@Dongjoon Hyun <dongjoon.h...@gmail.com> , Sure, and I have update the JIRA already :) https://issues.apache.org/jira/browse/SPARK-29106 If anything missed, please let me know, thank you.
On Thu, Sep 19, 2019 at 12:44 AM Dongjoon Hyun <dongjoon.h...@gmail.com> wrote: > Hi, Tianhua. > > Could you summarize the detail on the JIRA once more? > It will be very helpful for the community. Also, I've been waiting on that > JIRA. :) > > Bests, > Dongjoon. > > > On Mon, Sep 16, 2019 at 11:48 PM Tianhua huang <huangtianhua...@gmail.com> > wrote: > >> @shane knapp <skn...@berkeley.edu> thank you very much, I opened an >> issue for this https://issues.apache.org/jira/browse/SPARK-29106, we can >> tall the details in it :) >> And we will prepare an arm instance today and will send the info to your >> email later. >> >> On Tue, Sep 17, 2019 at 4:40 AM Shane Knapp <skn...@berkeley.edu> wrote: >> >>> @Tianhua huang <huangtianhua...@gmail.com> sure, i think we can get >>> something sorted for the short-term. >>> >>> all we need is ssh access (i can provide an ssh key), and i can then >>> have our jenkins master launch a remote worker on that instance. >>> >>> instance setup, etc, will be up to you. my support for the time being >>> will be to create the job and 'best effort' for everything else. >>> >>> this should get us up and running asap. >>> >>> is there an open JIRA for jenkins/arm test support? we can move the >>> technical details about this idea there. >>> >>> On Sun, Sep 15, 2019 at 9:03 PM Tianhua huang <huangtianhua...@gmail.com> >>> wrote: >>> >>>> @Sean Owen <sro...@gmail.com> , so sorry to reply late, we had a >>>> Mid-Autumn holiday:) >>>> >>>> If you hope to integrate ARM CI to amplab jenkins, we can offer the arm >>>> instance, and then the ARM job will run together with other x86 jobs, so >>>> maybe there is a guideline to do this? @shane knapp >>>> <skn...@berkeley.edu> would you help us? >>>> >>>> On Thu, Sep 12, 2019 at 9:36 PM Sean Owen <sro...@gmail.com> wrote: >>>> >>>>> I don't know what's involved in actually accepting or operating those >>>>> machines, so can't comment there, but in the meantime it's good that you >>>>> are running these tests and can help report changes needed to keep it >>>>> working with ARM. I would continue with that for now. >>>>> >>>>> On Wed, Sep 11, 2019 at 10:06 PM Tianhua huang < >>>>> huangtianhua...@gmail.com> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> For the whole work process of spark ARM CI, we want to make 2 things >>>>>> clear. >>>>>> >>>>>> The first thing is: >>>>>> About spark ARM CI, now we have two periodic jobs, one job[1] based >>>>>> on commit[2](which already fixed the replay tests failed issue[3], we >>>>>> made >>>>>> a new test branch based on date 09-09-2019), the other job[4] based on >>>>>> spark master. >>>>>> >>>>>> The first job we test on the specified branch to prove that our ARM >>>>>> CI is good and stable. >>>>>> The second job checks spark master every day, then we can find >>>>>> whether the latest commits affect the ARM CI. According to the build >>>>>> history and result, it shows that some problems are easier to find on ARM >>>>>> like SPARK-28770 <https://issues.apache.org/jira/browse/SPARK-28770>, >>>>>> and it also shows that we would make efforts to trace and figure them >>>>>> out, till now we have found and fixed several problems[5][6][7], thanks >>>>>> everyone of the community :). And we believe that ARM CI is very >>>>>> necessary, >>>>>> right? >>>>>> >>>>>> The second thing is: >>>>>> We plan to run the jobs for a period of time, and you can see the >>>>>> result and logs from 'build history' of the jobs console, if everything >>>>>> goes well for one or two weeks could community accept the ARM CI? or how >>>>>> long the periodic jobs to run then our community could have enough >>>>>> confidence to accept the ARM CI? As you suggested before, it's good to >>>>>> integrate ARM CI to amplab jenkins, we agree that and we can donate the >>>>>> ARM >>>>>> instances and then maintain the ARM-related test jobs together with >>>>>> community, any thoughts? >>>>>> >>>>>> Thank you all! >>>>>> >>>>>> [1] >>>>>> http://status.openlabtesting.org/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64 >>>>>> [2] >>>>>> https://github.com/apache/spark/commit/0ed9fae45769d4b06b8cf8128f462f09ff3d9a72 >>>>>> [3] https://issues.apache.org/jira/browse/SPARK-28770 >>>>>> [4] >>>>>> http://status.openlabtesting.org/builds?job_name=spark-master-unit-test-hadoop-2.7-arm64 >>>>>> [5] https://github.com/apache/spark/pull/25186 >>>>>> [6] https://github.com/apache/spark/pull/25279 >>>>>> [7] https://github.com/apache/spark/pull/25673 >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Aug 16, 2019 at 11:24 PM Sean Owen <sro...@gmail.com> wrote: >>>>>> >>>>>>> Yes, I think it's just local caching. After you run the build you >>>>>>> should find lots of stuff cached at ~/.m2/repository and it won't >>>>>>> download >>>>>>> every time. >>>>>>> >>>>>>> On Fri, Aug 16, 2019 at 3:01 AM bo zhaobo < >>>>>>> bzhaojyathousa...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi Sean, >>>>>>>> Thanks for reply. And very apologize for making you confused. >>>>>>>> I know the dependencies will be downloaded from SBT or Maven. But >>>>>>>> the Spark QA job also exec "mvn clean package", why the log didn't >>>>>>>> print >>>>>>>> "downloading some jar from Maven central [1] and build very fast. Is >>>>>>>> the >>>>>>>> reason that Spark Jenkins build the Spark jars in the physical >>>>>>>> machiines >>>>>>>> and won't destrory the test env after job is finished? Then the other >>>>>>>> job >>>>>>>> build Spark will get the dependencies jar from the local cached, as the >>>>>>>> previous jobs exec "mvn package", those dependencies had been >>>>>>>> downloaded >>>>>>>> already on local worker machine. Am I right? Is that the reason the job >>>>>>>> log[1] didn't print any downloading information from Maven Central? >>>>>>>> >>>>>>>> Thank you very much. >>>>>>>> >>>>>>>> [1] >>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.6-ubuntu-testing/lastBuild/consoleFull >>>>>>>> >>>>>>>> >>>>>>>> Best regards >>>>>>>> >>>>>>>> ZhaoBo >>>>>>>> >>>>>>>> [image: Mailtrack] >>>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&> >>>>>>>> Sender >>>>>>>> notified by >>>>>>>> Mailtrack >>>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&> >>>>>>>> 19/08/16 >>>>>>>> 下午03:58:53 >>>>>>>> >>>>>>>> Sean Owen <sro...@gmail.com> 于2019年8月16日周五 上午10:38写道: >>>>>>>> >>>>>>>>> I'm not sure what you mean. The dependencies are downloaded by SBT >>>>>>>>> and Maven like in any other project, and nothing about it is specific >>>>>>>>> to >>>>>>>>> Spark. >>>>>>>>> The worker machines cache artifacts that are downloaded from >>>>>>>>> these, but this is a function of Maven and SBT, not Spark. You may >>>>>>>>> find >>>>>>>>> that the initial download takes a long time. >>>>>>>>> >>>>>>>>> On Thu, Aug 15, 2019 at 9:02 PM bo zhaobo < >>>>>>>>> bzhaojyathousa...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Sean, >>>>>>>>>> >>>>>>>>>> Thanks very much for pointing out the roadmap. ;-). Then I think >>>>>>>>>> we will continue to focus on our test environment. >>>>>>>>>> >>>>>>>>>> For the networking problems, I mean that we can access Maven >>>>>>>>>> Central, and jobs cloud download the required jar package with a high >>>>>>>>>> network speed. What we want to know is that, why the Spark QA test >>>>>>>>>> jobs[1] >>>>>>>>>> log shows the job script/maven build seem don't download the jar >>>>>>>>>> packages? >>>>>>>>>> Could you tell us the reason about that? Thank you. The reason we >>>>>>>>>> raise >>>>>>>>>> the "networking problems" is that we found a phenomenon during we >>>>>>>>>> test, if >>>>>>>>>> we execute "mvn clean package" in a new test environment(As in our >>>>>>>>>> test >>>>>>>>>> environment, we will destory the test VMs after the job is finish), >>>>>>>>>> maven >>>>>>>>>> will download the dependency jar packages from Maven Central, but in >>>>>>>>>> this >>>>>>>>>> job "spark-master-test-maven-hadoop" [2], from the log, we didn't >>>>>>>>>> found it >>>>>>>>>> download any jar packages, what the reason about that? >>>>>>>>>> Also we build the Spark jar with downloading dependencies from >>>>>>>>>> Maven Central, it will cost mostly 1 hour. And we found [2] just cost >>>>>>>>>> 10min. But if we run "mvn package" in a VM which already exec "mvn >>>>>>>>>> package" >>>>>>>>>> before, it just cost 14min, looks very closer with [2]. So we >>>>>>>>>> suspect that >>>>>>>>>> downloading the Jar packages cost so much time. For the goad of ARM >>>>>>>>>> CI, we >>>>>>>>>> expect the performance of NEW ARM CI could be closer with existing >>>>>>>>>> X86 CI, >>>>>>>>>> then users could accept it eaiser. >>>>>>>>>> >>>>>>>>>> [1] >>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/ >>>>>>>>>> [2] >>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.6-ubuntu-testing/lastBuild/consoleFull >>>>>>>>>> >>>>>>>>>> Best regards >>>>>>>>>> >>>>>>>>>> ZhaoBo >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> [image: Mailtrack] >>>>>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&> >>>>>>>>>> Sender >>>>>>>>>> notified by >>>>>>>>>> Mailtrack >>>>>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&> >>>>>>>>>> 19/08/16 >>>>>>>>>> 上午09:48:43 >>>>>>>>>> >>>>>>>>>> Sean Owen <sro...@gmail.com> 于2019年8月15日周四 下午9:58写道: >>>>>>>>>> >>>>>>>>>>> I think the right goal is to fix the remaining issues first. If >>>>>>>>>>> we set up CI/CD it will only tell us there are still some test >>>>>>>>>>> failures. If >>>>>>>>>>> it's stable, and not hard to add to the existing CI/CD, yes it >>>>>>>>>>> could be >>>>>>>>>>> done automatically later. You can continue to test on ARM >>>>>>>>>>> independently for >>>>>>>>>>> now. >>>>>>>>>>> >>>>>>>>>>> It sounds indeed like there are some networking problems in the >>>>>>>>>>> test system if you're not able to download from Maven Central. That >>>>>>>>>>> rarely >>>>>>>>>>> takes significant time, and there aren't project-specific mirrors >>>>>>>>>>> here. You >>>>>>>>>>> might be able to point at a closer public mirror, depending on >>>>>>>>>>> where you >>>>>>>>>>> are. >>>>>>>>>>> >>>>>>>>>>> On Thu, Aug 15, 2019 at 5:43 AM Tianhua huang < >>>>>>>>>>> huangtianhua...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi all, >>>>>>>>>>>> >>>>>>>>>>>> I want to discuss spark ARM CI again, we took some tests on arm >>>>>>>>>>>> instance based on master and the job includes >>>>>>>>>>>> https://github.com/theopenlab/spark/pull/13 and k8s >>>>>>>>>>>> integration https://github.com/theopenlab/spark/pull/17/ , >>>>>>>>>>>> there are several things I want to talk about: >>>>>>>>>>>> >>>>>>>>>>>> First, about the failed tests: >>>>>>>>>>>> 1.we have fixed some problems like >>>>>>>>>>>> https://github.com/apache/spark/pull/25186 and >>>>>>>>>>>> https://github.com/apache/spark/pull/25279, thanks sean owen >>>>>>>>>>>> and others to help us. >>>>>>>>>>>> 2.we tried k8s integration test on arm, and met an error: >>>>>>>>>>>> apk fetch hangs, the tests passed after adding '--network host' >>>>>>>>>>>> option >>>>>>>>>>>> for command `docker build`, see: >>>>>>>>>>>> >>>>>>>>>>>> https://github.com/theopenlab/spark/pull/17/files#diff-5b731b14068240d63a93c393f6f9b1e8R176 >>>>>>>>>>>> , the solution refers to >>>>>>>>>>>> https://github.com/gliderlabs/docker-alpine/issues/307 and I >>>>>>>>>>>> don't know whether it happened once in community CI, or maybe we >>>>>>>>>>>> should >>>>>>>>>>>> submit a pr to pass '--network host' when `docker build`? >>>>>>>>>>>> 3.we found there are two tests failed after the commit >>>>>>>>>>>> https://github.com/apache/spark/pull/23767 : >>>>>>>>>>>> ReplayListenerSuite: >>>>>>>>>>>> - ... >>>>>>>>>>>> - End-to-end replay *** FAILED *** >>>>>>>>>>>> "[driver]" did not equal "[1]" >>>>>>>>>>>> (JsonProtocolSuite.scala:622) >>>>>>>>>>>> - End-to-end replay with compression *** FAILED *** >>>>>>>>>>>> "[driver]" did not equal "[1]" >>>>>>>>>>>> (JsonProtocolSuite.scala:622) >>>>>>>>>>>> >>>>>>>>>>>> we tried to revert the commit and then the tests >>>>>>>>>>>> passed, the patch is too big and so sorry we can't find the reason >>>>>>>>>>>> till >>>>>>>>>>>> now, if you are interesting please try it, and it will be very >>>>>>>>>>>> appreciate >>>>>>>>>>>> if someone can help us to figure it out. >>>>>>>>>>>> >>>>>>>>>>>> Second, about the test time, we increased the flavor of arm >>>>>>>>>>>> instance to 16U16G, but seems there was no significant >>>>>>>>>>>> improvement, the k8s >>>>>>>>>>>> integration test took about one and a half hours, and the QA >>>>>>>>>>>> test(like >>>>>>>>>>>> spark-master-test-maven-hadoop-2.7 community jenkins job) took >>>>>>>>>>>> about >>>>>>>>>>>> seventeen hours(it is too long :(), we suspect that the reason is >>>>>>>>>>>> the >>>>>>>>>>>> performance and network, >>>>>>>>>>>> we split the jobs based on projects such as sql, core and so >>>>>>>>>>>> on, the time can be decrease to about seven hours, see >>>>>>>>>>>> https://github.com/theopenlab/spark/pull/19 We found the Spark >>>>>>>>>>>> QA tests like >>>>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/ , >>>>>>>>>>>> it looks all tests seem never download the jar packages from maven >>>>>>>>>>>> centry >>>>>>>>>>>> repo(such as >>>>>>>>>>>> https://repo.maven.apache.org/maven2/org/opencypher/okapi-api/0.4.2/okapi-api-0.4.2.jar). >>>>>>>>>>>> So we want to know how the jenkins jobs can do that, is there a >>>>>>>>>>>> internal >>>>>>>>>>>> maven repo launched? maybe we can do the same thing to avoid the >>>>>>>>>>>> network >>>>>>>>>>>> connection cost during downloading the dependent jar packages. >>>>>>>>>>>> >>>>>>>>>>>> Third, the most important thing, it's about ARM CI of spark, we >>>>>>>>>>>> believe that it is necessary, right? And you can see we really >>>>>>>>>>>> made a lot >>>>>>>>>>>> of efforts, now the basic arm build/test jobs is ok, so we suggest >>>>>>>>>>>> to add >>>>>>>>>>>> arm jobs to community, we can set them to novoting firstly, and >>>>>>>>>>>> improve/rich the jobs step by step. Generally, there are two ways >>>>>>>>>>>> in our >>>>>>>>>>>> mind to integrate the ARM CI for spark: >>>>>>>>>>>> 1) We introduce openlab ARM CI into spark as a custom CI >>>>>>>>>>>> system. We provide human resources and test ARM VMs, also we will >>>>>>>>>>>> focus on >>>>>>>>>>>> the ARM related issues about Spark. We will push the PR into >>>>>>>>>>>> community. >>>>>>>>>>>> 2) We donate ARM VM resources into existing amplab >>>>>>>>>>>> Jenkins. We still provide human resources, focus on the ARM >>>>>>>>>>>> related issues >>>>>>>>>>>> about Spark and push the PR into community. >>>>>>>>>>>> Both options, we will provide human resources to maintain, of >>>>>>>>>>>> course it will be great if we can work together. So please tell us >>>>>>>>>>>> which >>>>>>>>>>>> option you would like? And let's move forward. Waiting for your >>>>>>>>>>>> reply, >>>>>>>>>>>> thank you very much. >>>>>>>>>>>> >>>>>>>>>>> >>> >>> -- >>> Shane Knapp >>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>> https://rise.cs.berkeley.edu >>> >>