Hi, Tianhua. Could you summarize the detail on the JIRA once more? It will be very helpful for the community. Also, I've been waiting on that JIRA. :)
Bests, Dongjoon. On Mon, Sep 16, 2019 at 11:48 PM Tianhua huang <huangtianhua...@gmail.com> wrote: > @shane knapp <skn...@berkeley.edu> thank you very much, I opened an issue > for this https://issues.apache.org/jira/browse/SPARK-29106, we can tall > the details in it :) > And we will prepare an arm instance today and will send the info to your > email later. > > On Tue, Sep 17, 2019 at 4:40 AM Shane Knapp <skn...@berkeley.edu> wrote: > >> @Tianhua huang <huangtianhua...@gmail.com> sure, i think we can get >> something sorted for the short-term. >> >> all we need is ssh access (i can provide an ssh key), and i can then have >> our jenkins master launch a remote worker on that instance. >> >> instance setup, etc, will be up to you. my support for the time being >> will be to create the job and 'best effort' for everything else. >> >> this should get us up and running asap. >> >> is there an open JIRA for jenkins/arm test support? we can move the >> technical details about this idea there. >> >> On Sun, Sep 15, 2019 at 9:03 PM Tianhua huang <huangtianhua...@gmail.com> >> wrote: >> >>> @Sean Owen <sro...@gmail.com> , so sorry to reply late, we had a >>> Mid-Autumn holiday:) >>> >>> If you hope to integrate ARM CI to amplab jenkins, we can offer the arm >>> instance, and then the ARM job will run together with other x86 jobs, so >>> maybe there is a guideline to do this? @shane knapp >>> <skn...@berkeley.edu> would you help us? >>> >>> On Thu, Sep 12, 2019 at 9:36 PM Sean Owen <sro...@gmail.com> wrote: >>> >>>> I don't know what's involved in actually accepting or operating those >>>> machines, so can't comment there, but in the meantime it's good that you >>>> are running these tests and can help report changes needed to keep it >>>> working with ARM. I would continue with that for now. >>>> >>>> On Wed, Sep 11, 2019 at 10:06 PM Tianhua huang < >>>> huangtianhua...@gmail.com> wrote: >>>> >>>>> Hi all, >>>>> >>>>> For the whole work process of spark ARM CI, we want to make 2 things >>>>> clear. >>>>> >>>>> The first thing is: >>>>> About spark ARM CI, now we have two periodic jobs, one job[1] based on >>>>> commit[2](which already fixed the replay tests failed issue[3], we made a >>>>> new test branch based on date 09-09-2019), the other job[4] based on spark >>>>> master. >>>>> >>>>> The first job we test on the specified branch to prove that our ARM CI >>>>> is good and stable. >>>>> The second job checks spark master every day, then we can find whether >>>>> the latest commits affect the ARM CI. According to the build history and >>>>> result, it shows that some problems are easier to find on ARM like >>>>> SPARK-28770 <https://issues.apache.org/jira/browse/SPARK-28770>, and >>>>> it also shows that we would make efforts to trace and figure them out, >>>>> till >>>>> now we have found and fixed several problems[5][6][7], thanks everyone of >>>>> the community :). And we believe that ARM CI is very necessary, right? >>>>> >>>>> The second thing is: >>>>> We plan to run the jobs for a period of time, and you can see the >>>>> result and logs from 'build history' of the jobs console, if everything >>>>> goes well for one or two weeks could community accept the ARM CI? or how >>>>> long the periodic jobs to run then our community could have enough >>>>> confidence to accept the ARM CI? As you suggested before, it's good to >>>>> integrate ARM CI to amplab jenkins, we agree that and we can donate the >>>>> ARM >>>>> instances and then maintain the ARM-related test jobs together with >>>>> community, any thoughts? >>>>> >>>>> Thank you all! >>>>> >>>>> [1] >>>>> http://status.openlabtesting.org/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64 >>>>> [2] >>>>> https://github.com/apache/spark/commit/0ed9fae45769d4b06b8cf8128f462f09ff3d9a72 >>>>> [3] https://issues.apache.org/jira/browse/SPARK-28770 >>>>> [4] >>>>> http://status.openlabtesting.org/builds?job_name=spark-master-unit-test-hadoop-2.7-arm64 >>>>> [5] https://github.com/apache/spark/pull/25186 >>>>> [6] https://github.com/apache/spark/pull/25279 >>>>> [7] https://github.com/apache/spark/pull/25673 >>>>> >>>>> >>>>> >>>>> On Fri, Aug 16, 2019 at 11:24 PM Sean Owen <sro...@gmail.com> wrote: >>>>> >>>>>> Yes, I think it's just local caching. After you run the build you >>>>>> should find lots of stuff cached at ~/.m2/repository and it won't >>>>>> download >>>>>> every time. >>>>>> >>>>>> On Fri, Aug 16, 2019 at 3:01 AM bo zhaobo < >>>>>> bzhaojyathousa...@gmail.com> wrote: >>>>>> >>>>>>> Hi Sean, >>>>>>> Thanks for reply. And very apologize for making you confused. >>>>>>> I know the dependencies will be downloaded from SBT or Maven. But >>>>>>> the Spark QA job also exec "mvn clean package", why the log didn't print >>>>>>> "downloading some jar from Maven central [1] and build very fast. Is the >>>>>>> reason that Spark Jenkins build the Spark jars in the physical machiines >>>>>>> and won't destrory the test env after job is finished? Then the other >>>>>>> job >>>>>>> build Spark will get the dependencies jar from the local cached, as the >>>>>>> previous jobs exec "mvn package", those dependencies had been downloaded >>>>>>> already on local worker machine. Am I right? Is that the reason the job >>>>>>> log[1] didn't print any downloading information from Maven Central? >>>>>>> >>>>>>> Thank you very much. >>>>>>> >>>>>>> [1] >>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.6-ubuntu-testing/lastBuild/consoleFull >>>>>>> >>>>>>> >>>>>>> Best regards >>>>>>> >>>>>>> ZhaoBo >>>>>>> >>>>>>> [image: Mailtrack] >>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&> >>>>>>> Sender >>>>>>> notified by >>>>>>> Mailtrack >>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&> >>>>>>> 19/08/16 >>>>>>> 下午03:58:53 >>>>>>> >>>>>>> Sean Owen <sro...@gmail.com> 于2019年8月16日周五 上午10:38写道: >>>>>>> >>>>>>>> I'm not sure what you mean. The dependencies are downloaded by SBT >>>>>>>> and Maven like in any other project, and nothing about it is specific >>>>>>>> to >>>>>>>> Spark. >>>>>>>> The worker machines cache artifacts that are downloaded from these, >>>>>>>> but this is a function of Maven and SBT, not Spark. You may find that >>>>>>>> the >>>>>>>> initial download takes a long time. >>>>>>>> >>>>>>>> On Thu, Aug 15, 2019 at 9:02 PM bo zhaobo < >>>>>>>> bzhaojyathousa...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi Sean, >>>>>>>>> >>>>>>>>> Thanks very much for pointing out the roadmap. ;-). Then I think >>>>>>>>> we will continue to focus on our test environment. >>>>>>>>> >>>>>>>>> For the networking problems, I mean that we can access Maven >>>>>>>>> Central, and jobs cloud download the required jar package with a high >>>>>>>>> network speed. What we want to know is that, why the Spark QA test >>>>>>>>> jobs[1] >>>>>>>>> log shows the job script/maven build seem don't download the jar >>>>>>>>> packages? >>>>>>>>> Could you tell us the reason about that? Thank you. The reason we >>>>>>>>> raise >>>>>>>>> the "networking problems" is that we found a phenomenon during we >>>>>>>>> test, if >>>>>>>>> we execute "mvn clean package" in a new test environment(As in our >>>>>>>>> test >>>>>>>>> environment, we will destory the test VMs after the job is finish), >>>>>>>>> maven >>>>>>>>> will download the dependency jar packages from Maven Central, but in >>>>>>>>> this >>>>>>>>> job "spark-master-test-maven-hadoop" [2], from the log, we didn't >>>>>>>>> found it >>>>>>>>> download any jar packages, what the reason about that? >>>>>>>>> Also we build the Spark jar with downloading dependencies from >>>>>>>>> Maven Central, it will cost mostly 1 hour. And we found [2] just cost >>>>>>>>> 10min. But if we run "mvn package" in a VM which already exec "mvn >>>>>>>>> package" >>>>>>>>> before, it just cost 14min, looks very closer with [2]. So we suspect >>>>>>>>> that >>>>>>>>> downloading the Jar packages cost so much time. For the goad of ARM >>>>>>>>> CI, we >>>>>>>>> expect the performance of NEW ARM CI could be closer with existing >>>>>>>>> X86 CI, >>>>>>>>> then users could accept it eaiser. >>>>>>>>> >>>>>>>>> [1] https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/ >>>>>>>>> >>>>>>>>> [2] >>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.6-ubuntu-testing/lastBuild/consoleFull >>>>>>>>> >>>>>>>>> Best regards >>>>>>>>> >>>>>>>>> ZhaoBo >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> [image: Mailtrack] >>>>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&> >>>>>>>>> Sender >>>>>>>>> notified by >>>>>>>>> Mailtrack >>>>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&> >>>>>>>>> 19/08/16 >>>>>>>>> 上午09:48:43 >>>>>>>>> >>>>>>>>> Sean Owen <sro...@gmail.com> 于2019年8月15日周四 下午9:58写道: >>>>>>>>> >>>>>>>>>> I think the right goal is to fix the remaining issues first. If >>>>>>>>>> we set up CI/CD it will only tell us there are still some test >>>>>>>>>> failures. If >>>>>>>>>> it's stable, and not hard to add to the existing CI/CD, yes it could >>>>>>>>>> be >>>>>>>>>> done automatically later. You can continue to test on ARM >>>>>>>>>> independently for >>>>>>>>>> now. >>>>>>>>>> >>>>>>>>>> It sounds indeed like there are some networking problems in the >>>>>>>>>> test system if you're not able to download from Maven Central. That >>>>>>>>>> rarely >>>>>>>>>> takes significant time, and there aren't project-specific mirrors >>>>>>>>>> here. You >>>>>>>>>> might be able to point at a closer public mirror, depending on where >>>>>>>>>> you >>>>>>>>>> are. >>>>>>>>>> >>>>>>>>>> On Thu, Aug 15, 2019 at 5:43 AM Tianhua huang < >>>>>>>>>> huangtianhua...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi all, >>>>>>>>>>> >>>>>>>>>>> I want to discuss spark ARM CI again, we took some tests on arm >>>>>>>>>>> instance based on master and the job includes >>>>>>>>>>> https://github.com/theopenlab/spark/pull/13 and k8s >>>>>>>>>>> integration https://github.com/theopenlab/spark/pull/17/ , >>>>>>>>>>> there are several things I want to talk about: >>>>>>>>>>> >>>>>>>>>>> First, about the failed tests: >>>>>>>>>>> 1.we have fixed some problems like >>>>>>>>>>> https://github.com/apache/spark/pull/25186 and >>>>>>>>>>> https://github.com/apache/spark/pull/25279, thanks sean owen >>>>>>>>>>> and others to help us. >>>>>>>>>>> 2.we tried k8s integration test on arm, and met an error: >>>>>>>>>>> apk fetch hangs, the tests passed after adding '--network host' >>>>>>>>>>> option >>>>>>>>>>> for command `docker build`, see: >>>>>>>>>>> >>>>>>>>>>> https://github.com/theopenlab/spark/pull/17/files#diff-5b731b14068240d63a93c393f6f9b1e8R176 >>>>>>>>>>> , the solution refers to >>>>>>>>>>> https://github.com/gliderlabs/docker-alpine/issues/307 and I >>>>>>>>>>> don't know whether it happened once in community CI, or maybe we >>>>>>>>>>> should >>>>>>>>>>> submit a pr to pass '--network host' when `docker build`? >>>>>>>>>>> 3.we found there are two tests failed after the commit >>>>>>>>>>> https://github.com/apache/spark/pull/23767 : >>>>>>>>>>> ReplayListenerSuite: >>>>>>>>>>> - ... >>>>>>>>>>> - End-to-end replay *** FAILED *** >>>>>>>>>>> "[driver]" did not equal "[1]" >>>>>>>>>>> (JsonProtocolSuite.scala:622) >>>>>>>>>>> - End-to-end replay with compression *** FAILED *** >>>>>>>>>>> "[driver]" did not equal "[1]" >>>>>>>>>>> (JsonProtocolSuite.scala:622) >>>>>>>>>>> >>>>>>>>>>> we tried to revert the commit and then the tests passed, >>>>>>>>>>> the patch is too big and so sorry we can't find the reason till >>>>>>>>>>> now, if you >>>>>>>>>>> are interesting please try it, and it will be very appreciate >>>>>>>>>>> if >>>>>>>>>>> someone can help us to figure it out. >>>>>>>>>>> >>>>>>>>>>> Second, about the test time, we increased the flavor of arm >>>>>>>>>>> instance to 16U16G, but seems there was no significant improvement, >>>>>>>>>>> the k8s >>>>>>>>>>> integration test took about one and a half hours, and the QA >>>>>>>>>>> test(like >>>>>>>>>>> spark-master-test-maven-hadoop-2.7 community jenkins job) took about >>>>>>>>>>> seventeen hours(it is too long :(), we suspect that the reason is >>>>>>>>>>> the >>>>>>>>>>> performance and network, >>>>>>>>>>> we split the jobs based on projects such as sql, core and so on, >>>>>>>>>>> the time can be decrease to about seven hours, see >>>>>>>>>>> https://github.com/theopenlab/spark/pull/19 We found the Spark >>>>>>>>>>> QA tests like >>>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/ , >>>>>>>>>>> it looks all tests seem never download the jar packages from maven >>>>>>>>>>> centry >>>>>>>>>>> repo(such as >>>>>>>>>>> https://repo.maven.apache.org/maven2/org/opencypher/okapi-api/0.4.2/okapi-api-0.4.2.jar). >>>>>>>>>>> So we want to know how the jenkins jobs can do that, is there a >>>>>>>>>>> internal >>>>>>>>>>> maven repo launched? maybe we can do the same thing to avoid the >>>>>>>>>>> network >>>>>>>>>>> connection cost during downloading the dependent jar packages. >>>>>>>>>>> >>>>>>>>>>> Third, the most important thing, it's about ARM CI of spark, we >>>>>>>>>>> believe that it is necessary, right? And you can see we really made >>>>>>>>>>> a lot >>>>>>>>>>> of efforts, now the basic arm build/test jobs is ok, so we suggest >>>>>>>>>>> to add >>>>>>>>>>> arm jobs to community, we can set them to novoting firstly, and >>>>>>>>>>> improve/rich the jobs step by step. Generally, there are two ways >>>>>>>>>>> in our >>>>>>>>>>> mind to integrate the ARM CI for spark: >>>>>>>>>>> 1) We introduce openlab ARM CI into spark as a custom CI >>>>>>>>>>> system. We provide human resources and test ARM VMs, also we will >>>>>>>>>>> focus on >>>>>>>>>>> the ARM related issues about Spark. We will push the PR into >>>>>>>>>>> community. >>>>>>>>>>> 2) We donate ARM VM resources into existing amplab Jenkins. >>>>>>>>>>> We still provide human resources, focus on the ARM related issues >>>>>>>>>>> about >>>>>>>>>>> Spark and push the PR into community. >>>>>>>>>>> Both options, we will provide human resources to maintain, of >>>>>>>>>>> course it will be great if we can work together. So please tell us >>>>>>>>>>> which >>>>>>>>>>> option you would like? And let's move forward. Waiting for your >>>>>>>>>>> reply, >>>>>>>>>>> thank you very much. >>>>>>>>>>> >>>>>>>>>> >> >> -- >> Shane Knapp >> UC Berkeley EECS Research / RISELab Staff Technical Lead >> https://rise.cs.berkeley.edu >> >