Re: Ask for ARM CI for spark

Tianhua huang Mon, 16 Sep 2019 23:49:43 -0700

@shane knapp <skn...@berkeley.edu> thank you very much, I opened an issue
for this https://issues.apache.org/jira/browse/SPARK-29106, we can tall the
details in it :)
And we will prepare an arm instance today and will send the info to your
email later.


On Tue, Sep 17, 2019 at 4:40 AM Shane Knapp <skn...@berkeley.edu> wrote:

> @Tianhua huang <huangtianhua...@gmail.com> sure, i think we can get
> something sorted for the short-term.
>
> all we need is ssh access (i can provide an ssh key), and i can then have
> our jenkins master launch a remote worker on that instance.
>
> instance setup, etc, will be up to you.  my support for the time being
> will be to create the job and 'best effort' for everything else.
>
> this should get us up and running asap.
>
> is there an open JIRA for jenkins/arm test support?  we can move the
> technical details about this idea there.
>
> On Sun, Sep 15, 2019 at 9:03 PM Tianhua huang <huangtianhua...@gmail.com>
> wrote:
>
>> @Sean Owen <sro...@gmail.com> , so sorry to reply late, we had a
>> Mid-Autumn holiday:)
>>
>> If you hope to integrate ARM CI to amplab jenkins, we can offer the arm
>> instance, and then the ARM job will run together with other x86 jobs, so
>> maybe there is a guideline to do this? @shane knapp <skn...@berkeley.edu>
>> would you help us?
>>
>> On Thu, Sep 12, 2019 at 9:36 PM Sean Owen <sro...@gmail.com> wrote:
>>
>>> I don't know what's involved in actually accepting or operating those
>>> machines, so can't comment there, but in the meantime it's good that you
>>> are running these tests and can help report changes needed to keep it
>>> working with ARM. I would continue with that for now.
>>>
>>> On Wed, Sep 11, 2019 at 10:06 PM Tianhua huang <
>>> huangtianhua...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> For the whole work process of spark ARM CI, we want to make 2 things
>>>> clear.
>>>>
>>>> The first thing is:
>>>> About spark ARM CI, now we have two periodic jobs, one job[1] based on
>>>> commit[2](which already fixed the replay tests failed issue[3], we made a
>>>> new test branch based on date 09-09-2019), the other job[4] based on spark
>>>> master.
>>>>
>>>> The first job we test on the specified branch to prove that our ARM CI
>>>> is good and stable.
>>>> The second job checks spark master every day, then we can find whether
>>>> the latest commits affect the ARM CI. According to the build history and
>>>> result, it shows that some problems are easier to find on ARM like
>>>> SPARK-28770 <https://issues.apache.org/jira/browse/SPARK-28770>, and
>>>> it also shows that we would make efforts to trace and figure them out, till
>>>> now we have found and fixed several problems[5][6][7], thanks everyone of
>>>> the community :). And we believe that ARM CI is very necessary, right?
>>>>
>>>> The second thing is:
>>>> We plan to run the jobs for a period of time, and you can see the
>>>> result and logs from 'build history' of the jobs console, if everything
>>>> goes well for one or two weeks could community accept the ARM CI? or how
>>>> long the periodic jobs to run then our community could have enough
>>>> confidence to accept the ARM CI? As you suggested before, it's good to
>>>> integrate ARM CI to amplab jenkins, we agree that and we can donate the ARM
>>>> instances and then maintain the ARM-related test jobs together with
>>>> community, any thoughts?
>>>>
>>>> Thank you all!
>>>>
>>>> [1]
>>>> http://status.openlabtesting.org/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64
>>>> [2]
>>>> https://github.com/apache/spark/commit/0ed9fae45769d4b06b8cf8128f462f09ff3d9a72
>>>> [3] https://issues.apache.org/jira/browse/SPARK-28770
>>>> [4]
>>>> http://status.openlabtesting.org/builds?job_name=spark-master-unit-test-hadoop-2.7-arm64
>>>> [5] https://github.com/apache/spark/pull/25186
>>>> [6] https://github.com/apache/spark/pull/25279
>>>> [7] https://github.com/apache/spark/pull/25673
>>>>
>>>>
>>>>
>>>> On Fri, Aug 16, 2019 at 11:24 PM Sean Owen <sro...@gmail.com> wrote:
>>>>
>>>>> Yes, I think it's just local caching. After you run the build you
>>>>> should find lots of stuff cached at ~/.m2/repository and it won't download
>>>>> every time.
>>>>>
>>>>> On Fri, Aug 16, 2019 at 3:01 AM bo zhaobo <bzhaojyathousa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Sean,
>>>>>> Thanks for reply. And very apologize for making you confused.
>>>>>> I know the dependencies will be downloaded from SBT or Maven. But the
>>>>>> Spark QA job also exec "mvn clean package", why the log didn't print
>>>>>> "downloading some jar from Maven central [1] and build very fast. Is the
>>>>>> reason that Spark Jenkins build the Spark jars in the physical machiines
>>>>>> and won't destrory the test env after job is finished? Then the other job
>>>>>> build Spark will get the dependencies jar from the local cached, as the
>>>>>> previous jobs exec "mvn package", those dependencies had been downloaded
>>>>>> already on local worker machine. Am I right? Is that the reason the job
>>>>>> log[1] didn't print any downloading information from Maven Central?
>>>>>>
>>>>>> Thank you very much.
>>>>>>
>>>>>> [1]
>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.6-ubuntu-testing/lastBuild/consoleFull
>>>>>>
>>>>>>
>>>>>> Best regards
>>>>>>
>>>>>> ZhaoBo
>>>>>>
>>>>>> [image: Mailtrack]
>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&;>
>>>>>>  Sender
>>>>>> notified by
>>>>>> Mailtrack
>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&;>
>>>>>>  19/08/16
>>>>>> 下午03:58:53
>>>>>>
>>>>>> Sean Owen <sro...@gmail.com> 于2019年8月16日周五 上午10:38写道：
>>>>>>
>>>>>>> I'm not sure what you mean. The dependencies are downloaded by SBT
>>>>>>> and Maven like in any other project, and nothing about it is specific to
>>>>>>> Spark.
>>>>>>> The worker machines cache artifacts that are downloaded from these,
>>>>>>> but this is a function of Maven and SBT, not Spark. You may find that 
>>>>>>> the
>>>>>>> initial download takes a long time.
>>>>>>>
>>>>>>> On Thu, Aug 15, 2019 at 9:02 PM bo zhaobo <
>>>>>>> bzhaojyathousa...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Sean,
>>>>>>>>
>>>>>>>> Thanks very much for pointing out the roadmap. ;-). Then I think we
>>>>>>>> will continue to focus on our test environment.
>>>>>>>>
>>>>>>>> For the networking problems, I mean that we can access Maven
>>>>>>>> Central, and jobs cloud download the required jar package with a high
>>>>>>>> network speed. What we want to know is that, why the Spark QA test 
>>>>>>>> jobs[1]
>>>>>>>> log shows the job script/maven build seem don't download the jar 
>>>>>>>> packages?
>>>>>>>> Could you tell us the reason about that? Thank you.  The reason we 
>>>>>>>> raise
>>>>>>>> the "networking problems" is that we found a phenomenon during we 
>>>>>>>> test, if
>>>>>>>> we execute "mvn clean package" in a new test environment(As in our test
>>>>>>>> environment, we will destory the test VMs after the job is finish), 
>>>>>>>> maven
>>>>>>>> will download the dependency jar packages from Maven Central, but in 
>>>>>>>> this
>>>>>>>> job "spark-master-test-maven-hadoop" [2], from the log, we didn't 
>>>>>>>> found it
>>>>>>>> download any jar packages, what the reason about that?
>>>>>>>> Also we build the Spark jar with downloading dependencies from
>>>>>>>> Maven Central, it will cost mostly 1 hour. And we found [2] just cost
>>>>>>>> 10min. But if we run "mvn package" in a VM which already exec "mvn 
>>>>>>>> package"
>>>>>>>> before, it just cost 14min, looks very closer with [2]. So we suspect 
>>>>>>>> that
>>>>>>>> downloading the Jar packages cost so much time. For the goad of ARM 
>>>>>>>> CI, we
>>>>>>>> expect the performance of NEW ARM CI could be closer with existing X86 
>>>>>>>> CI,
>>>>>>>> then users could accept it eaiser.
>>>>>>>>
>>>>>>>> [1] https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/
>>>>>>>> [2]
>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.6-ubuntu-testing/lastBuild/consoleFull
>>>>>>>>
>>>>>>>> Best regards
>>>>>>>>
>>>>>>>> ZhaoBo
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> [image: Mailtrack]
>>>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&;>
>>>>>>>>  Sender
>>>>>>>> notified by
>>>>>>>> Mailtrack
>>>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&;>
>>>>>>>>  19/08/16
>>>>>>>> 上午09:48:43
>>>>>>>>
>>>>>>>> Sean Owen <sro...@gmail.com> 于2019年8月15日周四 下午9:58写道：
>>>>>>>>
>>>>>>>>> I think the right goal is to fix the remaining issues first. If we
>>>>>>>>> set up CI/CD it will only tell us there are still some test failures. 
>>>>>>>>> If
>>>>>>>>> it's stable, and not hard to add to the existing CI/CD, yes it could 
>>>>>>>>> be
>>>>>>>>> done automatically later. You can continue to test on ARM 
>>>>>>>>> independently for
>>>>>>>>> now.
>>>>>>>>>
>>>>>>>>> It sounds indeed like there are some networking problems in the
>>>>>>>>> test system if you're not able to download from Maven Central. That 
>>>>>>>>> rarely
>>>>>>>>> takes significant time, and there aren't project-specific mirrors 
>>>>>>>>> here. You
>>>>>>>>> might be able to point at a closer public mirror, depending on where 
>>>>>>>>> you
>>>>>>>>> are.
>>>>>>>>>
>>>>>>>>> On Thu, Aug 15, 2019 at 5:43 AM Tianhua huang <
>>>>>>>>> huangtianhua...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> I want to discuss spark ARM CI again, we took some tests on arm
>>>>>>>>>> instance based on master and the job includes
>>>>>>>>>> https://github.com/theopenlab/spark/pull/13  and k8s integration
>>>>>>>>>> https://github.com/theopenlab/spark/pull/17/ , there are several
>>>>>>>>>> things I want to talk about:
>>>>>>>>>>
>>>>>>>>>> First, about the failed tests:
>>>>>>>>>>     1.we have fixed some problems like
>>>>>>>>>> https://github.com/apache/spark/pull/25186 and
>>>>>>>>>> https://github.com/apache/spark/pull/25279, thanks sean owen and
>>>>>>>>>> others to help us.
>>>>>>>>>>     2.we tried k8s integration test on arm, and met an error: apk
>>>>>>>>>> fetch hangs,  the tests passed  after adding '--network host' option 
>>>>>>>>>> for
>>>>>>>>>> command `docker build`, see:
>>>>>>>>>>
>>>>>>>>>> https://github.com/theopenlab/spark/pull/17/files#diff-5b731b14068240d63a93c393f6f9b1e8R176
>>>>>>>>>> , the solution refers to
>>>>>>>>>> https://github.com/gliderlabs/docker-alpine/issues/307  and I
>>>>>>>>>> don't know whether it happened once in community CI, or maybe we 
>>>>>>>>>> should
>>>>>>>>>> submit a pr to pass  '--network host' when `docker build`?
>>>>>>>>>>     3.we found there are two tests failed after the commit
>>>>>>>>>> https://github.com/apache/spark/pull/23767  :
>>>>>>>>>>        ReplayListenerSuite:
>>>>>>>>>>        - ...
>>>>>>>>>>        - End-to-end replay *** FAILED ***
>>>>>>>>>>          "[driver]" did not equal "[1]"
>>>>>>>>>> (JsonProtocolSuite.scala:622)
>>>>>>>>>>        - End-to-end replay with compression *** FAILED ***
>>>>>>>>>>          "[driver]" did not equal "[1]"
>>>>>>>>>> (JsonProtocolSuite.scala:622)
>>>>>>>>>>
>>>>>>>>>>         we tried to revert the commit and then the tests passed,
>>>>>>>>>> the patch is too big and so sorry we can't find the reason till now, 
>>>>>>>>>> if you
>>>>>>>>>> are interesting please try it, and it will be very appreciate        
>>>>>>>>>>   if
>>>>>>>>>> someone can help us to figure it out.
>>>>>>>>>>
>>>>>>>>>> Second, about the test time, we increased the flavor of arm
>>>>>>>>>> instance to 16U16G, but seems there was no significant improvement, 
>>>>>>>>>> the k8s
>>>>>>>>>> integration test took about one and a half hours, and the QA 
>>>>>>>>>> test(like
>>>>>>>>>> spark-master-test-maven-hadoop-2.7 community jenkins job) took about
>>>>>>>>>> seventeen hours(it is too long :(), we suspect that the reason is the
>>>>>>>>>> performance and network,
>>>>>>>>>> we split the jobs based on projects such as sql, core and so on,
>>>>>>>>>> the time can be decrease to about seven hours, see
>>>>>>>>>> https://github.com/theopenlab/spark/pull/19 We found the Spark
>>>>>>>>>> QA tests like
>>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/   ,
>>>>>>>>>> it looks all tests seem never download the jar packages from maven 
>>>>>>>>>> centry
>>>>>>>>>> repo(such as
>>>>>>>>>> https://repo.maven.apache.org/maven2/org/opencypher/okapi-api/0.4.2/okapi-api-0.4.2.jar).
>>>>>>>>>> So we want to know how the jenkins jobs can do that, is there a 
>>>>>>>>>> internal
>>>>>>>>>> maven repo launched? maybe we can do the same thing to avoid the 
>>>>>>>>>> network
>>>>>>>>>> connection cost during downloading the dependent jar packages.
>>>>>>>>>>
>>>>>>>>>> Third, the most important thing, it's about ARM CI of spark, we
>>>>>>>>>> believe that it is necessary, right? And you can see we really made 
>>>>>>>>>> a lot
>>>>>>>>>> of efforts, now the basic arm build/test jobs is ok, so we suggest 
>>>>>>>>>> to add
>>>>>>>>>> arm jobs to community, we can set them to novoting firstly, and
>>>>>>>>>> improve/rich the jobs step by step. Generally, there are two ways in 
>>>>>>>>>> our
>>>>>>>>>> mind to integrate the ARM CI for spark:
>>>>>>>>>>      1) We introduce openlab ARM CI into spark as a custom CI
>>>>>>>>>> system. We provide human resources and test ARM VMs, also we will 
>>>>>>>>>> focus on
>>>>>>>>>> the ARM related issues about Spark. We will push the PR into 
>>>>>>>>>> community.
>>>>>>>>>>      2) We donate ARM VM resources into existing amplab Jenkins.
>>>>>>>>>> We still provide human resources, focus on the ARM related issues 
>>>>>>>>>> about
>>>>>>>>>> Spark and push the PR into community.
>>>>>>>>>> Both options, we will provide human resources to maintain, of
>>>>>>>>>> course it will be great if we can work together. So please tell us 
>>>>>>>>>> which
>>>>>>>>>> option you would like? And let's move forward. Waiting for your 
>>>>>>>>>> reply,
>>>>>>>>>> thank you very much.
>>>>>>>>>>
>>>>>>>>>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>

Re: Ask for ARM CI for spark

Reply via email to