Re: Ask for ARM CI for spark

Tianhua huang Wed, 18 Sep 2019 20:00:43 -0700

@Dongjoon Hyun <[email protected]> ,

Sure, and I have update the JIRA already :)
https://issues.apache.org/jira/browse/SPARK-29106
If anything missed, please let me know, thank you.


On Thu, Sep 19, 2019 at 12:44 AM Dongjoon Hyun <[email protected]>
wrote:

> Hi, Tianhua.
>
> Could you summarize the detail on the JIRA once more?
> It will be very helpful for the community. Also, I've been waiting on that
> JIRA. :)
>
> Bests,
> Dongjoon.
>
>
> On Mon, Sep 16, 2019 at 11:48 PM Tianhua huang <[email protected]>
> wrote:
>
>> @shane knapp <[email protected]> thank you very much, I opened an
>> issue for this https://issues.apache.org/jira/browse/SPARK-29106, we can
>> tall the details in it :)
>> And we will prepare an arm instance today and will send the info to your
>> email later.
>>
>> On Tue, Sep 17, 2019 at 4:40 AM Shane Knapp <[email protected]> wrote:
>>
>>> @Tianhua huang <[email protected]> sure, i think we can get
>>> something sorted for the short-term.
>>>
>>> all we need is ssh access (i can provide an ssh key), and i can then
>>> have our jenkins master launch a remote worker on that instance.
>>>
>>> instance setup, etc, will be up to you.  my support for the time being
>>> will be to create the job and 'best effort' for everything else.
>>>
>>> this should get us up and running asap.
>>>
>>> is there an open JIRA for jenkins/arm test support?  we can move the
>>> technical details about this idea there.
>>>
>>> On Sun, Sep 15, 2019 at 9:03 PM Tianhua huang <[email protected]>
>>> wrote:
>>>
>>>> @Sean Owen <[email protected]> , so sorry to reply late, we had a
>>>> Mid-Autumn holiday:)
>>>>
>>>> If you hope to integrate ARM CI to amplab jenkins, we can offer the arm
>>>> instance, and then the ARM job will run together with other x86 jobs, so
>>>> maybe there is a guideline to do this? @shane knapp
>>>> <[email protected]>  would you help us?
>>>>
>>>> On Thu, Sep 12, 2019 at 9:36 PM Sean Owen <[email protected]> wrote:
>>>>
>>>>> I don't know what's involved in actually accepting or operating those
>>>>> machines, so can't comment there, but in the meantime it's good that you
>>>>> are running these tests and can help report changes needed to keep it
>>>>> working with ARM. I would continue with that for now.
>>>>>
>>>>> On Wed, Sep 11, 2019 at 10:06 PM Tianhua huang <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> For the whole work process of spark ARM CI, we want to make 2 things
>>>>>> clear.
>>>>>>
>>>>>> The first thing is:
>>>>>> About spark ARM CI, now we have two periodic jobs, one job[1] based
>>>>>> on commit[2](which already fixed the replay tests failed issue[3], we 
>>>>>> made
>>>>>> a new test branch based on date 09-09-2019), the other job[4] based on
>>>>>> spark master.
>>>>>>
>>>>>> The first job we test on the specified branch to prove that our ARM
>>>>>> CI is good and stable.
>>>>>> The second job checks spark master every day, then we can find
>>>>>> whether the latest commits affect the ARM CI. According to the build
>>>>>> history and result, it shows that some problems are easier to find on ARM
>>>>>> like SPARK-28770 <https://issues.apache.org/jira/browse/SPARK-28770>,
>>>>>> and it also shows that we would make efforts to trace and figure them
>>>>>> out, till now we have found and fixed several problems[5][6][7], thanks
>>>>>> everyone of the community :). And we believe that ARM CI is very 
>>>>>> necessary,
>>>>>> right?
>>>>>>
>>>>>> The second thing is:
>>>>>> We plan to run the jobs for a period of time, and you can see the
>>>>>> result and logs from 'build history' of the jobs console, if everything
>>>>>> goes well for one or two weeks could community accept the ARM CI? or how
>>>>>> long the periodic jobs to run then our community could have enough
>>>>>> confidence to accept the ARM CI? As you suggested before, it's good to
>>>>>> integrate ARM CI to amplab jenkins, we agree that and we can donate the 
>>>>>> ARM
>>>>>> instances and then maintain the ARM-related test jobs together with
>>>>>> community, any thoughts?
>>>>>>
>>>>>> Thank you all!
>>>>>>
>>>>>> [1]
>>>>>> http://status.openlabtesting.org/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64
>>>>>> [2]
>>>>>> https://github.com/apache/spark/commit/0ed9fae45769d4b06b8cf8128f462f09ff3d9a72
>>>>>> [3] https://issues.apache.org/jira/browse/SPARK-28770
>>>>>> [4]
>>>>>> http://status.openlabtesting.org/builds?job_name=spark-master-unit-test-hadoop-2.7-arm64
>>>>>> [5] https://github.com/apache/spark/pull/25186
>>>>>> [6] https://github.com/apache/spark/pull/25279
>>>>>> [7] https://github.com/apache/spark/pull/25673
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 16, 2019 at 11:24 PM Sean Owen <[email protected]> wrote:
>>>>>>
>>>>>>> Yes, I think it's just local caching. After you run the build you
>>>>>>> should find lots of stuff cached at ~/.m2/repository and it won't 
>>>>>>> download
>>>>>>> every time.
>>>>>>>
>>>>>>> On Fri, Aug 16, 2019 at 3:01 AM bo zhaobo <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi Sean,
>>>>>>>> Thanks for reply. And very apologize for making you confused.
>>>>>>>> I know the dependencies will be downloaded from SBT or Maven. But
>>>>>>>> the Spark QA job also exec "mvn clean package", why the log didn't 
>>>>>>>> print
>>>>>>>> "downloading some jar from Maven central [1] and build very fast. Is 
>>>>>>>> the
>>>>>>>> reason that Spark Jenkins build the Spark jars in the physical 
>>>>>>>> machiines
>>>>>>>> and won't destrory the test env after job is finished? Then the other 
>>>>>>>> job
>>>>>>>> build Spark will get the dependencies jar from the local cached, as the
>>>>>>>> previous jobs exec "mvn package", those dependencies had been 
>>>>>>>> downloaded
>>>>>>>> already on local worker machine. Am I right? Is that the reason the job
>>>>>>>> log[1] didn't print any downloading information from Maven Central?
>>>>>>>>
>>>>>>>> Thank you very much.
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.6-ubuntu-testing/lastBuild/consoleFull
>>>>>>>>
>>>>>>>>
>>>>>>>> Best regards
>>>>>>>>
>>>>>>>> ZhaoBo
>>>>>>>>
>>>>>>>> [image: Mailtrack]
>>>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&;>
>>>>>>>>  Sender
>>>>>>>> notified by
>>>>>>>> Mailtrack
>>>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&;>
>>>>>>>>  19/08/16
>>>>>>>> 下午03:58:53
>>>>>>>>
>>>>>>>> Sean Owen <[email protected]> 于2019年8月16日周五 上午10:38写道：
>>>>>>>>
>>>>>>>>> I'm not sure what you mean. The dependencies are downloaded by SBT
>>>>>>>>> and Maven like in any other project, and nothing about it is specific 
>>>>>>>>> to
>>>>>>>>> Spark.
>>>>>>>>> The worker machines cache artifacts that are downloaded from
>>>>>>>>> these, but this is a function of Maven and SBT, not Spark. You may 
>>>>>>>>> find
>>>>>>>>> that the initial download takes a long time.
>>>>>>>>>
>>>>>>>>> On Thu, Aug 15, 2019 at 9:02 PM bo zhaobo <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Sean,
>>>>>>>>>>
>>>>>>>>>> Thanks very much for pointing out the roadmap. ;-). Then I think
>>>>>>>>>> we will continue to focus on our test environment.
>>>>>>>>>>
>>>>>>>>>> For the networking problems, I mean that we can access Maven
>>>>>>>>>> Central, and jobs cloud download the required jar package with a high
>>>>>>>>>> network speed. What we want to know is that, why the Spark QA test 
>>>>>>>>>> jobs[1]
>>>>>>>>>> log shows the job script/maven build seem don't download the jar 
>>>>>>>>>> packages?
>>>>>>>>>> Could you tell us the reason about that? Thank you.  The reason we 
>>>>>>>>>> raise
>>>>>>>>>> the "networking problems" is that we found a phenomenon during we 
>>>>>>>>>> test, if
>>>>>>>>>> we execute "mvn clean package" in a new test environment(As in our 
>>>>>>>>>> test
>>>>>>>>>> environment, we will destory the test VMs after the job is finish), 
>>>>>>>>>> maven
>>>>>>>>>> will download the dependency jar packages from Maven Central, but in 
>>>>>>>>>> this
>>>>>>>>>> job "spark-master-test-maven-hadoop" [2], from the log, we didn't 
>>>>>>>>>> found it
>>>>>>>>>> download any jar packages, what the reason about that?
>>>>>>>>>> Also we build the Spark jar with downloading dependencies from
>>>>>>>>>> Maven Central, it will cost mostly 1 hour. And we found [2] just cost
>>>>>>>>>> 10min. But if we run "mvn package" in a VM which already exec "mvn 
>>>>>>>>>> package"
>>>>>>>>>> before, it just cost 14min, looks very closer with [2]. So we 
>>>>>>>>>> suspect that
>>>>>>>>>> downloading the Jar packages cost so much time. For the goad of ARM 
>>>>>>>>>> CI, we
>>>>>>>>>> expect the performance of NEW ARM CI could be closer with existing 
>>>>>>>>>> X86 CI,
>>>>>>>>>> then users could accept it eaiser.
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/
>>>>>>>>>> [2]
>>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.6-ubuntu-testing/lastBuild/consoleFull
>>>>>>>>>>
>>>>>>>>>> Best regards
>>>>>>>>>>
>>>>>>>>>> ZhaoBo
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [image: Mailtrack]
>>>>>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&;>
>>>>>>>>>>  Sender
>>>>>>>>>> notified by
>>>>>>>>>> Mailtrack
>>>>>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&;>
>>>>>>>>>>  19/08/16
>>>>>>>>>> 上午09:48:43
>>>>>>>>>>
>>>>>>>>>> Sean Owen <[email protected]> 于2019年8月15日周四 下午9:58写道：
>>>>>>>>>>
>>>>>>>>>>> I think the right goal is to fix the remaining issues first. If
>>>>>>>>>>> we set up CI/CD it will only tell us there are still some test 
>>>>>>>>>>> failures. If
>>>>>>>>>>> it's stable, and not hard to add to the existing CI/CD, yes it 
>>>>>>>>>>> could be
>>>>>>>>>>> done automatically later. You can continue to test on ARM 
>>>>>>>>>>> independently for
>>>>>>>>>>> now.
>>>>>>>>>>>
>>>>>>>>>>> It sounds indeed like there are some networking problems in the
>>>>>>>>>>> test system if you're not able to download from Maven Central. That 
>>>>>>>>>>> rarely
>>>>>>>>>>> takes significant time, and there aren't project-specific mirrors 
>>>>>>>>>>> here. You
>>>>>>>>>>> might be able to point at a closer public mirror, depending on 
>>>>>>>>>>> where you
>>>>>>>>>>> are.
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Aug 15, 2019 at 5:43 AM Tianhua huang <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>
>>>>>>>>>>>> I want to discuss spark ARM CI again, we took some tests on arm
>>>>>>>>>>>> instance based on master and the job includes
>>>>>>>>>>>> https://github.com/theopenlab/spark/pull/13  and k8s
>>>>>>>>>>>> integration https://github.com/theopenlab/spark/pull/17/ ,
>>>>>>>>>>>> there are several things I want to talk about:
>>>>>>>>>>>>
>>>>>>>>>>>> First, about the failed tests:
>>>>>>>>>>>>     1.we have fixed some problems like
>>>>>>>>>>>> https://github.com/apache/spark/pull/25186 and
>>>>>>>>>>>> https://github.com/apache/spark/pull/25279, thanks sean owen
>>>>>>>>>>>> and others to help us.
>>>>>>>>>>>>     2.we tried k8s integration test on arm, and met an error:
>>>>>>>>>>>> apk fetch hangs,  the tests passed  after adding '--network host' 
>>>>>>>>>>>> option
>>>>>>>>>>>> for command `docker build`, see:
>>>>>>>>>>>>
>>>>>>>>>>>> https://github.com/theopenlab/spark/pull/17/files#diff-5b731b14068240d63a93c393f6f9b1e8R176
>>>>>>>>>>>> , the solution refers to
>>>>>>>>>>>> https://github.com/gliderlabs/docker-alpine/issues/307  and I
>>>>>>>>>>>> don't know whether it happened once in community CI, or maybe we 
>>>>>>>>>>>> should
>>>>>>>>>>>> submit a pr to pass  '--network host' when `docker build`?
>>>>>>>>>>>>     3.we found there are two tests failed after the commit
>>>>>>>>>>>> https://github.com/apache/spark/pull/23767  :
>>>>>>>>>>>>        ReplayListenerSuite:
>>>>>>>>>>>>        - ...
>>>>>>>>>>>>        - End-to-end replay *** FAILED ***
>>>>>>>>>>>>          "[driver]" did not equal "[1]"
>>>>>>>>>>>> (JsonProtocolSuite.scala:622)
>>>>>>>>>>>>        - End-to-end replay with compression *** FAILED ***
>>>>>>>>>>>>          "[driver]" did not equal "[1]"
>>>>>>>>>>>> (JsonProtocolSuite.scala:622)
>>>>>>>>>>>>
>>>>>>>>>>>>         we tried to revert the commit and then the tests
>>>>>>>>>>>> passed, the patch is too big and so sorry we can't find the reason 
>>>>>>>>>>>> till
>>>>>>>>>>>> now, if you are interesting please try it, and it will be very 
>>>>>>>>>>>> appreciate
>>>>>>>>>>>>         if someone can help us to figure it out.
>>>>>>>>>>>>
>>>>>>>>>>>> Second, about the test time, we increased the flavor of arm
>>>>>>>>>>>> instance to 16U16G, but seems there was no significant 
>>>>>>>>>>>> improvement, the k8s
>>>>>>>>>>>> integration test took about one and a half hours, and the QA 
>>>>>>>>>>>> test(like
>>>>>>>>>>>> spark-master-test-maven-hadoop-2.7 community jenkins job) took 
>>>>>>>>>>>> about
>>>>>>>>>>>> seventeen hours(it is too long :(), we suspect that the reason is 
>>>>>>>>>>>> the
>>>>>>>>>>>> performance and network,
>>>>>>>>>>>> we split the jobs based on projects such as sql, core and so
>>>>>>>>>>>> on, the time can be decrease to about seven hours, see
>>>>>>>>>>>> https://github.com/theopenlab/spark/pull/19 We found the Spark
>>>>>>>>>>>> QA tests like
>>>>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/   ,
>>>>>>>>>>>> it looks all tests seem never download the jar packages from maven 
>>>>>>>>>>>> centry
>>>>>>>>>>>> repo(such as
>>>>>>>>>>>> https://repo.maven.apache.org/maven2/org/opencypher/okapi-api/0.4.2/okapi-api-0.4.2.jar).
>>>>>>>>>>>> So we want to know how the jenkins jobs can do that, is there a 
>>>>>>>>>>>> internal
>>>>>>>>>>>> maven repo launched? maybe we can do the same thing to avoid the 
>>>>>>>>>>>> network
>>>>>>>>>>>> connection cost during downloading the dependent jar packages.
>>>>>>>>>>>>
>>>>>>>>>>>> Third, the most important thing, it's about ARM CI of spark, we
>>>>>>>>>>>> believe that it is necessary, right? And you can see we really 
>>>>>>>>>>>> made a lot
>>>>>>>>>>>> of efforts, now the basic arm build/test jobs is ok, so we suggest 
>>>>>>>>>>>> to add
>>>>>>>>>>>> arm jobs to community, we can set them to novoting firstly, and
>>>>>>>>>>>> improve/rich the jobs step by step. Generally, there are two ways 
>>>>>>>>>>>> in our
>>>>>>>>>>>> mind to integrate the ARM CI for spark:
>>>>>>>>>>>>      1) We introduce openlab ARM CI into spark as a custom CI
>>>>>>>>>>>> system. We provide human resources and test ARM VMs, also we will 
>>>>>>>>>>>> focus on
>>>>>>>>>>>> the ARM related issues about Spark. We will push the PR into 
>>>>>>>>>>>> community.
>>>>>>>>>>>>      2) We donate ARM VM resources into existing amplab
>>>>>>>>>>>> Jenkins. We still provide human resources, focus on the ARM 
>>>>>>>>>>>> related issues
>>>>>>>>>>>> about Spark and push the PR into community.
>>>>>>>>>>>> Both options, we will provide human resources to maintain, of
>>>>>>>>>>>> course it will be great if we can work together. So please tell us 
>>>>>>>>>>>> which
>>>>>>>>>>>> option you would like? And let's move forward. Waiting for your 
>>>>>>>>>>>> reply,
>>>>>>>>>>>> thank you very much.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>

Re: Ask for ARM CI for spark

Reply via email to