Re: Ask for ARM CI for spark

Dongjoon Hyun Wed, 18 Sep 2019 09:44:48 -0700

Hi, Tianhua.

Could you summarize the detail on the JIRA once more?
It will be very helpful for the community. Also, I've been waiting on that
JIRA. :)


Bests,
Dongjoon.


On Mon, Sep 16, 2019 at 11:48 PM Tianhua huang <[email protected]>
wrote:

> @shane knapp <[email protected]> thank you very much, I opened an issue
> for this https://issues.apache.org/jira/browse/SPARK-29106, we can tall
> the details in it :)
> And we will prepare an arm instance today and will send the info to your
> email later.
>
> On Tue, Sep 17, 2019 at 4:40 AM Shane Knapp <[email protected]> wrote:
>
>> @Tianhua huang <[email protected]> sure, i think we can get
>> something sorted for the short-term.
>>
>> all we need is ssh access (i can provide an ssh key), and i can then have
>> our jenkins master launch a remote worker on that instance.
>>
>> instance setup, etc, will be up to you.  my support for the time being
>> will be to create the job and 'best effort' for everything else.
>>
>> this should get us up and running asap.
>>
>> is there an open JIRA for jenkins/arm test support?  we can move the
>> technical details about this idea there.
>>
>> On Sun, Sep 15, 2019 at 9:03 PM Tianhua huang <[email protected]>
>> wrote:
>>
>>> @Sean Owen <[email protected]> , so sorry to reply late, we had a
>>> Mid-Autumn holiday:)
>>>
>>> If you hope to integrate ARM CI to amplab jenkins, we can offer the arm
>>> instance, and then the ARM job will run together with other x86 jobs, so
>>> maybe there is a guideline to do this? @shane knapp
>>> <[email protected]>  would you help us?
>>>
>>> On Thu, Sep 12, 2019 at 9:36 PM Sean Owen <[email protected]> wrote:
>>>
>>>> I don't know what's involved in actually accepting or operating those
>>>> machines, so can't comment there, but in the meantime it's good that you
>>>> are running these tests and can help report changes needed to keep it
>>>> working with ARM. I would continue with that for now.
>>>>
>>>> On Wed, Sep 11, 2019 at 10:06 PM Tianhua huang <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> For the whole work process of spark ARM CI, we want to make 2 things
>>>>> clear.
>>>>>
>>>>> The first thing is:
>>>>> About spark ARM CI, now we have two periodic jobs, one job[1] based on
>>>>> commit[2](which already fixed the replay tests failed issue[3], we made a
>>>>> new test branch based on date 09-09-2019), the other job[4] based on spark
>>>>> master.
>>>>>
>>>>> The first job we test on the specified branch to prove that our ARM CI
>>>>> is good and stable.
>>>>> The second job checks spark master every day, then we can find whether
>>>>> the latest commits affect the ARM CI. According to the build history and
>>>>> result, it shows that some problems are easier to find on ARM like
>>>>> SPARK-28770 <https://issues.apache.org/jira/browse/SPARK-28770>, and
>>>>> it also shows that we would make efforts to trace and figure them out, 
>>>>> till
>>>>> now we have found and fixed several problems[5][6][7], thanks everyone of
>>>>> the community :). And we believe that ARM CI is very necessary, right?
>>>>>
>>>>> The second thing is:
>>>>> We plan to run the jobs for a period of time, and you can see the
>>>>> result and logs from 'build history' of the jobs console, if everything
>>>>> goes well for one or two weeks could community accept the ARM CI? or how
>>>>> long the periodic jobs to run then our community could have enough
>>>>> confidence to accept the ARM CI? As you suggested before, it's good to
>>>>> integrate ARM CI to amplab jenkins, we agree that and we can donate the 
>>>>> ARM
>>>>> instances and then maintain the ARM-related test jobs together with
>>>>> community, any thoughts?
>>>>>
>>>>> Thank you all!
>>>>>
>>>>> [1]
>>>>> http://status.openlabtesting.org/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64
>>>>> [2]
>>>>> https://github.com/apache/spark/commit/0ed9fae45769d4b06b8cf8128f462f09ff3d9a72
>>>>> [3] https://issues.apache.org/jira/browse/SPARK-28770
>>>>> [4]
>>>>> http://status.openlabtesting.org/builds?job_name=spark-master-unit-test-hadoop-2.7-arm64
>>>>> [5] https://github.com/apache/spark/pull/25186
>>>>> [6] https://github.com/apache/spark/pull/25279
>>>>> [7] https://github.com/apache/spark/pull/25673
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Aug 16, 2019 at 11:24 PM Sean Owen <[email protected]> wrote:
>>>>>
>>>>>> Yes, I think it's just local caching. After you run the build you
>>>>>> should find lots of stuff cached at ~/.m2/repository and it won't 
>>>>>> download
>>>>>> every time.
>>>>>>
>>>>>> On Fri, Aug 16, 2019 at 3:01 AM bo zhaobo <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi Sean,
>>>>>>> Thanks for reply. And very apologize for making you confused.
>>>>>>> I know the dependencies will be downloaded from SBT or Maven. But
>>>>>>> the Spark QA job also exec "mvn clean package", why the log didn't print
>>>>>>> "downloading some jar from Maven central [1] and build very fast. Is the
>>>>>>> reason that Spark Jenkins build the Spark jars in the physical machiines
>>>>>>> and won't destrory the test env after job is finished? Then the other 
>>>>>>> job
>>>>>>> build Spark will get the dependencies jar from the local cached, as the
>>>>>>> previous jobs exec "mvn package", those dependencies had been downloaded
>>>>>>> already on local worker machine. Am I right? Is that the reason the job
>>>>>>> log[1] didn't print any downloading information from Maven Central?
>>>>>>>
>>>>>>> Thank you very much.
>>>>>>>
>>>>>>> [1]
>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.6-ubuntu-testing/lastBuild/consoleFull
>>>>>>>
>>>>>>>
>>>>>>> Best regards
>>>>>>>
>>>>>>> ZhaoBo
>>>>>>>
>>>>>>> [image: Mailtrack]
>>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&;>
>>>>>>>  Sender
>>>>>>> notified by
>>>>>>> Mailtrack
>>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&;>
>>>>>>>  19/08/16
>>>>>>> 下午03:58:53
>>>>>>>
>>>>>>> Sean Owen <[email protected]> 于2019年8月16日周五 上午10:38写道：
>>>>>>>
>>>>>>>> I'm not sure what you mean. The dependencies are downloaded by SBT
>>>>>>>> and Maven like in any other project, and nothing about it is specific 
>>>>>>>> to
>>>>>>>> Spark.
>>>>>>>> The worker machines cache artifacts that are downloaded from these,
>>>>>>>> but this is a function of Maven and SBT, not Spark. You may find that 
>>>>>>>> the
>>>>>>>> initial download takes a long time.
>>>>>>>>
>>>>>>>> On Thu, Aug 15, 2019 at 9:02 PM bo zhaobo <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi Sean,
>>>>>>>>>
>>>>>>>>> Thanks very much for pointing out the roadmap. ;-). Then I think
>>>>>>>>> we will continue to focus on our test environment.
>>>>>>>>>
>>>>>>>>> For the networking problems, I mean that we can access Maven
>>>>>>>>> Central, and jobs cloud download the required jar package with a high
>>>>>>>>> network speed. What we want to know is that, why the Spark QA test 
>>>>>>>>> jobs[1]
>>>>>>>>> log shows the job script/maven build seem don't download the jar 
>>>>>>>>> packages?
>>>>>>>>> Could you tell us the reason about that? Thank you.  The reason we 
>>>>>>>>> raise
>>>>>>>>> the "networking problems" is that we found a phenomenon during we 
>>>>>>>>> test, if
>>>>>>>>> we execute "mvn clean package" in a new test environment(As in our 
>>>>>>>>> test
>>>>>>>>> environment, we will destory the test VMs after the job is finish), 
>>>>>>>>> maven
>>>>>>>>> will download the dependency jar packages from Maven Central, but in 
>>>>>>>>> this
>>>>>>>>> job "spark-master-test-maven-hadoop" [2], from the log, we didn't 
>>>>>>>>> found it
>>>>>>>>> download any jar packages, what the reason about that?
>>>>>>>>> Also we build the Spark jar with downloading dependencies from
>>>>>>>>> Maven Central, it will cost mostly 1 hour. And we found [2] just cost
>>>>>>>>> 10min. But if we run "mvn package" in a VM which already exec "mvn 
>>>>>>>>> package"
>>>>>>>>> before, it just cost 14min, looks very closer with [2]. So we suspect 
>>>>>>>>> that
>>>>>>>>> downloading the Jar packages cost so much time. For the goad of ARM 
>>>>>>>>> CI, we
>>>>>>>>> expect the performance of NEW ARM CI could be closer with existing 
>>>>>>>>> X86 CI,
>>>>>>>>> then users could accept it eaiser.
>>>>>>>>>
>>>>>>>>> [1] https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/
>>>>>>>>>
>>>>>>>>> [2]
>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.6-ubuntu-testing/lastBuild/consoleFull
>>>>>>>>>
>>>>>>>>> Best regards
>>>>>>>>>
>>>>>>>>> ZhaoBo
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [image: Mailtrack]
>>>>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&;>
>>>>>>>>>  Sender
>>>>>>>>> notified by
>>>>>>>>> Mailtrack
>>>>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&;>
>>>>>>>>>  19/08/16
>>>>>>>>> 上午09:48:43
>>>>>>>>>
>>>>>>>>> Sean Owen <[email protected]> 于2019年8月15日周四 下午9:58写道：
>>>>>>>>>
>>>>>>>>>> I think the right goal is to fix the remaining issues first. If
>>>>>>>>>> we set up CI/CD it will only tell us there are still some test 
>>>>>>>>>> failures. If
>>>>>>>>>> it's stable, and not hard to add to the existing CI/CD, yes it could 
>>>>>>>>>> be
>>>>>>>>>> done automatically later. You can continue to test on ARM 
>>>>>>>>>> independently for
>>>>>>>>>> now.
>>>>>>>>>>
>>>>>>>>>> It sounds indeed like there are some networking problems in the
>>>>>>>>>> test system if you're not able to download from Maven Central. That 
>>>>>>>>>> rarely
>>>>>>>>>> takes significant time, and there aren't project-specific mirrors 
>>>>>>>>>> here. You
>>>>>>>>>> might be able to point at a closer public mirror, depending on where 
>>>>>>>>>> you
>>>>>>>>>> are.
>>>>>>>>>>
>>>>>>>>>> On Thu, Aug 15, 2019 at 5:43 AM Tianhua huang <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> I want to discuss spark ARM CI again, we took some tests on arm
>>>>>>>>>>> instance based on master and the job includes
>>>>>>>>>>> https://github.com/theopenlab/spark/pull/13  and k8s
>>>>>>>>>>> integration https://github.com/theopenlab/spark/pull/17/ ,
>>>>>>>>>>> there are several things I want to talk about:
>>>>>>>>>>>
>>>>>>>>>>> First, about the failed tests:
>>>>>>>>>>>     1.we have fixed some problems like
>>>>>>>>>>> https://github.com/apache/spark/pull/25186 and
>>>>>>>>>>> https://github.com/apache/spark/pull/25279, thanks sean owen
>>>>>>>>>>> and others to help us.
>>>>>>>>>>>     2.we tried k8s integration test on arm, and met an error:
>>>>>>>>>>> apk fetch hangs,  the tests passed  after adding '--network host' 
>>>>>>>>>>> option
>>>>>>>>>>> for command `docker build`, see:
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/theopenlab/spark/pull/17/files#diff-5b731b14068240d63a93c393f6f9b1e8R176
>>>>>>>>>>> , the solution refers to
>>>>>>>>>>> https://github.com/gliderlabs/docker-alpine/issues/307  and I
>>>>>>>>>>> don't know whether it happened once in community CI, or maybe we 
>>>>>>>>>>> should
>>>>>>>>>>> submit a pr to pass  '--network host' when `docker build`?
>>>>>>>>>>>     3.we found there are two tests failed after the commit
>>>>>>>>>>> https://github.com/apache/spark/pull/23767  :
>>>>>>>>>>>        ReplayListenerSuite:
>>>>>>>>>>>        - ...
>>>>>>>>>>>        - End-to-end replay *** FAILED ***
>>>>>>>>>>>          "[driver]" did not equal "[1]"
>>>>>>>>>>> (JsonProtocolSuite.scala:622)
>>>>>>>>>>>        - End-to-end replay with compression *** FAILED ***
>>>>>>>>>>>          "[driver]" did not equal "[1]"
>>>>>>>>>>> (JsonProtocolSuite.scala:622)
>>>>>>>>>>>
>>>>>>>>>>>         we tried to revert the commit and then the tests passed,
>>>>>>>>>>> the patch is too big and so sorry we can't find the reason till 
>>>>>>>>>>> now, if you
>>>>>>>>>>> are interesting please try it, and it will be very appreciate       
>>>>>>>>>>>    if
>>>>>>>>>>> someone can help us to figure it out.
>>>>>>>>>>>
>>>>>>>>>>> Second, about the test time, we increased the flavor of arm
>>>>>>>>>>> instance to 16U16G, but seems there was no significant improvement, 
>>>>>>>>>>> the k8s
>>>>>>>>>>> integration test took about one and a half hours, and the QA 
>>>>>>>>>>> test(like
>>>>>>>>>>> spark-master-test-maven-hadoop-2.7 community jenkins job) took about
>>>>>>>>>>> seventeen hours(it is too long :(), we suspect that the reason is 
>>>>>>>>>>> the
>>>>>>>>>>> performance and network,
>>>>>>>>>>> we split the jobs based on projects such as sql, core and so on,
>>>>>>>>>>> the time can be decrease to about seven hours, see
>>>>>>>>>>> https://github.com/theopenlab/spark/pull/19 We found the Spark
>>>>>>>>>>> QA tests like
>>>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/   ,
>>>>>>>>>>> it looks all tests seem never download the jar packages from maven 
>>>>>>>>>>> centry
>>>>>>>>>>> repo(such as
>>>>>>>>>>> https://repo.maven.apache.org/maven2/org/opencypher/okapi-api/0.4.2/okapi-api-0.4.2.jar).
>>>>>>>>>>> So we want to know how the jenkins jobs can do that, is there a 
>>>>>>>>>>> internal
>>>>>>>>>>> maven repo launched? maybe we can do the same thing to avoid the 
>>>>>>>>>>> network
>>>>>>>>>>> connection cost during downloading the dependent jar packages.
>>>>>>>>>>>
>>>>>>>>>>> Third, the most important thing, it's about ARM CI of spark, we
>>>>>>>>>>> believe that it is necessary, right? And you can see we really made 
>>>>>>>>>>> a lot
>>>>>>>>>>> of efforts, now the basic arm build/test jobs is ok, so we suggest 
>>>>>>>>>>> to add
>>>>>>>>>>> arm jobs to community, we can set them to novoting firstly, and
>>>>>>>>>>> improve/rich the jobs step by step. Generally, there are two ways 
>>>>>>>>>>> in our
>>>>>>>>>>> mind to integrate the ARM CI for spark:
>>>>>>>>>>>      1) We introduce openlab ARM CI into spark as a custom CI
>>>>>>>>>>> system. We provide human resources and test ARM VMs, also we will 
>>>>>>>>>>> focus on
>>>>>>>>>>> the ARM related issues about Spark. We will push the PR into 
>>>>>>>>>>> community.
>>>>>>>>>>>      2) We donate ARM VM resources into existing amplab Jenkins.
>>>>>>>>>>> We still provide human resources, focus on the ARM related issues 
>>>>>>>>>>> about
>>>>>>>>>>> Spark and push the PR into community.
>>>>>>>>>>> Both options, we will provide human resources to maintain, of
>>>>>>>>>>> course it will be great if we can work together. So please tell us 
>>>>>>>>>>> which
>>>>>>>>>>> option you would like? And let's move forward. Waiting for your 
>>>>>>>>>>> reply,
>>>>>>>>>>> thank you very much.
>>>>>>>>>>>
>>>>>>>>>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>

Re: Ask for ARM CI for spark

Reply via email to