Re: Ask for ARM CI for spark

Shane Knapp Mon, 16 Sep 2019 13:40:21 -0700

@Tianhua huang <[email protected]> sure, i think we can get
something sorted for the short-term.


all we need is ssh access (i can provide an ssh key), and i can then have
our jenkins master launch a remote worker on that instance.

instance setup, etc, will be up to you.  my support for the time being will
be to create the job and 'best effort' for everything else.

this should get us up and running asap.

is there an open JIRA for jenkins/arm test support?  we can move the
technical details about this idea there.

On Sun, Sep 15, 2019 at 9:03 PM Tianhua huang <[email protected]>
wrote:

> @Sean Owen <[email protected]> , so sorry to reply late, we had a
> Mid-Autumn holiday:)
>
> If you hope to integrate ARM CI to amplab jenkins, we can offer the arm
> instance, and then the ARM job will run together with other x86 jobs, so
> maybe there is a guideline to do this? @shane knapp <[email protected]>
> would you help us?
>
> On Thu, Sep 12, 2019 at 9:36 PM Sean Owen <[email protected]> wrote:
>
>> I don't know what's involved in actually accepting or operating those
>> machines, so can't comment there, but in the meantime it's good that you
>> are running these tests and can help report changes needed to keep it
>> working with ARM. I would continue with that for now.
>>
>> On Wed, Sep 11, 2019 at 10:06 PM Tianhua huang <[email protected]>
>> wrote:
>>
>>> Hi all,
>>>
>>> For the whole work process of spark ARM CI, we want to make 2 things
>>> clear.
>>>
>>> The first thing is:
>>> About spark ARM CI, now we have two periodic jobs, one job[1] based on
>>> commit[2](which already fixed the replay tests failed issue[3], we made a
>>> new test branch based on date 09-09-2019), the other job[4] based on spark
>>> master.
>>>
>>> The first job we test on the specified branch to prove that our ARM CI
>>> is good and stable.
>>> The second job checks spark master every day, then we can find whether
>>> the latest commits affect the ARM CI. According to the build history and
>>> result, it shows that some problems are easier to find on ARM like
>>> SPARK-28770 <https://issues.apache.org/jira/browse/SPARK-28770>, and it
>>> also shows that we would make efforts to trace and figure them out, till
>>> now we have found and fixed several problems[5][6][7], thanks everyone of
>>> the community :). And we believe that ARM CI is very necessary, right?
>>>
>>> The second thing is:
>>> We plan to run the jobs for a period of time, and you can see the result
>>> and logs from 'build history' of the jobs console, if everything goes well
>>> for one or two weeks could community accept the ARM CI? or how long the
>>> periodic jobs to run then our community could have enough confidence to
>>> accept the ARM CI? As you suggested before, it's good to integrate ARM CI
>>> to amplab jenkins, we agree that and we can donate the ARM instances and
>>> then maintain the ARM-related test jobs together with community, any
>>> thoughts?
>>>
>>> Thank you all!
>>>
>>> [1]
>>> http://status.openlabtesting.org/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64
>>> [2]
>>> https://github.com/apache/spark/commit/0ed9fae45769d4b06b8cf8128f462f09ff3d9a72
>>> [3] https://issues.apache.org/jira/browse/SPARK-28770
>>> [4]
>>> http://status.openlabtesting.org/builds?job_name=spark-master-unit-test-hadoop-2.7-arm64
>>> [5] https://github.com/apache/spark/pull/25186
>>> [6] https://github.com/apache/spark/pull/25279
>>> [7] https://github.com/apache/spark/pull/25673
>>>
>>>
>>>
>>> On Fri, Aug 16, 2019 at 11:24 PM Sean Owen <[email protected]> wrote:
>>>
>>>> Yes, I think it's just local caching. After you run the build you
>>>> should find lots of stuff cached at ~/.m2/repository and it won't download
>>>> every time.
>>>>
>>>> On Fri, Aug 16, 2019 at 3:01 AM bo zhaobo <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Sean,
>>>>> Thanks for reply. And very apologize for making you confused.
>>>>> I know the dependencies will be downloaded from SBT or Maven. But the
>>>>> Spark QA job also exec "mvn clean package", why the log didn't print
>>>>> "downloading some jar from Maven central [1] and build very fast. Is the
>>>>> reason that Spark Jenkins build the Spark jars in the physical machiines
>>>>> and won't destrory the test env after job is finished? Then the other job
>>>>> build Spark will get the dependencies jar from the local cached, as the
>>>>> previous jobs exec "mvn package", those dependencies had been downloaded
>>>>> already on local worker machine. Am I right? Is that the reason the job
>>>>> log[1] didn't print any downloading information from Maven Central?
>>>>>
>>>>> Thank you very much.
>>>>>
>>>>> [1]
>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.6-ubuntu-testing/lastBuild/consoleFull
>>>>>
>>>>>
>>>>> Best regards
>>>>>
>>>>> ZhaoBo
>>>>>
>>>>> [image: Mailtrack]
>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&;>
>>>>>  Sender
>>>>> notified by
>>>>> Mailtrack
>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&;>
>>>>>  19/08/16
>>>>> 下午03:58:53
>>>>>
>>>>> Sean Owen <[email protected]> 于2019年8月16日周五 上午10:38写道：
>>>>>
>>>>>> I'm not sure what you mean. The dependencies are downloaded by SBT
>>>>>> and Maven like in any other project, and nothing about it is specific to
>>>>>> Spark.
>>>>>> The worker machines cache artifacts that are downloaded from these,
>>>>>> but this is a function of Maven and SBT, not Spark. You may find that the
>>>>>> initial download takes a long time.
>>>>>>
>>>>>> On Thu, Aug 15, 2019 at 9:02 PM bo zhaobo <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi Sean,
>>>>>>>
>>>>>>> Thanks very much for pointing out the roadmap. ;-). Then I think we
>>>>>>> will continue to focus on our test environment.
>>>>>>>
>>>>>>> For the networking problems, I mean that we can access Maven
>>>>>>> Central, and jobs cloud download the required jar package with a high
>>>>>>> network speed. What we want to know is that, why the Spark QA test 
>>>>>>> jobs[1]
>>>>>>> log shows the job script/maven build seem don't download the jar 
>>>>>>> packages?
>>>>>>> Could you tell us the reason about that? Thank you.  The reason we raise
>>>>>>> the "networking problems" is that we found a phenomenon during we test, 
>>>>>>> if
>>>>>>> we execute "mvn clean package" in a new test environment(As in our test
>>>>>>> environment, we will destory the test VMs after the job is finish), 
>>>>>>> maven
>>>>>>> will download the dependency jar packages from Maven Central, but in 
>>>>>>> this
>>>>>>> job "spark-master-test-maven-hadoop" [2], from the log, we didn't found 
>>>>>>> it
>>>>>>> download any jar packages, what the reason about that?
>>>>>>> Also we build the Spark jar with downloading dependencies from Maven
>>>>>>> Central, it will cost mostly 1 hour. And we found [2] just cost 10min. 
>>>>>>> But
>>>>>>> if we run "mvn package" in a VM which already exec "mvn package" 
>>>>>>> before, it
>>>>>>> just cost 14min, looks very closer with [2]. So we suspect that 
>>>>>>> downloading
>>>>>>> the Jar packages cost so much time. For the goad of ARM CI, we expect 
>>>>>>> the
>>>>>>> performance of NEW ARM CI could be closer with existing X86 CI, then 
>>>>>>> users
>>>>>>> could accept it eaiser.
>>>>>>>
>>>>>>> [1] https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/
>>>>>>> [2]
>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.6-ubuntu-testing/lastBuild/consoleFull
>>>>>>>
>>>>>>> Best regards
>>>>>>>
>>>>>>> ZhaoBo
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> [image: Mailtrack]
>>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&;>
>>>>>>>  Sender
>>>>>>> notified by
>>>>>>> Mailtrack
>>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&;>
>>>>>>>  19/08/16
>>>>>>> 上午09:48:43
>>>>>>>
>>>>>>> Sean Owen <[email protected]> 于2019年8月15日周四 下午9:58写道：
>>>>>>>
>>>>>>>> I think the right goal is to fix the remaining issues first. If we
>>>>>>>> set up CI/CD it will only tell us there are still some test failures. 
>>>>>>>> If
>>>>>>>> it's stable, and not hard to add to the existing CI/CD, yes it could be
>>>>>>>> done automatically later. You can continue to test on ARM 
>>>>>>>> independently for
>>>>>>>> now.
>>>>>>>>
>>>>>>>> It sounds indeed like there are some networking problems in the
>>>>>>>> test system if you're not able to download from Maven Central. That 
>>>>>>>> rarely
>>>>>>>> takes significant time, and there aren't project-specific mirrors 
>>>>>>>> here. You
>>>>>>>> might be able to point at a closer public mirror, depending on where 
>>>>>>>> you
>>>>>>>> are.
>>>>>>>>
>>>>>>>> On Thu, Aug 15, 2019 at 5:43 AM Tianhua huang <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I want to discuss spark ARM CI again, we took some tests on arm
>>>>>>>>> instance based on master and the job includes
>>>>>>>>> https://github.com/theopenlab/spark/pull/13  and k8s integration
>>>>>>>>> https://github.com/theopenlab/spark/pull/17/ , there are several
>>>>>>>>> things I want to talk about:
>>>>>>>>>
>>>>>>>>> First, about the failed tests:
>>>>>>>>>     1.we have fixed some problems like
>>>>>>>>> https://github.com/apache/spark/pull/25186 and
>>>>>>>>> https://github.com/apache/spark/pull/25279, thanks sean owen and
>>>>>>>>> others to help us.
>>>>>>>>>     2.we tried k8s integration test on arm, and met an error: apk
>>>>>>>>> fetch hangs,  the tests passed  after adding '--network host' option 
>>>>>>>>> for
>>>>>>>>> command `docker build`, see:
>>>>>>>>>
>>>>>>>>> https://github.com/theopenlab/spark/pull/17/files#diff-5b731b14068240d63a93c393f6f9b1e8R176
>>>>>>>>> , the solution refers to
>>>>>>>>> https://github.com/gliderlabs/docker-alpine/issues/307  and I
>>>>>>>>> don't know whether it happened once in community CI, or maybe we 
>>>>>>>>> should
>>>>>>>>> submit a pr to pass  '--network host' when `docker build`?
>>>>>>>>>     3.we found there are two tests failed after the commit
>>>>>>>>> https://github.com/apache/spark/pull/23767  :
>>>>>>>>>        ReplayListenerSuite:
>>>>>>>>>        - ...
>>>>>>>>>        - End-to-end replay *** FAILED ***
>>>>>>>>>          "[driver]" did not equal "[1]"
>>>>>>>>> (JsonProtocolSuite.scala:622)
>>>>>>>>>        - End-to-end replay with compression *** FAILED ***
>>>>>>>>>          "[driver]" did not equal "[1]"
>>>>>>>>> (JsonProtocolSuite.scala:622)
>>>>>>>>>
>>>>>>>>>         we tried to revert the commit and then the tests passed,
>>>>>>>>> the patch is too big and so sorry we can't find the reason till now, 
>>>>>>>>> if you
>>>>>>>>> are interesting please try it, and it will be very appreciate         
>>>>>>>>>  if
>>>>>>>>> someone can help us to figure it out.
>>>>>>>>>
>>>>>>>>> Second, about the test time, we increased the flavor of arm
>>>>>>>>> instance to 16U16G, but seems there was no significant improvement, 
>>>>>>>>> the k8s
>>>>>>>>> integration test took about one and a half hours, and the QA test(like
>>>>>>>>> spark-master-test-maven-hadoop-2.7 community jenkins job) took about
>>>>>>>>> seventeen hours(it is too long :(), we suspect that the reason is the
>>>>>>>>> performance and network,
>>>>>>>>> we split the jobs based on projects such as sql, core and so on,
>>>>>>>>> the time can be decrease to about seven hours, see
>>>>>>>>> https://github.com/theopenlab/spark/pull/19 We found the Spark QA
>>>>>>>>> tests like
>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/   ,
>>>>>>>>> it looks all tests seem never download the jar packages from maven 
>>>>>>>>> centry
>>>>>>>>> repo(such as
>>>>>>>>> https://repo.maven.apache.org/maven2/org/opencypher/okapi-api/0.4.2/okapi-api-0.4.2.jar).
>>>>>>>>> So we want to know how the jenkins jobs can do that, is there a 
>>>>>>>>> internal
>>>>>>>>> maven repo launched? maybe we can do the same thing to avoid the 
>>>>>>>>> network
>>>>>>>>> connection cost during downloading the dependent jar packages.
>>>>>>>>>
>>>>>>>>> Third, the most important thing, it's about ARM CI of spark, we
>>>>>>>>> believe that it is necessary, right? And you can see we really made a 
>>>>>>>>> lot
>>>>>>>>> of efforts, now the basic arm build/test jobs is ok, so we suggest to 
>>>>>>>>> add
>>>>>>>>> arm jobs to community, we can set them to novoting firstly, and
>>>>>>>>> improve/rich the jobs step by step. Generally, there are two ways in 
>>>>>>>>> our
>>>>>>>>> mind to integrate the ARM CI for spark:
>>>>>>>>>      1) We introduce openlab ARM CI into spark as a custom CI
>>>>>>>>> system. We provide human resources and test ARM VMs, also we will 
>>>>>>>>> focus on
>>>>>>>>> the ARM related issues about Spark. We will push the PR into 
>>>>>>>>> community.
>>>>>>>>>      2) We donate ARM VM resources into existing amplab Jenkins.
>>>>>>>>> We still provide human resources, focus on the ARM related issues 
>>>>>>>>> about
>>>>>>>>> Spark and push the PR into community.
>>>>>>>>> Both options, we will provide human resources to maintain, of
>>>>>>>>> course it will be great if we can work together. So please tell us 
>>>>>>>>> which
>>>>>>>>> option you would like? And let's move forward. Waiting for your reply,
>>>>>>>>> thank you very much.
>>>>>>>>>
>>>>>>>>

-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

Re: Ask for ARM CI for spark

Reply via email to