hello from the canary islands!  ;)

i just saw this thread, and another one about a quick power loss at the
colo where our machines are hosted.  the master is on UPS but the workers
aren't...  and when they come back, the PATH variable specified in the
workers' configs get dropped and we see behavior like this.

josh rosen (whom i am talking with over chat) will be restarting the
ssh/worker processes on all of the worker nodes immediately.  this will fix
the problem.

now, back to my holiday!  :)

On Sun, Nov 5, 2017 at 5:01 PM, Xin Lu <x...@salesforce.com> wrote:

> Also another thing to look at is if you guys have any kinda of nightly
> cleanup scripts for these workers that completely nuke the conda
> environments.  If there is one maybe that's why some of them recover after
> a while.  I don't know enough about your infra right now to understand all
> the things that could cause the current unstable behavior so these are just
> some guesses.  Anyway, I sent a previous email about running spark tests in
> docker and noone responded.  At Databricks the whole build infra to run
> spark tests was very different.  Spark tests were run in docker and had a
> jenkins that was dedicated to it.  Perhaps that's something that can be
> replicated for OSS.
>
> On Sun, Nov 5, 2017 at 8:45 AM, Xin Lu <x...@salesforce.com> wrote:
>
>> So, right now it looks like 2 and 6 are still broken, but 7 has recovered:
>>
>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Tes
>> t/job/SparkPullRequestBuilder/buildTimeTrend
>>
>> What I am suggesting is to just perhaps modify the
>> SparkPullRequestBuilder configuration and run "which python" and then
>> "python -V" to see what the pull request builder is seeing before it exits.
>> Perhaps the sparkpullrequest builders are erroneously targeting a different
>> conda environment because you have multiple nodes on each worker.   It
>> looks like there is some build that's changing the environment and that's
>> causing the workers to break and recover somewhat randomly.
>>
>> Xin
>>
>> On Sun, Nov 5, 2017 at 8:29 AM, Alyssa Morrow <morrowaly...@gmail.com>
>> wrote:
>>
>>> Hi Xin,
>>>
>>> The extent of which our projects set exports are:
>>>
>>> export JAVA_HOME=/usr/java/jdk1.8.0_60
>>> export CONDA_BIN=/home/anaconda/bin/
>>> export MVN_BIN=/home/jenkins/tools/hudson.tasks.Maven_MavenInstalla
>>> tion/Maven_3.1.1/bin/
>>> export PATH=${JAVA_HOME}/bin/:${MVN_BIN}:${CONDA_BIN}:${PATH}
>>>
>>> As for python, *which python* gives us python installed in the conda
>>> virtual environment:
>>> ~/.conda/envs/buildxxxx/bin/python
>>>
>>> These steps look similar to how spark sets up its build. Not sure if
>>> this helps. Let me know if any other information would be helpful.
>>>
>>> Best,
>>>
>>> Alyssa Morrow
>>> akmor...@berkeley.edu
>>> 414-254-6645 <(414)%20254-6645>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Nov 5, 2017, at 8:15 AM, Xin Lu <x...@salesforce.com> wrote:
>>>
>>> Thanks, I actually don't have access to the machines or build configs to
>>> do proper debugging on this.  It looks like these  workers are shared with
>>> other build configurations  like avocado and cannoli as well and really any
>>> of the shared configs could be changing your JAVA_HOME and python
>>> environments.   It is fairly easy to debug if you can just change the spark
>>> build to run "which python"  and run it on one of the currently broken
>>> machines.
>>>
>>> On Sat, Nov 4, 2017 at 11:50 PM, Frank Austin Nothaft <
>>> fnoth...@berkeley.edu> wrote:
>>>
>>>> Hi Xin!
>>>>
>>>> Alyssa and I chatted just now and both reviewed the mango build
>>>> scripts. We don’t see anything in the mango build scripts that looks
>>>> concerning. To give a bit more context, Mango is a Spark-based application
>>>> for visualizing genomics data that is built in Scala, but which has python
>>>> language bindings and a node.js frontend. During CI, the mango build runs
>>>> the following steps:
>>>>
>>>> • Creates a temp directory
>>>> • Runs maven to build the Java artifacts
>>>> • Copies the built artifacts into the temp directory, and cd’s into the
>>>> temp directory. Inside the temp directory, we:
>>>> • Create a temporary conda environment and install node.js into the
>>>> conda environment
>>>> • Pull down a pre-built distribution of Spark
>>>> • We then run our python build, from inside the temp directory
>>>> • Once this is done, we:
>>>> • Deactivate and remove the conda environment
>>>> • Delete the temp directory
>>>>
>>>> This is very similar to the ADAM build, which has been running Python
>>>> builds since mid-summer. We don’t manipulate any python dependencies
>>>> outside of the conda environment, which we delete at the end of the build,
>>>> so we are pretty confident that we’re not doing anything that should be
>>>> breaking the PySpark builds.
>>>>
>>>> To help debug, it would help if you could provide the path to the
>>>> Python executables that get run during both a good and a bad build, as well
>>>> as the Python versions. From our side (mango/ADAM), we’ve seen some oddness
>>>> over the last few months with the environment on some of the Jenkins
>>>> executors (things like the JAVA_HOME getting changed), but we haven’t been
>>>> able to root cause those issues.
>>>>
>>>> Regards,
>>>>
>>>> Frank Austin Nothaft
>>>> fnoth...@berkeley.edu
>>>> fnoth...@eecs.berkeley.edu
>>>> 202-340-0466 <(202)%20340-0466>
>>>>
>>>> On Nov 4, 2017, at 10:50 PM, Frank A Nothaft <fnoth...@berkeley.edu>
>>>> wrote:
>>>>
>>>> Hi Xin!
>>>>
>>>> Mango does install python dependencies, but they should all be inside
>>>> of a conda environment. My guess is that we've got somewhere in the mango
>>>> Jenkins build where something is getting installed outside of the conda
>>>> environment. I'll be looking into this shortly.
>>>>
>>>> Regards,
>>>>
>>>> Frank Austin Nothaft
>>>>
>>>> On Nov 4, 2017, at 9:25 PM, Xin Lu <x...@salesforce.com> wrote:
>>>>
>>>> I'm not entirely sure if it's the cause because I can't see the build
>>>> configurations, but just looking at the build logs it looks like they share
>>>> a pool and those mango builds run some setup with python.
>>>>
>>>> On Sat, Nov 4, 2017 at 9:19 PM, Frank Austin Nothaft <
>>>> fnoth...@berkeley.edu> wrote:
>>>>
>>>>> Hi folks,
>>>>>
>>>>> Alyssa (cc’ed) and I manage the mango build on the AMPLab Jenkins. I
>>>>> will start to look into this to see what the connection between the mango
>>>>> builds and the failing Spark builds are.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Frank Austin Nothaft
>>>>> fnoth...@berkeley.edu
>>>>> fnoth...@eecs.berkeley.edu
>>>>> 202-340-0466 <(202)%20340-0466>
>>>>>
>>>>> On Nov 4, 2017, at 9:15 PM, Xin Lu <x...@salesforce.com> wrote:
>>>>>
>>>>> Sorry, mango wasn't added recently, but it looks like after successful
>>>>> builds of this specific configuration the workers break:
>>>>>
>>>>> https://amplab.cs.berkeley.edu/jenkins/job/mango/HADOOP_VERS
>>>>> ION=2.6.0,SCALAVER=2.11,SPARK_VERSION=1.6.1,label=centos/
>>>>>
>>>>> And then after another configuration runs it recovers.
>>>>>
>>>>> Xin
>>>>>
>>>>> On Sat, Nov 4, 2017 at 9:09 PM, Xin Lu <x...@salesforce.com> wrote:
>>>>>
>>>>>> It has happened with other workers as well, namely 3 and 4 and then
>>>>>> recovered. Looking at the build history it looks like a project called
>>>>>> mango has been added to this pool of machines recently:
>>>>>>
>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/mango/
>>>>>>
>>>>>> It looks like the slaves start to fail spark pull request builds
>>>>>> after some runs of mango.
>>>>>>
>>>>>> Xin
>>>>>>
>>>>>> On Sat, Nov 4, 2017 at 1:23 AM, Hyukjin Kwon <gurwls...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I assume it is as it says:
>>>>>>>
>>>>>>> Python versions prior to 2.7 are not supported.
>>>>>>>
>>>>>>>
>>>>>>> Looks this happens in worker 2, 6 and 7 given my observation.
>>>>>>>
>>>>>>>
>>>>>>> On 4 Nov 2017 5:15 pm, "Sean Owen" <so...@cloudera.com> wrote:
>>>>>>>
>>>>>>> Agree, seeing this somewhat regularly on the pull request builder.
>>>>>>> Do some machines inadvertently have Python 2.6? some builds succeed, so 
>>>>>>> may
>>>>>>> just be one or a few. CC Shane.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Nov 2, 2017 at 5:39 PM Pralabh Kumar <pralabhku...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Dev
>>>>>>>>
>>>>>>>> Spark build is failing in Jenkins
>>>>>>>>
>>>>>>>>
>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83353/consoleFull
>>>>>>>>
>>>>>>>>
>>>>>>>> Python versions prior to 2.7 are not supported.
>>>>>>>> Build step 'Execute shell' marked build as failure
>>>>>>>> Archiving artifacts
>>>>>>>> Recording test results
>>>>>>>> ERROR: Step ?Publish JUnit test result report? failed: No test report 
>>>>>>>> files were found. Configuration error?
>>>>>>>>
>>>>>>>>
>>>>>>>> Please help
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards
>>>>>>>>
>>>>>>>> Pralabh Kumar
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>

Reply via email to