alright, the system load graphs show that we've had a generally decreasing
load since friday, and have burned through ~3k builds/day since the reboot
last week!  i don't see many timeouts, and the PRB builds have been
generally green for a couple of days.

again, i will keep an eye on things but i feel we're out of the woods right
now.  :)

shane

On Fri, Jul 10, 2020 at 3:43 PM Frank Yin <ukby.1...@gmail.com> wrote:

> Great. Thanks.
>
> On Fri, Jul 10, 2020 at 3:39 PM shane knapp ☠ <skn...@berkeley.edu> wrote:
>
>> no, 8 hours is plenty.  things will speed up soon once the backlog of
>> builds works through....  i limited the number of PRB builds to 4 per
>> worker, and things are looking better.  let's see how we look next week.
>>
>> On Fri, Jul 10, 2020 at 3:31 PM Frank Yin <ukby.1...@gmail.com> wrote:
>>
>>> Can we also increase the build timeout?
>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125617
>>> This one fails because it times out, not because of test failures.
>>>
>>> On Fri, Jul 10, 2020 at 2:16 PM Frank Yin <ukby.1...@gmail.com> wrote:
>>>
>>>> Yeah, that's what I figured -- those workers are under load. Thanks.
>>>>
>>>> On Fri, Jul 10, 2020 at 12:43 PM shane knapp ☠ <skn...@berkeley.edu>
>>>> wrote:
>>>>
>>>>> only 125561, 125562 and 125564 were impacted by -9.
>>>>>
>>>>> 125565 exited w/a code of 15 (143 - 128), which means the process was
>>>>> terminated for unknown reasons.
>>>>>
>>>>> 125563 looks like mima failed due to a bunch of errors.
>>>>>
>>>>> i just spot checked a bunch of recent failed PRB builds from today and
>>>>> they all seemed to be legit.
>>>>>
>>>>> another thing that might be happening is an overload of PRB builds on
>>>>> the workers due to the backlog...  the workers are under a LOT of load
>>>>> right now, and i can put some rate limiting in to see if that helps out.
>>>>>
>>>>> shane
>>>>>
>>>>> On Fri, Jul 10, 2020 at 11:31 AM Frank Yin <ukby.1...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Like from build number 125565 to 125561, all impacted by kill -9.
>>>>>>
>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console
>>>>>>
>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125564/console
>>>>>>
>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125563/console
>>>>>>
>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125562/console
>>>>>>
>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125561/console
>>>>>>
>>>>>> On Fri, Jul 10, 2020 at 9:35 AM shane knapp ☠ <skn...@berkeley.edu>
>>>>>> wrote:
>>>>>>
>>>>>>> define "a lot" and provide some links to those builds, please.
>>>>>>> there are roughly 2000 builds per day, and i can't do more than keep a
>>>>>>> cursory eye on things.
>>>>>>>
>>>>>>> the infrastructure that the tests run on hasn't changed one bit on
>>>>>>> any of the workers, and 'kill -9' could be a timeout, flakiness caused 
>>>>>>> by
>>>>>>> old build processes remaining on the workers after the master went 
>>>>>>> down, or
>>>>>>> me trying to clean things up w/o a reboot.  or, perhaps, something wrong
>>>>>>> w/the infra.  :)
>>>>>>>
>>>>>>> On Fri, Jul 10, 2020 at 9:28 AM Frank Yin <ukby.1...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Agree, but I’ve seen a lot of kill by signal 9, assuming that
>>>>>>>> infrastructure?
>>>>>>>>
>>>>>>>> On Fri, Jul 10, 2020 at 8:19 AM shane knapp ☠ <skn...@berkeley.edu>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> yeah, i can't do much for flaky tests...  just flaky
>>>>>>>>> infrastructure.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jul 10, 2020 at 12:41 AM Hyukjin Kwon <gurwls...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Couple of flaky tests can happen. It's usual. Seems it got better
>>>>>>>>>> now at least. I will keep monitoring the builds.
>>>>>>>>>>
>>>>>>>>>> 2020년 7월 10일 (금) 오후 4:33, ukby1234 <ukby.1...@gmail.com>님이 작성:
>>>>>>>>>>
>>>>>>>>>>> Looks like Jenkins isn't stable still. My PR fails two times in
>>>>>>>>>>> a row:
>>>>>>>>>>>
>>>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console
>>>>>>>>>>>
>>>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Sent from:
>>>>>>>>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Shane Knapp
>>>>>>>>> Computer Guy / Voice of Reason
>>>>>>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>>>>>>> https://rise.cs.berkeley.edu
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Shane Knapp
>>>>>>> Computer Guy / Voice of Reason
>>>>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>>>>> https://rise.cs.berkeley.edu
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Shane Knapp
>>>>> Computer Guy / Voice of Reason
>>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>>> https://rise.cs.berkeley.edu
>>>>>
>>>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>

-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

Reply via email to