Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

2020-07-10 Thread shane knapp ☠
no, 8 hours is plenty.  things will speed up soon once the backlog of
builds works through  i limited the number of PRB builds to 4 per
worker, and things are looking better.  let's see how we look next week.

On Fri, Jul 10, 2020 at 3:31 PM Frank Yin  wrote:

> Can we also increase the build timeout?
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125617
> This one fails because it times out, not because of test failures.
>
> On Fri, Jul 10, 2020 at 2:16 PM Frank Yin  wrote:
>
>> Yeah, that's what I figured -- those workers are under load. Thanks.
>>
>> On Fri, Jul 10, 2020 at 12:43 PM shane knapp ☠ 
>> wrote:
>>
>>> only 125561, 125562 and 125564 were impacted by -9.
>>>
>>> 125565 exited w/a code of 15 (143 - 128), which means the process was
>>> terminated for unknown reasons.
>>>
>>> 125563 looks like mima failed due to a bunch of errors.
>>>
>>> i just spot checked a bunch of recent failed PRB builds from today and
>>> they all seemed to be legit.
>>>
>>> another thing that might be happening is an overload of PRB builds on
>>> the workers due to the backlog...  the workers are under a LOT of load
>>> right now, and i can put some rate limiting in to see if that helps out.
>>>
>>> shane
>>>
>>> On Fri, Jul 10, 2020 at 11:31 AM Frank Yin  wrote:
>>>
 Like from build number 125565 to 125561, all impacted by kill -9.

 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console

 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125564/console

 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125563/console

 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125562/console

 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125561/console

 On Fri, Jul 10, 2020 at 9:35 AM shane knapp ☠ 
 wrote:

> define "a lot" and provide some links to those builds, please.  there
> are roughly 2000 builds per day, and i can't do more than keep a cursory
> eye on things.
>
> the infrastructure that the tests run on hasn't changed one bit on any
> of the workers, and 'kill -9' could be a timeout, flakiness caused by old
> build processes remaining on the workers after the master went down, or me
> trying to clean things up w/o a reboot.  or, perhaps, something wrong 
> w/the
> infra.  :)
>
> On Fri, Jul 10, 2020 at 9:28 AM Frank Yin  wrote:
>
>> Agree, but I’ve seen a lot of kill by signal 9, assuming that
>> infrastructure?
>>
>> On Fri, Jul 10, 2020 at 8:19 AM shane knapp ☠ 
>> wrote:
>>
>>> yeah, i can't do much for flaky tests...  just flaky infrastructure.
>>>
>>>
>>> On Fri, Jul 10, 2020 at 12:41 AM Hyukjin Kwon 
>>> wrote:
>>>
 Couple of flaky tests can happen. It's usual. Seems it got better
 now at least. I will keep monitoring the builds.

 2020년 7월 10일 (금) 오후 4:33, ukby1234 님이 작성:

> Looks like Jenkins isn't stable still. My PR fails two times in a
> row:
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport
>
>
>
> --
> Sent from:
> http://apache-spark-developers-list.1001551.n3.nabble.com/
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>>>
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>

>>>
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>

-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

2020-07-10 Thread Frank Yin
Yeah, that's what I figured -- those workers are under load. Thanks.

On Fri, Jul 10, 2020 at 12:43 PM shane knapp ☠  wrote:

> only 125561, 125562 and 125564 were impacted by -9.
>
> 125565 exited w/a code of 15 (143 - 128), which means the process was
> terminated for unknown reasons.
>
> 125563 looks like mima failed due to a bunch of errors.
>
> i just spot checked a bunch of recent failed PRB builds from today and
> they all seemed to be legit.
>
> another thing that might be happening is an overload of PRB builds on the
> workers due to the backlog...  the workers are under a LOT of load right
> now, and i can put some rate limiting in to see if that helps out.
>
> shane
>
> On Fri, Jul 10, 2020 at 11:31 AM Frank Yin  wrote:
>
>> Like from build number 125565 to 125561, all impacted by kill -9.
>>
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console
>>
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125564/console
>>
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125563/console
>>
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125562/console
>>
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125561/console
>>
>> On Fri, Jul 10, 2020 at 9:35 AM shane knapp ☠ 
>> wrote:
>>
>>> define "a lot" and provide some links to those builds, please.  there
>>> are roughly 2000 builds per day, and i can't do more than keep a cursory
>>> eye on things.
>>>
>>> the infrastructure that the tests run on hasn't changed one bit on any
>>> of the workers, and 'kill -9' could be a timeout, flakiness caused by old
>>> build processes remaining on the workers after the master went down, or me
>>> trying to clean things up w/o a reboot.  or, perhaps, something wrong w/the
>>> infra.  :)
>>>
>>> On Fri, Jul 10, 2020 at 9:28 AM Frank Yin  wrote:
>>>
 Agree, but I’ve seen a lot of kill by signal 9, assuming that
 infrastructure?

 On Fri, Jul 10, 2020 at 8:19 AM shane knapp ☠ 
 wrote:

> yeah, i can't do much for flaky tests...  just flaky infrastructure.
>
>
> On Fri, Jul 10, 2020 at 12:41 AM Hyukjin Kwon 
> wrote:
>
>> Couple of flaky tests can happen. It's usual. Seems it got better now
>> at least. I will keep monitoring the builds.
>>
>> 2020년 7월 10일 (금) 오후 4:33, ukby1234 님이 작성:
>>
>>> Looks like Jenkins isn't stable still. My PR fails two times in a
>>> row:
>>>
>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console
>>>
>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport
>>>
>>>
>>>
>>> --
>>> Sent from:
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>

>>>
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

2020-07-10 Thread shane knapp ☠
only 125561, 125562 and 125564 were impacted by -9.

125565 exited w/a code of 15 (143 - 128), which means the process was
terminated for unknown reasons.

125563 looks like mima failed due to a bunch of errors.

i just spot checked a bunch of recent failed PRB builds from today and they
all seemed to be legit.

another thing that might be happening is an overload of PRB builds on the
workers due to the backlog...  the workers are under a LOT of load right
now, and i can put some rate limiting in to see if that helps out.

shane

On Fri, Jul 10, 2020 at 11:31 AM Frank Yin  wrote:

> Like from build number 125565 to 125561, all impacted by kill -9.
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125564/console
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125563/console
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125562/console
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125561/console
>
> On Fri, Jul 10, 2020 at 9:35 AM shane knapp ☠  wrote:
>
>> define "a lot" and provide some links to those builds, please.  there are
>> roughly 2000 builds per day, and i can't do more than keep a cursory eye on
>> things.
>>
>> the infrastructure that the tests run on hasn't changed one bit on any of
>> the workers, and 'kill -9' could be a timeout, flakiness caused by old
>> build processes remaining on the workers after the master went down, or me
>> trying to clean things up w/o a reboot.  or, perhaps, something wrong w/the
>> infra.  :)
>>
>> On Fri, Jul 10, 2020 at 9:28 AM Frank Yin  wrote:
>>
>>> Agree, but I’ve seen a lot of kill by signal 9, assuming that
>>> infrastructure?
>>>
>>> On Fri, Jul 10, 2020 at 8:19 AM shane knapp ☠ 
>>> wrote:
>>>
 yeah, i can't do much for flaky tests...  just flaky infrastructure.


 On Fri, Jul 10, 2020 at 12:41 AM Hyukjin Kwon 
 wrote:

> Couple of flaky tests can happen. It's usual. Seems it got better now
> at least. I will keep monitoring the builds.
>
> 2020년 7월 10일 (금) 오후 4:33, ukby1234 님이 작성:
>
>> Looks like Jenkins isn't stable still. My PR fails two times in a row:
>>
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console
>>
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

 --
 Shane Knapp
 Computer Guy / Voice of Reason
 UC Berkeley EECS Research / RISELab Staff Technical Lead
 https://rise.cs.berkeley.edu

>>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>

-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

2020-07-10 Thread shane knapp ☠
define "a lot" and provide some links to those builds, please.  there are
roughly 2000 builds per day, and i can't do more than keep a cursory eye on
things.

the infrastructure that the tests run on hasn't changed one bit on any of
the workers, and 'kill -9' could be a timeout, flakiness caused by old
build processes remaining on the workers after the master went down, or me
trying to clean things up w/o a reboot.  or, perhaps, something wrong w/the
infra.  :)

On Fri, Jul 10, 2020 at 9:28 AM Frank Yin  wrote:

> Agree, but I’ve seen a lot of kill by signal 9, assuming that
> infrastructure?
>
> On Fri, Jul 10, 2020 at 8:19 AM shane knapp ☠  wrote:
>
>> yeah, i can't do much for flaky tests...  just flaky infrastructure.
>>
>>
>> On Fri, Jul 10, 2020 at 12:41 AM Hyukjin Kwon 
>> wrote:
>>
>>> Couple of flaky tests can happen. It's usual. Seems it got better now at
>>> least. I will keep monitoring the builds.
>>>
>>> 2020년 7월 10일 (금) 오후 4:33, ukby1234 님이 작성:
>>>
 Looks like Jenkins isn't stable still. My PR fails two times in a row:

 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console

 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport



 --
 Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>

-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

2020-07-10 Thread shane knapp ☠
yeah, i can't do much for flaky tests...  just flaky infrastructure.


On Fri, Jul 10, 2020 at 12:41 AM Hyukjin Kwon  wrote:

> Couple of flaky tests can happen. It's usual. Seems it got better now at
> least. I will keep monitoring the builds.
>
> 2020년 7월 10일 (금) 오후 4:33, ukby1234 님이 작성:
>
>> Looks like Jenkins isn't stable still. My PR fails two times in a row:
>>
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console
>>
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

2020-07-10 Thread Hyukjin Kwon
Couple of flaky tests can happen. It's usual. Seems it got better now at
least. I will keep monitoring the builds.

2020년 7월 10일 (금) 오후 4:33, ukby1234 님이 작성:

> Looks like Jenkins isn't stable still. My PR fails two times in a row:
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

2020-07-10 Thread ukby1234
Looks like Jenkins isn't stable still. My PR fails two times in a row:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org