Re: restarting jenkins build system tomorrow (7/8) ~930am PDT
no, 8 hours is plenty. things will speed up soon once the backlog of builds works through i limited the number of PRB builds to 4 per worker, and things are looking better. let's see how we look next week. On Fri, Jul 10, 2020 at 3:31 PM Frank Yin wrote: > Can we also increase the build timeout? > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125617 > This one fails because it times out, not because of test failures. > > On Fri, Jul 10, 2020 at 2:16 PM Frank Yin wrote: > >> Yeah, that's what I figured -- those workers are under load. Thanks. >> >> On Fri, Jul 10, 2020 at 12:43 PM shane knapp ☠ >> wrote: >> >>> only 125561, 125562 and 125564 were impacted by -9. >>> >>> 125565 exited w/a code of 15 (143 - 128), which means the process was >>> terminated for unknown reasons. >>> >>> 125563 looks like mima failed due to a bunch of errors. >>> >>> i just spot checked a bunch of recent failed PRB builds from today and >>> they all seemed to be legit. >>> >>> another thing that might be happening is an overload of PRB builds on >>> the workers due to the backlog... the workers are under a LOT of load >>> right now, and i can put some rate limiting in to see if that helps out. >>> >>> shane >>> >>> On Fri, Jul 10, 2020 at 11:31 AM Frank Yin wrote: >>> Like from build number 125565 to 125561, all impacted by kill -9. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125564/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125563/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125562/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125561/console On Fri, Jul 10, 2020 at 9:35 AM shane knapp ☠ wrote: > define "a lot" and provide some links to those builds, please. there > are roughly 2000 builds per day, and i can't do more than keep a cursory > eye on things. > > the infrastructure that the tests run on hasn't changed one bit on any > of the workers, and 'kill -9' could be a timeout, flakiness caused by old > build processes remaining on the workers after the master went down, or me > trying to clean things up w/o a reboot. or, perhaps, something wrong > w/the > infra. :) > > On Fri, Jul 10, 2020 at 9:28 AM Frank Yin wrote: > >> Agree, but I’ve seen a lot of kill by signal 9, assuming that >> infrastructure? >> >> On Fri, Jul 10, 2020 at 8:19 AM shane knapp ☠ >> wrote: >> >>> yeah, i can't do much for flaky tests... just flaky infrastructure. >>> >>> >>> On Fri, Jul 10, 2020 at 12:41 AM Hyukjin Kwon >>> wrote: >>> Couple of flaky tests can happen. It's usual. Seems it got better now at least. I will keep monitoring the builds. 2020년 7월 10일 (금) 오후 4:33, ukby1234 님이 작성: > Looks like Jenkins isn't stable still. My PR fails two times in a > row: > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport > > > > -- > Sent from: > http://apache-spark-developers-list.1001551.n3.nabble.com/ > > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > >>> >>> -- >>> Shane Knapp >>> Computer Guy / Voice of Reason >>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>> https://rise.cs.berkeley.edu >>> >> > > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > >>> >>> -- >>> Shane Knapp >>> Computer Guy / Voice of Reason >>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>> https://rise.cs.berkeley.edu >>> >> -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: restarting jenkins build system tomorrow (7/8) ~930am PDT
Yeah, that's what I figured -- those workers are under load. Thanks. On Fri, Jul 10, 2020 at 12:43 PM shane knapp ☠ wrote: > only 125561, 125562 and 125564 were impacted by -9. > > 125565 exited w/a code of 15 (143 - 128), which means the process was > terminated for unknown reasons. > > 125563 looks like mima failed due to a bunch of errors. > > i just spot checked a bunch of recent failed PRB builds from today and > they all seemed to be legit. > > another thing that might be happening is an overload of PRB builds on the > workers due to the backlog... the workers are under a LOT of load right > now, and i can put some rate limiting in to see if that helps out. > > shane > > On Fri, Jul 10, 2020 at 11:31 AM Frank Yin wrote: > >> Like from build number 125565 to 125561, all impacted by kill -9. >> >> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console >> >> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125564/console >> >> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125563/console >> >> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125562/console >> >> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125561/console >> >> On Fri, Jul 10, 2020 at 9:35 AM shane knapp ☠ >> wrote: >> >>> define "a lot" and provide some links to those builds, please. there >>> are roughly 2000 builds per day, and i can't do more than keep a cursory >>> eye on things. >>> >>> the infrastructure that the tests run on hasn't changed one bit on any >>> of the workers, and 'kill -9' could be a timeout, flakiness caused by old >>> build processes remaining on the workers after the master went down, or me >>> trying to clean things up w/o a reboot. or, perhaps, something wrong w/the >>> infra. :) >>> >>> On Fri, Jul 10, 2020 at 9:28 AM Frank Yin wrote: >>> Agree, but I’ve seen a lot of kill by signal 9, assuming that infrastructure? On Fri, Jul 10, 2020 at 8:19 AM shane knapp ☠ wrote: > yeah, i can't do much for flaky tests... just flaky infrastructure. > > > On Fri, Jul 10, 2020 at 12:41 AM Hyukjin Kwon > wrote: > >> Couple of flaky tests can happen. It's usual. Seems it got better now >> at least. I will keep monitoring the builds. >> >> 2020년 7월 10일 (금) 오후 4:33, ukby1234 님이 작성: >> >>> Looks like Jenkins isn't stable still. My PR fails two times in a >>> row: >>> >>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console >>> >>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport >>> >>> >>> >>> -- >>> Sent from: >>> http://apache-spark-developers-list.1001551.n3.nabble.com/ >>> >>> - >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >>> > > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > >>> >>> -- >>> Shane Knapp >>> Computer Guy / Voice of Reason >>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>> https://rise.cs.berkeley.edu >>> >> > > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu >
Re: restarting jenkins build system tomorrow (7/8) ~930am PDT
only 125561, 125562 and 125564 were impacted by -9. 125565 exited w/a code of 15 (143 - 128), which means the process was terminated for unknown reasons. 125563 looks like mima failed due to a bunch of errors. i just spot checked a bunch of recent failed PRB builds from today and they all seemed to be legit. another thing that might be happening is an overload of PRB builds on the workers due to the backlog... the workers are under a LOT of load right now, and i can put some rate limiting in to see if that helps out. shane On Fri, Jul 10, 2020 at 11:31 AM Frank Yin wrote: > Like from build number 125565 to 125561, all impacted by kill -9. > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125564/console > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125563/console > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125562/console > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125561/console > > On Fri, Jul 10, 2020 at 9:35 AM shane knapp ☠ wrote: > >> define "a lot" and provide some links to those builds, please. there are >> roughly 2000 builds per day, and i can't do more than keep a cursory eye on >> things. >> >> the infrastructure that the tests run on hasn't changed one bit on any of >> the workers, and 'kill -9' could be a timeout, flakiness caused by old >> build processes remaining on the workers after the master went down, or me >> trying to clean things up w/o a reboot. or, perhaps, something wrong w/the >> infra. :) >> >> On Fri, Jul 10, 2020 at 9:28 AM Frank Yin wrote: >> >>> Agree, but I’ve seen a lot of kill by signal 9, assuming that >>> infrastructure? >>> >>> On Fri, Jul 10, 2020 at 8:19 AM shane knapp ☠ >>> wrote: >>> yeah, i can't do much for flaky tests... just flaky infrastructure. On Fri, Jul 10, 2020 at 12:41 AM Hyukjin Kwon wrote: > Couple of flaky tests can happen. It's usual. Seems it got better now > at least. I will keep monitoring the builds. > > 2020년 7월 10일 (금) 오후 4:33, ukby1234 님이 작성: > >> Looks like Jenkins isn't stable still. My PR fails two times in a row: >> >> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console >> >> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport >> >> >> >> -- >> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu >>> >> >> -- >> Shane Knapp >> Computer Guy / Voice of Reason >> UC Berkeley EECS Research / RISELab Staff Technical Lead >> https://rise.cs.berkeley.edu >> > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: restarting jenkins build system tomorrow (7/8) ~930am PDT
define "a lot" and provide some links to those builds, please. there are roughly 2000 builds per day, and i can't do more than keep a cursory eye on things. the infrastructure that the tests run on hasn't changed one bit on any of the workers, and 'kill -9' could be a timeout, flakiness caused by old build processes remaining on the workers after the master went down, or me trying to clean things up w/o a reboot. or, perhaps, something wrong w/the infra. :) On Fri, Jul 10, 2020 at 9:28 AM Frank Yin wrote: > Agree, but I’ve seen a lot of kill by signal 9, assuming that > infrastructure? > > On Fri, Jul 10, 2020 at 8:19 AM shane knapp ☠ wrote: > >> yeah, i can't do much for flaky tests... just flaky infrastructure. >> >> >> On Fri, Jul 10, 2020 at 12:41 AM Hyukjin Kwon >> wrote: >> >>> Couple of flaky tests can happen. It's usual. Seems it got better now at >>> least. I will keep monitoring the builds. >>> >>> 2020년 7월 10일 (금) 오후 4:33, ukby1234 님이 작성: >>> Looks like Jenkins isn't stable still. My PR fails two times in a row: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> -- >> Shane Knapp >> Computer Guy / Voice of Reason >> UC Berkeley EECS Research / RISELab Staff Technical Lead >> https://rise.cs.berkeley.edu >> > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: restarting jenkins build system tomorrow (7/8) ~930am PDT
yeah, i can't do much for flaky tests... just flaky infrastructure. On Fri, Jul 10, 2020 at 12:41 AM Hyukjin Kwon wrote: > Couple of flaky tests can happen. It's usual. Seems it got better now at > least. I will keep monitoring the builds. > > 2020년 7월 10일 (금) 오후 4:33, ukby1234 님이 작성: > >> Looks like Jenkins isn't stable still. My PR fails two times in a row: >> >> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console >> >> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport >> >> >> >> -- >> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: restarting jenkins build system tomorrow (7/8) ~930am PDT
Couple of flaky tests can happen. It's usual. Seems it got better now at least. I will keep monitoring the builds. 2020년 7월 10일 (금) 오후 4:33, ukby1234 님이 작성: > Looks like Jenkins isn't stable still. My PR fails two times in a row: > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >
Re: restarting jenkins build system tomorrow (7/8) ~930am PDT
Looks like Jenkins isn't stable still. My PR fails two times in a row: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org