alright, the system load graphs show that we've had a generally decreasing load since friday, and have burned through ~3k builds/day since the reboot last week! i don't see many timeouts, and the PRB builds have been generally green for a couple of days.
again, i will keep an eye on things but i feel we're out of the woods right now. :) shane On Fri, Jul 10, 2020 at 3:43 PM Frank Yin <ukby.1...@gmail.com> wrote: > Great. Thanks. > > On Fri, Jul 10, 2020 at 3:39 PM shane knapp ☠ <skn...@berkeley.edu> wrote: > >> no, 8 hours is plenty. things will speed up soon once the backlog of >> builds works through.... i limited the number of PRB builds to 4 per >> worker, and things are looking better. let's see how we look next week. >> >> On Fri, Jul 10, 2020 at 3:31 PM Frank Yin <ukby.1...@gmail.com> wrote: >> >>> Can we also increase the build timeout? >>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125617 >>> This one fails because it times out, not because of test failures. >>> >>> On Fri, Jul 10, 2020 at 2:16 PM Frank Yin <ukby.1...@gmail.com> wrote: >>> >>>> Yeah, that's what I figured -- those workers are under load. Thanks. >>>> >>>> On Fri, Jul 10, 2020 at 12:43 PM shane knapp ☠ <skn...@berkeley.edu> >>>> wrote: >>>> >>>>> only 125561, 125562 and 125564 were impacted by -9. >>>>> >>>>> 125565 exited w/a code of 15 (143 - 128), which means the process was >>>>> terminated for unknown reasons. >>>>> >>>>> 125563 looks like mima failed due to a bunch of errors. >>>>> >>>>> i just spot checked a bunch of recent failed PRB builds from today and >>>>> they all seemed to be legit. >>>>> >>>>> another thing that might be happening is an overload of PRB builds on >>>>> the workers due to the backlog... the workers are under a LOT of load >>>>> right now, and i can put some rate limiting in to see if that helps out. >>>>> >>>>> shane >>>>> >>>>> On Fri, Jul 10, 2020 at 11:31 AM Frank Yin <ukby.1...@gmail.com> >>>>> wrote: >>>>> >>>>>> Like from build number 125565 to 125561, all impacted by kill -9. >>>>>> >>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console >>>>>> >>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125564/console >>>>>> >>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125563/console >>>>>> >>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125562/console >>>>>> >>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125561/console >>>>>> >>>>>> On Fri, Jul 10, 2020 at 9:35 AM shane knapp ☠ <skn...@berkeley.edu> >>>>>> wrote: >>>>>> >>>>>>> define "a lot" and provide some links to those builds, please. >>>>>>> there are roughly 2000 builds per day, and i can't do more than keep a >>>>>>> cursory eye on things. >>>>>>> >>>>>>> the infrastructure that the tests run on hasn't changed one bit on >>>>>>> any of the workers, and 'kill -9' could be a timeout, flakiness caused >>>>>>> by >>>>>>> old build processes remaining on the workers after the master went >>>>>>> down, or >>>>>>> me trying to clean things up w/o a reboot. or, perhaps, something wrong >>>>>>> w/the infra. :) >>>>>>> >>>>>>> On Fri, Jul 10, 2020 at 9:28 AM Frank Yin <ukby.1...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Agree, but I’ve seen a lot of kill by signal 9, assuming that >>>>>>>> infrastructure? >>>>>>>> >>>>>>>> On Fri, Jul 10, 2020 at 8:19 AM shane knapp ☠ <skn...@berkeley.edu> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> yeah, i can't do much for flaky tests... just flaky >>>>>>>>> infrastructure. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Jul 10, 2020 at 12:41 AM Hyukjin Kwon <gurwls...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Couple of flaky tests can happen. It's usual. Seems it got better >>>>>>>>>> now at least. I will keep monitoring the builds. >>>>>>>>>> >>>>>>>>>> 2020년 7월 10일 (금) 오후 4:33, ukby1234 <ukby.1...@gmail.com>님이 작성: >>>>>>>>>> >>>>>>>>>>> Looks like Jenkins isn't stable still. My PR fails two times in >>>>>>>>>>> a row: >>>>>>>>>>> >>>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console >>>>>>>>>>> >>>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Sent from: >>>>>>>>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/ >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> --------------------------------------------------------------------- >>>>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Shane Knapp >>>>>>>>> Computer Guy / Voice of Reason >>>>>>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>>>>>>>> https://rise.cs.berkeley.edu >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Shane Knapp >>>>>>> Computer Guy / Voice of Reason >>>>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>>>>>> https://rise.cs.berkeley.edu >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> Shane Knapp >>>>> Computer Guy / Voice of Reason >>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>>>> https://rise.cs.berkeley.edu >>>>> >>>> >> >> -- >> Shane Knapp >> Computer Guy / Voice of Reason >> UC Berkeley EECS Research / RISELab Staff Technical Lead >> https://rise.cs.berkeley.edu >> > -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu