Thank you very much, Shane! Xiao
On Mon, Jul 13, 2020 at 10:15 AM shane knapp ☠ <skn...@berkeley.edu> wrote: > alright, the system load graphs show that we've had a generally decreasing > load since friday, and have burned through ~3k builds/day since the reboot > last week! i don't see many timeouts, and the PRB builds have been > generally green for a couple of days. > > again, i will keep an eye on things but i feel we're out of the woods > right now. :) > > shane > > On Fri, Jul 10, 2020 at 3:43 PM Frank Yin <ukby.1...@gmail.com> wrote: > >> Great. Thanks. >> >> On Fri, Jul 10, 2020 at 3:39 PM shane knapp ☠ <skn...@berkeley.edu> >> wrote: >> >>> no, 8 hours is plenty. things will speed up soon once the backlog of >>> builds works through.... i limited the number of PRB builds to 4 per >>> worker, and things are looking better. let's see how we look next week. >>> >>> On Fri, Jul 10, 2020 at 3:31 PM Frank Yin <ukby.1...@gmail.com> wrote: >>> >>>> Can we also increase the build timeout? >>>> >>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125617 >>>> This one fails because it times out, not because of test failures. >>>> >>>> On Fri, Jul 10, 2020 at 2:16 PM Frank Yin <ukby.1...@gmail.com> wrote: >>>> >>>>> Yeah, that's what I figured -- those workers are under load. Thanks. >>>>> >>>>> On Fri, Jul 10, 2020 at 12:43 PM shane knapp ☠ <skn...@berkeley.edu> >>>>> wrote: >>>>> >>>>>> only 125561, 125562 and 125564 were impacted by -9. >>>>>> >>>>>> 125565 exited w/a code of 15 (143 - 128), which means the process was >>>>>> terminated for unknown reasons. >>>>>> >>>>>> 125563 looks like mima failed due to a bunch of errors. >>>>>> >>>>>> i just spot checked a bunch of recent failed PRB builds from today >>>>>> and they all seemed to be legit. >>>>>> >>>>>> another thing that might be happening is an overload of PRB builds on >>>>>> the workers due to the backlog... the workers are under a LOT of load >>>>>> right now, and i can put some rate limiting in to see if that helps out. >>>>>> >>>>>> shane >>>>>> >>>>>> On Fri, Jul 10, 2020 at 11:31 AM Frank Yin <ukby.1...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Like from build number 125565 to 125561, all impacted by kill -9. >>>>>>> >>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console >>>>>>> >>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125564/console >>>>>>> >>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125563/console >>>>>>> >>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125562/console >>>>>>> >>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125561/console >>>>>>> >>>>>>> On Fri, Jul 10, 2020 at 9:35 AM shane knapp ☠ <skn...@berkeley.edu> >>>>>>> wrote: >>>>>>> >>>>>>>> define "a lot" and provide some links to those builds, please. >>>>>>>> there are roughly 2000 builds per day, and i can't do more than keep a >>>>>>>> cursory eye on things. >>>>>>>> >>>>>>>> the infrastructure that the tests run on hasn't changed one bit on >>>>>>>> any of the workers, and 'kill -9' could be a timeout, flakiness caused >>>>>>>> by >>>>>>>> old build processes remaining on the workers after the master went >>>>>>>> down, or >>>>>>>> me trying to clean things up w/o a reboot. or, perhaps, something >>>>>>>> wrong >>>>>>>> w/the infra. :) >>>>>>>> >>>>>>>> On Fri, Jul 10, 2020 at 9:28 AM Frank Yin <ukby.1...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Agree, but I’ve seen a lot of kill by signal 9, assuming that >>>>>>>>> infrastructure? >>>>>>>>> >>>>>>>>> On Fri, Jul 10, 2020 at 8:19 AM shane knapp ☠ <skn...@berkeley.edu> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> yeah, i can't do much for flaky tests... just flaky >>>>>>>>>> infrastructure. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Jul 10, 2020 at 12:41 AM Hyukjin Kwon < >>>>>>>>>> gurwls...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Couple of flaky tests can happen. It's usual. Seems it got >>>>>>>>>>> better now at least. I will keep monitoring the builds. >>>>>>>>>>> >>>>>>>>>>> 2020년 7월 10일 (금) 오후 4:33, ukby1234 <ukby.1...@gmail.com>님이 작성: >>>>>>>>>>> >>>>>>>>>>>> Looks like Jenkins isn't stable still. My PR fails two times in >>>>>>>>>>>> a row: >>>>>>>>>>>> >>>>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console >>>>>>>>>>>> >>>>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Sent from: >>>>>>>>>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/ >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> --------------------------------------------------------------------- >>>>>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Shane Knapp >>>>>>>>>> Computer Guy / Voice of Reason >>>>>>>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>>>>>>>>> https://rise.cs.berkeley.edu >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Shane Knapp >>>>>>>> Computer Guy / Voice of Reason >>>>>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>>>>>>> https://rise.cs.berkeley.edu >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Shane Knapp >>>>>> Computer Guy / Voice of Reason >>>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>>>>> https://rise.cs.berkeley.edu >>>>>> >>>>> >>> >>> -- >>> Shane Knapp >>> Computer Guy / Voice of Reason >>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>> https://rise.cs.berkeley.edu >>> >> > > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- <https://databricks.com/sparkaisummit/north-america>