But is there any non-memory-leak reason why the tests should need more memory? In theory each test should be cleaning up it's own Spark Context etc. right? My memory is that OOM issues in the tests in the past have been indicative of memory leaks somewhere.
I do agree that it doesn't seem likely that it's an infrastructure issue / I can't explain why re-booting would improve things. On Thu, Jan 5, 2017 at 4:38 PM, Marcelo Vanzin <van...@cloudera.com> wrote: > Seems like the OOM is coming from tests, which most probably means > it's not an infrastructure issue. Maybe tests just need more memory > these days and we need to update maven / sbt scripts. > > On Thu, Jan 5, 2017 at 1:19 PM, shane knapp <skn...@berkeley.edu> wrote: > > as of first thing this morning, here's the list of recent GC overhead > > build failures: > > https://amplab.cs.berkeley.edu/jenkins/job/ > SparkPullRequestBuilder/70891/console > > https://amplab.cs.berkeley.edu/jenkins/job/ > SparkPullRequestBuilder/70874/console > > https://amplab.cs.berkeley.edu/jenkins/job/ > SparkPullRequestBuilder/70842/console > > https://amplab.cs.berkeley.edu/jenkins/job/ > SparkPullRequestBuilder/70927/console > > https://amplab.cs.berkeley.edu/jenkins/job/ > SparkPullRequestBuilder/70551/console > > https://amplab.cs.berkeley.edu/jenkins/job/ > SparkPullRequestBuilder/70835/console > > https://amplab.cs.berkeley.edu/jenkins/job/ > SparkPullRequestBuilder/70841/console > > https://amplab.cs.berkeley.edu/jenkins/job/ > SparkPullRequestBuilder/70869/console > > https://amplab.cs.berkeley.edu/jenkins/job/ > SparkPullRequestBuilder/70598/console > > https://amplab.cs.berkeley.edu/jenkins/job/ > SparkPullRequestBuilder/70898/console > > https://amplab.cs.berkeley.edu/jenkins/job/ > SparkPullRequestBuilder/70629/console > > https://amplab.cs.berkeley.edu/jenkins/job/ > SparkPullRequestBuilder/70644/console > > https://amplab.cs.berkeley.edu/jenkins/job/ > SparkPullRequestBuilder/70686/console > > https://amplab.cs.berkeley.edu/jenkins/job/ > SparkPullRequestBuilder/70620/console > > https://amplab.cs.berkeley.edu/jenkins/job/ > SparkPullRequestBuilder/70871/console > > https://amplab.cs.berkeley.edu/jenkins/job/ > SparkPullRequestBuilder/70873/console > > https://amplab.cs.berkeley.edu/jenkins/job/ > SparkPullRequestBuilder/70622/console > > https://amplab.cs.berkeley.edu/jenkins/job/ > SparkPullRequestBuilder/70837/console > > https://amplab.cs.berkeley.edu/jenkins/job/ > SparkPullRequestBuilder/70626/console > > > > i haven't really found anything that jumps out at me except perhaps > > auditing/upping the java memory limits across the build. this seems > > to be a massive shot in the dark, and time consuming, so let's just > > call this a "method of last resort". > > > > looking more closely at the systems themselves, it looked to me that > > there was enough java "garbage" that had accumulated over the last 5 > > months (since the last reboot) that system reboots would be a good > > first step. > > > > https://www.youtube.com/watch?v=nn2FB1P_Mn8 > > > > over the course of this morning i've been sneaking in worker reboots > > during quiet times... the ganglia memory graphs look a lot better > > (free memory up, cached memory down!), and i'll keep an eye on things > > over the course of the next few days to see if the build failure > > frequency is effected. > > > > also, i might be scheduling quarterly system reboots if this indeed > > fixes the problem. > > > > shane > > > > On Wed, Jan 4, 2017 at 1:22 PM, shane knapp <skn...@berkeley.edu> wrote: > >> preliminary findings: seems to be transient, and affecting 4% of > >> builds from late december until now (which is as far back as we keep > >> build records for the PRB builds). > >> > >> 408 builds > >> 16 builds.gc <--- failures > >> > >> it's also happening across all workers at about the same rate. > >> > >> and best of all, there seems to be no pattern to which tests are > >> failing (different each time). i'll look a little deeper and decide > >> what to do next. > >> > >> On Tue, Jan 3, 2017 at 6:49 PM, shane knapp <skn...@berkeley.edu> > wrote: > >>> nope, no changes to jenkins in the past few months. ganglia graphs > >>> show higher, but not worrying, memory usage on the workers when the > >>> jobs failed... > >>> > >>> i'll take a closer look later tonite/first thing tomorrow morning. > >>> > >>> shane > >>> > >>> On Tue, Jan 3, 2017 at 4:35 PM, Kay Ousterhout <k...@eecs.berkeley.edu> > wrote: > >>>> I've noticed a bunch of the recent builds failing because of GC > limits, for > >>>> seemingly unrelated changes (e.g. 70818, 70840, 70842). Shane, have > there > >>>> been any recent changes in the build configuration that might be > causing > >>>> this? Does anyone else have any ideas about what's going on here? > >>>> > >>>> -Kay > >>>> > >>>> > > > > --------------------------------------------------------------------- > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > > > > > -- > Marcelo >