+1 to self destruct timer. Great analysis as usual Dawid!
On Fri, Dec 14, 2018 at 8:48 PM Dawid Weiss <[email protected]> wrote:

> So, digging in the huge logs problem I discovered a subtle issue with
> suite timeouts (thanks for keeping an eye open, Steve!) -- the
> framework can in fact hang while trying to interrupt leaked threads;
> this in combination with the spinning hadoop zombie threads that keep
> on logging results in filling up all the disk space. And once the disk
> space is exhausted, everything goes down.
>
> The hanging suite timeouts issue is interesting. The loop in the
> randomized runner used thread.join(timeoutMillis) method and iteration
> count to try to kill leaked threads. Well, turns out join(timeout) can
> hang indefinitely because this method is synchronized on the thread
> being joined... So regardless of the timeout value, it'll never return
> if the thread's monitor is never released... Nasty.
>
> https://github.com/randomizedtesting/randomizedtesting/issues/275
>
> I wonder if we should add the universal JVM kill switch option to
> forked JVMs... This wouldn't solve things, but would at least kill
> those hung forked processed before they become insane. The option that
> kills the JVM after a certain amount of time in OpenJDK is
> -XX:SelfDestructTimer=[mins]. Just a thought...
>
> Dawid
>
> Dawid
> On Fri, Dec 14, 2018 at 6:43 PM Dawid Weiss <[email protected]> wrote:
> >
> > Correction: I can reproduce the problem (on Linux). Looking into why
> > suite timeout doesn't work properly.
> >
> > D.
> > On Fri, Dec 14, 2018 at 6:12 PM Dawid Weiss <[email protected]>
> wrote:
> > >
> > > > Hadoop is up to 2.9.2, we're on 2.7.4. Have you seen any hint that
> > > > this behavior is better in a more recent version?
> > >
> > > I can't reproduce this problem, unfortunately. Even with the thread
> > > locked in an endless loop the runner should make progress and
> > > eventually terminate. That it doesn't is very suspicious; could be a
> > > bug somewhere, but without a stack trace from the hung JVM I can't
> > > really figure out why it's stalling.
> > >
> > > A better way to move forward would be to remove those annotations that
> > > currently leak threads and resources between tests, but I realize it's
> > > difficult with external software we don't have full control over.
> > >
> > > D.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
> --
Lucene/Solr Search Committer (PMC), Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Reply via email to