So, digging in the huge logs problem I discovered a subtle issue with
suite timeouts (thanks for keeping an eye open, Steve!) -- the
framework can in fact hang while trying to interrupt leaked threads;
this in combination with the spinning hadoop zombie threads that keep
on logging results in filling up all the disk space. And once the disk
space is exhausted, everything goes down.

The hanging suite timeouts issue is interesting. The loop in the
randomized runner used thread.join(timeoutMillis) method and iteration
count to try to kill leaked threads. Well, turns out join(timeout) can
hang indefinitely because this method is synchronized on the thread
being joined... So regardless of the timeout value, it'll never return
if the thread's monitor is never released... Nasty.

https://github.com/randomizedtesting/randomizedtesting/issues/275

I wonder if we should add the universal JVM kill switch option to
forked JVMs... This wouldn't solve things, but would at least kill
those hung forked processed before they become insane. The option that
kills the JVM after a certain amount of time in OpenJDK is
-XX:SelfDestructTimer=[mins]. Just a thought...

Dawid

Dawid
On Fri, Dec 14, 2018 at 6:43 PM Dawid Weiss <[email protected]> wrote:
>
> Correction: I can reproduce the problem (on Linux). Looking into why
> suite timeout doesn't work properly.
>
> D.
> On Fri, Dec 14, 2018 at 6:12 PM Dawid Weiss <[email protected]> wrote:
> >
> > > Hadoop is up to 2.9.2, we're on 2.7.4. Have you seen any hint that
> > > this behavior is better in a more recent version?
> >
> > I can't reproduce this problem, unfortunately. Even with the thread
> > locked in an endless loop the runner should make progress and
> > eventually terminate. That it doesn't is very suspicious; could be a
> > bug somewhere, but without a stack trace from the hung JVM I can't
> > really figure out why it's stalling.
> >
> > A better way to move forward would be to remove those annotations that
> > currently leak threads and resources between tests, but I realize it's
> > difficult with external software we don't have full control over.
> >
> > D.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to