So, digging in the huge logs problem I discovered a subtle issue with suite timeouts (thanks for keeping an eye open, Steve!) -- the framework can in fact hang while trying to interrupt leaked threads; this in combination with the spinning hadoop zombie threads that keep on logging results in filling up all the disk space. And once the disk space is exhausted, everything goes down.
The hanging suite timeouts issue is interesting. The loop in the randomized runner used thread.join(timeoutMillis) method and iteration count to try to kill leaked threads. Well, turns out join(timeout) can hang indefinitely because this method is synchronized on the thread being joined... So regardless of the timeout value, it'll never return if the thread's monitor is never released... Nasty. https://github.com/randomizedtesting/randomizedtesting/issues/275 I wonder if we should add the universal JVM kill switch option to forked JVMs... This wouldn't solve things, but would at least kill those hung forked processed before they become insane. The option that kills the JVM after a certain amount of time in OpenJDK is -XX:SelfDestructTimer=[mins]. Just a thought... Dawid Dawid On Fri, Dec 14, 2018 at 6:43 PM Dawid Weiss <[email protected]> wrote: > > Correction: I can reproduce the problem (on Linux). Looking into why > suite timeout doesn't work properly. > > D. > On Fri, Dec 14, 2018 at 6:12 PM Dawid Weiss <[email protected]> wrote: > > > > > Hadoop is up to 2.9.2, we're on 2.7.4. Have you seen any hint that > > > this behavior is better in a more recent version? > > > > I can't reproduce this problem, unfortunately. Even with the thread > > locked in an endless loop the runner should make progress and > > eventually terminate. That it doesn't is very suspicious; could be a > > bug somewhere, but without a stack trace from the hung JVM I can't > > really figure out why it's stalling. > > > > A better way to move forward would be to remove those annotations that > > currently leak threads and resources between tests, but I realize it's > > difficult with external software we don't have full control over. > > > > D. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
