+1 to self destruct timer. Great analysis as usual Dawid! On Fri, Dec 14, 2018 at 8:48 PM Dawid Weiss <[email protected]> wrote:
> So, digging in the huge logs problem I discovered a subtle issue with > suite timeouts (thanks for keeping an eye open, Steve!) -- the > framework can in fact hang while trying to interrupt leaked threads; > this in combination with the spinning hadoop zombie threads that keep > on logging results in filling up all the disk space. And once the disk > space is exhausted, everything goes down. > > The hanging suite timeouts issue is interesting. The loop in the > randomized runner used thread.join(timeoutMillis) method and iteration > count to try to kill leaked threads. Well, turns out join(timeout) can > hang indefinitely because this method is synchronized on the thread > being joined... So regardless of the timeout value, it'll never return > if the thread's monitor is never released... Nasty. > > https://github.com/randomizedtesting/randomizedtesting/issues/275 > > I wonder if we should add the universal JVM kill switch option to > forked JVMs... This wouldn't solve things, but would at least kill > those hung forked processed before they become insane. The option that > kills the JVM after a certain amount of time in OpenJDK is > -XX:SelfDestructTimer=[mins]. Just a thought... > > Dawid > > Dawid > On Fri, Dec 14, 2018 at 6:43 PM Dawid Weiss <[email protected]> wrote: > > > > Correction: I can reproduce the problem (on Linux). Looking into why > > suite timeout doesn't work properly. > > > > D. > > On Fri, Dec 14, 2018 at 6:12 PM Dawid Weiss <[email protected]> > wrote: > > > > > > > Hadoop is up to 2.9.2, we're on 2.7.4. Have you seen any hint that > > > > this behavior is better in a more recent version? > > > > > > I can't reproduce this problem, unfortunately. Even with the thread > > > locked in an endless loop the runner should make progress and > > > eventually terminate. That it doesn't is very suspicious; could be a > > > bug somewhere, but without a stack trace from the hung JVM I can't > > > really figure out why it's stalling. > > > > > > A better way to move forward would be to remove those annotations that > > > currently leak threads and resources between tests, but I realize it's > > > difficult with external software we don't have full control over. > > > > > > D. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > -- Lucene/Solr Search Committer (PMC), Developer, Author, Speaker LinkedIn: http://linkedin.com/in/davidwsmiley | Book: http://www.solrenterprisesearchserver.com
