[ https://issues.apache.org/jira/browse/STORM-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jungtaek Lim updated STORM-2194: -------------------------------- Fix Version/s: 1.1.1 > ReportErrorAndDie doesn't always die > ------------------------------------ > > Key: STORM-2194 > URL: https://issues.apache.org/jira/browse/STORM-2194 > Project: Apache Storm > Issue Type: Bug > Components: storm-core > Affects Versions: 2.0.0, 1.0.2 > Reporter: Craig Hawco > Assignee: Paul Poulosky > Fix For: 2.0.0, 1.0.4, 1.1.1, 1.2.0 > > Attachments: scrubbed-thread-dump.txt > > Time Spent: 2h 40m > Remaining Estimate: 0h > > I've been trying to track down a cause of some of our issues with some > exceptions leaving Storm workers in a zombified state for some time. I > believe I've isolated the bug to the behaviour in > :report-error-and-die/reportErrorAndDie in the executor. Essentially: > {code} > :report-error-and-die (fn [error] > (try > ((:report-error <>) error) > (catch Exception e > (log-message "Error while reporting error to > cluster, proceeding with shutdown"))) > (if (or > (exception-cause? InterruptedException > error) > (exception-cause? > java.io.InterruptedIOException error)) > (log-message "Got interrupted excpetion > shutting thread down...") > ((:suicide-fn <>)))) > {code} > has the grouping for the if statement slightly wrong. It shouldn't log OR die > from InterruptedException/InterruptedIOException, but it should log under > that condition, and ALWAYS die. > Basically: > {code} > :report-error-and-die (fn [error] > (try > ((:report-error <>) error) > (catch Exception e > (log-message "Error while reporting error to > cluster, proceeding with shutdown"))) > (if (or > (exception-cause? InterruptedException > error) > (exception-cause? > java.io.InterruptedIOException error)) > (log-message "Got interrupted excpetion > shutting thread down...")) > ((:suicide-fn <>))) > {code} > After digging into the Java port of this code, it looks like a different bug > was introduced while porting: > {code} > if (Utils.exceptionCauseIsInstanceOf(InterruptedException.class, e) > || > Utils.exceptionCauseIsInstanceOf(java.io.InterruptedIOException.class, e)) { > LOG.info("Got interrupted exception shutting thread down..."); > suicideFn.run(); > } > {code} > Was how this was initially ported, and STORM-2142 changed this to: > {code} > if (Utils.exceptionCauseIsInstanceOf(InterruptedException.class, e) > || > Utils.exceptionCauseIsInstanceOf(java.io.InterruptedIOException.class, e)) { > LOG.info("Got interrupted exception shutting thread down..."); > } else { > suicideFn.run(); > } > {code} > However, I believe the correct port is as described above: > {code} > if (Utils.exceptionCauseIsInstanceOf(InterruptedException.class, e) > || > Utils.exceptionCauseIsInstanceOf(java.io.InterruptedIOException.class, e)) { > LOG.info("Got interrupted exception shutting thread down..."); > } > suicideFn.run(); > {code} > I'll look into providing patches for the 1.x and 2.x branches shortly. -- This message was sent by Atlassian JIRA (v6.4.14#64029)