Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11317 )

Change subject: IMPALA-7483: abort stuck impalad/catalogd on JVM deadlock
......................................................................


Patch Set 2:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/11317/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11317/2//COMMIT_MSG@17
PS2, Line 17: cheap
> Just curious if you tried this out on a loaded JVM with a bunch of threads
I tried running a local stress test with up to 500 concurrent queries. I added 
a timer around findDeadlockedThreads() 10/15 times it took 100s of microseconds 
and 5/15 it took 15-50ms. I'm not sure how valid the measurement is because the 
time could include the thread being descheduled by the OS and the thread 
getting stuck in a GC pause.

So then I enable JVM debug output with

   LIBHDFS_OPTS="${LIBHDFS_OPTS} -XX:+PrintGCApplicationStoppedTime  
-XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1"

The time spent in the operation was always < 1ms but sometimes it blocked for 
longer waiting to get into the safepoint. I suspect that's something to do with 
the sheer number of impalad threads runnning. It looks like there's a periodic 
safepoint anyway in hotspot every 1s by default so running 
findDeadlockedThreads() every 60s should only make any problem 1/60 worse.


http://gerrit.cloudera.org:8080/#/c/11317/2//COMMIT_MSG@22
PS2, Line 22: findDeadlockedThreads
> Maybe worth injecting a deadlock locally and see if findDeadlockedThreads()
Extended the testing utility code to induce a deadlock.


http://gerrit.cloudera.org:8080/#/c/11317/2//COMMIT_MSG@29
PS2, Line 29: I'm open to ideas about how to write an automated test for this 
and/or
> We could write a Java test that at least invokes the loop that checks for d
Yeah I once did an undergrad project where a teammate made heavy use of 
thread.stop() - it wasn't the most stable bit of code...

I think I'll skip automating the test if thats ok. I think we'd want an 
end-to-end test if anything - a unit test would probably mostly be checking 
whether findDeadlockThreads() works. We could do a python test that launches a 
program with a deadlock and checks that it exits, but it feels like overkill.


http://gerrit.cloudera.org:8080/#/c/11317/2/be/src/util/jni-util.cc
File be/src/util/jni-util.cc:

http://gerrit.cloudera.org:8080/#/c/11317/2/be/src/util/jni-util.cc@28
PS2, Line 28: 60
> IMO, configuring it something like 300s is reasonable too since VM seems to
My reason for keeping it lower was to reduce the potential duration of an 
outage resulting from this - if it's 60s then I'd expect the cluster to recover 
from this in < 2min (up to 60s to detect the hang, then maybe 30s for the 
statestore to detect the process exit and propagate the failure).

Qualitatively a 5 minute+ outage feels worse to me than a shorter outage. As 
you said though, this is rare so maybe that is ok so long as it resolves itself.


http://gerrit.cloudera.org:8080/#/c/11317/2/be/src/util/jni-util.cc@29
PS2, Line 29:     "Interval between JVM deadlock checks. If set to 0 or a 
negative value, deadlock "
> nit: mark it (Advanced)?
Done


http://gerrit.cloudera.org:8080/#/c/11317/2/fe/src/main/java/org/apache/impala/util/JvmPauseMonitor.java
File fe/src/main/java/org/apache/impala/util/JvmPauseMonitor.java:

http://gerrit.cloudera.org:8080/#/c/11317/2/fe/src/main/java/org/apache/impala/util/JvmPauseMonitor.java@176
PS2, Line 176:           return;
> not related to your change but I just noticed this, can we log this as an e
Done


http://gerrit.cloudera.org:8080/#/c/11317/2/fe/src/main/java/org/apache/impala/util/JvmPauseMonitor.java@216
PS2, Line 216:         for (ThreadInfo thread : deadlockedThreads)  
LOG.error(thread.toString());
> nit: in theory, given the threads are deadlocked, it should be guaranteed t
Done


http://gerrit.cloudera.org:8080/#/c/11317/2/fe/src/main/java/org/apache/impala/util/JvmPauseMonitor.java@217
PS2, Line 217:         LOG.fatal("Aborting because of deadlocked threads in 
JVM.");
> Should we dump the entire thread stacks (not just the deadlocked ones) and
Dumping all the other state feels out of scope for this change. It seems like 
useful infrastructure but I don't want to do a once-off version of it for this 
one case. If we want to add a utility to dump all JVM state then we could add a 
call to that here.


http://gerrit.cloudera.org:8080/#/c/11317/2/fe/src/main/java/org/apache/impala/util/JvmPauseMonitor.java@217
PS2, Line 217:         LOG.fatal("Aborting because of deadlocked threads in 
JVM.");
> +1 for other thread stacks.
Done


http://gerrit.cloudera.org:8080/#/c/11317/2/fe/src/main/java/org/apache/impala/util/JvmPauseMonitor.java@217
PS2, Line 217:         LOG.fatal("Aborting because of deadlocked threads in 
JVM.");
> oh, also, does log4j "fatal" actually guarantee the process exits? You dont
This made use of the fact that we hooked up log4j to glog. Glog fatal does 
abort the process. That does merit a comment so I added a comment and an 
exit(1) in case this is used in some context where its not hooked up to glog.



--
To view, visit http://gerrit.cloudera.org:8080/11317
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If46f879e7ab9eeead0c6e9a09844e1356012898d
Gerrit-Change-Number: 11317
Gerrit-PatchSet: 2
Gerrit-Owner: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <phi...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-Comment-Date: Sat, 25 Aug 2018 00:20:23 +0000
Gerrit-HasComments: Yes

Reply via email to