[
https://issues.apache.org/jira/browse/HBASE-20169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439167#comment-16439167
]
Chia-Ping Tsai commented on HBASE-20169:
----------------------------------------
{quote}Can you please explain the fix? And what's the actually stack trace you
saw? The stack trace is always incomplete on jenkins.
{quote}
Pardon me, I just add some log to trace the null object so no full stack trace
can be attached here.
The story is about thread contention between ProcedureExecutor#stop and
ProcedureExecutor#join.
*first thread*: When shutdowning the mini cluster, the master#shutdown is
executed.
{code:java}
activeMaster.master.shutdown();{code}
And then the ServerManager#shutdownCluster is called. Because the rs which
failed to instantiate cp is dead, the onlineServers.isEmpty() is true. That
makes Master#run leave the loop. (*second thread*)
{code:java}
public void shutdownCluster() {
String statusStr = "Cluster shutdown requested of master=" +
this.master.getServerName();
LOG.info(statusStr);
this.clusterShutdown.set(true);
if (onlineServers.isEmpty()) {
// we do not synchronize here so this may cause a double stop, but not a
big deal
master.stop("OnlineServer=0 right after cluster shutdown set");
}
}{code}
Since ProcedureExecutor#join sets null to timeoutExecutor, the NPE will happen
if ProcedureExecutor#stop is called after ProcedureExecutor#join.
{code:java}
public void join() {
assert !isRunning() : "expected not running";
// stop the timeout executor
timeoutExecutor.awaitTermination();
timeoutExecutor = null;{code}
{code:java}
public void stop() {
if (!running.getAndSet(false)) {
return;
}
LOG.info("Stopping");
scheduler.stop();
timeoutExecutor.sendStopSignal();
}{code}
> NPE when calling HBTU.shutdownMiniCluster (TestAssignmentManagerMetrics is
> flakey)
> ----------------------------------------------------------------------------------
>
> Key: HBASE-20169
> URL: https://issues.apache.org/jira/browse/HBASE-20169
> Project: HBase
> Issue Type: Bug
> Components: test
> Reporter: Duo Zhang
> Assignee: stack
> Priority: Major
> Attachments: HBASE-20169.branch-2.001.patch,
> HBASE-20169.branch-2.002.patch, HBASE-20169.branch-2.003.patch,
> HBASE-20169.branch-2.004.patch, HBASE-20169.branch-2.005.patch,
> HBASE-20169.v0.addendum.patch
>
>
> This usually happens when some master or rs has already been down before we
> calling shutdownMiniCluster.
> See
> https://builds.apache.org/job/HBASE-Flaky-Tests/27223/testReport/junit/org.apache.hadoop.hbase.master/TestAssignmentManagerMetrics/org_apache_hadoop_hbase_master_TestAssignmentManagerMetrics/
> and also
> http://104.198.223.121:8080/job/HBASE-Flaky-Tests/34873/testReport/junit/org.apache.hadoop.hbase.master/TestRestartCluster/testRetainAssignmentOnRestart/
> {noformat}
> java.lang.NullPointerException
> at
> org.apache.hadoop.hbase.master.TestAssignmentManagerMetrics.after(TestAssignmentManagerMetrics.java:100)
> java.lang.NullPointerException
> at
> org.apache.hadoop.hbase.master.TestRestartCluster.testRetainAssignmentOnRestart(TestRestartCluster.java:156)
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)