Ke Han created HBASE-28659:
------------------------------
Summary: NPE in hmaster (setServerState function)
Key: HBASE-28659
URL: https://issues.apache.org/jira/browse/HBASE-28659
Project: HBase
Issue Type: Bug
Components: master
Affects Versions: 3.0.0-beta-2
Reporter: Ke Han
Attachments: hbase--master-d16bb50815b7.log
I met NPE in master node after migrating data from 2.5.8.
{code:java}
[ERROR LOG]
executionId = LrmpjV32
ConfigIdx = test9
Node02024-05-11T10:45:57,896 ERROR [PEWorker-15] procedure2.ProcedureExecutor:
CODE-BUG: Uncaught runtime exception: pid=48,
state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, hasLock=true; ServerCrashProcedure
hregion1,16020,1715424228375, splitWal=true, meta=true
java.lang.NullPointerException: null
at
org.apache.hadoop.hbase.master.assignment.RegionStates.setServerState(RegionStates.java:409)
~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
at
org.apache.hadoop.hbase.master.assignment.RegionStates.logSplitting(RegionStates.java:435)
~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
at
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:226)
~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
2024-05-11T10:45:57,918 ERROR [PEWorker-15] procedure2.ProcedureExecutor: Root
Procedure pid=48, state=FAILED:SERVER_CRASH_SPLIT_LOGS, hasLock=true,
exception=java.lang.NullPointerException via CODE-BUG: Uncaught runtime
exception: pid=48, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, hasLock=true;
ServerCrashProcedure hregion1,16020,1715424228375, splitWal=true,
meta=true:java.lang.NullPointerException; ServerCrashProcedure
hregion1,16020,1715424228375, splitWal=true, meta=true does not support
rollback but the execution failed and try to rollback, code bug?
org.apache.hadoop.hbase.procedure2.RemoteProcedureException:
java.lang.NullPointerException
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1826)
~[hbase-procedure-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1484)
~[hbase-procedure-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:80)
~[hbase-procedure-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] {code}
h1. Reproduce
This bug cannot be reproduced deterministically.
I upgraded hbase cluster form 2.5.8 to 3.0.0 (commit: 516c89e8597fb) with 4
nodes (1HM, 2RS, 1HDFS)
h1. Root Case
>From the stack trace, the bug is in setServerState state function.
{code:java}
hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java
private void setServerState(ServerName serverName, ServerState state) {
ServerStateNode serverNode = getServerNode(serverName);
synchronized (serverNode) { // NPE!
serverNode.setState(state);
}
} {code}
The serverNode sometimes could be null and there's no null pointer check.
{code:java}
/** Returns Pertinent ServerStateNode or NULL if none found (Do not make
modifications). */
public ServerStateNode getServerNode(final ServerName serverName) {
return serverMap.get(serverName);
} {code}
The potential fix could be adding a NULL serverNode. However, how it runs into
this buggy state is unclear.
I am running the workloads that could trigger the bug multiple times to see if
I can find more information.
I have attached the error log.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)