Ke Han created HBASE-28659:
------------------------------

             Summary: NPE in hmaster (setServerState function)
                 Key: HBASE-28659
                 URL: https://issues.apache.org/jira/browse/HBASE-28659
             Project: HBase
          Issue Type: Bug
          Components: master
    Affects Versions: 3.0.0-beta-2
            Reporter: Ke Han
         Attachments: hbase--master-d16bb50815b7.log

I met NPE in master node after migrating data from 2.5.8.
{code:java}
[ERROR LOG]
executionId = LrmpjV32
ConfigIdx = test9
Node02024-05-11T10:45:57,896 ERROR [PEWorker-15] procedure2.ProcedureExecutor: 
CODE-BUG: Uncaught runtime exception: pid=48, 
state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, hasLock=true; ServerCrashProcedure 
hregion1,16020,1715424228375, splitWal=true, meta=true
java.lang.NullPointerException: null
        at 
org.apache.hadoop.hbase.master.assignment.RegionStates.setServerState(RegionStates.java:409)
 ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
        at 
org.apache.hadoop.hbase.master.assignment.RegionStates.logSplitting(RegionStates.java:435)
 ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
        at 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:226)
 ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
2024-05-11T10:45:57,918 ERROR [PEWorker-15] procedure2.ProcedureExecutor: Root 
Procedure pid=48, state=FAILED:SERVER_CRASH_SPLIT_LOGS, hasLock=true, 
exception=java.lang.NullPointerException via CODE-BUG: Uncaught runtime 
exception: pid=48, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, hasLock=true; 
ServerCrashProcedure hregion1,16020,1715424228375, splitWal=true, 
meta=true:java.lang.NullPointerException; ServerCrashProcedure 
hregion1,16020,1715424228375, splitWal=true, meta=true does not support 
rollback but the execution failed and try to rollback, code bug?
org.apache.hadoop.hbase.procedure2.RemoteProcedureException: 
java.lang.NullPointerException
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1826)
 ~[hbase-procedure-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1484)
 ~[hbase-procedure-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:80)
 ~[hbase-procedure-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] {code}
h1. Reproduce

This bug cannot be reproduced deterministically.

I upgraded hbase cluster form 2.5.8 to 3.0.0 (commit: 516c89e8597fb) with 4 
nodes (1HM, 2RS, 1HDFS)
h1. Root Case

>From the stack trace, the bug is in setServerState state function.
{code:java}
hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java

  private void setServerState(ServerName serverName, ServerState state) {
    ServerStateNode serverNode = getServerNode(serverName);
    synchronized (serverNode) { // NPE!
      serverNode.setState(state);
    }
  } {code}
The serverNode sometimes could be null and there's no null pointer check.
{code:java}
  /** Returns Pertinent ServerStateNode or NULL if none found (Do not make 
modifications). */
  public ServerStateNode getServerNode(final ServerName serverName) {
    return serverMap.get(serverName);
  } {code}
The potential fix could be adding a NULL serverNode. However, how it runs into 
this buggy state is unclear. 

I am running the workloads that could trigger the bug multiple times to see if 
I can find more information.

I have attached the error log.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to