Ke Han created HBASE-28659: ------------------------------ Summary: NPE in hmaster (setServerState function) Key: HBASE-28659 URL: https://issues.apache.org/jira/browse/HBASE-28659 Project: HBase Issue Type: Bug Components: master Affects Versions: 3.0.0-beta-2 Reporter: Ke Han Attachments: hbase--master-d16bb50815b7.log
I met NPE in master node after migrating data from 2.5.8. {code:java} [ERROR LOG] executionId = LrmpjV32 ConfigIdx = test9 Node02024-05-11T10:45:57,896 ERROR [PEWorker-15] procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception: pid=48, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, hasLock=true; ServerCrashProcedure hregion1,16020,1715424228375, splitWal=true, meta=true java.lang.NullPointerException: null at org.apache.hadoop.hbase.master.assignment.RegionStates.setServerState(RegionStates.java:409) ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] at org.apache.hadoop.hbase.master.assignment.RegionStates.logSplitting(RegionStates.java:435) ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:226) ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] 2024-05-11T10:45:57,918 ERROR [PEWorker-15] procedure2.ProcedureExecutor: Root Procedure pid=48, state=FAILED:SERVER_CRASH_SPLIT_LOGS, hasLock=true, exception=java.lang.NullPointerException via CODE-BUG: Uncaught runtime exception: pid=48, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, hasLock=true; ServerCrashProcedure hregion1,16020,1715424228375, splitWal=true, meta=true:java.lang.NullPointerException; ServerCrashProcedure hregion1,16020,1715424228375, splitWal=true, meta=true does not support rollback but the execution failed and try to rollback, code bug? org.apache.hadoop.hbase.procedure2.RemoteProcedureException: java.lang.NullPointerException at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1826) ~[hbase-procedure-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1484) ~[hbase-procedure-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:80) ~[hbase-procedure-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] {code} h1. Reproduce This bug cannot be reproduced deterministically. I upgraded hbase cluster form 2.5.8 to 3.0.0 (commit: 516c89e8597fb) with 4 nodes (1HM, 2RS, 1HDFS) h1. Root Case >From the stack trace, the bug is in setServerState state function. {code:java} hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java private void setServerState(ServerName serverName, ServerState state) { ServerStateNode serverNode = getServerNode(serverName); synchronized (serverNode) { // NPE! serverNode.setState(state); } } {code} The serverNode sometimes could be null and there's no null pointer check. {code:java} /** Returns Pertinent ServerStateNode or NULL if none found (Do not make modifications). */ public ServerStateNode getServerNode(final ServerName serverName) { return serverMap.get(serverName); } {code} The potential fix could be adding a NULL serverNode. However, how it runs into this buggy state is unclear. I am running the workloads that could trigger the bug multiple times to see if I can find more information. I have attached the error log. -- This message was sent by Atlassian Jira (v8.20.10#820010)