[ https://issues.apache.org/jira/browse/HBASE-29251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Viraj Jasani updated HBASE-29251: --------------------------------- Hadoop Flags: Reviewed Resolution: Fixed Status: Resolved (was: Patch Available) > Procedure gets stuck if the procedure state cannot be persisted > --------------------------------------------------------------- > > Key: HBASE-29251 > URL: https://issues.apache.org/jira/browse/HBASE-29251 > Project: HBase > Issue Type: Bug > Affects Versions: 2.4.18, 3.0.0-beta-1, 2.5.11, 2.6.2 > Reporter: Viraj Jasani > Assignee: Viraj Jasani > Priority: Critical > Labels: pull-request-available > Fix For: 2.7.0, 3.0.0-beta-2, 2.6.3, 2.5.12 > > > When a given regionserver stops or aborts, the corresponding > ServerCrashProcedure is initiated by the active master. We have recently come > across a case where initial state of the SCP SERVER_CRASH_START could not be > persisted in the local region store: > {code:java} > 2025-04-09 19:00:23,538 ERROR [RegionServerTracker-0] > region.RegionProcedureStore - Failed to update proc pid=60020, > state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure > server1,60020,1731526432248, splitWal=true, meta=false > java.io.InterruptedIOException: No ack received after 55s and a timeout of 55s > at > org.apache.hadoop.hdfs.DataStreamer.waitForAckedSeqno(DataStreamer.java:938) > at > org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:692) > at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:580) > at > org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:136) > at > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:85) > at > org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:666) > {code} > > This led to no further action on the SCP, it stayed stuck until the active > master was restarted manually. > After the manual restart, new active master was able to proceed further with > SCP: > {code:java} > 2025-04-09 20:43:07,693 DEBUG [master/hmaster-3:60000:becomeActiveMaster] > procedure2.ProcedureExecutor - Stored pid=60771, > state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure > server1,60020,1731526432248, splitWal=true, meta=false > 2025-04-09 20:44:15,312 INFO [PEWorker-18] procedure2.ProcedureExecutor - > Finished pid=60771, state=SUCCESS; ServerCrashProcedure > server1,60020,1731526432248, splitWal=true, meta=false in 1 mins, 7.667 sec > {code} > > While it is well known that for active master to be operate without > functional issues, the file system backing the master local region should be > healthy. It is however worth noting that hdfs can have issues and master > should be able to recover the procedures like SCP unless hdfs issues persist > for longer duration. > A couple of proposals: > * Provide retries for the proc store persist failures > * Abort active master for new master to continue the recovery (deployment > systems usually ensure that the aborted servers are auto-started e.g. k8s or > ambari) -- This message was sent by Atlassian Jira (v8.20.10#820010)