[ 
https://issues.apache.org/jira/browse/HBASE-29251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-29251:
---------------------------------
    Hadoop Flags: Reviewed
      Resolution: Fixed
          Status: Resolved  (was: Patch Available)

> Procedure gets stuck if the procedure state cannot be persisted
> ---------------------------------------------------------------
>
>                 Key: HBASE-29251
>                 URL: https://issues.apache.org/jira/browse/HBASE-29251
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.4.18, 3.0.0-beta-1, 2.5.11, 2.6.2
>            Reporter: Viraj Jasani
>            Assignee: Viraj Jasani
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 2.7.0, 3.0.0-beta-2, 2.6.3, 2.5.12
>
>
> When a given regionserver stops or aborts, the corresponding 
> ServerCrashProcedure is initiated by the active master. We have recently come 
> across a case where initial state of the SCP SERVER_CRASH_START could not be 
> persisted in the local region store:
> {code:java}
> 2025-04-09 19:00:23,538 ERROR [RegionServerTracker-0] 
> region.RegionProcedureStore - Failed to update proc pid=60020, 
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server1,60020,1731526432248, splitWal=true, meta=false
> java.io.InterruptedIOException: No ack received after 55s and a timeout of 55s
>     at 
> org.apache.hadoop.hdfs.DataStreamer.waitForAckedSeqno(DataStreamer.java:938)
>     at 
> org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:692)
>     at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:580)
>     at 
> org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:136)
>     at 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:85)
>     at 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:666)
>  {code}
>  
> This led to no further action on the SCP, it stayed stuck until the active 
> master was restarted manually.
> After the manual restart, new active master was able to proceed further with 
> SCP:
> {code:java}
> 2025-04-09 20:43:07,693 DEBUG [master/hmaster-3:60000:becomeActiveMaster] 
> procedure2.ProcedureExecutor - Stored pid=60771, 
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server1,60020,1731526432248, splitWal=true, meta=false
> 2025-04-09 20:44:15,312 INFO  [PEWorker-18] procedure2.ProcedureExecutor - 
> Finished pid=60771, state=SUCCESS; ServerCrashProcedure 
> server1,60020,1731526432248, splitWal=true, meta=false in 1 mins, 7.667 sec 
> {code}
>  
> While it is well known that for active master to be operate without 
> functional issues, the file system backing the master local region should be 
> healthy. It is however worth noting that hdfs can have issues and master 
> should be able to recover the procedures like SCP unless hdfs issues persist 
> for longer duration.
> A couple of proposals:
>  * Provide retries for the proc store persist failures
>  * Abort active master for new master to continue the recovery (deployment 
> systems usually ensure that the aborted servers are auto-started e.g. k8s or 
> ambari)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to