[
https://issues.apache.org/jira/browse/HBASE-29110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Umesh Kumar Kumawat resolved HBASE-29110.
-----------------------------------------
Resolution: Cannot Reproduce
I tried to reproduce it using different mock exception but unable to do it. And
in production also we only faced it once. After that we never faced this
perticuler issue when RS got the request but didn't replied to HMaster.
> HMaster SplitWalProcedure stays RUNNABLE state infinitely if Worker
> Regionserver doesn't reports back SplitWALRemoteProcedure RPC response to
> HMaster
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-29110
> URL: https://issues.apache.org/jira/browse/HBASE-29110
> Project: HBase
> Issue Type: Bug
> Components: master
> Affects Versions: 3.0.0-beta-1, 2.6.1, 2.5.10
> Reporter: Nirdosh Kumar Yadav
> Assignee: Umesh Kumar Kumawat
> Priority: Minor
>
> In an HBase cluster, we encountered a scenario where the RegionServer crash
> procedure (SCP) took for more than 3 hours to complete. The incident was
> triggered by temporary network unavailability in the HBase cluster. Upon
> debugging, we found that the SCP was stuck due to the child
> SplitWALProcedure, which was waiting for the completion of the SplitWALRemote
> procedure by the RegionServer worker.
> While executing, the SplitWALRemote procedure encountered an unknown
> exception. The logs show an error message: "hdfs.DataStreamer - No ack
> received," indicating an issue with the RegionServer's connection to the
> DataNode. After this error, the thread either became stuck or terminated, as
> no related logs were available.
> During this period, inconsistent regions were reported. Worker RS was healthy
> and didn't reported any signs of abort or unhealthy. All procedures were
> restarted and completed successfully after the Active HMaster service was
> bounced.
> Here if we can setTimeout(<backoff>) for retrying the SplitWAL procedure,
> Procedure will wake up and status can be checked paused procedure
> Related Bug: HBASE-29041 on RegionServer component.
> Related logs:
> [HMASTER-4]
> 2024-12-05 14:55:11,264 INFO [PEWorker-41] procedure2.ProcedureExecutor -
> Initialized subprocedures=[
> {pid=6003288, ppid=6002575, state=RUNNABLE;
> SplitWALRemoteProcedureregionserver-53.regionserver.hbase.hbasexxxxxx.
> %2Cxxxxx%2C1730886070174.1733410178680,
> worker=regionserver-15.regionserver.xxxxx,1730878238028}
> ]
> [RS-15]
> 2024-12-05 14:55:11,461 DEBUG
> [iority.RWQ.Fifo.read.handler=83,queue=2,port=xxxx]
> regionserver.RSRpcServices - Executing remote procedure
> classorg.apache.hadoop.hbase.regionserver.SplitWALCallable, pid=6003288
> [RS-15]
> 2024-12-05 14:55:54,689 ERROR [split-log-closeStream-pool-0]
> hdfs.DataStreamer - No ack received, took 25002ms (threshold=25000ms). File
> being written:
> /hbase/data/default/tsdb/c997c5f8dd36481dcd3ebb9b79a35b51/recovered.edits/0000000000539451088-regionserver-53.regionserver.hbase.hbasexxa.xxxxx.is%2Cxxxxx%2C1730886070174.1733410178680.temp,
> block: BP-1745262640-10.60.130.13-1712173738392:blk_1330710217_257120451,
> Write pipeline datanodes:
> [DatanodeInfoWithStorage[10.60.52.107:xxxxx,DS-f2b7ba1a-68b5-433a-9fe8-99315a172098,SSD],
>
> DatanodeInfoWithStorage[10.60.75.52:xxxxxx,DS-93a433be-972f-4457-92ae-dd07288e41b5,SSD]].
> [HMASTER-1]
> 2024-12-05 18:11:42,036 DEBUG [master/hmaster-1:xxxxx:becomeActiveMaster]
> store.ProcedureTree -Procedure Procedure(pid=6003288, ppid=6002575,
> class=org.apache.hadoop.hbase.master.procedure.SplitWALRemoteProcedure) stack
> ids=[3592]
> [HMASTER-1]
> 2024-12-05 18:11:42,214 DEBUG [master/hmaster-1:xxxxx:becomeActiveMaster]
> procedure2.ProcedureExecutor - Loading pid=6003288, ppid=6002575,
> state=RUNNABLE; SplitWALRemoteProcedure
> regionserver-53.regionserver.hbase.xxxxxx.is%2C60020%2C1730886070174.1733410178680,
>
> worker=regionserver-15.regionserver.hbase.hbasexxa.xxxxx.is,xxxxx,1730878238028
> [RS-15]
> 2024-12-05 18:11:42,769 DEBUG
> [iority.RWQ.Fifo.read.handler=79,queue=7,port=60020]
> regionserver.RSRpcServices - Executing remote procedure
> classorg.apache.hadoop.hbase.regionserver.SplitWALCallable, pid=6003288
> [RS-15]
> 2024-12-05 18:11:48,247 DEBUG
> [_REPLAY_OPS-regionserver/regionserver-15:xxxxx-192]
> regionserver.RemoteProcedureResultReporter - Successfully complete execution
> of pid=6003288
> [HMASTER-1]
> 2024-12-05 18:11:48,304 INFO [PEWorker-2] procedure2.ProcedureExecutor -
> Finished pid=6002575, ppid=6000775, state=SUCCESS;
> SplitWALProcedureregionserver-53.regionserver.hbase.hbasexxa.xxxxxxx.is%2Cxxxxx%2C1730886070174.1733410178680,
>
> worker=regionserver-15.regionserver.hbase.hbasexxa.xxxxxx,xxxxxx,1730878238028
> in 3 hrs, 16 mins, 52.806 sec
--
This message was sent by Atlassian Jira
(v8.20.10#820010)