Nirdosh Kumar Yadav created HBASE-29110:
-------------------------------------------
Summary: HMaster SplitWalProcedure stays in waiting state if
Worker Regionserver doesn't reports back RPC call to HMaster
Key: HBASE-29110
URL: https://issues.apache.org/jira/browse/HBASE-29110
Project: HBase
Issue Type: Bug
Components: master
Affects Versions: 2.5.10, 2.6.1, 3.0.0-beta-1
Reporter: Nirdosh Kumar Yadav
Assignee: Nirdosh Kumar Yadav
In an HBase cluster, we encountered a scenario where the RegionServer crash
procedure (SCP) took for more than 3 hours to complete. The incident was
triggered by temporary network unavailability in the HBase cluster. Upon
debugging, we found that the SCP was stuck due to the child SplitWALProcedure,
which was waiting for the completion of the SplitWALRemote procedure by the
RegionServer worker.
While executing, the SplitWALRemote procedure encountered an unknown exception.
The logs show an error message: "hdfs.DataStreamer - No ack received,"
indicating an issue with the RegionServer's connection to the DataNode. After
this error, the thread either became stuck or terminated, as no related logs
were available.
During this period, inconsistent regions were reported. Worker RS was healthy
and didn't reported any signs of abort or unhealthy. All procedures were
restarted and completed successfully after the Active HMaster service was
bounced.
Here if we can setTimeout(<backoff>) for retrying the SplitWALRemote procedure,
Procedure will wake up and status can be checked paused procedure
Related Bug: HBASE-29041 on RegionServer component.
Related logs:
[HMASTER-4]
2024-12-05 14:55:11,264 INFO [PEWorker-41] procedure2.ProcedureExecutor -
Initialized subprocedures=[
{pid=6003288, ppid=6002575, state=RUNNABLE;
SplitWALRemoteProcedureregionserver-53.regionserver.hbase.hbasexxxxxx.
%2Cxxxxx%2C1730886070174.1733410178680,
worker=regionserver-15.regionserver.xxxxx,1730878238028}
]
[RS-15]
2024-12-05 14:55:11,461 DEBUG
[iority.RWQ.Fifo.read.handler=83,queue=2,port=xxxx] regionserver.RSRpcServices
- Executing remote procedure
classorg.apache.hadoop.hbase.regionserver.SplitWALCallable, pid=6003288
[RS-15]
2024-12-05 14:55:54,689 ERROR [split-log-closeStream-pool-0] hdfs.DataStreamer
- No ack received, took 25002ms (threshold=25000ms). File being written:
/hbase/data/default/tsdb/c997c5f8dd36481dcd3ebb9b79a35b51/recovered.edits/0000000000539451088-regionserver-53.regionserver.hbase.hbasexxa.xxxxx.is%2Cxxxxx%2C1730886070174.1733410178680.temp,
block: BP-1745262640-10.60.130.13-1712173738392:blk_1330710217_257120451,
Write pipeline datanodes:
[DatanodeInfoWithStorage[10.60.52.107:xxxxx,DS-f2b7ba1a-68b5-433a-9fe8-99315a172098,SSD],
DatanodeInfoWithStorage[10.60.75.52:xxxxxx,DS-93a433be-972f-4457-92ae-dd07288e41b5,SSD]].
[HMASTER-1]
2024-12-05 18:11:42,036 DEBUG [master/hmaster-1:xxxxx:becomeActiveMaster]
store.ProcedureTree -Procedure Procedure(pid=6003288, ppid=6002575,
class=org.apache.hadoop.hbase.master.procedure.SplitWALRemoteProcedure) stack
ids=[3592]
[HMASTER-1]
2024-12-05 18:11:42,214 DEBUG [master/hmaster-1:xxxxx:becomeActiveMaster]
procedure2.ProcedureExecutor - Loading pid=6003288, ppid=6002575,
state=RUNNABLE; SplitWALRemoteProcedure
regionserver-53.regionserver.hbase.xxxxxx.is%2C60020%2C1730886070174.1733410178680,
worker=regionserver-15.regionserver.hbase.hbasexxa.xxxxx.is,xxxxx,1730878238028
[RS-15]
2024-12-05 18:11:42,769 DEBUG
[iority.RWQ.Fifo.read.handler=79,queue=7,port=60020] regionserver.RSRpcServices
- Executing remote procedure
classorg.apache.hadoop.hbase.regionserver.SplitWALCallable, pid=6003288
[RS-15]
2024-12-05 18:11:48,247 DEBUG
[_REPLAY_OPS-regionserver/regionserver-15:xxxxx-192]
regionserver.RemoteProcedureResultReporter - Successfully complete execution of
pid=6003288
[HMASTER-1]
2024-12-05 18:11:48,304 INFO [PEWorker-2] procedure2.ProcedureExecutor -
Finished pid=6002575, ppid=6000775, state=SUCCESS;
SplitWALProcedureregionserver-53.regionserver.hbase.hbasexxa.xxxxxxx.is%2Cxxxxx%2C1730886070174.1733410178680,
worker=regionserver-15.regionserver.hbase.hbasexxa.xxxxxx,xxxxxx,1730878238028
in 3 hrs, 16 mins, 52.806 sec
--
This message was sent by Atlassian Jira
(v8.20.10#820010)