Nirdosh Kumar Yadav created HBASE-29041:
-------------------------------------------
Summary: Set UncaughtException Handler for RegionServer
ExecutorService
Key: HBASE-29041
URL: https://issues.apache.org/jira/browse/HBASE-29041
Project: HBase
Issue Type: Bug
Components: regionserver
Affects Versions: 2.5.10, 2.6.1, 3.0.0
Reporter: Nirdosh Kumar Yadav
In HBase cluster we have encountered a scenario where regionserver server crash
procedure(SCP) waited for more than 3 Hours. Incident was triggered due to
temporary network unavailability in hbase cluster. On Debugging found out SCP
was stuck due to child {{SplitWALProcedure}} which was waiting for completion
of SpliWalRemote procedure by regionserver worker. SplitWALRemote procedure
while running encountered{{{} an unknown exception. In logs we can see
"hdfs{}}}{{{}.{}}}{{{}DataStreamer{}}}{{ }}{{-}}{{ }}{{No}}{{ }}{{ack}}{{
}}{{{}receive{}}}{{{}d{}}}" error while regionserver connecting to Data Node.
After this error thread was stuck or died as there was no related logs
exists{{{}. There were inconsistent regions reported during this period. All
procedure were restarted and completed after Active HMaster service was
bounced. {}}}
Related logs:
--
This message was sent by Atlassian Jira
(v8.20.10#820010)