Nirdosh Kumar Yadav created HBASE-29041: -------------------------------------------
Summary: Set UncaughtException Handler for RegionServer ExecutorService Key: HBASE-29041 URL: https://issues.apache.org/jira/browse/HBASE-29041 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 2.5.10, 2.6.1, 3.0.0 Reporter: Nirdosh Kumar Yadav In HBase cluster we have encountered a scenario where regionserver server crash procedure(SCP) waited for more than 3 Hours. Incident was triggered due to temporary network unavailability in hbase cluster. On Debugging found out SCP was stuck due to child {{SplitWALProcedure}} which was waiting for completion of SpliWalRemote procedure by regionserver worker. SplitWALRemote procedure while running encountered{{{} an unknown exception. In logs we can see "hdfs{}}}{{{}.{}}}{{{}DataStreamer{}}}{{ }}{{-}}{{ }}{{No}}{{ }}{{ack}}{{ }}{{{}receive{}}}{{{}d{}}}" error while regionserver connecting to Data Node. After this error thread was stuck or died as there was no related logs exists{{{}. There were inconsistent regions reported during this period. All procedure were restarted and completed after Active HMaster service was bounced. {}}} Related logs: -- This message was sent by Atlassian Jira (v8.20.10#820010)