Hi,
We recently ran into a production issue, here is the summary of events that we went through, in timeline order: 1. One of the region servers went down (it became inaccessible) 2. Region transition initiated, some regions of multiple tables were stuck in transition status. Most of them are in status “OPEN_FAIILED” or “OPENING” or “PENDING”, “CLOSE_FAILED” 3. Client requests to those tables are still being diverted to lost server causing failures/time outs. (Which can we do about it ?) 4. After waiting for many hours, we ran hbck –repair per table which resolved issues with some of them. 5. One table, whose data can get stale in hours, we planned to recreate it to avoid any corruption. Disabling of table went through fine but dropping the table stuck at state “DELETE_TABLE_PRE_OPERATION”, it is waiting for regions in transition to finish. The regions it is complaining is in “OPENING” status. Here is the exception: 2017-12-08 18:59:17,975 WARN [ProcedureExecutor-10] procedure.DeleteTableProcedure: Retriable error trying to delete table=Queue-SCKAD state=DELETE_TABLE_PRE_OPERATION org.apache.hadoop.hbase.exceptions.TimeoutIOException: Timed out while waiting on regions Queue-SCKAD,B19,1502479054304.15a44cf47634d7d2264eaf00d61f6036. in transition at org.apache.hadoop.hbase.master.procedure.ProcedureSyncWait.waitFor(ProcedureSyncWait.java:123) at org.apache.hadoop.hbase.master.procedure.ProcedureSyncWait.waitFor(ProcedureSyncWait.java:103) 1. This operation has been running for more than 24 hours and doesn’t time out (isn't there a 2 hour timeout for client operations at HBase level ? ). Enabling the table back also queues up with no progress. 2. Because the table is in disable status, running hbck isn’t helping as it says regions = 0. 3. We added new node to the cluster to replace the old one, we see that HBase balancer doesn’t kick in at all. So, basically, region movement is totally stuck. 4. No missing data on HDFS, 100% consistent. Hbck detail report on whole cluster also returns OK. I can provide additional logs if you request, but can you suggest how we can resolve this problem with the cluster? Does restarting hbase master process would help? We can’t afford another outage on the cluster making the situation tricky. My questions: 1. Why drop operation need to wait for regions in transition to finish? Is there a way we can abort the on-going region movement or even the drop operation? 2. Why rebalancing or other rest of operations are stuck? 3. Can you please suggest what action can be taken to resolve this? Thank you for your time and help. Regards