Hi,


We recently ran into a production issue, here is the summary of events that
we went through, in timeline order:



   1. One of the region servers went down (it became inaccessible)
   2. Region transition initiated, some regions of multiple tables were
   stuck in transition status. Most of them are in status “OPEN_FAIILED” or
   “OPENING” or “PENDING”, “CLOSE_FAILED”
   3. Client requests to those tables are still being diverted to lost
   server causing failures/time outs. (Which can we do about it ?)
   4. After waiting for many hours, we ran hbck –repair per table which
   resolved issues with some of them.
   5. One table, whose data can get stale in hours, we planned to recreate
   it to avoid any corruption. Disabling of table went through fine but
   dropping the table stuck at state “DELETE_TABLE_PRE_OPERATION”, it is
   waiting for regions in transition to finish. The regions it is complaining
   is in “OPENING” status.

Here is the exception:



2017-12-08 18:59:17,975 WARN  [ProcedureExecutor-10]
procedure.DeleteTableProcedure: Retriable error trying to delete
table=Queue-SCKAD state=DELETE_TABLE_PRE_OPERATION

org.apache.hadoop.hbase.exceptions.TimeoutIOException: Timed out while
waiting on regions
Queue-SCKAD,B19,1502479054304.15a44cf47634d7d2264eaf00d61f6036. in
transition

                at
org.apache.hadoop.hbase.master.procedure.ProcedureSyncWait.waitFor(ProcedureSyncWait.java:123)

                at
org.apache.hadoop.hbase.master.procedure.ProcedureSyncWait.waitFor(ProcedureSyncWait.java:103)



   1. This operation has been running for more than 24 hours and doesn’t
   time out (isn't there a 2 hour timeout for client operations at HBase level
   ? ). Enabling the table back also queues up with no progress.
   2. Because the table is in disable status, running hbck isn’t helping as
   it says regions = 0.
   3. We added new node to the cluster to replace the old one, we see that
   HBase balancer doesn’t kick in at all. So, basically, region movement is
   totally stuck.
   4. No missing data on HDFS, 100% consistent. Hbck detail report on whole
   cluster also returns OK.



I can provide additional logs if you request, but can you suggest how we
can resolve this problem with the cluster? Does restarting hbase master
process would help? We can’t afford another outage on the cluster making
the situation tricky.



My questions:



   1. Why drop operation need to wait for regions in transition to finish?
   Is there a way we can abort the on-going region movement or even the drop
   operation?
   2. Why rebalancing or other rest of operations are stuck?
   3.  Can you please suggest what action can be taken to resolve this?



Thank you for your time and help.



Regards

Reply via email to