HBase Outage - Drop table operation stuck in "DELETE_TABLE_PRE_OPERATION"

Murthy boddu Fri, 08 Dec 2017 15:05:21 -0800

Hi,

We recently ran into a production issue, here is the summary of events that
we went through, in timeline order:

1. One of the region servers went down (it became inaccessible)
2. Region transition initiated, some regions of multiple tables were
stuck in transition status. Most of them are in status “OPEN_FAIILED” or
“OPENING” or “PENDING”, “CLOSE_FAILED”
3. Client requests to those tables are still being diverted to lost
server causing failures/time outs. (Which can we do about it ?)
4. After waiting for many hours, we ran hbck –repair per table which
resolved issues with some of them.
5. One table, whose data can get stale in hours, we planned to recreate
it to avoid any corruption. Disabling of table went through fine but
dropping the table stuck at state “DELETE_TABLE_PRE_OPERATION”, it is
waiting for regions in transition to finish. The regions it is complaining
is in “OPENING” status.

Here is the exception:

2017-12-08 18:59:17,975 WARN [ProcedureExecutor-10]
procedure.DeleteTableProcedure: Retriable error trying to delete
table=Queue-SCKAD state=DELETE_TABLE_PRE_OPERATION

org.apache.hadoop.hbase.exceptions.TimeoutIOException: Timed out while
waiting on regions
Queue-SCKAD,B19,1502479054304.15a44cf47634d7d2264eaf00d61f6036. in
transition

at
org.apache.hadoop.hbase.master.procedure.ProcedureSyncWait.waitFor(ProcedureSyncWait.java:123)

at
org.apache.hadoop.hbase.master.procedure.ProcedureSyncWait.waitFor(ProcedureSyncWait.java:103)

1. This operation has been running for more than 24 hours and doesn’t
time out (isn't there a 2 hour timeout for client operations at HBase level
? ). Enabling the table back also queues up with no progress.
2. Because the table is in disable status, running hbck isn’t helping as
it says regions = 0.
3. We added new node to the cluster to replace the old one, we see that
HBase balancer doesn’t kick in at all. So, basically, region movement is
totally stuck.
4. No missing data on HDFS, 100% consistent. Hbck detail report on whole
cluster also returns OK.

I can provide additional logs if you request, but can you suggest how we
can resolve this problem with the cluster? Does restarting hbase master
process would help? We can’t afford another outage on the cluster making
the situation tricky.

My questions:

1. Why drop operation need to wait for regions in transition to finish?
Is there a way we can abort the on-going region movement or even the drop
operation?
2. Why rebalancing or other rest of operations are stuck?
3. Can you please suggest what action can be taken to resolve this?

Thank you for your time and help.

Regards

HBase Outage - Drop table operation stuck in "DELETE_TABLE_PRE_OPERATION"

Reply via email to