Andrew Purtell created HBASE-13238:
--------------------------------------
Summary: Time out locks and abort if HDFS is wedged
Key: HBASE-13238
URL: https://issues.apache.org/jira/browse/HBASE-13238
Project: HBase
Issue Type: Brainstorming
Reporter: Andrew Purtell
This is a brainstorming issue on the top of timing out locks and aborting if
HDFS is wedged.
We had a minor production incident where a region was unable to close after 24
hours. The CloseRegionHandler was waiting for a write lock on the
ReentrantReadWriteLock we take in HRegion#doClose. There were outstanding read
locks. Three other threads were stuck in scanning, all blocked on the same
DFSInputStream. Two were blocked in DFSInputStream#getFileLength, the third was
waiting in epoll from SocketIOWithTimeout$SelectorPool#select with apparent
infinite timeout from PacketReceiver#readChannelFully.
This is similar to other issues we have seen before, in the context of the
region wanting to finish a compaction, but can't due to some HDFS issue causing
the reader to become extremely slow if not wedged.
The Hadoop version was 2.3 (specifically 2.3 CDH 5.0.1), and we are planning to
upgrade, but [~lhofhansl] and I were discussing the issue in general and wonder
if we should not be timing out locks such as the ReentrantReadWriteLock, and if
so, abort the regionserver. In this case this would have caused recovery and
reassignment of the region in question and we would not have had a prolonged
availability problem.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)