Just as a note, I think I had the same issue. This is on my 7+1
cluster
during a MR import job:
2009-12-08 01:15:45,772 DEBUG
org.apache.hadoop.hbase.regionserver.HRegion:
Flush requested on ma-
docs,cb48e6aa06cd2937e095bfefbec7c357,1260256286643
2009-12-08 01:15:45,772 DEBUG
org.apache.hadoop.hbase.regionserver.HRegion:
Started memstore flush for region
ma-docs,cb48e6aa06cd2937e095bfefbec7c357,1260256286643. Current
region
memstore size 64.2m
2009-12-08 01:15:57,409 WARN org.apache.hadoop.hdfs.DFSClient:
DataStreamer
Exception: java.net.SocketTimeoutException: 10000 millis timeout
while
waiting for channel to be ready for write. ch :
java.nio.channels.SocketChannel[connected local=/
192.168.99.38:51729remote=/
192.168.99.38:50010]
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO
(SocketIOWithTimeout.java:164)
at
org.apache.hadoop.net.SocketOutputStream.write
(SocketOutputStream.java:146)
at
org.apache.hadoop.net.SocketOutputStream.write
(SocketOutputStream.java:107)
at java.io.BufferedOutputStream.write
(BufferedOutputStream.java:105)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run
(DFSClient.java:2290)
2009-12-08 01:15:57,409 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 bad datanode[0]
192.168.99.38:50010
2009-12-08 01:15:57,410 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 in pipeline
192.168.99.38:50010, 192.168.99.37:50010: bad datanode
192.168.99.38:50010
2009-12-08 01:15:58,567 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed because
recovery
from primary datanode 192.168.99.37:50010 failed 1 times. Pipeline
was
192.168.99.38:50010, 192.168.99.37:50010. Will retry...
2009-12-08 01:15:58,569 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 bad datanode[0]
192.168.99.38:50010
2009-12-08 01:15:58,569 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 in pipeline
192.168.99.38:50010, 192.168.99.37:50010: bad datanode
192.168.99.38:50010
2009-12-08 01:15:58,583 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed because
recovery
from primary datanode 192.168.99.37:50010 failed 2 times. Pipeline
was
192.168.99.38:50010, 192.168.99.37:50010. Will retry...
2009-12-08 01:15:58,585 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 bad datanode[0]
192.168.99.38:50010
2009-12-08 01:15:58,585 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 in pipeline
192.168.99.38:50010, 192.168.99.37:50010: bad datanode
192.168.99.38:50010
2009-12-08 01:15:58,591 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed because
recovery
from primary datanode 192.168.99.37:50010 failed 3 times. Pipeline
was
192.168.99.38:50010, 192.168.99.37:50010. Will retry...
2009-12-08 01:15:58,593 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 bad datanode[0]
192.168.99.38:50010
2009-12-08 01:15:58,593 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 in pipeline
192.168.99.38:50010, 192.168.99.37:50010: bad datanode
192.168.99.38:50010
2009-12-08 01:15:58,598 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed because
recovery
from primary datanode 192.168.99.37:50010 failed 4 times. Pipeline
was
192.168.99.38:50010, 192.168.99.37:50010. Will retry...
2009-12-08 01:15:58,600 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 bad datanode[0]
192.168.99.38:50010
2009-12-08 01:15:58,600 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 in pipeline
192.168.99.38:50010, 192.168.99.37:50010: bad datanode
192.168.99.38:50010
2009-12-08 01:15:58,608 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed because
recovery
from primary datanode 192.168.99.37:50010 failed 5 times. Pipeline
was
192.168.99.38:50010, 192.168.99.37:50010. Will retry...
2009-12-08 01:15:58,610 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 bad datanode[0]
192.168.99.38:50010
2009-12-08 01:15:58,610 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 in pipeline
192.168.99.38:50010, 192.168.99.37:50010: bad datanode
192.168.99.38:50010
2009-12-08 01:15:58,615 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed because
recovery
from primary datanode 192.168.99.37:50010 failed 6 times. Pipeline
was
192.168.99.38:50010, 192.168.99.37:50010. Marking primary datanode
as bad.
2009-12-08 01:15:58,625 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed because
recovery
from primary datanode 192.168.99.38:50010 failed 1 times. Pipeline
was
192.168.99.38:50010. Will retry...
2009-12-08 01:15:58,637 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed because
recovery
from primary datanode 192.168.99.38:50010 failed 2 times. Pipeline
was
192.168.99.38:50010. Will retry...
2009-12-08 01:15:58,654 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed because
recovery
from primary datanode 192.168.99.38:50010 failed 3 times. Pipeline
was
192.168.99.38:50010. Will retry...
2009-12-08 01:15:58,668 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed because
recovery
from primary datanode 192.168.99.38:50010 failed 4 times. Pipeline
was
192.168.99.38:50010. Will retry...
2009-12-08 01:15:58,678 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed because
recovery
from primary datanode 192.168.99.38:50010 failed 5 times. Pipeline
was
192.168.99.38:50010. Will retry...
2009-12-08 01:15:58,685 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed because
recovery
from primary datanode 192.168.99.38:50010 failed 6 times. Pipeline
was
192.168.99.38:50010. Aborting...
2009-12-08 01:15:58,685 FATAL
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay of hlog
required. Forcing server shutdown
org.apache.hadoop.hbase.DroppedSnapshotException: region:
ma-docs,cb48e6aa06cd2937e095bfefbec7c357,1260256286643
at
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache
(HRegion.java:946)
at
org.apache.hadoop.hbase.regionserver.HRegion.flushcache
(HRegion.java:839)
at
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion
(MemStoreFlusher.java:241)
at
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run
(MemStoreFlusher.java:149)
Caused by: java.io.IOException: Error Recovery for block
blk_2400329754585253075_931440 failed because recovery from primary
datanode 192.168.99.38:50010 failed 6 times. Pipeline was
192.168.99.38:50010. Aborting...
at
org.apache.hadoop.hdfs.DFSClient
$DFSOutputStream.processDatanodeError(DFSClient.java:2584)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600
(DFSClient.java:2078)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run
(DFSClient.java:2241)
2009-12-08 01:15:58,688 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
request=11.9, regions=37, stores=296, storefiles=697,
storefileIndexSize=66,
memstoreSize=1184, usedHeap=3319, maxHeap=4087, blockCacheSize=7033392
,
blockCacheFree=850216848, blockCacheCount=0, blockCacheHitRatio=0
2009-12-08 01:15:58,688 INFO
org.apache.hadoop.hbase.regionserver.MemStoreFlusher:
regionserver/192.168.99.38:60020.cacheFlusher exiting
2009-12-08 01:15:58,779 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server
handler 5 on 60020, call put([...@7237791f,
[Lorg.apache.hadoop.hbase.client.Put;@17f11cce) from
192.168.99.34:34211:
error: java.io.IOException: Server not running, aborting
java.io.IOException: Server not running, aborting
at
org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen
(HRegionServer.java:2351)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.put
(HRegionServer.java:1828)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke
(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run
(HBaseServer.java:915)
2009-12-08 01:15:58,796 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server
handler 13 on 60020, call put([...@569a24a9,
[Lorg.apache.hadoop.hbase.client.Put;@21dcffaa) from
192.168.99.36:42492:
error: java.io.IOException: Server not running, aborting
java.io.IOException: Server not running, aborting
at
org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen
(HRegionServer.java:2351)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.put
(HRegionServer.java:1828)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke
(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run
(HBaseServer.java:915)
2009-12-08 01:16:00,151 INFO org.apache.hadoop.ipc.HBaseServer:
Stopping
server on 60020
2009-12-08 01:16:00,151 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server
handler 2 on 60020: exiting
2009-12-08 01:16:00,152 INFO org.apache.hadoop.ipc.HBaseServer:
Stopping IPC
Server Responder
2009-12-08 01:16:00,152 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server
handler 8 on 60020: exiting
...
On Sun, Dec 13, 2009 at 1:03 AM, stack <[email protected]> wrote:
I wrote hdfs-dev to see how to proceed. We could try running a
vote to get
it committed to 0.21.
St.Ack
On Sat, Dec 12, 2009 at 1:37 PM, Andrew Purtell
<[email protected]>
wrote:
I do. I think I saw it just last week with a failure case as
follows on a
small testbed (aren't they all? :-/ ) that some of our devs are
working
with:
- Local RS and datanode are talking
- Something happens to the datanode
org.apache.hadoop.hdfs.DFSClient: Exception in
createBlockOutputStream
java.net.SocketTimeoutException: 69000 millis timeout while
waiting for
channel to be ready for read. ch : java.nio.channels.SocketChannel
org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
java.io.IOException: Unable to create new block.
- RS won't try talking to other datanodes elsewhere on the cluster
org.apache.hadoop.hdfs.DFSClient: Abandoning block
blk_7040605219500907455_6449696
org.apache.hadoop.hdfs.DFSClient: Abandoning block
blk_-5367929502764356875_6449620
org.apache.hadoop.hdfs.DFSClient: Abandoning block
blk_7075535856966512941_6449680
org.apache.hadoop.hdfs.DFSClient: Abandoning block
blk_77095304474221514_6449685
- RS goes down
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay
of hlog
required.
Forcing server shutdown
org.apache.hadoop.hbase.DroppedSnapshotException ...
Not a blocker in that the downed RS with working sync in 0.21
won't lose
data and can be restarted. But, a critical issue because it will be
frequently encountered and will cause processes on the cluster to
shut
down.
Without some kind of "god" monitor or human intervention
eventually there
will be insufficient resources to carry all regions.
- Andy
________________________________
From: Stack <[email protected]>
To: "[email protected]" <[email protected]>
Sent: Sat, December 12, 2009 1:01:49 PM
Subject: Re: [jira] Resolved: (HBASE-1972) Failed split results
in closed
region and non-registration of daughters; fix the order in which
things
are
run
So we think this critical to hbase?
Stack
On Dec 12, 2009, at 12:43 PM, Andrew Purtell <[email protected]>
wrote:
All HBase committers should jump on that issue and +1. We should
make
that kind of statement for the record.
________________________________
From: stack (JIRA) <[email protected]>
To: [email protected]
Sent: Sat, December 12, 2009 12:39:18 PM
Subject: [jira] Resolved: (HBASE-1972) Failed split results in
closed
region and non-registration of daughters; fix the order in which
things
are
run
[
https://issues.apache.org/jira/browse/HBASE-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
stack resolved HBASE-1972.
--------------------------
Resolution: Won't Fix
Marking as invalid addressed by hdfs-630. Thanks for looking at
this
cosmin. Want to open an issue on getting 630 into 0.21. There
will be
pushback I'd imagine since not "critical" but might make 0.21.1
Failed split results in closed region and non-registration of
daughters;
fix the order in which things are run
---
---
---
---
---
---
---
---
---
---
---
---
---
---
--------------------------------------------------------------------
Key: HBASE-1972
URL: https://issues.apache.org/jira/browse/HBASE-1972
Project: Hadoop HBase
Issue Type: Bug
Reporter: stack
Priority: Blocker
Fix For: 0.21.0
As part of a split, we go to close the region. The close fails
because
flush failed -- a DN was down and HDFS refuses to move past it --
so we
jump
up out of the close with an IOE. But the region has been closed
yet its
still in the .META. as online.
Here is where the hole is:
1. CompactSplitThread calls split.
2. This calls HRegion splitRegion.
3. splitRegion calls close(false).
4. Down the end of the close, we get as far as the LOG.info
("Closed "
+
this)..... but a DFSClient running thread throws an exception
because it
can't allocate block for the flush made as part of the close
(Ain't sure
how... we should add more try/catch in here):
{code}
2009-11-12 00:47:17,865 [regionserver/
208.76.44.142:60020.compactor]
DEBUG org.apache.hadoop.hbase.regionserver.Store: Added hdfs://
aa0-000-12.u.powerset.com:9002/hbase/TestTable/868626151/info/
5071349140567656566
,
entries=46975, sequenceid=2350017, memsize=52.0m, filesize=46.5m to
TestTable,,1257986664542
2009-11-12 00:47:17,866 [regionserver/
208.76.44.142:60020.compactor]
DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Finished
memstore
flush
of ~52.0m for region TestTable,,1257986664542 in 7985ms, sequence
id=2350017, compaction requested=false
2009-11-12 00:47:17,866 [regionserver/
208.76.44.142:60020.compactor]
DEBUG org.apache.hadoop.hbase.regionserver.Store: closed info
2009-11-12 00:47:17,866 [regionserver/
208.76.44.142:60020.compactor]
INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed
TestTable,,1257986664542
2009-11-12 00:47:17,906 [Thread-315] INFO
org.apache.hadoop.hdfs.DFSClient: Exception in
createBlockOutputStream
java.io.IOException: Bad connect ack with firstBadLink as
208.76.44.140:51010
2009-11-12 00:47:17,906 [Thread-315] INFO
org.apache.hadoop.hdfs.DFSClient: Abandoning block
blk_1351692500502810095_1391
2009-11-12 00:47:23,918 [Thread-315] INFO
org.apache.hadoop.hdfs.DFSClient: Exception in
createBlockOutputStream
java.io.IOException: Bad connect ack with firstBadLink as
208.76.44.140:51010
2009-11-12 00:47:23,918 [Thread-315] INFO
org.apache.hadoop.hdfs.DFSClient: Abandoning block
blk_-3310646336307339512_1391
2009-11-12 00:47:29,982 [Thread-318] INFO
org.apache.hadoop.hdfs.DFSClient: Exception in
createBlockOutputStream
java.io.IOException: Bad connect ack with firstBadLink as
208.76.44.140:51010
2009-11-12 00:47:29,982 [Thread-318] INFO
org.apache.hadoop.hdfs.DFSClient: Abandoning block
blk_3070440586900692765_1393
2009-11-12 00:47:35,997 [Thread-318] INFO
org.apache.hadoop.hdfs.DFSClient: Exception in
createBlockOutputStream
java.io.IOException: Bad connect ack with firstBadLink as
208.76.44.140:51010
2009-11-12 00:47:35,997 [Thread-318] INFO
org.apache.hadoop.hdfs.DFSClient: Abandoning block
blk_-5656011219762164043_1393
2009-11-12 00:47:42,007 [Thread-318] INFO
org.apache.hadoop.hdfs.DFSClient: Exception in
createBlockOutputStream
java.io.IOException: Bad connect ack with firstBadLink as
208.76.44.140:51010
2009-11-12 00:47:42,007 [Thread-318] INFO
org.apache.hadoop.hdfs.DFSClient: Abandoning block
blk_-2359634393837722978_1393
2009-11-12 00:47:48,017 [Thread-318] INFO
org.apache.hadoop.hdfs.DFSClient: Exception in
createBlockOutputStream
java.io.IOException: Bad connect ack with firstBadLink as
208.76.44.140:51010
2009-11-12 00:47:48,017 [Thread-318] INFO
org.apache.hadoop.hdfs.DFSClient: Abandoning block
blk_-1626727145091780831_1393
2009-11-12 00:47:54,022 [Thread-318] WARN
org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
java.io.IOException: Unable to create new block.
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream
$DataStreamer.nextBlockOutputStream(DFSClient.java:3100)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run
(DFSClient.java:2681)
2009-11-12 00:47:54,022 [Thread-318] WARN
org.apache.hadoop.hdfs.DFSClient: Could not get block locations.
Source
file
"/hbase/TestTable/868626151/splits/1211221550/info/
5071349140567656566.868626151"
- Aborting...
2009-11-12 00:47:54,029 [regionserver/
208.76.44.142:60020.compactor]
ERROR org.apache.hadoop.hbase.regionserver.CompactSplitThread:
Compaction/Split failed for region TestTable,,1257986664542
java.io.IOException: Bad connect ack with firstBadLink as
208.76.44.140:51010
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream
$DataStreamer.createBlockOutputStream(DFSClient.java:3160)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream
$DataStreamer.nextBlockOutputStream(DFSClient.java:3080)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run
(DFSClient.java:2681)
{code}
Marking this as blocker.
--This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.