I do. I think I saw it just last week with a failure case as follows on a small
testbed (aren't they all? :-/ ) that some of our devs are working with:
- Local RS and datanode are talking
- Something happens to the datanode
org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel
to be ready for read. ch : java.nio.channels.SocketChannel
org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
java.io.IOException: Unable to create new block.
- RS won't try talking to other datanodes elsewhere on the cluster
org.apache.hadoop.hdfs.DFSClient: Abandoning block
blk_7040605219500907455_6449696
org.apache.hadoop.hdfs.DFSClient: Abandoning block
blk_-5367929502764356875_6449620
org.apache.hadoop.hdfs.DFSClient: Abandoning block
blk_7075535856966512941_6449680
org.apache.hadoop.hdfs.DFSClient: Abandoning block
blk_77095304474221514_6449685
- RS goes down
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay of hlog
required.
Forcing server shutdown
org.apache.hadoop.hbase.DroppedSnapshotException ...
Not a blocker in that the downed RS with working sync in 0.21 won't lose data
and can be restarted. But, a critical issue because it will be frequently
encountered and will cause processes on the cluster to shut down. Without some
kind of "god" monitor or human intervention eventually there will be
insufficient resources to carry all regions.
- Andy
________________________________
From: Stack <[email protected]>
To: "[email protected]" <[email protected]>
Sent: Sat, December 12, 2009 1:01:49 PM
Subject: Re: [jira] Resolved: (HBASE-1972) Failed split results in closed
region and non-registration of daughters; fix the order in which things are run
So we think this critical to hbase?
Stack
On Dec 12, 2009, at 12:43 PM, Andrew Purtell <[email protected]> wrote:
> All HBase committers should jump on that issue and +1. We should make that
> kind of statement for the record.
>
>
>
>
> ________________________________
> From: stack (JIRA) <[email protected]>
> To: [email protected]
> Sent: Sat, December 12, 2009 12:39:18 PM
> Subject: [jira] Resolved: (HBASE-1972) Failed split results in closed region
> and non-registration of daughters; fix the order in which things are run
>
>
> [
> https://issues.apache.org/jira/browse/HBASE-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> ]
>
> stack resolved HBASE-1972.
> --------------------------
>
> Resolution: Won't Fix
>
> Marking as invalid addressed by hdfs-630. Thanks for looking at this cosmin.
> Want to open an issue on getting 630 into 0.21. There will be pushback I'd
> imagine since not "critical" but might make 0.21.1
>
>> Failed split results in closed region and non-registration of daughters; fix
>> the order in which things are run
>> --------------------------------------------------------------------------------------------------------------
>>
>> Key: HBASE-1972
>> URL: https://issues.apache.org/jira/browse/HBASE-1972
>> Project: Hadoop HBase
>> Issue Type: Bug
>> Reporter: stack
>> Priority: Blocker
>> Fix For: 0.21.0
>>
>>
>> As part of a split, we go to close the region. The close fails because
>> flush failed -- a DN was down and HDFS refuses to move past it -- so we jump
>> up out of the close with an IOE. But the region has been closed yet its
>> still in the .META. as online.
>> Here is where the hole is:
>> 1. CompactSplitThread calls split.
>> 2. This calls HRegion splitRegion.
>> 3. splitRegion calls close(false).
>> 4. Down the end of the close, we get as far as the LOG.info("Closed " +
>> this)..... but a DFSClient running thread throws an exception because it
>> can't allocate block for the flush made as part of the close (Ain't sure
>> how... we should add more try/catch in here):
>> {code}
>> 2009-11-12 00:47:17,865 [regionserver/208.76.44.142:60020.compactor] DEBUG
>> org.apache.hadoop.hbase.regionserver.Store: Added
>> hdfs://aa0-000-12.u.powerset.com:9002/hbase/TestTable/868626151/info/5071349140567656566,
>> entries=46975, sequenceid=2350017, memsize=52.0m, filesize=46.5m to
>> TestTable,,1257986664542
>> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor] DEBUG
>> org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of
>> ~52.0m for region TestTable,,1257986664542 in 7985ms, sequence id=2350017,
>> compaction requested=false
>> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor] DEBUG
>> org.apache.hadoop.hbase.regionserver.Store: closed info
>> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor] INFO
>> org.apache.hadoop.hbase.regionserver.HRegion: Closed TestTable,,1257986664542
>> 2009-11-12 00:47:17,906 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient:
>> Exception in createBlockOutputStream java.io.IOException: Bad connect ack
>> with firstBadLink as 208.76.44.140:51010
>> 2009-11-12 00:47:17,906 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient:
>> Abandoning block blk_1351692500502810095_1391
>> 2009-11-12 00:47:23,918 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient:
>> Exception in createBlockOutputStream java.io.IOException: Bad connect ack
>> with firstBadLink as 208.76.44.140:51010
>> 2009-11-12 00:47:23,918 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient:
>> Abandoning block blk_-3310646336307339512_1391
>> 2009-11-12 00:47:29,982 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient:
>> Exception in createBlockOutputStream java.io.IOException: Bad connect ack
>> with firstBadLink as 208.76.44.140:51010
>> 2009-11-12 00:47:29,982 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient:
>> Abandoning block blk_3070440586900692765_1393
>> 2009-11-12 00:47:35,997 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient:
>> Exception in createBlockOutputStream java.io.IOException: Bad connect ack
>> with firstBadLink as 208.76.44.140:51010
>> 2009-11-12 00:47:35,997 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient:
>> Abandoning block blk_-5656011219762164043_1393
>> 2009-11-12 00:47:42,007 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient:
>> Exception in createBlockOutputStream java.io.IOException: Bad connect ack
>> with firstBadLink as 208.76.44.140:51010
>> 2009-11-12 00:47:42,007 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient:
>> Abandoning block blk_-2359634393837722978_1393
>> 2009-11-12 00:47:48,017 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient:
>> Exception in createBlockOutputStream java.io.IOException: Bad connect ack
>> with firstBadLink as 208.76.44.140:51010
>> 2009-11-12 00:47:48,017 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient:
>> Abandoning block blk_-1626727145091780831_1393
>> 2009-11-12 00:47:54,022 [Thread-318] WARN org.apache.hadoop.hdfs.DFSClient:
>> DataStreamer Exception: java.io.IOException: Unable to create new block.
>> at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3100)
>> at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)
>> 2009-11-12 00:47:54,022 [Thread-318] WARN org.apache.hadoop.hdfs.DFSClient:
>> Could not get block locations. Source file
>> "/hbase/TestTable/868626151/splits/1211221550/info/5071349140567656566.868626151"
>> - Aborting...
>> 2009-11-12 00:47:54,029 [regionserver/208.76.44.142:60020.compactor] ERROR
>> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction/Split
>> failed for region TestTable,,1257986664542
>> java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
>> at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.createBlockOutputStream(DFSClient.java:3160)
>> at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3080)
>> at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)
>> {code}
>> Marking this as blocker.
>
> --This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>