Re: [jira] Resolved: (HBASE-1972) Failed split results in closed region and non-registration of daughters; fix the order in which things are run

Andrew Purtell Sat, 12 Dec 2009 13:38:24 -0800

I do. I think I saw it just last week with a failure case as follows on a small 
testbed (aren't they all? :-/ ) that some of our devs are working with:


- Local RS and datanode are talking

- Something happens to the datanode 
    org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream 
java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel
    org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: 
java.io.IOException: Unable to create new block.
    
- RS won't try talking to other datanodes elsewhere on the cluster
    org.apache.hadoop.hdfs.DFSClient: Abandoning block 
blk_7040605219500907455_6449696 
    org.apache.hadoop.hdfs.DFSClient: Abandoning block 
blk_-5367929502764356875_6449620 
    org.apache.hadoop.hdfs.DFSClient: Abandoning block 
blk_7075535856966512941_6449680 
    org.apache.hadoop.hdfs.DFSClient: Abandoning block 
blk_77095304474221514_6449685 

- RS goes down
    org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay of hlog 
required. 
Forcing server shutdown
    org.apache.hadoop.hbase.DroppedSnapshotException ...

Not a blocker in that the downed RS with working sync in 0.21 won't lose data 
and can be restarted. But, a critical issue because it will be frequently 
encountered and will cause processes on the cluster to shut down. Without some 
kind of "god" monitor or human intervention eventually there will be 
insufficient resources to carry all regions. 

   - Andy




________________________________
From: Stack <[email protected]>
To: "[email protected]" <[email protected]>
Sent: Sat, December 12, 2009 1:01:49 PM
Subject: Re: [jira] Resolved: (HBASE-1972) Failed split results in closed 
region and non-registration of daughters; fix the order in which things are run

So we think this critical to hbase?
Stack



On Dec 12, 2009, at 12:43 PM, Andrew Purtell <[email protected]> wrote:

> All HBase committers should jump on that issue and +1. We should make that 
> kind of statement for the record.
> 
> 
> 
> 
> ________________________________
> From: stack (JIRA) <[email protected]>
> To: [email protected]
> Sent: Sat, December 12, 2009 12:39:18 PM
> Subject: [jira] Resolved: (HBASE-1972) Failed split results in closed region 
> and non-registration of daughters; fix the order in which things are run
> 
> 
>     [ 
> https://issues.apache.org/jira/browse/HBASE-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>  ]
> 
> stack resolved HBASE-1972.
> --------------------------
> 
>    Resolution: Won't Fix
> 
> Marking as invalid addressed by hdfs-630. Thanks for looking at this cosmin.  
> Want to open an issue on getting 630 into 0.21.   There will be pushback I'd 
> imagine since not "critical" but might make 0.21.1
> 
>> Failed split results in closed region and non-registration of daughters; fix 
>> the order in which things are run
>> --------------------------------------------------------------------------------------------------------------
>> 
>>                Key: HBASE-1972
>>                URL: https://issues.apache.org/jira/browse/HBASE-1972
>>            Project: Hadoop HBase
>>         Issue Type: Bug
>>           Reporter: stack
>>           Priority: Blocker
>>            Fix For: 0.21.0
>> 
>> 
>> As part of a split, we go to close the region.  The close fails because 
>> flush failed -- a DN was down and HDFS refuses to move past it -- so we jump 
>> up out of the close with an IOE.  But the region has been closed yet its 
>> still in the .META. as online.
>> Here is where the hole is:
>> 1. CompactSplitThread calls split.
>> 2. This calls HRegion splitRegion.
>> 3. splitRegion calls close(false).
>> 4. Down the end of the close, we get as far as the LOG.info("Closed " + 
>> this)..... but a DFSClient running thread throws an exception because it 
>> can't allocate block for the flush made as part of the close (Ain't sure 
>> how... we should add more try/catch in here):
>> {code}
>> 2009-11-12 00:47:17,865 [regionserver/208.76.44.142:60020.compactor] DEBUG 
>> org.apache.hadoop.hbase.regionserver.Store: Added 
>> hdfs://aa0-000-12.u.powerset.com:9002/hbase/TestTable/868626151/info/5071349140567656566,
>>  entries=46975, sequenceid=2350017, memsize=52.0m, filesize=46.5m to 
>> TestTable,,1257986664542
>> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor] DEBUG 
>> org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of 
>> ~52.0m for region TestTable,,1257986664542 in 7985ms, sequence id=2350017, 
>> compaction requested=false
>> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor] DEBUG 
>> org.apache.hadoop.hbase.regionserver.Store: closed info
>> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor] INFO 
>> org.apache.hadoop.hbase.regionserver.HRegion: Closed TestTable,,1257986664542
>> 2009-11-12 00:47:17,906 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: 
>> Exception in createBlockOutputStream java.io.IOException: Bad connect ack 
>> with firstBadLink as 208.76.44.140:51010
>> 2009-11-12 00:47:17,906 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: 
>> Abandoning block blk_1351692500502810095_1391
>> 2009-11-12 00:47:23,918 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: 
>> Exception in createBlockOutputStream java.io.IOException: Bad connect ack 
>> with firstBadLink as 208.76.44.140:51010
>> 2009-11-12 00:47:23,918 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: 
>> Abandoning block blk_-3310646336307339512_1391
>> 2009-11-12 00:47:29,982 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: 
>> Exception in createBlockOutputStream java.io.IOException: Bad connect ack 
>> with firstBadLink as 208.76.44.140:51010
>> 2009-11-12 00:47:29,982 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: 
>> Abandoning block blk_3070440586900692765_1393
>> 2009-11-12 00:47:35,997 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: 
>> Exception in createBlockOutputStream java.io.IOException: Bad connect ack 
>> with firstBadLink as 208.76.44.140:51010
>> 2009-11-12 00:47:35,997 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: 
>> Abandoning block blk_-5656011219762164043_1393
>> 2009-11-12 00:47:42,007 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: 
>> Exception in createBlockOutputStream java.io.IOException: Bad connect ack 
>> with firstBadLink as 208.76.44.140:51010
>> 2009-11-12 00:47:42,007 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: 
>> Abandoning block blk_-2359634393837722978_1393
>> 2009-11-12 00:47:48,017 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: 
>> Exception in createBlockOutputStream java.io.IOException: Bad connect ack 
>> with firstBadLink as 208.76.44.140:51010
>> 2009-11-12 00:47:48,017 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: 
>> Abandoning block blk_-1626727145091780831_1393
>> 2009-11-12 00:47:54,022 [Thread-318] WARN org.apache.hadoop.hdfs.DFSClient: 
>> DataStreamer Exception: java.io.IOException: Unable to create new block.
>>        at 
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3100)
>>        at 
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)
>> 2009-11-12 00:47:54,022 [Thread-318] WARN org.apache.hadoop.hdfs.DFSClient: 
>> Could not get block locations. Source file 
>> "/hbase/TestTable/868626151/splits/1211221550/info/5071349140567656566.868626151"
>>  - Aborting...
>> 2009-11-12 00:47:54,029 [regionserver/208.76.44.142:60020.compactor] ERROR 
>> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction/Split 
>> failed for region TestTable,,1257986664542
>> java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
>>        at 
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.createBlockOutputStream(DFSClient.java:3160)
>>        at 
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3080)
>>        at 
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)
>> {code}
>> Marking this as blocker.
> 
> --This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
> 
>

Re: [jira] Resolved: (HBASE-1972) Failed split results in closed region and non-registration of daughters; fix the order in which things are run

Reply via email to