I wrote hdfs-dev to see how to proceed. We could try running a vote to get it committed to 0.21. St.Ack
On Sat, Dec 12, 2009 at 1:37 PM, Andrew Purtell <[email protected]> wrote: > I do. I think I saw it just last week with a failure case as follows on a > small testbed (aren't they all? :-/ ) that some of our devs are working > with: > > - Local RS and datanode are talking > > - Something happens to the datanode > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream > java.net.SocketTimeoutException: 69000 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel > org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: > java.io.IOException: Unable to create new block. > > - RS won't try talking to other datanodes elsewhere on the cluster > org.apache.hadoop.hdfs.DFSClient: Abandoning block > blk_7040605219500907455_6449696 > org.apache.hadoop.hdfs.DFSClient: Abandoning block > blk_-5367929502764356875_6449620 > org.apache.hadoop.hdfs.DFSClient: Abandoning block > blk_7075535856966512941_6449680 > org.apache.hadoop.hdfs.DFSClient: Abandoning block > blk_77095304474221514_6449685 > > - RS goes down > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay of hlog > required. > Forcing server shutdown > org.apache.hadoop.hbase.DroppedSnapshotException ... > > Not a blocker in that the downed RS with working sync in 0.21 won't lose > data and can be restarted. But, a critical issue because it will be > frequently encountered and will cause processes on the cluster to shut down. > Without some kind of "god" monitor or human intervention eventually there > will be insufficient resources to carry all regions. > > - Andy > > > > > ________________________________ > From: Stack <[email protected]> > To: "[email protected]" <[email protected]> > Sent: Sat, December 12, 2009 1:01:49 PM > Subject: Re: [jira] Resolved: (HBASE-1972) Failed split results in closed > region and non-registration of daughters; fix the order in which things are > run > > So we think this critical to hbase? > Stack > > > > On Dec 12, 2009, at 12:43 PM, Andrew Purtell <[email protected]> wrote: > > > All HBase committers should jump on that issue and +1. We should make > that kind of statement for the record. > > > > > > > > > > ________________________________ > > From: stack (JIRA) <[email protected]> > > To: [email protected] > > Sent: Sat, December 12, 2009 12:39:18 PM > > Subject: [jira] Resolved: (HBASE-1972) Failed split results in closed > region and non-registration of daughters; fix the order in which things are > run > > > > > > [ > https://issues.apache.org/jira/browse/HBASE-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel] > > > > stack resolved HBASE-1972. > > -------------------------- > > > > Resolution: Won't Fix > > > > Marking as invalid addressed by hdfs-630. Thanks for looking at this > cosmin. Want to open an issue on getting 630 into 0.21. There will be > pushback I'd imagine since not "critical" but might make 0.21.1 > > > >> Failed split results in closed region and non-registration of daughters; > fix the order in which things are run > >> > -------------------------------------------------------------------------------------------------------------- > >> > >> Key: HBASE-1972 > >> URL: https://issues.apache.org/jira/browse/HBASE-1972 > >> Project: Hadoop HBase > >> Issue Type: Bug > >> Reporter: stack > >> Priority: Blocker > >> Fix For: 0.21.0 > >> > >> > >> As part of a split, we go to close the region. The close fails because > flush failed -- a DN was down and HDFS refuses to move past it -- so we jump > up out of the close with an IOE. But the region has been closed yet its > still in the .META. as online. > >> Here is where the hole is: > >> 1. CompactSplitThread calls split. > >> 2. This calls HRegion splitRegion. > >> 3. splitRegion calls close(false). > >> 4. Down the end of the close, we get as far as the LOG.info("Closed " + > this)..... but a DFSClient running thread throws an exception because it > can't allocate block for the flush made as part of the close (Ain't sure > how... we should add more try/catch in here): > >> {code} > >> 2009-11-12 00:47:17,865 [regionserver/208.76.44.142:60020.compactor] > DEBUG org.apache.hadoop.hbase.regionserver.Store: Added hdfs:// > aa0-000-12.u.powerset.com:9002/hbase/TestTable/868626151/info/5071349140567656566, > entries=46975, sequenceid=2350017, memsize=52.0m, filesize=46.5m to > TestTable,,1257986664542 > >> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor] > DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush > of ~52.0m for region TestTable,,1257986664542 in 7985ms, sequence > id=2350017, compaction requested=false > >> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor] > DEBUG org.apache.hadoop.hbase.regionserver.Store: closed info > >> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor] > INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed > TestTable,,1257986664542 > >> 2009-11-12 00:47:17,906 [Thread-315] INFO > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream > java.io.IOException: Bad connect ack with firstBadLink as > 208.76.44.140:51010 > >> 2009-11-12 00:47:17,906 [Thread-315] INFO > org.apache.hadoop.hdfs.DFSClient: Abandoning block > blk_1351692500502810095_1391 > >> 2009-11-12 00:47:23,918 [Thread-315] INFO > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream > java.io.IOException: Bad connect ack with firstBadLink as > 208.76.44.140:51010 > >> 2009-11-12 00:47:23,918 [Thread-315] INFO > org.apache.hadoop.hdfs.DFSClient: Abandoning block > blk_-3310646336307339512_1391 > >> 2009-11-12 00:47:29,982 [Thread-318] INFO > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream > java.io.IOException: Bad connect ack with firstBadLink as > 208.76.44.140:51010 > >> 2009-11-12 00:47:29,982 [Thread-318] INFO > org.apache.hadoop.hdfs.DFSClient: Abandoning block > blk_3070440586900692765_1393 > >> 2009-11-12 00:47:35,997 [Thread-318] INFO > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream > java.io.IOException: Bad connect ack with firstBadLink as > 208.76.44.140:51010 > >> 2009-11-12 00:47:35,997 [Thread-318] INFO > org.apache.hadoop.hdfs.DFSClient: Abandoning block > blk_-5656011219762164043_1393 > >> 2009-11-12 00:47:42,007 [Thread-318] INFO > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream > java.io.IOException: Bad connect ack with firstBadLink as > 208.76.44.140:51010 > >> 2009-11-12 00:47:42,007 [Thread-318] INFO > org.apache.hadoop.hdfs.DFSClient: Abandoning block > blk_-2359634393837722978_1393 > >> 2009-11-12 00:47:48,017 [Thread-318] INFO > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream > java.io.IOException: Bad connect ack with firstBadLink as > 208.76.44.140:51010 > >> 2009-11-12 00:47:48,017 [Thread-318] INFO > org.apache.hadoop.hdfs.DFSClient: Abandoning block > blk_-1626727145091780831_1393 > >> 2009-11-12 00:47:54,022 [Thread-318] WARN > org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: > java.io.IOException: Unable to create new block. > >> at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3100) > >> at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681) > >> 2009-11-12 00:47:54,022 [Thread-318] WARN > org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file > "/hbase/TestTable/868626151/splits/1211221550/info/5071349140567656566.868626151" > - Aborting... > >> 2009-11-12 00:47:54,029 [regionserver/208.76.44.142:60020.compactor] > ERROR org.apache.hadoop.hbase.regionserver.CompactSplitThread: > Compaction/Split failed for region TestTable,,1257986664542 > >> java.io.IOException: Bad connect ack with firstBadLink as > 208.76.44.140:51010 > >> at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.createBlockOutputStream(DFSClient.java:3160) > >> at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3080) > >> at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681) > >> {code} > >> Marking this as blocker. > > > > --This message is automatically generated by JIRA. > > - > > You can reply to this email to add a comment to the issue online. > > > > > > > > >
