First, "ulimit: 1024"
That's fatal. You need to up file descriptors to something like 32K. See http://wiki.apache.org/hadoop/Hbase/Troubleshooting, item #6 From there, let's see. - Andy > From: Oded Rosen <o...@legolas-media.com> > Subject: DFSClient errors during massive HBase load > To: hbase-user@hadoop.apache.org > Date: Thursday, April 1, 2010, 1:19 PM > **Hi all, > > I have a problem with a massive HBase loading job. > It is from raw files to hbase, through some mapreduce > processing + > manipulating (so loading direcly to files will not be > easy). > > After some dozen million successful writes - a few hours of > load - some of > the regionservers start to die - one by one - until the > whole cluster is > kaput. > The hbase master sees a "znode expired" error each time a > regionserver > falls. The regionserver errors are attached. > > Current configurations: > Four nodes - one namenode+master, three > datanodes+regionservers. > dfs.datanode.max.xcievers: 2047 > ulimit: 1024 > servers: fedora > hadoop-0.20, hbase-0.20, hdfs (private servers, not on ec2 > or anything). > > > *The specific errors from the regionserver log (from > <IP6>, see comment):* > > 2010-04-01 11:36:00,224 WARN > org.apache.hadoop.hdfs.DFSClient: > DFSOutputStream ResponseProcessor exception for > block > blk_7621973847448611459_244908java.io.IOException: Bad > response 1 for block > blk_7621973847448611459_244908 from datanode > <IP2>:50010 > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2423) > > *after that, some of this appears:* > > 2010-04-01 11:36:20,602 INFO > org.apache.hadoop.hdfs.DFSClient: Exception in > createBlockOutputStream java.io.IOException: Bad connect > ack with > firstBadLink <IP2>:50010 > 2010-04-01 11:36:20,602 INFO > org.apache.hadoop.hdfs.DFSClient: Abandoning > block blk_4280490438976631008_245009 > > *and the FATAL:* > > 2010-04-01 11:36:32,634 FATAL > org.apache.hadoop.hbase.regionserver.HLog: > Could not append. Requesting close of hlog > java.io.IOException: Bad connect ack with firstBadLink > <IP2>:50010 > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2872) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2795) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2078) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2264) > > *this FATAL error appears many times until this one kicks > in:* > > 2010-04-01 11:38:57,281 FATAL > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: > Replay of hlog > required. Forcing server shutdown > org.apache.hadoop.hbase.DroppedSnapshotException: region: > .META.,,1 > at > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:977) > at > org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:846) > at > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:241) > at > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:149) > Caused by: java.io.IOException: Bad connect ack with > firstBadLink > <IP2>:50010 > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2872) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2795) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2078) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2264) > > *(then the regionserver starts closing itself)* > > The regionserver on <IP6> was shut down, but problems > are corellated with > <IP2> (notice the ip in the error msgs). <IP2> > was also considered a dead > node after these errors, according to the hadoop namenode > web ui. > I think this is an hdfs failure, rather then > hbase/zookeeper (although it is > probably because of hbase high load...). > > On the datanodes, once in a while I had: > > 2010-04-01 11:24:59,265 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeRegistration(<IP2>:50010, > storageID=DS-1822315410-<IP2>-50010-1266860406782, > infoPort=50075, > ipcPort=50020):DataXceiver > > but these errors occured at different times, and not even > around crashes. No > fatal errors found on the datanode log (but it still > crashed). > > I haven't seen this exact error on the web (only similar > ones); > This guy (http://osdir.com/ml/hbase-user-hadoop-apache/2009-02/msg00186.html) > had a similar problem, but not exactly the same. > > Any ideas? > thanks, > > -- > Oded >