First, 

  "ulimit: 1024"

That's fatal. You need to up file descriptors to something like 32K. 

See http://wiki.apache.org/hadoop/Hbase/Troubleshooting, item #6

From there, let's see.

    - Andy

> From: Oded Rosen <o...@legolas-media.com>
> Subject: DFSClient errors during massive HBase load
> To: hbase-user@hadoop.apache.org
> Date: Thursday, April 1, 2010, 1:19 PM
> **Hi all,
> 
> I have a problem with a massive HBase loading job.
> It is from raw files to hbase, through some mapreduce
> processing +
> manipulating (so loading direcly to files will not be
> easy).
> 
> After some dozen million successful writes - a few hours of
> load - some of
> the regionservers start to die - one by one - until the
> whole cluster is
> kaput.
> The hbase master sees a "znode expired" error each time a
> regionserver
> falls. The regionserver errors are attached.
> 
> Current configurations:
> Four nodes - one namenode+master, three
> datanodes+regionservers.
> dfs.datanode.max.xcievers: 2047
> ulimit: 1024
> servers: fedora
> hadoop-0.20, hbase-0.20, hdfs (private servers, not on ec2
> or anything).
> 
> 
> *The specific errors from the regionserver log (from
> <IP6>, see comment):*
> 
> 2010-04-01 11:36:00,224 WARN
> org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception  for
> block
> blk_7621973847448611459_244908java.io.IOException: Bad
> response 1 for block
> blk_7621973847448611459_244908 from datanode
> <IP2>:50010
>     at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2423)
> 
> *after that, some of this appears:*
> 
> 2010-04-01 11:36:20,602 INFO
> org.apache.hadoop.hdfs.DFSClient: Exception in
> createBlockOutputStream java.io.IOException: Bad connect
> ack with
> firstBadLink <IP2>:50010
> 2010-04-01 11:36:20,602 INFO
> org.apache.hadoop.hdfs.DFSClient: Abandoning
> block blk_4280490438976631008_245009
> 
> *and the FATAL:*
> 
> 2010-04-01 11:36:32,634 FATAL
> org.apache.hadoop.hbase.regionserver.HLog:
> Could not append. Requesting close of hlog
> java.io.IOException: Bad connect ack with firstBadLink
> <IP2>:50010
>     at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2872)
>     at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2795)
>     at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2078)
>     at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2264)
> 
> *this FATAL error appears many times until this one kicks
> in:*
> 
> 2010-04-01 11:38:57,281 FATAL
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher:
> Replay of hlog
> required. Forcing server shutdown
> org.apache.hadoop.hbase.DroppedSnapshotException: region:
> .META.,,1
>     at
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:977)
>     at
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:846)
>     at
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:241)
>     at
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:149)
> Caused by: java.io.IOException: Bad connect ack with
> firstBadLink
> <IP2>:50010
>     at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2872)
>     at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2795)
>     at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2078)
>     at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2264)
> 
> *(then the regionserver starts closing itself)*
> 
> The regionserver on <IP6> was shut down, but problems
> are corellated with
> <IP2> (notice the ip in the error msgs). <IP2>
> was also considered a dead
> node after these errors, according to the hadoop namenode
> web ui.
> I think this is an hdfs failure, rather then
> hbase/zookeeper (although it is
> probably because of hbase high load...).
> 
> On the datanodes, once in a while I had:
> 
> 2010-04-01 11:24:59,265 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(<IP2>:50010,
> storageID=DS-1822315410-<IP2>-50010-1266860406782,
> infoPort=50075,
> ipcPort=50020):DataXceiver
> 
> but these errors occured at different times, and not even
> around crashes. No
> fatal errors found on the datanode log (but it still
> crashed).
> 
> I haven't seen this exact error on the web (only similar
> ones);
> This guy (http://osdir.com/ml/hbase-user-hadoop-apache/2009-02/msg00186.html)
> had a similar problem, but not exactly the same.
> 
> Any ideas?
> thanks,
> 
> -- 
> Oded
> 




Reply via email to