Hi Jay,

Yes, the whole log would be interesting, plus the logs of the datanode
on the same box as the dead RS.
What's your hbase & hdfs versions?

The RS should be immune to hdfs errors. There are known issues (see
HDFS-3701), but it seems you have something different...
This:
> java.nio.channels.SocketChannel[connected local=/10.128.204.225:52949
> remote=/10.128.204.225:50010]

Seems to say that the error was between the datanode on the same box as the RS?

Nicolas

On Mon, Jul 30, 2012 at 6:43 PM, Jay T <jay.pyl...@gmail.com> wrote:
>  A couple of our region servers (in a 16 node cluster) crashed due to
> underlying Data Node errors. I am trying to understand how errors on remote
> data nodes impact other region server processes.
>
> *To briefly describe what happened:
> *
> 1) Cluster was in operation. All 16 nodes were up, reads and writes were
> happening extensively.
> 2) Nodes 7 and 8 were shutdown for maintenance. (No graceful shutdown DN and
> RS service were running and the power was just pulled out)
> 3) Nodes 2 and 5 flushed and DFS client started reporting errors. From the
> log it seems like DFS blocks were being replicated to the nodes that were
> shutdown (7 and 8) and since replication could not go through successfully
> DFS client raised errors on 2 and 5 and eventually the RS itself died.
>
> The question I am trying to get an answer for is : Is a Region Server immune
> from remote data node errors (that are part of the replication pipeline) or
> not. ?
> *
> Part of the Region Server Log:* (Node 5)
>
> 2012-07-26 18:53:15,245 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
> createBlockOutputStream 10.128.204.225:50010 java.io.IOException: Bad
> connect ack with firstBadLink
> as 10.128.204.228:50010
> 2012-07-26 18:53:15,245 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
> block blk_-316956372096761177_489798
> 2012-07-26 18:53:15,246 INFO org.apache.hadoop.hdfs.DFSClient: Excluding
> datanode 10.128.204.228:50010
> 2012-07-26 18:53:16,903 INFO org.apache.hadoop.hbase.regionserver.StoreFile:
> NO General Bloom and NO DeleteFamily was added to HFile
> (hdfs://Node101:8020/hbase/table/754de060
> c9d96286e0c8cd200716ffde/.tmp/26f5cd1fb2cb4547972a31073d2da124)
> 2012-07-26 18:53:16,903 INFO org.apache.hadoop.hbase.regionserver.Store:
> Flushed , sequenceid=4046717645, memsize=256.5m, into tmp file
> hdfs://Node101:8020/hbase/table/754de0
> 60c9d96286e0c8cd200716ffde/.tmp/26f5cd1fb2cb4547972a31073d2da1242012-07-26
> 18:53:16,907 DEBUG org.apache.hadoop.hbase.regionserver.Store: Renaming
> flushed file at
> hdfs://Node101:8020/hbase/table/754de060c9d96286e0c8cd200716ffde/.tmp/26f5c
> d1fb2cb4547972a31073d2da124 to
> hdfs://Node101:8020/hbase/table/754de060c9d96286e0c8cd200716ffde/CF/26f5cd1fb2cb4547972a31073d2da124
> 2012-07-26 18:53:16,921 INFO org.apache.hadoop.hbase.regionserver.Store:
> Added
> hdfs://Node101:8020/hbase/table/754de060c9d96286e0c8cd200716ffde/CF/26f5cd1fb2cb4547972a31073d2d
> a124, entries=1137956, sequenceid=4046717645, filesize=13.2m2012-07-26
> 18:53:32,048 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
> java.net.SocketTimeoutException: 15000 millis timeout while waiting for
> channel to be ready for write. ch :
> java.nio.channels.SocketChannel[connected local=/10.128.204.225:52949
> remote=/10.128.204.225:50010]
>         at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>         at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
> at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
>         at java.io.DataOutputStream.write(DataOutputStream.java:90)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2857)
> 2012-07-26 18:53:32,049 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_5116092240243398556_489796 bad datanode[0]
> 10.128.204.225:50010
> 2012-07-26 18:53:32,049 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_5116092240243398556_489796 in pipeline
> 10.128.204.225:50010, 10.128.204.221:50010, 10.128.204.227:50010: bad
> datanode 10.128.204.225:50010
>
> I can pastebin the entire log but this is when things started going wrong
> for Node 5 and eventually shutdown hook for RS started and the RS was
> shutdown.
>
> Any help in troubleshooting this is greatly appreciated.
>
> Thanks,
> Jay

Reply via email to