Hi Jay, Yes, the whole log would be interesting, plus the logs of the datanode on the same box as the dead RS. What's your hbase & hdfs versions?
The RS should be immune to hdfs errors. There are known issues (see HDFS-3701), but it seems you have something different... This: > java.nio.channels.SocketChannel[connected local=/10.128.204.225:52949 > remote=/10.128.204.225:50010] Seems to say that the error was between the datanode on the same box as the RS? Nicolas On Mon, Jul 30, 2012 at 6:43 PM, Jay T <jay.pyl...@gmail.com> wrote: > A couple of our region servers (in a 16 node cluster) crashed due to > underlying Data Node errors. I am trying to understand how errors on remote > data nodes impact other region server processes. > > *To briefly describe what happened: > * > 1) Cluster was in operation. All 16 nodes were up, reads and writes were > happening extensively. > 2) Nodes 7 and 8 were shutdown for maintenance. (No graceful shutdown DN and > RS service were running and the power was just pulled out) > 3) Nodes 2 and 5 flushed and DFS client started reporting errors. From the > log it seems like DFS blocks were being replicated to the nodes that were > shutdown (7 and 8) and since replication could not go through successfully > DFS client raised errors on 2 and 5 and eventually the RS itself died. > > The question I am trying to get an answer for is : Is a Region Server immune > from remote data node errors (that are part of the replication pipeline) or > not. ? > * > Part of the Region Server Log:* (Node 5) > > 2012-07-26 18:53:15,245 INFO org.apache.hadoop.hdfs.DFSClient: Exception in > createBlockOutputStream 10.128.204.225:50010 java.io.IOException: Bad > connect ack with firstBadLink > as 10.128.204.228:50010 > 2012-07-26 18:53:15,245 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning > block blk_-316956372096761177_489798 > 2012-07-26 18:53:15,246 INFO org.apache.hadoop.hdfs.DFSClient: Excluding > datanode 10.128.204.228:50010 > 2012-07-26 18:53:16,903 INFO org.apache.hadoop.hbase.regionserver.StoreFile: > NO General Bloom and NO DeleteFamily was added to HFile > (hdfs://Node101:8020/hbase/table/754de060 > c9d96286e0c8cd200716ffde/.tmp/26f5cd1fb2cb4547972a31073d2da124) > 2012-07-26 18:53:16,903 INFO org.apache.hadoop.hbase.regionserver.Store: > Flushed , sequenceid=4046717645, memsize=256.5m, into tmp file > hdfs://Node101:8020/hbase/table/754de0 > 60c9d96286e0c8cd200716ffde/.tmp/26f5cd1fb2cb4547972a31073d2da1242012-07-26 > 18:53:16,907 DEBUG org.apache.hadoop.hbase.regionserver.Store: Renaming > flushed file at > hdfs://Node101:8020/hbase/table/754de060c9d96286e0c8cd200716ffde/.tmp/26f5c > d1fb2cb4547972a31073d2da124 to > hdfs://Node101:8020/hbase/table/754de060c9d96286e0c8cd200716ffde/CF/26f5cd1fb2cb4547972a31073d2da124 > 2012-07-26 18:53:16,921 INFO org.apache.hadoop.hbase.regionserver.Store: > Added > hdfs://Node101:8020/hbase/table/754de060c9d96286e0c8cd200716ffde/CF/26f5cd1fb2cb4547972a31073d2d > a124, entries=1137956, sequenceid=4046717645, filesize=13.2m2012-07-26 > 18:53:32,048 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: > java.net.SocketTimeoutException: 15000 millis timeout while waiting for > channel to be ready for write. ch : > java.nio.channels.SocketChannel[connected local=/10.128.204.225:52949 > remote=/10.128.204.225:50010] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) > at > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) > at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) > at java.io.DataOutputStream.write(DataOutputStream.java:90) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2857) > 2012-07-26 18:53:32,049 WARN org.apache.hadoop.hdfs.DFSClient: Error > Recovery for block blk_5116092240243398556_489796 bad datanode[0] > 10.128.204.225:50010 > 2012-07-26 18:53:32,049 WARN org.apache.hadoop.hdfs.DFSClient: Error > Recovery for block blk_5116092240243398556_489796 in pipeline > 10.128.204.225:50010, 10.128.204.221:50010, 10.128.204.227:50010: bad > datanode 10.128.204.225:50010 > > I can pastebin the entire log but this is when things started going wrong > for Node 5 and eventually shutdown hook for RS started and the RS was > shutdown. > > Any help in troubleshooting this is greatly appreciated. > > Thanks, > Jay