Re: Should a data node restart cause a region server to go down?

Jeff Whiting Mon, 06 Feb 2012 12:42:56 -0800

So I restart one of the data nodes and everything continues to work just fine even though the localone is no longer valid. Additionally I can restart n-1 nodes without any problem and hbasecontinues to work. However as soon as I restart the last data node RSs start dying. hbck and fscksay everything is ok before I restart the last one.


fsck:
Status: HEALTHY
 Total size:    7527623612 B
 Total dirs:    2488
 Total files:    6306
 Total blocks (validated):    5687 (avg. block size 1323654 B)
 Minimally replicated blocks:    5687 (100.0 %)
 Over-replicated blocks:    0 (0.0 %)
 Under-replicated blocks:    0 (0.0 %)
 Mis-replicated blocks:        0 (0.0 %)
 Default replication factor:    4
 Average block replication:    2.9280815
 Corrupt blocks:        0
 Missing replicas:        0 (0.0 %)
 Number of data-nodes:        4
 Number of racks:        1
FSCK ended at Mon Feb 06 13:29:09 MST 2012 in 119 milliseconds


hbck:
0 inconsistencies detected.
Status: OK

So if everything is indeed ok, it seems like I shouldn't get the following exception which kills theregion server:12/02/06 13:28:55 WARN hdfs.DFSClient: Error while syncing java.io.IOException: All datanodes10.1.37.4:50010 are bad. Aborting...

I understand the logic of removing dn from the list and preventing the client from retrying a deadnode multiple times. However hdfs continually heals itself and a previous node that may have beeninvalid may now be valid and be able to accept data. After 20 minutes it is likely that a previousnode may be available. Instead hbase / dfsclient just gives up completely and aborts.


The rule here I think was that you do not want RSes to go switch over
writing to a remote DN cause the first one in the pipeline (always the
local one) failed. Hence they're pulled down instead of a retry being
attempted.

Personally I'd rather it try again than bring down my region server needlessly.

~Jeff

On 2/6/2012 1:17 PM, Harsh J wrote:

This is the normal behavior of the sync-API (that when the first DN in
pipeline fails, the whole op is failed), correct me if am wrong.

The rule here I think was that you do not want RSes to go switch over
writing to a remote DN cause the first one in the pipeline (always the
local one) failed. Hence they're pulled down instead of a retry being
attempted.

On Tue, Feb 7, 2012 at 1:38 AM, Jeff Whiting<je...@qualtrics.com>  wrote:

What would "hadoop fsck /" that type of problem if there really were no
nodes with that data?  The worst I've seen is:  Target Replicas is 4 but
found 3 replica(s).

~Jeff


On 2/6/2012 12:45 PM, Ted Yu wrote:

In your case Error Recovery wasn't successful because of:
All datanodes 10.49.29.92:50010 are bad. Aborting...

On Mon, Feb 6, 2012 at 10:28 AM, Jeff Whiting<je...@qualtrics.com>    wrote:

I was increasing the storage on some of my data nodes and thus had to do
a
restart of the data node.  I use cdh3u2 and ran
"/etc/init.d/hadoop-0.20-*
*datanode restart" (I don't think this is a cdh problem). Unfortunately
doing the restart caused region servers to go offline.  Is this expected
behavior?  It seems like should recover for those just fine without
giving
up and dying since there were other data nodes available.  Here are the
logs on the region server from when I restarted the data node to when it
decided to give up.  To give you a little background I'm running a small
cluster with 4 region servers and 4 data nodes.

Thanks,
~Jeff

12/02/06 18:06:03 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor
exception  for block blk_-4249058562504578427_**18197java.io.IOException:
Bad response 1 for block blk_-4249058562504578427_18197 from datanode
10.49.129.134:50010
    at org.apache.hadoop.hdfs.**DFSClient$DFSOutputStream$**
ResponseProcessor.run(**DFSClient.java:2664)

12/02/06 18:06:03 INFO hdfs.DFSClient: Error Recovery for block
blk_-4249058562504578427_18197 waiting for responder to exit.
12/02/06 18:06:03 WARN hdfs.DFSClient: Error Recovery for block
blk_-4249058562504578427_18197 bad datanode[2] 10.49.129.134:50010
12/02/06 18:06:03 WARN hdfs.DFSClient: Error Recovery for block
blk_-4249058562504578427_18197 in pipeline 10.59.39.142:50010,
10.234.50.225:50010, 10.49.129.134:50010, 10.49.29.92:50010: bad datanode
10.49.129.134:50010
12/02/06 18:06:03 WARN wal.HLog: HDFS pipeline error detected. Found 3
replicas but expecting 4 replicas.  Requesting close of hlog.
12/02/06 18:06:03 INFO wal.SequenceFileLogWriter: Using syncFs --
HDFS-200
12/02/06 18:06:03 WARN regionserver.**ReplicationSourceManager:
Replication stopped, won't add new log
12/02/06 18:06:03 INFO wal.HLog: Roll /hbase/.logs/ip-10-59-39-142.**
eu-west-1.compute.internal,**60020,1328142685179/ip-10-59-**
39-142.eu-west-1.compute.**internal%3A60020.**1328549504988,
entries=3644, filesize=12276680. New hlog /hbase/.logs/ip-10-59-39-142.**
eu-west-1.compute.internal,**60020,1328142685179/ip-10-59-**
39-142.eu-west-1.compute.**internal%3A60020.1328551563518

12/02/06 18:06:04 INFO hdfs.DFSClient: Exception in
createBlockOutputStream java.io.IOException: Bad connect ack with
firstBadLink as 10.49.129.134:50010
12/02/06 18:06:04 INFO hdfs.DFSClient: Abandoning block
blk_6156813298944908969_18211
12/02/06 18:06:04 INFO hdfs.DFSClient: Excluding datanode
10.49.129.134:50010
12/02/06 18:06:04 WARN wal.HLog: HDFS pipeline error detected. Found 3
replicas but expecting 4 replicas.  Requesting close of hlog.
12/02/06 18:07:06 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor
exception  for block blk_-165678744483388406_**18211java.io.IOException:
Bad response 1 for block blk_-165678744483388406_18211 from datanode
10.234.50.225:50010
    at org.apache.hadoop.hdfs.**DFSClient$DFSOutputStream$**
ResponseProcessor.run(**DFSClient.java:2664)

12/02/06 18:07:06 INFO hdfs.DFSClient: Error Recovery for block
blk_-165678744483388406_18211 waiting for responder to exit.
12/02/06 18:07:06 WARN hdfs.DFSClient: Error Recovery for block
blk_-165678744483388406_18211 bad datanode[2] 10.234.50.225:50010
12/02/06 18:07:06 WARN hdfs.DFSClient: Error Recovery for block
blk_-165678744483388406_18211 in pipeline 10.59.39.142:50010,
10.49.29.92:50010, 10.234.50.225:50010: bad datanode 10.234.50.225:50010
12/02/06 18:09:21 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor
exception  for block blk_-165678744483388406_**18214java.io.IOException:
Connection reset by peer
    at sun.nio.ch.FileDispatcher.**read0(Native Method)
    at sun.nio.ch.SocketDispatcher.**read(SocketDispatcher.java:21)
    at sun.nio.ch.IOUtil.**readIntoNativeBuffer(IOUtil.**java:237)
    at sun.nio.ch.IOUtil.read(IOUtil.**java:210)
    at sun.nio.ch.SocketChannelImpl.**read(SocketChannelImpl.java:**236)
    at org.apache.hadoop.net.**SocketInputStream$Reader.**
performIO(SocketInputStream.**java:55)
    at org.apache.hadoop.net.**SocketIOWithTimeout.doIO(**
SocketIOWithTimeout.java:142)
    at org.apache.hadoop.net.**SocketInputStream.read(**
SocketInputStream.java:155)
    at org.apache.hadoop.net.**SocketInputStream.read(**
SocketInputStream.java:128)
    at java.io.DataInputStream.**readFully(DataInputStream.**java:178)
    at java.io.DataInputStream.**readLong(DataInputStream.java:**399)
    at org.apache.hadoop.hdfs.**protocol.DataTransferProtocol$**
PipelineAck.readFields(**DataTransferProtocol.java:120)
    at org.apache.hadoop.hdfs.**DFSClient$DFSOutputStream$**
ResponseProcessor.run(**DFSClient.java:2634)

12/02/06 18:09:21 INFO hdfs.DFSClient: Error Recovery for block
blk_-165678744483388406_18214 waiting for responder to exit.
12/02/06 18:09:21 WARN hdfs.DFSClient: Error Recovery for block
blk_-165678744483388406_18214 bad datanode[0] 10.59.39.142:50010
12/02/06 18:09:21 WARN hdfs.DFSClient: Error Recovery for block
blk_-165678744483388406_18214 in pipeline 10.59.39.142:50010,
10.49.29.92:50010: bad datanode 10.59.39.142:50010
12/02/06 18:09:55 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor
exception  for block blk_-165678744483388406_**18221java.io.IOException:
Connection reset by peer
    at sun.nio.ch.FileDispatcher.**read0(Native Method)
    at sun.nio.ch.SocketDispatcher.**read(SocketDispatcher.java:21)
    at sun.nio.ch.IOUtil.**readIntoNativeBuffer(IOUtil.**java:237)
    at sun.nio.ch.IOUtil.read(IOUtil.**java:210)
    at sun.nio.ch.SocketChannelImpl.**read(SocketChannelImpl.java:**236)
    at org.apache.hadoop.net.**SocketInputStream$Reader.**
performIO(SocketInputStream.**java:55)
    at org.apache.hadoop.net.**SocketIOWithTimeout.doIO(**
SocketIOWithTimeout.java:142)
    at org.apache.hadoop.net.**SocketInputStream.read(**
SocketInputStream.java:155)
    at org.apache.hadoop.net.**SocketInputStream.read(**
SocketInputStream.java:128)
    at java.io.DataInputStream.**readFully(DataInputStream.**java:178)
    at java.io.DataInputStream.**readLong(DataInputStream.java:**399)
    at org.apache.hadoop.hdfs.**protocol.DataTransferProtocol$**
PipelineAck.readFields(**DataTransferProtocol.java:120)
    at org.apache.hadoop.hdfs.**DFSClient$DFSOutputStream$**
ResponseProcessor.run(**DFSClient.java:2634)

12/02/06 18:09:55 INFO hdfs.DFSClient: Error Recovery for block
blk_-165678744483388406_18221 waiting for responder to exit.
12/02/06 18:09:56 WARN hdfs.DFSClient: Error Recovery for block
blk_-165678744483388406_18221 bad datanode[0] 10.49.29.92:50010
12/02/06 18:09:56 WARN hdfs.DFSClient: Error while syncing
java.io.IOException: All datanodes 10.49.29.92:50010 are bad. Aborting...
    at org.apache.hadoop.hdfs.**DFSClient$DFSOutputStream.**
processDatanodeError(**DFSClient.java:2766)
    at org.apache.hadoop.hdfs.**DFSClient$DFSOutputStream.**
access$1600(DFSClient.java:**2305)
    at org.apache.hadoop.hdfs.**DFSClient$DFSOutputStream$**
DataStreamer.run(DFSClient.**java:2477)
12/02/06 18:09:56 WARN hdfs.DFSClient: Error while syncing
java.io.IOException: All datanodes 10.49.29.92:50010 are bad. Aborting...
    at org.apache.hadoop.hdfs.**DFSClient$DFSOutputStream.**
processDatanodeError(**DFSClient.java:2766)
    at org.apache.hadoop.hdfs.**DFSClient$DFSOutputStream.**
access$1600(DFSClient.java:**2305)
    at org.apache.hadoop.hdfs.**DFSClient$DFSOutputStream$**
DataStreamer.run(DFSClient.**java:2477)
12/02/06 18:09:56 FATAL wal.HLog: Could not append. Requesting close of
hlog
java.io.IOException: Reflection
    at org.apache.hadoop.hbase.**regionserver.wal.**
SequenceFileLogWriter.sync(**SequenceFileLogWriter.java:**147)
    at org.apache.hadoop.hbase.**regionserver.wal.HLog.sync(**
HLog.java:981)
    at org.apache.hadoop.hbase.**regionserver.wal.HLog$**
LogSyncer.run(HLog.java:958)
Caused by: java.lang.reflect.**InvocationTargetException
    at sun.reflect.**GeneratedMethodAccessor5.**invoke(Unknown Source)
    at sun.reflect.**DelegatingMethodAccessorImpl.**invoke(**
DelegatingMethodAccessorImpl.**java:25)
    at java.lang.reflect.Method.**invoke(Method.java:597)
    at org.apache.hadoop.hbase.**regionserver.wal.**
SequenceFileLogWriter.sync(**SequenceFileLogWriter.java:**145)
    ... 2 more
Caused by: java.io.IOException: All datanodes 10.49.29.92:50010 are bad.
Aborting...
    at org.apache.hadoop.hdfs.**DFSClient$DFSOutputStream.**
processDatanodeError(**DFSClient.java:2766)
    at org.apache.hadoop.hdfs.**DFSClient$DFSOutputStream.**
access$1600(DFSClient.java:**2305)
    at org.apache.hadoop.hdfs.**DFSClient$DFSOutputStream$**
DataStreamer.run(DFSClient.**java:2477)
12/02/06 18:09:56 ERROR wal.HLog: Error while syncing, requesting close
of
hlog
java.io.IOException: Reflection
    at org.apache.hadoop.hbase.**regionserver.wal.**
SequenceFileLogWriter.sync(**SequenceFileLogWriter.java:**147)
    at org.apache.hadoop.hbase.**regionserver.wal.HLog.sync(**
HLog.java:981)
    at org.apache.hadoop.hbase.**regionserver.wal.HLog$**
LogSyncer.run(HLog.java:958)
Caused by: java.lang.reflect.**InvocationTargetException
    at sun.reflect.**GeneratedMethodAccessor5.**invoke(Unknown Source)
    at sun.reflect.**DelegatingMethodAccessorImpl.**invoke(**
DelegatingMethodAccessorImpl.**java:25)
    at java.lang.reflect.Method.**invoke(Method.java:597)
    at org.apache.hadoop.hbase.**regionserver.wal.**
SequenceFileLogWriter.sync(**SequenceFileLogWriter.java:**145)
    ... 2 more
Caused by: java.io.IOException: All datanodes 10.49.29.92:50010 are bad.
Aborting...

--
Jeff Whiting
Qualtrics Senior Software Engineer
je...@qualtrics.com

--
Jeff Whiting
Qualtrics Senior Software Engineer
je...@qualtrics.com


--
Jeff Whiting
Qualtrics Senior Software Engineer
je...@qualtrics.com

Re: Should a data node restart cause a region server to go down?

Reply via email to