[jira] [Commented] (HDFS-10490) Client may never recovery replica after a timeout during sending packet

Kihwal Lee (JIRA) Mon, 13 Jun 2016 10:52:48 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327836#comment-15327836
 ]


Kihwal Lee commented on HDFS-10490:
-----------------------------------

{{BlockReceiver}} flushes after writing each packet locally, so the reported 
issue can happen in two cases:
1) a datanode in the pipeline relayed the first packet downstream, but the 
local write hung or somehow stuck on flush.  The client would received one ack 
if this node is not the last node in the pipeline. The second packet won't get 
through since this node is stuck.  If a new node is added during the recovery, 
it will try to transfer the first packet.
2) a datanode in the pipeline got stuck on sending the first packet downstream. 
 The client won't receive any ack.  No actual data will be copied during 
recovery.

Also, for simple pipeline recovery without adding any node, {{stopWriter()}} 
will cause {{IOUtils.closeStream()}} to be called against the active 
{{BlockReceiver}} instance, so both checksum and data output will be flushed 
and closed. However,  {{transferReplicaForPipelineRecovery()}} does not take 
care of the active writer.

If a rbw copying failed in case 1), it was not a good node to include anyway.  
Before HDFS-9106, a single transfer failure would cause permanent failure. So 
if this was the cause, it could have survived with HDFS-9106.  

If 2) was the case and the stuck node was the 1st node in the pipeline, the 
recovery can be tricky. As stated in the description, the connections 
downstream might still be up and the header might not have been flushed on the 
remaining "healthy" nodes. But normally, timeout causes a connection to break 
and {{closeStream()}} to be called. I see you had to short out {{close()}} to 
artificially have the connection stay open in the test case.

I can think of several potential solutions to this case.
1) The approach taken by the current patch. Flush the meta file after the 
header is written.
2) Revisit the design of {{transferReplicaForPipelineRecovery()}} and 
{{waitForMinLength()}}.  Make it stop the active writer if possible.
3) Since no packet has been acked, the state of datanodes is uncertain to the 
client. Treat it like block output stream creation failure. I.e. do 
{{abandonBlock()}} and retry with the suspected bad node excluded.

1) will address most of cases, but 3) (a sludge hammer apporoach) may be the 
surest way.  2) has a bigger impact and may need to be considered in a separate 
jira.  As for the patch, {{closedInTest}} doesn't seem to serve any purpose.

> Client may never recovery replica after a timeout during sending packet
> -----------------------------------------------------------------------
>
>                 Key: HDFS-10490
>                 URL: https://issues.apache.org/jira/browse/HDFS-10490
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.6.0
>            Reporter: He Tianyi
>         Attachments: HDFS-10490.0001.patch, HDFS-10490.patch
>
>
> For newly created replica, a meta file is created in constructor of 
> {{BlockReceiver}} (for {{WRITE_BLOCK}} op). Its header will be written lazily 
> (buffered in memory first by {{BufferedOutputStream}}). 
> If following packets fail to deliver (e.g. in  extreme network condition), 
> the header may never get flush until closed. 
> However, {{BlockReceiver}} will not call close until block receiving is 
> finished or exception(s) encountered. Also in extreme network condition, both 
> RST & FIN may not deliver in time. 
> In this case, if client tries to initiates a {{transferBlock}} to a new 
> datanode (in {{addDatanode2ExistingPipeline}}), existing datanode will see an 
> empty meta if its {{BlockReceiver}} did not close in time. 
> Then, after HDFS-3429, a default {{DataChecksum}} (NULL, 512) will be used 
> during transfer. So when client then tries to recover pipeline after 
> completely transferred, it may encounter the following exception:
> {noformat}
> java.io.IOException: Client requested checksum DataChecksum(type=CRC32C, 
> chunkSize=4096) when appending to an existing block with different chunk 
> size: DataChecksum(type=NULL, chunkSize=512)
>         at 
> org.apache.hadoop.hdfs.server.datanode.ReplicaInPipeline.createStreams(ReplicaInPipeline.java:230)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:226)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:798)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:76)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:243)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This will repeat, until exhausted by datanode replacement policy.
> Also to note that, with bad luck (like I), 20k clients are all doing this. 
> It's to some extend a DDoS attack to NameNode (because of 
> getAdditionalDataNode calls).
> I suggest we flush immediately after header is written, preventing anybody 
> from seeing empty meta file for avoiding the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-10490) Client may never recovery replica after a timeout during sending packet

Reply via email to