[jira] [Commented] (HDFS-8704) Erasure Coding: client fails to write large file when one datanode fails

Zhe Zhang (JIRA) Fri, 11 Sep 2015 00:13:53 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14740321#comment-14740321
 ]


Zhe Zhang commented on HDFS-8704:
---------------------------------

Thanks for updating the patch Bo. My main concern is still the nested {{run()}} 
structure in {{StripedDataStreamer}}. 
{code}
   @Override
+  public void run() {
+
+    while (!toTerminate && !streamerClosed &&
+        dfsClient.clientRunning && !errorState.hasError()) {
+      super.run();
{code}

[~walter.k.su] is exploring the idea of a group streamer in HDFS-9040, and 
[~jingzhao] is trying to move {{locateFollowBlock}} to DFSOutputStream level. 
If either of the two directions works, the role of a streamer will be limited 
to transferring a single internal block, which will solve this problem. So I 
suggest we keep this JIRA open and waiit for a conclusion on these 2 efforts. 

> Erasure Coding: client fails to write large file when one datanode fails
> ------------------------------------------------------------------------
>
>                 Key: HDFS-8704
>                 URL: https://issues.apache.org/jira/browse/HDFS-8704
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Li Bo
>            Assignee: Li Bo
>         Attachments: HDFS-8704-000.patch, HDFS-8704-HDFS-7285-002.patch, 
> HDFS-8704-HDFS-7285-003.patch, HDFS-8704-HDFS-7285-004.patch, 
> HDFS-8704-HDFS-7285-005.patch, HDFS-8704-HDFS-7285-006.patch, 
> HDFS-8704-HDFS-7285-007.patch, HDFS-8704-HDFS-7285-008.patch
>
>
> I test current code on a 5-node cluster using RS(3,2).  When a datanode is 
> corrupt, client succeeds to write a file smaller than a block group but fails 
> to write a large one. {{TestDFSStripeOutputStreamWithFailure}} only tests 
> files smaller than a block group, this jira will add more test situations.
> A streamer may encounter some bad datanodes when writing blocks allocated to 
> it. When it fails to connect datanode or send a packet, the streamer needs to 
> prepare for the next block. First it removes the packets of current  block 
> from its data queue. If the first packet of next block has already been in 
> the data queue, the streamer will reset its state and start to wait for the 
> next block allocated for it; otherwise it will just wait for the first packet 
> of next block. The streamer will check periodically if it is asked to 
> terminate during its waiting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8704) Erasure Coding: client fails to write large file when one datanode fails

Reply via email to