[ https://issues.apache.org/jira/browse/HDFS-8383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14728246#comment-14728246 ]
Zhe Zhang commented on HDFS-8383: --------------------------------- Thanks Walter for creating the patch. Below is a list of comments, some on the overall write fault tolerance design, and others on this patch. # {{DataStreamer#ErrorState#externalError}} looks a key concept. [~szetszwo]: Does it mean "error from peer streamers"? We should take this chance to add a Javadoc. # Right now when a DN (e.g. DN_0) fails, we handle other streams (DN_1~DN_5) as if each of them has a failed DN. We trigger {{processDatanodeError}} to close the stream and open again with the same DN. This overhead isn't really necessary. IIUC all we want to do is to bump the {{GenerationStamp}} for internal blocks 1~5. Can we do it by sending a packet (or piggybacking with a data packet) to DN? # By doing the above we can also simplify the error handling logic. All we need is an {{AtomicInteger groupGS}} in {{DFSStripedOutputStream}} recording the current GS. Each failed streamer should increment {{groupGS}}. Each streamer can compare {{groupGS}} with its current GS before sending the next packet. # Regardless of this change, the write error handling logic is already very complex IMO. Maybe we can consider moving {{locateFollowingBlock}} to OutputStream level so the streamer's task is capped within a single block. For non-EC files this refactor will also facilitate HDFS-8955. Nits on the patch # Is {{BlockRecoveryTrigger}} a singleton? If so do we need the synchronization? # {{private Integer numScheduled}} looks like it's a boolean? > Tolerate multiple failures in DFSStripedOutputStream > ---------------------------------------------------- > > Key: HDFS-8383 > URL: https://issues.apache.org/jira/browse/HDFS-8383 > Project: Hadoop HDFS > Issue Type: Sub-task > Reporter: Tsz Wo Nicholas Sze > Assignee: Walter Su > Attachments: HDFS-8383.00.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)