[ 
https://issues.apache.org/jira/browse/HDFS-8383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14728246#comment-14728246
 ] 

Zhe Zhang commented on HDFS-8383:
---------------------------------

Thanks Walter for creating the patch. Below is a list of comments, some on the 
overall write fault tolerance design, and others on this patch.

# {{DataStreamer#ErrorState#externalError}} looks a key concept. [~szetszwo]: 
Does it mean "error from peer streamers"? We should take this chance to add a 
Javadoc.
# Right now when a DN (e.g. DN_0) fails, we handle other streams (DN_1~DN_5) as 
if each of them has a failed DN. We trigger {{processDatanodeError}} to close 
the stream and open again with the same DN. This overhead isn't really 
necessary. IIUC all we want to do is to bump the {{GenerationStamp}} for 
internal blocks 1~5. Can we do it by sending a packet (or piggybacking with a 
data packet) to DN?
# By doing the above we can also simplify the error handling logic. All we need 
is an {{AtomicInteger groupGS}} in {{DFSStripedOutputStream}} recording the 
current GS. Each failed streamer should increment {{groupGS}}. Each streamer 
can compare {{groupGS}} with its current GS before sending the next packet.
# Regardless of this change, the write error handling logic is already very 
complex IMO. Maybe we can consider moving {{locateFollowingBlock}} to 
OutputStream level so the streamer's task is capped within a single block. For 
non-EC files this refactor will also facilitate HDFS-8955.

Nits on the patch
# Is {{BlockRecoveryTrigger}} a singleton? If so do we need the synchronization?
# {{private Integer numScheduled}} looks like it's a boolean?

> Tolerate multiple failures in DFSStripedOutputStream
> ----------------------------------------------------
>
>                 Key: HDFS-8383
>                 URL: https://issues.apache.org/jira/browse/HDFS-8383
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Tsz Wo Nicholas Sze
>            Assignee: Walter Su
>         Attachments: HDFS-8383.00.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to