[ https://issues.apache.org/jira/browse/HDFS-9040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14741388#comment-14741388 ]
Jing Zhao commented on HDFS-9040: --------------------------------- Thanks for the patch, Walter! I think this looks much clearer compared with the current implementation. Some thoughts and comments: # In general I think it's the correct direction to push all the coordination logic into one place, and let all the other streamers simply transfer data. # Currently the new block allocation step and failure handling steps can still be interleaved. To me this may be too hard to guarantee the correctness. For example, we need to handle a scenario where some data streamer has not fetched the new block yet when the coordinator starts handling a failure. The current patch tries to handle this by checking the corresponding following block queue. But since a data streamer can be in a state where it fetches the new block but has not assigned new values to its nodes/storageTypes, we may still have some race condition. Thus I agree with Nicholas's comment [here|https://issues.apache.org/jira/browse/HDFS-8383?focusedCommentId=14737962&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14737962], i.e., we need to add some "barriers" to sync all the data streamers and so as to simplify the problem. # More specifically, my current proposal for failure handling looks like this: The coordinator side: #* Check if there is failure(s) periodically. If we use DFSStripedOutputStream as the coordinator, we can easily do this in {{writeChunk}}, e.g., to check failures whenever we've received one stripe of data. #* If there is new failure, first wait till all the healthy streamers fetch the new block and are in DATA_STREAMING stage. #* Mark all the healthy streamers as external error. #* Call updateBlockForPipeline and get the new GS. #* Wait till all the healthy streamers to fetch the new block from the queue and creating new block streams. #* If there is new failure happening when creating new block streams, notify all the remaining streamers the failure and keep them in the external error state. Repeat the above steps. #* Otherwise reset all the external error states and make the updatePipeline RPC call. Then notify all the streamers that this failure handling session has succeeded. # The DataStreamer side: #* When finding itself in external error state, wait and take the new block from the blocking queue. #* Create new datanode connection using the new block. #* Notify the coordinator the result of the new datanode connection creation. #* If the connection creation succeeded, wait the coordinator for the overall result. #* If all the involving streamers succeed, update its block based on the new GS. #* Otherwise repeat the steps. #* And instead of overriding updateBlockForPipeline and updatePipeline, it may be easier to implement the above logic by overriding {{setupPipelineForAppendOrRecovery}}. # Obviously the above proposal may still have some holes. But the direction here is to make sure there is no overlap between different error handling efforts and the new block allocation. Please see if this makes sense to you. # Also I think it is easier to implement the above logic in StripedOutputStream: 1) it's easier to determine when to start block allocation and failure check, 2) it's easier to handle exceptions during the NN RPCs since we do not need to pass the exception from a separate coordinator thread. But we can discuss this further and please let me know if I miss something. Currently I have an in-progress patch implementing the above proposal. I will try to make it in a better shape and post it as a demo soon. > Erasure coding: A BlockGroupDataStreamer to rule all internal blocks streamers > ------------------------------------------------------------------------------ > > Key: HDFS-9040 > URL: https://issues.apache.org/jira/browse/HDFS-9040 > Project: Hadoop HDFS > Issue Type: Sub-task > Reporter: Walter Su > Assignee: Walter Su > Attachments: HDFS-9040.00.patch > > > A {{BlockGroupDataStreamer}} to communicate with NN to allocate/update block, > and {{StripedDataStreamer}} s only have to stream blocks to DNs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)