[ https://issues.apache.org/jira/browse/HDFS-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14740321#comment-14740321 ]
Zhe Zhang commented on HDFS-8704: --------------------------------- Thanks for updating the patch Bo. My main concern is still the nested {{run()}} structure in {{StripedDataStreamer}}. {code} @Override + public void run() { + + while (!toTerminate && !streamerClosed && + dfsClient.clientRunning && !errorState.hasError()) { + super.run(); {code} [~walter.k.su] is exploring the idea of a group streamer in HDFS-9040, and [~jingzhao] is trying to move {{locateFollowBlock}} to DFSOutputStream level. If either of the two directions works, the role of a streamer will be limited to transferring a single internal block, which will solve this problem. So I suggest we keep this JIRA open and waiit for a conclusion on these 2 efforts. > Erasure Coding: client fails to write large file when one datanode fails > ------------------------------------------------------------------------ > > Key: HDFS-8704 > URL: https://issues.apache.org/jira/browse/HDFS-8704 > Project: Hadoop HDFS > Issue Type: Sub-task > Reporter: Li Bo > Assignee: Li Bo > Attachments: HDFS-8704-000.patch, HDFS-8704-HDFS-7285-002.patch, > HDFS-8704-HDFS-7285-003.patch, HDFS-8704-HDFS-7285-004.patch, > HDFS-8704-HDFS-7285-005.patch, HDFS-8704-HDFS-7285-006.patch, > HDFS-8704-HDFS-7285-007.patch, HDFS-8704-HDFS-7285-008.patch > > > I test current code on a 5-node cluster using RS(3,2). When a datanode is > corrupt, client succeeds to write a file smaller than a block group but fails > to write a large one. {{TestDFSStripeOutputStreamWithFailure}} only tests > files smaller than a block group, this jira will add more test situations. > A streamer may encounter some bad datanodes when writing blocks allocated to > it. When it fails to connect datanode or send a packet, the streamer needs to > prepare for the next block. First it removes the packets of current block > from its data queue. If the first packet of next block has already been in > the data queue, the streamer will reset its state and start to wait for the > next block allocated for it; otherwise it will just wait for the first packet > of next block. The streamer will check periodically if it is asked to > terminate during its waiting. -- This message was sent by Atlassian JIRA (v6.3.4#6332)