[ https://issues.apache.org/jira/browse/HDFS-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kihwal Lee updated HDFS-10178: ------------------------------ Attachment: HDFS-10178.patch The following is from {{BlockSender}}, added by HDFS-6934. {code:java} // The meta file will contain only the header if the NULL checksum // type was used, or if the replica was written to transient storage. // Checksum verification is not performed for replicas on transient // storage. The header is important for determining the checksum // type later when lazy persistence copies the block to non-transient // storage and computes the checksum. if (metaIn.getLength() > BlockMetadataHeader.getHeaderSize()) { {code} The code in the {{BlockSender}} makes a wrong assumption. If I simply changes {{>}} to {{>=}}, my test passes, but some of the lazy persist test cases fail. So I added another argument to the constructor. [~cnauroth], can you take a look at my patch? I am not familiar with the lazy persist feature. There might be a better way. > Permanent write failures can happen if pipeline recoveries occur for the > first packet > ------------------------------------------------------------------------------------- > > Key: HDFS-10178 > URL: https://issues.apache.org/jira/browse/HDFS-10178 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Kihwal Lee > Assignee: Kihwal Lee > Priority: Critical > Attachments: HDFS-10178.patch > > > We have observed that write fails permanently if the first packet doesn't go > through properly and pipeline recovery happens. If the packet header is sent > out, but the data portion of the packet does not reach one or more datanodes > in time, the pipeline recovery will be done against the 0-byte partial block. > > If additional datanodes are added, the block is transferred to the new nodes. > After the transfer, each node will have a meta file containing the header > and 0-length data block file. The pipeline recovery seems to work correctly > up to this point, but write fails when actual data packet is resent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)