The following other JIRAs have been committed in CDH for 18 months or so, for the purpose of HBase. You may want to consider backporting them as well - many were never committed to 0.20-append due to lack of reviews by HDFS committers at the time.
HDFS-1056. Fix possible multinode deadlocks during block recovery when using ephemeral dataxceiv Description: Fixes the logic by which datanodes identify local RPC targets during block recovery for the case when the datanode is configured with an ephemeral data transceiver port. Reason: Potential internode deadlock for clusters using ephemeral ports HADOOP-6722. Workaround a TCP spec quirk by not allowing NetUtils.connect to connect to itself Description: TCP's ephemeral port assignment results in the possibility that a client can connect back to its own outgoing socket, resulting in failed RPCs or datanode transfers. Reason: Fixes intermittent errors in cluster testing with ephemeral IPC/transceiver ports on datanodes. HDFS-1122. Don't allow client verification to prematurely add inprogress blocks to DataBlockScanner Description: When a client reads a block that is also open for writing, it should not add it to the datanode block scanner. If it does, the block scanner can incorrectly mark the block as corrupt, causing data loss. Reason: Potential dataloss with concurrent writer-reader case. HDFS-1248. Miscellaneous cleanup and improvements on 0.20 append branch Description: Miscellaneous code cleanup and logging changes, including: - Slight cleanup to recoverFile() function in TestFileAppend4 - Improve error messages on OP_READ_BLOCK - Some comment cleanup in FSNamesystem - Remove toInodeUnderConstruction (was not used) - Add some checks for null blocks in FSNamesystem to avoid a possible NPE - Only log "inconsistent size" warnings at WARN level for non-under-construction blocks. - Redundant addStoredBlock calls are also not worthy of WARN level - Add some extra information to a warning in ReplicationTargetChooser Reason: Improves diagnosis of error cases and clarity of code HDFS-1242. Add unit test for the appendFile race condition / synchronization bug fixed in HDFS-142 Reason: Test coverage for previously applied patch. HDFS-1218. Replicas that are recovered during DN startup should not be allowed to truncate better replicas. Description: If a datanode loses power and then recovers, its replicas may be truncated due to the recovery of the local FS journal. This patch ensures that a replica truncated by a power loss does not truncate the block on HDFS. Reason: Potential dataloss bug uncovered by power failure simulation HDFS-915. Write pipeline hangs for too long when ResponseProcessor hits timeout Description: Previously, the write pipeline would hang for the entire write timeout when it encountered a read timeout (eg due to a network connectivity issue). This patch interrupts the writing thread when a read error occurs. Reason: Faster recovery from pipeline failure for HBase and other interactive applications. HDFS-1186. Writers should be interrupted when recovery is started, not when it's completed. Description: When the write pipeline recovery process is initiated, this interrupts any concurrent writers to the block under recovery. This prevents a case where some edits may be lost if the writer has lost its lease but continues to write (eg due to a garbage collection pause) Reason: Fixes a potential dataloss bug commit a960eea40dbd6a4e87072bdf73ac3b62e772f70a Author: Todd Lipcon <t...@lipcon.org> Date: Sun Jun 13 23:02:38 2010 -0700 HDFS-1197. Received blocks should not be added to block map prematurely for under construction files Description: Fixes a possible dataloss scenario when using append() on real-life clusters. Also augments unit tests to uncover similar bugs in the future by simulating latency when reporting blocks received by datanodes. Reason: Append support dataloss bug Author: Todd Lipcon HDFS-1260. tryUpdateBlock should do validation before renaming meta file Description: Solves bug where block became inaccessible in certain failure conditions (particularly network partitions). Observed under HBase workload at user site. Reason: Potential loss of syunced data when write pipeline fails On Fri, Sep 2, 2011 at 11:20 AM, Suresh Srinivas <sur...@hortonworks.com> wrote: > I also propose following jiras, which are non append related bug fixes from > 0.20-append branch: > > - HDFS-1164. TestHdfsProxy is failing. > - HDFS-1211. Block receiver should not log "rewind" packets at INFO > level. > - HDFS-1118. Fix socketleak on DFSClient. > - HDFS-1210. DFSClient should log exception when block recovery fails. > - HDFS-606. Fix ConcurrentModificationException in > invalidateCorruptReplicas. > - HDFS-561. Fix write pipeline READ_TIMEOUT. > - HDFS-1202. DataBlockScanner throws NPE when updated before > initialized. > > Risk Level: > These are useful bugfixes from append branch and are not big changes to the > code base. > > These jiras have already been merged into 0.20-security branch. > -- Todd Lipcon Software Engineer, Cloudera