HDFS VFS Driver
We have an open source ETL tool (Kettle) which uses VFS for many input/output steps/jobs. We would like to be able to read/write HDFS from Kettle using VFS. I haven't been able to find anything out there other than it would be nice. I had some time a few weeks ago to begin writing a VFS driver for HDFS and we (Pentaho) would like to be able to contribute this driver. I believe it supports all the major file/folder operations and I have written unit tests for all of these operations. The code is currently checked into an open Pentaho SVN repository under the Apache 2.0 license. There are some current limitations, such as a lack of authentication (kerberos), which appears to be coming in 0.22.0, however, the driver supports username/password, but I just can't use them yet. Please let me know how to proceed with the contribution process. Thank you. -Mike
Re: HDFS VFS Driver
Michael, Please open a jira (new feature) and attach your patch there: http://wiki.apache.org/hadoop/HowToContribute thanks, Arun On Jun 16, 2010, at 8:55 AM, Michael D'Amour wrote: We have an open source ETL tool (Kettle) which uses VFS for many input/output steps/jobs. We would like to be able to read/write HDFS from Kettle using VFS. I haven't been able to find anything out there other than it would be nice. I had some time a few weeks ago to begin writing a VFS driver for HDFS and we (Pentaho) would like to be able to contribute this driver. I believe it supports all the major file/folder operations and I have written unit tests for all of these operations. The code is currently checked into an open Pentaho SVN repository under the Apache 2.0 license. There are some current limitations, such as a lack of authentication (kerberos), which appears to be coming in 0.22.0, however, the driver supports username/password, but I just can't use them yet. Please let me know how to proceed with the contribution process. Thank you. -Mike
Re: HDFS VFS Driver
hi mike, it will be nice to get a high level doc on what/how it is implemented. also, you might want to compare it with fufs-dfs http://wiki.apache.org/hadoop/MountableHDFS thanks, dhruba On Wed, Jun 16, 2010 at 8:55 AM, Michael D'Amour mdam...@pentaho.comwrote: We have an open source ETL tool (Kettle) which uses VFS for many input/output steps/jobs. We would like to be able to read/write HDFS from Kettle using VFS. I haven't been able to find anything out there other than it would be nice. I had some time a few weeks ago to begin writing a VFS driver for HDFS and we (Pentaho) would like to be able to contribute this driver. I believe it supports all the major file/folder operations and I have written unit tests for all of these operations. The code is currently checked into an open Pentaho SVN repository under the Apache 2.0 license. There are some current limitations, such as a lack of authentication (kerberos), which appears to be coming in 0.22.0, however, the driver supports username/password, but I just can't use them yet. Please let me know how to proceed with the contribution process. Thank you. -Mike -- Connect to me at http://www.facebook.com/dhruba
[jira] Created: (HDFS-1213) Implement a VFS Driver for HDFS
Implement a VFS Driver for HDFS --- Key: HDFS-1213 URL: https://issues.apache.org/jira/browse/HDFS-1213 Project: Hadoop HDFS Issue Type: New Feature Components: hdfs client Reporter: Michael D'Amour We have an open source ETL tool (Kettle) which uses VFS for many input/output steps/jobs. We would like to be able to read/write HDFS from Kettle using VFS. I haven't been able to find anything out there other than it would be nice. I had some time a few weeks ago to begin writing a VFS driver for HDFS and we (Pentaho) would like to be able to contribute this driver. I believe it supports all the major file/folder operations and I have written unit tests for all of these operations. The code is currently checked into an open Pentaho SVN repository under the Apache 2.0 license. There are some current limitations, such as a lack of authentication (kerberos), which appears to be coming in 0.22.0, however, the driver supports username/password, but I just can't use them yet. I will be attaching the code for the driver once the case is created. The project does not modify existing hadoop/hdfs source. Our JIRA case can be found at http://jira.pentaho.com/browse/PDI-4146 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1214) hdfs client metadata cache
hdfs client metadata cache -- Key: HDFS-1214 URL: https://issues.apache.org/jira/browse/HDFS-1214 Project: Hadoop HDFS Issue Type: New Feature Components: hdfs client Reporter: Joydeep Sen Sarma In some applications, latency is affected by the cost of making rpc calls to namenode to fetch metadata. the most obvious case are calls to fetch file/directory status. applications like hive like to make optimizations based on file size/number etc. - and for such optimizations - 'recent' status data (as opposed to most up-to-date) is acceptable. in such cases, a cache on the DFS client that transparently caches metadata would be greatly benefit applications. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HDFS-1215) TestNodeCount infinite loops on branch-20-append
[ https://issues.apache.org/jira/browse/HDFS-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon resolved HDFS-1215. --- Assignee: Todd Lipcon Resolution: Fixed Dhruba committed to 20-append branch TestNodeCount infinite loops on branch-20-append Key: HDFS-1215 URL: https://issues.apache.org/jira/browse/HDFS-1215 Project: Hadoop HDFS Issue Type: Bug Components: test Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: Todd Lipcon Fix For: 0.20-append Attachments: 0025-Fix-TestNodeCount-to-not-infinite-loop-after-HDFS-40.patch HDFS-409 made some minicluster changes, which got incorporated into one of the earlier 20-append patches. This breaks TestNodeCount so it infinite loops on the branch. This patch fixes it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HDFS-1216) Update to JUnit 4 in branch 20 append
[ https://issues.apache.org/jira/browse/HDFS-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dhruba borthakur resolved HDFS-1216. Resolution: Fixed I just committed this. Thanks Todd! Update to JUnit 4 in branch 20 append - Key: HDFS-1216 URL: https://issues.apache.org/jira/browse/HDFS-1216 Project: Hadoop HDFS Issue Type: Task Components: test Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: Todd Lipcon Fix For: 0.20-append Attachments: junit-4.5.txt A lot of the append tests are JUnit 4 style. We should upgrade in branch - Junit 4 is entirely backward compatible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1218) 20 append: Blocks recovered on startup should be treated with lower priority during block synchronization
20 append: Blocks recovered on startup should be treated with lower priority during block synchronization - Key: HDFS-1218 URL: https://issues.apache.org/jira/browse/HDFS-1218 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical Fix For: 0.20-append When a datanode experiences power loss, it can come back up with truncated replicas (due to local FS journal replay). Those replicas should not be allowed to truncate the block during block synchronization if there are other replicas from DNs that have _not_ restarted. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HDFS-142) In 0.20, move blocks being written into a blocksBeingWritten directory
[ https://issues.apache.org/jira/browse/HDFS-142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dhruba borthakur resolved HDFS-142. --- Resolution: Fixed I have committed this. Thanks Sam, Nicolas and Todd. In 0.20, move blocks being written into a blocksBeingWritten directory -- Key: HDFS-142 URL: https://issues.apache.org/jira/browse/HDFS-142 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 0.20-append Reporter: Raghu Angadi Assignee: dhruba borthakur Priority: Blocker Fix For: 0.20-append Attachments: appendFile-recheck-lease.txt, appendQuestions.txt, deleteTmp.patch, deleteTmp2.patch, deleteTmp5_20.txt, deleteTmp5_20.txt, deleteTmp_0.18.patch, dont-recover-rwr-when-rbw-available.txt, handleTmp1.patch, hdfs-142-commitBlockSynchronization-unknown-datanode.txt, HDFS-142-deaddn-fix.patch, HDFS-142-finalize-fix.txt, hdfs-142-minidfs-fix-from-409.txt, HDFS-142-multiple-blocks-datanode-exception.patch, hdfs-142-recovery-reassignment-and-bbw-cleanup.txt, hdfs-142-testcases.txt, hdfs-142-testleaserecovery-fix.txt, HDFS-142_20-append2.patch, HDFS-142_20.patch, recentInvalidateSets-assertion-fix.txt, recover-rbw-v2.txt, testfileappend4-deaddn.txt, validateBlockMetaData-synchronized.txt Before 0.18, when Datanode restarts, it deletes files under data-dir/tmp directory since these files are not valid anymore. But in 0.18 it moves these files to normal directory incorrectly making them valid blocks. One of the following would work : - remove the tmp files during upgrade, or - if the files under /tmp are in pre-18 format (i.e. no generation), delete them. Currently effect of this bug is that, these files end up failing block verification and eventually get deleted. But cause incorrect over-replication at the namenode before that. Also it looks like our policy regd treating files under tmp needs to be defined better. Right now there are probably one or two more bugs with it. Dhruba, please file them if you rememeber. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HDFS-1141) completeFile does not check lease ownership
[ https://issues.apache.org/jira/browse/HDFS-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dhruba borthakur resolved HDFS-1141. Resolution: Fixed Pulled into hadoop-0.20-append completeFile does not check lease ownership --- Key: HDFS-1141 URL: https://issues.apache.org/jira/browse/HDFS-1141 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Blocker Fix For: 0.20-append, 0.22.0 Attachments: hdfs-1141-branch20.txt, hdfs-1141.txt, hdfs-1141.txt completeFile should check that the caller still owns the lease of the file that it's completing. This is for the 'testCompleteOtherLeaseHoldersFile' case in HDFS-1139. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HDFS-1207) 0.20-append: stallReplicationWork should be volatile
[ https://issues.apache.org/jira/browse/HDFS-1207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dhruba borthakur resolved HDFS-1207. Fix Version/s: 0.20-append Resolution: Fixed I just committed this. Thanks Todd! 0.20-append: stallReplicationWork should be volatile Key: HDFS-1207 URL: https://issues.apache.org/jira/browse/HDFS-1207 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: Todd Lipcon Fix For: 0.20-append Attachments: hdfs-1207.txt the stallReplicationWork member in FSNamesystem is accessed by multiple threads without synchronization, but isn't marked volatile. I believe this is responsible for about 1% failure rate on TestFileAppend4.testAppendSyncChecksum* on my 8-core test boxes (looking at logs I see replication happening even though we've supposedly disabled it) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HDFS-1210) DFSClient should log exception when block recovery fails
[ https://issues.apache.org/jira/browse/HDFS-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dhruba borthakur resolved HDFS-1210. Fix Version/s: 0.20-append Resolution: Fixed I just committed this. Thanks Todd. DFSClient should log exception when block recovery fails Key: HDFS-1210 URL: https://issues.apache.org/jira/browse/HDFS-1210 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs client Affects Versions: 0.20-append, 0.20.2 Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Trivial Fix For: 0.20-append Attachments: hdfs-1210.txt Right now we just retry without necessarily showing the exception. It can be useful to see what the error was that prevented the recovery RPC from succeeding. (I believe this only applies in 0.20 style of block recovery) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HDFS-1211) 0.20 append: Block receiver should not log rewind packets at INFO level
[ https://issues.apache.org/jira/browse/HDFS-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dhruba borthakur resolved HDFS-1211. Resolution: Fixed I just committed this. Thanks Todd! 0.20 append: Block receiver should not log rewind packets at INFO level - Key: HDFS-1211 URL: https://issues.apache.org/jira/browse/HDFS-1211 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Minor Fix For: 0.20-append Attachments: hdfs-1211.txt In the 0.20 append implementation, it logs an INFO level message for every packet that rewinds the end of the block file. This is really noisy for applications like HBase which sync every edit. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1219) Data Loss due to edits log truncation
Data Loss due to edits log truncation - Key: HDFS-1219 URL: https://issues.apache.org/jira/browse/HDFS-1219 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.20.2 Reporter: Thanh Do We found this problem almost at the same time as HDFS developers. Basically, the edits log is truncated before fsimage.ckpt is renamed to fsimage. Hence, any crash happens after the truncation but before the renaming will lead to a data loss. Detailed description can be found here: https://issues.apache.org/jira/browse/HDFS-955 This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1221) NameNode unable to start due to stale edits log after a crash
NameNode unable to start due to stale edits log after a crash - Key: HDFS-1221 URL: https://issues.apache.org/jira/browse/HDFS-1221 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 0.20.1 Reporter: Thanh Do - Summary: If a crash happens during FSEditLog.createEditLogFile(), the edits log file on disk may be stale. During next reboot, NameNode will get an exception when parsing the edits file, because of stale data, leading to unsuccessful reboot. Note: This is just one example. Since we see that edits log (and fsimage) does not have checksum, they are vulnerable to corruption too. - Details: The steps to create new edits log (which we infer from HDFS code) are: 1) truncate the file to zero size 2) write FSConstants.LAYOUT_VERSION to buffer 3) insert the end-of-file marker OP_INVALID to the end of the buffer 4) preallocate 1MB of data, and fill the data with 0 5) flush the buffer to disk Note that only in step 1, 4, 5, the data on disk is actually changed. Now, suppose a crash happens after step 4, but before step 5. In the next reboot, NameNode will fetch this edits log file (which contains all 0). The first thing parsed is the LAYOUT_VERSION, which is 0. This is OK, because NameNode has code to handle that case. (but we expect LAYOUT_VERSION to be -18, don't we). Now it parses the operation code, which happens to be 0. Unfortunately, since 0 is the value for OP_ADD, the NameNode expects some parameters corresponding to that operation. Now NameNode calls readString to read the path, which throws an exception leading to a failed reboot. We found this problem almost at the same time as HDFS developers. Basically, the edits log is truncated before fsimage.ckpt is renamed to fsimage. Hence, any crash happens after the truncation but before the renaming will lead to a data loss. Detailed description can be found here: https://issues.apache.org/jira/browse/HDFS-955 This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1225) Block lost when primary crashes in recoverBlock
Block lost when primary crashes in recoverBlock --- Key: HDFS-1225 URL: https://issues.apache.org/jira/browse/HDFS-1225 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 0.20.1 Reporter: Thanh Do - Summary: Block is lost if primary datanode crashes in the middle tryUpdateBlock. - Setup: # available datanode = 2 # replica = 2 # disks / datanode = 1 # failures = 1 # failure type = crash When/where failure happens = (see below) - Details: Suppose we have 2 datanodes: dn1 and dn2 and dn1 is primary. Client appends to blk_X_1001 and crash happens during dn1.recoverBlock, at the point after blk_X_1001.meta is renamed to blk_X_1001.meta_tmp1002 **Interesting**, this case, the block X is lost eventually. Why? After dn1.recoverBlock crashes at rename, what left at dn1 current directory is: 1) blk_X 2) blk_X_1001.meta_tmp1002 == this is an invalid block, because it has no meta file associated with it. dn2 (after dn1 crash) now contains: 1) blk_X 2) blk_X_1002.meta (note that the rename at dn2 is completed, because dn1 called dn2.updateBlock() before calling its own updateBlock()) But the command namenode.commitBlockSynchronization is not reported to namenode, because dn1 is crashed. Therefore, from namenode point of view, the block X has GS 1001. Hence, the block is lost. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1227) UpdateBlock fails due to unmatched file length
UpdateBlock fails due to unmatched file length -- Key: HDFS-1227 URL: https://issues.apache.org/jira/browse/HDFS-1227 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 0.20.1 Reporter: Thanh Do - Summary: client append is not atomic, hence, it is possible that when retrying during append, there is an exception in updateBlock indicating unmatched file length, making append failed. - Setup: + # available datanodes = 3 + # disks / datanode = 1 + # failures = 2 + failure type = bad disk + When/where failure happens = (see below) + This bug is non-deterministic, to reproduce it, add a sufficient sleep before out.write() in BlockReceiver.receivePacket() in dn1 and dn2 but not dn3 - Details: Suppose client appends 16 bytes to block X which has length 16 bytes at dn1, dn2, dn3. Dn1 is primary. The pipeline is dn3-dn2-dn1. recoverBlock succeeds. Client starts sending data to the dn3 - the first datanode in pipeline. dn3 forwards the packet to downstream datanodes, and starts writing data to its disk. Suppose there is an exception in dn3 when writing to disk. Client gets the exception, it starts the recovery code by calling dn1.recoverBlock() again. dn1 in turn calls dn2.getMetadataInfo() and dn1.getMetaDataInfo() to build the syncList. Suppose at the time getMetadataInfo() is called at both datanodes (dn1 and dn2), the previous packet (which is sent from dn3) has not come to disk yet. Hence, the block Info given by getMetaDataInfo contains the length of 16 bytes. But after that, the packet comes to disk, making the block file length now becomes 32 bytes. Using the syncList (with contains block info with length 16 byte), dn1 calls updateBlock at dn2 and dn1, which will failed, because the length of new block info (given by updateBlock, which is 16 byte) does not match with its actual length on disk (which is 32 byte) Note that this bug is non-deterministic. Its depends on the thread interleaving at datanodes. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1228) CRC does not match when retrying appending a partial block
CRC does not match when retrying appending a partial block -- Key: HDFS-1228 URL: https://issues.apache.org/jira/browse/HDFS-1228 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 0.20.1 Reporter: Thanh Do - Summary: when appending to partial block, if is possible that retrial when facing an exception fails due to a checksum mismatch. Append operation is not atomic (either complete or fail completely). - Setup: + # available datanodes = 2 +# disks / datanode = 1 + # failures = 1 + failure type = bad disk + When/where failure happens = (see below) - Details: Client writes 16 bytes to dn1 and dn2. Write completes. So far so good. The meta file now contains: 7 bytes header + 4 byte checksum (CK1 - checksum for 16 byte) Client then appends 16 bytes more, and let assume there is an exception at BlockReceiver.receivePacket() at dn2. So the client knows dn2 is bad. BUT, the append at dn1 is complete (i.e the data portion and checksum portion has been made to disk to the corresponding block file and meta file), meaning that the checksum file at dn1 now contains 7 bytes header + 4 byte checksum (CK2 - this is checksum for 32 byte data). Because dn2 has an exception, client calls recoverBlock and starts append again to dn1. dn1 receives 16 byte data, it verifies if the pre-computed crc (CK2) matches what we recalculate just now (CK1), which obviously does not match. Hence an exception and retrial fails. - a similar bug has been reported at https://issues.apache.org/jira/browse/HDFS-679 but here, it manifests in different context. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1230) BlocksMap.blockinfo is not getting cleared immediately after deleting a block.This will be cleared only after block report comes from the datanode.Why we need to maintain t
BlocksMap.blockinfo is not getting cleared immediately after deleting a block.This will be cleared only after block report comes from the datanode.Why we need to maintain the blockinfo till that time. Key: HDFS-1230 URL: https://issues.apache.org/jira/browse/HDFS-1230 Project: Hadoop HDFS Issue Type: Improvement Components: name-node Affects Versions: 0.20.1 Reporter: Gokul BlocksMap.blockinfo is not getting cleared immediately after deleting a block.This will be cleared only after block report comes from the datanode.Why we need to maintain the blockinfo till that time It increases namenode memory unnecessarily. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1231) Generation Stamp mismatches, leading to failed append
Generation Stamp mismatches, leading to failed append - Key: HDFS-1231 URL: https://issues.apache.org/jira/browse/HDFS-1231 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client Affects Versions: 0.20.1 Reporter: Thanh Do - Summary: the recoverBlock is not atomic, leading retrial fails when facing a failure. - Setup: + # available datanodes = 3 + # disks / datanode = 1 + # failures = 2 + failure type = crash + When/where failure happens = (see below) - Details: Suppose there are 3 datanodes in the pipeline: dn3, dn2, and dn1. Dn1 is primary. When appending, client first calls dn1.recoverBlock to make all the datanodes in pipeline agree on the new Generation Stamp (GS1) and the length of the block. Client then sends a data packet to dn3. dn3 in turn forwards this packet to down stream dns (dn2 and dn1) and starts writing to its own disk, then it crashes AFTER writing to the block file but BEFORE writing to the meta file. Client notices the crash, it calls dn1.recoverBlock(). dn1.recoverBlock() first creates a syncList (by calling getMetadataInfo at all dn2 and dn1). Then dn1 calls NameNode.getNextGS() to get new Generation Stamp (GS2). Then it calls dn2.updateBlock(), this returns successfully. Now, it starts calling its own updateBlock and crashes after renaming from blk_X_GS1.meta to blk_X_GS1.meta_tmpGS2. Therefore, dn1.recoverBlock() from the client point of view fails. but the GS for corresponding block has been incremented in the namenode (GS2) The client retries by calling dn2.recoverBlock with old GS (GS1), which does not match with the new GS at the NameNode (GS1) --exception, leading to append fails. Now, after all, we have - in dn3 (which is crashed) tmp/blk_X tmp/blk_X_GS1.meta - in dn2 current/blk_X current/blk_X_GS2 - in dn1: current/blk_X current/blk_X_GS1.meta_tmpGS2 - in NameNode, the block X has generation stamp GS1 (because dn1 has not called commitSyncronization yet). Therefore, when crashed datanodes restart, at dn1 the block is invalid because there is no meta file. In dn3, block file and meta file are finalized, however, the block is corrupted because CRC mismatch. In dn2, the GS of the block is GS2, which is not equal with the generation stamp info of the block maintained in NameNode. Hence, the block blk_X is inaccessible. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1233) Bad retry logic at DFSClient
Bad retry logic at DFSClient Key: HDFS-1233 URL: https://issues.apache.org/jira/browse/HDFS-1233 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client Affects Versions: 0.20.1 Reporter: Thanh Do - Summary: failover bug, bad retry logic at DFSClient, cannot failover to the 2nd disk - Setups: + # available datanodes = 1 + # disks / datanode = 2 + # failures = 1 + failure type = bad disk + When/where failure happens = (see below) - Details: The setup is: 1 datanode, 1 replica, and each datanode has 2 disks (Disk1 and Disk2). We injected a single disk failure to see if we can failover to the second disk or not. If a persistent disk failure happens during createBlockOutputStream (the first phase of the pipeline creation) (e.g. say DN1-Disk1 is bad), then createBlockOutputStream (cbos) will get an exception and it will retry! When it retries it will get the same DN1 from the namenode, and then DN1 will call DN.writeBlock(), FSVolume.createTmpFile, and finally getNextVolume() which a moving volume#. Thus, on the second try, the write will be successfully go to the second disk. So essentially createBlockOutputStream is wrapped in a do/while(retry --count = 0). The first cbos will fail, the second will be successful in this particular scenario. NOW, say cbos is successful, but the failure is persistent. Then the retry is in a different while loop. First, hasError is set to true in RP.run (responder packet). Thus, DataStreamer.run() will go back to the loop: while(!closed clientRunning !lastPacketInBlock). Now this second iteration of the loop will call processDatanodeError because hasError has been set to true. In processDatanodeError (pde), the client sees that this is the only datanode in the pipeline, and hence it considers that the node is bad! Although actually only 1 disk is bad! Hence, pde throws IOException suggesting all the datanodes (in this case, only DN1) in the pipeline is bad. Hence, in this error, the exception is thrown to the client. But if the exception, say, is catched by the most outer while loop do-while(retry --count = 0), then this outer retry will be successful then (as suggested in the previous paragraph). In summary, if in a deployment scenario, we only have one datanode that has multiple disks, and one disk goes bad, then the current retry logic at the DFSClient side is not robust enough to mask the failure from the client. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1232) Corrupted block if a crash happens before writing to checksumOut but after writing to dataOut
Corrupted block if a crash happens before writing to checksumOut but after writing to dataOut - Key: HDFS-1232 URL: https://issues.apache.org/jira/browse/HDFS-1232 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 0.20.1 Reporter: Thanh Do - Summary: block is corrupted if a crash happens before writing to checksumOut but after writing to dataOut. - Setup: + # available datanodes = 1 + # disks / datanode = 1 + # failures = 1 + failure type = crash +When/where failure happens = (see below) - Details: The order of processing a packet during client write/append at datanode is first forward the packet to downstream, then write to data the block file, and and finally, write to the checksum file. Hence if a crash happens BEFORE the write to checksum file but AFTER the write to data file, the block is corrupted. Worse, if this is the only available replica, the block is lost. We also found this problem in case there are 3 replicas for a particular block, and during append, there are two failures. (see HDFS-1231) This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1234) Datanode 'alive' but with its disk failed, Namenode thinks it's alive
Datanode 'alive' but with its disk failed, Namenode thinks it's alive - Key: HDFS-1234 URL: https://issues.apache.org/jira/browse/HDFS-1234 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.20.1 Reporter: Thanh Do - Summary: Datanode 'alive' but with its disk failed, Namenode still thinks it's alive - Setups: + Replication = 1 + # available datanodes = 2 + # disks / datanode = 1 + # failures = 1 + Failure type = bad disk + When/where failure happens = first phase of the pipeline - Details: In this experiment we have two datanodes. Each node has 1 disk. However, if one datanode has a failed disk (but the node is still alive), the datanode does not keep track of this. From the perspective of the namenode, that datanode is still alive, and thus the namenode gives back the same datanode to the client. The client will retry 3 times by asking the namenode to give a new set of datanodes, and always get the same datanode. And every time the client wants to write there, it gets an exception. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1235) Namenode returning the same Datanode to client, due to infrequent heartbeat
Namenode returning the same Datanode to client, due to infrequent heartbeat --- Key: HDFS-1235 URL: https://issues.apache.org/jira/browse/HDFS-1235 Project: Hadoop HDFS Issue Type: Bug Components: name-node Reporter: Thanh Do This bug has been reported. Basically since datanode's hearbeat messages are infrequent (~ every 10 minutes), NameNode always gives the client the same datanode even if the datanode is dead. We want to point out that the client wait 6 seconds before retrying, which could be considered long and useless retries in this scenario, because in 6 secs, the namenode hasn't declared the datanode dead. Overall this happens when a datanode is dead during the first phase of the pipeline (file setups). If a datanode is dead during the second phase (byte transfer), the DFSClient still could proceed with the other surviving datanodes (which is consistent with what Hadoop books always say -- the write should proceed if at least we have one good datanode). But unfortunately this specification is not true during the first phase of the pipeline. Overall we suggest that the namenode take into consideration the client's view of unreachable datanodes. That is, if a client says that it cannot reach DN-X, then the namenode might give the client another node other than X (but the namenode does not have to declare N dead). This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1236) Client uselessly retries recoverBlock 5 times
Client uselessly retries recoverBlock 5 times - Key: HDFS-1236 URL: https://issues.apache.org/jira/browse/HDFS-1236 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 0.20.1 Reporter: Thanh Do Summary: Client uselessly retries recoverBlock 5 times The same behavior is also seen in append protocol (HDFS-1229) The setup: # available datanodes = 4 Replication factor = 2 (hence there are 2 datanodes in the pipeline) Failure type = Bad disk at datanode (not crashes) # failures = 2 # disks / datanode = 1 Where/when the failures happen: This is a scenario where each disk of the two datanodes in the pipeline go bad at the same time during the 2nd phase of the pipeline (the data transfer phase). Details: In this case, the client will call processDatanodeError which will call datanode.recoverBlock to those two datanodes. But since these two datanodes have bad disks (although they're still alive), then recoverBlock() will fail. For this one, the client's retry logic ends when streamer is closed (close == true). But before this happen, the client will retry 5 times (maxRecoveryErrorCount) and will fail all the time, until it finishes. What is interesting is that during each retry, there is a wait of 1 second in DataStreamer.run (i.e. dataQueue.wait(1000)). So it will be a 5-second total wait before declaring it fails. This is a different bug than HDFS-1235, where the client retries 3 times for 6 seconds (resulting in 25 seconds wait time). In this experiment, what we get for the total wait time is only 12 seconds (not sure why it is 12). So the DFSClient quits without contacting the namenode again (say to ask for a new set of two datanodes). So interestingly we find another bug that shows client retry logic is complex and not deterministic depending on where and when failures happen. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.