[jira] Created: (HDFS-1239) All datanodes are bad in 2nd phase
All datanodes are bad in 2nd phase -- Key: HDFS-1239 URL: https://issues.apache.org/jira/browse/HDFS-1239 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client Affects Versions: 0.20.1 Reporter: Thanh Do - Setups: number of datanodes = 2 replication factor = 2 Type of failure: transient fault (a java i/o call throws an exception or return false) Number of failures = 2 when/where failures happen = during the 2nd phase of the pipeline, each happens at each datanode when trying to perform I/O (e.g. dataoutputstream.flush()) - Details: This is similar to HDFS-1237. In this case, node1 throws exception that makes client creates a pipeline only with node2, then tries to redo the whole thing, which throws another failure. So at this point, the client considers all datanodes are bad, and never retries the whole thing again, (i.e. it never asks the namenode again to ask for a new set of datanodes). In HDFS-1237, the bug is due to permanent disk fault. In this case, it's about transient error. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1238) A block is stuck in ongoingRecovery due to exception not propagated
A block is stuck in ongoingRecovery due to exception not propagated Key: HDFS-1238 URL: https://issues.apache.org/jira/browse/HDFS-1238 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client Affects Versions: 0.20.1 Reporter: Thanh Do - Setup: + # datanodes = 2 + replication factor = 2 + failure type = transient (i.e. a java I/O call that throws I/O Exception or returns false) + # failures = 2 + When/where failures happen: (This is a subtle bug) The first failure is a transient failure at a datanode during the second phase. Due to the first failure, the DFSClient will call recoverBlock. The second failure is injected during this recover block process (i.e. another failure during the recovery process). - Details: The expectation here is that since the DFSClient performs lots of retries, two transient failures should be masked properly by the retries. We found one case, where the failures are not transparent to the users. Here's the stack trace of when/where the two failures happen (please ignore the line number). 1. The first failure: Exception is thrown at call(void java.io.DataOutputStream.flush()) SourceLoc: org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java(252) Stack Trace: [0] datanode.BlockReceiver (flush:252) [1] datanode.BlockReceiver (receivePacket:660) [2] datanode.BlockReceiver (receiveBlock:743) [3] datanode.DataXceiver (writeBlock:468) [4] datanode.DataXceiver (run:119) 2. The second failure: False is returned at call(boolean java.io.File.renameTo(File)) SourceLoc: org/apache/hadoop/hdfs/server/datanode/FSDataset.java(105) Stack Trace: [0] datanode.FSDataset (tryUpdateBlock:1008) [1] datanode.FSDataset (updateBlock:859) [2] datanode.DataNode (updateBlock:1780) [3] datanode.DataNode (syncBlock:2032) [4] datanode.DataNode (recoverBlock:1962) [5] datanode.DataNode (recoverBlock:2101) This is what we found out: The first failure causes the DFSClient to somehow calls recoverBlock, which will force us to see the 2nd failure. The 2nd failure makes renameTo returns false, which then causes an IOException to be thrown from the function that calls renameTo. But this IOException is not propagated properly! It is dropped inside DN.syncBlock(). Specifically DN.syncBlock calls DN.updateBlock() which gets the exception. But syncBlock only catches that and prints a warning without propagating the exception properly. Thus syncBlock returns without any exception, and thus recoverBlock returns without executing the finally{} block (see below). Now, the client will retry recoverBlock for 3-5 more times, but this retries always see exceptions! The reason is that the first time we call recoverBlock(blk), this blk is being put into an ongoingRecovery list inside DN.recoverBlock(). Normally, blk is only removed (ongoingRecovery.remove(blk)) inside the finally{} block. But since the exception is not propagated properly, this finally{} block is never called, thus the blk is stuck forever inside the ongoingRecovery list, and hence the next time client performs the retry, it gets this error message "Block ... is already being recovered" and recoverBlock() throws IOException. As a result, the client which calls this whole process in the context of processDatanodeError will return from the pde function with closed = true, and hence it never retries the whole thing again from the beginning, and instead just returns error. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1237) Client logic for 1st phase and 2nd phase failover are different
Client logic for 1st phase and 2nd phase failover are different --- Key: HDFS-1237 URL: https://issues.apache.org/jira/browse/HDFS-1237 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client Affects Versions: 0.20.1 Reporter: Thanh Do - Setup: number of datanodes = 4 replication factor = 2 (2 datanodes in the pipeline) number of failure injected = 2 failure type: crash Where/When failures happen: There are two scenarios: First, is when two datanodes crash at the same time in the first phase of the pipeline. Second, when two datanodes crash at the second phase of the pipeline. - Details: In this setting, we set the datanode's heartbeat message to be 1 second to the namenode. This is just to show that if the NN has declared a datanode dead, the DFSClient will not get that dead datanode from the server. Here's our observations: 1. If the two crashes happen during the first phase, the client will wait for 6 seconds (which is enough time for NN to detect dead datanodes in this setting). So after waiting for 6 seconds, the client asks the NN again, and the NN is able to give a fresh two healthy datanodes. and the experiment is successful! 2. BUT, If the two crashes happen during the second phase (e.g. renameTo). The client *never waits for 6 secs* which implies that the logic of the client for 1st phase and 2nd phase are different. What happens here, DFSClient gives up and (we believe) it never falls back to the outer while loop to contact the NN again. So the two crashes in this second phase are not masked properly, and the write operation fails. In summary, scenario (1) is good, but scenario (2) is not successful. This shows a bad retry logic during the second phase. (We note again that we change the setup a bit by setting the DN's hearbeat interval to 1 second. If we use the default interval, scenario (1) will fail too because the NN will give the client the same dead datanodes). This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1236) Client uselessly retries recoverBlock 5 times
[ https://issues.apache.org/jira/browse/HDFS-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thanh Do updated HDFS-1236: --- Description: Summary: Client uselessly retries recoverBlock 5 times The same behavior is also seen in append protocol (HDFS-1229) The setup: + # available datanodes = 4 + Replication factor = 2 (hence there are 2 datanodes in the pipeline) + Failure type = Bad disk at datanode (not crashes) + # failures = 2 + # disks / datanode = 1 + Where/when the failures happen: This is a scenario where each disk of the two datanodes in the pipeline go bad at the same time during the 2nd phase of the pipeline (the data transfer phase). Details: In this case, the client will call processDatanodeError which will call datanode.recoverBlock to those two datanodes. But since these two datanodes have bad disks (although they're still alive), then recoverBlock() will fail. For this one, the client's retry logic ends when streamer is closed (close == true). But before this happen, the client will retry 5 times (maxRecoveryErrorCount) and will fail all the time, until it finishes. What is interesting is that during each retry, there is a wait of 1 second in DataStreamer.run (i.e. dataQueue.wait(1000)). So it will be a 5-second total wait before declaring it fails. This is a different bug than HDFS-1235, where the client retries 3 times for 6 seconds (resulting in 25 seconds wait time). In this experiment, what we get for the total wait time is only 12 seconds (not sure why it is 12). So the DFSClient quits without contacting the namenode again (say to ask for a new set of two datanodes). So interestingly we find another bug that shows client retry logic is complex and not deterministic depending on where and when failures happen. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) was: Summary: Client uselessly retries recoverBlock 5 times The same behavior is also seen in append protocol (HDFS-1229) The setup: # available datanodes = 4 Replication factor = 2 (hence there are 2 datanodes in the pipeline) Failure type = Bad disk at datanode (not crashes) # failures = 2 # disks / datanode = 1 Where/when the failures happen: This is a scenario where each disk of the two datanodes in the pipeline go bad at the same time during the 2nd phase of the pipeline (the data transfer phase). Details: In this case, the client will call processDatanodeError which will call datanode.recoverBlock to those two datanodes. But since these two datanodes have bad disks (although they're still alive), then recoverBlock() will fail. For this one, the client's retry logic ends when streamer is closed (close == true). But before this happen, the client will retry 5 times (maxRecoveryErrorCount) and will fail all the time, until it finishes. What is interesting is that during each retry, there is a wait of 1 second in DataStreamer.run (i.e. dataQueue.wait(1000)). So it will be a 5-second total wait before declaring it fails. This is a different bug than HDFS-1235, where the client retries 3 times for 6 seconds (resulting in 25 seconds wait time). In this experiment, what we get for the total wait time is only 12 seconds (not sure why it is 12). So the DFSClient quits without contacting the namenode again (say to ask for a new set of two datanodes). So interestingly we find another bug that shows client retry logic is complex and not deterministic depending on where and when failures happen. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) > Client uselessly retries recoverBlock 5 times > - > > Key: HDFS-1236 > URL: https://issues.apache.org/jira/browse/HDFS-1236 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs client >Affects Versions: 0.20.1 >Reporter: Thanh Do > > Summary: > Client uselessly retries recoverBlock 5 times > The same behavior is also seen in append protocol (HDFS-1229) > The setup: > + # available datanodes = 4 > + Replication factor = 2 (hence there are 2 datanodes in the pipeline) > + Failure type = Bad disk at datanode (not crashes) > + # failures = 2 > + # disks / datanode = 1 > + Where/when the failures happen: This is a scenario where each disk of the > two datanodes in the pipeline go bad at the same time during the 2nd phase of > the pipeline (the data transfer phase). > > Details: > > In this case, the client will call processDatanodeError > which will call datanode.recoverBlock to those two datanodes. > But since
[jira] Updated: (HDFS-1236) Client uselessly retries recoverBlock 5 times
[ https://issues.apache.org/jira/browse/HDFS-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thanh Do updated HDFS-1236: --- Description: Summary: Client uselessly retries recoverBlock 5 times The same behavior is also seen in append protocol (HDFS-1229) The setup: # available datanodes = 4 Replication factor = 2 (hence there are 2 datanodes in the pipeline) Failure type = Bad disk at datanode (not crashes) # failures = 2 # disks / datanode = 1 Where/when the failures happen: This is a scenario where each disk of the two datanodes in the pipeline go bad at the same time during the 2nd phase of the pipeline (the data transfer phase). Details: In this case, the client will call processDatanodeError which will call datanode.recoverBlock to those two datanodes. But since these two datanodes have bad disks (although they're still alive), then recoverBlock() will fail. For this one, the client's retry logic ends when streamer is closed (close == true). But before this happen, the client will retry 5 times (maxRecoveryErrorCount) and will fail all the time, until it finishes. What is interesting is that during each retry, there is a wait of 1 second in DataStreamer.run (i.e. dataQueue.wait(1000)). So it will be a 5-second total wait before declaring it fails. This is a different bug than HDFS-1235, where the client retries 3 times for 6 seconds (resulting in 25 seconds wait time). In this experiment, what we get for the total wait time is only 12 seconds (not sure why it is 12). So the DFSClient quits without contacting the namenode again (say to ask for a new set of two datanodes). So interestingly we find another bug that shows client retry logic is complex and not deterministic depending on where and when failures happen. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) was: Summary: Client uselessly retries recoverBlock 5 times The same behavior is also seen in append protocol (HDFS-1229) The setup: # available datanodes = 4 Replication factor = 2 (hence there are 2 datanodes in the pipeline) Failure type = Bad disk at datanode (not crashes) # failures = 2 # disks / datanode = 1 Where/when the failures happen: This is a scenario where each disk of the two datanodes in the pipeline go bad at the same time during the 2nd phase of the pipeline (the data transfer phase). Details: In this case, the client will call processDatanodeError which will call datanode.recoverBlock to those two datanodes. But since these two datanodes have bad disks (although they're still alive), then recoverBlock() will fail. For this one, the client's retry logic ends when streamer is closed (close == true). But before this happen, the client will retry 5 times (maxRecoveryErrorCount) and will fail all the time, until it finishes. What is interesting is that during each retry, there is a wait of 1 second in DataStreamer.run (i.e. dataQueue.wait(1000)). So it will be a 5-second total wait before declaring it fails. This is a different bug than HDFS-1235, where the client retries 3 times for 6 seconds (resulting in 25 seconds wait time). In this experiment, what we get for the total wait time is only 12 seconds (not sure why it is 12). So the DFSClient quits without contacting the namenode again (say to ask for a new set of two datanodes). So interestingly we find another bug that shows client retry logic is complex and not deterministic depending on where and when failures happen. Component/s: hdfs client > Client uselessly retries recoverBlock 5 times > - > > Key: HDFS-1236 > URL: https://issues.apache.org/jira/browse/HDFS-1236 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs client >Affects Versions: 0.20.1 >Reporter: Thanh Do > > Summary: > Client uselessly retries recoverBlock 5 times > The same behavior is also seen in append protocol (HDFS-1229) > The setup: > # available datanodes = 4 > Replication factor = 2 (hence there are 2 datanodes in the pipeline) > Failure type = Bad disk at datanode (not crashes) > # failures = 2 > # disks / datanode = 1 > Where/when the failures happen: This is a scenario where each disk of the two > datanodes in the pipeline go bad at the same time during the 2nd phase of the > pipeline (the data transfer phase). > > Details: > > In this case, the client will call processDatanodeError > which will call datanode.recoverBlock to those two datanodes. > But since these two datanodes have bad disks (although they're still alive), > then recoverBlock() will fail. > For this one, the client's retry logic ends when streamer is closed (close == > true). > But before this happen, the client will r
[jira] Created: (HDFS-1236) Client uselessly retries recoverBlock 5 times
Client uselessly retries recoverBlock 5 times - Key: HDFS-1236 URL: https://issues.apache.org/jira/browse/HDFS-1236 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 0.20.1 Reporter: Thanh Do Summary: Client uselessly retries recoverBlock 5 times The same behavior is also seen in append protocol (HDFS-1229) The setup: # available datanodes = 4 Replication factor = 2 (hence there are 2 datanodes in the pipeline) Failure type = Bad disk at datanode (not crashes) # failures = 2 # disks / datanode = 1 Where/when the failures happen: This is a scenario where each disk of the two datanodes in the pipeline go bad at the same time during the 2nd phase of the pipeline (the data transfer phase). Details: In this case, the client will call processDatanodeError which will call datanode.recoverBlock to those two datanodes. But since these two datanodes have bad disks (although they're still alive), then recoverBlock() will fail. For this one, the client's retry logic ends when streamer is closed (close == true). But before this happen, the client will retry 5 times (maxRecoveryErrorCount) and will fail all the time, until it finishes. What is interesting is that during each retry, there is a wait of 1 second in DataStreamer.run (i.e. dataQueue.wait(1000)). So it will be a 5-second total wait before declaring it fails. This is a different bug than HDFS-1235, where the client retries 3 times for 6 seconds (resulting in 25 seconds wait time). In this experiment, what we get for the total wait time is only 12 seconds (not sure why it is 12). So the DFSClient quits without contacting the namenode again (say to ask for a new set of two datanodes). So interestingly we find another bug that shows client retry logic is complex and not deterministic depending on where and when failures happen. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1234) Datanode 'alive' but with its disk failed, Namenode thinks it's alive
[ https://issues.apache.org/jira/browse/HDFS-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thanh Do updated HDFS-1234: --- Description: - Summary: Datanode 'alive' but with its disk failed, Namenode still thinks it's alive - Setups: + Replication = 1 + # available datanodes = 2 + # disks / datanode = 1 + # failures = 1 + Failure type = bad disk + When/where failure happens = first phase of the pipeline - Details: In this experiment we have two datanodes. Each node has 1 disk. However, if one datanode has a failed disk (but the node is still alive), the datanode does not keep track of this. From the perspective of the namenode, that datanode is still alive, and thus the namenode gives back the same datanode to the client. The client will retry 3 times by asking the namenode to give a new set of datanodes, and always get the same datanode. And every time the client wants to write there, it gets an exception. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) was: - Summary: Datanode 'alive' but with its disk failed, Namenode still thinks it's alive - Setups: + Replication = 1 + # available datanodes = 2 + # disks / datanode = 1 + # failures = 1 + Failure type = bad disk + When/where failure happens = first phase of the pipeline - Details: In this experiment we have two datanodes. Each node has 1 disk. However, if one datanode has a failed disk (but the node is still alive), the datanode does not keep track of this. From the perspective of the namenode, that datanode is still alive, and thus the namenode gives back the same datanode to the client. The client will retry 3 times by asking the namenode to give a new set of datanodes, and always get the same datanode. And every time the client wants to write there, it gets an exception. > Datanode 'alive' but with its disk failed, Namenode thinks it's alive > - > > Key: HDFS-1234 > URL: https://issues.apache.org/jira/browse/HDFS-1234 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.20.1 >Reporter: Thanh Do > > - Summary: Datanode 'alive' but with its disk failed, Namenode still thinks > it's alive > > - Setups: > + Replication = 1 > + # available datanodes = 2 > + # disks / datanode = 1 > + # failures = 1 > + Failure type = bad disk > + When/where failure happens = first phase of the pipeline > > - Details: > In this experiment we have two datanodes. Each node has 1 disk. > However, if one datanode has a failed disk (but the node is still alive), the > datanode > does not keep track of this. From the perspective of the namenode, > that datanode is still alive, and thus the namenode gives back the same > datanode > to the client. The client will retry 3 times by asking the namenode to > give a new set of datanodes, and always get the same datanode. > And every time the client wants to write there, it gets an exception. > This bug was found by our Failure Testing Service framework: > http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html > For questions, please email us: Thanh Do (than...@cs.wisc.edu) and > Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1235) Namenode returning the same Datanode to client, due to infrequent heartbeat
Namenode returning the same Datanode to client, due to infrequent heartbeat --- Key: HDFS-1235 URL: https://issues.apache.org/jira/browse/HDFS-1235 Project: Hadoop HDFS Issue Type: Bug Components: name-node Reporter: Thanh Do This bug has been reported. Basically since datanode's hearbeat messages are infrequent (~ every 10 minutes), NameNode always gives the client the same datanode even if the datanode is dead. We want to point out that the client wait 6 seconds before retrying, which could be considered long and useless retries in this scenario, because in 6 secs, the namenode hasn't declared the datanode dead. Overall this happens when a datanode is dead during the first phase of the pipeline (file setups). If a datanode is dead during the second phase (byte transfer), the DFSClient still could proceed with the other surviving datanodes (which is consistent with what Hadoop books always say -- the write should proceed if at least we have one good datanode). But unfortunately this specification is not true during the first phase of the pipeline. Overall we suggest that the namenode take into consideration the client's view of unreachable datanodes. That is, if a client says that it cannot reach DN-X, then the namenode might give the client another node other than X (but the namenode does not have to declare N dead). This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1234) Datanode 'alive' but with its disk failed, Namenode thinks it's alive
Datanode 'alive' but with its disk failed, Namenode thinks it's alive - Key: HDFS-1234 URL: https://issues.apache.org/jira/browse/HDFS-1234 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.20.1 Reporter: Thanh Do - Summary: Datanode 'alive' but with its disk failed, Namenode still thinks it's alive - Setups: + Replication = 1 + # available datanodes = 2 + # disks / datanode = 1 + # failures = 1 + Failure type = bad disk + When/where failure happens = first phase of the pipeline - Details: In this experiment we have two datanodes. Each node has 1 disk. However, if one datanode has a failed disk (but the node is still alive), the datanode does not keep track of this. From the perspective of the namenode, that datanode is still alive, and thus the namenode gives back the same datanode to the client. The client will retry 3 times by asking the namenode to give a new set of datanodes, and always get the same datanode. And every time the client wants to write there, it gets an exception. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1232) Corrupted block if a crash happens before writing to checksumOut but after writing to dataOut
Corrupted block if a crash happens before writing to checksumOut but after writing to dataOut - Key: HDFS-1232 URL: https://issues.apache.org/jira/browse/HDFS-1232 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 0.20.1 Reporter: Thanh Do - Summary: block is corrupted if a crash happens before writing to checksumOut but after writing to dataOut. - Setup: + # available datanodes = 1 + # disks / datanode = 1 + # failures = 1 + failure type = crash +When/where failure happens = (see below) - Details: The order of processing a packet during client write/append at datanode is first forward the packet to downstream, then write to data the block file, and and finally, write to the checksum file. Hence if a crash happens BEFORE the write to checksum file but AFTER the write to data file, the block is corrupted. Worse, if this is the only available replica, the block is lost. We also found this problem in case there are 3 replicas for a particular block, and during append, there are two failures. (see HDFS-1231) This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1233) Bad retry logic at DFSClient
Bad retry logic at DFSClient Key: HDFS-1233 URL: https://issues.apache.org/jira/browse/HDFS-1233 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client Affects Versions: 0.20.1 Reporter: Thanh Do - Summary: failover bug, bad retry logic at DFSClient, cannot failover to the 2nd disk - Setups: + # available datanodes = 1 + # disks / datanode = 2 + # failures = 1 + failure type = bad disk + When/where failure happens = (see below) - Details: The setup is: 1 datanode, 1 replica, and each datanode has 2 disks (Disk1 and Disk2). We injected a single disk failure to see if we can failover to the second disk or not. If a persistent disk failure happens during createBlockOutputStream (the first phase of the pipeline creation) (e.g. say DN1-Disk1 is bad), then createBlockOutputStream (cbos) will get an exception and it will retry! When it retries it will get the same DN1 from the namenode, and then DN1 will call DN.writeBlock(), FSVolume.createTmpFile, and finally getNextVolume() which a moving volume#. Thus, on the second try, the write will be successfully go to the second disk. So essentially createBlockOutputStream is wrapped in a do/while(retry && --count >= 0). The first cbos will fail, the second will be successful in this particular scenario. NOW, say cbos is successful, but the failure is persistent. Then the "retry" is in a different while loop. First, hasError is set to true in RP.run (responder packet). Thus, DataStreamer.run() will go back to the loop: while(!closed && clientRunning && !lastPacketInBlock). Now this second iteration of the loop will call processDatanodeError because hasError has been set to true. In processDatanodeError (pde), the client sees that this is the only datanode in the pipeline, and hence it considers that the node is bad! Although actually only 1 disk is bad! Hence, pde throws IOException suggesting all the datanodes (in this case, only DN1) in the pipeline is bad. Hence, in this error, the exception is thrown to the client. But if the exception, say, is catched by the most outer while loop do-while(retry && --count >= 0), then this outer retry will be successful then (as suggested in the previous paragraph). In summary, if in a deployment scenario, we only have one datanode that has multiple disks, and one disk goes bad, then the current retry logic at the DFSClient side is not robust enough to mask the failure from the client. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1231) Generation Stamp mismatches, leading to failed append
Generation Stamp mismatches, leading to failed append - Key: HDFS-1231 URL: https://issues.apache.org/jira/browse/HDFS-1231 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client Affects Versions: 0.20.1 Reporter: Thanh Do - Summary: the recoverBlock is not atomic, leading retrial fails when facing a failure. - Setup: + # available datanodes = 3 + # disks / datanode = 1 + # failures = 2 + failure type = crash + When/where failure happens = (see below) - Details: Suppose there are 3 datanodes in the pipeline: dn3, dn2, and dn1. Dn1 is primary. When appending, client first calls dn1.recoverBlock to make all the datanodes in pipeline agree on the new Generation Stamp (GS1) and the length of the block. Client then sends a data packet to dn3. dn3 in turn forwards this packet to down stream dns (dn2 and dn1) and starts writing to its own disk, then it crashes AFTER writing to the block file but BEFORE writing to the meta file. Client notices the crash, it calls dn1.recoverBlock(). dn1.recoverBlock() first creates a syncList (by calling getMetadataInfo at all dn2 and dn1). Then dn1 calls NameNode.getNextGS() to get new Generation Stamp (GS2). Then it calls dn2.updateBlock(), this returns successfully. Now, it starts calling its own updateBlock and crashes after renaming from blk_X_GS1.meta to blk_X_GS1.meta_tmpGS2. Therefore, dn1.recoverBlock() from the client point of view fails. but the GS for corresponding block has been incremented in the namenode (GS2) The client retries by calling dn2.recoverBlock with old GS (GS1), which does not match with the new GS at the NameNode (GS1) -->exception, leading to append fails. Now, after all, we have - in dn3 (which is crashed) tmp/blk_X tmp/blk_X_GS1.meta - in dn2 current/blk_X current/blk_X_GS2 - in dn1: current/blk_X current/blk_X_GS1.meta_tmpGS2 - in NameNode, the block X has generation stamp GS1 (because dn1 has not called commitSyncronization yet). Therefore, when crashed datanodes restart, at dn1 the block is invalid because there is no meta file. In dn3, block file and meta file are finalized, however, the block is corrupted because CRC mismatch. In dn2, the GS of the block is GS2, which is not equal with the generation stamp info of the block maintained in NameNode. Hence, the block blk_X is inaccessible. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HDFS-1219) Data Loss due to edits log truncation
[ https://issues.apache.org/jira/browse/HDFS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon resolved HDFS-1219. --- Resolution: Duplicate Why file this bug if it's the same as 955? > Data Loss due to edits log truncation > - > > Key: HDFS-1219 > URL: https://issues.apache.org/jira/browse/HDFS-1219 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.20.2 >Reporter: Thanh Do > > We found this problem almost at the same time as HDFS developers. > Basically, the edits log is truncated before fsimage.ckpt is renamed to > fsimage. > Hence, any crash happens after the truncation but before the renaming will > lead > to a data loss. Detailed description can be found here: > https://issues.apache.org/jira/browse/HDFS-955 > This bug was found by our Failure Testing Service framework: > http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html > For questions, please email us: Thanh Do (than...@cs.wisc.edu) and > Haryadi Gunawi (hary...@eecs.berkeley.edu -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1220) Namenode unable to start due to truncated fstime
[ https://issues.apache.org/jira/browse/HDFS-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879666#action_12879666 ] Todd Lipcon commented on HDFS-1220: --- I believe we fixed this in trunk by saving to an fsimage_ckpt dir and then moving it into place atomically once all the files are on disk. See HDFS-955? > Namenode unable to start due to truncated fstime > > > Key: HDFS-1220 > URL: https://issues.apache.org/jira/browse/HDFS-1220 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.20.1 >Reporter: Thanh Do > > - Summary: updating fstime file on disk is not atomic, so it is possible that > if a crash happens in the middle, next time when NameNode reboots, it will > read stale fstime, hence unable to start successfully. > > - Details: > Basically, this involve 3 steps: > 1) delete fstime file (timeFile.delete()) > 2) truncate fstime file (new FileOutputStream(timeFile)) > 3) write new time to fstime file (out.writeLong(checkpointTime)) > If a crash happens after step 2 and before step 3, in the next reboot, > NameNode > got an exception when reading the time (8 byte) from an empty fstime file. > This bug was found by our Failure Testing Service framework: > http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html > For questions, please email us: Thanh Do (than...@cs.wisc.edu) and > Haryadi Gunawi (hary...@eecs.berkeley.edu -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1223) DataNode fails stop due to a bad disk (or storage directory)
[ https://issues.apache.org/jira/browse/HDFS-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879665#action_12879665 ] Todd Lipcon commented on HDFS-1223: --- Already fixed by HDFS-457 > DataNode fails stop due to a bad disk (or storage directory) > > > Key: HDFS-1223 > URL: https://issues.apache.org/jira/browse/HDFS-1223 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node >Affects Versions: 0.20.1 >Reporter: Thanh Do > > A datanode can store block files in multiple volumes. > If a datanode sees a bad volume during start up (i.e, face an exception > when accessing that volume), it simply fail stops, making all block files > stored in other healthy volumes inaccessible. Consequently, these lost > replicas will be generated later on in other datanodes. > If a datanode is able to mark the bad disk and continue working with > healthy ones, this will increase availability and avoid unnecessary > regeneration. As an extreme example, consider one datanode which has > 2 volumes V1 and V2, each contains about 1 64MB block files. > During startup, the datanode gets an exception when accessing V1, it then > fail stops, making 2 block files generated later on. > If the datanode masks V1 as bad and continues working with V2, the number > of replicas needed to be regenerated is cut in to half. > This bug was found by our Failure Testing Service framework: > http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html > For questions, please email us: Thanh Do (than...@cs.wisc.edu) and > Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1224) Stale connection makes node miss append
[ https://issues.apache.org/jira/browse/HDFS-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879664#action_12879664 ] Todd Lipcon commented on HDFS-1224: --- If the node has crashed, the TCP connection should be broken and thus it won't re-use an existing connection, no? Even so, does this cause any actual problems aside from a shorter pipeline? Given we only cache IPC connections for a short amount of time, the likelihood of a DN restart while a connection is cached is very small > Stale connection makes node miss append > --- > > Key: HDFS-1224 > URL: https://issues.apache.org/jira/browse/HDFS-1224 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node >Affects Versions: 0.20.1 >Reporter: Thanh Do > > - Summary: if a datanode crashes and restarts, it may miss an append. > > - Setup: > + # available datanodes = 3 > + # replica = 3 > + # disks / datanode = 1 > + # failures = 1 > + failure type = crash > + When/where failure happens = after the first append succeed > > - Details: > Since each datanode maintains a pool of IPC connections, whenever it wants > to make an IPC call, it first looks into the pool. If the connection is not > there, > it is created and put in to the pool. Otherwise the existing connection is > used. > Suppose that the append pipeline contains dn1, dn2, and dn3. Dn1 is the > primary. > After the client appends to block X successfully, dn2 crashes and restarts. > Now client writes a new block Y to dn1, dn2 and dn3. The write is successful. > Client starts appending to block Y. It first calls dn1.recoverBlock(). > Dn1 will first create a proxy corresponding with each of the datanode in the > pipeline > (in order to make RPC call like getMetadataInfo( ) or updateBlock( )). > However, because > dn2 has just crashed and restarts, its connection in dn1's pool become stale. > Dn1 uses > this connection to make a call to dn2, hence an exception. Therefore, append > will be > made only to dn1 and dn3, although dn2 is alive and the write of block Y to > dn2 has > been successful. > This bug was found by our Failure Testing Service framework: > http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html > For questions, please email us: Thanh Do (than...@cs.wisc.edu) and > Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1227) UpdateBlock fails due to unmatched file length
[ https://issues.apache.org/jira/browse/HDFS-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879662#action_12879662 ] Todd Lipcon commented on HDFS-1227: --- Believe this is addressed by HDFS-1186 in the 20-append branch > UpdateBlock fails due to unmatched file length > -- > > Key: HDFS-1227 > URL: https://issues.apache.org/jira/browse/HDFS-1227 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node >Affects Versions: 0.20.1 >Reporter: Thanh Do > > - Summary: client append is not atomic, hence, it is possible that > when retrying during append, there is an exception in updateBlock > indicating unmatched file length, making append failed. > > - Setup: > + # available datanodes = 3 > + # disks / datanode = 1 > + # failures = 2 > + failure type = bad disk > + When/where failure happens = (see below) > + This bug is non-deterministic, to reproduce it, add a sufficient sleep > before out.write() in BlockReceiver.receivePacket() in dn1 and dn2 but not dn3 > > - Details: > Suppose client appends 16 bytes to block X which has length 16 bytes at dn1, > dn2, dn3. > Dn1 is primary. The pipeline is dn3-dn2-dn1. recoverBlock succeeds. > Client starts sending data to the dn3 - the first datanode in pipeline. > dn3 forwards the packet to downstream datanodes, and starts writing > data to its disk. Suppose there is an exception in dn3 when writing to disk. > Client gets the exception, it starts the recovery code by calling > dn1.recoverBlock() again. > dn1 in turn calls dn2.getMetadataInfo() and dn1.getMetaDataInfo() to build > the syncList. > Suppose at the time getMetadataInfo() is called at both datanodes (dn1 and > dn2), > the previous packet (which is sent from dn3) has not come to disk yet. > Hence, the block Info given by getMetaDataInfo contains the length of 16 > bytes. > But after that, the packet "comes" to disk, making the block file length now > becomes 32 bytes. > Using the syncList (with contains block info with length 16 byte), dn1 calls > updateBlock at > dn2 and dn1, which will failed, because the length of new block info (given > by updateBlock, > which is 16 byte) does not match with its actual length on disk (which is 32 > byte) > > Note that this bug is non-deterministic. Its depends on the thread > interleaving > at datanodes. > This bug was found by our Failure Testing Service framework: > http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html > For questions, please email us: Thanh Do (than...@cs.wisc.edu) and > Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1226) Last block is temporary unavailable for readers because of crashed appender
[ https://issues.apache.org/jira/browse/HDFS-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879663#action_12879663 ] Todd Lipcon commented on HDFS-1226: --- This is addressed by combination of HDFS-142, HDFS-200, HDFS-1057 in 20 append > Last block is temporary unavailable for readers because of crashed appender > --- > > Key: HDFS-1226 > URL: https://issues.apache.org/jira/browse/HDFS-1226 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node >Affects Versions: 0.20.1 >Reporter: Thanh Do > > - Summary: the last block is unavailable to subsequent readers if appender > crashes in the > middle of appending workload. > > - Setup: > + # available datanodes = 3 > + # disks / datanode = 1 > + # failures = 1 > + failure type = crash > + When/where failure happens = (see below) > > - Details: > Say a client appending to block X at 3 datanodes: dn1, dn2 and dn3. After > successful > recoverBlock at primary datanode, client calls createOutputStream, which make > all datanodes > move the block file and the meta file from current directory to tmp > directory. Now suppose > the client crashes. Since all replicas of block X are in tmp folders of > corresponding datanode, > subsequent readers cannot read block X. > This bug was found by our Failure Testing Service framework: > http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html > For questions, please email us: Thanh Do (than...@cs.wisc.edu) and > Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1230) BlocksMap.blockinfo is not getting cleared immediately after deleting a block.This will be cleared only after block report comes from the datanode.Why we need to maintain t
BlocksMap.blockinfo is not getting cleared immediately after deleting a block.This will be cleared only after block report comes from the datanode.Why we need to maintain the blockinfo till that time. Key: HDFS-1230 URL: https://issues.apache.org/jira/browse/HDFS-1230 Project: Hadoop HDFS Issue Type: Improvement Components: name-node Affects Versions: 0.20.1 Reporter: Gokul BlocksMap.blockinfo is not getting cleared immediately after deleting a block.This will be cleared only after block report comes from the datanode.Why we need to maintain the blockinfo till that time It increases namenode memory unnecessarily. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1229) DFSClient incorrectly asks for new block if primary crashes during first recoverBlock
[ https://issues.apache.org/jira/browse/HDFS-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thanh Do updated HDFS-1229: --- Description: Setup: + # available datanodes = 2 + # disks / datanode = 1 + # failures = 1 + failure type = crash + When/where failure happens = during primary's recoverBlock Details: -- Say client is appending to block X1 in 2 datanodes: dn1 and dn2. First it needs to make sure both dn1 and dn2 agree on the new GS of the block. 1) Client first creates DFSOutputStream by calling >OutputStream result = new DFSOutputStream(src, buffersize, progress, >lastBlock, stat, > conf.getInt("io.bytes.per.checksum", 512)); in DFSClient.append() 2) The above DFSOutputStream constructor in turn calls processDataNodeError(true, true) (i.e, hasError = true, isAppend = true), and starts the DataStreammer > processDatanodeError(true, true); /* let's call this PDNE 1 */ > streamer.start(); Note that DataStreammer.run() also calls processDatanodeError() > while (!closed && clientRunning) { > ... > boolean doSleep = processDatanodeError(hasError, false); /let's call > this PDNE 2*/ 3) Now in the PDNE 1, we have following code: > blockStream = null; > blockReplyStream = null; > ... > while (!success && clientRunning) { > ... >try { > primary = createClientDatanodeProtocolProxy(primaryNode, conf); > newBlock = primary.recoverBlock(block, isAppend, newnodes); > /*exception here*/ > ... >catch (IOException e) { > ... > if (recoveryErrorCount > maxRecoveryErrorCount) { > // this condition is false > } > ... > return true; >} // end catch >finally {...} > >this.hasError = false; >lastException = null; >errorIndex = 0; >success = createBlockOutputStream(nodes, clientName, true); >} >... Because dn1 crashes during client call to recoverBlock, we have an exception. Hence, go to the catch block, in which processDatanodeError returns true before setting hasError to false. Also, because createBlockOutputStream() is not called (due to an early return), blockStream is still null. 4) Now PDNE 1 has finished, we come to streamer.start(), which calls PDNE 2. Because hasError = false, PDNE 2 returns false immediately without doing anything > if (!hasError) { return false; } 5) still in the DataStreamer.run(), after returning false from PDNE 2, we still have blockStream = null, hence the following code is executed: if (blockStream == null) { nodes = nextBlockOutputStream(src); this.setName("DataStreamer for file " + src + " block " + block); response = new ResponseProcessor(nodes); response.start(); } nextBlockOutputStream which asks namenode to allocate new Block is called. (This is not good, because we are appending, not writing). Namenode gives it new Block ID and a set of datanodes, including crashed dn1. this leads to createOutputStream() fails because it tries to contact the dn1 first. (which has crashed). The client retries 5 times without any success, because every time, it asks namenode for new block! Again we see that the retry logic at client is weird! *This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu)* was: Setup: + # available datanodes = 2 + # disks / datanode = 1 + # failures = 1 + failure type = crash + When/where failure happens = during primary's recoverBlock Details: -- Say client is appending to block X1 in 2 datanodes: dn1 and dn2. First it needs to make sure both dn1 and dn2 agree on the new GS of the block. 1) Client first creates DFSOutputStream by calling >OutputStream result = new DFSOutputStream(src, buffersize, progress, >lastBlock, stat, > conf.getInt("io.bytes.per.checksum", 512)); in DFSClient.append() 2) The above DFSOutputStream constructor in turn calls processDataNodeError(true, true) (i.e, hasError = true, isAppend = true), and starts the DataStreammer > processDatanodeError(true, true); /* let's call this PDNE 1 */ > streamer.start(); Note that DataStreammer.run() also calls processDatanodeError() > while (!closed && clientRunning) { > ... > boolean doSleep = processDatanodeError(hasError, false); /let's call > this PDNE 2*/ 3) Now in the PDNE 1, we have following code: > blockStream = null; > blockReplyStream = null; > ... > while (!success && clientRunning) { > ... >try { > primary = createClientDatanodeProtocolProxy(primaryNode, conf); > newBlock = primary.recoverBlock(block, isAppend, newnodes); > /*exception here*/ > ... >catch (IOException e) { > ... > if (recoveryError
[jira] Updated: (HDFS-1229) DFSClient incorrectly asks for new block if primary crashes during first recoverBlock
[ https://issues.apache.org/jira/browse/HDFS-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thanh Do updated HDFS-1229: --- Description: Setup: + # available datanodes = 2 + # disks / datanode = 1 + # failures = 1 + failure type = crash + When/where failure happens = during primary's recoverBlock Details: -- Say client is appending to block X1 in 2 datanodes: dn1 and dn2. First it needs to make sure both dn1 and dn2 agree on the new GS of the block. 1) Client first creates DFSOutputStream by calling >OutputStream result = new DFSOutputStream(src, buffersize, progress, >lastBlock, stat, > conf.getInt("io.bytes.per.checksum", 512)); in DFSClient.append() 2) The above DFSOutputStream constructor in turn calls processDataNodeError(true, true) (i.e, hasError = true, isAppend = true), and starts the DataStreammer > processDatanodeError(true, true); /* let's call this PDNE 1 */ > streamer.start(); Note that DataStreammer.run() also calls processDatanodeError() > while (!closed && clientRunning) { > ... > boolean doSleep = processDatanodeError(hasError, false); /let's call > this PDNE 2*/ 3) Now in the PDNE 1, we have following code: > blockStream = null; > blockReplyStream = null; > ... > while (!success && clientRunning) { > ... >try { > primary = createClientDatanodeProtocolProxy(primaryNode, conf); > newBlock = primary.recoverBlock(block, isAppend, newnodes); > /*exception here*/ > ... >catch (IOException e) { > ... > if (recoveryErrorCount > maxRecoveryErrorCount) { > /* this condition is false */ > } > ... > return true; >} // end catch >finally {...} > >this.hasError = false; >lastException = null; >errorIndex = 0; >success = createBlockOutputStream(nodes, clientName, true); >} >... Because dn1 crashes during client call to recoverBlock, we have an exception. Hence, go to the catch block, in which processDatanodeError returns true before setting hasError to false. Also, because createBlockOutputStream() is not called (due to an early return), blockStream is still null. 4) Now PDNE 1 has finished, we come to streamer.start(), which calls PDNE 2. Because hasError = false, PDNE 2 returns false immediately without doing anything >if (!hasError) { > return false; >} 5) still in the DataStreamer.run(), after returning false from PDNE 2, we still have blockStream = null, hence the following code is executed: > if (blockStream == null) { > nodes = nextBlockOutputStream(src); > this.setName("DataStreamer for file " + src + > " block " + block); >response = new ResponseProcessor(nodes); >response.start(); > } nextBlockOutputStream which asks namenode to allocate new Block is called. (This is not good, because we are appending, not writing). Namenode gives it new Block ID and a set of datanodes, including crashed dn1. this leads to createOutputStream() fails because it tries to contact the dn1 first. (which has crashed). The client retries 5 times without any success, because every time, it asks namenode for new block! Again we see that the retry logic at client is weird! *This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu)* was: - Setup: + # available datanodes = 2 + # disks / datanode = 1 + # failures = 1 + failure type = crash + When/where failure happens = during primary's recoverBlock - Details: Say client is appending to block X1 in 2 datanodes: dn1 and dn2. First it needs to make sure both dn1 and dn2 agree on the new GS of the block. 1) Client first creates DFSOutputStream by calling >OutputStream result = new DFSOutputStream(src, buffersize, progress, >lastBlock, stat, > conf.getInt("io.bytes.per.checksum", 512)); in DFSClient.append() 2) The above DFSOutputStream constructor in turn calls processDataNodeError(true, true) (i.e, hasError = true, isAppend = true), and starts the DataStreammer > processDatanodeError(true, true); /* let's call this PDNE 1 */ > streamer.start(); Note that DataStreammer.run() also calls processDatanodeError() > while (!closed && clientRunning) { > ... > boolean doSleep = processDatanodeError(hasError, false); /let's call > this PDNE 2*/ 3) Now in the PDNE 1, we have following code: > blockStream = null; > blockReplyStream = null; > ... > while (!success && clientRunning) { > ... >try { > primary = createClientDatanodeProtocolProxy(primaryNode, conf); > newBlock = primary.recoverBlock(block, isAppend, newn
[jira] Created: (HDFS-1229) DFSClient incorrectly asks for new block if primary crashes during first recoverBlock
DFSClient incorrectly asks for new block if primary crashes during first recoverBlock - Key: HDFS-1229 URL: https://issues.apache.org/jira/browse/HDFS-1229 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client Affects Versions: 0.20.1 Reporter: Thanh Do - Setup: + # available datanodes = 2 + # disks / datanode = 1 + # failures = 1 + failure type = crash + When/where failure happens = during primary's recoverBlock - Details: Say client is appending to block X1 in 2 datanodes: dn1 and dn2. First it needs to make sure both dn1 and dn2 agree on the new GS of the block. 1) Client first creates DFSOutputStream by calling >OutputStream result = new DFSOutputStream(src, buffersize, progress, >lastBlock, stat, > conf.getInt("io.bytes.per.checksum", 512)); in DFSClient.append() 2) The above DFSOutputStream constructor in turn calls processDataNodeError(true, true) (i.e, hasError = true, isAppend = true), and starts the DataStreammer > processDatanodeError(true, true); /* let's call this PDNE 1 */ > streamer.start(); Note that DataStreammer.run() also calls processDatanodeError() > while (!closed && clientRunning) { > ... > boolean doSleep = processDatanodeError(hasError, false); /let's call > this PDNE 2*/ 3) Now in the PDNE 1, we have following code: > blockStream = null; > blockReplyStream = null; > ... > while (!success && clientRunning) { > ... >try { > primary = createClientDatanodeProtocolProxy(primaryNode, conf); > newBlock = primary.recoverBlock(block, isAppend, newnodes); > /*exception here*/ > ... >catch (IOException e) { > ... > if (recoveryErrorCount > maxRecoveryErrorCount) { > /* this condition is false */ > } > ... > return true; >} // end catch >finally {...} > >this.hasError = false; >lastException = null; >errorIndex = 0; >success = createBlockOutputStream(nodes, clientName, true); >} >... Because dn1 crashes during client call to recoverBlock, we have an exception. Hence, go to the catch block, in which processDatanodeError returns true before setting hasError to false. Also, because createBlockOutputStream() is not called (due to an early return), blockStream is still null. 4) Now PDNE 1 has finished, we come to streamer.start(), which calls PDNE 2. Because hasError = false, PDNE 2 returns false immediately without doing anything >if (!hasError) { > return false; >} 5) still in the DataStreamer.run(), after returning false from PDNE 2, we still have blockStream = null, hence the following code is executed: > if (blockStream == null) { > nodes = nextBlockOutputStream(src); > this.setName("DataStreamer for file " + src + > " block " + block); >response = new ResponseProcessor(nodes); >response.start(); > } nextBlockOutputStream which asks namenode to allocate new Block is called. (This is not good, because we are appending, not writing). Namenode gives it new Block ID and a set of datanodes, including crashed dn1. this leads to createOutputStream() fails because it tries to contact the dn1 first. (which has crashed). The client retries 5 times without any success, because every time, it asks namenode for new block! Again we see that the retry logic at client is weird! *This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu)* -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1228) CRC does not match when retrying appending a partial block
CRC does not match when retrying appending a partial block -- Key: HDFS-1228 URL: https://issues.apache.org/jira/browse/HDFS-1228 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 0.20.1 Reporter: Thanh Do - Summary: when appending to partial block, if is possible that retrial when facing an exception fails due to a checksum mismatch. Append operation is not atomic (either complete or fail completely). - Setup: + # available datanodes = 2 +# disks / datanode = 1 + # failures = 1 + failure type = bad disk + When/where failure happens = (see below) - Details: Client writes 16 bytes to dn1 and dn2. Write completes. So far so good. The meta file now contains: 7 bytes header + 4 byte checksum (CK1 - checksum for 16 byte) Client then appends 16 bytes more, and let assume there is an exception at BlockReceiver.receivePacket() at dn2. So the client knows dn2 is bad. BUT, the append at dn1 is complete (i.e the data portion and checksum portion has been made to disk to the corresponding block file and meta file), meaning that the checksum file at dn1 now contains 7 bytes header + 4 byte checksum (CK2 - this is checksum for 32 byte data). Because dn2 has an exception, client calls recoverBlock and starts append again to dn1. dn1 receives 16 byte data, it verifies if the pre-computed crc (CK2) matches what we recalculate just now (CK1), which obviously does not match. Hence an exception and retrial fails. - a similar bug has been reported at https://issues.apache.org/jira/browse/HDFS-679 but here, it manifests in different context. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1219) Data Loss due to edits log truncation
[ https://issues.apache.org/jira/browse/HDFS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879655#action_12879655 ] Tsz Wo (Nicholas), SZE commented on HDFS-1219: -- Then, is this the same as HDFS-955? > Data Loss due to edits log truncation > - > > Key: HDFS-1219 > URL: https://issues.apache.org/jira/browse/HDFS-1219 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.20.2 >Reporter: Thanh Do > > We found this problem almost at the same time as HDFS developers. > Basically, the edits log is truncated before fsimage.ckpt is renamed to > fsimage. > Hence, any crash happens after the truncation but before the renaming will > lead > to a data loss. Detailed description can be found here: > https://issues.apache.org/jira/browse/HDFS-955 > This bug was found by our Failure Testing Service framework: > http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html > For questions, please email us: Thanh Do (than...@cs.wisc.edu) and > Haryadi Gunawi (hary...@eecs.berkeley.edu -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1227) UpdateBlock fails due to unmatched file length
UpdateBlock fails due to unmatched file length -- Key: HDFS-1227 URL: https://issues.apache.org/jira/browse/HDFS-1227 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 0.20.1 Reporter: Thanh Do - Summary: client append is not atomic, hence, it is possible that when retrying during append, there is an exception in updateBlock indicating unmatched file length, making append failed. - Setup: + # available datanodes = 3 + # disks / datanode = 1 + # failures = 2 + failure type = bad disk + When/where failure happens = (see below) + This bug is non-deterministic, to reproduce it, add a sufficient sleep before out.write() in BlockReceiver.receivePacket() in dn1 and dn2 but not dn3 - Details: Suppose client appends 16 bytes to block X which has length 16 bytes at dn1, dn2, dn3. Dn1 is primary. The pipeline is dn3-dn2-dn1. recoverBlock succeeds. Client starts sending data to the dn3 - the first datanode in pipeline. dn3 forwards the packet to downstream datanodes, and starts writing data to its disk. Suppose there is an exception in dn3 when writing to disk. Client gets the exception, it starts the recovery code by calling dn1.recoverBlock() again. dn1 in turn calls dn2.getMetadataInfo() and dn1.getMetaDataInfo() to build the syncList. Suppose at the time getMetadataInfo() is called at both datanodes (dn1 and dn2), the previous packet (which is sent from dn3) has not come to disk yet. Hence, the block Info given by getMetaDataInfo contains the length of 16 bytes. But after that, the packet "comes" to disk, making the block file length now becomes 32 bytes. Using the syncList (with contains block info with length 16 byte), dn1 calls updateBlock at dn2 and dn1, which will failed, because the length of new block info (given by updateBlock, which is 16 byte) does not match with its actual length on disk (which is 32 byte) Note that this bug is non-deterministic. Its depends on the thread interleaving at datanodes. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1226) Last block is temporary unavailable for readers because of crashed appender
[ https://issues.apache.org/jira/browse/HDFS-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thanh Do updated HDFS-1226: --- Description: - Summary: the last block is unavailable to subsequent readers if appender crashes in the middle of appending workload. - Setup: + # available datanodes = 3 + # disks / datanode = 1 + # failures = 1 + failure type = crash + When/where failure happens = (see below) - Details: Say a client appending to block X at 3 datanodes: dn1, dn2 and dn3. After successful recoverBlock at primary datanode, client calls createOutputStream, which make all datanodes move the block file and the meta file from current directory to tmp directory. Now suppose the client crashes. Since all replicas of block X are in tmp folders of corresponding datanode, subsequent readers cannot read block X. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) was: - Summary: the last block is unavailable to subsequent readers if appender crashes in the middle of appending workload. - Setup: # available datanodes = 3 # disks / datanode = 1 # failures = 1 failure type = crash When/where failure happens = (see below) - Details: Say a client appending to block X at 3 datanodes: dn1, dn2 and dn3. After successful recoverBlock at primary datanode, client calls createOutputStream, which make all datanodes move the block file and the meta file from current directory to tmp directory. Now suppose the client crashes. Since all replicas of block X are in tmp folders of corresponding datanode, subsequent readers cannot read block X. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) > Last block is temporary unavailable for readers because of crashed appender > --- > > Key: HDFS-1226 > URL: https://issues.apache.org/jira/browse/HDFS-1226 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node >Affects Versions: 0.20.1 >Reporter: Thanh Do > > - Summary: the last block is unavailable to subsequent readers if appender > crashes in the > middle of appending workload. > > - Setup: > + # available datanodes = 3 > + # disks / datanode = 1 > + # failures = 1 > + failure type = crash > + When/where failure happens = (see below) > > - Details: > Say a client appending to block X at 3 datanodes: dn1, dn2 and dn3. After > successful > recoverBlock at primary datanode, client calls createOutputStream, which make > all datanodes > move the block file and the meta file from current directory to tmp > directory. Now suppose > the client crashes. Since all replicas of block X are in tmp folders of > corresponding datanode, > subsequent readers cannot read block X. > This bug was found by our Failure Testing Service framework: > http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html > For questions, please email us: Thanh Do (than...@cs.wisc.edu) and > Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1226) Last block is temporary unavailable for readers because of crashed appender
[ https://issues.apache.org/jira/browse/HDFS-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thanh Do updated HDFS-1226: --- Description: - Summary: the last block is unavailable to subsequent readers if appender crashes in the middle of appending workload. - Setup: # available datanodes = 3 # disks / datanode = 1 # failures = 1 failure type = crash When/where failure happens = (see below) - Details: Say a client appending to block X at 3 datanodes: dn1, dn2 and dn3. After successful recoverBlock at primary datanode, client calls createOutputStream, which make all datanodes move the block file and the meta file from current directory to tmp directory. Now suppose the client crashes. Since all replicas of block X are in tmp folders of corresponding datanode, subsequent readers cannot read block X. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) was: - Summary: the last block is unavailable to subsequent readers if appender crashes in the middle of appending workload. - Setup: # available datanodes = 3 # disks / datanode = 1 # failures = 1 failure type = crash When/where failure happens = (see below) - Details: Say a client appending to block X at 3 datanodes: dn1, dn2 and dn3. After successful recoverBlock at primary datanode, client calls createOutputStream, which make all datanodes move the block file and the meta file from current directory to tmp directory. Now suppose the client crashes. Since all replicas of block X are in tmp folders of corresponding datanode, subsequent readers cannot read block X. - Summary: the last block is unavailable to subsequent readers if appender crashes in the middle of appending workload. - Setup: # available datanodes = 3 # disks / datanode = 1 # failures = 1 failure type = crash When/where failure happens = (see below) - Details: Say a client appending to block X at 3 datanodes: dn1, dn2 and dn3. After successful recoverBlock at primary datanode, client calls createOutputStream, which make all datanodes move the block file and the meta file from current directory to tmp directory. Now suppose the client crashes. Since all replicas of block X are in tmp folders of corresponding datanode, subsequent readers cannot read block X. > Last block is temporary unavailable for readers because of crashed appender > --- > > Key: HDFS-1226 > URL: https://issues.apache.org/jira/browse/HDFS-1226 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node >Affects Versions: 0.20.1 >Reporter: Thanh Do > > - Summary: the last block is unavailable to subsequent readers if appender > crashes in the > middle of appending workload. > > - Setup: > # available datanodes = 3 > # disks / datanode = 1 > # failures = 1 > failure type = crash > When/where failure happens = (see below) > > - Details: > Say a client appending to block X at 3 datanodes: dn1, dn2 and dn3. After > successful > recoverBlock at primary datanode, client calls createOutputStream, which make > all datanodes > move the block file and the meta file from current directory to tmp > directory. Now suppose > the client crashes. Since all replicas of block X are in tmp folders of > corresponding datanode, > subsequent readers cannot read block X. > This bug was found by our Failure Testing Service framework: > http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html > For questions, please email us: Thanh Do (than...@cs.wisc.edu) and > Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1226) Last block is temporary unavailable for readers because of crashed appender
Last block is temporary unavailable for readers because of crashed appender --- Key: HDFS-1226 URL: https://issues.apache.org/jira/browse/HDFS-1226 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 0.20.1 Reporter: Thanh Do - Summary: the last block is unavailable to subsequent readers if appender crashes in the middle of appending workload. - Setup: # available datanodes = 3 # disks / datanode = 1 # failures = 1 failure type = crash When/where failure happens = (see below) - Details: Say a client appending to block X at 3 datanodes: dn1, dn2 and dn3. After successful recoverBlock at primary datanode, client calls createOutputStream, which make all datanodes move the block file and the meta file from current directory to tmp directory. Now suppose the client crashes. Since all replicas of block X are in tmp folders of corresponding datanode, subsequent readers cannot read block X. - Summary: the last block is unavailable to subsequent readers if appender crashes in the middle of appending workload. - Setup: # available datanodes = 3 # disks / datanode = 1 # failures = 1 failure type = crash When/where failure happens = (see below) - Details: Say a client appending to block X at 3 datanodes: dn1, dn2 and dn3. After successful recoverBlock at primary datanode, client calls createOutputStream, which make all datanodes move the block file and the meta file from current directory to tmp directory. Now suppose the client crashes. Since all replicas of block X are in tmp folders of corresponding datanode, subsequent readers cannot read block X. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1225) Block lost when primary crashes in recoverBlock
Block lost when primary crashes in recoverBlock --- Key: HDFS-1225 URL: https://issues.apache.org/jira/browse/HDFS-1225 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 0.20.1 Reporter: Thanh Do - Summary: Block is lost if primary datanode crashes in the middle tryUpdateBlock. - Setup: # available datanode = 2 # replica = 2 # disks / datanode = 1 # failures = 1 # failure type = crash When/where failure happens = (see below) - Details: Suppose we have 2 datanodes: dn1 and dn2 and dn1 is primary. Client appends to blk_X_1001 and crash happens during dn1.recoverBlock, at the point after blk_X_1001.meta is renamed to blk_X_1001.meta_tmp1002 **Interesting**, this case, the block X is lost eventually. Why? After dn1.recoverBlock crashes at rename, what left at dn1 current directory is: 1) blk_X 2) blk_X_1001.meta_tmp1002 ==> this is an invalid block, because it has no meta file associated with it. dn2 (after dn1 crash) now contains: 1) blk_X 2) blk_X_1002.meta (note that the rename at dn2 is completed, because dn1 called dn2.updateBlock() before calling its own updateBlock()) But the command namenode.commitBlockSynchronization is not reported to namenode, because dn1 is crashed. Therefore, from namenode point of view, the block X has GS 1001. Hence, the block is lost. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1224) Stale connection makes node miss append
[ https://issues.apache.org/jira/browse/HDFS-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thanh Do updated HDFS-1224: --- Description: - Summary: if a datanode crashes and restarts, it may miss an append. - Setup: + # available datanodes = 3 + # replica = 3 + # disks / datanode = 1 + # failures = 1 + failure type = crash + When/where failure happens = after the first append succeed - Details: Since each datanode maintains a pool of IPC connections, whenever it wants to make an IPC call, it first looks into the pool. If the connection is not there, it is created and put in to the pool. Otherwise the existing connection is used. Suppose that the append pipeline contains dn1, dn2, and dn3. Dn1 is the primary. After the client appends to block X successfully, dn2 crashes and restarts. Now client writes a new block Y to dn1, dn2 and dn3. The write is successful. Client starts appending to block Y. It first calls dn1.recoverBlock(). Dn1 will first create a proxy corresponding with each of the datanode in the pipeline (in order to make RPC call like getMetadataInfo( ) or updateBlock( )). However, because dn2 has just crashed and restarts, its connection in dn1's pool become stale. Dn1 uses this connection to make a call to dn2, hence an exception. Therefore, append will be made only to dn1 and dn3, although dn2 is alive and the write of block Y to dn2 has been successful. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) was: - Summary: if a datanode crashes and restarts, it may miss an append. - Setup: + available datanodes = 3 + replica = 3 + disks / datanode = 1 + failures = 1 + failure type = crash + When/where failure happens = after the first append succeed - Details: Since each datanode maintains a pool of IPC connections, whenever it wants to make an IPC call, it first looks into the pool. If the connection is not there, it is created and put in to the pool. Otherwise the existing connection is used. Suppose that the append pipeline contains dn1, dn2, and dn3. Dn1 is the primary. After the client appends to block X successfully, dn2 crashes and restarts. Now client writes a new block Y to dn1, dn2 and dn3. The write is successful. Client starts appending to block Y. It first calls dn1.recoverBlock(). Dn1 will first create a proxy corresponding with each of the datanode in the pipeline (in order to make RPC call like getMetadataInfo( ) or updateBlock( )). However, because dn2 has just crashed and restarts, its connection in dn1's pool become stale. Dn1 uses this connection to make a call to dn2, hence an exception. Therefore, append will be made only to dn1 and dn3, although dn2 is alive and the write of block Y to dn2 has been successful. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) > Stale connection makes node miss append > --- > > Key: HDFS-1224 > URL: https://issues.apache.org/jira/browse/HDFS-1224 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node >Affects Versions: 0.20.1 >Reporter: Thanh Do > > - Summary: if a datanode crashes and restarts, it may miss an append. > > - Setup: > + # available datanodes = 3 > + # replica = 3 > + # disks / datanode = 1 > + # failures = 1 > + failure type = crash > + When/where failure happens = after the first append succeed > > - Details: > Since each datanode maintains a pool of IPC connections, whenever it wants > to make an IPC call, it first looks into the pool. If the connection is not > there, > it is created and put in to the pool. Otherwise the existing connection is > used. > Suppose that the append pipeline contains dn1, dn2, and dn3. Dn1 is the > primary. > After the client appends to block X successfully, dn2 crashes and restarts. > Now client writes a new block Y to dn1, dn2 and dn3. The write is successful. > Client starts appending to block Y. It first calls dn1.recoverBlock(). > Dn1 will first create a proxy corresponding with each of the datanode in the > pipeline > (in order to make RPC call like getMetadataInfo( ) or updateBlock( )). > However, because > dn2 has just crashed and restarts, its connection in dn1's pool become stale. > Dn1 uses > this connection to make a call to dn2, hence an exception. Therefore, append > will be > made only to dn1 and dn3, although dn2 is alive and the write of block Y to > dn2 has > been successful. > This bug was found by our Failure Testing Service framework: > http://www.eecs.berkeley.edu/Pubs/TechR
[jira] Updated: (HDFS-1224) Stale connection makes node miss append
[ https://issues.apache.org/jira/browse/HDFS-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thanh Do updated HDFS-1224: --- Affects Version/s: 0.20.1 Component/s: data-node > Stale connection makes node miss append > --- > > Key: HDFS-1224 > URL: https://issues.apache.org/jira/browse/HDFS-1224 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node >Affects Versions: 0.20.1 >Reporter: Thanh Do > > - Summary: if a datanode crashes and restarts, it may miss an append. > > - Setup: > + # available datanodes = 3 > + # replica = 3 > + # disks / datanode = 1 > + # failures = 1 > + failure type = crash > + When/where failure happens = after the first append succeed > > - Details: > Since each datanode maintains a pool of IPC connections, whenever it wants > to make an IPC call, it first looks into the pool. If the connection is not > there, > it is created and put in to the pool. Otherwise the existing connection is > used. > Suppose that the append pipeline contains dn1, dn2, and dn3. Dn1 is the > primary. > After the client appends to block X successfully, dn2 crashes and restarts. > Now client writes a new block Y to dn1, dn2 and dn3. The write is successful. > Client starts appending to block Y. It first calls dn1.recoverBlock(). > Dn1 will first create a proxy corresponding with each of the datanode in the > pipeline > (in order to make RPC call like getMetadataInfo( ) or updateBlock( )). > However, because > dn2 has just crashed and restarts, its connection in dn1's pool become stale. > Dn1 uses > this connection to make a call to dn2, hence an exception. Therefore, append > will be > made only to dn1 and dn3, although dn2 is alive and the write of block Y to > dn2 has > been successful. > This bug was found by our Failure Testing Service framework: > http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html > For questions, please email us: Thanh Do (than...@cs.wisc.edu) and > Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1224) Stale connection makes node miss append
Stale connection makes node miss append --- Key: HDFS-1224 URL: https://issues.apache.org/jira/browse/HDFS-1224 Project: Hadoop HDFS Issue Type: Bug Reporter: Thanh Do - Summary: if a datanode crashes and restarts, it may miss an append. - Setup: + available datanodes = 3 + replica = 3 + disks / datanode = 1 + failures = 1 + failure type = crash + When/where failure happens = after the first append succeed - Details: Since each datanode maintains a pool of IPC connections, whenever it wants to make an IPC call, it first looks into the pool. If the connection is not there, it is created and put in to the pool. Otherwise the existing connection is used. Suppose that the append pipeline contains dn1, dn2, and dn3. Dn1 is the primary. After the client appends to block X successfully, dn2 crashes and restarts. Now client writes a new block Y to dn1, dn2 and dn3. The write is successful. Client starts appending to block Y. It first calls dn1.recoverBlock(). Dn1 will first create a proxy corresponding with each of the datanode in the pipeline (in order to make RPC call like getMetadataInfo( ) or updateBlock( )). However, because dn2 has just crashed and restarts, its connection in dn1's pool become stale. Dn1 uses this connection to make a call to dn2, hence an exception. Therefore, append will be made only to dn1 and dn3, although dn2 is alive and the write of block Y to dn2 has been successful. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1223) DataNode fails stop due to a bad disk (or storage directory)
DataNode fails stop due to a bad disk (or storage directory) Key: HDFS-1223 URL: https://issues.apache.org/jira/browse/HDFS-1223 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 0.20.1 Reporter: Thanh Do A datanode can store block files in multiple volumes. If a datanode sees a bad volume during start up (i.e, face an exception when accessing that volume), it simply fail stops, making all block files stored in other healthy volumes inaccessible. Consequently, these lost replicas will be generated later on in other datanodes. If a datanode is able to mark the bad disk and continue working with healthy ones, this will increase availability and avoid unnecessary regeneration. As an extreme example, consider one datanode which has 2 volumes V1 and V2, each contains about 1 64MB block files. During startup, the datanode gets an exception when accessing V1, it then fail stops, making 2 block files generated later on. If the datanode masks V1 as bad and continues working with V2, the number of replicas needed to be regenerated is cut in to half. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1222) NameNode fail stop in spite of multiple metadata directories
NameNode fail stop in spite of multiple metadata directories Key: HDFS-1222 URL: https://issues.apache.org/jira/browse/HDFS-1222 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.20.1 Reporter: Thanh Do Despite the ability to configure multiple name directories (to store fsimage) and edits directories, the NameNode will fail stop in most of the time it faces exception when accessing to these directories. NameNode fail stops if an exception happens when loading fsimage, reading fstime, loading edits log, writing fsimage.ckpt ..., although there are still good replicas. NameNode could have tried to work with other replicas, and marked the faulty one. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1221) NameNode unable to start due to stale edits log after a crash
NameNode unable to start due to stale edits log after a crash - Key: HDFS-1221 URL: https://issues.apache.org/jira/browse/HDFS-1221 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 0.20.1 Reporter: Thanh Do - Summary: If a crash happens during FSEditLog.createEditLogFile(), the edits log file on disk may be stale. During next reboot, NameNode will get an exception when parsing the edits file, because of stale data, leading to unsuccessful reboot. Note: This is just one example. Since we see that edits log (and fsimage) does not have checksum, they are vulnerable to corruption too. - Details: The steps to create new edits log (which we infer from HDFS code) are: 1) truncate the file to zero size 2) write FSConstants.LAYOUT_VERSION to buffer 3) insert the end-of-file marker OP_INVALID to the end of the buffer 4) preallocate 1MB of data, and fill the data with 0 5) flush the buffer to disk Note that only in step 1, 4, 5, the data on disk is actually changed. Now, suppose a crash happens after step 4, but before step 5. In the next reboot, NameNode will fetch this edits log file (which contains all 0). The first thing parsed is the LAYOUT_VERSION, which is 0. This is OK, because NameNode has code to handle that case. (but we expect LAYOUT_VERSION to be -18, don't we). Now it parses the operation code, which happens to be 0. Unfortunately, since 0 is the value for OP_ADD, the NameNode expects some parameters corresponding to that operation. Now NameNode calls readString to read the path, which throws an exception leading to a failed reboot. We found this problem almost at the same time as HDFS developers. Basically, the edits log is truncated before fsimage.ckpt is renamed to fsimage. Hence, any crash happens after the truncation but before the renaming will lead to a data loss. Detailed description can be found here: https://issues.apache.org/jira/browse/HDFS-955 This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1221) NameNode unable to start due to stale edits log after a crash
[ https://issues.apache.org/jira/browse/HDFS-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thanh Do updated HDFS-1221: --- Description: - Summary: If a crash happens during FSEditLog.createEditLogFile(), the edits log file on disk may be stale. During next reboot, NameNode will get an exception when parsing the edits file, because of stale data, leading to unsuccessful reboot. Note: This is just one example. Since we see that edits log (and fsimage) does not have checksum, they are vulnerable to corruption too. - Details: The steps to create new edits log (which we infer from HDFS code) are: 1) truncate the file to zero size 2) write FSConstants.LAYOUT_VERSION to buffer 3) insert the end-of-file marker OP_INVALID to the end of the buffer 4) preallocate 1MB of data, and fill the data with 0 5) flush the buffer to disk Note that only in step 1, 4, 5, the data on disk is actually changed. Now, suppose a crash happens after step 4, but before step 5. In the next reboot, NameNode will fetch this edits log file (which contains all 0). The first thing parsed is the LAYOUT_VERSION, which is 0. This is OK, because NameNode has code to handle that case. (but we expect LAYOUT_VERSION to be -18, don't we). Now it parses the operation code, which happens to be 0. Unfortunately, since 0 is the value for OP_ADD, the NameNode expects some parameters corresponding to that operation. Now NameNode calls readString to read the path, which throws an exception leading to a failed reboot. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) was: - Summary: If a crash happens during FSEditLog.createEditLogFile(), the edits log file on disk may be stale. During next reboot, NameNode will get an exception when parsing the edits file, because of stale data, leading to unsuccessful reboot. Note: This is just one example. Since we see that edits log (and fsimage) does not have checksum, they are vulnerable to corruption too. - Details: The steps to create new edits log (which we infer from HDFS code) are: 1) truncate the file to zero size 2) write FSConstants.LAYOUT_VERSION to buffer 3) insert the end-of-file marker OP_INVALID to the end of the buffer 4) preallocate 1MB of data, and fill the data with 0 5) flush the buffer to disk Note that only in step 1, 4, 5, the data on disk is actually changed. Now, suppose a crash happens after step 4, but before step 5. In the next reboot, NameNode will fetch this edits log file (which contains all 0). The first thing parsed is the LAYOUT_VERSION, which is 0. This is OK, because NameNode has code to handle that case. (but we expect LAYOUT_VERSION to be -18, don't we). Now it parses the operation code, which happens to be 0. Unfortunately, since 0 is the value for OP_ADD, the NameNode expects some parameters corresponding to that operation. Now NameNode calls readString to read the path, which throws an exception leading to a failed reboot. We found this problem almost at the same time as HDFS developers. Basically, the edits log is truncated before fsimage.ckpt is renamed to fsimage. Hence, any crash happens after the truncation but before the renaming will lead to a data loss. Detailed description can be found here: https://issues.apache.org/jira/browse/HDFS-955 This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) Component/s: name-node > NameNode unable to start due to stale edits log after a crash > - > > Key: HDFS-1221 > URL: https://issues.apache.org/jira/browse/HDFS-1221 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.20.1 >Reporter: Thanh Do > > - Summary: > If a crash happens during FSEditLog.createEditLogFile(), the > edits log file on disk may be stale. During next reboot, NameNode > will get an exception when parsing the edits file, because of stale data, > leading to unsuccessful reboot. > Note: This is just one example. Since we see that edits log (and fsimage) > does not have checksum, they are vulnerable to corruption too. > > - Details: > The steps to create new edits log (which we infer from HDFS code) are: > 1) truncate the file to zero size > 2) write FSConstants.LAYOUT_VERSION to buffer > 3) insert the end-of-file marker OP_INVALID to the end of the buffer > 4) preallocate 1MB of data, and fill the data with 0 > 5) flush the buffer to disk > > Note that only in step 1, 4, 5, the data on disk is actually changed. > Now, suppose a crash happ
[jira] Updated: (HDFS-1220) Namenode unable to start due to truncated fstime
[ https://issues.apache.org/jira/browse/HDFS-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thanh Do updated HDFS-1220: --- Description: - Summary: updating fstime file on disk is not atomic, so it is possible that if a crash happens in the middle, next time when NameNode reboots, it will read stale fstime, hence unable to start successfully. - Details: Below is the code for updating fstime file on disk void writeCheckpointTime(StorageDirectory sd) throws IOException { if (checkpointTime < 0L) return; // do not write negative time File timeFile = getImageFile(sd, NameNodeFile.TIME); if (timeFile.exists()) { timeFile.delete(); } DataOutputStream out = new DataOutputStream( new FileOutputStream(timeFile)); try { out.writeLong(checkpointTime); } finally { out.close(); } } Basically, this involve 3 steps: 1) delete fstime file (timeFile.delete()) 2) truncate fstime file (new FileOutputStream(timeFile)) 3) write new time to fstime file (out.writeLong(checkpointTime)) If a crash happens after step 2 and before step 3, in the next reboot, NameNode got an exception when reading the time (8 byte) from an empty fstime file. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu was: - Summary: updating fstime file on disk is not atomic, so it is possible that if a crash happens in the middle, next time when NameNode reboots, it will read stale fstime, hence unable to start successfully. - Details: Basically, this involve 3 steps: 1) delete fstime file (timeFile.delete()) 2) truncate fstime file (new FileOutputStream(timeFile)) 3) write new time to fstime file (out.writeLong(checkpointTime)) If a crash happens after step 2 and before step 3, in the next reboot, NameNode got an exception when reading the time (8 byte) from an empty fstime file. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu > Namenode unable to start due to truncated fstime > > > Key: HDFS-1220 > URL: https://issues.apache.org/jira/browse/HDFS-1220 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.20.1 >Reporter: Thanh Do > > - Summary: updating fstime file on disk is not atomic, so it is possible that > if a crash happens in the middle, next time when NameNode reboots, it will > read stale fstime, hence unable to start successfully. > > - Details: > Below is the code for updating fstime file on disk > void writeCheckpointTime(StorageDirectory sd) throws IOException { > if (checkpointTime < 0L) > return; // do not write negative time > File timeFile = getImageFile(sd, NameNodeFile.TIME); > if (timeFile.exists()) { timeFile.delete(); } > DataOutputStream out = new DataOutputStream( > new > FileOutputStream(timeFile)); > try { > out.writeLong(checkpointTime); > } finally { > out.close(); > } > } > Basically, this involve 3 steps: > 1) delete fstime file (timeFile.delete()) > 2) truncate fstime file (new FileOutputStream(timeFile)) > 3) write new time to fstime file (out.writeLong(checkpointTime)) > If a crash happens after step 2 and before step 3, in the next reboot, > NameNode > got an exception when reading the time (8 byte) from an empty fstime file. > This bug was found by our Failure Testing Service framework: > http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html > For questions, please email us: Thanh Do (than...@cs.wisc.edu) and > Haryadi Gunawi (hary...@eecs.berkeley.edu -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1220) Namenode unable to start due to truncated fstime
[ https://issues.apache.org/jira/browse/HDFS-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thanh Do updated HDFS-1220: --- Description: - Summary: updating fstime file on disk is not atomic, so it is possible that if a crash happens in the middle, next time when NameNode reboots, it will read stale fstime, hence unable to start successfully. - Details: Basically, this involve 3 steps: 1) delete fstime file (timeFile.delete()) 2) truncate fstime file (new FileOutputStream(timeFile)) 3) write new time to fstime file (out.writeLong(checkpointTime)) If a crash happens after step 2 and before step 3, in the next reboot, NameNode got an exception when reading the time (8 byte) from an empty fstime file. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu was: - Summary: updating fstime file on disk is not atomic, so it is possible that if a crash happens in the middle, next time when NameNode reboots, it will read stale fstime, hence unable to start successfully. - Details: Below is the code for updating fstime file on disk void writeCheckpointTime(StorageDirectory sd) throws IOException { if (checkpointTime < 0L) return; // do not write negative time File timeFile = getImageFile(sd, NameNodeFile.TIME); if (timeFile.exists()) { timeFile.delete(); } DataOutputStream out = new DataOutputStream( new FileOutputStream(timeFile)); try { out.writeLong(checkpointTime); } finally { out.close(); } } Basically, this involve 3 steps: 1) delete fstime file (timeFile.delete()) 2) truncate fstime file (new FileOutputStream(timeFile)) 3) write new time to fstime file (out.writeLong(checkpointTime)) If a crash happens after step 2 and before step 3, in the next reboot, NameNode got an exception when reading the time (8 byte) from an empty fstime file. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu > Namenode unable to start due to truncated fstime > > > Key: HDFS-1220 > URL: https://issues.apache.org/jira/browse/HDFS-1220 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.20.1 >Reporter: Thanh Do > > - Summary: updating fstime file on disk is not atomic, so it is possible that > if a crash happens in the middle, next time when NameNode reboots, it will > read stale fstime, hence unable to start successfully. > > - Details: > Basically, this involve 3 steps: > 1) delete fstime file (timeFile.delete()) > 2) truncate fstime file (new FileOutputStream(timeFile)) > 3) write new time to fstime file (out.writeLong(checkpointTime)) > If a crash happens after step 2 and before step 3, in the next reboot, > NameNode > got an exception when reading the time (8 byte) from an empty fstime file. > This bug was found by our Failure Testing Service framework: > http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html > For questions, please email us: Thanh Do (than...@cs.wisc.edu) and > Haryadi Gunawi (hary...@eecs.berkeley.edu -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1220) Namenode unable to start due to truncated fstime
Namenode unable to start due to truncated fstime Key: HDFS-1220 URL: https://issues.apache.org/jira/browse/HDFS-1220 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.20.1 Reporter: Thanh Do - Summary: updating fstime file on disk is not atomic, so it is possible that if a crash happens in the middle, next time when NameNode reboots, it will read stale fstime, hence unable to start successfully. - Details: Below is the code for updating fstime file on disk void writeCheckpointTime(StorageDirectory sd) throws IOException { if (checkpointTime < 0L) return; // do not write negative time File timeFile = getImageFile(sd, NameNodeFile.TIME); if (timeFile.exists()) { timeFile.delete(); } DataOutputStream out = new DataOutputStream( new FileOutputStream(timeFile)); try { out.writeLong(checkpointTime); } finally { out.close(); } } Basically, this involve 3 steps: 1) delete fstime file (timeFile.delete()) 2) truncate fstime file (new FileOutputStream(timeFile)) 3) write new time to fstime file (out.writeLong(checkpointTime)) If a crash happens after step 2 and before step 3, in the next reboot, NameNode got an exception when reading the time (8 byte) from an empty fstime file. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1220) Namenode unable to start due to truncated fstime
[ https://issues.apache.org/jira/browse/HDFS-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thanh Do updated HDFS-1220: --- Description: - Summary: updating fstime file on disk is not atomic, so it is possible that if a crash happens in the middle, next time when NameNode reboots, it will read stale fstime, hence unable to start successfully. - Details: Basically, this involve 3 steps: 1) delete fstime file (timeFile.delete()) 2) truncate fstime file (new FileOutputStream(timeFile)) 3) write new time to fstime file (out.writeLong(checkpointTime)) If a crash happens after step 2 and before step 3, in the next reboot, NameNode got an exception when reading the time (8 byte) from an empty fstime file. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu was: - Summary: updating fstime file on disk is not atomic, so it is possible that if a crash happens in the middle, next time when NameNode reboots, it will read stale fstime, hence unable to start successfully. - Details: Below is the code for updating fstime file on disk void writeCheckpointTime(StorageDirectory sd) throws IOException { if (checkpointTime < 0L) return; // do not write negative time File timeFile = getImageFile(sd, NameNodeFile.TIME); if (timeFile.exists()) { timeFile.delete(); } DataOutputStream out = new DataOutputStream( new FileOutputStream(timeFile)); try { out.writeLong(checkpointTime); } finally { out.close(); } } Basically, this involve 3 steps: 1) delete fstime file (timeFile.delete()) 2) truncate fstime file (new FileOutputStream(timeFile)) 3) write new time to fstime file (out.writeLong(checkpointTime)) If a crash happens after step 2 and before step 3, in the next reboot, NameNode got an exception when reading the time (8 byte) from an empty fstime file. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu > Namenode unable to start due to truncated fstime > > > Key: HDFS-1220 > URL: https://issues.apache.org/jira/browse/HDFS-1220 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.20.1 >Reporter: Thanh Do > > - Summary: updating fstime file on disk is not atomic, so it is possible that > if a crash happens in the middle, next time when NameNode reboots, it will > read stale fstime, hence unable to start successfully. > > - Details: > Basically, this involve 3 steps: > 1) delete fstime file (timeFile.delete()) > 2) truncate fstime file (new FileOutputStream(timeFile)) > 3) write new time to fstime file (out.writeLong(checkpointTime)) > If a crash happens after step 2 and before step 3, in the next reboot, > NameNode > got an exception when reading the time (8 byte) from an empty fstime file. > This bug was found by our Failure Testing Service framework: > http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html > For questions, please email us: Thanh Do (than...@cs.wisc.edu) and > Haryadi Gunawi (hary...@eecs.berkeley.edu -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1219) Data Loss due to edits log truncation
Data Loss due to edits log truncation - Key: HDFS-1219 URL: https://issues.apache.org/jira/browse/HDFS-1219 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.20.2 Reporter: Thanh Do We found this problem almost at the same time as HDFS developers. Basically, the edits log is truncated before fsimage.ckpt is renamed to fsimage. Hence, any crash happens after the truncation but before the renaming will lead to a data loss. Detailed description can be found here: https://issues.apache.org/jira/browse/HDFS-955 This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1114) Reducing NameNode memory usage by an alternate hash table
[ https://issues.apache.org/jira/browse/HDFS-1114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879633#action_12879633 ] Scott Carey commented on HDFS-1114: --- bq. # Capacity should be divided by a reference size 8 or 4 depending on the 64bit or 32bit java version What about -XX:+UseCompressedOops ? All users should be using this flag on a 64 bit JVM to save a lot of space. It only works up to -Xmx32G though, beyond that its large pointers again. > Reducing NameNode memory usage by an alternate hash table > - > > Key: HDFS-1114 > URL: https://issues.apache.org/jira/browse/HDFS-1114 > Project: Hadoop HDFS > Issue Type: Improvement > Components: name-node >Reporter: Tsz Wo (Nicholas), SZE >Assignee: Tsz Wo (Nicholas), SZE > Attachments: GSet20100525.pdf, gset20100608.pdf, > h1114_20100607.patch, h1114_20100614b.patch, h1114_20100615.patch, > h1114_20100616b.patch > > > NameNode uses a java.util.HashMap to store BlockInfo objects. When there are > many blocks in HDFS, this map uses a lot of memory in the NameNode. We may > optimize the memory usage by a light weight hash table implementation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HDFS-1211) 0.20 append: Block receiver should not log "rewind" packets at INFO level
[ https://issues.apache.org/jira/browse/HDFS-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dhruba borthakur resolved HDFS-1211. Resolution: Fixed I just committed this. Thanks Todd! > 0.20 append: Block receiver should not log "rewind" packets at INFO level > - > > Key: HDFS-1211 > URL: https://issues.apache.org/jira/browse/HDFS-1211 > Project: Hadoop HDFS > Issue Type: Improvement > Components: data-node >Affects Versions: 0.20-append >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Minor > Fix For: 0.20-append > > Attachments: hdfs-1211.txt > > > In the 0.20 append implementation, it logs an INFO level message for every > packet that "rewinds" the end of the block file. This is really noisy for > applications like HBase which sync every edit. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1209) Add conf dfs.client.block.recovery.retries to configure number of block recovery attempts
[ https://issues.apache.org/jira/browse/HDFS-1209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879630#action_12879630 ] dhruba borthakur commented on HDFS-1209: It still does not apply cleanly, can you pl post a new patch > Add conf dfs.client.block.recovery.retries to configure number of block > recovery attempts > - > > Key: HDFS-1209 > URL: https://issues.apache.org/jira/browse/HDFS-1209 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs client >Affects Versions: 0.20-append >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Attachments: hdfs-1209.txt > > > This variable is referred to in the TestFileAppend4 tests, but it isn't > actually looked at by DFSClient (I'm betting this is in FB's branch). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HDFS-1210) DFSClient should log exception when block recovery fails
[ https://issues.apache.org/jira/browse/HDFS-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dhruba borthakur resolved HDFS-1210. Fix Version/s: 0.20-append Resolution: Fixed I just committed this. Thanks Todd. > DFSClient should log exception when block recovery fails > > > Key: HDFS-1210 > URL: https://issues.apache.org/jira/browse/HDFS-1210 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs client >Affects Versions: 0.20-append, 0.20.2 >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Trivial > Fix For: 0.20-append > > Attachments: hdfs-1210.txt > > > Right now we just retry without necessarily showing the exception. It can be > useful to see what the error was that prevented the recovery RPC from > succeeding. > (I believe this only applies in 0.20 style of block recovery) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1118) DFSOutputStream socket leak when cannot connect to DataNode
[ https://issues.apache.org/jira/browse/HDFS-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879626#action_12879626 ] dhruba borthakur commented on HDFS-1118: Code looks good to me. I will commit this to trunk. > DFSOutputStream socket leak when cannot connect to DataNode > --- > > Key: HDFS-1118 > URL: https://issues.apache.org/jira/browse/HDFS-1118 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20-append, 0.20.1, 0.20.2 >Reporter: Zheng Shao >Assignee: Zheng Shao > Fix For: 0.20-append > > Attachments: HDFS-1118.1.patch, HDFS-1118.2.patch > > > The offending code is in {{DFSOutputStream.nextBlockOutputStream}} > This function retries several times to call {{createBlockOutputStream}}. Each > time when it fails, it leaves a {{Socket}} object in {{DFSOutputStream.s}}. > That object is never closed, but overwritten the next time > {{createBlockOutputStream}} is called. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HDFS-1204) 0.20: Lease expiration should recover single files, not entire lease holder
[ https://issues.apache.org/jira/browse/HDFS-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dhruba borthakur resolved HDFS-1204. Resolution: Fixed > 0.20: Lease expiration should recover single files, not entire lease holder > --- > > Key: HDFS-1204 > URL: https://issues.apache.org/jira/browse/HDFS-1204 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20-append >Reporter: Todd Lipcon >Assignee: sam rash > Fix For: 0.20-append > > Attachments: hdfs-1204.txt, hdfs-1204.txt > > > This was brought up in HDFS-200 but didn't make it into the branch on Apache. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HDFS-1207) 0.20-append: stallReplicationWork should be volatile
[ https://issues.apache.org/jira/browse/HDFS-1207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dhruba borthakur resolved HDFS-1207. Fix Version/s: 0.20-append Resolution: Fixed I just committed this. Thanks Todd! > 0.20-append: stallReplicationWork should be volatile > > > Key: HDFS-1207 > URL: https://issues.apache.org/jira/browse/HDFS-1207 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.20-append >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.20-append > > Attachments: hdfs-1207.txt > > > the stallReplicationWork member in FSNamesystem is accessed by multiple > threads without synchronization, but isn't marked volatile. I believe this is > responsible for about 1% failure rate on > TestFileAppend4.testAppendSyncChecksum* on my 8-core test boxes (looking at > logs I see replication happening even though we've supposedly disabled it) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1209) Add conf dfs.client.block.recovery.retries to configure number of block recovery attempts
[ https://issues.apache.org/jira/browse/HDFS-1209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879620#action_12879620 ] Todd Lipcon commented on HDFS-1209: --- This should apply after HDFS-1210 - can you commit that one first? > Add conf dfs.client.block.recovery.retries to configure number of block > recovery attempts > - > > Key: HDFS-1209 > URL: https://issues.apache.org/jira/browse/HDFS-1209 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs client >Affects Versions: 0.20-append >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Attachments: hdfs-1209.txt > > > This variable is referred to in the TestFileAppend4 tests, but it isn't > actually looked at by DFSClient (I'm betting this is in FB's branch). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1204) 0.20: Lease expiration should recover single files, not entire lease holder
[ https://issues.apache.org/jira/browse/HDFS-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879618#action_12879618 ] Todd Lipcon commented on HDFS-1204: --- I think it does not - it looks like it was a regression caused by HDFS-200 in branch 20 append. > 0.20: Lease expiration should recover single files, not entire lease holder > --- > > Key: HDFS-1204 > URL: https://issues.apache.org/jira/browse/HDFS-1204 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20-append >Reporter: Todd Lipcon >Assignee: sam rash > Fix For: 0.20-append > > Attachments: hdfs-1204.txt, hdfs-1204.txt > > > This was brought up in HDFS-200 but didn't make it into the branch on Apache. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1209) Add conf dfs.client.block.recovery.retries to configure number of block recovery attempts
[ https://issues.apache.org/jira/browse/HDFS-1209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879617#action_12879617 ] dhruba borthakur commented on HDFS-1209: Thsi one does not apply cleanly to 0.20-append. Can you pl post a new patch? > Add conf dfs.client.block.recovery.retries to configure number of block > recovery attempts > - > > Key: HDFS-1209 > URL: https://issues.apache.org/jira/browse/HDFS-1209 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs client >Affects Versions: 0.20-append >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Attachments: hdfs-1209.txt > > > This variable is referred to in the TestFileAppend4 tests, but it isn't > actually looked at by DFSClient (I'm betting this is in FB's branch). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1204) 0.20: Lease expiration should recover single files, not entire lease holder
[ https://issues.apache.org/jira/browse/HDFS-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879616#action_12879616 ] dhruba borthakur commented on HDFS-1204: Sam/Todd : can you pl comment on whether this bug exists in Hadoop trunk? > 0.20: Lease expiration should recover single files, not entire lease holder > --- > > Key: HDFS-1204 > URL: https://issues.apache.org/jira/browse/HDFS-1204 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20-append >Reporter: Todd Lipcon >Assignee: sam rash > Fix For: 0.20-append > > Attachments: hdfs-1204.txt, hdfs-1204.txt > > > This was brought up in HDFS-200 but didn't make it into the branch on Apache. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1206) TestFiHFlush depends on BlocksMap implementation
[ https://issues.apache.org/jira/browse/HDFS-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879615#action_12879615 ] Tsz Wo (Nicholas), SZE commented on HDFS-1206: -- Saw it failing again. {noformat} Testcase: hFlushFi01_a took 4.553 sec FAILED junit.framework.AssertionFailedError: at org.apache.hadoop.hdfs.TestFiHFlush.runDiskErrorTest(TestFiHFlush.java:56) at org.apache.hadoop.hdfs.TestFiHFlush.hFlushFi01_a(TestFiHFlush.java:72) {noformat} > TestFiHFlush depends on BlocksMap implementation > > > Key: HDFS-1206 > URL: https://issues.apache.org/jira/browse/HDFS-1206 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Reporter: Tsz Wo (Nicholas), SZE > > When I was testing HDFS-1114, the patch passed all tests except TestFiHFlush. > Then, I tried to print out some debug messages, however, TestFiHFlush > succeeded after added the messages. > TestFiHFlush probably depends on the speed of BlocksMap. If BlocksMap is > slow enough, then it will pass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1114) Reducing NameNode memory usage by an alternate hash table
[ https://issues.apache.org/jira/browse/HDFS-1114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz Wo (Nicholas), SZE updated HDFS-1114: - Status: Patch Available (was: Open) Hudson does not seem working. It did not pick up my previous for a long time. Re-submit. > Reducing NameNode memory usage by an alternate hash table > - > > Key: HDFS-1114 > URL: https://issues.apache.org/jira/browse/HDFS-1114 > Project: Hadoop HDFS > Issue Type: Improvement > Components: name-node >Reporter: Tsz Wo (Nicholas), SZE >Assignee: Tsz Wo (Nicholas), SZE > Attachments: GSet20100525.pdf, gset20100608.pdf, > h1114_20100607.patch, h1114_20100614b.patch, h1114_20100615.patch, > h1114_20100616b.patch > > > NameNode uses a java.util.HashMap to store BlockInfo objects. When there are > many blocks in HDFS, this map uses a lot of memory in the NameNode. We may > optimize the memory usage by a light weight hash table implementation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1114) Reducing NameNode memory usage by an alternate hash table
[ https://issues.apache.org/jira/browse/HDFS-1114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz Wo (Nicholas), SZE updated HDFS-1114: - Status: Open (was: Patch Available) Thanks for the detail review, Suresh. 1. BlocksMap.java done. 2. LightWeightGSet.java done all except the follwoing. * remove() - for better readability ... Implicit else is better the explicit else? 3. TestGSet.java * In exception tests, ... Catching specific exceptions but I did not change the messages. * println should use Log.info instead of System.out.println? No, print(..) and println(..) work together. * add some comments to ... * add some comments to ... * add comments to ... Added more some comments. > Reducing NameNode memory usage by an alternate hash table > - > > Key: HDFS-1114 > URL: https://issues.apache.org/jira/browse/HDFS-1114 > Project: Hadoop HDFS > Issue Type: Improvement > Components: name-node >Reporter: Tsz Wo (Nicholas), SZE >Assignee: Tsz Wo (Nicholas), SZE > Attachments: GSet20100525.pdf, gset20100608.pdf, > h1114_20100607.patch, h1114_20100614b.patch, h1114_20100615.patch, > h1114_20100616b.patch > > > NameNode uses a java.util.HashMap to store BlockInfo objects. When there are > many blocks in HDFS, this map uses a lot of memory in the NameNode. We may > optimize the memory usage by a light weight hash table implementation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1114) Reducing NameNode memory usage by an alternate hash table
[ https://issues.apache.org/jira/browse/HDFS-1114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz Wo (Nicholas), SZE updated HDFS-1114: - Attachment: h1114_20100616b.patch h1114_20100616b: - Rewrote codes on capacity computation - By following Java, throwing NPE instead of IllegalArugmentException when the parameter is null. - Split toString() to two methods. - Catching specific exception and added more comments on the test. > Reducing NameNode memory usage by an alternate hash table > - > > Key: HDFS-1114 > URL: https://issues.apache.org/jira/browse/HDFS-1114 > Project: Hadoop HDFS > Issue Type: Improvement > Components: name-node >Reporter: Tsz Wo (Nicholas), SZE >Assignee: Tsz Wo (Nicholas), SZE > Attachments: GSet20100525.pdf, gset20100608.pdf, > h1114_20100607.patch, h1114_20100614b.patch, h1114_20100615.patch, > h1114_20100616b.patch > > > NameNode uses a java.util.HashMap to store BlockInfo objects. When there are > many blocks in HDFS, this map uses a lot of memory in the NameNode. We may > optimize the memory usage by a light weight hash table implementation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1000) libhdfs needs to be updated to use the new UGI
[ https://issues.apache.org/jira/browse/HDFS-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879584#action_12879584 ] Suresh Srinivas commented on HDFS-1000: --- +1 patch looks good. > libhdfs needs to be updated to use the new UGI > -- > > Key: HDFS-1000 > URL: https://issues.apache.org/jira/browse/HDFS-1000 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.21.0, 0.22.0 >Reporter: Devaraj Das >Assignee: Devaraj Das >Priority: Blocker > Fix For: 0.21.0, 0.22.0 > > Attachments: fs-javadoc.patch, hdfs-1000-bp20.3.patch, > hdfs-1000-bp20.4.patch, hdfs-1000-bp20.patch, hdfs-1000-trunk.1.patch > > > libhdfs needs to be updated w.r.t the APIs in the new UGI. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1114) Reducing NameNode memory usage by an alternate hash table
[ https://issues.apache.org/jira/browse/HDFS-1114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879582#action_12879582 ] Suresh Srinivas commented on HDFS-1114: --- # BlocksMap.java #* typo exponient. Should be exponent? #* Capacity should be divided by a reference size 8 or 4 depending on the 64bit or 32bit java version #* Current capacity calculation seems quite complex. Add more explanation on why it is implemented that way. # LightWeightGSet.java #* "which uses a hash table for storing the elements" should this say "uses array"? #* Add a comment that the size of entries is power of two #* Throw HadoopIllegalArgumentException instead of IllegalArgumentException (for 20 version of the patch it could remain IllegalArugmentException) #* remove() - for better readability no need for else if and else since the previous block returns #* toString() - prints the all the entries. This is a bad idea if some one passes this object to Log unknowingly. If all the details of the HashMap is needed, we should have some other method such as dump() or printDetails() to do the same. # TestGSet.java #* In exception tests, instead of printing log when expected exception happened, print a log in Assert.fail(), like Assert.fail("Excepected exception was not thrown"). Check for exceptions should be more specific, instead Exception. It is also good idea to document these exceptions in javadoc for methods in GSet. #* println should use Log.info instead of System.out.println? #* add some comments to classes on what they do/how they are used #* add some comments to GSetTestCase members denominator etc. and constructor #* add comments to testGSet() on what each of the case is accomplishing > Reducing NameNode memory usage by an alternate hash table > - > > Key: HDFS-1114 > URL: https://issues.apache.org/jira/browse/HDFS-1114 > Project: Hadoop HDFS > Issue Type: Improvement > Components: name-node >Reporter: Tsz Wo (Nicholas), SZE >Assignee: Tsz Wo (Nicholas), SZE > Attachments: GSet20100525.pdf, gset20100608.pdf, > h1114_20100607.patch, h1114_20100614b.patch, h1114_20100615.patch > > > NameNode uses a java.util.HashMap to store BlockInfo objects. When there are > many blocks in HDFS, this map uses a lot of memory in the NameNode. We may > optimize the memory usage by a light weight hash table implementation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file
[ https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879563#action_12879563 ] Todd Lipcon commented on HDFS-1057: --- [for branch 0.20 append, +1 -- I've been running with this for 6 weeks, it works, and looks good!] > Concurrent readers hit ChecksumExceptions if following a writer to very end > of file > --- > > Key: HDFS-1057 > URL: https://issues.apache.org/jira/browse/HDFS-1057 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: data-node >Affects Versions: 0.20-append, 0.21.0, 0.22.0 >Reporter: Todd Lipcon >Assignee: sam rash >Priority: Blocker > Attachments: conurrent-reader-patch-1.txt, > conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, > hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, hdfs-1057-trunk-3.txt > > > In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before > calling flush(). Therefore, if there is a concurrent reader, it's possible to > race here - the reader will see the new length while those bytes are still in > the buffers of BlockReceiver. Thus the client will potentially see checksum > errors or EOFs. Additionally, the last checksum chunk of the file is made > accessible to readers even though it is not stable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HDFS-1141) completeFile does not check lease ownership
[ https://issues.apache.org/jira/browse/HDFS-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dhruba borthakur resolved HDFS-1141. Resolution: Fixed Pulled into hadoop-0.20-append > completeFile does not check lease ownership > --- > > Key: HDFS-1141 > URL: https://issues.apache.org/jira/browse/HDFS-1141 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.20-append >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Blocker > Fix For: 0.20-append, 0.22.0 > > Attachments: hdfs-1141-branch20.txt, hdfs-1141.txt, hdfs-1141.txt > > > completeFile should check that the caller still owns the lease of the file > that it's completing. This is for the 'testCompleteOtherLeaseHoldersFile' > case in HDFS-1139. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HDFS-142) In 0.20, move blocks being written into a blocksBeingWritten directory
[ https://issues.apache.org/jira/browse/HDFS-142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dhruba borthakur resolved HDFS-142. --- Resolution: Fixed I have committed this. Thanks Sam, Nicolas and Todd. > In 0.20, move blocks being written into a blocksBeingWritten directory > -- > > Key: HDFS-142 > URL: https://issues.apache.org/jira/browse/HDFS-142 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20-append >Reporter: Raghu Angadi >Assignee: dhruba borthakur >Priority: Blocker > Fix For: 0.20-append > > Attachments: appendFile-recheck-lease.txt, appendQuestions.txt, > deleteTmp.patch, deleteTmp2.patch, deleteTmp5_20.txt, deleteTmp5_20.txt, > deleteTmp_0.18.patch, dont-recover-rwr-when-rbw-available.txt, > handleTmp1.patch, hdfs-142-commitBlockSynchronization-unknown-datanode.txt, > HDFS-142-deaddn-fix.patch, HDFS-142-finalize-fix.txt, > hdfs-142-minidfs-fix-from-409.txt, > HDFS-142-multiple-blocks-datanode-exception.patch, > hdfs-142-recovery-reassignment-and-bbw-cleanup.txt, hdfs-142-testcases.txt, > hdfs-142-testleaserecovery-fix.txt, HDFS-142_20-append2.patch, > HDFS-142_20.patch, recentInvalidateSets-assertion-fix.txt, > recover-rbw-v2.txt, testfileappend4-deaddn.txt, > validateBlockMetaData-synchronized.txt > > > Before 0.18, when Datanode restarts, it deletes files under data-dir/tmp > directory since these files are not valid anymore. But in 0.18 it moves these > files to normal directory incorrectly making them valid blocks. One of the > following would work : > - remove the tmp files during upgrade, or > - if the files under /tmp are in pre-18 format (i.e. no generation), delete > them. > Currently effect of this bug is that, these files end up failing block > verification and eventually get deleted. But cause incorrect over-replication > at the namenode before that. > Also it looks like our policy regd treating files under tmp needs to be > defined better. Right now there are probably one or two more bugs with it. > Dhruba, please file them if you rememeber. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1207) 0.20-append: stallReplicationWork should be volatile
[ https://issues.apache.org/jira/browse/HDFS-1207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-1207: -- Attachment: hdfs-1207.txt > 0.20-append: stallReplicationWork should be volatile > > > Key: HDFS-1207 > URL: https://issues.apache.org/jira/browse/HDFS-1207 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.20-append >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Attachments: hdfs-1207.txt > > > the stallReplicationWork member in FSNamesystem is accessed by multiple > threads without synchronization, but isn't marked volatile. I believe this is > responsible for about 1% failure rate on > TestFileAppend4.testAppendSyncChecksum* on my 8-core test boxes (looking at > logs I see replication happening even though we've supposedly disabled it) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1194) Secondary namenode fails to fetch the image from the primary
[ https://issues.apache.org/jira/browse/HDFS-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-1194: -- Affects Version/s: (was: 0.20-append) Removing append tag, since it's unrelated. > Secondary namenode fails to fetch the image from the primary > > > Key: HDFS-1194 > URL: https://issues.apache.org/jira/browse/HDFS-1194 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0 > Environment: Java(TM) SE Runtime Environment (build 1.6.0_14-b08) > Java HotSpot(TM) 64-Bit Server VM (build 14.0-b16, mixed mode) > CentOS 5 >Reporter: Dmytro Molkov >Assignee: Dmytro Molkov > > We just hit the problem described in HDFS-1024 again. > After more investigation of the underlying problems with > CancelledKeyException there are some findings: > One of the symptoms: the transfer becomes really slow (it does 700 kb/s) when > I am doing the fetch using wget. At the same time disk and network are OK > since I can copy at 50 mb/s using scp. > I was taking jstacks of the namenode while the transfer is in process and we > found that every stack trace has one thread of jetty sitting in this place: > {code} >java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at > org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:452) > at org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:185) > at > org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:124) > at > org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:707) > at > org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522) > {code} > Here is a jetty code that corresponds to this: > {code} > // Look for JVM bug > if (selected==0 && wait>0 && (now-before) _selector.selectedKeys().size()==0) > { > if (_jvmBug++>5) // TODO tune or configure this > { > // Probably JVM BUG! > > Iterator iter = _selector.keys().iterator(); > while(iter.hasNext()) > { > key = (SelectionKey) iter.next(); > if (key.isValid()&&key.interestOps()==0) > { > key.cancel(); > } > } > try > { > Thread.sleep(20); // tune or configure this > } > catch (InterruptedException e) > { > Log.ignore(e); > } > } > } > {code} > Based on this it is obvious we are hitting a jetty workaround for a JVM bug > that doesn't handle select() properly. > There is a jetty JIRA for this http://jira.codehaus.org/browse/JETTY-937 (it > actually introduces the workaround for the JVM bug that we are hitting) > They say that the problem was fixed in 6.1.22, there is a person on that JIRA > also saying that switching to using SocketConnector instead of > SelectChannelConnector helped in their case. > Since we are hitting the same bug in our world we should either adopt the > newer Jetty version where there is a better workaround, but it might not help > if we are still hitting that bug constantly, the workaround might be better > though. > Another approach is to switch to using SocketConnector which will eliminate > the problem completely, although I am not sure what problems that will bring. > The java version we are running is in Environment > Any thoughts -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1218) 20 append: Blocks recovered on startup should be treated with lower priority during block synchronization
[ https://issues.apache.org/jira/browse/HDFS-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-1218: -- Attachment: hdfs-1281.txt Here's a patch, but won't apply on top of the branch currently. Requires HDFS-1057 and possibly some other FSDataset patches first to apply without conflict (possibly HDFS-1056) > 20 append: Blocks recovered on startup should be treated with lower priority > during block synchronization > - > > Key: HDFS-1218 > URL: https://issues.apache.org/jira/browse/HDFS-1218 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node >Affects Versions: 0.20-append >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Critical > Fix For: 0.20-append > > Attachments: hdfs-1281.txt > > > When a datanode experiences power loss, it can come back up with truncated > replicas (due to local FS journal replay). Those replicas should not be > allowed to truncate the block during block synchronization if there are other > replicas from DNs that have _not_ restarted. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1141) completeFile does not check lease ownership
[ https://issues.apache.org/jira/browse/HDFS-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dhruba borthakur updated HDFS-1141: --- Fix Version/s: 0.20-append > completeFile does not check lease ownership > --- > > Key: HDFS-1141 > URL: https://issues.apache.org/jira/browse/HDFS-1141 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.20-append >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Blocker > Fix For: 0.20-append, 0.22.0 > > Attachments: hdfs-1141-branch20.txt, hdfs-1141.txt, hdfs-1141.txt > > > completeFile should check that the caller still owns the lease of the file > that it's completing. This is for the 'testCompleteOtherLeaseHoldersFile' > case in HDFS-1139. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1218) 20 append: Blocks recovered on startup should be treated with lower priority during block synchronization
20 append: Blocks recovered on startup should be treated with lower priority during block synchronization - Key: HDFS-1218 URL: https://issues.apache.org/jira/browse/HDFS-1218 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical Fix For: 0.20-append When a datanode experiences power loss, it can come back up with truncated replicas (due to local FS journal replay). Those replicas should not be allowed to truncate the block during block synchronization if there are other replicas from DNs that have _not_ restarted. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1216) Update to JUnit 4 in branch 20 append
[ https://issues.apache.org/jira/browse/HDFS-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879524#action_12879524 ] Tom White commented on HDFS-1216: - HADOOP-6800 will upgrade to JUnit 4.8.1, so perhaps you'd like to use that. > Update to JUnit 4 in branch 20 append > - > > Key: HDFS-1216 > URL: https://issues.apache.org/jira/browse/HDFS-1216 > Project: Hadoop HDFS > Issue Type: Task > Components: test >Affects Versions: 0.20-append >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.20-append > > Attachments: junit-4.5.txt > > > A lot of the append tests are JUnit 4 style. We should upgrade in branch - > Junit 4 is entirely backward compatible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HDFS-1216) Update to JUnit 4 in branch 20 append
[ https://issues.apache.org/jira/browse/HDFS-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dhruba borthakur resolved HDFS-1216. Resolution: Fixed I just committed this. Thanks Todd! > Update to JUnit 4 in branch 20 append > - > > Key: HDFS-1216 > URL: https://issues.apache.org/jira/browse/HDFS-1216 > Project: Hadoop HDFS > Issue Type: Task > Components: test >Affects Versions: 0.20-append >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.20-append > > Attachments: junit-4.5.txt > > > A lot of the append tests are JUnit 4 style. We should upgrade in branch - > Junit 4 is entirely backward compatible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1217) Some methods in the NameNdoe should not be public
[ https://issues.apache.org/jira/browse/HDFS-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879517#action_12879517 ] Tsz Wo (Nicholas), SZE commented on HDFS-1217: -- Here is some suggestions: {code} +++ src/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java (working copy) @@ -1137,7 +1137,7 @@ * @param nodeReg data node registration * @throws IOException */ - public void verifyRequest(NodeRegistration nodeReg) throws IOException { + void verifyRequest(NodeRegistration nodeReg) throws IOException { verifyVersion(nodeReg.getVersion()); if (!namesystem.getRegistrationID().equals(nodeReg.getRegistrationID())) throw new UnregisteredNodeException(nodeReg); @@ -1149,19 +1149,12 @@ * @param version * @throws IOException */ - public void verifyVersion(int version) throws IOException { + private static void verifyVersion(int version) throws IOException { if (version != LAYOUT_VERSION) throw new IncorrectVersionException(version, "data node"); } - /** - * Returns the name of the fsImage file - */ - public File getFsImageName() throws IOException { -return getFSImage().getFsImageName(); - } - - public FSImage getFSImage() { + FSImage getFSImage() { return namesystem.dir.fsImage; } @@ -1169,7 +1162,7 @@ * Returns the name of the fsImage file uploaded by periodic * checkpointing */ - public File[] getFsImageNameCheckpoint() throws IOException { + File[] getFsImageNameCheckpoint() throws IOException { return getFSImage().getFsImageNameCheckpoint(); } @@ -1187,7 +1180,7 @@ * * @return the http address. */ - public InetSocketAddress getHttpAddress() { + InetSocketAddress getHttpAddress() { return httpAddress; } {code} > Some methods in the NameNdoe should not be public > - > > Key: HDFS-1217 > URL: https://issues.apache.org/jira/browse/HDFS-1217 > Project: Hadoop HDFS > Issue Type: Improvement > Components: name-node >Reporter: Tsz Wo (Nicholas), SZE > > There are quite a few NameNode methods which are not required to be public. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1216) Update to JUnit 4 in branch 20 append
[ https://issues.apache.org/jira/browse/HDFS-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-1216: -- Attachment: (was: junit-4.5.txt) > Update to JUnit 4 in branch 20 append > - > > Key: HDFS-1216 > URL: https://issues.apache.org/jira/browse/HDFS-1216 > Project: Hadoop HDFS > Issue Type: Task > Components: test >Affects Versions: 0.20-append >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.20-append > > Attachments: junit-4.5.txt > > > A lot of the append tests are JUnit 4 style. We should upgrade in branch - > Junit 4 is entirely backward compatible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1216) Update to JUnit 4 in branch 20 append
[ https://issues.apache.org/jira/browse/HDFS-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-1216: -- Attachment: junit-4.5.txt Ah, uploaded wrong file. Take 2. > Update to JUnit 4 in branch 20 append > - > > Key: HDFS-1216 > URL: https://issues.apache.org/jira/browse/HDFS-1216 > Project: Hadoop HDFS > Issue Type: Task > Components: test >Affects Versions: 0.20-append >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.20-append > > Attachments: junit-4.5.txt > > > A lot of the append tests are JUnit 4 style. We should upgrade in branch - > Junit 4 is entirely backward compatible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1216) Update to JUnit 4 in branch 20 append
[ https://issues.apache.org/jira/browse/HDFS-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-1216: -- Attachment: junit-4.5.txt Update to junit 4.5 (it's not the newest, but it's what we use in trunk, so we should be consistent) > Update to JUnit 4 in branch 20 append > - > > Key: HDFS-1216 > URL: https://issues.apache.org/jira/browse/HDFS-1216 > Project: Hadoop HDFS > Issue Type: Task > Components: test >Affects Versions: 0.20-append >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.20-append > > Attachments: junit-4.5.txt > > > A lot of the append tests are JUnit 4 style. We should upgrade in branch - > Junit 4 is entirely backward compatible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1217) Some methods in the NameNdoe should not be public
Some methods in the NameNdoe should not be public - Key: HDFS-1217 URL: https://issues.apache.org/jira/browse/HDFS-1217 Project: Hadoop HDFS Issue Type: Improvement Components: name-node Reporter: Tsz Wo (Nicholas), SZE There are quite a few NameNode methods which are not required to be public. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1216) Update to JUnit 4 in branch 20 append
[ https://issues.apache.org/jira/browse/HDFS-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879513#action_12879513 ] dhruba borthakur commented on HDFS-1216: +1 > Update to JUnit 4 in branch 20 append > - > > Key: HDFS-1216 > URL: https://issues.apache.org/jira/browse/HDFS-1216 > Project: Hadoop HDFS > Issue Type: Task > Components: test >Affects Versions: 0.20-append >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.20-append > > Attachments: junit-4.5.txt > > > A lot of the append tests are JUnit 4 style. We should upgrade in branch - > Junit 4 is entirely backward compatible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1216) Update to JUnit 4 in branch 20 append
Update to JUnit 4 in branch 20 append - Key: HDFS-1216 URL: https://issues.apache.org/jira/browse/HDFS-1216 Project: Hadoop HDFS Issue Type: Task Components: test Affects Versions: 0.20-append Reporter: Todd Lipcon Assignee: Todd Lipcon Fix For: 0.20-append A lot of the append tests are JUnit 4 style. We should upgrade in branch - Junit 4 is entirely backward compatible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1054) Remove unnecessary sleep after failure in nextBlockOutputStream
[ https://issues.apache.org/jira/browse/HDFS-1054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-1054: -- Fix Version/s: 0.20-append > Remove unnecessary sleep after failure in nextBlockOutputStream > --- > > Key: HDFS-1054 > URL: https://issues.apache.org/jira/browse/HDFS-1054 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs client >Affects Versions: 0.20-append, 0.20.3, 0.21.0, 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.20-append, 0.21.0 > > Attachments: hdfs-1054-0.20-append.txt, hdfs-1054.txt, hdfs-1054.txt > > > If DFSOutputStream fails to create a pipeline, it currently sleeps 6 seconds > before retrying. I don't see a great reason to wait at all, much less 6 > seconds (especially now that HDFS-630 ensures that a retry won't go back to > the bad node). We should at least make it configurable, and perhaps something > like backoff makes some sense. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HDFS-1215) TestNodeCount infinite loops on branch-20-append
[ https://issues.apache.org/jira/browse/HDFS-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon resolved HDFS-1215. --- Assignee: Todd Lipcon Resolution: Fixed Dhruba committed to 20-append branch > TestNodeCount infinite loops on branch-20-append > > > Key: HDFS-1215 > URL: https://issues.apache.org/jira/browse/HDFS-1215 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 0.20-append >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.20-append > > Attachments: > 0025-Fix-TestNodeCount-to-not-infinite-loop-after-HDFS-40.patch > > > HDFS-409 made some minicluster changes, which got incorporated into one of > the earlier 20-append patches. This breaks TestNodeCount so it infinite loops > on the branch. This patch fixes it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-988) saveNamespace can corrupt edits log
[ https://issues.apache.org/jira/browse/HDFS-988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-988: - Fix Version/s: 0.20-append Marking this as fixed for the append branch (it's committed there, but not resolved for trunk yet) > saveNamespace can corrupt edits log > --- > > Key: HDFS-988 > URL: https://issues.apache.org/jira/browse/HDFS-988 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.20-append, 0.21.0, 0.22.0 >Reporter: dhruba borthakur >Assignee: Todd Lipcon >Priority: Blocker > Fix For: 0.20-append > > Attachments: hdfs-988.txt, saveNamespace.txt, > saveNamespace_20-append.patch > > > The adminstrator puts the namenode is safemode and then issues the > savenamespace command. This can corrupt the edits log. The problem is that > when the NN enters safemode, there could still be pending logSycs occuring > from other threads. Now, the saveNamespace command, when executed, would save > a edits log with partial writes. I have seen this happen on 0.20. > https://issues.apache.org/jira/browse/HDFS-909?focusedCommentId=12828853&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12828853 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1215) TestNodeCount infinite loops on branch-20-append
[ https://issues.apache.org/jira/browse/HDFS-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-1215: -- Attachment: 0025-Fix-TestNodeCount-to-not-infinite-loop-after-HDFS-40.patch Here's a -p1 patch that fixes this issue. > TestNodeCount infinite loops on branch-20-append > > > Key: HDFS-1215 > URL: https://issues.apache.org/jira/browse/HDFS-1215 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 0.20-append >Reporter: Todd Lipcon > Fix For: 0.20-append > > Attachments: > 0025-Fix-TestNodeCount-to-not-infinite-loop-after-HDFS-40.patch > > > HDFS-409 made some minicluster changes, which got incorporated into one of > the earlier 20-append patches. This breaks TestNodeCount so it infinite loops > on the branch. This patch fixes it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1215) TestNodeCount infinite loops on branch-20-append
TestNodeCount infinite loops on branch-20-append Key: HDFS-1215 URL: https://issues.apache.org/jira/browse/HDFS-1215 Project: Hadoop HDFS Issue Type: Bug Components: test Affects Versions: 0.20-append Reporter: Todd Lipcon Fix For: 0.20-append HDFS-409 made some minicluster changes, which got incorporated into one of the earlier 20-append patches. This breaks TestNodeCount so it infinite loops on the branch. This patch fixes it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-826) Allow a mechanism for an application to detect that datanode(s) have died in the write pipeline
[ https://issues.apache.org/jira/browse/HDFS-826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dhruba borthakur updated HDFS-826: -- Fix Version/s: 0.20-append > Allow a mechanism for an application to detect that datanode(s) have died in > the write pipeline > > > Key: HDFS-826 > URL: https://issues.apache.org/jira/browse/HDFS-826 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs client >Affects Versions: 0.20-append >Reporter: dhruba borthakur >Assignee: dhruba borthakur > Fix For: 0.20-append, 0.21.0 > > Attachments: HDFS-826-0.20-v2.patch, HDFS-826-0.20.patch, > Replicable4.txt, ReplicableHdfs.txt, ReplicableHdfs2.txt, ReplicableHdfs3.txt > > > HDFS does not replicate the last block of the file that is being currently > written to by an application. Every datanode death in the write pipeline > decreases the reliability of the last block of the currently-being-written > block. This situation can be improved if the application can be notified of a > datanode death in the write pipeline. Then, the application can decide what > is the right course of action to be taken on this event. > In our use-case, the application can close the file on the first datanode > death, and start writing to a newly created file. This ensures that the > reliability guarantee of a block is close to 3 at all time. > One idea is to make DFSOutoutStream. write() throw an exception if the number > of datanodes in the write pipeline fall below minimum.replication.factor that > is set on the client (this is backward compatible). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-793) DataNode should first receive the whole packet ack message before it constructs and sends its own ack message for the packet
[ https://issues.apache.org/jira/browse/HDFS-793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-793: - Affects Version/s: (was: 0.20-append) Removing 0.20-append, since it was already applied to 20 branch in the form of HDFS-872. > DataNode should first receive the whole packet ack message before it > constructs and sends its own ack message for the packet > > > Key: HDFS-793 > URL: https://issues.apache.org/jira/browse/HDFS-793 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node >Reporter: Hairong Kuang >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.20.2, 0.21.0 > > Attachments: separateSendRcvAck-0.20-yahoo.patch, > separateSendRcvAck-0.20.patch, separateSendRcvAck.patch, > separateSendRcvAck1.patch, separateSendRcvAck2.patch > > > Currently BlockReceiver#PacketResponder interleaves receiving ack message and > sending ack message for the same packet. It reads a portion of the message, > sends a portion of its ack, and continues like this until it hits the end of > the message. The problem is that if it gets an error receiving the ack, it is > not able to send an ack that reflects the source of the error. > The PacketResponder should receives the whole packet ack message first and > then constuct and sends out its ack. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-630) In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block.
[ https://issues.apache.org/jira/browse/HDFS-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dhruba borthakur updated HDFS-630: -- Fix Version/s: 0.20-append > In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific > datanodes when locating the next block. > --- > > Key: HDFS-630 > URL: https://issues.apache.org/jira/browse/HDFS-630 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs client, name-node >Affects Versions: 0.20-append >Reporter: Ruyue Ma >Assignee: Cosmin Lehene > Fix For: 0.20-append, 0.21.0 > > Attachments: 0001-Fix-HDFS-630-0.21-svn-1.patch, > 0001-Fix-HDFS-630-0.21-svn-2.patch, 0001-Fix-HDFS-630-0.21-svn.patch, > 0001-Fix-HDFS-630-for-0.21-and-trunk-unified.patch, > 0001-Fix-HDFS-630-for-0.21.patch, 0001-Fix-HDFS-630-svn.patch, > 0001-Fix-HDFS-630-svn.patch, 0001-Fix-HDFS-630-trunk-svn-1.patch, > 0001-Fix-HDFS-630-trunk-svn-2.patch, 0001-Fix-HDFS-630-trunk-svn-3.patch, > 0001-Fix-HDFS-630-trunk-svn-3.patch, 0001-Fix-HDFS-630-trunk-svn-4.patch, > hdfs-630-0.20-append.patch, hdfs-630-0.20.txt, HDFS-630.patch > > > created from hdfs-200. > If during a write, the dfsclient sees that a block replica location for a > newly allocated block is not-connectable, it re-requests the NN to get a > fresh set of replica locations of the block. It tries this > dfs.client.block.write.retries times (default 3), sleeping 6 seconds between > each retry ( see DFSClient.nextBlockOutputStream). > This setting works well when you have a reasonable size cluster; if u have > few datanodes in the cluster, every retry maybe pick the dead-datanode and > the above logic bails out. > Our solution: when getting block location from namenode, we give nn the > excluded datanodes. The list of dead datanodes is only for one block > allocation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-457) better handling of volume failure in Data Node storage
[ https://issues.apache.org/jira/browse/HDFS-457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-457: - Affects Version/s: (was: 0.20-append) Removing 0.20-append tag - this isn't append-specific. > better handling of volume failure in Data Node storage > -- > > Key: HDFS-457 > URL: https://issues.apache.org/jira/browse/HDFS-457 > Project: Hadoop HDFS > Issue Type: Improvement > Components: data-node >Reporter: Boris Shkolnik >Assignee: Boris Shkolnik > Fix For: 0.21.0 > > Attachments: HDFS-457-1.patch, HDFS-457-2.patch, HDFS-457-2.patch, > HDFS-457-2.patch, HDFS-457-3.patch, HDFS-457.patch, HDFS-457_20-append.patch, > jira.HDFS-457.branch-0.20-internal.patch, TestFsck.zip > > > Current implementation shuts DataNode down completely when one of the > configured volumes of the storage fails. > This is rather wasteful behavior because it decreases utilization (good > storage becomes unavailable) and imposes extra load on the system > (replication of the blocks from the good volumes). These problems will become > even more prominent when we move to mixed (heterogeneous) clusters with many > more volumes per Data Node. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-561) Fix write pipeline READ_TIMEOUT
[ https://issues.apache.org/jira/browse/HDFS-561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dhruba borthakur updated HDFS-561: -- Fix Version/s: 0.20-append > Fix write pipeline READ_TIMEOUT > --- > > Key: HDFS-561 > URL: https://issues.apache.org/jira/browse/HDFS-561 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: data-node, hdfs client >Affects Versions: 0.20-append >Reporter: Kan Zhang >Assignee: Kan Zhang > Fix For: 0.20-append, 0.21.0 > > Attachments: h561-01.patch, h561-02.patch, hdfs-561-0.20-append.txt > > > When writing a file, the pipeline status read timeouts for datanodes are not > set up properly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-445) pread() fails when cached block locations are no longer valid
[ https://issues.apache.org/jira/browse/HDFS-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dhruba borthakur updated HDFS-445: -- Fix Version/s: 0.20-append > pread() fails when cached block locations are no longer valid > - > > Key: HDFS-445 > URL: https://issues.apache.org/jira/browse/HDFS-445 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20-append >Reporter: Kan Zhang >Assignee: Kan Zhang > Fix For: 0.20-append, 0.21.0 > > Attachments: 445-06.patch, 445-08.patch, hdfs-445-0.20-append.txt, > HDFS-445-0_20.2.patch > > > when cached block locations are no longer valid (e.g., datanodes restart on > different ports), pread() will fail, whereas normal read() still succeeds > through re-fetching of block locations from namenode (up to a max number of > times). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-127) DFSClient block read failures cause open DFSInputStream to become unusable
[ https://issues.apache.org/jira/browse/HDFS-127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Gray updated HDFS-127: --- Affects Version/s: (was: 0.20-append) > DFSClient block read failures cause open DFSInputStream to become unusable > -- > > Key: HDFS-127 > URL: https://issues.apache.org/jira/browse/HDFS-127 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs client >Reporter: Igor Bolotin >Assignee: Igor Bolotin > Fix For: 0.21.0 > > Attachments: 4681.patch, h127_20091016.patch, h127_20091019.patch, > h127_20091019b.patch, hdfs-127-branch20-redone-v2.txt, > hdfs-127-branch20-redone.txt, hdfs-127-regression-test.txt > > > We are using some Lucene indexes directly from HDFS and for quite long time > we were using Hadoop version 0.15.3. > When tried to upgrade to Hadoop 0.19 - index searches started to fail with > exceptions like: > 2008-11-13 16:50:20,314 WARN [Listener-4] [] DFSClient : DFS Read: > java.io.IOException: Could not obtain block: blk_5604690829708125511_15489 > file=/usr/collarity/data/urls-new/part-0/20081110-163426/_0.tis > at > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1708) > at > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1536) > at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1663) > at java.io.DataInputStream.read(DataInputStream.java:132) > at > org.apache.nutch.indexer.FsDirectory$DfsIndexInput.readInternal(FsDirectory.java:174) > at > org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:152) > at > org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38) > at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76) > at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:63) > at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:131) > at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:162) > at org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:223) > at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:217) > at org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:54) > ... > The investigation showed that the root of this issue is that we exceeded # of > xcievers in the data nodes and that was fixed by changing configuration > settings to 2k. > However - one thing that bothered me was that even after datanodes recovered > from overload and most of client servers had been shut down - we still > observed errors in the logs of running servers. > Further investigation showed that fix for HADOOP-1911 introduced another > problem - the DFSInputStream instance might become unusable once number of > failures over lifetime of this instance exceeds configured threshold. > The fix for this specific issue seems to be trivial - just reset failure > counter before reading next block (patch will be attached shortly). > This seems to be also related to HADOOP-3185, but I'm not sure I really > understand necessity of keeping track of failed block accesses in the DFS > client. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dhruba borthakur updated HDFS-101: -- Fix Version/s: 0.20-append > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20-append, 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.20-append, 0.20.2, 0.21.0 > > Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, > detectDownDN2.patch, detectDownDN3-0.20-yahoo.patch, > detectDownDN3-0.20.patch, detectDownDN3.patch, > hdfs-101-branch-0.20-append-cdh3.txt, hdfs-101.tar.gz, > HDFS-101_20-append.patch, pipelineHeartbeat.patch, > pipelineHeartbeat_yahoo.patch > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file
[ https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879479#action_12879479 ] Hairong Kuang commented on HDFS-1057: - > they aren't guaranteed to be since there are methods to change the > bytesOnDisk separate from the lastCheckSum bytes. I do not see any place that updates bytesOnDisk except for BlockReceiver. That's why I suggested to remove setBytesOnDisk from ReplicaInPipeline etc. > Concurrent readers hit ChecksumExceptions if following a writer to very end > of file > --- > > Key: HDFS-1057 > URL: https://issues.apache.org/jira/browse/HDFS-1057 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: data-node >Affects Versions: 0.20-append, 0.21.0, 0.22.0 >Reporter: Todd Lipcon >Assignee: sam rash >Priority: Blocker > Attachments: conurrent-reader-patch-1.txt, > conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, > hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, hdfs-1057-trunk-3.txt > > > In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before > calling flush(). Therefore, if there is a concurrent reader, it's possible to > race here - the reader will see the new length while those bytes are still in > the buffers of BlockReceiver. Thus the client will potentially see checksum > errors or EOFs. Additionally, the last checksum chunk of the file is made > accessible to readers even though it is not stable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1214) hdfs client metadata cache
[ https://issues.apache.org/jira/browse/HDFS-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879466#action_12879466 ] Joydeep Sen Sarma commented on HDFS-1214: - while a cache can be maintained on the application side - it's harder and it seems like the wrong place to implement the same. in the case of a query compiler - different compilation stages may be fetching metadata to figure out cost of query. furthermore, different queries may be compiled from the same jvm that end up requesting metadata for the same objects. the application can identify calls that can deal with out of date metadata. (so a separate api or an overlaid filesystem driver or additional flags in the current api are all acceptable) ideally the cache should be write-through (it's very common for a single jvm to be reading/writing the same fs object repeatedly). > hdfs client metadata cache > -- > > Key: HDFS-1214 > URL: https://issues.apache.org/jira/browse/HDFS-1214 > Project: Hadoop HDFS > Issue Type: New Feature > Components: hdfs client >Reporter: Joydeep Sen Sarma > > In some applications, latency is affected by the cost of making rpc calls to > namenode to fetch metadata. the most obvious case are calls to fetch > file/directory status. applications like hive like to make optimizations > based on file size/number etc. - and for such optimizations - 'recent' status > data (as opposed to most up-to-date) is acceptable. in such cases, a cache on > the DFS client that transparently caches metadata would be greatly benefit > applications. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1212) Harmonize HDFS JAR library versions with Common
[ https://issues.apache.org/jira/browse/HDFS-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tom White updated HDFS-1212: Attachment: HDFS-1212.patch > Harmonize HDFS JAR library versions with Common > --- > > Key: HDFS-1212 > URL: https://issues.apache.org/jira/browse/HDFS-1212 > Project: Hadoop HDFS > Issue Type: Bug > Components: build >Reporter: Tom White >Assignee: Tom White >Priority: Blocker > Fix For: 0.21.0 > > Attachments: HDFS-1212.patch > > > HDFS part of HADOOP-6800. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1214) hdfs client metadata cache
hdfs client metadata cache -- Key: HDFS-1214 URL: https://issues.apache.org/jira/browse/HDFS-1214 Project: Hadoop HDFS Issue Type: New Feature Components: hdfs client Reporter: Joydeep Sen Sarma In some applications, latency is affected by the cost of making rpc calls to namenode to fetch metadata. the most obvious case are calls to fetch file/directory status. applications like hive like to make optimizations based on file size/number etc. - and for such optimizations - 'recent' status data (as opposed to most up-to-date) is acceptable. in such cases, a cache on the DFS client that transparently caches metadata would be greatly benefit applications. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file
[ https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879460#action_12879460 ] sam rash commented on HDFS-1057: 1. they aren't guaranteed to be since there are methods to change the bytesOnDisk separate from the lastCheckSum bytes. It's entirely conceivable that something could update the bytes on disk w/o updating the lastChecksum with the current set of methods If we are ok with a loosely coupled guarantee, then we can use bytesOnDisk and be careful never to call setBytesOnDisk() for any RBW 2. oh--your previous comments indicated we shouldn't change either ReplicaInPipelineInterface or ReplicaInPipeline. If that's not the case and we can do this, then my comment above doesn't hold. we use bytesOnDisk and guarantee it's in sync with the checksum in a single synchronized method (I like this) 3. will make the update to treat missing last blocks as 0-length and re-instate the unit test. thanks for all the help on this > Concurrent readers hit ChecksumExceptions if following a writer to very end > of file > --- > > Key: HDFS-1057 > URL: https://issues.apache.org/jira/browse/HDFS-1057 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: data-node >Affects Versions: 0.20-append, 0.21.0, 0.22.0 >Reporter: Todd Lipcon >Assignee: sam rash >Priority: Blocker > Attachments: conurrent-reader-patch-1.txt, > conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, > hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, hdfs-1057-trunk-3.txt > > > In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before > calling flush(). Therefore, if there is a concurrent reader, it's possible to > race here - the reader will see the new length while those bytes are still in > the buffers of BlockReceiver. Thus the client will potentially see checksum > errors or EOFs. Additionally, the last checksum chunk of the file is made > accessible to readers even though it is not stable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file
[ https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879454#action_12879454 ] Hairong Kuang commented on HDFS-1057: - Sam, the patch is in good shape. Thanks for working on this. A few minor comments: 1. ReplicaBeingWritten.java: dataLength and bytesOnDisk are the same, right? We do not need to introduce another field dataLength. I am also hesitate to delare datalength & lastchecksum as volatible. Accesses to them are already synchronized and the norm case is that writing without reading. 2. We probably should remove setBytesOnDisk from ReplicaInPipelineInterface & ReplicaInPipeline. > In 0.20, I made it so that client just treats this as a 0-length file. one of > our internal tools saw this rather frequently in 0.20. Good to know this. Then could you please handle this case in the trunk the same as well? Thanks again, Sam. > Concurrent readers hit ChecksumExceptions if following a writer to very end > of file > --- > > Key: HDFS-1057 > URL: https://issues.apache.org/jira/browse/HDFS-1057 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: data-node >Affects Versions: 0.20-append, 0.21.0, 0.22.0 >Reporter: Todd Lipcon >Assignee: sam rash >Priority: Blocker > Attachments: conurrent-reader-patch-1.txt, > conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, > hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, hdfs-1057-trunk-3.txt > > > In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before > calling flush(). Therefore, if there is a concurrent reader, it's possible to > race here - the reader will see the new length while those bytes are still in > the buffers of BlockReceiver. Thus the client will potentially see checksum > errors or EOFs. Additionally, the last checksum chunk of the file is made > accessible to readers even though it is not stable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1213) Implement an Apache Commons VFS Driver for HDFS
[ https://issues.apache.org/jira/browse/HDFS-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated HDFS-1213: --- Summary: Implement an Apache Commons VFS Driver for HDFS (was: Implement a VFS Driver for HDFS) > Implement an Apache Commons VFS Driver for HDFS > --- > > Key: HDFS-1213 > URL: https://issues.apache.org/jira/browse/HDFS-1213 > Project: Hadoop HDFS > Issue Type: New Feature > Components: hdfs client >Reporter: Michael D'Amour > Attachments: pentaho-hdfs-vfs-TRUNK-SNAPSHOT-sources.tar.gz, > pentaho-hdfs-vfs-TRUNK-SNAPSHOT.jar > > > We have an open source ETL tool (Kettle) which uses VFS for many input/output > steps/jobs. We would like to be able to read/write HDFS from Kettle using > VFS. > > I haven't been able to find anything out there other than "it would be nice." > > I had some time a few weeks ago to begin writing a VFS driver for HDFS and we > (Pentaho) would like to be able to contribute this driver. I believe it > supports all the major file/folder operations and I have written unit tests > for all of these operations. The code is currently checked into an open > Pentaho SVN repository under the Apache 2.0 license. There are some current > limitations, such as a lack of authentication (kerberos), which appears to be > coming in 0.22.0, however, the driver supports username/password, but I just > can't use them yet. > I will be attaching the code for the driver once the case is created. The > project does not modify existing hadoop/hdfs source. > Our JIRA case can be found at http://jira.pentaho.com/browse/PDI-4146 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-101: - Attachment: hdfs-101-branch-0.20-append-cdh3.txt Hey Nicolas, I just compared our two patches side by side. The one I've been testing with (and made a noticeable improvement in recovery detecting the correct down node in cluster failure testing) is attached. Here are a few differences I noticed (though maybe because the diffs are against different trees): - Looks like your patch doesn't maintain wire compat when mirrorError is true, since it constructs a "replies" list with only 2 elements (not based on the number of downstream nodes) - When receiving packets in BlockReceiver, I am explicitly forwarding HEART_BEAT packets where it looks like you're not checking for them. Have you verified by leaving a connection open with no data flowing that heartbeats are handled properly in BlockReceiver? > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20-append, 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.20.2, 0.21.0 > > Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, > detectDownDN2.patch, detectDownDN3-0.20-yahoo.patch, > detectDownDN3-0.20.patch, detectDownDN3.patch, > hdfs-101-branch-0.20-append-cdh3.txt, hdfs-101.tar.gz, > HDFS-101_20-append.patch, pipelineHeartbeat.patch, > pipelineHeartbeat_yahoo.patch > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1213) Implement a VFS Driver for HDFS
[ https://issues.apache.org/jira/browse/HDFS-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879448#action_12879448 ] Michael D'Amour commented on HDFS-1213: --- Allen- Sorry for any confusion, I am referring to Apache VFS (http://commons.apache.org/vfs/) > Implement a VFS Driver for HDFS > --- > > Key: HDFS-1213 > URL: https://issues.apache.org/jira/browse/HDFS-1213 > Project: Hadoop HDFS > Issue Type: New Feature > Components: hdfs client >Reporter: Michael D'Amour > Attachments: pentaho-hdfs-vfs-TRUNK-SNAPSHOT-sources.tar.gz, > pentaho-hdfs-vfs-TRUNK-SNAPSHOT.jar > > > We have an open source ETL tool (Kettle) which uses VFS for many input/output > steps/jobs. We would like to be able to read/write HDFS from Kettle using > VFS. > > I haven't been able to find anything out there other than "it would be nice." > > I had some time a few weeks ago to begin writing a VFS driver for HDFS and we > (Pentaho) would like to be able to contribute this driver. I believe it > supports all the major file/folder operations and I have written unit tests > for all of these operations. The code is currently checked into an open > Pentaho SVN repository under the Apache 2.0 license. There are some current > limitations, such as a lack of authentication (kerberos), which appears to be > coming in 0.22.0, however, the driver supports username/password, but I just > can't use them yet. > I will be attaching the code for the driver once the case is created. The > project does not modify existing hadoop/hdfs source. > Our JIRA case can be found at http://jira.pentaho.com/browse/PDI-4146 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1213) Implement a VFS Driver for HDFS
[ https://issues.apache.org/jira/browse/HDFS-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879447#action_12879447 ] Arun C Murthy commented on HDFS-1213: - Michael, could you please upload this as a patch rather than a tarball? > Implement a VFS Driver for HDFS > --- > > Key: HDFS-1213 > URL: https://issues.apache.org/jira/browse/HDFS-1213 > Project: Hadoop HDFS > Issue Type: New Feature > Components: hdfs client >Reporter: Michael D'Amour > Attachments: pentaho-hdfs-vfs-TRUNK-SNAPSHOT-sources.tar.gz, > pentaho-hdfs-vfs-TRUNK-SNAPSHOT.jar > > > We have an open source ETL tool (Kettle) which uses VFS for many input/output > steps/jobs. We would like to be able to read/write HDFS from Kettle using > VFS. > > I haven't been able to find anything out there other than "it would be nice." > > I had some time a few weeks ago to begin writing a VFS driver for HDFS and we > (Pentaho) would like to be able to contribute this driver. I believe it > supports all the major file/folder operations and I have written unit tests > for all of these operations. The code is currently checked into an open > Pentaho SVN repository under the Apache 2.0 license. There are some current > limitations, such as a lack of authentication (kerberos), which appears to be > coming in 0.22.0, however, the driver supports username/password, but I just > can't use them yet. > I will be attaching the code for the driver once the case is created. The > project does not modify existing hadoop/hdfs source. > Our JIRA case can be found at http://jira.pentaho.com/browse/PDI-4146 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879439#action_12879439 ] Nicolas Spiegelberg commented on HDFS-101: -- Todd, you're assumption is correct. I needed a couple small things from the HDFS-793 patch (namely, getNumOfReplies) to make HDFS-101 compatible with HDFS-872. > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20-append, 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.20.2, 0.21.0 > > Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, > detectDownDN2.patch, detectDownDN3-0.20-yahoo.patch, > detectDownDN3-0.20.patch, detectDownDN3.patch, hdfs-101.tar.gz, > HDFS-101_20-append.patch, pipelineHeartbeat.patch, > pipelineHeartbeat_yahoo.patch > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1213) Implement a VFS Driver for HDFS
[ https://issues.apache.org/jira/browse/HDFS-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879437#action_12879437 ] Allen Wittenauer commented on HDFS-1213: Do you mean VFS as in the Linux virtual file system kernel API or some other VFS? > Implement a VFS Driver for HDFS > --- > > Key: HDFS-1213 > URL: https://issues.apache.org/jira/browse/HDFS-1213 > Project: Hadoop HDFS > Issue Type: New Feature > Components: hdfs client >Reporter: Michael D'Amour > Attachments: pentaho-hdfs-vfs-TRUNK-SNAPSHOT-sources.tar.gz, > pentaho-hdfs-vfs-TRUNK-SNAPSHOT.jar > > > We have an open source ETL tool (Kettle) which uses VFS for many input/output > steps/jobs. We would like to be able to read/write HDFS from Kettle using > VFS. > > I haven't been able to find anything out there other than "it would be nice." > > I had some time a few weeks ago to begin writing a VFS driver for HDFS and we > (Pentaho) would like to be able to contribute this driver. I believe it > supports all the major file/folder operations and I have written unit tests > for all of these operations. The code is currently checked into an open > Pentaho SVN repository under the Apache 2.0 license. There are some current > limitations, such as a lack of authentication (kerberos), which appears to be > coming in 0.22.0, however, the driver supports username/password, but I just > can't use them yet. > I will be attaching the code for the driver once the case is created. The > project does not modify existing hadoop/hdfs source. > Our JIRA case can be found at http://jira.pentaho.com/browse/PDI-4146 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.