[jira] Updated: (HDFS-1597) Batched edit log syncs can reset synctxid throw assertions
[ https://issues.apache.org/jira/browse/HDFS-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-1597: -- Status: Patch Available (was: Open) > Batched edit log syncs can reset synctxid throw assertions > -- > > Key: HDFS-1597 > URL: https://issues.apache.org/jira/browse/HDFS-1597 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Critical > Fix For: 0.22.0 > > Attachments: hdfs-1597.txt, illustrate-test-failure.txt > > > The top of FSEditLog.logSync has the following assertion: > {code} > assert editStreams.size() > 0 : "no editlog streams"; > {code} > which should actually come after checking to see if the sync was already > batched in by another thread. > This is related to a second bug in which the same case causes synctxid to be > reset to 0 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1595) DFSClient may incorrectly detect datanode failure
[ https://issues.apache.org/jira/browse/HDFS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986861#action_12986861 ] dhruba borthakur commented on HDFS-1595: It appears that Todd's proposal could work well to avoid this issue, do you agree Nicholas? It appears to me (please correct me if I am wrong) to be a data availability problem. The replicas F is actually still intact and the data is good there, it is just that clients are unable to read that data. is it true that if the network card on F gets fixed, the data becomes available once again? > DFSClient may incorrectly detect datanode failure > - > > Key: HDFS-1595 > URL: https://issues.apache.org/jira/browse/HDFS-1595 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node, hdfs client >Affects Versions: 0.20.4 >Reporter: Tsz Wo (Nicholas), SZE >Priority: Critical > Attachments: hdfs-1595-idea.txt > > > Suppose a source datanode S is writing to a destination datanode D in a write > pipeline. We have an implicit assumption that _if S catches an exception > when it is writing to D, then D is faulty and S is fine._ As a result, > DFSClient will take out D from the pipeline, reconstruct the write pipeline > with the remaining datanodes and then continue writing . > However, we find a case that the faulty machine F is indeed S but not D. In > the case we found, F has a faulty network interface (or a faulty switch port) > in such a way that the faulty network interface works fine when sending out a > small amount of data, say 1MB, but it fails when sending out a large amount > of data, say 100MB. Reading is working fine for any data size. > It is even worst if F is the first datanode in the pipeline. Consider the > following: > # DFSClient creates a pipeline with three datanodes. The first datanode is F. > # F catches an IOException when writing to the second datanode. Then, F > reports the second datanode has error. > # DFSClient removes the second datanode from the pipeline and continue > writing with the remaining datanode(s). > # The pipeline now has two datanodes but (2) and (3) repeat. > # Now, only F remains in the pipeline. DFSClient continues writing with one > replica in F. > # The write succeeds and DFSClient is able to *close the file successfully*. > # The block is under replicated. The NameNode schedules replication from F > to some other datanode D. > # The replication fails for the same reason. D reports to the NameNode that > the replica in F is corrupted. > # The NameNode marks the replica in F is corrupted. > # The block is corrupted since no replica is available. > This is a *data loss* scenario. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1593) Allow a datanode to copy a block to a datanode on a foreign HDFS cluster.
[ https://issues.apache.org/jira/browse/HDFS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986852#action_12986852 ] Sanjay Radia commented on HDFS-1593: In the case of the NN issuing a copy operation, the data does not go via the NN, but is transferred directly; the two DNs do have to ensure that the other peer is indeed a DN in the cluster; I need to look at the code more closely but I believe the DN generates an access token since it shares the access token secret with the NN. Dhruba, you are asserting that the two clusters have the same principals; while this may be true in many cases it may not always be true in all environments. Further the secret that is used to generate access tokens is not the same in two different clusters (even if they have use the same principal). BTW we have been looking at the same problem here at Yahoo and are trying to figure out the best secure solution. There is another issue you are missing; the block sizes on the two clusters may not be the same; their default block sizes may be different. Hence I am not sure if one can simply copy a block across. Q. are you trying to push or pull the data? In order to handle different block sizes it seems easier to pull the data. One choice is to access a byte range in a file via the DfsClient; the other choice is to get the block's bytes from multiple DNs in the remote src cluster. I do however agree that transferring data directly from one or more data nodes to another is desirable. > Allow a datanode to copy a block to a datanode on a foreign HDFS cluster. > - > > Key: HDFS-1593 > URL: https://issues.apache.org/jira/browse/HDFS-1593 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: dhruba borthakur >Assignee: dhruba borthakur > Attachments: copyBlockTrunk1.txt > > > This patch introduces an RPC to the datanode to allow it to copy a block to a > datanode on a remote HDFS cluster. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1580) Add interface for generic Write Ahead Logging mechanisms
[ https://issues.apache.org/jira/browse/HDFS-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986815#action_12986815 ] Jitendra Nath Pandey commented on HDFS-1580: > Is HDFS-1557 almost ready to go? (should I take a look at it?) Yes, almost! I am done with my review and Suresh is also taking a look at the patch, once his review is done I will proceed to commit it. You are welcome to take a look. > Add interface for generic Write Ahead Logging mechanisms > > > Key: HDFS-1580 > URL: https://issues.apache.org/jira/browse/HDFS-1580 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ivan Kelly > Attachments: generic_wal_iface.txt > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1597) Batched edit log syncs can reset synctxid throw assertions
[ https://issues.apache.org/jira/browse/HDFS-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-1597: -- Attachment: hdfs-1597.txt Here's a patch containing a fix and also two new unit tests that verify the edit-batching behavior. > Batched edit log syncs can reset synctxid throw assertions > -- > > Key: HDFS-1597 > URL: https://issues.apache.org/jira/browse/HDFS-1597 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Critical > Fix For: 0.22.0 > > Attachments: hdfs-1597.txt, illustrate-test-failure.txt > > > The top of FSEditLog.logSync has the following assertion: > {code} > assert editStreams.size() > 0 : "no editlog streams"; > {code} > which should actually come after checking to see if the sync was already > batched in by another thread. > This is related to a second bug in which the same case causes synctxid to be > reset to 0 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1597) Batched edit log syncs can reset synctxid throw assertions
[ https://issues.apache.org/jira/browse/HDFS-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-1597: -- Description: The top of FSEditLog.logSync has the following assertion: {code} assert editStreams.size() > 0 : "no editlog streams"; {code} which should actually come after checking to see if the sync was already batched in by another thread. This is related to a second bug in which the same case causes synctxid to be reset to 0 was: The top of FSEditLog.logSync has the following assertion: {code} assert editStreams.size() > 0 : "no editlog streams"; {code} which should actually come after checking to see if the sync was already batched in by another thread. Will describe the race in a comment. Summary: Batched edit log syncs can reset synctxid throw assertions (was: Misplaced assertion in FSEditLog.logSync) Updated description to reflect the other problem as well > Batched edit log syncs can reset synctxid throw assertions > -- > > Key: HDFS-1597 > URL: https://issues.apache.org/jira/browse/HDFS-1597 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Critical > Fix For: 0.22.0 > > Attachments: illustrate-test-failure.txt > > > The top of FSEditLog.logSync has the following assertion: > {code} > assert editStreams.size() > 0 : "no editlog streams"; > {code} > which should actually come after checking to see if the sync was already > batched in by another thread. > This is related to a second bug in which the same case causes synctxid to be > reset to 0 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1595) DFSClient may incorrectly detect datanode failure
[ https://issues.apache.org/jira/browse/HDFS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986788#action_12986788 ] Koji Noguchi commented on HDFS-1595: bq. So this faulty node F has no problem receiving large amount of data? This faulty node had problem sending/receiving large amount of data and failing most of the time. Bigger the data, higher the chances of the failures. I think smaller data (,say less than 1MB) was going through 99% of the time. So heartbeat, ack and so forth were probably working. When I tried to scp some blocks out from this node for data recovery, it kept on failing with === blk_-1131935611740137990% 0 0.0KB/s --:-- ETA Corrupted MAC on input. Finished discarding for aa.bb.cc.dd lost connection === So I believe *most* of the dfsclient write was failing when going through this node. And when it successfully went through (after hundreds of write attempts for different blocks), it would then fail on all the following replications but succeed on 'close' with 1 replica leading to this bug. > DFSClient may incorrectly detect datanode failure > - > > Key: HDFS-1595 > URL: https://issues.apache.org/jira/browse/HDFS-1595 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node, hdfs client >Affects Versions: 0.20.4 >Reporter: Tsz Wo (Nicholas), SZE >Priority: Critical > Attachments: hdfs-1595-idea.txt > > > Suppose a source datanode S is writing to a destination datanode D in a write > pipeline. We have an implicit assumption that _if S catches an exception > when it is writing to D, then D is faulty and S is fine._ As a result, > DFSClient will take out D from the pipeline, reconstruct the write pipeline > with the remaining datanodes and then continue writing . > However, we find a case that the faulty machine F is indeed S but not D. In > the case we found, F has a faulty network interface (or a faulty switch port) > in such a way that the faulty network interface works fine when sending out a > small amount of data, say 1MB, but it fails when sending out a large amount > of data, say 100MB. Reading is working fine for any data size. > It is even worst if F is the first datanode in the pipeline. Consider the > following: > # DFSClient creates a pipeline with three datanodes. The first datanode is F. > # F catches an IOException when writing to the second datanode. Then, F > reports the second datanode has error. > # DFSClient removes the second datanode from the pipeline and continue > writing with the remaining datanode(s). > # The pipeline now has two datanodes but (2) and (3) repeat. > # Now, only F remains in the pipeline. DFSClient continues writing with one > replica in F. > # The write succeeds and DFSClient is able to *close the file successfully*. > # The block is under replicated. The NameNode schedules replication from F > to some other datanode D. > # The replication fails for the same reason. D reports to the NameNode that > the replica in F is corrupted. > # The NameNode marks the replica in F is corrupted. > # The block is corrupted since no replica is available. > This is a *data loss* scenario. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1158) HDFS-457 increases the chances of losing blocks
[ https://issues.apache.org/jira/browse/HDFS-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986783#action_12986783 ] Eli Collins commented on HDFS-1158: --- Good suggestion Owen. How about the following? * A dn should decommission itself rather than shutdown whenever (a) the configured threshold of disk failures has been reached or (b) when a critical volume (specified in the config, eg the volume(s) that host the logs, pid, tmp etc) has failed. In practice an admin would specify a number of volume failures should be tolerated and specify the root volume as critical. * The configured failed.volumes.tolerated should be respected on startup. The datanode should only refuse to startup if more than failed.volumes.tolerated are failed, or if a configured critical volume has failed (which is probably not an issue in practice since dn startup probably fails eg if the root volume has gone readonly). > HDFS-457 increases the chances of losing blocks > > > Key: HDFS-1158 > URL: https://issues.apache.org/jira/browse/HDFS-1158 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node >Affects Versions: 0.21.0 >Reporter: Koji Noguchi > Attachments: rev-HDFS-457.patch > > > Whenever we restart a cluster, there's a chance of losing some blocks if more > than three datanodes don't come up. > HDFS-457 increases this chance by keeping the datanodes up even when ># /tmp disk goes read-only ># /disk0 that is used for storing PID goes read-only > and probably more. > In our environment, /tmp and /disk0 are from the same device. > When trying to restart a datanode, it would fail with > 1) > {noformat} > 2010-05-15 05:45:45,575 WARN org.mortbay.log: tmpdir > java.io.IOException: Read-only file system > at java.io.UnixFileSystem.createFileExclusively(Native Method) > at java.io.File.checkAndCreate(File.java:1704) > at java.io.File.createTempFile(File.java:1792) > at java.io.File.createTempFile(File.java:1828) > at > org.mortbay.jetty.webapp.WebAppContext.getTempDirectory(WebAppContext.java:745) > {noformat} > or > 2) > {noformat} > hadoop-daemon.sh: line 117: /disk/0/hadoop-datanodecom.out: Read-only > file system > hadoop-daemon.sh: line 118: /disk/0/hadoop-datanode.pid: Read-only file system > {noformat} > I can recover the missing blocks but it takes some time. > Also, we are losing track of block movements since log directory can also go > to read-only but datanode would continue running. > For 0.21 release, can we revert HDFS-457 or make it configurable? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1469) TestBlockTokenWithDFS fails on trunk
[ https://issues.apache.org/jira/browse/HDFS-1469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986781#action_12986781 ] Konstantin Boudnik commented on HDFS-1469: -- Forgot to mention, that the timeout happens on 0.20.2 based release. > TestBlockTokenWithDFS fails on trunk > > > Key: HDFS-1469 > URL: https://issues.apache.org/jira/browse/HDFS-1469 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Priority: Blocker > Attachments: failed-TestBlockTokenWithDFS.txt, log.gz > > > TestBlockTokenWithDFS is failing on trunk: > Testcase: testAppend took 31.569 sec > FAILED > null > junit.framework.AssertionFailedError: null > at > org.apache.hadoop.hdfs.server.namenode.TestBlockTokenWithDFS.testAppend(TestBlockTokenWithDFS.java:223) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1469) TestBlockTokenWithDFS fails on trunk
[ https://issues.apache.org/jira/browse/HDFS-1469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Boudnik updated HDFS-1469: - Attachment: log.gz I have ran slightly modified test (converted to JUnit 4 with better assertion messages, etc.) and full test output turned on in a loop and got the timeout (attaching the log) > TestBlockTokenWithDFS fails on trunk > > > Key: HDFS-1469 > URL: https://issues.apache.org/jira/browse/HDFS-1469 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Priority: Blocker > Attachments: failed-TestBlockTokenWithDFS.txt, log.gz > > > TestBlockTokenWithDFS is failing on trunk: > Testcase: testAppend took 31.569 sec > FAILED > null > junit.framework.AssertionFailedError: null > at > org.apache.hadoop.hdfs.server.namenode.TestBlockTokenWithDFS.testAppend(TestBlockTokenWithDFS.java:223) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1597) Misplaced assertion in FSEditLog.logSync
[ https://issues.apache.org/jira/browse/HDFS-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986773#action_12986773 ] Todd Lipcon commented on HDFS-1597: --- bq. As of HDFS-119 syntxid is set in the finally block, even if the sync was batched Sorry, didn't say that very clearly. If the sync is batched, it will set {{synctxid}} to *0* in the {{finally}} block! So the next thread comes along, doesn't think it has been batched (though it has) and do yet another sync. > Misplaced assertion in FSEditLog.logSync > > > Key: HDFS-1597 > URL: https://issues.apache.org/jira/browse/HDFS-1597 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Critical > Fix For: 0.22.0 > > Attachments: illustrate-test-failure.txt > > > The top of FSEditLog.logSync has the following assertion: > {code} > assert editStreams.size() > 0 : "no editlog streams"; > {code} > which should actually come after checking to see if the sync was already > batched in by another thread. > Will describe the race in a comment. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1595) DFSClient may incorrectly detect datanode failure
[ https://issues.apache.org/jira/browse/HDFS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986774#action_12986774 ] Tsz Wo (Nicholas), SZE commented on HDFS-1595: -- > Resurrecting the pipeline with more replicas is a nice idea but I imagine it > will be super-complicated, no? Yes, it is complicated. Nonetheless, it is invaluable for this JIRA, append and other applications like the HBase use case you provided. > DFSClient may incorrectly detect datanode failure > - > > Key: HDFS-1595 > URL: https://issues.apache.org/jira/browse/HDFS-1595 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node, hdfs client >Affects Versions: 0.20.4 >Reporter: Tsz Wo (Nicholas), SZE >Priority: Critical > Attachments: hdfs-1595-idea.txt > > > Suppose a source datanode S is writing to a destination datanode D in a write > pipeline. We have an implicit assumption that _if S catches an exception > when it is writing to D, then D is faulty and S is fine._ As a result, > DFSClient will take out D from the pipeline, reconstruct the write pipeline > with the remaining datanodes and then continue writing . > However, we find a case that the faulty machine F is indeed S but not D. In > the case we found, F has a faulty network interface (or a faulty switch port) > in such a way that the faulty network interface works fine when sending out a > small amount of data, say 1MB, but it fails when sending out a large amount > of data, say 100MB. Reading is working fine for any data size. > It is even worst if F is the first datanode in the pipeline. Consider the > following: > # DFSClient creates a pipeline with three datanodes. The first datanode is F. > # F catches an IOException when writing to the second datanode. Then, F > reports the second datanode has error. > # DFSClient removes the second datanode from the pipeline and continue > writing with the remaining datanode(s). > # The pipeline now has two datanodes but (2) and (3) repeat. > # Now, only F remains in the pipeline. DFSClient continues writing with one > replica in F. > # The write succeeds and DFSClient is able to *close the file successfully*. > # The block is under replicated. The NameNode schedules replication from F > to some other datanode D. > # The replication fails for the same reason. D reports to the NameNode that > the replica in F is corrupted. > # The NameNode marks the replica in F is corrupted. > # The block is corrupted since no replica is available. > This is a *data loss* scenario. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1597) Misplaced assertion in FSEditLog.logSync
[ https://issues.apache.org/jira/browse/HDFS-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986770#action_12986770 ] Todd Lipcon commented on HDFS-1597: --- Actually in trunk there's a second bug that affects this area of the code. As of HDFS-119 {{syntxid}} is set in the {{finally}} block, even if the sync was batched. So, in this case if there are two threads acting like "Thread A" in my example, then the second one will actually fall past the batching check and trigger the assertion later as well (or even lose edits!) > Misplaced assertion in FSEditLog.logSync > > > Key: HDFS-1597 > URL: https://issues.apache.org/jira/browse/HDFS-1597 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Critical > Fix For: 0.22.0 > > Attachments: illustrate-test-failure.txt > > > The top of FSEditLog.logSync has the following assertion: > {code} > assert editStreams.size() > 0 : "no editlog streams"; > {code} > which should actually come after checking to see if the sync was already > batched in by another thread. > Will describe the race in a comment. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1597) Misplaced assertion in FSEditLog.logSync
[ https://issues.apache.org/jira/browse/HDFS-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-1597: -- Attachment: illustrate-test-failure.txt Here's a little hack I did that makes the test fail reliably with this error: Caused by: java.lang.AssertionError: no editlog streams at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:485) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:2071) at org.apache.hadoop.hdfs.server.namenode.TestEditLogRace$Transactions.run(TestEditLogRace.java:115) at java.lang.Thread.run(Thread.java:662) (obviously not for commit, just to trigger the race) > Misplaced assertion in FSEditLog.logSync > > > Key: HDFS-1597 > URL: https://issues.apache.org/jira/browse/HDFS-1597 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Critical > Fix For: 0.22.0 > > Attachments: illustrate-test-failure.txt > > > The top of FSEditLog.logSync has the following assertion: > {code} > assert editStreams.size() > 0 : "no editlog streams"; > {code} > which should actually come after checking to see if the sync was already > batched in by another thread. > Will describe the race in a comment. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1597) Misplaced assertion in FSEditLog.logSync
[ https://issues.apache.org/jira/browse/HDFS-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986762#action_12986762 ] Todd Lipcon commented on HDFS-1597: --- The race is the following: ||Thread A||Thread B|| |mkdirs() | - | | take FSN lock | - | | ..logEdit() | - | | drop FSN lock | - | | - | enterSafeMode() | | - | saveNamespace() | | - | ..logSyncAll() | | - | ..editLog.close() | | logSync() | - | In this case, because Thread A's transaction has already been synced in logSyncAll, it doesn't actually have any work to sync - i.e it got batched. Accordingly, it's fine that the edit log is closed. But, the assertion comes before the check that the sync was already batched, so it fires. This causes occasional failures of TestEditLog on one of our hudson builds now that assertions are enabled. > Misplaced assertion in FSEditLog.logSync > > > Key: HDFS-1597 > URL: https://issues.apache.org/jira/browse/HDFS-1597 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Critical > Fix For: 0.22.0 > > > The top of FSEditLog.logSync has the following assertion: > {code} > assert editStreams.size() > 0 : "no editlog streams"; > {code} > which should actually come after checking to see if the sync was already > batched in by another thread. > Will describe the race in a comment. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1597) Misplaced assertion in FSEditLog.logSync
Misplaced assertion in FSEditLog.logSync Key: HDFS-1597 URL: https://issues.apache.org/jira/browse/HDFS-1597 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 0.22.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical Fix For: 0.22.0 The top of FSEditLog.logSync has the following assertion: {code} assert editStreams.size() > 0 : "no editlog streams"; {code} which should actually come after checking to see if the sync was already batched in by another thread. Will describe the race in a comment. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1582) Remove auto-generated native build files
[ https://issues.apache.org/jira/browse/HDFS-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986743#action_12986743 ] Eli Collins commented on HDFS-1582: --- Patch looks good. What testing has been done to check the native part of the build, eg run libhdfs or fuse-dfs? > Remove auto-generated native build files > > > Key: HDFS-1582 > URL: https://issues.apache.org/jira/browse/HDFS-1582 > Project: Hadoop HDFS > Issue Type: Improvement > Components: contrib/libhdfs >Reporter: Roman Shaposhnik >Assignee: Roman Shaposhnik > Fix For: 0.23.0 > > Attachments: HADOOP-6436.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > The repo currently includes the automake and autoconf generated files for the > native build. Per discussion on HADOOP-6421 let's remove them and use the > host's automake and autoconf. We should also do this for libhdfs and > fuse-dfs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1580) Add interface for generic Write Ahead Logging mechanisms
[ https://issues.apache.org/jira/browse/HDFS-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986718#action_12986718 ] Todd Lipcon commented on HDFS-1580: --- Above sounds reasonable with respect to 1073. Is HDFS-1557 almost ready to go? (should I take a look at it?) > Add interface for generic Write Ahead Logging mechanisms > > > Key: HDFS-1580 > URL: https://issues.apache.org/jira/browse/HDFS-1580 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ivan Kelly > Attachments: generic_wal_iface.txt > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1580) Add interface for generic Write Ahead Logging mechanisms
[ https://issues.apache.org/jira/browse/HDFS-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986709#action_12986709 ] Jitendra Nath Pandey commented on HDFS-1580: The interface also needs to have a counterpart of roll-edits method. Currently, for checkpointing, first thing being done is to roll the edit logs i.e. an edits.new is created. As I understand, in hdfs-1073 instead of edits.new the edit files will be numbered. At least FileWriteAheadLog will need to roll to keep edit files from getting too big, even if it is not required for checkpointing. The interface should also provide methods to get all previously rotated edit log files (or ledgers) and also current "in-progress" edit log file or ledger. As a suggestion, the interface could have a concept of log handles, where each handle uniquely corresponds to single edit log file or ledger. Thus, we could have a method getAllLogs and it will return a list of log-handles. I think ordered handles will fit with hdfs-1073 model (need to confirm). LogHandle can also have some meta data for example first transaction id, whether its current or old etc. Hdfs-1073 is proposing to store first transaction-id in the edit-file name itself, which could be used to populate the log-handle in case of FileWriteAheadLog. The input and output streams should be in the LogHandle, so that any log-file can be read. Log-Handle for older files should not let one create an output stream. A method to purge the editlogs might also be needed, i.e. given a handle remove the corresponding log-file (or ledger). > Add interface for generic Write Ahead Logging mechanisms > > > Key: HDFS-1580 > URL: https://issues.apache.org/jira/browse/HDFS-1580 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ivan Kelly > Attachments: generic_wal_iface.txt > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1595) DFSClient may incorrectly detect datanode failure
[ https://issues.apache.org/jira/browse/HDFS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986634#action_12986634 ] Todd Lipcon commented on HDFS-1595: --- Another option not mentioned above that we use in HBase is to periodically poll the pipeline from the "application code" to find out how many replicas are in it (there's an API to do that in trunk). If we see the pipeline drop below 3 replicas, we roll the output to a new file (hence new pipeline). For something like a commit log where we really just care about a stream of records, the actual file boundaries have no semantic meaning, so rolling is "free" an we get a full pipeline. For MR it's a bit trickier to do that in general, but might be useful for certain output formats if we can figure out how to make the APIs non-gross. eg if a reducer is writing part-0 and gets a pipeline failure it could continue writing to part-0.1 or something. > DFSClient may incorrectly detect datanode failure > - > > Key: HDFS-1595 > URL: https://issues.apache.org/jira/browse/HDFS-1595 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node, hdfs client >Affects Versions: 0.20.4 >Reporter: Tsz Wo (Nicholas), SZE >Priority: Critical > Attachments: hdfs-1595-idea.txt > > > Suppose a source datanode S is writing to a destination datanode D in a write > pipeline. We have an implicit assumption that _if S catches an exception > when it is writing to D, then D is faulty and S is fine._ As a result, > DFSClient will take out D from the pipeline, reconstruct the write pipeline > with the remaining datanodes and then continue writing . > However, we find a case that the faulty machine F is indeed S but not D. In > the case we found, F has a faulty network interface (or a faulty switch port) > in such a way that the faulty network interface works fine when sending out a > small amount of data, say 1MB, but it fails when sending out a large amount > of data, say 100MB. Reading is working fine for any data size. > It is even worst if F is the first datanode in the pipeline. Consider the > following: > # DFSClient creates a pipeline with three datanodes. The first datanode is F. > # F catches an IOException when writing to the second datanode. Then, F > reports the second datanode has error. > # DFSClient removes the second datanode from the pipeline and continue > writing with the remaining datanode(s). > # The pipeline now has two datanodes but (2) and (3) repeat. > # Now, only F remains in the pipeline. DFSClient continues writing with one > replica in F. > # The write succeeds and DFSClient is able to *close the file successfully*. > # The block is under replicated. The NameNode schedules replication from F > to some other datanode D. > # The replication fails for the same reason. D reports to the NameNode that > the replica in F is corrupted. > # The NameNode marks the replica in F is corrupted. > # The block is corrupted since no replica is available. > This is a *data loss* scenario. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-863) Potential deadlock in TestOverReplicatedBlocks
[ https://issues.apache.org/jira/browse/HDFS-863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Goodhope updated HDFS-863: -- Attachment: HDFS-863.patch Finally got my auto format set up right and it found a couple more issues my eyes missed. Only formatted the sections I worked on. > Potential deadlock in TestOverReplicatedBlocks > -- > > Key: HDFS-863 > URL: https://issues.apache.org/jira/browse/HDFS-863 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Reporter: Todd Lipcon >Assignee: Ken Goodhope > Fix For: 0.23.0 > > Attachments: cycle.png, HDFS-863.patch, HDFS-863.patch, > HDFS-863.patch, HDFS-863.patch, TestNodeCount.png > > > TestOverReplicatedBlocks.testProcesOverReplicateBlock synchronizes on > namesystem.heartbeats without synchronizing on namesystem first. Other places > in the code synchronize namesystem, then heartbeats. It's probably unlikely > to occur in this test case, but it's a simple fix. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1595) DFSClient may incorrectly detect datanode failure
[ https://issues.apache.org/jira/browse/HDFS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986624#action_12986624 ] Kan Zhang commented on HDFS-1595: - Looks like we have 3 options when the pipeline is reduced to a single datanode F. # stop writing and fail fast. # recruit a new set of datanodes to be added to F, with F still being the first datanode in the pipeline. # finish writing the block to F and ask NN to wait for 2 replicas before closing the block. 1) has the drawback of failing even when F is healthy, which is undesirable. Both 2) and 3) will fail when F is bad. And both 2) and 3) will likely succeed when F is healthy. Of the two, I'd prefer 2) over 3), since how soon a replica can be replicated from one datanode to another depends on many factors. It is less predictable than 2), where the client actively setting up a new pipeline and resume writing. One long-tail task spending much longer time to finish than others impacts the total running time of the whole job. > DFSClient may incorrectly detect datanode failure > - > > Key: HDFS-1595 > URL: https://issues.apache.org/jira/browse/HDFS-1595 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node, hdfs client >Affects Versions: 0.20.4 >Reporter: Tsz Wo (Nicholas), SZE >Priority: Critical > Attachments: hdfs-1595-idea.txt > > > Suppose a source datanode S is writing to a destination datanode D in a write > pipeline. We have an implicit assumption that _if S catches an exception > when it is writing to D, then D is faulty and S is fine._ As a result, > DFSClient will take out D from the pipeline, reconstruct the write pipeline > with the remaining datanodes and then continue writing . > However, we find a case that the faulty machine F is indeed S but not D. In > the case we found, F has a faulty network interface (or a faulty switch port) > in such a way that the faulty network interface works fine when sending out a > small amount of data, say 1MB, but it fails when sending out a large amount > of data, say 100MB. Reading is working fine for any data size. > It is even worst if F is the first datanode in the pipeline. Consider the > following: > # DFSClient creates a pipeline with three datanodes. The first datanode is F. > # F catches an IOException when writing to the second datanode. Then, F > reports the second datanode has error. > # DFSClient removes the second datanode from the pipeline and continue > writing with the remaining datanode(s). > # The pipeline now has two datanodes but (2) and (3) repeat. > # Now, only F remains in the pipeline. DFSClient continues writing with one > replica in F. > # The write succeeds and DFSClient is able to *close the file successfully*. > # The block is under replicated. The NameNode schedules replication from F > to some other datanode D. > # The replication fails for the same reason. D reports to the NameNode that > the replica in F is corrupted. > # The NameNode marks the replica in F is corrupted. > # The block is corrupted since no replica is available. > This is a *data loss* scenario. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-863) Potential deadlock in TestOverReplicatedBlocks
[ https://issues.apache.org/jira/browse/HDFS-863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Goodhope updated HDFS-863: -- Attachment: HDFS-863.patch Thought I caught all of those, but obviously not. Found a few more and removed them as well. Thanks. > Potential deadlock in TestOverReplicatedBlocks > -- > > Key: HDFS-863 > URL: https://issues.apache.org/jira/browse/HDFS-863 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Reporter: Todd Lipcon >Assignee: Ken Goodhope > Fix For: 0.23.0 > > Attachments: cycle.png, HDFS-863.patch, HDFS-863.patch, > HDFS-863.patch, TestNodeCount.png > > > TestOverReplicatedBlocks.testProcesOverReplicateBlock synchronizes on > namesystem.heartbeats without synchronizing on namesystem first. Other places > in the code synchronize namesystem, then heartbeats. It's probably unlikely > to occur in this test case, but it's a simple fix. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1595) DFSClient may incorrectly detect datanode failure
[ https://issues.apache.org/jira/browse/HDFS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz Wo (Nicholas), SZE updated HDFS-1595: - Description: Suppose a source datanode S is writing to a destination datanode D in a write pipeline. We have an implicit assumption that _if S catches an exception when it is writing to D, then D is faulty and S is fine._ As a result, DFSClient will take out D from the pipeline, reconstruct the write pipeline with the remaining datanodes and then continue writing . However, we find a case that the faulty machine F is indeed S but not D. In the case we found, F has a faulty network interface (or a faulty switch port) in such a way that the faulty network interface works fine when sending out a small amount of data, say 1MB, but it fails when sending out a large amount of data, say 100MB. Reading is working fine for any data size. It is even worst if F is the first datanode in the pipeline. Consider the following: # DFSClient creates a pipeline with three datanodes. The first datanode is F. # F catches an IOException when writing to the second datanode. Then, F reports the second datanode has error. # DFSClient removes the second datanode from the pipeline and continue writing with the remaining datanode(s). # The pipeline now has two datanodes but (2) and (3) repeat. # Now, only F remains in the pipeline. DFSClient continues writing with one replica in F. # The write succeeds and DFSClient is able to *close the file successfully*. # The block is under replicated. The NameNode schedules replication from F to some other datanode D. # The replication fails for the same reason. D reports to the NameNode that the replica in F is corrupted. # The NameNode marks the replica in F is corrupted. # The block is corrupted since no replica is available. This is a *data loss* scenario. was: Suppose a source datanode S is writing to a destination datanode D in a write pipeline. We have an implicit assumption that _if S catches an exception when it is writing to D, then D is faulty and S is fine._ As a result, DFSClient will take out D from the pipeline, reconstruct the write pipeline with the remaining datanodes and then continue writing . However, we find a case that the faulty machine F is indeed S but not D. In the case we found, F has a faulty network interface (or a faulty switch port) in such a way that the faulty network interface works fine when sending out a small amount of data, say 1MB, but it fails when sending out a large amount of data, say 100MB. It is even worst if F is the first datanode in the pipeline. Consider the following: # DFSClient creates a pipeline with three datanodes. The first datanode is F. # F catches an IOException when writing to the second datanode. Then, F reports the second datanode has error. # DFSClient removes the second datanode from the pipeline and continue writing with the remaining datanode(s). # The pipeline now has two datanodes but (2) and (3) repeat. # Now, only F remains in the pipeline. DFSClient continues writing with one replica in F. # The write succeeds and DFSClient is able to *close the file successfully*. # The block is under replicated. The NameNode schedules replication from F to some other datanode D. # The replication fails from the same reason. D reports to the NameNode that the replica in F is corrupted. # The NameNode marks the replica in F is corrupted. # The block is corrupted since no replica is available. This is a *data loss* scenario. Yes, reading is working fine for any data size. (updated also the description.) > DFSClient may incorrectly detect datanode failure > - > > Key: HDFS-1595 > URL: https://issues.apache.org/jira/browse/HDFS-1595 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node, hdfs client >Affects Versions: 0.20.4 >Reporter: Tsz Wo (Nicholas), SZE >Priority: Critical > Attachments: hdfs-1595-idea.txt > > > Suppose a source datanode S is writing to a destination datanode D in a write > pipeline. We have an implicit assumption that _if S catches an exception > when it is writing to D, then D is faulty and S is fine._ As a result, > DFSClient will take out D from the pipeline, reconstruct the write pipeline > with the remaining datanodes and then continue writing . > However, we find a case that the faulty machine F is indeed S but not D. In > the case we found, F has a faulty network interface (or a faulty switch port) > in such a way that the faulty network interface works fine when sending out a > small amount of data, say 1MB, but it fails when sending out a large amount > of data, say 100MB. Reading is working fine for any data size. > It is even worst if F is the first datanode in the pipeline.
[jira] Commented: (HDFS-1595) DFSClient may incorrectly detect datanode failure
[ https://issues.apache.org/jira/browse/HDFS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986612#action_12986612 ] Hairong Kuang commented on HDFS-1595: - > In the case we found, F has a faulty network interface (or a faulty switch > port) in such a way that the faulty network interface works fine when sending > out a small amount of data, say 1MB, but it fails when sending out a large > amount of data, say 100MB. So this faulty node F has no problem receiving large amount of data? That seems an extreme case. Normally the client should get an error sending data to this faulty Node F and thus removes F from the write pipeline. > DFSClient may incorrectly detect datanode failure > - > > Key: HDFS-1595 > URL: https://issues.apache.org/jira/browse/HDFS-1595 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node, hdfs client >Affects Versions: 0.20.4 >Reporter: Tsz Wo (Nicholas), SZE >Priority: Critical > Attachments: hdfs-1595-idea.txt > > > Suppose a source datanode S is writing to a destination datanode D in a write > pipeline. We have an implicit assumption that _if S catches an exception > when it is writing to D, then D is faulty and S is fine._ As a result, > DFSClient will take out D from the pipeline, reconstruct the write pipeline > with the remaining datanodes and then continue writing . > However, we find a case that the faulty machine F is indeed S but not D. In > the case we found, F has a faulty network interface (or a faulty switch port) > in such a way that the faulty network interface works fine when sending out a > small amount of data, say 1MB, but it fails when sending out a large amount > of data, say 100MB. > It is even worst if F is the first datanode in the pipeline. Consider the > following: > # DFSClient creates a pipeline with three datanodes. The first datanode is F. > # F catches an IOException when writing to the second datanode. Then, F > reports the second datanode has error. > # DFSClient removes the second datanode from the pipeline and continue > writing with the remaining datanode(s). > # The pipeline now has two datanodes but (2) and (3) repeat. > # Now, only F remains in the pipeline. DFSClient continues writing with one > replica in F. > # The write succeeds and DFSClient is able to *close the file successfully*. > # The block is under replicated. The NameNode schedules replication from F > to some other datanode D. > # The replication fails from the same reason. D reports to the NameNode that > the replica in F is corrupted. > # The NameNode marks the replica in F is corrupted. > # The block is corrupted since no replica is available. > This is a *data loss* scenario. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-863) Potential deadlock in TestOverReplicatedBlocks
[ https://issues.apache.org/jira/browse/HDFS-863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986611#action_12986611 ] Todd Lipcon commented on HDFS-863: -- Hi Ken. Looks good except for one nit - there are some hard tab characters (eg TestNodeCount.java:117). The style guide is 2-space indentation. Mind reformatting those? > Potential deadlock in TestOverReplicatedBlocks > -- > > Key: HDFS-863 > URL: https://issues.apache.org/jira/browse/HDFS-863 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Reporter: Todd Lipcon >Assignee: Ken Goodhope > Fix For: 0.23.0 > > Attachments: cycle.png, HDFS-863.patch, HDFS-863.patch, > TestNodeCount.png > > > TestOverReplicatedBlocks.testProcesOverReplicateBlock synchronizes on > namesystem.heartbeats without synchronizing on namesystem first. Other places > in the code synchronize namesystem, then heartbeats. It's probably unlikely > to occur in this test case, but it's a simple fix. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-863) Potential deadlock in TestOverReplicatedBlocks
[ https://issues.apache.org/jira/browse/HDFS-863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Goodhope updated HDFS-863: -- Attachment: HDFS-863.patch > Potential deadlock in TestOverReplicatedBlocks > -- > > Key: HDFS-863 > URL: https://issues.apache.org/jira/browse/HDFS-863 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Reporter: Todd Lipcon >Assignee: Ken Goodhope > Fix For: 0.23.0 > > Attachments: cycle.png, HDFS-863.patch, HDFS-863.patch, > TestNodeCount.png > > > TestOverReplicatedBlocks.testProcesOverReplicateBlock synchronizes on > namesystem.heartbeats without synchronizing on namesystem first. Other places > in the code synchronize namesystem, then heartbeats. It's probably unlikely > to occur in this test case, but it's a simple fix. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-863) Potential deadlock in TestOverReplicatedBlocks
[ https://issues.apache.org/jira/browse/HDFS-863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Goodhope updated HDFS-863: -- Attachment: (was: HDFS-863.patch) > Potential deadlock in TestOverReplicatedBlocks > -- > > Key: HDFS-863 > URL: https://issues.apache.org/jira/browse/HDFS-863 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Reporter: Todd Lipcon >Assignee: Ken Goodhope > Fix For: 0.23.0 > > Attachments: cycle.png, HDFS-863.patch, TestNodeCount.png > > > TestOverReplicatedBlocks.testProcesOverReplicateBlock synchronizes on > namesystem.heartbeats without synchronizing on namesystem first. Other places > in the code synchronize namesystem, then heartbeats. It's probably unlikely > to occur in this test case, but it's a simple fix. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-863) Potential deadlock in TestOverReplicatedBlocks
[ https://issues.apache.org/jira/browse/HDFS-863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Goodhope updated HDFS-863: -- Attachment: HDFS-863.patch Agreed, and done. Reran tests with the following results [junit] Test org.apache.hadoop.hdfs.server.namenode.TestStorageRestore FAILED [junit] Test org.apache.hadoop.hdfs.TestFileConcurrentReader FAILED [junit] Test org.apache.hadoop.hdfs.server.namenode.TestNNThroughputBenchmark FAILED Skipped src/test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestLargeDirectoryDelete.java since last time that test stalled. > Potential deadlock in TestOverReplicatedBlocks > -- > > Key: HDFS-863 > URL: https://issues.apache.org/jira/browse/HDFS-863 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Reporter: Todd Lipcon >Assignee: Ken Goodhope > Fix For: 0.23.0 > > Attachments: cycle.png, HDFS-863.patch, HDFS-863.patch, > TestNodeCount.png > > > TestOverReplicatedBlocks.testProcesOverReplicateBlock synchronizes on > namesystem.heartbeats without synchronizing on namesystem first. Other places > in the code synchronize namesystem, then heartbeats. It's probably unlikely > to occur in this test case, but it's a simple fix. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1595) DFSClient may incorrectly detect datanode failure
[ https://issues.apache.org/jira/browse/HDFS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986589#action_12986589 ] Tsz Wo (Nicholas), SZE commented on HDFS-1595: -- > I'll be honest: I don't know the new Append code in trunk very well. I > thought the client called nn.updatePipeline() whenever a node was removed > from the pipeline. That's not the case? You are actually right about the new append codes. I am sorry that I was mostly looking at the 0.20 codes. It will work for 0.22 but we need to change protocol for 0.20. Nice. > DFSClient may incorrectly detect datanode failure > - > > Key: HDFS-1595 > URL: https://issues.apache.org/jira/browse/HDFS-1595 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node, hdfs client >Affects Versions: 0.20.4 >Reporter: Tsz Wo (Nicholas), SZE >Priority: Critical > Attachments: hdfs-1595-idea.txt > > > Suppose a source datanode S is writing to a destination datanode D in a write > pipeline. We have an implicit assumption that _if S catches an exception > when it is writing to D, then D is faulty and S is fine._ As a result, > DFSClient will take out D from the pipeline, reconstruct the write pipeline > with the remaining datanodes and then continue writing . > However, we find a case that the faulty machine F is indeed S but not D. In > the case we found, F has a faulty network interface (or a faulty switch port) > in such a way that the faulty network interface works fine when sending out a > small amount of data, say 1MB, but it fails when sending out a large amount > of data, say 100MB. > It is even worst if F is the first datanode in the pipeline. Consider the > following: > # DFSClient creates a pipeline with three datanodes. The first datanode is F. > # F catches an IOException when writing to the second datanode. Then, F > reports the second datanode has error. > # DFSClient removes the second datanode from the pipeline and continue > writing with the remaining datanode(s). > # The pipeline now has two datanodes but (2) and (3) repeat. > # Now, only F remains in the pipeline. DFSClient continues writing with one > replica in F. > # The write succeeds and DFSClient is able to *close the file successfully*. > # The block is under replicated. The NameNode schedules replication from F > to some other datanode D. > # The replication fails from the same reason. D reports to the NameNode that > the replica in F is corrupted. > # The NameNode marks the replica in F is corrupted. > # The block is corrupted since no replica is available. > This is a *data loss* scenario. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1295) Improve namenode restart times by short-circuiting the first block reports from datanodes
[ https://issues.apache.org/jira/browse/HDFS-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986564#action_12986564 ] Matt Foley commented on HDFS-1295: -- Hi Dhruba, I think this is a really important improvement to startup time, so I will try to get some contributors here to review it. Regarding this dangling issue, can you please describe any risks you see from corrupt replicas, with this shortcut in place? Thanks. > Improve namenode restart times by short-circuiting the first block reports > from datanodes > - > > Key: HDFS-1295 > URL: https://issues.apache.org/jira/browse/HDFS-1295 > Project: Hadoop HDFS > Issue Type: Improvement > Components: name-node >Reporter: dhruba borthakur >Assignee: dhruba borthakur > Attachments: shortCircuitBlockReport_1.txt > > > The namenode restart is dominated by the performance of processing block > reports. On a 2000 node cluster with 90 million blocks, block report > processing takes 30 to 40 minutes. The namenode "diffs" the contents of the > incoming block report with the contents of the blocks map, and then applies > these diffs to the blocksMap, but in reality there is no need to compute the > "diff" because this is the first block report from the datanode. > This code change improves block report processing time by 300%. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1595) DFSClient may incorrectly detect datanode failure
[ https://issues.apache.org/jira/browse/HDFS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986560#action_12986560 ] Todd Lipcon commented on HDFS-1595: --- I'll be honest: I don't know the new Append code in trunk very well. I thought the client called nn.updatePipeline() whenever a node was removed from the pipeline. That's not the case? > DFSClient may incorrectly detect datanode failure > - > > Key: HDFS-1595 > URL: https://issues.apache.org/jira/browse/HDFS-1595 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node, hdfs client >Affects Versions: 0.20.4 >Reporter: Tsz Wo (Nicholas), SZE >Priority: Critical > Attachments: hdfs-1595-idea.txt > > > Suppose a source datanode S is writing to a destination datanode D in a write > pipeline. We have an implicit assumption that _if S catches an exception > when it is writing to D, then D is faulty and S is fine._ As a result, > DFSClient will take out D from the pipeline, reconstruct the write pipeline > with the remaining datanodes and then continue writing . > However, we find a case that the faulty machine F is indeed S but not D. In > the case we found, F has a faulty network interface (or a faulty switch port) > in such a way that the faulty network interface works fine when sending out a > small amount of data, say 1MB, but it fails when sending out a large amount > of data, say 100MB. > It is even worst if F is the first datanode in the pipeline. Consider the > following: > # DFSClient creates a pipeline with three datanodes. The first datanode is F. > # F catches an IOException when writing to the second datanode. Then, F > reports the second datanode has error. > # DFSClient removes the second datanode from the pipeline and continue > writing with the remaining datanode(s). > # The pipeline now has two datanodes but (2) and (3) repeat. > # Now, only F remains in the pipeline. DFSClient continues writing with one > replica in F. > # The write succeeds and DFSClient is able to *close the file successfully*. > # The block is under replicated. The NameNode schedules replication from F > to some other datanode D. > # The replication fails from the same reason. D reports to the NameNode that > the replica in F is corrupted. > # The NameNode marks the replica in F is corrupted. > # The block is corrupted since no replica is available. > This is a *data loss* scenario. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1595) DFSClient may incorrectly detect datanode failure
[ https://issues.apache.org/jira/browse/HDFS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986554#action_12986554 ] Tsz Wo (Nicholas), SZE commented on HDFS-1595: -- {code} +int pipelineReplication = biuc.getNumExpectedLocations(); {code} Hi Todd, I might be wrong but I think it is not that simple as in [hdfs-1595-idea.txt|https://issues.apache.org/jira/secure/attachment/12469243/hdfs-1595-idea.txt]. The value of pipelineReplication above is not equal to the actual number of datanodes in the pipeline. In other words, it won't be updated when datanodes are removed from the pipeline. We need to change some protocol in order to implement this idea. > DFSClient may incorrectly detect datanode failure > - > > Key: HDFS-1595 > URL: https://issues.apache.org/jira/browse/HDFS-1595 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node, hdfs client >Affects Versions: 0.20.4 >Reporter: Tsz Wo (Nicholas), SZE >Priority: Critical > Attachments: hdfs-1595-idea.txt > > > Suppose a source datanode S is writing to a destination datanode D in a write > pipeline. We have an implicit assumption that _if S catches an exception > when it is writing to D, then D is faulty and S is fine._ As a result, > DFSClient will take out D from the pipeline, reconstruct the write pipeline > with the remaining datanodes and then continue writing . > However, we find a case that the faulty machine F is indeed S but not D. In > the case we found, F has a faulty network interface (or a faulty switch port) > in such a way that the faulty network interface works fine when sending out a > small amount of data, say 1MB, but it fails when sending out a large amount > of data, say 100MB. > It is even worst if F is the first datanode in the pipeline. Consider the > following: > # DFSClient creates a pipeline with three datanodes. The first datanode is F. > # F catches an IOException when writing to the second datanode. Then, F > reports the second datanode has error. > # DFSClient removes the second datanode from the pipeline and continue > writing with the remaining datanode(s). > # The pipeline now has two datanodes but (2) and (3) repeat. > # Now, only F remains in the pipeline. DFSClient continues writing with one > replica in F. > # The write succeeds and DFSClient is able to *close the file successfully*. > # The block is under replicated. The NameNode schedules replication from F > to some other datanode D. > # The replication fails from the same reason. D reports to the NameNode that > the replica in F is corrupted. > # The NameNode marks the replica in F is corrupted. > # The block is corrupted since no replica is available. > This is a *data loss* scenario. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1596) Move secondary namenode checkpoint configs from core-default.xml to hdfs-default.xml
Move secondary namenode checkpoint configs from core-default.xml to hdfs-default.xml Key: HDFS-1596 URL: https://issues.apache.org/jira/browse/HDFS-1596 Project: Hadoop HDFS Issue Type: Improvement Components: name-node Reporter: Patrick Angeles The following configs are in core-default.xml, but are really read by the Secondary Namenode. These should be moved to hdfs-default.xml for consistency. fs.checkpoint.dir ${hadoop.tmp.dir}/dfs/namesecondary Determines where on the local filesystem the DFS secondary name node should store the temporary images to merge. If this is a comma-delimited list of directories then the image is replicated in all of the directories for redundancy. fs.checkpoint.edits.dir ${fs.checkpoint.dir} Determines where on the local filesystem the DFS secondary name node should store the temporary edits to merge. If this is a comma-delimited list of directoires then teh edits is replicated in all of the directoires for redundancy. Default value is same as fs.checkpoint.dir fs.checkpoint.period 3600 The number of seconds between two periodic checkpoints. fs.checkpoint.size 67108864 The size of the current edit log (in bytes) that triggers a periodic checkpoint even if the fs.checkpoint.period hasn't expired. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1594) When the disk becomes full Namenode is getting shutdown and not able to recover
[ https://issues.apache.org/jira/browse/HDFS-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated HDFS-1594: Status: Open (was: Patch Available) > When the disk becomes full Namenode is getting shutdown and not able to > recover > --- > > Key: HDFS-1594 > URL: https://issues.apache.org/jira/browse/HDFS-1594 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.21.0, 0.21.1, 0.22.0 > Environment: Linux linux124 2.6.27.19-5-default #1 SMP 2009-02-28 > 04:40:21 +0100 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Devaraj K > Attachments: hadoop-root-namenode-linux124.log, HDFS-1594.patch > > > When the disk becomes full name node is shutting down and if we try to start > after making the space available It is not starting and throwing the below > exception. > {code:xml} > 2011-01-24 23:23:33,727 ERROR > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem > initialization failed. > java.io.EOFException > at java.io.DataInputStream.readFully(DataInputStream.java:180) > at org.apache.hadoop.io.UTF8.readFields(UTF8.java:117) > at > org.apache.hadoop.hdfs.server.namenode.FSImageSerialization.readString(FSImageSerialization.java:201) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:185) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:93) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:60) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1089) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:1041) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:487) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:149) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:306) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:284) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:328) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:356) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:577) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:570) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1529) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1538) > 2011-01-24 23:23:33,729 ERROR > org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException > at java.io.DataInputStream.readFully(DataInputStream.java:180) > at org.apache.hadoop.io.UTF8.readFields(UTF8.java:117) > at > org.apache.hadoop.hdfs.server.namenode.FSImageSerialization.readString(FSImageSerialization.java:201) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:185) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:93) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:60) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1089) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:1041) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:487) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:149) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:306) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:284) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:328) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:356) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:577) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:570) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1529) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1538) > 2011-01-24 23:23:33,730 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: > SHUTDOWN_MSG: > / > SHUTDOWN_MSG: Shutting down NameNode at linux124/10.18.52.124 > / > {code} -- This messa