[jira] [Commented] (HDFS-9904) testCheckpointCancellationDuringUpload occasionally fails
[ https://issues.apache.org/jira/browse/HDFS-9904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196761#comment-15196761 ] Lin Yiqun commented on HDFS-9904: - Thanks [~kihwal] for commit! > testCheckpointCancellationDuringUpload occasionally fails > -- > > Key: HDFS-9904 > URL: https://issues.apache.org/jira/browse/HDFS-9904 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 2.7.3 >Reporter: Kihwal Lee >Assignee: Lin Yiqun > Fix For: 2.7.3 > > Attachments: HDFS-9904.001.patch, HDFS-9904.002.patch > > > The failure was at the end of the test case where the txid of the standby > (former active) is checked. Since the checkpoint/uploading was canceled , it > is not supposed to have the new checkpoint. Looking at the test log, that was > still the case, but the standby then did checkpoint on its own and bumped up > the txid, right before the check was performed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9904) testCheckpointCancellationDuringUpload occasionally fails
[ https://issues.apache.org/jira/browse/HDFS-9904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195578#comment-15195578 ] Hudson commented on HDFS-9904: -- FAILURE: Integrated in Hadoop-trunk-Commit #9464 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/9464/]) HDFS-9904. testCheckpointCancellationDuringUpload occasionally fails. (kihwal: rev d4574017845cfa7521e703f80efd404afd09b8c4) * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyCheckpoints.java > testCheckpointCancellationDuringUpload occasionally fails > -- > > Key: HDFS-9904 > URL: https://issues.apache.org/jira/browse/HDFS-9904 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 2.7.3 >Reporter: Kihwal Lee >Assignee: Lin Yiqun > Fix For: 2.7.3 > > Attachments: HDFS-9904.001.patch, HDFS-9904.002.patch > > > The failure was at the end of the test case where the txid of the standby > (former active) is checked. Since the checkpoint/uploading was canceled , it > is not supposed to have the new checkpoint. Looking at the test log, that was > still the case, but the standby then did checkpoint on its own and bumped up > the txid, right before the check was performed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9904) testCheckpointCancellationDuringUpload occasionally fails
[ https://issues.apache.org/jira/browse/HDFS-9904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195554#comment-15195554 ] Kihwal Lee commented on HDFS-9904: -- I've committed this to trunk through branch-2.7. Thanks for working on this Lin Yiqun. > testCheckpointCancellationDuringUpload occasionally fails > -- > > Key: HDFS-9904 > URL: https://issues.apache.org/jira/browse/HDFS-9904 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 2.7.3 >Reporter: Kihwal Lee >Assignee: Lin Yiqun > Fix For: 2.7.3 > > Attachments: HDFS-9904.001.patch, HDFS-9904.002.patch > > > The failure was at the end of the test case where the txid of the standby > (former active) is checked. Since the checkpoint/uploading was canceled , it > is not supposed to have the new checkpoint. Looking at the test log, that was > still the case, but the standby then did checkpoint on its own and bumped up > the txid, right before the check was performed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9904) testCheckpointCancellationDuringUpload occasionally fails
[ https://issues.apache.org/jira/browse/HDFS-9904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195531#comment-15195531 ] Kihwal Lee commented on HDFS-9904: -- +1 I've verified that the config is only set for the specific test case. > testCheckpointCancellationDuringUpload occasionally fails > -- > > Key: HDFS-9904 > URL: https://issues.apache.org/jira/browse/HDFS-9904 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 2.7.3 >Reporter: Kihwal Lee > Attachments: HDFS-9904.001.patch, HDFS-9904.002.patch > > > The failure was at the end of the test case where the txid of the standby > (former active) is checked. Since the checkpoint/uploading was canceled , it > is not supposed to have the new checkpoint. Looking at the test log, that was > still the case, but the standby then did checkpoint on its own and bumped up > the txid, right before the check was performed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9904) testCheckpointCancellationDuringUpload occasionally fails
[ https://issues.apache.org/jira/browse/HDFS-9904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184764#comment-15184764 ] Lin Yiqun commented on HDFS-9904: - Sorry for last comments. The testcase {{testNonPrimarySBNUploadFSImage}} has no problem, I ignored that the last param txid has changed. Please ignore some comments of them. > testCheckpointCancellationDuringUpload occasionally fails > -- > > Key: HDFS-9904 > URL: https://issues.apache.org/jira/browse/HDFS-9904 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 2.7.3 >Reporter: Kihwal Lee > Attachments: HDFS-9904.001.patch, HDFS-9904.002.patch > > > The failure was at the end of the test case where the txid of the standby > (former active) is checked. Since the checkpoint/uploading was canceled , it > is not supposed to have the new checkpoint. Looking at the test log, that was > still the case, but the standby then did checkpoint on its own and bumped up > the txid, right before the check was performed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9904) testCheckpointCancellationDuringUpload occasionally fails
[ https://issues.apache.org/jira/browse/HDFS-9904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184333#comment-15184333 ] Lin Yiqun commented on HDFS-9904: - Thanks [~kihwal] for concrete analysation. I am ignored for that. {quote} Also, it should be set before the namenode is started and should be reset for other test cases. {quote} In method {{testCheckpointCancellationDuringUpload}}, it has already restart all namenodes after. So I reset the configuration here is ok. {code} // don't compress, we want a big image for (int i = 0; i < NUM_NNS; i++) { cluster.getConfiguration(i).setBoolean( DFSConfigKeys.DFS_IMAGE_COMPRESS_KEY, false); } // Throttle SBN upload to make it hang during upload to ANN for (int i = 1; i < NUM_NNS; i++) { cluster.getConfiguration(i).setLong( DFSConfigKeys.DFS_IMAGE_TRANSFER_RATE_KEY, 100); } for (int i = 0; i < NUM_NNS; i++) { cluster.restartNameNode(i); } {code} It seems that there was a similar problem in {{testNonPrimarySBNUploadFSImage}}. If first namenode change to standby, because 10 is bigger than 5(set value), it will also do a checkpoint. And actually, the checkpoint should be uploaded by one of standby nodes. {code} doEdits(0, 10); cluster.transitionToStandby(0); {code} Am I think right? If so, we can slove both two in this jira. Finally update a patch for addressing your comments. > testCheckpointCancellationDuringUpload occasionally fails > -- > > Key: HDFS-9904 > URL: https://issues.apache.org/jira/browse/HDFS-9904 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 2.7.3 >Reporter: Kihwal Lee > Attachments: HDFS-9904.001.patch > > > The failure was at the end of the test case where the txid of the standby > (former active) is checked. Since the checkpoint/uploading was canceled , it > is not supposed to have the new checkpoint. Looking at the test log, that was > still the case, but the standby then did checkpoint on its own and bumped up > the txid, right before the check was performed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9904) testCheckpointCancellationDuringUpload occasionally fails
[ https://issues.apache.org/jira/browse/HDFS-9904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183248#comment-15183248 ] Kihwal Lee commented on HDFS-9904: -- Thanks for working on the fix. The config is used to determine whether to create a new checkpoint. A standby will, after loading/replaying edits, check how many transactions went by since the last checkpoint. If the number is greater than the configured limit, it will do checkpoint. As you can see from the test output, there are around 106 transactions at the end. In order to prevent the standby from checkpointing, the config value should be bigger than that. E.g. 1000. Also, it should be set before the namenode is started and should be reset for other test cases. > testCheckpointCancellationDuringUpload occasionally fails > -- > > Key: HDFS-9904 > URL: https://issues.apache.org/jira/browse/HDFS-9904 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 2.7.3 >Reporter: Kihwal Lee > Attachments: HDFS-9904.001.patch > > > The failure was at the end of the test case where the txid of the standby > (former active) is checked. Since the checkpoint/uploading was canceled , it > is not supposed to have the new checkpoint. Looking at the test log, that was > still the case, but the standby then did checkpoint on its own and bumped up > the txid, right before the check was performed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9904) testCheckpointCancellationDuringUpload occasionally fails
[ https://issues.apache.org/jira/browse/HDFS-9904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180154#comment-15180154 ] Kihwal Lee commented on HDFS-9904: -- The stack trace from the test failure. {noformat} java.lang.AssertionError: expected:<0> but was:<106> at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints.testCheckpointCancellationDuringUpload(TestStandbyCheckpoints.java:328) {noformat} We could set DFS_NAMENODE_CHECKPOINT_TXNS_KEY differently on the first NN to avoid it doing checkpoint when it becomes a standby. > testCheckpointCancellationDuringUpload occasionally fails > -- > > Key: HDFS-9904 > URL: https://issues.apache.org/jira/browse/HDFS-9904 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 2.7.3 >Reporter: Kihwal Lee > > The failure was at the end of the test case where the txid of the standby > (former active) is checked. Since the checkpoint/uploading was canceled , it > is not supposed to have the new checkpoint. Looking at the test log, that was > still the case, but the standby then did checkpoint on its own and bumped up > the txid, right before the check was performed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)