[jira] [Commented] (HDFS-9904) testCheckpointCancellationDuringUpload occasionally fails

2016-03-15 Thread Lin Yiqun (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196761#comment-15196761
 ] 

Lin Yiqun commented on HDFS-9904:
-

Thanks [~kihwal] for commit!

> testCheckpointCancellationDuringUpload occasionally fails 
> --
>
> Key: HDFS-9904
> URL: https://issues.apache.org/jira/browse/HDFS-9904
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.7.3
>Reporter: Kihwal Lee
>Assignee: Lin Yiqun
> Fix For: 2.7.3
>
> Attachments: HDFS-9904.001.patch, HDFS-9904.002.patch
>
>
> The failure was at the end of the test case where the txid of the standby 
> (former active) is checked. Since the checkpoint/uploading was canceled , it 
> is not supposed to have the new checkpoint. Looking at the test log, that was 
> still the case, but the standby then did checkpoint on its own and bumped up 
> the txid, right before the check was performed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9904) testCheckpointCancellationDuringUpload occasionally fails

2016-03-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195578#comment-15195578
 ] 

Hudson commented on HDFS-9904:
--

FAILURE: Integrated in Hadoop-trunk-Commit #9464 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/9464/])
HDFS-9904. testCheckpointCancellationDuringUpload occasionally fails. (kihwal: 
rev d4574017845cfa7521e703f80efd404afd09b8c4)
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyCheckpoints.java


> testCheckpointCancellationDuringUpload occasionally fails 
> --
>
> Key: HDFS-9904
> URL: https://issues.apache.org/jira/browse/HDFS-9904
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.7.3
>Reporter: Kihwal Lee
>Assignee: Lin Yiqun
> Fix For: 2.7.3
>
> Attachments: HDFS-9904.001.patch, HDFS-9904.002.patch
>
>
> The failure was at the end of the test case where the txid of the standby 
> (former active) is checked. Since the checkpoint/uploading was canceled , it 
> is not supposed to have the new checkpoint. Looking at the test log, that was 
> still the case, but the standby then did checkpoint on its own and bumped up 
> the txid, right before the check was performed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9904) testCheckpointCancellationDuringUpload occasionally fails

2016-03-15 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195554#comment-15195554
 ] 

Kihwal Lee commented on HDFS-9904:
--

I've committed this to trunk through branch-2.7. Thanks for working on this Lin 
Yiqun.

> testCheckpointCancellationDuringUpload occasionally fails 
> --
>
> Key: HDFS-9904
> URL: https://issues.apache.org/jira/browse/HDFS-9904
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.7.3
>Reporter: Kihwal Lee
>Assignee: Lin Yiqun
> Fix For: 2.7.3
>
> Attachments: HDFS-9904.001.patch, HDFS-9904.002.patch
>
>
> The failure was at the end of the test case where the txid of the standby 
> (former active) is checked. Since the checkpoint/uploading was canceled , it 
> is not supposed to have the new checkpoint. Looking at the test log, that was 
> still the case, but the standby then did checkpoint on its own and bumped up 
> the txid, right before the check was performed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9904) testCheckpointCancellationDuringUpload occasionally fails

2016-03-15 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195531#comment-15195531
 ] 

Kihwal Lee commented on HDFS-9904:
--

+1 I've verified that the config is only set for the specific test case.

> testCheckpointCancellationDuringUpload occasionally fails 
> --
>
> Key: HDFS-9904
> URL: https://issues.apache.org/jira/browse/HDFS-9904
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.7.3
>Reporter: Kihwal Lee
> Attachments: HDFS-9904.001.patch, HDFS-9904.002.patch
>
>
> The failure was at the end of the test case where the txid of the standby 
> (former active) is checked. Since the checkpoint/uploading was canceled , it 
> is not supposed to have the new checkpoint. Looking at the test log, that was 
> still the case, but the standby then did checkpoint on its own and bumped up 
> the txid, right before the check was performed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9904) testCheckpointCancellationDuringUpload occasionally fails

2016-03-08 Thread Lin Yiqun (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184764#comment-15184764
 ] 

Lin Yiqun commented on HDFS-9904:
-

Sorry for last comments. The testcase {{testNonPrimarySBNUploadFSImage}} has no 
problem, I ignored that the last param txid has changed. Please ignore some 
comments of them.

> testCheckpointCancellationDuringUpload occasionally fails 
> --
>
> Key: HDFS-9904
> URL: https://issues.apache.org/jira/browse/HDFS-9904
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.7.3
>Reporter: Kihwal Lee
> Attachments: HDFS-9904.001.patch, HDFS-9904.002.patch
>
>
> The failure was at the end of the test case where the txid of the standby 
> (former active) is checked. Since the checkpoint/uploading was canceled , it 
> is not supposed to have the new checkpoint. Looking at the test log, that was 
> still the case, but the standby then did checkpoint on its own and bumped up 
> the txid, right before the check was performed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9904) testCheckpointCancellationDuringUpload occasionally fails

2016-03-07 Thread Lin Yiqun (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184333#comment-15184333
 ] 

Lin Yiqun commented on HDFS-9904:
-

Thanks [~kihwal] for concrete analysation. I am ignored for that.
{quote}
Also, it should be set before the namenode is started and should be reset for 
other test cases.
{quote}
In method {{testCheckpointCancellationDuringUpload}}, it has already restart 
all namenodes after. So I reset the configuration here is ok.
{code}
// don't compress, we want a big image
for (int i = 0; i < NUM_NNS; i++) {
  cluster.getConfiguration(i).setBoolean(
  DFSConfigKeys.DFS_IMAGE_COMPRESS_KEY, false);
}

// Throttle SBN upload to make it hang during upload to ANN
for (int i = 1; i < NUM_NNS; i++) {
  cluster.getConfiguration(i).setLong(
  DFSConfigKeys.DFS_IMAGE_TRANSFER_RATE_KEY, 100);
}
for (int i = 0; i < NUM_NNS; i++) {
  cluster.restartNameNode(i);
}
{code}
It seems that there was a similar problem in 
{{testNonPrimarySBNUploadFSImage}}. If first namenode change to standby, 
because 10 is bigger than 5(set value), it will also do a checkpoint. And 
actually, the checkpoint should be uploaded by one of standby nodes.
{code}
doEdits(0, 10);
cluster.transitionToStandby(0);
{code}
Am I think right? If so, we can slove both two in this jira. Finally update a 
patch for addressing your comments.



> testCheckpointCancellationDuringUpload occasionally fails 
> --
>
> Key: HDFS-9904
> URL: https://issues.apache.org/jira/browse/HDFS-9904
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.7.3
>Reporter: Kihwal Lee
> Attachments: HDFS-9904.001.patch
>
>
> The failure was at the end of the test case where the txid of the standby 
> (former active) is checked. Since the checkpoint/uploading was canceled , it 
> is not supposed to have the new checkpoint. Looking at the test log, that was 
> still the case, but the standby then did checkpoint on its own and bumped up 
> the txid, right before the check was performed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9904) testCheckpointCancellationDuringUpload occasionally fails

2016-03-07 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183248#comment-15183248
 ] 

Kihwal Lee commented on HDFS-9904:
--

Thanks for working on the fix. The config is used to determine whether to 
create a new checkpoint. A standby will, after loading/replaying edits, check 
how many transactions went by since the last checkpoint. If the number is 
greater than the configured limit, it will do checkpoint. As you can see from 
the test output, there are around 106 transactions at the end. In order to 
prevent the standby from checkpointing, the config value should be bigger than 
that. E.g. 1000.  Also, it should be set before the namenode is started and 
should be reset for other test cases.

> testCheckpointCancellationDuringUpload occasionally fails 
> --
>
> Key: HDFS-9904
> URL: https://issues.apache.org/jira/browse/HDFS-9904
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.7.3
>Reporter: Kihwal Lee
> Attachments: HDFS-9904.001.patch
>
>
> The failure was at the end of the test case where the txid of the standby 
> (former active) is checked. Since the checkpoint/uploading was canceled , it 
> is not supposed to have the new checkpoint. Looking at the test log, that was 
> still the case, but the standby then did checkpoint on its own and bumped up 
> the txid, right before the check was performed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9904) testCheckpointCancellationDuringUpload occasionally fails

2016-03-04 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180154#comment-15180154
 ] 

Kihwal Lee commented on HDFS-9904:
--

The stack trace from the test failure.
{noformat}
java.lang.AssertionError: expected:<0> but was:<106>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at org.junit.Assert.assertEquals(Assert.java:542)
at 
org.apache.hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints.testCheckpointCancellationDuringUpload(TestStandbyCheckpoints.java:328)
{noformat}

We could set DFS_NAMENODE_CHECKPOINT_TXNS_KEY differently on the first NN to 
avoid it doing checkpoint when it becomes a standby.

> testCheckpointCancellationDuringUpload occasionally fails 
> --
>
> Key: HDFS-9904
> URL: https://issues.apache.org/jira/browse/HDFS-9904
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.7.3
>Reporter: Kihwal Lee
>
> The failure was at the end of the test case where the txid of the standby 
> (former active) is checked. Since the checkpoint/uploading was canceled , it 
> is not supposed to have the new checkpoint. Looking at the test log, that was 
> still the case, but the standby then did checkpoint on its own and bumped up 
> the txid, right before the check was performed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)