[jira] [Updated] (HDFS-3519) Checkpoint upload may interfere with a concurrent saveNamespace

2015-01-22 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-3519:

   Resolution: Fixed
Fix Version/s: 2.7.0
   Status: Resolved  (was: Patch Available)

+1 for the branch-2 patch also.  I have committed to both trunk and branch-2.  
Ming, thank you for contributing this patch.

 Checkpoint upload may interfere with a concurrent saveNamespace
 ---

 Key: HDFS-3519
 URL: https://issues.apache.org/jira/browse/HDFS-3519
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Todd Lipcon
Assignee: Ming Ma
Priority: Critical
 Fix For: 2.7.0

 Attachments: HDFS-3519-2.patch, HDFS-3519-3.patch, 
 HDFS-3519-branch-2.patch, HDFS-3519.patch, test-output.txt


 TestStandbyCheckpoints failed in [precommit build 
 2620|https://builds.apache.org/job/PreCommit-HDFS-Build/2620//testReport/] 
 due to the following issue:
 - both nodes were in Standby state, and configured to checkpoint as fast as 
 possible
 - NN1 starts to save its own namespace
 - NN2 starts to upload a checkpoint for the same txid. So, both threads are 
 writing to the same file fsimage.ckpt_12, but the actual file contents 
 correspond to the uploading thread's data.
 - NN1 finished its saveNamespace operation while NN2 was still uploading. So, 
 it renamed the ckpt file. However, the contents of the file are still empty 
 since NN2 hasn't sent any bytes
 - NN2 finishes the upload, and the rename() call fails, which causes the 
 directory to be marked failed, etc.
 The result is that there is a file fsimage_12 which appears to be a finalized 
 image but in fact is incompletely transferred. When the transfer completes, 
 the problem heals itself so there wouldn't be persistent corruption unless 
 the machine crashes at the same time. And even then, we'd still have the 
 earlier checkpoint to restore from.
 This same race could occur in a non-HA setup if a user puts the NN in safe 
 mode and issues saveNamespace operations concurrent with a 2NN checkpointing, 
 I believe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-3519) Checkpoint upload may interfere with a concurrent saveNamespace

2015-01-21 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-3519:

Target Version/s: 2.7.0

 Checkpoint upload may interfere with a concurrent saveNamespace
 ---

 Key: HDFS-3519
 URL: https://issues.apache.org/jira/browse/HDFS-3519
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Todd Lipcon
Assignee: Ming Ma
Priority: Critical
 Attachments: HDFS-3519-2.patch, HDFS-3519-3.patch, HDFS-3519.patch, 
 test-output.txt


 TestStandbyCheckpoints failed in [precommit build 
 2620|https://builds.apache.org/job/PreCommit-HDFS-Build/2620//testReport/] 
 due to the following issue:
 - both nodes were in Standby state, and configured to checkpoint as fast as 
 possible
 - NN1 starts to save its own namespace
 - NN2 starts to upload a checkpoint for the same txid. So, both threads are 
 writing to the same file fsimage.ckpt_12, but the actual file contents 
 correspond to the uploading thread's data.
 - NN1 finished its saveNamespace operation while NN2 was still uploading. So, 
 it renamed the ckpt file. However, the contents of the file are still empty 
 since NN2 hasn't sent any bytes
 - NN2 finishes the upload, and the rename() call fails, which causes the 
 directory to be marked failed, etc.
 The result is that there is a file fsimage_12 which appears to be a finalized 
 image but in fact is incompletely transferred. When the transfer completes, 
 the problem heals itself so there wouldn't be persistent corruption unless 
 the machine crashes at the same time. And even then, we'd still have the 
 earlier checkpoint to restore from.
 This same race could occur in a non-HA setup if a user puts the NN in safe 
 mode and issues saveNamespace operations concurrent with a 2NN checkpointing, 
 I believe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-3519) Checkpoint upload may interfere with a concurrent saveNamespace

2015-01-21 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-3519:

Hadoop Flags: Reviewed

+1 for the patch.  Ming, thank you for incorporating the feedback.  Could you 
also please provide a branch-2 patch?  There is a small difference in 
{{FSImage}} on branch-2 that prevents me from applying the trunk patch.

 Checkpoint upload may interfere with a concurrent saveNamespace
 ---

 Key: HDFS-3519
 URL: https://issues.apache.org/jira/browse/HDFS-3519
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Todd Lipcon
Assignee: Ming Ma
Priority: Critical
 Attachments: HDFS-3519-2.patch, HDFS-3519-3.patch, HDFS-3519.patch, 
 test-output.txt


 TestStandbyCheckpoints failed in [precommit build 
 2620|https://builds.apache.org/job/PreCommit-HDFS-Build/2620//testReport/] 
 due to the following issue:
 - both nodes were in Standby state, and configured to checkpoint as fast as 
 possible
 - NN1 starts to save its own namespace
 - NN2 starts to upload a checkpoint for the same txid. So, both threads are 
 writing to the same file fsimage.ckpt_12, but the actual file contents 
 correspond to the uploading thread's data.
 - NN1 finished its saveNamespace operation while NN2 was still uploading. So, 
 it renamed the ckpt file. However, the contents of the file are still empty 
 since NN2 hasn't sent any bytes
 - NN2 finishes the upload, and the rename() call fails, which causes the 
 directory to be marked failed, etc.
 The result is that there is a file fsimage_12 which appears to be a finalized 
 image but in fact is incompletely transferred. When the transfer completes, 
 the problem heals itself so there wouldn't be persistent corruption unless 
 the machine crashes at the same time. And even then, we'd still have the 
 earlier checkpoint to restore from.
 This same race could occur in a non-HA setup if a user puts the NN in safe 
 mode and issues saveNamespace operations concurrent with a 2NN checkpointing, 
 I believe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-3519) Checkpoint upload may interfere with a concurrent saveNamespace

2015-01-21 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated HDFS-3519:
--
Attachment: HDFS-3519-branch-2.patch

Thanks, Chris. Here is the patch for branch-2.

 Checkpoint upload may interfere with a concurrent saveNamespace
 ---

 Key: HDFS-3519
 URL: https://issues.apache.org/jira/browse/HDFS-3519
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Todd Lipcon
Assignee: Ming Ma
Priority: Critical
 Attachments: HDFS-3519-2.patch, HDFS-3519-3.patch, 
 HDFS-3519-branch-2.patch, HDFS-3519.patch, test-output.txt


 TestStandbyCheckpoints failed in [precommit build 
 2620|https://builds.apache.org/job/PreCommit-HDFS-Build/2620//testReport/] 
 due to the following issue:
 - both nodes were in Standby state, and configured to checkpoint as fast as 
 possible
 - NN1 starts to save its own namespace
 - NN2 starts to upload a checkpoint for the same txid. So, both threads are 
 writing to the same file fsimage.ckpt_12, but the actual file contents 
 correspond to the uploading thread's data.
 - NN1 finished its saveNamespace operation while NN2 was still uploading. So, 
 it renamed the ckpt file. However, the contents of the file are still empty 
 since NN2 hasn't sent any bytes
 - NN2 finishes the upload, and the rename() call fails, which causes the 
 directory to be marked failed, etc.
 The result is that there is a file fsimage_12 which appears to be a finalized 
 image but in fact is incompletely transferred. When the transfer completes, 
 the problem heals itself so there wouldn't be persistent corruption unless 
 the machine crashes at the same time. And even then, we'd still have the 
 earlier checkpoint to restore from.
 This same race could occur in a non-HA setup if a user puts the NN in safe 
 mode and issues saveNamespace operations concurrent with a 2NN checkpointing, 
 I believe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-3519) Checkpoint upload may interfere with a concurrent saveNamespace

2015-01-20 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated HDFS-3519:
--
Attachment: HDFS-3519-3.patch

Thanks, Chris. Good point. Here is the updated patch with your suggestions.

 Checkpoint upload may interfere with a concurrent saveNamespace
 ---

 Key: HDFS-3519
 URL: https://issues.apache.org/jira/browse/HDFS-3519
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Todd Lipcon
Assignee: Ming Ma
Priority: Critical
 Attachments: HDFS-3519-2.patch, HDFS-3519-3.patch, HDFS-3519.patch, 
 test-output.txt


 TestStandbyCheckpoints failed in [precommit build 
 2620|https://builds.apache.org/job/PreCommit-HDFS-Build/2620//testReport/] 
 due to the following issue:
 - both nodes were in Standby state, and configured to checkpoint as fast as 
 possible
 - NN1 starts to save its own namespace
 - NN2 starts to upload a checkpoint for the same txid. So, both threads are 
 writing to the same file fsimage.ckpt_12, but the actual file contents 
 correspond to the uploading thread's data.
 - NN1 finished its saveNamespace operation while NN2 was still uploading. So, 
 it renamed the ckpt file. However, the contents of the file are still empty 
 since NN2 hasn't sent any bytes
 - NN2 finishes the upload, and the rename() call fails, which causes the 
 directory to be marked failed, etc.
 The result is that there is a file fsimage_12 which appears to be a finalized 
 image but in fact is incompletely transferred. When the transfer completes, 
 the problem heals itself so there wouldn't be persistent corruption unless 
 the machine crashes at the same time. And even then, we'd still have the 
 earlier checkpoint to restore from.
 This same race could occur in a non-HA setup if a user puts the NN in safe 
 mode and issues saveNamespace operations concurrent with a 2NN checkpointing, 
 I believe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-3519) Checkpoint upload may interfere with a concurrent saveNamespace

2015-01-12 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated HDFS-3519:
--
Attachment: HDFS-3519-2.patch

The change of slowness parameter from 2 to 20 causes test case 
testReadsAllowedDuringCheckpoint to time out due to the large number of edits 
used in that test case. The motivation of adjusting this parameter is to make 
sure both NNs are doing checkpointing in test case testBothNodesInStandbyState. 
That doesn't seem be an issue in trunk given OIV image checkpoint adds extra 
delay before the image upload. Still, we can adjust the value from 2 to 5, just 
to be safe.

 Checkpoint upload may interfere with a concurrent saveNamespace
 ---

 Key: HDFS-3519
 URL: https://issues.apache.org/jira/browse/HDFS-3519
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Todd Lipcon
Assignee: Ming Ma
Priority: Critical
 Attachments: HDFS-3519-2.patch, HDFS-3519.patch, test-output.txt


 TestStandbyCheckpoints failed in [precommit build 
 2620|https://builds.apache.org/job/PreCommit-HDFS-Build/2620//testReport/] 
 due to the following issue:
 - both nodes were in Standby state, and configured to checkpoint as fast as 
 possible
 - NN1 starts to save its own namespace
 - NN2 starts to upload a checkpoint for the same txid. So, both threads are 
 writing to the same file fsimage.ckpt_12, but the actual file contents 
 correspond to the uploading thread's data.
 - NN1 finished its saveNamespace operation while NN2 was still uploading. So, 
 it renamed the ckpt file. However, the contents of the file are still empty 
 since NN2 hasn't sent any bytes
 - NN2 finishes the upload, and the rename() call fails, which causes the 
 directory to be marked failed, etc.
 The result is that there is a file fsimage_12 which appears to be a finalized 
 image but in fact is incompletely transferred. When the transfer completes, 
 the problem heals itself so there wouldn't be persistent corruption unless 
 the machine crashes at the same time. And even then, we'd still have the 
 earlier checkpoint to restore from.
 This same race could occur in a non-HA setup if a user puts the NN in safe 
 mode and issues saveNamespace operations concurrent with a 2NN checkpointing, 
 I believe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-3519) Checkpoint upload may interfere with a concurrent saveNamespace

2015-01-09 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated HDFS-3519:
--
Affects Version/s: (was: 2.0.0-alpha)
   (was: 1.0.3)
   Status: Patch Available  (was: Open)

 Checkpoint upload may interfere with a concurrent saveNamespace
 ---

 Key: HDFS-3519
 URL: https://issues.apache.org/jira/browse/HDFS-3519
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Todd Lipcon
Assignee: Ming Ma
Priority: Critical
 Attachments: HDFS-3519.patch, test-output.txt


 TestStandbyCheckpoints failed in [precommit build 
 2620|https://builds.apache.org/job/PreCommit-HDFS-Build/2620//testReport/] 
 due to the following issue:
 - both nodes were in Standby state, and configured to checkpoint as fast as 
 possible
 - NN1 starts to save its own namespace
 - NN2 starts to upload a checkpoint for the same txid. So, both threads are 
 writing to the same file fsimage.ckpt_12, but the actual file contents 
 correspond to the uploading thread's data.
 - NN1 finished its saveNamespace operation while NN2 was still uploading. So, 
 it renamed the ckpt file. However, the contents of the file are still empty 
 since NN2 hasn't sent any bytes
 - NN2 finishes the upload, and the rename() call fails, which causes the 
 directory to be marked failed, etc.
 The result is that there is a file fsimage_12 which appears to be a finalized 
 image but in fact is incompletely transferred. When the transfer completes, 
 the problem heals itself so there wouldn't be persistent corruption unless 
 the machine crashes at the same time. And even then, we'd still have the 
 earlier checkpoint to restore from.
 This same race could occur in a non-HA setup if a user puts the NN in safe 
 mode and issues saveNamespace operations concurrent with a 2NN checkpointing, 
 I believe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-3519) Checkpoint upload may interfere with a concurrent saveNamespace

2015-01-09 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated HDFS-3519:
--
Attachment: HDFS-3519.patch

To follow up on this, https://issues.apache.org/jira/browse/HDFS-4811 discussed 
about adding timestamp to checkpoint file name to manage the concurrency. 
Alternatively, we can move {{ImageServlet}}'s code that handle s duplicated 
download requests to {{FSImage}} so that it covers regular checkpoint. Any 
thoughts on this?

In addition, {{currentlyDownloadingCheckpoints}} is declared as static. That is 
problematic in unit test where two namenodes share the same list. It doesn't 
need to be static as {{ImageServlet}} gets the global FSImage object from 
{{NameNodeHttpServer}}.

Another thing about the test case, it is possible that the first standby nn 
checked by the test code finishes the checkpoint and upload before the second 
standby nn kicks off its checkpoint process; it doesn't appears to be the 
intention of the test case.  We can increase {{SlowCodec}} sleep interval for 
that.

Chris, sorry didn't checkout the jira owner until after I write up the patch.

 Checkpoint upload may interfere with a concurrent saveNamespace
 ---

 Key: HDFS-3519
 URL: https://issues.apache.org/jira/browse/HDFS-3519
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 1.0.3, 2.0.0-alpha
Reporter: Todd Lipcon
Assignee: Chris Nauroth
Priority: Critical
 Attachments: HDFS-3519.patch, test-output.txt


 TestStandbyCheckpoints failed in [precommit build 
 2620|https://builds.apache.org/job/PreCommit-HDFS-Build/2620//testReport/] 
 due to the following issue:
 - both nodes were in Standby state, and configured to checkpoint as fast as 
 possible
 - NN1 starts to save its own namespace
 - NN2 starts to upload a checkpoint for the same txid. So, both threads are 
 writing to the same file fsimage.ckpt_12, but the actual file contents 
 correspond to the uploading thread's data.
 - NN1 finished its saveNamespace operation while NN2 was still uploading. So, 
 it renamed the ckpt file. However, the contents of the file are still empty 
 since NN2 hasn't sent any bytes
 - NN2 finishes the upload, and the rename() call fails, which causes the 
 directory to be marked failed, etc.
 The result is that there is a file fsimage_12 which appears to be a finalized 
 image but in fact is incompletely transferred. When the transfer completes, 
 the problem heals itself so there wouldn't be persistent corruption unless 
 the machine crashes at the same time. And even then, we'd still have the 
 earlier checkpoint to restore from.
 This same race could occur in a non-HA setup if a user puts the NN in safe 
 mode and issues saveNamespace operations concurrent with a 2NN checkpointing, 
 I believe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-3519) Checkpoint upload may interfere with a concurrent saveNamespace

2015-01-09 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-3519:

Target Version/s:   (was: )
Assignee: Ming Ma  (was: Chris Nauroth)

[~mingma], I'm not actively working on this, so thank you for posting a patch.  
I have reassigned it to you.

 Checkpoint upload may interfere with a concurrent saveNamespace
 ---

 Key: HDFS-3519
 URL: https://issues.apache.org/jira/browse/HDFS-3519
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 1.0.3, 2.0.0-alpha
Reporter: Todd Lipcon
Assignee: Ming Ma
Priority: Critical
 Attachments: HDFS-3519.patch, test-output.txt


 TestStandbyCheckpoints failed in [precommit build 
 2620|https://builds.apache.org/job/PreCommit-HDFS-Build/2620//testReport/] 
 due to the following issue:
 - both nodes were in Standby state, and configured to checkpoint as fast as 
 possible
 - NN1 starts to save its own namespace
 - NN2 starts to upload a checkpoint for the same txid. So, both threads are 
 writing to the same file fsimage.ckpt_12, but the actual file contents 
 correspond to the uploading thread's data.
 - NN1 finished its saveNamespace operation while NN2 was still uploading. So, 
 it renamed the ckpt file. However, the contents of the file are still empty 
 since NN2 hasn't sent any bytes
 - NN2 finishes the upload, and the rename() call fails, which causes the 
 directory to be marked failed, etc.
 The result is that there is a file fsimage_12 which appears to be a finalized 
 image but in fact is incompletely transferred. When the transfer completes, 
 the problem heals itself so there wouldn't be persistent corruption unless 
 the machine crashes at the same time. And even then, we'd still have the 
 earlier checkpoint to restore from.
 This same race could occur in a non-HA setup if a user puts the NN in safe 
 mode and issues saveNamespace operations concurrent with a 2NN checkpointing, 
 I believe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-3519) Checkpoint upload may interfere with a concurrent saveNamespace

2012-06-08 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-3519:
--

Attachment: test-output.txt

Here's the test output from the above-referenced Jenkins build.

 Checkpoint upload may interfere with a concurrent saveNamespace
 ---

 Key: HDFS-3519
 URL: https://issues.apache.org/jira/browse/HDFS-3519
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 1.0.3, 2.0.0-alpha
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Critical
 Attachments: test-output.txt


 TestStandbyCheckpoints failed in [precommit build 
 2620|https://builds.apache.org/job/PreCommit-HDFS-Build/2620//testReport/] 
 due to the following issue:
 - both nodes were in Standby state, and configured to checkpoint as fast as 
 possible
 - NN1 starts to save its own namespace
 - NN2 starts to upload a checkpoint for the same txid. So, both threads are 
 writing to the same file fsimage.ckpt_12, but the actual file contents 
 correspond to the uploading thread's data.
 - NN1 finished its saveNamespace operation while NN2 was still uploading. So, 
 it renamed the ckpt file. However, the contents of the file are still empty 
 since NN2 hasn't sent any bytes
 - NN2 finishes the upload, and the rename() call fails, which causes the 
 directory to be marked failed, etc.
 The result is that there is a file fsimage_12 which appears to be a finalized 
 image but in fact is incompletely transferred. When the transfer completes, 
 the problem heals itself so there wouldn't be persistent corruption unless 
 the machine crashes at the same time. And even then, we'd still have the 
 earlier checkpoint to restore from.
 This same race could occur in a non-HA setup if a user puts the NN in safe 
 mode and issues saveNamespace operations concurrent with a 2NN checkpointing, 
 I believe.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira