[jira] [Updated] (HDFS-3864) NN does not update internal file mtime for OP_CLOSE when reading from the edit log

2012-08-28 Thread Aaron T. Myers (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron T. Myers updated HDFS-3864:
-

Status: Patch Available  (was: Open)

 NN does not update internal file mtime for OP_CLOSE when reading from the 
 edit log
 --

 Key: HDFS-3864
 URL: https://issues.apache.org/jira/browse/HDFS-3864
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 2.0.0-alpha
Reporter: Aaron T. Myers
Assignee: Aaron T. Myers
 Attachments: HDFS-3864.patch


 When logging an OP_CLOSE to the edit log, the NN writes out an updated file 
 mtime and atime. However, when reading in an OP_CLOSE from the edit log, the 
 NN does not apply these values to the in-memory FS data structure. Because of 
 this, a file's mtime or atime may appear to go back in time after an NN 
 restart, or an HA failover.
 Most of the time this will be harmless and folks won't notice, but in the 
 event one of these files is being used in the distributed cache of an MR job 
 when an HA failover occurs, the job might notice that the mtime of a cache 
 file has changed, which in MR2 will cause the job to fail with an exception 
 like the following:
 {noformat}
 java.io.IOException: Resource 
 hdfs://ha-nn-uri/user/jenkins/.staging/job_1341364439849_0513/libjars/snappy-java-1.0.3.2.jar
  changed on src filesystem (expected 1342137814599, was 1342137814473
   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:90)
   at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49)
   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157)
   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
 {noformat}
 Credit to Sujay Rau for discovering this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3864) NN does not update internal file mtime for OP_CLOSE when reading from the edit log

2012-08-28 Thread Aaron T. Myers (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron T. Myers updated HDFS-3864:
-

Attachment: HDFS-3864.patch

Here's a patch which addresses the issue. Fortunately, the fix is quite simply 
- just apply the values that we read in from the edit log.

In addition to the automated test provided in the patch, I also tested this 
manually on an HA cluster and confirmed that MR jobs no longer experience the 
:distributed cache object changed errors which caused this issue to be 
discovered.

 NN does not update internal file mtime for OP_CLOSE when reading from the 
 edit log
 --

 Key: HDFS-3864
 URL: https://issues.apache.org/jira/browse/HDFS-3864
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 2.0.0-alpha
Reporter: Aaron T. Myers
Assignee: Aaron T. Myers
 Attachments: HDFS-3864.patch


 When logging an OP_CLOSE to the edit log, the NN writes out an updated file 
 mtime and atime. However, when reading in an OP_CLOSE from the edit log, the 
 NN does not apply these values to the in-memory FS data structure. Because of 
 this, a file's mtime or atime may appear to go back in time after an NN 
 restart, or an HA failover.
 Most of the time this will be harmless and folks won't notice, but in the 
 event one of these files is being used in the distributed cache of an MR job 
 when an HA failover occurs, the job might notice that the mtime of a cache 
 file has changed, which in MR2 will cause the job to fail with an exception 
 like the following:
 {noformat}
 java.io.IOException: Resource 
 hdfs://ha-nn-uri/user/jenkins/.staging/job_1341364439849_0513/libjars/snappy-java-1.0.3.2.jar
  changed on src filesystem (expected 1342137814599, was 1342137814473
   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:90)
   at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49)
   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157)
   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
 {noformat}
 Credit to Sujay Rau for discovering this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3864) NN does not update internal file mtime for OP_CLOSE when reading from the edit log

2012-08-28 Thread Aaron T. Myers (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron T. Myers updated HDFS-3864:
-

Attachment: HDFS-3864.patch

Thanks a lot for the quick review, Todd.

Here's an updated patch which lowers the sleep time to 10 milliseconds.

 NN does not update internal file mtime for OP_CLOSE when reading from the 
 edit log
 --

 Key: HDFS-3864
 URL: https://issues.apache.org/jira/browse/HDFS-3864
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 2.0.0-alpha
Reporter: Aaron T. Myers
Assignee: Aaron T. Myers
 Attachments: HDFS-3864.patch, HDFS-3864.patch


 When logging an OP_CLOSE to the edit log, the NN writes out an updated file 
 mtime and atime. However, when reading in an OP_CLOSE from the edit log, the 
 NN does not apply these values to the in-memory FS data structure. Because of 
 this, a file's mtime or atime may appear to go back in time after an NN 
 restart, or an HA failover.
 Most of the time this will be harmless and folks won't notice, but in the 
 event one of these files is being used in the distributed cache of an MR job 
 when an HA failover occurs, the job might notice that the mtime of a cache 
 file has changed, which in MR2 will cause the job to fail with an exception 
 like the following:
 {noformat}
 java.io.IOException: Resource 
 hdfs://ha-nn-uri/user/jenkins/.staging/job_1341364439849_0513/libjars/snappy-java-1.0.3.2.jar
  changed on src filesystem (expected 1342137814599, was 1342137814473
   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:90)
   at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49)
   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157)
   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
 {noformat}
 Credit to Sujay Rau for discovering this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3864) NN does not update internal file mtime for OP_CLOSE when reading from the edit log

2012-08-28 Thread Aaron T. Myers (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron T. Myers updated HDFS-3864:
-

   Resolution: Fixed
Fix Version/s: 2.2.0-alpha
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

I've just committed this to trunk and branch-2. Thanks a lot for the review, 
Todd.

 NN does not update internal file mtime for OP_CLOSE when reading from the 
 edit log
 --

 Key: HDFS-3864
 URL: https://issues.apache.org/jira/browse/HDFS-3864
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 2.0.0-alpha
Reporter: Aaron T. Myers
Assignee: Aaron T. Myers
 Fix For: 2.2.0-alpha

 Attachments: HDFS-3864.patch, HDFS-3864.patch


 When logging an OP_CLOSE to the edit log, the NN writes out an updated file 
 mtime and atime. However, when reading in an OP_CLOSE from the edit log, the 
 NN does not apply these values to the in-memory FS data structure. Because of 
 this, a file's mtime or atime may appear to go back in time after an NN 
 restart, or an HA failover.
 Most of the time this will be harmless and folks won't notice, but in the 
 event one of these files is being used in the distributed cache of an MR job 
 when an HA failover occurs, the job might notice that the mtime of a cache 
 file has changed, which in MR2 will cause the job to fail with an exception 
 like the following:
 {noformat}
 java.io.IOException: Resource 
 hdfs://ha-nn-uri/user/jenkins/.staging/job_1341364439849_0513/libjars/snappy-java-1.0.3.2.jar
  changed on src filesystem (expected 1342137814599, was 1342137814473
   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:90)
   at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49)
   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157)
   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
 {noformat}
 Credit to Sujay Rau for discovering this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira