[jira] [Updated] (HDFS-3864) NN does not update internal file mtime for OP_CLOSE when reading from the edit log

2012-08-28 Thread Aaron T. Myers (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron T. Myers updated HDFS-3864:
-

   Resolution: Fixed
Fix Version/s: 2.2.0-alpha
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

I've just committed this to trunk and branch-2. Thanks a lot for the review, 
Todd.

> NN does not update internal file mtime for OP_CLOSE when reading from the 
> edit log
> --
>
> Key: HDFS-3864
> URL: https://issues.apache.org/jira/browse/HDFS-3864
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 2.0.0-alpha
>Reporter: Aaron T. Myers
>Assignee: Aaron T. Myers
> Fix For: 2.2.0-alpha
>
> Attachments: HDFS-3864.patch, HDFS-3864.patch
>
>
> When logging an OP_CLOSE to the edit log, the NN writes out an updated file 
> mtime and atime. However, when reading in an OP_CLOSE from the edit log, the 
> NN does not apply these values to the in-memory FS data structure. Because of 
> this, a file's mtime or atime may appear to go back in time after an NN 
> restart, or an HA failover.
> Most of the time this will be harmless and folks won't notice, but in the 
> event one of these files is being used in the distributed cache of an MR job 
> when an HA failover occurs, the job might notice that the mtime of a cache 
> file has changed, which in MR2 will cause the job to fail with an exception 
> like the following:
> {noformat}
> java.io.IOException: Resource 
> hdfs://ha-nn-uri/user/jenkins/.staging/job_1341364439849_0513/libjars/snappy-java-1.0.3.2.jar
>  changed on src filesystem (expected 1342137814599, was 1342137814473
>   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:90)
>   at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49)
>   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157)
>   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:396)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
>   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>   at java.lang.Thread.run(Thread.java:662)
> {noformat}
> Credit to Sujay Rau for discovering this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3864) NN does not update internal file mtime for OP_CLOSE when reading from the edit log

2012-08-28 Thread Aaron T. Myers (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron T. Myers updated HDFS-3864:
-

Attachment: HDFS-3864.patch

Thanks a lot for the quick review, Todd.

Here's an updated patch which lowers the sleep time to 10 milliseconds.

> NN does not update internal file mtime for OP_CLOSE when reading from the 
> edit log
> --
>
> Key: HDFS-3864
> URL: https://issues.apache.org/jira/browse/HDFS-3864
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 2.0.0-alpha
>Reporter: Aaron T. Myers
>Assignee: Aaron T. Myers
> Attachments: HDFS-3864.patch, HDFS-3864.patch
>
>
> When logging an OP_CLOSE to the edit log, the NN writes out an updated file 
> mtime and atime. However, when reading in an OP_CLOSE from the edit log, the 
> NN does not apply these values to the in-memory FS data structure. Because of 
> this, a file's mtime or atime may appear to go back in time after an NN 
> restart, or an HA failover.
> Most of the time this will be harmless and folks won't notice, but in the 
> event one of these files is being used in the distributed cache of an MR job 
> when an HA failover occurs, the job might notice that the mtime of a cache 
> file has changed, which in MR2 will cause the job to fail with an exception 
> like the following:
> {noformat}
> java.io.IOException: Resource 
> hdfs://ha-nn-uri/user/jenkins/.staging/job_1341364439849_0513/libjars/snappy-java-1.0.3.2.jar
>  changed on src filesystem (expected 1342137814599, was 1342137814473
>   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:90)
>   at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49)
>   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157)
>   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:396)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
>   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>   at java.lang.Thread.run(Thread.java:662)
> {noformat}
> Credit to Sujay Rau for discovering this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3864) NN does not update internal file mtime for OP_CLOSE when reading from the edit log

2012-08-28 Thread Aaron T. Myers (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron T. Myers updated HDFS-3864:
-

Status: Patch Available  (was: Open)

> NN does not update internal file mtime for OP_CLOSE when reading from the 
> edit log
> --
>
> Key: HDFS-3864
> URL: https://issues.apache.org/jira/browse/HDFS-3864
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 2.0.0-alpha
>Reporter: Aaron T. Myers
>Assignee: Aaron T. Myers
> Attachments: HDFS-3864.patch
>
>
> When logging an OP_CLOSE to the edit log, the NN writes out an updated file 
> mtime and atime. However, when reading in an OP_CLOSE from the edit log, the 
> NN does not apply these values to the in-memory FS data structure. Because of 
> this, a file's mtime or atime may appear to go back in time after an NN 
> restart, or an HA failover.
> Most of the time this will be harmless and folks won't notice, but in the 
> event one of these files is being used in the distributed cache of an MR job 
> when an HA failover occurs, the job might notice that the mtime of a cache 
> file has changed, which in MR2 will cause the job to fail with an exception 
> like the following:
> {noformat}
> java.io.IOException: Resource 
> hdfs://ha-nn-uri/user/jenkins/.staging/job_1341364439849_0513/libjars/snappy-java-1.0.3.2.jar
>  changed on src filesystem (expected 1342137814599, was 1342137814473
>   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:90)
>   at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49)
>   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157)
>   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:396)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
>   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>   at java.lang.Thread.run(Thread.java:662)
> {noformat}
> Credit to Sujay Rau for discovering this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3864) NN does not update internal file mtime for OP_CLOSE when reading from the edit log

2012-08-28 Thread Aaron T. Myers (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron T. Myers updated HDFS-3864:
-

Attachment: HDFS-3864.patch

Here's a patch which addresses the issue. Fortunately, the fix is quite simply 
- just apply the values that we read in from the edit log.

In addition to the automated test provided in the patch, I also tested this 
manually on an HA cluster and confirmed that MR jobs no longer experience the 
:distributed cache object changed" errors which caused this issue to be 
discovered.

> NN does not update internal file mtime for OP_CLOSE when reading from the 
> edit log
> --
>
> Key: HDFS-3864
> URL: https://issues.apache.org/jira/browse/HDFS-3864
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 2.0.0-alpha
>Reporter: Aaron T. Myers
>Assignee: Aaron T. Myers
> Attachments: HDFS-3864.patch
>
>
> When logging an OP_CLOSE to the edit log, the NN writes out an updated file 
> mtime and atime. However, when reading in an OP_CLOSE from the edit log, the 
> NN does not apply these values to the in-memory FS data structure. Because of 
> this, a file's mtime or atime may appear to go back in time after an NN 
> restart, or an HA failover.
> Most of the time this will be harmless and folks won't notice, but in the 
> event one of these files is being used in the distributed cache of an MR job 
> when an HA failover occurs, the job might notice that the mtime of a cache 
> file has changed, which in MR2 will cause the job to fail with an exception 
> like the following:
> {noformat}
> java.io.IOException: Resource 
> hdfs://ha-nn-uri/user/jenkins/.staging/job_1341364439849_0513/libjars/snappy-java-1.0.3.2.jar
>  changed on src filesystem (expected 1342137814599, was 1342137814473
>   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:90)
>   at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49)
>   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157)
>   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:396)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
>   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>   at java.lang.Thread.run(Thread.java:662)
> {noformat}
> Credit to Sujay Rau for discovering this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira