[jira] [Comment Edited] (HDFS-3864) NN does not update internal file mtime for OP_CLOSE when reading from the edit log

Aaron T. Myers (JIRA) Tue, 28 Aug 2012 15:23:08 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13443586#comment-13443586
 ]


Aaron T. Myers edited comment on HDFS-3864 at 8/29/12 9:21 AM:
---------------------------------------------------------------

Here's a patch which addresses the issue. Fortunately, the fix is quite simple 
- just apply the values that we read in from the edit log.

In addition to the automated test provided in the patch, I also tested this 
manually on an HA cluster and confirmed that MR jobs no longer experience the 
"distributed cache object changed" errors which caused this issue to be 
discovered.
                
      was (Author: atm):
    Here's a patch which addresses the issue. Fortunately, the fix is quite 
simply - just apply the values that we read in from the edit log.

In addition to the automated test provided in the patch, I also tested this 
manually on an HA cluster and confirmed that MR jobs no longer experience the 
:distributed cache object changed" errors which caused this issue to be 
discovered.
                  
> NN does not update internal file mtime for OP_CLOSE when reading from the 
> edit log
> ----------------------------------------------------------------------------------
>
>                 Key: HDFS-3864
>                 URL: https://issues.apache.org/jira/browse/HDFS-3864
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 2.0.0-alpha
>            Reporter: Aaron T. Myers
>            Assignee: Aaron T. Myers
>         Attachments: HDFS-3864.patch, HDFS-3864.patch
>
>
> When logging an OP_CLOSE to the edit log, the NN writes out an updated file 
> mtime and atime. However, when reading in an OP_CLOSE from the edit log, the 
> NN does not apply these values to the in-memory FS data structure. Because of 
> this, a file's mtime or atime may appear to go back in time after an NN 
> restart, or an HA failover.
> Most of the time this will be harmless and folks won't notice, but in the 
> event one of these files is being used in the distributed cache of an MR job 
> when an HA failover occurs, the job might notice that the mtime of a cache 
> file has changed, which in MR2 will cause the job to fail with an exception 
> like the following:
> {noformat}
> java.io.IOException: Resource 
> hdfs://ha-nn-uri/user/jenkins/.staging/job_1341364439849_0513/libjars/snappy-java-1.0.3.2.jar
>  changed on src filesystem (expected 1342137814599, was 1342137814473
>       at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:90)
>       at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49)
>       at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157)
>       at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:396)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
>       at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153)
>       at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
>       at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>       at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>       at java.lang.Thread.run(Thread.java:662)
> {noformat}
> Credit to Sujay Rau for discovering this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (HDFS-3864) NN does not update internal file mtime for OP_CLOSE when reading from the edit log

Reply via email to