[jira] [Commented] (YARN-99) Jobs fail during resource localization when private distributed-cache hits unix directory limits

Omkar Vinit Joshi (JIRA) Wed, 03 Apr 2013 11:07:19 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13621128#comment-13621128
 ]


Omkar Vinit Joshi commented on YARN-99:
---------------------------------------

Rebasing the patch as 467 is now committed.
This issue is related to 467 and the detailed information can be found here 
[underlying problem and proposed/implemented Solution | 
https://issues.apache.org/jira/browse/YARN-467?focusedCommentId=13615894&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13615894]

The only difference here is that the same problem is present in 
<local-dir>/usercache/<user-name>/filecache (Private user cache). We are using 
LocalCacheDirectoryManager for user-cache but not for app-cache as it is highly 
unlikely for application to have so many localized files.

Earlier implementation for private cache involved computing localized path 
inside ContainerLocalizer; i.e. in different processes. Now in order to 
centralize this we have moved it to ResourceLocalizationService.LocalizerRunner 
and this is communicated to all the ContainerLocalizer as a part of the 
heartbeat. Thereby we can now manage LocalCacheDirectory at one place.
                
> Jobs fail during resource localization when private distributed-cache hits 
> unix directory limits
> ------------------------------------------------------------------------------------------------
>
>                 Key: YARN-99
>                 URL: https://issues.apache.org/jira/browse/YARN-99
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 3.0.0, 2.0.0-alpha
>            Reporter: Devaraj K
>            Assignee: Omkar Vinit Joshi
>         Attachments: yarn-99-20130324.patch
>
>
> If we have multiple jobs which uses distributed cache with small size of 
> files, the directory limit reaches before reaching the cache size and fails 
> to create any directories in file cache. The jobs start failing with the 
> below exception.
> {code:xml}
> java.io.IOException: mkdir of 
> /tmp/nm-local-dir/usercache/root/filecache/1701886847734194975 failed
>       at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:909)
>       at 
> org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:143)
>       at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
>       at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:706)
>       at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:703)
>       at 
> org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2325)
>       at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:703)
>       at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:147)
>       at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
>       at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>       at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>       at java.lang.Thread.run(Thread.java:662)
> {code}
> We should have a mechanism to clean the cache files if it crosses specified 
> number of directories like cache size.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-99) Jobs fail during resource localization when private distributed-cache hits unix directory limits

Reply via email to