[jira] [Commented] (YARN-527) Local filecache mkdir fails

2013-04-04 Thread Knut O. Hellan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13621878#comment-13621878
 ] 

Knut O. Hellan commented on YARN-527:
-

Yes, this is a duplicate of YARN-467 so you may close it. We will add cronjobs 
to delete old directories as a temporary workaround until we can test 
2.0.5-beta. Thanks!

 Local filecache mkdir fails
 ---

 Key: YARN-527
 URL: https://issues.apache.org/jira/browse/YARN-527
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.0-alpha
 Environment: RHEL 6.3 with CDH4.1.3 Hadoop, HA with two name nodes 
 and six worker nodes.
Reporter: Knut O. Hellan
Priority: Minor
 Attachments: yarn-site.xml


 Jobs failed with no other explanation than this stack trace:
 2013-03-29 16:46:02,671 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diag
 nostics report from attempt_1364591875320_0017_m_00_0: 
 java.io.IOException: mkdir of /disk3/yarn/local/filecache/-42307893
 55400878397 failed
 at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:932)
 at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:143)
 at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
 at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:706)
 at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:703)
 at 
 org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2333)
 at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:703)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:147)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Manually creating the directory worked. This behavior was common to at least 
 several nodes in the cluster.
 The situation was resolved by removing and recreating all 
 /disk?/yarn/local/filecache directories on all nodes.
 It is unclear whether Yarn struggled with the number of files or if there 
 were corrupt files in the caches. The situation was triggered by a node dying.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-527) Local filecache mkdir fails

2013-04-03 Thread Knut O. Hellan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13620711#comment-13620711
 ] 

Knut O. Hellan commented on YARN-527:
-

There is really no difference in how the directories are created. What probably 
happened under the hood was that the file system reached maximum number of 
files in the filecache directory. This maximum size is 32000 since we use EXT3. 
I don't have the exact numbers for any of the disks from my checks, but i 
remember seeing above 30k some places. The reason we were able to manually 
create directories might be that there was some automatic cleanup happening. 
Does YARN clean the file cache?

 Local filecache mkdir fails
 ---

 Key: YARN-527
 URL: https://issues.apache.org/jira/browse/YARN-527
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.0-alpha
 Environment: RHEL 6.3 with CDH4.1.3 Hadoop, HA with two name nodes 
 and six worker nodes.
Reporter: Knut O. Hellan
Priority: Minor
 Attachments: yarn-site.xml


 Jobs failed with no other explanation than this stack trace:
 2013-03-29 16:46:02,671 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diag
 nostics report from attempt_1364591875320_0017_m_00_0: 
 java.io.IOException: mkdir of /disk3/yarn/local/filecache/-42307893
 55400878397 failed
 at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:932)
 at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:143)
 at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
 at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:706)
 at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:703)
 at 
 org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2333)
 at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:703)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:147)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Manually creating the directory worked. This behavior was common to at least 
 several nodes in the cluster.
 The situation was resolved by removing and recreating all 
 /disk?/yarn/local/filecache directories on all nodes.
 It is unclear whether Yarn struggled with the number of files or if there 
 were corrupt files in the caches. The situation was triggered by a node dying.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-527) Local filecache mkdir fails

2013-04-03 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13621042#comment-13621042
 ] 

Vinod Kumar Vavilapalli commented on YARN-527:
--

If it is the 32K limit that caused it, the timing can't be more perfect. I just 
committed YARN-467 which addresses it for public cache, and YARN-99 is in 
progress which takes care of private cache. These two JIRAs enforce a limit in 
YARN itself, default is 8192.

Looking back again at your stack trace, I agree that it is very likely you are 
hitting the 32K limit.

Can I close this as a duplicate of YARN-467? You can verify the fix on 
2.0.5-beta when it is out.

 Local filecache mkdir fails
 ---

 Key: YARN-527
 URL: https://issues.apache.org/jira/browse/YARN-527
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.0-alpha
 Environment: RHEL 6.3 with CDH4.1.3 Hadoop, HA with two name nodes 
 and six worker nodes.
Reporter: Knut O. Hellan
Priority: Minor
 Attachments: yarn-site.xml


 Jobs failed with no other explanation than this stack trace:
 2013-03-29 16:46:02,671 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diag
 nostics report from attempt_1364591875320_0017_m_00_0: 
 java.io.IOException: mkdir of /disk3/yarn/local/filecache/-42307893
 55400878397 failed
 at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:932)
 at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:143)
 at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
 at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:706)
 at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:703)
 at 
 org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2333)
 at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:703)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:147)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Manually creating the directory worked. This behavior was common to at least 
 several nodes in the cluster.
 The situation was resolved by removing and recreating all 
 /disk?/yarn/local/filecache directories on all nodes.
 It is unclear whether Yarn struggled with the number of files or if there 
 were corrupt files in the caches. The situation was triggered by a node dying.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-527) Local filecache mkdir fails

2013-04-02 Thread Knut O. Hellan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13619797#comment-13619797
 ] 

Knut O. Hellan commented on YARN-527:
-

Digging through the code, it looks to me like the native Java File.mkdirs is 
used to actually create the directory and it will not give information about 
why it failed. If that is the case then I guess this issue is actually a 
feature request that yarn should be better at cleaning up old file caches so 
that this situation will not happen.

 Local filecache mkdir fails
 ---

 Key: YARN-527
 URL: https://issues.apache.org/jira/browse/YARN-527
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.0-alpha
 Environment: RHEL 6.3 with CDH4.1.3 Hadoop, HA with two name nodes 
 and six worker nodes.
Reporter: Knut O. Hellan
Priority: Minor
 Attachments: yarn-site.xml


 Jobs failed with no other explanation than this stack trace:
 2013-03-29 16:46:02,671 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diag
 nostics report from attempt_1364591875320_0017_m_00_0: 
 java.io.IOException: mkdir of /disk3/yarn/local/filecache/-42307893
 55400878397 failed
 at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:932)
 at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:143)
 at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
 at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:706)
 at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:703)
 at 
 org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2333)
 at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:703)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:147)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Manually creating the directory worked. This behavior was common to at least 
 several nodes in the cluster.
 The situation was resolved by removing and recreating all 
 /disk?/yarn/local/filecache directories on all nodes.
 It is unclear whether Yarn struggled with the number of files or if there 
 were corrupt files in the caches. The situation was triggered by a node dying.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-527) Local filecache mkdir fails

2013-04-02 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13620239#comment-13620239
 ] 

Vinod Kumar Vavilapalli commented on YARN-527:
--

Is there any difference in how NodeManager tried to create the dir and your 
manual creation? Like the user running NM and user who manually created the 
dir? Can you reproduce this? If we can find out exactly why NM couldn't create 
it automatically, then we can do something about it.

 Local filecache mkdir fails
 ---

 Key: YARN-527
 URL: https://issues.apache.org/jira/browse/YARN-527
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.0-alpha
 Environment: RHEL 6.3 with CDH4.1.3 Hadoop, HA with two name nodes 
 and six worker nodes.
Reporter: Knut O. Hellan
Priority: Minor
 Attachments: yarn-site.xml


 Jobs failed with no other explanation than this stack trace:
 2013-03-29 16:46:02,671 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diag
 nostics report from attempt_1364591875320_0017_m_00_0: 
 java.io.IOException: mkdir of /disk3/yarn/local/filecache/-42307893
 55400878397 failed
 at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:932)
 at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:143)
 at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
 at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:706)
 at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:703)
 at 
 org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2333)
 at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:703)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:147)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Manually creating the directory worked. This behavior was common to at least 
 several nodes in the cluster.
 The situation was resolved by removing and recreating all 
 /disk?/yarn/local/filecache directories on all nodes.
 It is unclear whether Yarn struggled with the number of files or if there 
 were corrupt files in the caches. The situation was triggered by a node dying.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira