Spilled shuffle files not being cleared

2014-06-09 Thread Michael Chang
Hi all,

I'm seeing exceptions that look like the below in Spark 0.9.1.  It looks
like I'm running out of inodes on my machines (I have around 300k each in a
12 machine cluster).  I took a quick look and I'm seeing some shuffle spill
files that are around even around 12 minutes after they are created.  Can
someone help me understand when these shuffle spill files should be cleaned
up (Is it as soon as they are used?)

Thanks,
Michael


java.io.FileNotFoundException:
/mnt/var/hadoop/1/yarn/local/usercache/ubuntu/appcache/application_1399886706975_13107/spark-local-20140609210947-19e1/1c/shuffle_41637_3_0
(No space left on device)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.(FileOutputStream.java:221)
at
org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:118)
at
org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:179)
at
org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:164)
at
org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
at org.apache.spark.scheduler.Task.run(Task.scala:53)
at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:211)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at
org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
14/06/09 22:07:36 WARN TaskSetManager: Lost TID 667432 (task 86909.0:7)
14/06/09 22:07:36 WARN TaskSetManager: Loss was due to
java.io.FileNotFoundException


Re: Spilled shuffle files not being cleared

2014-06-12 Thread Michael Chang
Bump


On Mon, Jun 9, 2014 at 3:22 PM, Michael Chang  wrote:

> Hi all,
>
> I'm seeing exceptions that look like the below in Spark 0.9.1.  It looks
> like I'm running out of inodes on my machines (I have around 300k each in a
> 12 machine cluster).  I took a quick look and I'm seeing some shuffle spill
> files that are around even around 12 minutes after they are created.  Can
> someone help me understand when these shuffle spill files should be cleaned
> up (Is it as soon as they are used?)
>
> Thanks,
> Michael
>
>
> java.io.FileNotFoundException:
> /mnt/var/hadoop/1/yarn/local/usercache/ubuntu/appcache/application_1399886706975_13107/spark-local-20140609210947-19e1/1c/shuffle_41637_3_0
> (No space left on device)
> at java.io.FileOutputStream.open(Native Method)
> at java.io.FileOutputStream.(FileOutputStream.java:221)
> at
> org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:118)
> at
> org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:179)
> at
> org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:164)
> at
> org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
> at org.apache.spark.scheduler.Task.run(Task.scala:53)
> at
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:211)
> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> at
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> 14/06/09 22:07:36 WARN TaskSetManager: Lost TID 667432 (task 86909.0:7)
> 14/06/09 22:07:36 WARN TaskSetManager: Loss was due to
> java.io.FileNotFoundException
>


RE: Spilled shuffle files not being cleared

2014-06-12 Thread Shao, Saisai
Hi Michael,

I think you can set up spark.cleaner.ttl=xxx to enable time-based metadata 
cleaner, which will clean old un-used shuffle data when it is timeout.

For Spark 1.0 another way is to clean shuffle data using weak reference 
(reference tracking based, configuration is spark.cleaner.referenceTracking), 
and it is enabled by default.

Thanks
Saisai

From: Michael Chang [mailto:m...@tellapart.com]
Sent: Friday, June 13, 2014 10:15 AM
To: user@spark.apache.org
Subject: Re: Spilled shuffle files not being cleared

Bump

On Mon, Jun 9, 2014 at 3:22 PM, Michael Chang 
mailto:m...@tellapart.com>> wrote:
Hi all,

I'm seeing exceptions that look like the below in Spark 0.9.1.  It looks like 
I'm running out of inodes on my machines (I have around 300k each in a 12 
machine cluster).  I took a quick look and I'm seeing some shuffle spill files 
that are around even around 12 minutes after they are created.  Can someone 
help me understand when these shuffle spill files should be cleaned up (Is it 
as soon as they are used?)

Thanks,
Michael


java.io.FileNotFoundException: 
/mnt/var/hadoop/1/yarn/local/usercache/ubuntu/appcache/application_1399886706975_13107/spark-local-20140609210947-19e1/1c/shuffle_41637_3_0
 (No space left on device)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.(FileOutputStream.java:221)
at 
org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:118)
at 
org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:179)
at 
org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:164)
at 
org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
at org.apache.spark.scheduler.Task.run(Task.scala:53)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:211)
at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at 
org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
14/06/09 22:07:36 WARN TaskSetManager: Lost TID 667432 (task 86909.0:7)
14/06/09 22:07:36 WARN TaskSetManager: Loss was due to 
java.io.FileNotFoundException



Re: Spilled shuffle files not being cleared

2014-06-13 Thread Michael Chang
Thanks Saisai, I think I will just try lowering my spark.cleaner.ttl value
- I've set it to an hour.


On Thu, Jun 12, 2014 at 7:32 PM, Shao, Saisai  wrote:

>  Hi Michael,
>
>
>
> I think you can set up spark.cleaner.ttl=xxx to enable time-based metadata
> cleaner, which will clean old un-used shuffle data when it is timeout.
>
>
>
> For Spark 1.0 another way is to clean shuffle data using weak reference
> (reference tracking based, configuration is
> spark.cleaner.referenceTracking), and it is enabled by default.
>
>
>
> Thanks
>
> Saisai
>
>
>
> *From:* Michael Chang [mailto:m...@tellapart.com]
> *Sent:* Friday, June 13, 2014 10:15 AM
> *To:* user@spark.apache.org
> *Subject:* Re: Spilled shuffle files not being cleared
>
>
>
> Bump
>
>
>
> On Mon, Jun 9, 2014 at 3:22 PM, Michael Chang  wrote:
>
> Hi all,
>
>
>
> I'm seeing exceptions that look like the below in Spark 0.9.1.  It looks
> like I'm running out of inodes on my machines (I have around 300k each in a
> 12 machine cluster).  I took a quick look and I'm seeing some shuffle spill
> files that are around even around 12 minutes after they are created.  Can
> someone help me understand when these shuffle spill files should be cleaned
> up (Is it as soon as they are used?)
>
>
>
> Thanks,
>
> Michael
>
>
>
>
>
> java.io.FileNotFoundException:
> /mnt/var/hadoop/1/yarn/local/usercache/ubuntu/appcache/application_1399886706975_13107/spark-local-20140609210947-19e1/1c/shuffle_41637_3_0
> (No space left on device)
>
> at java.io.FileOutputStream.open(Native Method)
>
> at java.io.FileOutputStream.(FileOutputStream.java:221)
>
> at
> org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:118)
>
> at
> org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:179)
>
> at
> org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:164)
>
> at
> org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161)
>
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
>
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:53)
>
> at
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:211)
>
> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
>
> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at javax.security.auth.Subject.doAs(Subject.java:415)
>
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>
> at
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
>
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:744)
>
> 14/06/09 22:07:36 WARN TaskSetManager: Lost TID 667432 (task 86909.0:7)
>
> 14/06/09 22:07:36 WARN TaskSetManager: Loss was due to
> java.io.FileNotFoundException
>
>
>