Hi,

I am using Spark 1.1.0 on a cluster. My job takes as input 30 files in a
directory (I am using  sc.textfile("dir/*") ) to read in the files.  I am
getting the following warning:

WARN TaskSetManager: Lost task 99.0 in stage 1.0 (TID 99,
mesos12-dev.sccps.net): java.io.FileNotFoundException:
/tmp/spark-local-20140925215712-0319/12/shuffle_0_99_93138 (Too many open
files)

basically I think a lot of shuffle files are being created. 

1) The tasks eventually fail and the job just hangs (after taking very long,
more than an hour).  If I read these 30 files in a for loop, the same job
completes in a few minutes. However, I need to specify the files names,
which is not convenient. I am assuming that sc.textfile("dir/*") creates a
large RDD for all the 30 files. Is there a way to make the operation on this
large RDD efficient so as to avoid creating too many shuffle files?


2) Also, I am finding that all the shuffle files for my other completed jobs
are not being automatically deleted even after days. I thought that
sc.stop() clears the intermediate files.  Is there some way to
programmatically delete these temp shuffle files upon job completion?


thanks





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-files-tp15185.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to