Hi, I am using Spark 1.1.0 on a cluster. My job takes as input 30 files in a directory (I am using sc.textfile("dir/*") ) to read in the files. I am getting the following warning:
WARN TaskSetManager: Lost task 99.0 in stage 1.0 (TID 99, mesos12-dev.sccps.net): java.io.FileNotFoundException: /tmp/spark-local-20140925215712-0319/12/shuffle_0_99_93138 (Too many open files) basically I think a lot of shuffle files are being created. 1) The tasks eventually fail and the job just hangs (after taking very long, more than an hour). If I read these 30 files in a for loop, the same job completes in a few minutes. However, I need to specify the files names, which is not convenient. I am assuming that sc.textfile("dir/*") creates a large RDD for all the 30 files. Is there a way to make the operation on this large RDD efficient so as to avoid creating too many shuffle files? 2) Also, I am finding that all the shuffle files for my other completed jobs are not being automatically deleted even after days. I thought that sc.stop() clears the intermediate files. Is there some way to programmatically delete these temp shuffle files upon job completion? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-files-tp15185.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org