Hi, I am having the same problem reported by Michael. I am trying to open 30 files. ulimit -n shows the limit is 1024. So I am not sure why the program is failing with "Too many open files" error. The total size of all the 30 files is 230 GB. I am running the job on a cluster with 10 nodes, each having 16 GB. The error appears to be happening at the distinct() stage.
Here is my program. In the following code, are all the 10 nodes trying to open all of the 30 files or are the files distributed among the 30 nodes? val baseFile = "/mapr/mapr_dir/files_2013apr*" val x = sc.textFile(baseFile)).map { line => val fields = line.split("\t") (fields(11), fields(6)) }.distinct().countByKey() val xrdd = sc.parallelize(x.toSeq) xrdd.saveAsTextFile(...) Instead of using the glob *, I guess I can try using a for loop to read the files one by one if that helps, but not sure if there is a more efficient solution. The following is the error transcript: Job aborted due to stage failure: Task 1.0:201 failed 4 times, most recent failure: Exception failure in TID 902 on host 192.168.13.11: java.io.FileNotFoundException: /tmp/spark-local-20140829131200-0bb7/08/shuffle_0_201_999 (Too many open files) java.io.FileOutputStream.open(Native Method) java.io.FileOutputStream.<init>(FileOutputStream.java:221) org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:116) org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:177) org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161) org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:158) scala.collection.Iterator$class.foreach(Iterator.scala:727) org.apache.spark.util.collection.AppendOnlyMap$$anon$1.foreach(AppendOnlyMap.scala:159) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:744) Driver stacktrace: -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-tp1464p13144.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org