Re: "Too many open files" exception on reduceByKey
It turns out the mesos can overwrite the OS ulimit -n setting. So we have increased the mesos slave ulimit -n setting. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-exception-on-reduceByKey-tp2462p25019.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: "Too many open files" exception on reduceByKey
You are right, I did find that mesos overwrite this to a smaller number.So we will modify that and try to run again. Thanks! Tian On Thursday, October 8, 2015 4:18 PM, DB Tsai wrote: Try to run to see actual ulimit. We found that mesos overrides the ulimit which causes the issue. import sys.process._ val p = 1 to 100 val rdd = sc.parallelize(p, 100) val a = rdd.map(x=> Seq("sh", "-c", "ulimit -n").!!.toDouble.toLong).collect Sincerely, DB Tsai --Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Thu, Oct 8, 2015 at 3:22 PM, Tian Zhang wrote: I hit this issue with spark 1.3.0 stateful application (with updateStateByKey) function on mesos. It will fail after running fine for about 24 hours. The error stack trace as below, I checked ulimit -n and we have very large numbers set on the machines. What else can be wrong? 15/09/27 18:45:11 WARN scheduler.TaskSetManager: Lost task 2.0 in stage 113727.0 (TID 833758, ip-10-112-10-221.ec2.internal): java.io.FileNotFoundException: /media/ephemeral0/oncue/mesos-slave/slaves/20150512-215537-2165010442-5050-1730-S5/frameworks/20150825-175705-2165010442-5050-13705-0338/executors/0/runs/19342849-d076-483c-88da-747896e19b93/./spark-6efa2dcd-aea7-478e-9fa9-6e0973578eb4/blockmgr-33b1e093-6dd6-4462-938c-2597516272a9/27/shuffle_535_2_0.index (Too many open files) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.(FileOutputStream.java:221) at java.io.FileOutputStream.(FileOutputStream.java:171) at org.apache.spark.shuffle.IndexShuffleBlockManager.writeIndexFile(IndexShuffleBlockManager.scala:85) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:69) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-exception-on-reduceByKey-tp2462p24985.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: "Too many open files" exception on reduceByKey
Try to run to see actual ulimit. We found that mesos overrides the ulimit which causes the issue. import sys.process._ val p = 1 to 100 val rdd = sc.parallelize(p, 100) val a = rdd.map(x=> Seq("sh", "-c", "ulimit -n").!!.toDouble.toLong).collect Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D <https://pgp.mit.edu/pks/lookup?search=0x59DF55B8AF08DF8D> On Thu, Oct 8, 2015 at 3:22 PM, Tian Zhang wrote: > I hit this issue with spark 1.3.0 stateful application (with > updateStateByKey) function on mesos. It will > fail after running fine for about 24 hours. > The error stack trace as below, I checked ulimit -n and we have very large > numbers set on the machines. > What else can be wrong? > 15/09/27 18:45:11 WARN scheduler.TaskSetManager: Lost task 2.0 in stage > 113727.0 (TID 833758, ip-10-112-10-221.ec2.internal): > java.io.FileNotFoundException: > > /media/ephemeral0/oncue/mesos-slave/slaves/20150512-215537-2165010442-5050-1730-S5/frameworks/20150825-175705-2165010442-5050-13705-0338/executors/0/runs/19342849-d076-483c-88da-747896e19b93/./spark-6efa2dcd-aea7-478e-9fa9-6e0973578eb4/blockmgr-33b1e093-6dd6-4462-938c-2597516272a9/27/shuffle_535_2_0.index > (Too many open files) > at java.io.FileOutputStream.open(Native Method) > at java.io.FileOutputStream.(FileOutputStream.java:221) > at java.io.FileOutputStream.(FileOutputStream.java:171) > at > > org.apache.spark.shuffle.IndexShuffleBlockManager.writeIndexFile(IndexShuffleBlockManager.scala:85) > at > > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:69) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:64) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-exception-on-reduceByKey-tp2462p24985.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: "Too many open files" exception on reduceByKey
I hit this issue with spark 1.3.0 stateful application (with updateStateByKey) function on mesos. It will fail after running fine for about 24 hours. The error stack trace as below, I checked ulimit -n and we have very large numbers set on the machines. What else can be wrong? 15/09/27 18:45:11 WARN scheduler.TaskSetManager: Lost task 2.0 in stage 113727.0 (TID 833758, ip-10-112-10-221.ec2.internal): java.io.FileNotFoundException: /media/ephemeral0/oncue/mesos-slave/slaves/20150512-215537-2165010442-5050-1730-S5/frameworks/20150825-175705-2165010442-5050-13705-0338/executors/0/runs/19342849-d076-483c-88da-747896e19b93/./spark-6efa2dcd-aea7-478e-9fa9-6e0973578eb4/blockmgr-33b1e093-6dd6-4462-938c-2597516272a9/27/shuffle_535_2_0.index (Too many open files) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.(FileOutputStream.java:221) at java.io.FileOutputStream.(FileOutputStream.java:171) at org.apache.spark.shuffle.IndexShuffleBlockManager.writeIndexFile(IndexShuffleBlockManager.scala:85) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:69) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-exception-on-reduceByKey-tp2462p24985.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: "Too many open files" exception on reduceByKey
Sorry, I also have some follow-up questions. "In general if a node in your cluster has C assigned cores and you run a job with X reducers then Spark will open C*X files in parallel and start writing." Some questions came to mind just now: 1) It would be nice to have a brief overview as to what these files are being used for? 2) Is this C*X files being opened on each machine? Also, is C the total number of cores among all machines in the cluster? Thanks, -Matt Cheah On Tue, Mar 11, 2014 at 4:35 PM, Matthew Cheah wrote: > Thanks. Just curious, is there a default number of reducers that are used? > > -Matt Cheah > > > On Mon, Mar 10, 2014 at 7:22 PM, Patrick Wendell wrote: > >> Hey Matt, >> >> The best way is definitely just to increase the ulimit if possible, >> this is sort of an assumption we make in Spark that clusters will be >> able to move it around. >> >> You might be able to hack around this by decreasing the number of >> reducers but this could have some performance implications for your >> job. >> >> In general if a node in your cluster has C assigned cores and you run >> a job with X reducers then Spark will open C*X files in parallel and >> start writing. Shuffle consolidation will help decrease the total >> number of files created but the number of file handles open at any >> time doesn't change so it won't help the ulimit problem. >> >> This means you'll have to use fewer reducers (e.g. pass reduceByKey a >> number of reducers) or use fewer cores on each machine. >> >> - Patrick >> >> On Mon, Mar 10, 2014 at 10:41 AM, Matthew Cheah >> wrote: >> > Hi everyone, >> > >> > My team (cc'ed in this e-mail) and I are running a Spark reduceByKey >> > operation on a cluster of 10 slaves where I don't have the privileges >> to set >> > "ulimit -n" to a higher number. I'm running on a cluster where "ulimit >> -n" >> > returns 1024 on each machine. >> > >> > When I attempt to run this job with the data originating from a text >> file, >> > stored in an HDFS cluster running on the same nodes as the Spark >> cluster, >> > the job crashes with the message, "Too many open files". >> > >> > My question is, why are so many files being created, and is there a way >> to >> > configure the Spark context to avoid spawning that many files? I am >> already >> > setting spark.shuffle.consolidateFiles to true. >> > >> > I want to repeat - I can't change the maximum number of open file >> > descriptors on the machines. This cluster is not owned by me and the >> system >> > administrator is responding quite slowly. >> > >> > Thanks, >> > >> > -Matt Cheah >> > >
Re: "Too many open files" exception on reduceByKey
Thanks. Just curious, is there a default number of reducers that are used? -Matt Cheah On Mon, Mar 10, 2014 at 7:22 PM, Patrick Wendell wrote: > Hey Matt, > > The best way is definitely just to increase the ulimit if possible, > this is sort of an assumption we make in Spark that clusters will be > able to move it around. > > You might be able to hack around this by decreasing the number of > reducers but this could have some performance implications for your > job. > > In general if a node in your cluster has C assigned cores and you run > a job with X reducers then Spark will open C*X files in parallel and > start writing. Shuffle consolidation will help decrease the total > number of files created but the number of file handles open at any > time doesn't change so it won't help the ulimit problem. > > This means you'll have to use fewer reducers (e.g. pass reduceByKey a > number of reducers) or use fewer cores on each machine. > > - Patrick > > On Mon, Mar 10, 2014 at 10:41 AM, Matthew Cheah > wrote: > > Hi everyone, > > > > My team (cc'ed in this e-mail) and I are running a Spark reduceByKey > > operation on a cluster of 10 slaves where I don't have the privileges to > set > > "ulimit -n" to a higher number. I'm running on a cluster where "ulimit > -n" > > returns 1024 on each machine. > > > > When I attempt to run this job with the data originating from a text > file, > > stored in an HDFS cluster running on the same nodes as the Spark cluster, > > the job crashes with the message, "Too many open files". > > > > My question is, why are so many files being created, and is there a way > to > > configure the Spark context to avoid spawning that many files? I am > already > > setting spark.shuffle.consolidateFiles to true. > > > > I want to repeat - I can't change the maximum number of open file > > descriptors on the machines. This cluster is not owned by me and the > system > > administrator is responding quite slowly. > > > > Thanks, > > > > -Matt Cheah >
Re: "Too many open files" exception on reduceByKey
Hey Matt, The best way is definitely just to increase the ulimit if possible, this is sort of an assumption we make in Spark that clusters will be able to move it around. You might be able to hack around this by decreasing the number of reducers but this could have some performance implications for your job. In general if a node in your cluster has C assigned cores and you run a job with X reducers then Spark will open C*X files in parallel and start writing. Shuffle consolidation will help decrease the total number of files created but the number of file handles open at any time doesn't change so it won't help the ulimit problem. This means you'll have to use fewer reducers (e.g. pass reduceByKey a number of reducers) or use fewer cores on each machine. - Patrick On Mon, Mar 10, 2014 at 10:41 AM, Matthew Cheah wrote: > Hi everyone, > > My team (cc'ed in this e-mail) and I are running a Spark reduceByKey > operation on a cluster of 10 slaves where I don't have the privileges to set > "ulimit -n" to a higher number. I'm running on a cluster where "ulimit -n" > returns 1024 on each machine. > > When I attempt to run this job with the data originating from a text file, > stored in an HDFS cluster running on the same nodes as the Spark cluster, > the job crashes with the message, "Too many open files". > > My question is, why are so many files being created, and is there a way to > configure the Spark context to avoid spawning that many files? I am already > setting spark.shuffle.consolidateFiles to true. > > I want to repeat - I can't change the maximum number of open file > descriptors on the machines. This cluster is not owned by me and the system > administrator is responding quite slowly. > > Thanks, > > -Matt Cheah
"Too many open files" exception on reduceByKey
Hi everyone, My team (cc'ed in this e-mail) and I are running a Spark reduceByKey operation on a cluster of 10 slaves where I don't have the privileges to set "ulimit -n" to a higher number. I'm running on a cluster where "ulimit -n" returns 1024 on each machine. When I attempt to run this job with the data originating from a text file, stored in an HDFS cluster running on the same nodes as the Spark cluster, the job crashes with the message, "Too many open files". My question is, why are so many files being created, and is there a way to configure the Spark context to avoid spawning that many files? I am already setting spark.shuffle.consolidateFiles to true. I want to repeat - I can't change the maximum number of open file descriptors on the machines. This cluster is not owned by me and the system administrator is responding quite slowly. Thanks, -Matt Cheah