Re: distinct on huge dataset

Mayur Rustagi Thu, 17 Apr 2014 19:56:37 -0700

Preferably increase the ulimit on your machines. Spark needs to access a lot of 
small files hence hard to control file handlers.



—
Sent from Mailbox

On Fri, Apr 18, 2014 at 3:59 AM, Ryan Compton <compton.r...@gmail.com>
wrote:

> Btw, I've got System.setProperty("spark.shuffle.consolidate.files",
> "true") and use ext3 (CentOS...)
> On Thu, Apr 17, 2014 at 3:20 PM, Ryan Compton <compton.r...@gmail.com> wrote:
>> Does this continue in newer versions? (I'm on 0.8.0 now)
>>
>> When I use .distinct() on moderately large datasets (224GB, 8.5B rows,
>> I'm guessing about 500M are distinct) my jobs fail with:
>>
>> 14/04/17 15:04:02 INFO cluster.ClusterTaskSetManager: Loss was due to
>> java.io.FileNotFoundException
>> java.io.FileNotFoundException:
>> /tmp/spark-local-20140417145643-a055/3c/shuffle_1_218_1157 (Too many
>> open files)
>>
>> ulimit -n tells me I can open 32000 files. Here's a plot of lsof on a
>> worker node during a failed .distinct():
>> http://i.imgur.com/wyBHmzz.png , you can see tasks fail when Spark
>> tries to open 32000 files.
>>
>> I never ran into this in 0.7.3. Is there a parameter I can set to tell
>> Spark to use less than 32000 files?
>>
>> On Mon, Mar 24, 2014 at 10:23 AM, Aaron Davidson <ilike...@gmail.com> wrote:
>>> Look up setting ulimit, though note the distinction between soft and hard
>>> limits, and that updating your hard limit may require changing
>>> /etc/security/limits.confand restarting each worker.
>>>
>>>
>>> On Mon, Mar 24, 2014 at 1:39 AM, Kane <kane.ist...@gmail.com> wrote:
>>>>
>>>> Got a bit further, i think out of memory error was caused by setting
>>>> spark.spill to false. Now i have this error, is there an easy way to
>>>> increase file limit for spark, cluster-wide?:
>>>>
>>>> java.io.FileNotFoundException:
>>>>
>>>> /tmp/spark-local-20140324074221-b8f1/01/temp_1ab674f9-4556-4239-9f21-688dfc9f17d2
>>>> (Too many open files)
>>>>         at java.io.FileOutputStream.openAppend(Native Method)
>>>>         at java.io.FileOutputStream.<init>(FileOutputStream.java:192)
>>>>         at
>>>>
>>>> org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:113)
>>>>         at
>>>>
>>>> org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:174)
>>>>         at
>>>>
>>>> org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:191)
>>>>         at
>>>>
>>>> org.apache.spark.util.collection.ExternalAppendOnlyMap.insert(ExternalAppendOnlyMap.scala:141)
>>>>         at
>>>> org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59)
>>>>         at
>>>>
>>>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
>>>>         at
>>>>
>>>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:94)
>>>>         at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
>>>>         at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
>>>>         at
>>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
>>>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>>>>         at
>>>>
>>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
>>>>         at
>>>>
>>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
>>>>         at org.apache.spark.scheduler.Task.run(Task.scala:53)
>>>>         at
>>>>
>>>> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
>>>>         at
>>>>
>>>> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
>>>>         at
>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
>>>>         at
>>>>
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>>         at
>>>>
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>>         at java.lang.Thread.run(Thread.java:662)
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3084.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>>

Re: distinct on huge dataset

Reply via email to