Fernando Pereira created SPARK-23029:
----------------------------------------

             Summary: Setting spark.shuffle.file.buffer will make the shuffle 
fail
                 Key: SPARK-23029
                 URL: https://issues.apache.org/jira/browse/SPARK-23029
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.2.1
            Reporter: Fernando Pereira


When setting the spark.shuffle.file.buffer setting, even to its default value, 
shuffles fail.
This appears to affect small to medium size partitions. Strangely the error 
message is OutOfMemoryError, but it works with large partitions (at least 
>32MB).

{code}
pyspark --conf "spark.shuffle.file.buffer=$((32*1024))"
/gpfs/bbp.cscs.ch/scratch/gss/spykfunc/_sparkenv/lib/python2.7/site-packages/pyspark/bin/spark-submit
 pyspark-shell-main --name PySparkShell --conf spark.shuffle.file.buffer=32768
version 2.2.1

>>> spark.range(1e7, numPartitions=10).sort("id").write.parquet("a", 
>>> mode="overwrite")

[Stage 1:>                                                        (0 + 10) / 
10]18/01/10 19:34:21 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 11)
java.lang.OutOfMemoryError: Java heap space
        at java.io.BufferedOutputStream.<init>(BufferedOutputStream.java:75)
        at 
org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseBufferedOutputStream$1.<init>(DiskBlockObjectWriter.scala:107)
        at 
org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:108)
        at 
org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)
        at 
org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237)
        at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to