If you're using Spark SQL, that configuration setting causes a shuffle if the number of your input partitions to the write is larger than that configuration.
Is there anything in the executor logs or the Spark UI DAG that indicates a shuffle? I don't expect a shuffle if it is a straight write. What's the input partition size? On Sat, 21 Dec 2019, 10:24 am Ruijing Li, <liruijin...@gmail.com> wrote: > Could you explain why shuffle partitions might be a good starting point? > > Some more details: when I write the output the first time after logic is > complete, I repartition the files to 20 after having > spark.sql.shuffle.partitions = 2000 so we don’t have too many small files. > Data is small about 130MB per file. When spark reads it reads in 40 > partitions and tries to output that to the different cluster. Unfortunately > during that read and write stage executors drop off. > > We keep hdfs block 128Mb > > On Fri, Dec 20, 2019 at 3:01 PM Chris Teoh <chris.t...@gmail.com> wrote: > >> spark.sql.shuffle.partitions might be a start. >> >> Is there a difference in the number of partitions when the parquet is >> read to spark.sql.shuffle.partitions? Is it much higher than >> spark.sql.shuffle.partitions? >> >> On Fri, 20 Dec 2019, 7:34 pm Ruijing Li, <liruijin...@gmail.com> wrote: >> >>> Hi all, >>> >>> I have encountered a strange executor OOM error. I have a data pipeline >>> using Spark 2.3 Scala 2.11.12. This pipeline writes the output to one HDFS >>> location as parquet then reads the files back in and writes to multiple >>> hadoop clusters (all co-located in the same datacenter). It should be a >>> very simple task, but executors are being killed off exceeding container >>> thresholds. From logs, it is exceeding given memory (using Mesos as the >>> cluster manager). >>> >>> The ETL process works perfectly fine with the given resources, doing >>> joins and adding columns. The output is written successfully the first >>> time. *Only when the pipeline at the end reads the output from HDFS and >>> writes it to different HDFS cluster paths does it fail.* (It does a >>> spark.read.parquet(source).write.parquet(dest)) >>> >>> This doesn't really make sense and I'm wondering what configurations I >>> should start looking at. >>> >>> -- >>> Cheers, >>> Ruijing Li >>> -- >>> Cheers, >>> Ruijing Li >>> >> -- > Cheers, > Ruijing Li >