Could you explain why shuffle partitions might be a good starting point?

Some more details: when I write the output the first time after logic is
complete, I repartition the files to 20 after having
spark.sql.shuffle.partitions = 2000 so we don’t have too many small files.
Data is small about 130MB per file. When spark reads it reads in 40
partitions and tries to output that to the different cluster. Unfortunately
during that read and write stage executors drop off.

We keep hdfs block 128Mb

On Fri, Dec 20, 2019 at 3:01 PM Chris Teoh <chris.t...@gmail.com> wrote:

> spark.sql.shuffle.partitions might be a start.
>
> Is there a difference in the number of partitions when the parquet is read
> to spark.sql.shuffle.partitions? Is it much higher than
> spark.sql.shuffle.partitions?
>
> On Fri, 20 Dec 2019, 7:34 pm Ruijing Li, <liruijin...@gmail.com> wrote:
>
>> Hi all,
>>
>> I have encountered a strange executor OOM error. I have a data pipeline
>> using Spark 2.3 Scala 2.11.12. This pipeline writes the output to one HDFS
>> location as parquet then reads the files back in and writes to multiple
>> hadoop clusters (all co-located in the same datacenter).  It should be a
>> very simple task, but executors are being killed off exceeding container
>> thresholds. From logs, it is exceeding given memory (using Mesos as the
>> cluster manager).
>>
>> The ETL process works perfectly fine with the given resources, doing
>> joins and adding columns. The output is written successfully the first
>> time. *Only when the pipeline at the end reads the output from HDFS and
>> writes it to different HDFS cluster paths does it fail.* (It does a
>> spark.read.parquet(source).write.parquet(dest))
>>
>> This doesn't really make sense and I'm wondering what configurations I
>> should start looking at.
>>
>> --
>> Cheers,
>> Ruijing Li
>> --
>> Cheers,
>> Ruijing Li
>>
> --
Cheers,
Ruijing Li

Reply via email to