If you're using Spark SQL, that configuration setting causes a shuffle if
the number of your input partitions to the write is larger than that
configuration.
Is there anything in the executor logs or the Spark UI DAG that indicates a
shuffle? I don't expect a shuffle if it is a straight write.
Could you explain why shuffle partitions might be a good starting point?
Some more details: when I write the output the first time after logic is
complete, I repartition the files to 20 after having
spark.sql.shuffle.partitions = 2000 so we don’t have too many small files.
Data is small about
spark.sql.shuffle.partitions might be a start.
Is there a difference in the number of partitions when the parquet is read
to spark.sql.shuffle.partitions? Is it much higher than
spark.sql.shuffle.partitions?
On Fri, 20 Dec 2019, 7:34 pm Ruijing Li, wrote:
> Hi all,
>
> I have encountered a
apparently the "withColumn" issue only apply for hundred or thousand of
calls. This was not the case here (twenty calls)
On Fri, Dec 20, 2019 at 08:53:16AM +0100, Enrico Minack wrote:
> The issue is explained in depth here: https://medium.com/@manuzhang/
>
Thank you very much for your help and your inputs.
I understood some stuff but I finally understood my issue.
In this case my main issue was a virtualization problem my vm was running on a
small hypervysor, I split them on multiple hypervisor and the application now
scale properly with the
Cool, thanks! Very helpful
On Fri, 20 Dec 2019 at 6:53 pm, Enrico Minack
wrote:
> The issue is explained in depth here:
> https://medium.com/@manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015
>
> Am 19.12.19 um 23:33 schrieb Chris Teoh:
>
> As far as I'm aware it isn't any better. The
Hi all,
I have encountered a strange executor OOM error. I have a data pipeline
using Spark 2.3 Scala 2.11.12. This pipeline writes the output to one HDFS
location as parquet then reads the files back in and writes to multiple
hadoop clusters (all co-located in the same datacenter). It should be