Re: Out of memory HDFS Multiple Cluster Write

2019-12-20 Thread Chris Teoh
If you're using Spark SQL, that configuration setting causes a shuffle if the number of your input partitions to the write is larger than that configuration. Is there anything in the executor logs or the Spark UI DAG that indicates a shuffle? I don't expect a shuffle if it is a straight write.

Re: Out of memory HDFS Multiple Cluster Write

2019-12-20 Thread Ruijing Li
Could you explain why shuffle partitions might be a good starting point? Some more details: when I write the output the first time after logic is complete, I repartition the files to 20 after having spark.sql.shuffle.partitions = 2000 so we don’t have too many small files. Data is small about

Re: Out of memory HDFS Multiple Cluster Write

2019-12-20 Thread Chris Teoh
spark.sql.shuffle.partitions might be a start. Is there a difference in the number of partitions when the parquet is read to spark.sql.shuffle.partitions? Is it much higher than spark.sql.shuffle.partitions? On Fri, 20 Dec 2019, 7:34 pm Ruijing Li, wrote: > Hi all, > > I have encountered a

Re: Identify bottleneck

2019-12-20 Thread Nicolas Paris
apparently the "withColumn" issue only apply for hundred or thousand of calls. This was not the case here (twenty calls) On Fri, Dec 20, 2019 at 08:53:16AM +0100, Enrico Minack wrote: > The issue is explained in depth here: https://medium.com/@manuzhang/ >

Re: Solved: Identify bottleneck

2019-12-20 Thread Antoine DUBOIS
Thank you very much for your help and your inputs. I understood some stuff but I finally understood my issue. In this case my main issue was a virtualization problem my vm was running on a small hypervysor, I split them on multiple hypervisor and the application now scale properly with the

Re: Identify bottleneck

2019-12-20 Thread ayan guha
Cool, thanks! Very helpful On Fri, 20 Dec 2019 at 6:53 pm, Enrico Minack wrote: > The issue is explained in depth here: > https://medium.com/@manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015 > > Am 19.12.19 um 23:33 schrieb Chris Teoh: > > As far as I'm aware it isn't any better. The

Out of memory HDFS Multiple Cluster Write

2019-12-20 Thread Ruijing Li
Hi all, I have encountered a strange executor OOM error. I have a data pipeline using Spark 2.3 Scala 2.11.12. This pipeline writes the output to one HDFS location as parquet then reads the files back in and writes to multiple hadoop clusters (all co-located in the same datacenter). It should be