[SPARK-SQL] Writing partitioned parquet requires huge amounts of memory

2018-11-14 Thread Lienhart, Pierre (DI IZ) - AF (ext)
Hi everyone, The team I am working with is currently plagued by storing its data in hundreds of thousands of tiny parquet files. I am trying 1) to reduce the number of file 2) reduce the number of partitions. I wrote a very simple (Py)spark (Spark 2.1.1 packaged within HDP 2.6.2.0) application w

RE: [Spark SQL] Does Spark group small files

2018-11-14 Thread Lienhart, Pierre (DI IZ) - AF (ext)
Hello Yann, From my understanding, when reading small files Spark will group them and load the content of each batch into the same partition so you won’t end up with 1 partition per file resulting in a huge number of very small partitions. This behavior is controlled by the spark.files.maxParti