Hi everyone,
The team I am working with is currently plagued by storing its data in hundreds
of thousands of tiny parquet files. I am trying 1) to reduce the number of file
2) reduce the number of partitions. I wrote a very simple (Py)spark (Spark
2.1.1 packaged within HDP 2.6.2.0) application w
Hello Yann,
From my understanding, when reading small files Spark will group them and load
the content of each batch into the same partition so you won’t end up with 1
partition per file resulting in a huge number of very small partitions. This
behavior is controlled by the spark.files.maxParti