Yes, it does bin-packing for small files which is a good thing so you avoid having many small partitions especially if you’re writing this data back out (e.g. it’s compacting as you read). The default partition size is 128MB with a 4MB “cost” for opening files. You can configure this using the settings defined here: http://spark.apache.org/docs/latest/sql-performance-tuning.html#other-configuration-options
From: Yann Moisan <yam...@gmail.com> Date: Tuesday, November 13, 2018 at 3:28 PM To: "user@spark.apache.org" <user@spark.apache.org> Subject: [Spark SQL] Does Spark group small files Hello, I'm using Spark 2.3.1. I have a job that reads 5.000 small parquet files into s3. When I do a mapPartitions followed by a collect, only 278 tasks are used (I would have expected 5000). Does Spark group small files ? If yes, what is the threshold for grouping ? Is it configurable ? Any link to corresponding source code ? Rgds, Yann.