Yes, it does bin-packing for small files which is a good thing so you avoid 
having many small partitions especially if you’re writing this data back out 
(e.g. it’s compacting as you read). The default partition size is 128MB with a 
4MB “cost” for opening files. You can configure this using the settings defined 
here: 
http://spark.apache.org/docs/latest/sql-performance-tuning.html#other-configuration-options

From: Yann Moisan <yam...@gmail.com>
Date: Tuesday, November 13, 2018 at 3:28 PM
To: "user@spark.apache.org" <user@spark.apache.org>
Subject: [Spark SQL] Does Spark group small files

Hello,

I'm using Spark 2.3.1.

I have a job that reads 5.000 small parquet files into s3.

When I do a mapPartitions followed by a collect, only 278 tasks are used (I 
would have expected 5000). Does Spark group small files ? If yes, what is the 
threshold for grouping ? Is it configurable ? Any link to corresponding source 
code ?

Rgds,

Yann.

Reply via email to