Spark group small files
Hello,
I'm using Spark 2.3.1.
I have a job that reads 5.000 small parquet files into s3.
When I do a mapPartitions followed by a collect, only 278 tasks are used (I
would have expected 5000). Does Spark group small files ? If yes, what is the
threshold for grouping
the settings defined
here:
http://spark.apache.org/docs/latest/sql-performance-tuning.html#other-configuration-options
From: Yann Moisan
Date: Tuesday, November 13, 2018 at 3:28 PM
To: "user@spark.apache.org"
Subject: [Spark SQL] Does Spark group small files
Hello,
I'm using Spark 2.3.
Hello,
I'm using Spark 2.3.1.
I have a job that reads 5.000 small parquet files into s3.
When I do a mapPartitions followed by a collect, only *278* tasks are used
(I would have expected 5000). Does Spark group small files ? If yes, what
is the threshold for grouping ? Is it configurable ? Any