RE: [Spark SQL] Does Spark group small files

2018-11-14 Thread Lienhart, Pierre (DI IZ) - AF (ext)
Spark group small files Hello, I'm using Spark 2.3.1. I have a job that reads 5.000 small parquet files into s3. When I do a mapPartitions followed by a collect, only 278 tasks are used (I would have expected 5000). Does Spark group small files ? If yes, what is the threshold for grouping

Re: [Spark SQL] Does Spark group small files

2018-11-13 Thread Silvio Fiorito
the settings defined here: http://spark.apache.org/docs/latest/sql-performance-tuning.html#other-configuration-options From: Yann Moisan Date: Tuesday, November 13, 2018 at 3:28 PM To: "user@spark.apache.org" Subject: [Spark SQL] Does Spark group small files Hello, I'm using Spark 2.3.

[Spark SQL] Does Spark group small files

2018-11-13 Thread Yann Moisan
Hello, I'm using Spark 2.3.1. I have a job that reads 5.000 small parquet files into s3. When I do a mapPartitions followed by a collect, only *278* tasks are used (I would have expected 5000). Does Spark group small files ? If yes, what is the threshold for grouping ? Is it configurable ? Any