Re: How does Spark determine in-memory partition count when reading Parquet ~files?

Michael Armbrust Wed, 19 Oct 2016 18:35:58 -0700

In spark 2.0 we bin-pack small files into a single task to avoid
overloading the scheduler.  If you want a specific number of partitions you
should repartition.  If you want to disable this optimization you can set
the file open cost very high: spark.sql.files.openCostInBytes


On Tue, Oct 18, 2016 at 7:04 PM, shea.parkes <shea.par...@gmail.com> wrote:

> When reading a parquet ~file with >50 parts, Spark is giving me a DataFrame
> object with far fewer in-memory partitions.
>
> I'm happy to troubleshoot this further, but I don't know Scala well and
> could use some help pointing me in the right direction.  Where should I be
> looking in the code base to understand how many partitions will result from
> reading a parquet ~file?
>
> Thanks,
>
> Shea
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/How-does-Spark-determine-in-
> memory-partition-count-when-reading-Parquet-files-tp27921.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: How does Spark determine in-memory partition count when reading Parquet ~files?

Reply via email to