In spark 2.0 we bin-pack small files into a single task to avoid overloading the scheduler. If you want a specific number of partitions you should repartition. If you want to disable this optimization you can set the file open cost very high: spark.sql.files.openCostInBytes
On Tue, Oct 18, 2016 at 7:04 PM, shea.parkes <shea.par...@gmail.com> wrote: > When reading a parquet ~file with >50 parts, Spark is giving me a DataFrame > object with far fewer in-memory partitions. > > I'm happy to troubleshoot this further, but I don't know Scala well and > could use some help pointing me in the right direction. Where should I be > looking in the code base to understand how many partitions will result from > reading a parquet ~file? > > Thanks, > > Shea > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/How-does-Spark-determine-in- > memory-partition-count-when-reading-Parquet-files-tp27921.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >