Re: How does Spark determine in-memory partition count when reading Parquet ~files?
Thank you for the reply tosaurabh85. We do tune and adjust our shuffle partition count, but that was not influencing the reading of parquets (the data is not shuffled as it is read as I understand it). I apologize that I actually received an answer, but it was not caught on the mailing list here. I'm posting the thread here below for future people to find the answer as well: On Wed, Oct 19, 2016 at 9:33 PM Michael Armbrust wrote: In spark 2.0 we bin-pack small files into a single task to avoid overloading the scheduler. If you want a specific number of partitions you should repartition. If you want to disable this optimization you can set the file open cost very high: spark.sql.files.openCostInBytes My reply: Thank you very much for that information sir. It does make sense, I just did not find that in any release notes. I will work to tune that parameter appropriately for our work flow. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-does-Spark-determine-in-memory-partition-count-when-reading-Parquet-files-tp27921p27943.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: How does Spark determine in-memory partition count when reading Parquet ~files?
In spark 2.0 we bin-pack small files into a single task to avoid overloading the scheduler. If you want a specific number of partitions you should repartition. If you want to disable this optimization you can set the file open cost very high: spark.sql.files.openCostInBytes On Tue, Oct 18, 2016 at 7:04 PM, shea.parkes wrote: > When reading a parquet ~file with >50 parts, Spark is giving me a DataFrame > object with far fewer in-memory partitions. > > I'm happy to troubleshoot this further, but I don't know Scala well and > could use some help pointing me in the right direction. Where should I be > looking in the code base to understand how many partitions will result from > reading a parquet ~file? > > Thanks, > > Shea > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/How-does-Spark-determine-in- > memory-partition-count-when-reading-Parquet-files-tp27921.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
How does Spark determine in-memory partition count when reading Parquet ~files?
When reading a parquet ~file with >50 parts, Spark is giving me a DataFrame object with far fewer in-memory partitions. I'm happy to troubleshoot this further, but I don't know Scala well and could use some help pointing me in the right direction. Where should I be looking in the code base to understand how many partitions will result from reading a parquet ~file? Thanks, Shea -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-does-Spark-determine-in-memory-partition-count-when-reading-Parquet-files-tp27921.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org