[jira] [Commented] (SPARK-17998) Reading Parquet files coalesces parts into too few in-memory partitions

sam (JIRA) Wed, 10 Jan 2018 02:36:49 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-17998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320041#comment-16320041
 ]


sam commented on SPARK-17998:
-----------------------------

[~srowen] Thanks, no idea where I got that from, cursed weakly typed silently 
failing config builders!!

Will try `spark.files.maxPartitionBytes`

> Reading Parquet files coalesces parts into too few in-memory partitions
> -----------------------------------------------------------------------
>
>                 Key: SPARK-17998
>                 URL: https://issues.apache.org/jira/browse/SPARK-17998
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 2.0.0, 2.0.1
>         Environment: Spark Standalone Cluster (not "local mode")
> Windows 10 and Windows 7
> Python 3.x
>            Reporter: Shea Parkes
>
> Reading a parquet ~file into a DataFrame is resulting in far too few 
> in-memory partitions.  In prior versions of Spark, the resulting DataFrame 
> would have a number of partitions often equal to the number of parts in the 
> parquet folder.
> Here's a minimal reproducible sample:
> {quote}
> df_first = session.range(start=1, end=100000000, numPartitions=13)
> assert df_first.rdd.getNumPartitions() == 13
> assert session._sc.defaultParallelism == 6
> path_scrap = r"c:\scratch\scrap.parquet"
> df_first.write.parquet(path_scrap)
> df_second = session.read.parquet(path_scrap)
> print(df_second.rdd.getNumPartitions())
> {quote}
> The above shows only 7 partitions in the DataFrame that was created by 
> reading the Parquet back into memory for me.  Why is it no longer just the 
> number of part files in the Parquet folder?  (Which is 13 in the example 
> above.)
> I'm filing this as a bug because it has gotten so bad that we can't work with 
> the underlying RDD without first repartitioning the DataFrame, which is 
> costly and wasteful.  I really doubt this was the intended effect of moving 
> to Spark 2.0.
> I've tried to research where the number of in-memory partitions is 
> determined, but my Scala skills have proven in-adequate.  I'd be happy to dig 
> further if someone could point me in the right direction...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17998) Reading Parquet files coalesces parts into too few in-memory partitions

Reply via email to