[jira] [Comment Edited] (SPARK-17998) Reading Parquet files coalesces parts into too few in-memory partitions

Fernando Pereira (JIRA) Mon, 15 Jan 2018 04:36:37 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-17998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16326195#comment-16326195
 ]


Fernando Pereira edited comment on SPARK-17998 at 1/15/18 12:35 PM:
--------------------------------------------------------------------

[~sams] Did you have the change to check with/without sql to confirm this?

I don't know where you got it from, but with me only 
{{spark.sql.files.maxPartitionBytes}} worked


was (Author: ferdonline):
[~sams] Did you have the change to check with/without sql to confirm this?

I don't know where you got it from, but with me only 
{{park.sql.files.maxPartitionBytes}} worked

> Reading Parquet files coalesces parts into too few in-memory partitions
> -----------------------------------------------------------------------
>
>                 Key: SPARK-17998
>                 URL: https://issues.apache.org/jira/browse/SPARK-17998
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 2.0.0, 2.0.1
>         Environment: Spark Standalone Cluster (not "local mode")
> Windows 10 and Windows 7
> Python 3.x
>            Reporter: Shea Parkes
>            Priority: Major
>
> Reading a parquet ~file into a DataFrame is resulting in far too few 
> in-memory partitions.  In prior versions of Spark, the resulting DataFrame 
> would have a number of partitions often equal to the number of parts in the 
> parquet folder.
> Here's a minimal reproducible sample:
> {quote}
> df_first = session.range(start=1, end=100000000, numPartitions=13)
> assert df_first.rdd.getNumPartitions() == 13
> assert session._sc.defaultParallelism == 6
> path_scrap = r"c:\scratch\scrap.parquet"
> df_first.write.parquet(path_scrap)
> df_second = session.read.parquet(path_scrap)
> print(df_second.rdd.getNumPartitions())
> {quote}
> The above shows only 7 partitions in the DataFrame that was created by 
> reading the Parquet back into memory for me.  Why is it no longer just the 
> number of part files in the Parquet folder?  (Which is 13 in the example 
> above.)
> I'm filing this as a bug because it has gotten so bad that we can't work with 
> the underlying RDD without first repartitioning the DataFrame, which is 
> costly and wasteful.  I really doubt this was the intended effect of moving 
> to Spark 2.0.
> I've tried to research where the number of in-memory partitions is 
> determined, but my Scala skills have proven in-adequate.  I'd be happy to dig 
> further if someone could point me in the right direction...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17998) Reading Parquet files coalesces parts into too few in-memory partitions

Reply via email to