[ https://issues.apache.org/jira/browse/SPARK-17998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16326195#comment-16326195 ]
Fernando Pereira edited comment on SPARK-17998 at 1/15/18 12:35 PM: -------------------------------------------------------------------- [~sams] Did you have the change to check with/without sql to confirm this? I don't know where you got it from, but with me only {{spark.sql.files.maxPartitionBytes}} worked was (Author: ferdonline): [~sams] Did you have the change to check with/without sql to confirm this? I don't know where you got it from, but with me only {{park.sql.files.maxPartitionBytes}} worked > Reading Parquet files coalesces parts into too few in-memory partitions > ----------------------------------------------------------------------- > > Key: SPARK-17998 > URL: https://issues.apache.org/jira/browse/SPARK-17998 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL > Affects Versions: 2.0.0, 2.0.1 > Environment: Spark Standalone Cluster (not "local mode") > Windows 10 and Windows 7 > Python 3.x > Reporter: Shea Parkes > Priority: Major > > Reading a parquet ~file into a DataFrame is resulting in far too few > in-memory partitions. In prior versions of Spark, the resulting DataFrame > would have a number of partitions often equal to the number of parts in the > parquet folder. > Here's a minimal reproducible sample: > {quote} > df_first = session.range(start=1, end=100000000, numPartitions=13) > assert df_first.rdd.getNumPartitions() == 13 > assert session._sc.defaultParallelism == 6 > path_scrap = r"c:\scratch\scrap.parquet" > df_first.write.parquet(path_scrap) > df_second = session.read.parquet(path_scrap) > print(df_second.rdd.getNumPartitions()) > {quote} > The above shows only 7 partitions in the DataFrame that was created by > reading the Parquet back into memory for me. Why is it no longer just the > number of part files in the Parquet folder? (Which is 13 in the example > above.) > I'm filing this as a bug because it has gotten so bad that we can't work with > the underlying RDD without first repartitioning the DataFrame, which is > costly and wasteful. I really doubt this was the intended effect of moving > to Spark 2.0. > I've tried to research where the number of in-memory partitions is > determined, but my Scala skills have proven in-adequate. I'd be happy to dig > further if someone could point me in the right direction... -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org