[ https://issues.apache.org/jira/browse/SPARK-24425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-24425. ---------------------------------- Resolution: Incomplete > Regression from 1.6 to 2.x - Spark no longer respects input partitions, > unnecessary shuffle required > ---------------------------------------------------------------------------------------------------- > > Key: SPARK-24425 > URL: https://issues.apache.org/jira/browse/SPARK-24425 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.0.2 > Reporter: sam > Priority: Major > Labels: bulk-closed > > I think this is a regression. We used to be able to easily control the number > of output files / tasks based on num files and coalesce. Now I have to use > `repartition` to get the desired num files / partitions which is > unnecessarily expensive. > I've tried playing with spark.sql.files.maxPartitionBytes and > spark.sql.files.openCostInBytes to see if I can force the conventional > behaviour. > {code:java} > val ss = SparkSession.builder().appName("uber-cp").master(conf.master()) > .config("spark.sql.files.maxPartitionBytes", 1) > .config("spark.sql.files.openCostInBytes", Long.MaxValue) > {code} > This didn't work. Spark just squashes all my parquet files into less > partitions. > Suggest a simple `option` on DataFrameReader that can disable this (or enable > it, default behaviour should be same as 1.6). > > This relates to https://issues.apache.org/jira/browse/SPARK-5997, in that if > SPARK-5997 was implemented this ticket wouldn't really be necessary. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org