[jira] [Commented] (SPARK-24425) Regression from 1.6 to 2.x - Spark no longer respects input partitions, unnecessary shuffle required

Unai Sarasola (JIRA) Wed, 30 May 2018 01:22:26 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-24425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494864#comment-16494864
 ]


Unai Sarasola commented on SPARK-24425:
---------------------------------------

Totally agree with you Sam. It's up to the developer to decide wether maintain 
or optimize the number of partitions.

This made impossible to backup your files in HDFS using Spark (I know is not 
the main task of Spark, but it's a limit that we don't have with the previous 
versions), also I'm playing with the same properties said by [~sams] without 
luck

> Regression from 1.6 to 2.x - Spark no longer respects input partitions, 
> unnecessary shuffle required
> ----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24425
>                 URL: https://issues.apache.org/jira/browse/SPARK-24425
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.0.2
>            Reporter: sam
>            Priority: Major
>
> I think this is a regression. We used to be able to easily control the number 
> of output files / tasks based on num files and coalesce. Now I have to use 
> `repartition` to get the desired num files / partitions which is 
> unnecessarily expensive.
> I've tried playing with spark.sql.files.maxPartitionBytes and 
> spark.sql.files.openCostInBytes to see if I can force the conventional 
> behaviour.
> {code:java}
> val ss = SparkSession.builder().appName("uber-cp").master(conf.master())
>              .config("spark.sql.files.maxPartitionBytes", 1)
>              .config("spark.sql.files.openCostInBytes", Long.MaxValue)
> {code}
> This didn't work. Spark just squashes all my parquet files into less 
> partitions.
> Suggest a simple `option` on DataFrameReader that can disable this (or enable 
> it, default behaviour should be same as 1.6).
>  
> This relates to https://issues.apache.org/jira/browse/SPARK-5997, in that if 
> SPARK-5997 was implemented this ticket wouldn't really be necessary.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24425) Regression from 1.6 to 2.x - Spark no longer respects input partitions, unnecessary shuffle required

Reply via email to