[ https://issues.apache.org/jira/browse/SPARK-22051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267169#comment-16267169 ]
Fernando Pereira commented on SPARK-22051: ------------------------------------------ Ideas anyone? > Explicit control of number of partitions after dataframe operations (join, > order...) > ------------------------------------------------------------------------------------ > > Key: SPARK-22051 > URL: https://issues.apache.org/jira/browse/SPARK-22051 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL > Affects Versions: 2.0.0 > Reporter: Fernando Pereira > Priority: Minor > > At the moment, at least from PySpark, it is not obvious to control the number > of partitions resulting from a join, a order by... but also spark.read, etc, > ending up in the (in)famous 200 partitions. > Of course one can do df.repartition() but most of the times it ends up > reshuffling data. > One workaround seems to be changing the config var > sparl.sql.shuffle.partitons at runtime. However, when tuning an app > performance, we might want different values / fields according to the > sizes/structures of the DF, and changing a global config var several times > simply doesn't feel right. Moreover it doesn't apply to all operations (e.g. > spark.read) > Therefore I believe it would be really a nice feature to either: > - Allow the user to specify the partitioning options in those operations. > E.g. df.join(df2, partitions=N, partition_cols=[col1]) > - Optimize subsequent calls to repartition() to change the parameters of the > latest partitioner in the execution plan, instead of instantiating and > executing a new partitioner. > My excuses if there is a better way of doing it or work in that direction is > already in progress. I couldn't find anything satisfactory. > If the community finds any of these ideas useful I can try to help > implementing them. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org