[ https://issues.apache.org/jira/browse/SPARK-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16272802#comment-16272802 ]
Adrian Ionescu commented on SPARK-22665: ---------------------------------------- {code} scala> spark.range(10).repartition(10).select('id, spark_partition_id()).show +---+--------------------+ | id|SPARK_PARTITION_ID()| +---+--------------------+ | 9| 0| | 0| 1| | 1| 2| | 2| 3| | 3| 4| | 4| 5| | 5| 6| | 6| 7| | 7| 8| | 8| 9| +---+--------------------+ scala> spark.range(10).repartition(10, Seq.empty: _*).select('id, spark_partition_id()).show +---+--------------------+ | id|SPARK_PARTITION_ID()| +---+--------------------+ | 0| 2| | 1| 2| | 2| 2| | 3| 2| | 4| 2| | 5| 2| | 6| 2| | 7| 2| | 8| 2| | 9| 2| +---+--------------------+ {code} > Dataset API: .repartition() inconsistency / issue > ------------------------------------------------- > > Key: SPARK-22665 > URL: https://issues.apache.org/jira/browse/SPARK-22665 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.2.0 > Reporter: Adrian Ionescu > > We currently have two functions for explicitly repartitioning a Dataset: > {code} > def repartition(numPartitions: Int) > {code} > and > {code} > def repartition(numPartitions: Int, partitionExprs: Column*) > {code} > The second function's signature allows it to be called with an empty list of > expressions as well. > However: > * {{df.repartition(numPartitions)}} does RoundRobin partitioning > * {{df.repartition(numPartitions, Seq.empty: _*)}} does HashPartitioning on a > constant, effectively moving all tuples to a single partition > Not only is this inconsistent, but the latter behavior is very undesirable: > it may hide problems in small-scale prototype code, but will inevitably fail > (or have terrible performance) in production. > I suggest we should make it: > - either throw an {{IllegalArgumentException}} > - or do RoundRobin partitioning, just like {{df.repartition(numPartitions)}} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org