[ https://issues.apache.org/jira/browse/SPARK-42760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-42760: ------------------------------------ Assignee: Apache Spark > The partition of result data frame of join is always 1 > ------------------------------------------------------ > > Key: SPARK-42760 > URL: https://issues.apache.org/jira/browse/SPARK-42760 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL > Affects Versions: 3.3.2 > Environment: standard spark 3.0.3/3.3.2, using in jupyter notebook, > local mode > Reporter: binyang > Assignee: Apache Spark > Priority: Major > > I am using pyspark. The partition of result data frame of join is always 1. > Here is my code from > https://stackoverflow.com/questions/51876281/is-partitioning-retained-after-a-spark-sql-join > > print(spark.version) > def example_shuffle_partitions(data_partitions=10, shuffle_partitions=4): > spark.conf.set("spark.sql.shuffle.partitions", shuffle_partitions) > spark.sql("SET spark.sql.autoBroadcastJoinThreshold=-1") > df1 = spark.range(1, 1000).repartition(data_partitions) > df2 = spark.range(1, 2000).repartition(data_partitions) > df3 = spark.range(1, 3000).repartition(data_partitions) > print("Data partitions is: {}. Shuffle partitions is > {}".format(data_partitions, shuffle_partitions)) > print("Data partitions before join: > {}".format(df1.rdd.getNumPartitions())) > df = (df1.join(df2, df1.id == df2.id) > .join(df3, df1.id == df3.id)) > print("Data partitions after join : {}".format(df.rdd.getNumPartitions())) > example_shuffle_partitions() > > In Spark 3.0.3, it prints out: > 3.0.3 > Data partitions is: 10. Shuffle partitions is 4 > Data partitions before join: 10 > Data partitions after join : 4 > However, it prints out the following in the latest 3.3.2 > 3.3.2 > Data partitions is: 10. Shuffle partitions is 4 > Data partitions before join: 10 > Data partitions after join : 1 -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org