Hi all,

I came across an unusual use case. I tried debugging this myself and also
checked Jira for potential contributions in this direction but with no
success.

I have the following snippet, running on a 2 worker cluster(16 cpu each):

d = spark.table(f"{project}.{t}")
print(d.rdd.getNumPartitions())
d = d.withColumn("pid", sqlf.spark_partition_id())
d = d.select("pid").distinct()
print(d.count())

Table t is bucketed in 6400 buckets. When this code runs on Spark 2.4.5 it
prints 6400 and uses 6400 tasks. When it runs on Spark 3.0.1 or 3.1.2 it
prints 424 and uses 424 tasks.

Is there a config which makes the buckets coalesce?

I have already tried the new Spark setting:
spark.sql.sources.bucketing.autoBucketedScan.enabled . This only helps with
d.rdd.getNumPartitions() to print the correct number of buckets.

- sebastian

Reply via email to