Shreyas created SPARK-47425: ------------------------------- Summary: spark-sql does not recognize expressions in repartition hint Key: SPARK-47425 URL: https://issues.apache.org/jira/browse/SPARK-47425 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.1 Reporter: Shreyas
In Scala, it is possible to do this, to create a bucketed table to not have many small files. {code:scala} df.repartition(expr("pmod(hash(user_id), 200)")) .write .mode(SaveMode.Overwrite) .bucketBy(200, "user_id") .option("path", output_path) .saveAsTable("bucketed_table") {code} Found [this small trick|https://towardsdatascience.com/best-practices-for-bucketing-in-spark-sql-ea9f23f7dd53] to have the same # files as buckets. However, the equivalent does not work in spark-sql (using repartition hint) {code:sql} create table bucketed_table stored as parquet clustered by (user_id) into 200 buckets select /*+repartition (pmod(hash(user_id),200)) */ * from df_table {code} {{REPARTITION Hint parameter should include columns, but 'pmod('hash('user_id), 200) found.}} When I instead make a virtual column and use that, Spark is not respecting the repartition anymore {code:sql} create table bucketed_table stored as parquet clustered by (user_id) into 200 buckets select /*+repartition (bkt) */ *, pmod(hash(user_id),200) as bkt from df_table {code} {code:java} $ hdfs dfs -ls -h /user/spark/warehouse/bucket_test.db/bucketed_table| head Found 101601 items ...{code} Can the behavior of repartition hint be changed to work like the Scala/Python equivalent? Thank you -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org