[ https://issues.apache.org/jira/browse/SPARK-47425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shreyas updated SPARK-47425: ---------------------------- Description: In Scala, it is possible to do this, to create a bucketed table to not have many small files. {code:scala} df.repartition(expr("pmod(hash(user_id), 200)")) .write .mode(SaveMode.Overwrite) .bucketBy(200, "user_id") .option("path", output_path) .saveAsTable("bucketed_table") {code} Found [this small trick|https://towardsdatascience.com/best-practices-for-bucketing-in-spark-sql-ea9f23f7dd53] to have the same # files as buckets. However, the equivalent does not work in spark-sql (using repartition hint) {code:sql} create table bucketed_table stored as parquet clustered by (user_id) into 200 buckets select /*+ repartition (pmod(hash(user_id),200)) */ -- I have the hint setup properly, Jira is removing the space when displaying * from df_table {code} {{REPARTITION Hint parameter should include columns, but 'pmod('hash('user_id), 200) found.}} When I instead make a virtual column and use that, Spark is not respecting the repartition anymore {code:sql} create table bucketed_table stored as parquet clustered by (user_id) into 200 buckets select /*+repartition (bkt) */ *, pmod(hash(user_id),200) as bkt from df_table {code} {code:java} $ hdfs dfs -ls -h /user/spark/warehouse/bucket_test.db/bucketed_table| head Found 101601 items ...{code} Can the behavior of repartition hint be changed to work like the Scala/Python equivalent? Thank you was: In Scala, it is possible to do this, to create a bucketed table to not have many small files. {code:scala} df.repartition(expr("pmod(hash(user_id), 200)")) .write .mode(SaveMode.Overwrite) .bucketBy(200, "user_id") .option("path", output_path) .saveAsTable("bucketed_table") {code} Found [this small trick|https://towardsdatascience.com/best-practices-for-bucketing-in-spark-sql-ea9f23f7dd53] to have the same # files as buckets. However, the equivalent does not work in spark-sql (using repartition hint) {code:sql} create table bucketed_table stored as parquet clustered by (user_id) into 200 buckets select /*+ repartition (pmod(hash(user_id),200)) */ * from df_table {code} {{REPARTITION Hint parameter should include columns, but 'pmod('hash('user_id), 200) found.}} When I instead make a virtual column and use that, Spark is not respecting the repartition anymore {code:sql} create table bucketed_table stored as parquet clustered by (user_id) into 200 buckets select /*+repartition (bkt) */ *, pmod(hash(user_id),200) as bkt from df_table {code} {code:java} $ hdfs dfs -ls -h /user/spark/warehouse/bucket_test.db/bucketed_table| head Found 101601 items ...{code} Can the behavior of repartition hint be changed to work like the Scala/Python equivalent? Thank you > spark-sql does not recognize expressions in repartition hint > ------------------------------------------------------------ > > Key: SPARK-47425 > URL: https://issues.apache.org/jira/browse/SPARK-47425 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.4.1 > Reporter: Shreyas > Priority: Major > > In Scala, it is possible to do this, to create a bucketed table to not have > many small files. > > {code:scala} > df.repartition(expr("pmod(hash(user_id), 200)")) > .write > .mode(SaveMode.Overwrite) > .bucketBy(200, "user_id") > .option("path", output_path) > .saveAsTable("bucketed_table") > {code} > Found [this small > trick|https://towardsdatascience.com/best-practices-for-bucketing-in-spark-sql-ea9f23f7dd53] > to have the same # files as buckets. > However, the equivalent does not work in spark-sql (using repartition hint) > {code:sql} > create table bucketed_table stored as parquet > clustered by (user_id) into 200 buckets > select /*+ repartition (pmod(hash(user_id),200)) */ > -- I have the hint setup properly, Jira is removing the space when displaying > * from df_table > {code} > {{REPARTITION Hint parameter should include columns, but > 'pmod('hash('user_id), 200) found.}} > When I instead make a virtual column and use that, Spark is not respecting > the repartition anymore > {code:sql} > create table bucketed_table stored as parquet > clustered by (user_id) into 200 buckets > select /*+repartition (bkt) */ > *, pmod(hash(user_id),200) as bkt > from df_table > {code} > {code:java} > $ hdfs dfs -ls -h /user/spark/warehouse/bucket_test.db/bucketed_table| head > Found 101601 items > ...{code} > Can the behavior of repartition hint be changed to work like the Scala/Python > equivalent? > Thank you -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org