Shreyas created SPARK-47425:
-------------------------------

             Summary: spark-sql does not recognize expressions in repartition 
hint
                 Key: SPARK-47425
                 URL: https://issues.apache.org/jira/browse/SPARK-47425
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.4.1
            Reporter: Shreyas


In Scala, it is possible to do this, to create a bucketed table to not have 
many small files.

 
{code:scala}
df.repartition(expr("pmod(hash(user_id), 200)"))
  .write
  .mode(SaveMode.Overwrite)
  .bucketBy(200, "user_id")
  .option("path", output_path)
  .saveAsTable("bucketed_table")
{code}
Found [this small 
trick|https://towardsdatascience.com/best-practices-for-bucketing-in-spark-sql-ea9f23f7dd53]
 to have the same # files as buckets.

However, the equivalent does not work in spark-sql (using repartition hint)
{code:sql}
create table bucketed_table stored as parquet
clustered by (user_id) into 200 buckets
select /*+repartition (pmod(hash(user_id),200)) */
* from df_table
{code}
{{REPARTITION Hint parameter should include columns, but 'pmod('hash('user_id), 
200) found.}}

When I instead make a virtual column and use that, Spark is not respecting the 
repartition anymore
{code:sql}
create table bucketed_table stored as parquet
clustered by (user_id) into 200 buckets
select /*+repartition (bkt) */
*, pmod(hash(user_id),200) as bkt
from df_table
{code}
{code:java}
$ hdfs dfs -ls -h /user/spark/warehouse/bucket_test.db/bucketed_table| head
Found 101601 items
...{code}
Can the behavior of repartition hint be changed to work like the Scala/Python 
equivalent?

Thank you



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to