[ 
https://issues.apache.org/jira/browse/SPARK-47425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shreyas updated SPARK-47425:
----------------------------
    Description: 
In Scala, it is possible to do this, to create a bucketed table to not have 
many small files.

 
{code:scala}
df.repartition(expr("pmod(hash(user_id), 200)"))
  .write
  .mode(SaveMode.Overwrite)
  .bucketBy(200, "user_id")
  .option("path", output_path)
  .saveAsTable("bucketed_table")
{code}
Found [this small 
trick|https://towardsdatascience.com/best-practices-for-bucketing-in-spark-sql-ea9f23f7dd53]
 to have the same # files as buckets.

However, the equivalent does not work in spark-sql (using repartition hint)
{code:sql}
create table bucketed_table stored as parquet
clustered by (user_id) into 200 buckets
select /*+ repartition (pmod(hash(user_id),200)) */ * from df_table
{code}
{{REPARTITION Hint parameter should include columns, but 'pmod('hash('user_id), 
200) found.}}

When I instead make a virtual column and use that, Spark is not respecting the 
repartition anymore
{code:sql}
create table bucketed_table stored as parquet
clustered by (user_id) into 200 buckets
select /*+ repartition (bkt) */
*, pmod(hash(user_id),200) as bkt
from df_table
{code}
{code:java}
$ hdfs dfs -ls -h /user/spark/warehouse/bucket_test.db/bucketed_table| head
Found 101601 items
...{code}
Can the behavior of repartition hint be changed to work like the Scala/Python 
equivalent?

Thank you

  was:
In Scala, it is possible to do this, to create a bucketed table to not have 
many small files.

 
{code:scala}
df.repartition(expr("pmod(hash(user_id), 200)"))
  .write
  .mode(SaveMode.Overwrite)
  .bucketBy(200, "user_id")
  .option("path", output_path)
  .saveAsTable("bucketed_table")
{code}
Found [this small 
trick|https://towardsdatascience.com/best-practices-for-bucketing-in-spark-sql-ea9f23f7dd53]
 to have the same # files as buckets.

However, the equivalent does not work in spark-sql (using repartition hint)
{code:sql}
create table bucketed_table stored as parquet
clustered by (user_id) into 200 buckets
select /*+ repartition (pmod(hash(user_id),200)) */
-- I have the hint setup properly, Jira is removing the space when displaying
* from df_table
{code}
{{REPARTITION Hint parameter should include columns, but 'pmod('hash('user_id), 
200) found.}}

When I instead make a virtual column and use that, Spark is not respecting the 
repartition anymore
{code:sql}
create table bucketed_table stored as parquet
clustered by (user_id) into 200 buckets
select /*+repartition (bkt) */
*, pmod(hash(user_id),200) as bkt
from df_table
{code}
{code:java}
$ hdfs dfs -ls -h /user/spark/warehouse/bucket_test.db/bucketed_table| head
Found 101601 items
...{code}
Can the behavior of repartition hint be changed to work like the Scala/Python 
equivalent?

Thank you


> spark-sql does not recognize expressions in repartition hint
> ------------------------------------------------------------
>
>                 Key: SPARK-47425
>                 URL: https://issues.apache.org/jira/browse/SPARK-47425
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.4.1
>            Reporter: Shreyas
>            Priority: Major
>
> In Scala, it is possible to do this, to create a bucketed table to not have 
> many small files.
>  
> {code:scala}
> df.repartition(expr("pmod(hash(user_id), 200)"))
>   .write
>   .mode(SaveMode.Overwrite)
>   .bucketBy(200, "user_id")
>   .option("path", output_path)
>   .saveAsTable("bucketed_table")
> {code}
> Found [this small 
> trick|https://towardsdatascience.com/best-practices-for-bucketing-in-spark-sql-ea9f23f7dd53]
>  to have the same # files as buckets.
> However, the equivalent does not work in spark-sql (using repartition hint)
> {code:sql}
> create table bucketed_table stored as parquet
> clustered by (user_id) into 200 buckets
> select /*+ repartition (pmod(hash(user_id),200)) */ * from df_table
> {code}
> {{REPARTITION Hint parameter should include columns, but 
> 'pmod('hash('user_id), 200) found.}}
> When I instead make a virtual column and use that, Spark is not respecting 
> the repartition anymore
> {code:sql}
> create table bucketed_table stored as parquet
> clustered by (user_id) into 200 buckets
> select /*+ repartition (bkt) */
> *, pmod(hash(user_id),200) as bkt
> from df_table
> {code}
> {code:java}
> $ hdfs dfs -ls -h /user/spark/warehouse/bucket_test.db/bucketed_table| head
> Found 101601 items
> ...{code}
> Can the behavior of repartition hint be changed to work like the Scala/Python 
> equivalent?
> Thank you



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to