[ 
https://issues.apache.org/jira/browse/SPARK-26209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17649824#comment-17649824
 ] 

Yeachan Park commented on SPARK-26209:
--------------------------------------

Is there an update for this? We'd also be interested in this feature. AFAIK the 
file names already contain the bucket number. For things like bucket pruning, 
I'd expect we could just enable a configuration that would allow Spark to take 
advantage of this by computing the same function/hash it used to bucket the 
data in the first place without the need for a metastore?

> Allow for dataframe bucketization without Hive
> ----------------------------------------------
>
>                 Key: SPARK-26209
>                 URL: https://issues.apache.org/jira/browse/SPARK-26209
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output, Java API, SQL
>    Affects Versions: 3.1.0
>            Reporter: Walt Elder
>            Priority: Minor
>
> As a DataFrame author, I can elect to bucketize my output without involving 
> Hive or HMS, so that my hive-less environment can benefit from this 
> query-optimization technique. 
>  
> https://issues.apache.org/jira/browse/SPARK-19256?focusedCommentId=16345397&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16345397
>  identifies this as a shortcoming with the umbrella feature in provided via 
> SPARK-19256.
>  
> In short, relying on Hive to store metadata *precludes* environments which 
> don't have/use hive from making use of bucketization features. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to