[ 
https://issues.apache.org/jira/browse/SPARK-26209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703824#comment-16703824
 ] 

Sam hendley commented on SPARK-26209:
-------------------------------------

Seems like dataframe needs a side-channel for the 'invariants' of the dataframe 
like it's bucketing state or how it was partitioned. It could be used for lots 
of other optimizations like bucketing and windowing. Relying on an external 
system to hold that data makes the whole system less cohesive. There are a lot 
of cool things that could happen if operations mutated that invariant state as 
they perform operations. I am guessing that the invariants already exists 
'implicitly' in the DAG graph but managing that state explictly and being able 
to ser/der that state would help make some of these complex optimization like 
bucketization easier to apply. 

For this specific case the side-channel would just contain the 'bucketed' and 
'sorted field' that are stored in Hive. As soon as we do some operation that 
shuffles this data to other partitions or otherwise makes it non-bucketable we 
would clear this state. When we called bucketBy/sortBy etc it would readd the 
correct metadata.

It seems like we could use things like the FileMetaData.key_value_metadata 
fields to store this metadata. Could add this same functionality to Parquet 
Dataframes?

> Allow for dataframe bucketization without Hive
> ----------------------------------------------
>
>                 Key: SPARK-26209
>                 URL: https://issues.apache.org/jira/browse/SPARK-26209
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output, Java API, SQL
>    Affects Versions: 2.4.0
>            Reporter: Walt Elder
>            Priority: Minor
>
> As a DataFrame author, I can elect to bucketize my output without involving 
> Hive or HMS, so that my hive-less environment can benefit from this 
> query-optimization technique. 
>  
> https://issues.apache.org/jira/browse/SPARK-19256?focusedCommentId=16345397&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16345397
>  identifies this as a shortcoming with the umbrella feature in provided via 
> SPARK-19256.
>  
> In short, relying on Hive to store metadata *precludes* environments which 
> don't have/use hive from making use of bucketization features. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to