[ https://issues.apache.org/jira/browse/SPARK-26209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703824#comment-16703824 ]
Sam hendley commented on SPARK-26209: ------------------------------------- Seems like dataframe needs a side-channel for the 'invariants' of the dataframe like it's bucketing state or how it was partitioned. It could be used for lots of other optimizations like bucketing and windowing. Relying on an external system to hold that data makes the whole system less cohesive. There are a lot of cool things that could happen if operations mutated that invariant state as they perform operations. I am guessing that the invariants already exists 'implicitly' in the DAG graph but managing that state explictly and being able to ser/der that state would help make some of these complex optimization like bucketization easier to apply. For this specific case the side-channel would just contain the 'bucketed' and 'sorted field' that are stored in Hive. As soon as we do some operation that shuffles this data to other partitions or otherwise makes it non-bucketable we would clear this state. When we called bucketBy/sortBy etc it would readd the correct metadata. It seems like we could use things like the FileMetaData.key_value_metadata fields to store this metadata. Could add this same functionality to Parquet Dataframes? > Allow for dataframe bucketization without Hive > ---------------------------------------------- > > Key: SPARK-26209 > URL: https://issues.apache.org/jira/browse/SPARK-26209 > Project: Spark > Issue Type: Improvement > Components: Input/Output, Java API, SQL > Affects Versions: 2.4.0 > Reporter: Walt Elder > Priority: Minor > > As a DataFrame author, I can elect to bucketize my output without involving > Hive or HMS, so that my hive-less environment can benefit from this > query-optimization technique. > > https://issues.apache.org/jira/browse/SPARK-19256?focusedCommentId=16345397&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16345397 > identifies this as a shortcoming with the umbrella feature in provided via > SPARK-19256. > > In short, relying on Hive to store metadata *precludes* environments which > don't have/use hive from making use of bucketization features. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org