Re: Support for Hive buckets

welder404 Sun, 02 Dec 2018 08:04:39 -0800

https://issues.apache.org/jira/browse/SPARK-19256 is an active umbrella
feature.


But as of 2.2, you can invoke APIs on DataFrames today to bucketize them on
serialization using Hive.

If you invoke

val bucketCount = 100

df1
.repartition(bucketCount, col("a"), col("b"))
.bucketBy(bucketCount, "a","b")
.sortBy("a", "b")
.saveAsTable("default.table_1")

df2
.repartition(bucketCount, col("a"), col("b"))
.bucketBy(bucketCount, "a","b")
.sortBy("a", "b")
.saveAsTable("default.table_2")

Then, join table_1 on table_2 on "a", "b",  you'll find that your query plan
involves no sort or exchange, only a SortMergeJoin. 




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Re: Support for Hive buckets

Reply via email to