https://issues.apache.org/jira/browse/SPARK-19256 is an active umbrella
feature.
But as of 2.2, you can invoke APIs on DataFrames today to bucketize them on
serialization using Hive.
If you invoke
val bucketCount = 100
df1
.repartition(bucketCount, col("a"), col("b"))
.bucketBy(bucketCount, "a","b")
.sortBy("a", "b")
.saveAsTable("default.table_1")
df2
.repartition(bucketCount, col("a"), col("b"))
.bucketBy(bucketCount, "a","b")
.sortBy("a", "b")
.saveAsTable("default.table_2")
Then, join table_1 on table_2 on "a", "b", you'll find that your query plan
involves no sort or exchange, only a SortMergeJoin.
--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]