subject:"Using existing distribution for join when subset of keys"

Re: Using existing distribution for join when subset of keys

2020-05-31 Thread Terry Kim

Is the following what you trying to do? spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "0") val df1 = (0 until 100).map(i => (i % 5, i % 13)).toDF("x", "y") val df2 = (0 until 100).map(i => (i % 7, i % 11)).toDF("x", "y") df1.write.format("parquet").bucketBy(8, "x", "y").saveAsTable("t1")

Re: Using existing distribution for join when subset of keys

2020-05-31 Thread Patrick Woody

Hey Terry, Thanks for the response! I'm not sure that it ends up working though - the bucketing still seems to require the exchange before the join. Both tables below are saved bucketed by "x": *(5) Project [x#29, y#30, z#31, z#37] +- *(5) SortMergeJoin [x#29, y#30], [x#35, y#36], Inner :-

Re: Using existing distribution for join when subset of keys

2020-05-31 Thread Terry Kim

You can use bucketBy to avoid shuffling in your scenario. This test suite has some examples: https://github.com/apache/spark/blob/45cf5e99503b00a6bd83ea94d6d92761db1a00ab/sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala#L343 Thanks, Terry On Sun, May 31, 2020 at 7:43

Using existing distribution for join when subset of keys

2020-05-31 Thread Patrick Woody

Hey all, I have one large table, A, and two medium sized tables, B & C, that I'm trying to complete a join on efficiently. The result is multiplicative on A join B, so I'd like to avoid shuffling that result. For this example, let's just assume each table has three columns, x, y, z. The below is