[ https://issues.apache.org/jira/browse/SPARK-34966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315591#comment-17315591 ]
Yuming Wang edited comment on SPARK-34966 at 10/24/22 11:37 AM: ---------------------------------------------------------------- https://github.com/apache/spark/blob/69aa727ff495f6698fe9b37e952dfaf36f1dd5eb/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L504-L507 It is same: {code:scala} import org.apache.spark.sql.catalyst.expressions.{Cast, EmptyRow, Literal, Murmur3Hash, Pmod} import org.apache.spark.sql.types.LongType println(Pmod(new Murmur3Hash(Seq(Literal(100.toShort))), Literal(10)).eval(EmptyRow)) println(Pmod(new Murmur3Hash(Seq(new Cast(Literal(100.toShort), IntegerType))), Literal(10)).eval(EmptyRow)) {code} But it is not same: {code:scala} println(Pmod(new Murmur3Hash(Seq(Literal(100))), Literal(10)).eval(EmptyRow)) println(Pmod(new Murmur3Hash(Seq(new Cast(Literal(100), LongType))), Literal(10)).eval(EmptyRow)) {code} was (Author: q79969786): https://github.com/apache/spark/blob/69aa727ff495f6698fe9b37e952dfaf36f1dd5eb/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L504-L507 It is same: import org.apache.spark.sql.catalyst.expressions.{Cast, EmptyRow, Literal, Murmur3Hash, Pmod} import org.apache.spark.sql.types.LongType println(Pmod(new Murmur3Hash(Seq(Literal(100.toShort))), Literal(10)).eval(EmptyRow)) println(Pmod(new Murmur3Hash(Seq(new Cast(Literal(100.toShort), IntegerType))), Literal(10)).eval(EmptyRow)) But it is not same: println(Pmod(new Murmur3Hash(Seq(Literal(100))), Literal(10)).eval(EmptyRow)) println(Pmod(new Murmur3Hash(Seq(new Cast(Literal(100), LongType))), Literal(10)).eval(EmptyRow)) > Avoid shuffle if join type do not match > --------------------------------------- > > Key: SPARK-34966 > URL: https://issues.apache.org/jira/browse/SPARK-34966 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.2.0 > Reporter: Yuming Wang > Priority: Major > > How to reproduce this issue: > {code:scala} > spark.sql("set spark.sql.autoBroadcastJoinThreshold=-1") > spark.sql("CREATE TABLE t1 using parquet clustered by (id) into 200 > buckets AS SELECT cast(id as bigint) FROM range(1000)") > spark.sql("CREATE TABLE t2 using parquet clustered by (id) into 200 > buckets AS SELECT cast(id as int) FROM range(500)") > spark.sql("select * from t1 join t2 on (t1.id = t2.id)").explain > {code} > Current: > {noformat} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- SortMergeJoin [id#14L], [cast(id#15 as bigint)], Inner > :- Sort [id#14L ASC NULLS FIRST], false, 0 > : +- Filter isnotnull(id#14L) > : +- FileScan parquet default.t1[id#14L] Batched: true, DataFilters: > [isnotnull(id#14L)], Format: Parquet, Location: InMemoryFileIndex(1 > paths)[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark...., > PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: > struct<id:bigint>, SelectedBucketsCount: 200 out of 200 > +- Sort [cast(id#15 as bigint) ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(cast(id#15 as bigint), 200), > ENSURE_REQUIREMENTS, [id=#58] > +- Filter isnotnull(id#15) > +- FileScan parquet default.t2[id#15] Batched: true, DataFilters: > [isnotnull(id#15)], Format: Parquet, Location: InMemoryFileIndex(1 > paths)[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark...., > PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: > struct<id:int> > {noformat} > Expected: > {noformat} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- SortMergeJoin [id#14L], [cast(id#15 as bigint)], Inner > :- Sort [id#14L ASC NULLS FIRST], false, 0 > : +- Filter isnotnull(id#14L) > : +- FileScan parquet default.t1[id#14L] Batched: true, DataFilters: > [isnotnull(id#14L)], Format: Parquet, Location: InMemoryFileIndex(1 > paths)[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark...., > PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: > struct<id:bigint>, SelectedBucketsCount: 200 out of 200 > +- Sort [cast(id#15 as bigint) ASC NULLS FIRST], false, 0 > +- Filter isnotnull(id#15) > +- FileScan parquet default.t2[id#15] Batched: true, DataFilters: > [isnotnull(id#15)], Format: Parquet, Location: InMemoryFileIndex(1 > paths)[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark...., > PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: > struct<id:int>, SelectedBucketsCount: 200 out of 200 > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org