[ https://issues.apache.org/jira/browse/SPARK-38771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-38771: ------------------------------------ Assignee: (was: Apache Spark) > Introduce adaptive Bloom filter Join to reduce spilling to disk when doing > SortMergeJoin > ----------------------------------------------------------------------------------------- > > Key: SPARK-38771 > URL: https://issues.apache.org/jira/browse/SPARK-38771 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.4.0 > Reporter: Yuming Wang > Priority: Major > Attachments: Special case.png > > > Insert a bloom filter on one side of the join if it may be spill when sorting > and the other side less than 100000000L rows. > A special case: > {code:java} > sql("set spark.sql.autoBroadcastJoinThreshold=10000") > sql("CREATE TABLE t1 using parquet AS SELECT id AS a, id AS b, id AS c FROM > range(200000000L)") > sql("CREATE TABLE t2 using parquet AS SELECT id AS x, id AS y FROM > range(8000000)") > sql("""SELECT a, b, c, x, y FROM t1 JOIN t2 ON t1.a = t2.x where t2.y = > 1""").collect() > {code} > !Special case.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org