[ https://issues.apache.org/jira/browse/SPARK-32290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wenchen Fan resolved SPARK-32290. --------------------------------- Fix Version/s: (was: 3.0.1) Resolution: Fixed Issue resolved by pull request 29104 [https://github.com/apache/spark/pull/29104] > NotInSubquery SingleColumn Optimize > ----------------------------------- > > Key: SPARK-32290 > URL: https://issues.apache.org/jira/browse/SPARK-32290 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.0.0 > Reporter: Leanken.Lin > Priority: Minor > Fix For: 3.1.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Normally, > A NotInSubquery will plan into BroadcastNestedLoopJoinExec, which is very > very time consuming. For example, I've done TPCH benchmark lately, Query 16 > almost took half of the entire TPCH 22Query execution Time. So i proposed > that to do the following optimize. > Inside BroadcastNestedLoopJoinExec, we can identify not in subquery with only > single column in following pattern. > {code:java} > case _@Or( > _@EqualTo(leftAttr: AttributeReference, rightAttr: > AttributeReference), > _@IsNull( > _@EqualTo(_: AttributeReference, _: AttributeReference) > ) > ) > {code} > if buildSide rows is small enough, we can change build side data into a > HashMap. > so the M*N calculation can be optimized into M*log(N) > I've done a benchmark job in 1TB TPCH, before apply the optimize > Query 16 take around 18 mins to finish, after apply the M*log(N) optimize, it > takes only 30s to finish. > But this optimize only works on single column not in subquery, so i am here > to seek advise whether the community need this update or not. I will do the > pull request first, if the community member thought it's hack, it's fine to > just ignore this request. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org