[ https://issues.apache.org/jira/browse/SPARK-38238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-38238: ------------------------------------ Assignee: Apache Spark > Contains Join for Spark SQL > --------------------------- > > Key: SPARK-38238 > URL: https://issues.apache.org/jira/browse/SPARK-38238 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.3.0 > Reporter: Wan Kun > Assignee: Apache Spark > Priority: Major > > Currently Spark SQL uses a Broadcast Nested Loop join when it has to execute > the following string contains query: > {code:sql} > SELECT a.text, b.pattern > FROM fact_table a > JOIN patterns b > ON a.text like concat('%', b.pattern, '%'); > {code} > OR > {code:sql} > SELECT a.text, b.pattern > FROM fact_table a > JOIN patterns b > ON position(b.pattern, a.text) > 0; > {code} > If there are many patterns to match in the left table, the query many execute > for a long time. > Actually this join is called *Multi-Pattern String Matching* or {*}Multi-Way > String Matching{*}, and many algorithm trying to improve this matching. One > of the famous algorithm called [*Aho–Corasick > algorithm*|https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm] > The basic idea to optimize this query is to transform all the patterns into a > trie tree and broadcast it. So then each row from the left table only need to > match its content to the trie tree once. > The query will go from *O(M * N * m * n)* to *O(M * m * max( n ))* > M = number of records in the fact table > N = number of records in the patterns table > m = row length of the fact table > n = row length of the patterns table -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org