[ https://issues.apache.org/jira/browse/SPARK-38238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wan Kun updated SPARK-38238: ---------------------------- Description: Currently Spark SQL uses a Broadcast Nested Loop join when it has to execute the following string contains query: {code:sql} SELECT a.text, b.pattern FROM fact_table a JOIN patterns b ON a.text like concat('%', b.pattern, '%'); {code} OR {code:sql} SELECT a.text, b.pattern FROM fact_table a JOIN patterns b ON position(b.pattern, a.text) > 0; {code} If there are many patterns to match in the left table, the query many execute for a long time. Actually this join is called *Multi-Pattern String Matching* or {*}Multi-Way String Matching{*}, and many algorithm trying to improve this matching. One of the famous algorithm called [*Aho–Corasick algorithm*|https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm] The basic idea to optimize this query is to transform all the patterns into a trie tree and broadcast it. So then each row from the left table only need to match its content to the trie tree once. The query will go from *O(M * N * m * n)* to *O(M * m * max(n))* M = number of records in the fact table N = number of records in the patterns table m = row length of the fact table n = row length of the patterns table sadf was: Currently Spark SQL uses a Broadcast Nested Loop join when it has to execute the following string contains query: {code:sql} SELECT a.text, b.pattern FROM fact_table a JOIN patterns b ON a.text like concat('%', b.pattern, '%'); {code} OR {code:sql} SELECT a.text, b.pattern FROM fact_table a JOIN patterns b ON position(b.pattern, a.text) > 0; {code} If there are many patterns to match in the left table, the query many execute for a long time. Actually this join is called *Multi-Pattern String Matching* or *Multi-Way String Matching*, and many algorithm trying to improve this matching. One of the famous algorithm called [*Aho–Corasick algorithm*|https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm] The basic idea to optimize this query is to transform all the patterns into a trie tree and broadcast it. So then each row from the left table only need to match its content to the trie tree once. > Contains Join for Spark SQL > --------------------------- > > Key: SPARK-38238 > URL: https://issues.apache.org/jira/browse/SPARK-38238 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.3.0 > Reporter: Wan Kun > Priority: Major > > Currently Spark SQL uses a Broadcast Nested Loop join when it has to execute > the following string contains query: > {code:sql} > SELECT a.text, b.pattern > FROM fact_table a > JOIN patterns b > ON a.text like concat('%', b.pattern, '%'); > {code} > OR > {code:sql} > SELECT a.text, b.pattern > FROM fact_table a > JOIN patterns b > ON position(b.pattern, a.text) > 0; > {code} > If there are many patterns to match in the left table, the query many execute > for a long time. > Actually this join is called *Multi-Pattern String Matching* or {*}Multi-Way > String Matching{*}, and many algorithm trying to improve this matching. One > of the famous algorithm called [*Aho–Corasick > algorithm*|https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm] > The basic idea to optimize this query is to transform all the patterns into a > trie tree and broadcast it. So then each row from the left table only need to > match its content to the trie tree once. > The query will go from *O(M * N * m * n)* to *O(M * m * max(n))* > M = number of records in the fact table > N = number of records in the patterns table > m = row length of the fact table > n = row length of the patterns table > sadf -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org