[jira] [Assigned] (SPARK-38238) Contains Join for Spark SQL

Apache Spark (Jira) Thu, 17 Feb 2022 01:24:06 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-38238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Apache Spark reassigned SPARK-38238:
------------------------------------

    Assignee: Apache Spark

> Contains Join for Spark SQL
> ---------------------------
>
>                 Key: SPARK-38238
>                 URL: https://issues.apache.org/jira/browse/SPARK-38238
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.3.0
>            Reporter: Wan Kun
>            Assignee: Apache Spark
>            Priority: Major
>
> Currently Spark SQL uses a Broadcast Nested Loop join when it has to execute 
> the following string contains query:
> {code:sql}
> SELECT a.text, b.pattern
> FROM fact_table a
> JOIN patterns b
> ON a.text like concat('%', b.pattern, '%');
> {code}
> OR
> {code:sql}
> SELECT a.text, b.pattern
> FROM fact_table a
> JOIN patterns b
> ON position(b.pattern, a.text) > 0;
> {code}
> If there are many patterns to match in the left table, the query many execute 
> for a long time.
> Actually this join is called *Multi-Pattern String Matching* or {*}Multi-Way 
> String Matching{*}, and many algorithm trying to improve this matching. One 
> of the famous algorithm called [*Aho–Corasick 
> algorithm*|https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm]
> The basic idea to optimize this query is to transform all the patterns into a 
> trie tree and broadcast it. So then each row from the left table only need to 
> match its content to the trie tree once.
> The query will go from *O(M * N * m * n)* to *O(M * m * max( n ))*
> M = number of records in the fact table
> N = number of records in the patterns table
> m = row length of the fact table
> n = row length of the patterns table



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38238) Contains Join for Spark SQL

Reply via email to