[ 
https://issues.apache.org/jira/browse/SPARK-38238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wan Kun updated SPARK-38238:
----------------------------
    Description: 
Currently Spark SQL uses a Broadcast Nested Loop join when it has to execute 
the following string contains query:
{code:sql}
SELECT a.text, b.pattern
FROM fact_table a
JOIN patterns b
ON a.text like concat('%', b.pattern, '%');
{code}
OR
{code:sql}
SELECT a.text, b.pattern
FROM fact_table a
JOIN patterns b
ON position(b.pattern, a.text) > 0;
{code}
If there are many patterns to match in the left table, the query many execute 
for a long time.

Actually this join is called *Multi-Pattern String Matching* or {*}Multi-Way 
String Matching{*}, and many algorithm trying to improve this matching. One of 
the famous algorithm called [*Aho–Corasick 
algorithm*|https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm]

The basic idea to optimize this query is to transform all the patterns into a 
trie tree and broadcast it. So then each row from the left table only need to 
match its content to the trie tree once.

The query will go from *O(M * N * m * n)* to *O(M * m * max(n))*
M = number of records in the fact table
N = number of records in the patterns table
m = row length of the fact table
n = row length of the patterns table

sadf

  was:
Currently Spark SQL uses a Broadcast Nested Loop join when it has to execute 
the following string contains query:
{code:sql}
SELECT a.text, b.pattern
FROM fact_table a
JOIN patterns b
ON a.text like concat('%', b.pattern, '%');
{code}
OR
{code:sql}
SELECT a.text, b.pattern
FROM fact_table a
JOIN patterns b
ON position(b.pattern, a.text) > 0;
{code}

If there are many patterns to match in the left table, the query many execute 
for a long time.

Actually this join is called *Multi-Pattern String Matching* or *Multi-Way 
String Matching*, and many algorithm trying to improve this matching. One of  
the famous algorithm called [*Aho–Corasick 
algorithm*|https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm]

The basic idea to optimize this query is to transform all the patterns into a 
trie tree and broadcast it. So then each row from the left table only need to 
match its content to the trie tree once. 



> Contains Join for Spark SQL
> ---------------------------
>
>                 Key: SPARK-38238
>                 URL: https://issues.apache.org/jira/browse/SPARK-38238
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.3.0
>            Reporter: Wan Kun
>            Priority: Major
>
> Currently Spark SQL uses a Broadcast Nested Loop join when it has to execute 
> the following string contains query:
> {code:sql}
> SELECT a.text, b.pattern
> FROM fact_table a
> JOIN patterns b
> ON a.text like concat('%', b.pattern, '%');
> {code}
> OR
> {code:sql}
> SELECT a.text, b.pattern
> FROM fact_table a
> JOIN patterns b
> ON position(b.pattern, a.text) > 0;
> {code}
> If there are many patterns to match in the left table, the query many execute 
> for a long time.
> Actually this join is called *Multi-Pattern String Matching* or {*}Multi-Way 
> String Matching{*}, and many algorithm trying to improve this matching. One 
> of the famous algorithm called [*Aho–Corasick 
> algorithm*|https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm]
> The basic idea to optimize this query is to transform all the patterns into a 
> trie tree and broadcast it. So then each row from the left table only need to 
> match its content to the trie tree once.
> The query will go from *O(M * N * m * n)* to *O(M * m * max(n))*
> M = number of records in the fact table
> N = number of records in the patterns table
> m = row length of the fact table
> n = row length of the patterns table
> sadf



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to