[ 
https://issues.apache.org/jira/browse/CALCITE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julian Hyde updated CALCITE-5914:
---------------------------------
    Description: 
Cache compiled regular expressions (and other amortized work) in SQL function 
runtime. Compiling a regular expression to a pattern is expensive (compared to 
the cost of matching, given an existing pattern) and therefore caching the 
compiled form will yield performance benefits if the regular expression is 
constant or has a small number of values.

Consider the following query:
{code:java}
SELECT ename, job, RLIKE(ename, 'A.*'), RLIKE(ename, job || '.*')
FROM emp
{code}
The first regular expression, '{{A.\*}}', is constant and can be compiled at 
prepare time or at the start of execution; the second regular expression, {{job 
|| '.\*'}}, might vary from one row to the next. However if the {{job}} column 
has a small number of values it still might be beneficial to cache the compiled 
regular expression.

If {{SqlFunctions.rlike}} could use a cache (mapping from {{String}} to 
{{{}java.util.regex.Pattern{}}}) then it would achieve benefits in both the 
constant and non-constant cases.

The cache needs to:
 * be thread-safe (in case queries are executing using multiple threads),
 * return thread-safe objects (as is {{{}Pattern{}}}),
 * have bounded space (so that a query doesn't blow memory with 1 million 
distinct regular expressions),
 * disposed after the query has terminated,
 * (ideally) share with regexes of the same language in the same query,
 * not conflict with regexes of different languages in the same query.

One possible implementation is to add an {{interface FunctionState}}, with 
subclasses including {{class RegexpCache}}, and if argument 1 of a function is 
a subclass of {{FunctionState}} the compiler would initialize the state in the 
generated code. The function can rely on the state argument being initialized, 
and being the same object from one call to the next. Example:
{code:java}
interface FunctionState {
}

class RegexpCache implements FunctionState {
  final Cache<String, Pattern> cache = ...;
}
{code}
This change should install the cache for all applicable functions, including 
LIKE, ILIKE, RLIKE, SIMILAR, posix regex, REGEXP_REPLACE (MySQL, Oracle), 
REGEXP_CONTAINS (BigQuery), other BigQuery REGEXP_ functions, PARSE_URL, 
JSON_REPLACE, PARSE_TIMESTAMP.

It can also be used for functions that have mutable state, e.g. {{RANDOM}} with 
and without a seed.

  was:
Cache compiled regular expressions (and other amortized work) in SQL function 
runtime. Compiling a regular expression to a pattern is expensive (compared to 
the cost of matching, given an existing pattern) and therefore caching the 
compiled form will yield performance benefits if the regular expression is 
constant or has a small number of values.

Consider the following query:
{code:java}
SELECT ename, job, RLIKE(ename, 'A.*'), RLIKE(ename, job || '.*')
FROM emp
{code}
The first regular expression, '{{A.\*}}', is constant and can be compiled at 
prepare time or at the start of execution; the second regular expression, {{job 
|| '.\*'}}, might vary from one row to the next. However if the {{job}} column 
has a small number of values it still might be beneficial to cache the compiled 
regular expression.

If {{SqlFunctions.rlike}} could use a cache (mapping from {{String}} to 
{{{}java.util.regex.Pattern{}}}) then it would achieve benefits in both the 
constant and non-constant cases.

The cache needs to:
 * be thread-safe (in case queries are executing using multiple threads),
 * return thread-safe objects (as is {{{}Pattern{}}}),
 * have bounded space (so that a query doesn't blow memory with 1 million 
distinct regular expressions),
 * disposed after the query has terminated,
 * (ideally) share with regexes of the same language in the same query,
 * not conflict with regexes of different languages in the same query.

One possible implementation is to add an {{interface FunctionState}}, with 
subclasses including {{class RegexpCache}}, and if argument 1 of a function is 
a subclass of {{FunctionState}} the compiler would initialize the state in the 
generated code. The function can rely on the state argument being initialized, 
and being the same object from one call to the next. Example:
{code:java}
interface FunctionState {
}

class RegexpCache implements FunctionState {
  final Cache<String, Pattern> cache = ...;
}
{code}
This change should install the cache for all applicable functions, including 
LIKE, ILIKE, RLIKE, SIMILAR, posix regex, REGEXP_REPLACE (MySQL, Oracle), 
REGEXP_CONTAINS (BigQuery), other BigQuery REGEXP_ functions, PARSE_URL, 
JSON_REPLACE, PARSE_TIMESTAMP.


> Cache compiled regular expressions in SQL function runtime
> ----------------------------------------------------------
>
>                 Key: CALCITE-5914
>                 URL: https://issues.apache.org/jira/browse/CALCITE-5914
>             Project: Calcite
>          Issue Type: Bug
>            Reporter: Julian Hyde
>            Priority: Major
>
> Cache compiled regular expressions (and other amortized work) in SQL function 
> runtime. Compiling a regular expression to a pattern is expensive (compared 
> to the cost of matching, given an existing pattern) and therefore caching the 
> compiled form will yield performance benefits if the regular expression is 
> constant or has a small number of values.
> Consider the following query:
> {code:java}
> SELECT ename, job, RLIKE(ename, 'A.*'), RLIKE(ename, job || '.*')
> FROM emp
> {code}
> The first regular expression, '{{A.\*}}', is constant and can be compiled at 
> prepare time or at the start of execution; the second regular expression, 
> {{job || '.\*'}}, might vary from one row to the next. However if the {{job}} 
> column has a small number of values it still might be beneficial to cache the 
> compiled regular expression.
> If {{SqlFunctions.rlike}} could use a cache (mapping from {{String}} to 
> {{{}java.util.regex.Pattern{}}}) then it would achieve benefits in both the 
> constant and non-constant cases.
> The cache needs to:
>  * be thread-safe (in case queries are executing using multiple threads),
>  * return thread-safe objects (as is {{{}Pattern{}}}),
>  * have bounded space (so that a query doesn't blow memory with 1 million 
> distinct regular expressions),
>  * disposed after the query has terminated,
>  * (ideally) share with regexes of the same language in the same query,
>  * not conflict with regexes of different languages in the same query.
> One possible implementation is to add an {{interface FunctionState}}, with 
> subclasses including {{class RegexpCache}}, and if argument 1 of a function 
> is a subclass of {{FunctionState}} the compiler would initialize the state in 
> the generated code. The function can rely on the state argument being 
> initialized, and being the same object from one call to the next. Example:
> {code:java}
> interface FunctionState {
> }
> class RegexpCache implements FunctionState {
>   final Cache<String, Pattern> cache = ...;
> }
> {code}
> This change should install the cache for all applicable functions, including 
> LIKE, ILIKE, RLIKE, SIMILAR, posix regex, REGEXP_REPLACE (MySQL, Oracle), 
> REGEXP_CONTAINS (BigQuery), other BigQuery REGEXP_ functions, PARSE_URL, 
> JSON_REPLACE, PARSE_TIMESTAMP.
> It can also be used for functions that have mutable state, e.g. {{RANDOM}} 
> with and without a seed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to