[ https://issues.apache.org/jira/browse/CALCITE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julian Hyde updated CALCITE-5914: --------------------------------- Description: Cache compiled regular expressions (and other amortized work) in SQL function runtime. Consider the following query: {code:java} SELECT ename, job, RLIKE(ename, 'A.*'), RLIKE(ename, job || '.*') FROM emp {code} The first regular expression, '{{{}A.*{}}}', is constant and can be compiled at prepare time or at the start of execution; the second regular expression '{{{}job || '.*'{}}}' might vary from one row to the next. However if the {{job}} column has a small number of values it still might be beneficial to cache the compiled regular expression. If {{SqlFunctions.rlike}} could use a cache (mapping from {{String}} to {{{}java.util.regex.Pattern{}}}) then it would achieve benefits in both the constant and non-constant cases. The cache needs to: * be thread-safe (in case queries are executing using multiple threads), * return thread-safe objects (as is {{{}Pattern{}}}), * have bounded space (so that a query doesn't blow memory with 1 million distinct regular expressions), * disposed after the query has terminated, * (ideally) share with regexes of the same language in the same query, * not conflict with regexes of different languages in the same query. One possibility is to add an {{{}interface FunctionState{}}}, with subclasses including {{{}class RegexpCache{}}}, and if argument 1 of a function is a subclass of {{FunctionState}} the compiler would initialize the state in the generated code. The function can rely on the state argument being initialized, and being the same object from one call to the next. Example: {code:java} interface FunctionState { } class RegexpCache implements FunctionState { final Cache<String, Pattern> cache = ...; } {code} This change should install the cache for all applicable functions, including LIKE, ILIKE, RLIKE, SIMILAR, posix regex, REGEXP_REPLACE (MySQL, Oracle), REGEXP_CONTAINS (BigQuery), other BigQuery REGEXP_ functions, PARSE_URL, JSON_REPLACE, PARSE_TIMESTAMP. was: Cache compiled regular expressions (and other amortized work) in SQL function runtime. Consider the following query: {code} SELECT ename, job, RLIKE(ename, 'A.*'), RLIKE(ename, job || '.*') FROM emp {code} The first regular expression, '{{A.\*}}', is constant and can be compiled at prepare time or at the start of execution; the second regular expression '{{job || '.*'}}' might vary from one row to the next. However if the {{job}} column has a small number of values it still might be beneficial to cache the compiled regular expression. If {{SqlFunctions.rlike}} could use a cache (mapping from {{String}} to {{java.util.regex.Pattern}}) then it would achieve benefits in both the constant and non-constant cases. The cache needs to: * be thread-safe (in case queries are executing using multiple threads), * return thread-safe objects (as is {{Pattern}}), * have bounded space (so that a query doesn't blow memory with 1 million distinct regular expressions), * disposed after the query has terminated, * (ideally) share with regexes of the same language in the same query, * not conflict with regexes of different languages in the same query. One possibility is to add an {{interface FunctionState}}, with subclasses including {{class RegexpCache}}, and if argument 1 of a function is a subclass of {{FunctionState}} the compiler would initialize the state in the generated code. The function can rely on the state argument being initialized, and being the same object from one call to the next. Example: {code} interface FunctionState { } class RegexpCache implements FunctionState { final Cache<String, Pattern> cache = ...; } {code} This change should install the cache for all applicable functions, including LIKE, ILIKE, RLIKE, SIMILAR, posix regex, REGEXP_CONTAINS, REGEXP_REPLACE, other REGEXP_ functions, PARSE_URL. > Cache compiled regular expressions in SQL function runtime > ---------------------------------------------------------- > > Key: CALCITE-5914 > URL: https://issues.apache.org/jira/browse/CALCITE-5914 > Project: Calcite > Issue Type: Bug > Reporter: Julian Hyde > Priority: Major > > Cache compiled regular expressions (and other amortized work) in SQL function > runtime. > Consider the following query: > {code:java} > SELECT ename, job, RLIKE(ename, 'A.*'), RLIKE(ename, job || '.*') > FROM emp > {code} > The first regular expression, '{{{}A.*{}}}', is constant and can be compiled > at prepare time or at the start of execution; the second regular expression > '{{{}job || '.*'{}}}' might vary from one row to the next. However if the > {{job}} column has a small number of values it still might be beneficial to > cache the compiled regular expression. > If {{SqlFunctions.rlike}} could use a cache (mapping from {{String}} to > {{{}java.util.regex.Pattern{}}}) then it would achieve benefits in both the > constant and non-constant cases. > The cache needs to: > * be thread-safe (in case queries are executing using multiple threads), > * return thread-safe objects (as is {{{}Pattern{}}}), > * have bounded space (so that a query doesn't blow memory with 1 million > distinct regular expressions), > * disposed after the query has terminated, > * (ideally) share with regexes of the same language in the same query, > * not conflict with regexes of different languages in the same query. > One possibility is to add an {{{}interface FunctionState{}}}, with subclasses > including {{{}class RegexpCache{}}}, and if argument 1 of a function is a > subclass of {{FunctionState}} the compiler would initialize the state in the > generated code. The function can rely on the state argument being > initialized, and being the same object from one call to the next. Example: > {code:java} > interface FunctionState { > } > class RegexpCache implements FunctionState { > final Cache<String, Pattern> cache = ...; > } > {code} > This change should install the cache for all applicable functions, including > LIKE, ILIKE, RLIKE, SIMILAR, posix regex, REGEXP_REPLACE (MySQL, Oracle), > REGEXP_CONTAINS (BigQuery), other BigQuery REGEXP_ functions, PARSE_URL, > JSON_REPLACE, PARSE_TIMESTAMP. -- This message was sent by Atlassian Jira (v8.20.10#820010)