[ https://issues.apache.org/jira/browse/CALCITE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated CALCITE-5914: ------------------------------------ Labels: pull-request-available (was: ) > Cache compiled regular expressions in SQL function runtime > ---------------------------------------------------------- > > Key: CALCITE-5914 > URL: https://issues.apache.org/jira/browse/CALCITE-5914 > Project: Calcite > Issue Type: Bug > Reporter: Julian Hyde > Priority: Major > Labels: pull-request-available > > Cache compiled regular expressions (and other amortized work) in SQL function > runtime. Compiling a regular expression to a pattern is expensive (compared > to the cost of matching, given an existing pattern) and therefore caching the > compiled form will yield performance benefits if the regular expression is > constant or has a small number of values. > Consider the following query: > {code:java} > SELECT ename, job, RLIKE(ename, 'A.*'), RLIKE(ename, job || '.*') > FROM emp > {code} > The first regular expression, '{{A.\*}}', is constant and can be compiled at > prepare time or at the start of execution; the second regular expression, > {{job || '.\*'}}, might vary from one row to the next. However if the {{job}} > column has a small number of values it still might be beneficial to cache the > compiled regular expression. > If {{SqlFunctions.rlike}} could use a cache (mapping from {{String}} to > {{{}java.util.regex.Pattern{}}}) then it would achieve benefits in both the > constant and non-constant cases. > The cache needs to: > * be thread-safe (in case queries are executing using multiple threads), > * return thread-safe objects (as is {{{}Pattern{}}}), > * have bounded space (so that a query doesn't blow memory with 1 million > distinct regular expressions), > * disposed after the query has terminated, > * (ideally) share with regexes of the same language in the same query, > * not conflict with regexes of different languages in the same query. > One possible implementation is to add an {{interface FunctionState}}, with > subclasses including {{class RegexpCache}}, and if argument 1 of a function > is a subclass of {{FunctionState}} the compiler would initialize the state in > the generated code. The function can rely on the state argument being > initialized, and being the same object from one call to the next. Example: > {code:java} > interface FunctionState { > } > class RegexpCache implements FunctionState { > final Cache<String, Pattern> cache = ...; > } > {code} > This change should install the cache for all applicable functions, including > LIKE, ILIKE, RLIKE, SIMILAR, posix regex, REGEXP_REPLACE (MySQL, Oracle), > REGEXP_CONTAINS (BigQuery), other BigQuery REGEXP_ functions, PARSE_URL, > JSON_REPLACE, PARSE_TIMESTAMP. > It can also be used for functions that have mutable state, e.g. {{RANDOM}} > with and without a seed. -- This message was sent by Atlassian Jira (v8.20.10#820010)