Hi Henson,

> Hi Ishii-san,
> 
> I've been working on the pattern grammar and AST for row pattern
> recognition.
> The attached patch extends the PATTERN clause to support more regex-like
> syntax.
> 
> The main additions are:
> 
> - Alternation: (A | B)
> - Grouping with quantifiers: (A B)+, (A | B)*
> - Full quantifier support: +, *, ?, {n}, {n,}, {n,m}
> - Reluctant quantifiers: +?, *?, ?? (parsed but not executed yet)

Excellent!

> I've introduced RPRPatternNode to represent the pattern as a Thompson-style
> AST with four node types: VAR, SEQ, ALT, and GROUP. Each node stores min/max
> bounds for quantifiers and a reluctant flag. This should make the eventual
> NFA implementation more straightforward.
> 
> The parser also does some basic pattern optimizations. For example:
>   - (A (B C)) gets flattened to (A B C)
>   - ((A)) unwraps to just A
>   - (A{2}){3} becomes A{6}
>   - (A | B | A) simplifies to (A | B)
>   - A A A merges into A{3,3}
> 
> I also updated ruleutils.c to deparse patterns properly.
> 
> 
> Current state:
> 
> The executor still uses regex matching, which works but has some
> limitations.
> Complex patterns with lots of alternations can be slow.
>
> The 26-variable limit is still there since we're mapping to [a-z] for the
> regex.
> 
> Reluctant quantifiers (+?, *?, ??) are parsed but will raise an error if you
> try to use them. They'll need proper NFA support to work correctly.
> 
> Some optimization code was removed in the process, but it should work better
> once NFA matching is in place.

Yeah. Now the rpr regression takes more than 7 seconds.  (with current
v37 patch it takes 86ms). I expect NFA matching works better.

> Next steps:
> 
> The plan is to replace regex matching with NFA-based execution. The AST
> structure is already set up for that.

Sounds promising.

> The first step is converting the Thompson AST into a flat representation -
> flattening pattern elements and building actual NFA states with transitions,
> then replacing the regex engine with direct NFA matching. This should
> handle all
> the pattern syntax properly, including reluctant quantifiers.
> 
> Once that's working, the next step is handling multiple contexts during
> matching -
> tracking which states are active and absorbing context as we go through
> rows.
> This is essential for proper pattern matching behavior.

Ok.

> After that, I'll implement PATH and CLASSIFIER functions, which depend on
> having
> the match context available.

What is PATH function? It's not in R010 or R020 as far as I know.

> Then there's room for various optimizations - PATH space optimization,
> incremental aggregate computation over matched rows, better state merging,
> etc.

Sounds good.

> Longer-term, the goal is to get to MATCH_RECOGNIZE in the FROM clause for
> full SQL standard compliance.

What about MEASURES clause? Without it, many things in the standard
cannot be implemented.

> Let me know if you have any concerns or suggestions about the approach.

Here are some comments on your patches.

- It is based on my v36 patches. But the latest one is v37 patch.  It
  would be nice if you create patches based on the my latest patches.

- You are doing some optimization (like (A (B C)) gets flattened to (A
  B C) etc.) in the parse analysis phase. I think doing that kind of
  optimization should be done in the optimizer/planner. Because with
  your patch a deparsed query (as shown by pg_get_viewdef()) looks
  different from what user inputs.

- If you add files to src/backend/parser, it should be noted in
  src/backend/parser/README.
  
- It would be noce to update typedefs patch if you add new typedefs.

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp


Reply via email to