Re: Row pattern recognition

Henson Choi Sat, 28 Feb 2026 09:00:26 -0800

Hi hackers,

I'd like to propose an extension to the context absorption optimization
for Row Pattern Recognition in the window clause: PREFIX pattern
absorption. This refines the "anchored pattern absorption" concept I
posted earlier [1], renaming it to "PREFIX pattern absorption" per
Ishii-san's feedback [2], and formalizing the absorption conditions
with count-dominance analysis.


It builds on the existing absorption framework to handle patterns where
a fixed-length prefix precedes the unbounded greedy repetition, using a
"shadow path" design.

[1]
https://www.postgresql.org/message-id/CAAAe_zAEg7sVM=wdwxmye-odgmqyxsvi5zzwgye6supsjdm...@mail.gmail.com
[2]
https://www.postgresql.org/message-id/[email protected]


1. Problem Definition
=====================

1.1 Current Absorption Scope

Context absorption currently applies to patterns with an unbounded-start
structure -- where an unbounded greedy repetition appears at the very
beginning of the pattern.

Example: A+ B -- the first element A+ is an unbounded repetition, so an
older context always has a repeat count greater than or equal to any
newer context, making immediate absorption possible.

1.2 The PREFIX Pattern Problem

For patterns like START SECOND A+ B+, where a finite-length prefix
(PREFIX) precedes the unbounded repetition (A+), absorption does not
currently work.

Why:
- A new context starts at the START element, so it is not at the same
  pattern position as an older context in the A+ region.
- The allStatesAbsorbable condition is not satisfied, so absorption
  never fires.
- Contexts accumulate while traversing the PREFIX, causing O(n^2)
  regression.


2. Relationship to the Existing Framework
==========================================

2.1 Position in the Three-Phase Execution Model

The Match -> Absorb -> Advance model is preserved as-is. PREFIX
absorption is an extension of the Absorb phase's eligibility criteria,
not a new phase.

2.2 Prerequisites

The same four prerequisites as existing absorption apply:

1. Greedy quantifier -- A+ must be greedy.
2. SKIP TO PAST LAST ROW -- prior matches suppress subsequent start
   points.
3. UNBOUNDED FOLLOWING -- all contexts see the same future rows.
4. Match-start-independent DEFINE predicates -- including predicates of
   PREFIX elements.

2.3 Extension of the Dominance Order

In existing absorption, the dominance condition is:

    C_old and C_new are in the absorbable region of the same element,
    and C_old.counts >= C_new.counts.

In PREFIX patterns this condition cannot be directly satisfied: C_new is
in the PREFIX region while C_old is in the BODY region. We therefore
need a mechanism to track what C_new's state *would be* if it had
already reached the BODY region.


3. Design: Shadow Path
=======================

3.1 Core Idea

Each context logically executes two parallel paths:

- Original path: executes the full pattern in order (PREFIX -> BODY).
  This is the path that produces the actual match result and evaluates
  PREFIX predicates.

- Shadow path: skips the PREFIX and starts directly in the BODY region
  (A+).
  - Used solely for absorption eligibility decisions.
  - Does not produce match results.
  - Actually evaluates the BODY element's DEFINE predicates to track
    counts accurately.
  - The shadow is placed at the BODY start state upon creation and
    immediately advances, so its initial count is 1.
  - Not created if there is no preceding context (nothing to be
    absorbed into).
  - Discarded when the preceding context dies.
  - Discarded when the shadow leaves the absorbable region (BODY->TAIL
    transition or match failure).

3.2 Absorption Conditions

For C_old to absorb C_new, all of the following must hold:

1. C_old is past the PREFIX and in the absorbable region (BODY).
   - hasAbsorbableState = true (existing flag)

2. C_new was created at or after the row where C_old entered the
   absorbable region.
   - "Same row" means: in the Match phase C_old enters BODY and C_new
     is created; the subsequent Absorb phase of that same row can then
     perform the absorption check.

3. C_new's shadow path is alive in the absorbable region.
   - allStatesAbsorbable = true (evaluated on the shadow)

4. C_old.counts[depth] >= C_new.shadow.counts[depth]
   - Same count-dominance condition as existing absorption (equality
     counts as dominance).
   - The shadow's count increases as it processes rows, so the
     comparison uses actual counts at absorption time.

Condition 2 is the key insight. Only contexts created after C_old has
passed through the PREFIX into the absorbable region are eligible for
absorption. Contexts created while C_old is still in the PREFIX cannot
be absorbed.

3.3 Common Fate

Soundness argument for absorption:

    C_old's original path and C_new's shadow path are at the same BODY
    element (A+). By match-start independence, the DEFINE predicates of
    both paths produce identical results on the same row. If C_new's
    shadow can match at some future row r, then C_old can also complete
    a match at row r (it has matched at least as many times, sees the
    same future rows, and DEFINE evaluations are identical). Under SKIP
    TO PAST LAST ROW, C_old's match includes C_new's start point, so
    removing C_new does not lose any reported match.

3.4 Complexity

With PREFIX absorption, the maximum number of simultaneously active
contexts is bounded by PREFIX_length + 1. Every C_new created after
C_old enters BODY is absorbed immediately, so the only contexts that can
coexist are those still traversing the PREFIX plus the single C_old in
BODY. Since PREFIX length is a per-pattern constant, overall execution
complexity is O(n).

3.5 Removal of Non-Absorbable Contexts

Contexts that do not satisfy condition 2 -- i.e., those created before
C_old entered BODY -- cannot be removed by PREFIX absorption.

These contexts are removed by the existing SKIP TO PAST LAST ROW
mechanism, which is separate from PREFIX absorption. When C_old
successfully matches, all subsequent contexts whose start points fall
within C_old's match range are removed as overlaps. Since contexts
created during the PREFIX region have start points within C_old's match
range, they are automatically removed upon C_old's match completion.

The number of such contexts is at most PREFIX_length, which does not
affect the O(n) complexity (see Section 3.4).


4. Compile-Time Analysis
=========================

4.1 Pattern Structure Recognition

At compile time, traverse the pattern elements from the start until the
first unbounded greedy repetition is found. All preceding fixed-length
elements form the PREFIX; the unbounded element is the BODY; everything
after it is the TAIL. If the first element is already unbounded, the
pattern has no PREFIX and existing absorption applies directly.

    Pattern: E1 E2 ... Ek  A+          B+ ...
             \__________/  \__/         \___/
              PREFIX        BODY         TAIL
                           (absorbable)  (non-absorbable)


5. Worked Example
==================

Pattern: START SECOND A+ B+
PREFIX: START, SECOND (length 2)
BODY: A+ (absorbable)
TAIL: B+ (non-absorbable)

Data: r0-r3 satisfy A only, r4-r5 satisfy B only, r6 satisfies neither
A nor B.

Row  C1(orig)   C2(orig)   C2(shadow)  C3(orig)   C3(shadow)  Active
r0   START       -           -           -           -          1
r1   SECOND      START       A(cnt=1)    -           -          2
r2   A(cnt=1)    SECOND      A(cnt=2)    START       A(cnt=1)   3
     ^ C1 reaches BODY
     -> C3: shadow A(cnt=1), created at C1's BODY entry
        -> absorbed (1 >= 1)
     -> C2: created before C1's BODY entry
        -> not absorbable                                       2
r3   A(cnt=2)    A(cnt=1)    A(cnt=3)    -           -          2
r4   B(cnt=1)    B(cnt=1)    (discard)   -           -          2
     ^ TAIL      ^ TAIL      ^ shadow A+->B+ -> TAIL -> discard
r5   B(cnt=2)    B(cnt=2)    -           -           -          2
r6   neither A nor B -> B+ ends -> C1 match (r0-r5)             0
     C2 removed by SKIP TO PAST LAST ROW overlap
     (C2 start r1 falls within C1 match range r0-r5)

Max concurrent contexts: 3 = PREFIX_length(2) + 1

Key observations:
- C3 is immediately removed at r2 by PREFIX absorption (shadow-path-
  based).
- C2's shadow is discarded at r4 when A+->B+ transition moves it into
  TAIL.
- C2 is not absorbable (condition 2 not met), survives until r6, then
  removed by SKIP TO PAST LAST ROW overlap when C1 matches
  (Section 3.5).

This is a design proposal at this stage. Thoughts and feedback welcome.

Best regards,
Henson

Re: Row pattern recognition

Reply via email to