Hi hackers, I'd like to propose an extension to the context absorption optimization for Row Pattern Recognition in the window clause: PREFIX pattern absorption. This refines the "anchored pattern absorption" concept I posted earlier [1], renaming it to "PREFIX pattern absorption" per Ishii-san's feedback [2], and formalizing the absorption conditions with count-dominance analysis.
It builds on the existing absorption framework to handle patterns where a fixed-length prefix precedes the unbounded greedy repetition, using a "shadow path" design. [1] https://www.postgresql.org/message-id/CAAAe_zAEg7sVM=wdwxmye-odgmqyxsvi5zzwgye6supsjdm...@mail.gmail.com [2] https://www.postgresql.org/message-id/[email protected] 1. Problem Definition ===================== 1.1 Current Absorption Scope Context absorption currently applies to patterns with an unbounded-start structure -- where an unbounded greedy repetition appears at the very beginning of the pattern. Example: A+ B -- the first element A+ is an unbounded repetition, so an older context always has a repeat count greater than or equal to any newer context, making immediate absorption possible. 1.2 The PREFIX Pattern Problem For patterns like START SECOND A+ B+, where a finite-length prefix (PREFIX) precedes the unbounded repetition (A+), absorption does not currently work. Why: - A new context starts at the START element, so it is not at the same pattern position as an older context in the A+ region. - The allStatesAbsorbable condition is not satisfied, so absorption never fires. - Contexts accumulate while traversing the PREFIX, causing O(n^2) regression. 2. Relationship to the Existing Framework ========================================== 2.1 Position in the Three-Phase Execution Model The Match -> Absorb -> Advance model is preserved as-is. PREFIX absorption is an extension of the Absorb phase's eligibility criteria, not a new phase. 2.2 Prerequisites The same four prerequisites as existing absorption apply: 1. Greedy quantifier -- A+ must be greedy. 2. SKIP TO PAST LAST ROW -- prior matches suppress subsequent start points. 3. UNBOUNDED FOLLOWING -- all contexts see the same future rows. 4. Match-start-independent DEFINE predicates -- including predicates of PREFIX elements. 2.3 Extension of the Dominance Order In existing absorption, the dominance condition is: C_old and C_new are in the absorbable region of the same element, and C_old.counts >= C_new.counts. In PREFIX patterns this condition cannot be directly satisfied: C_new is in the PREFIX region while C_old is in the BODY region. We therefore need a mechanism to track what C_new's state *would be* if it had already reached the BODY region. 3. Design: Shadow Path ======================= 3.1 Core Idea Each context logically executes two parallel paths: - Original path: executes the full pattern in order (PREFIX -> BODY). This is the path that produces the actual match result and evaluates PREFIX predicates. - Shadow path: skips the PREFIX and starts directly in the BODY region (A+). - Used solely for absorption eligibility decisions. - Does not produce match results. - Actually evaluates the BODY element's DEFINE predicates to track counts accurately. - The shadow is placed at the BODY start state upon creation and immediately advances, so its initial count is 1. - Not created if there is no preceding context (nothing to be absorbed into). - Discarded when the preceding context dies. - Discarded when the shadow leaves the absorbable region (BODY->TAIL transition or match failure). 3.2 Absorption Conditions For C_old to absorb C_new, all of the following must hold: 1. C_old is past the PREFIX and in the absorbable region (BODY). - hasAbsorbableState = true (existing flag) 2. C_new was created at or after the row where C_old entered the absorbable region. - "Same row" means: in the Match phase C_old enters BODY and C_new is created; the subsequent Absorb phase of that same row can then perform the absorption check. 3. C_new's shadow path is alive in the absorbable region. - allStatesAbsorbable = true (evaluated on the shadow) 4. C_old.counts[depth] >= C_new.shadow.counts[depth] - Same count-dominance condition as existing absorption (equality counts as dominance). - The shadow's count increases as it processes rows, so the comparison uses actual counts at absorption time. Condition 2 is the key insight. Only contexts created after C_old has passed through the PREFIX into the absorbable region are eligible for absorption. Contexts created while C_old is still in the PREFIX cannot be absorbed. 3.3 Common Fate Soundness argument for absorption: C_old's original path and C_new's shadow path are at the same BODY element (A+). By match-start independence, the DEFINE predicates of both paths produce identical results on the same row. If C_new's shadow can match at some future row r, then C_old can also complete a match at row r (it has matched at least as many times, sees the same future rows, and DEFINE evaluations are identical). Under SKIP TO PAST LAST ROW, C_old's match includes C_new's start point, so removing C_new does not lose any reported match. 3.4 Complexity With PREFIX absorption, the maximum number of simultaneously active contexts is bounded by PREFIX_length + 1. Every C_new created after C_old enters BODY is absorbed immediately, so the only contexts that can coexist are those still traversing the PREFIX plus the single C_old in BODY. Since PREFIX length is a per-pattern constant, overall execution complexity is O(n). 3.5 Removal of Non-Absorbable Contexts Contexts that do not satisfy condition 2 -- i.e., those created before C_old entered BODY -- cannot be removed by PREFIX absorption. These contexts are removed by the existing SKIP TO PAST LAST ROW mechanism, which is separate from PREFIX absorption. When C_old successfully matches, all subsequent contexts whose start points fall within C_old's match range are removed as overlaps. Since contexts created during the PREFIX region have start points within C_old's match range, they are automatically removed upon C_old's match completion. The number of such contexts is at most PREFIX_length, which does not affect the O(n) complexity (see Section 3.4). 4. Compile-Time Analysis ========================= 4.1 Pattern Structure Recognition At compile time, traverse the pattern elements from the start until the first unbounded greedy repetition is found. All preceding fixed-length elements form the PREFIX; the unbounded element is the BODY; everything after it is the TAIL. If the first element is already unbounded, the pattern has no PREFIX and existing absorption applies directly. Pattern: E1 E2 ... Ek A+ B+ ... \__________/ \__/ \___/ PREFIX BODY TAIL (absorbable) (non-absorbable) 5. Worked Example ================== Pattern: START SECOND A+ B+ PREFIX: START, SECOND (length 2) BODY: A+ (absorbable) TAIL: B+ (non-absorbable) Data: r0-r3 satisfy A only, r4-r5 satisfy B only, r6 satisfies neither A nor B. Row C1(orig) C2(orig) C2(shadow) C3(orig) C3(shadow) Active r0 START - - - - 1 r1 SECOND START A(cnt=1) - - 2 r2 A(cnt=1) SECOND A(cnt=2) START A(cnt=1) 3 ^ C1 reaches BODY -> C3: shadow A(cnt=1), created at C1's BODY entry -> absorbed (1 >= 1) -> C2: created before C1's BODY entry -> not absorbable 2 r3 A(cnt=2) A(cnt=1) A(cnt=3) - - 2 r4 B(cnt=1) B(cnt=1) (discard) - - 2 ^ TAIL ^ TAIL ^ shadow A+->B+ -> TAIL -> discard r5 B(cnt=2) B(cnt=2) - - - 2 r6 neither A nor B -> B+ ends -> C1 match (r0-r5) 0 C2 removed by SKIP TO PAST LAST ROW overlap (C2 start r1 falls within C1 match range r0-r5) Max concurrent contexts: 3 = PREFIX_length(2) + 1 Key observations: - C3 is immediately removed at r2 by PREFIX absorption (shadow-path- based). - C2's shadow is discarded at r4 when A+->B+ transition moves it into TAIL. - C2 is not absorbable (condition 2 not met), survives until r6, then removed by SKIP TO PAST LAST ROW overlap when C1 matches (Section 3.5). This is a design proposal at this stage. Thoughts and feedback welcome. Best regards, Henson
