Re: Row pattern recognition
I gave a talk on RPR in PGConf.dev 2024. https://www.pgevents.ca/events/pgconfdev2024/schedule/session/114-implementing-row-pattern-recognition/ (Slides are available from the link). Vik Faring and Jacob Champion were one of the audiences and we had a small discussion after the talk. We continued the discussion off list on how to move forward the RPR implementation project. One of the ideas is, to summarize what are in the patch and what are not from the SQL standard specification's point of view. This should help us to reach the consensus regarding "minimum viable" feature set if we want to bring the patch in upcoming PostgreSQL v18. Here is the first cut of the document. Comments/feedback are welcome. - This memo describes the current status of implementation of SQL/RPR (Row Pattern Recognition), as of June 13, 2024 (the latest patch is v20). - RPR in FROM clause and WINDOW clause The SQL standard defines two features regarding SQL/RPR - R010 (RPR in FROM clause) and R020 (RPR in WINDOW clause). Only R020 is implemented. From now on, we discuss on R020. - Overview of R020 syntax WINDOW window_name AS ( [ PARTITION BY ... ] [ ORDER BY... ] [ MEASURES ... ] ROWS BETWEEN CURRENT ROW AND ... [ AFTER MATCH SKIP ... ] [ INITIAL|SEEK ] PATTERN (...) [ SUBSET ... ] DEFINE ... ) -- PARTITION BY and ORDER BY are not specific to RPR and has been already there in current PostgreSQL. -- What are (partially) implemented: AFTER MATCH SKIP INITIAL|SEEK PATTERN DEFINE -- What are not implemented at all: MEASURES SUBSET Followings are detailed status of the each clause. - AFTER MATCH SKIP -- Implemented: AFTER MATCH SKIP TO NEXT ROW AFTER MATCH SKIP PAST LAST ROW -- Not implemented: AFTER MATCH SKIP TO FIRST|LAST pattern_variable - INITIAL|SEEK --Implemented: INITIAL -- Not implemented: SEEK - DEFINE -- Partially implemented row pattern navigation operations are PREV and NEXT. FIRST and LAST are not implemented. -- The standard says PREV and NEXT accepts optional argument "offset" but it's not implemented. -- The standard says the row pattern navigation operations can be nested but it's not implemented. -- CLLASSIFIER, use of aggregate functions and subqueries in DEFINE clause are not implemented. - PATTERN -- Followings are implemented: +: 1 or more rows *: 0 or more rows -- Followings are not implemented: ?: 0 or 1 row A | B: OR condition (A B): grouping {n}: n rows {n,}: n or more rows {n,m}: greater or equal to n rows and less than or equal to m rows {,m}: more than 0 and less than or equal to m rows - Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
Re: Row pattern recognition
> On Tue, Apr 23, 2024 at 8:13 PM Tatsuo Ishii wrote: >> SELECT v.a, count(*) OVER w >> FROM (VALUES ('A'),('B'),('B'),('C')) AS v (a) >> WINDOW w AS ( >> ORDER BY v.a >> ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING >> PATTERN (B+) >> DEFINE B AS a = 'B' >> ) >> a | count >> ---+--- >> A | 0 >> B | 2 >> B | >> C | 0 >> (4 rows) >> >> Here row 3 is skipped because the pattern B matches row 2 and 3. In >> this case I think cont(*) should return 0 rathern than NULL for row 3. > > I think returning zero would match Vik's explanation upthread [1], > yes. Unfortunately I don't have a spec handy to double-check for > myself right now. Ok. I believe you and Vik are correct. I am modifying the patch in this direction. Attached is the regression diff after modifying the patch. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp diff -U3 /usr/local/src/pgsql/current/postgresql/src/test/regress/expected/rpr.out /usr/local/src/pgsql/current/postgresql/src/test/regress/results/rpr.out --- /usr/local/src/pgsql/current/postgresql/src/test/regress/expected/rpr.out 2024-04-24 11:30:27.710523139 +0900 +++ /usr/local/src/pgsql/current/postgresql/src/test/regress/results/rpr.out 2024-04-26 14:39:03.543759205 +0900 @@ -181,8 +181,8 @@ company1 | 07-01-2023 | 100 | 0 company1 | 07-02-2023 | 200 | 0 company1 | 07-03-2023 | 150 | 3 - company1 | 07-04-2023 | 140 | - company1 | 07-05-2023 | 150 | + company1 | 07-04-2023 | 140 | 0 + company1 | 07-05-2023 | 150 | 0 company1 | 07-06-2023 |90 | 0 company1 | 07-07-2023 | 110 | 0 company1 | 07-08-2023 | 130 | 0 @@ -556,24 +556,24 @@ company | tdate| price | first_value | last_value | count --++---+-++--- company1 | 07-01-2023 | 100 | 07-01-2023 | 07-03-2023 | 3 - company1 | 07-02-2023 | 200 | || - company1 | 07-03-2023 | 150 | || + company1 | 07-02-2023 | 200 | || 0 + company1 | 07-03-2023 | 150 | || 0 company1 | 07-04-2023 | 140 | 07-04-2023 | 07-06-2023 | 3 - company1 | 07-05-2023 | 150 | || - company1 | 07-06-2023 |90 | || + company1 | 07-05-2023 | 150 | || 0 + company1 | 07-06-2023 |90 | || 0 company1 | 07-07-2023 | 110 | 07-07-2023 | 07-09-2023 | 3 - company1 | 07-08-2023 | 130 | || - company1 | 07-09-2023 | 120 | || + company1 | 07-08-2023 | 130 | || 0 + company1 | 07-09-2023 | 120 | || 0 company1 | 07-10-2023 | 130 | || 0 company2 | 07-01-2023 |50 | 07-01-2023 | 07-03-2023 | 3 - company2 | 07-02-2023 | 2000 | || - company2 | 07-03-2023 | 1500 | || + company2 | 07-02-2023 | 2000 | || 0 + company2 | 07-03-2023 | 1500 | || 0 company2 | 07-04-2023 | 1400 | 07-04-2023 | 07-06-2023 | 3 - company2 | 07-05-2023 | 1500 | || - company2 | 07-06-2023 |60 | || + company2 | 07-05-2023 | 1500 | || 0 + company2 | 07-06-2023 |60 | || 0 company2 | 07-07-2023 | 1100 | 07-07-2023 | 07-09-2023 | 3 - company2 | 07-08-2023 | 1300 | || - company2 | 07-09-2023 | 1200 | || + company2 | 07-08-2023 | 1300 | || 0 + company2 | 07-09-2023 | 1200 | || 0 company2 | 07-10-2023 | 1300 | || 0 (20 rows) @@ -604,24 +604,24 @@ company | tdate| price | first_value | last_value | max | min | sum | avg | count --++---+-++--+-+--+---+--- company1 | 07-01-2023 | 100 | 100 |140 | 200 | 100 | 590 | 147.5000 | 4 - company1 | 07-02-2023 | 200 | || | | | | - company1 | 07-03-2023 | 150 | || | | | | - company1 | 07-04-2023 | 140 | || | | | | + company1 | 07-02-2023 | 200 | || | | | | 0 + company1 | 07-03-2023 | 150 | || | | | | 0 + company1 |
Re: Row pattern recognition
On Tue, Apr 23, 2024 at 8:13 PM Tatsuo Ishii wrote: > SELECT v.a, count(*) OVER w > FROM (VALUES ('A'),('B'),('B'),('C')) AS v (a) > WINDOW w AS ( > ORDER BY v.a > ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > PATTERN (B+) > DEFINE B AS a = 'B' > ) > a | count > ---+--- > A | 0 > B | 2 > B | > C | 0 > (4 rows) > > Here row 3 is skipped because the pattern B matches row 2 and 3. In > this case I think cont(*) should return 0 rathern than NULL for row 3. I think returning zero would match Vik's explanation upthread [1], yes. Unfortunately I don't have a spec handy to double-check for myself right now. --Jacob [1] https://www.postgresql.org/message-id/c9ebc3d0-c3d1-e8eb-4a57-0ec099cbda17%40postgresfriends.org
Re: Row pattern recognition
Hi Vik and Champion, I think the current RPR patch is not quite correct in handling count(*). (using slightly modified version of Vik's example query) SELECT v.a, count(*) OVER w FROM (VALUES ('A'),('B'),('B'),('C')) AS v (a) WINDOW w AS ( ORDER BY v.a ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING PATTERN (B+) DEFINE B AS a = 'B' ) a | count ---+--- A | 0 B | 2 B | C | 0 (4 rows) Here row 3 is skipped because the pattern B matches row 2 and 3. In this case I think cont(*) should return 0 rathern than NULL for row 3. What do you think? Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
Re: Row pattern recognition
> Thank you very much for providing the patch for the RPR implementation. > > After applying the v12-patches, I noticed an issue that > the rpr related parts in window clauses were not displayed in the > view definitions (the definition column of pg_views). > > To address this, I have taken the liberty of adding an additional patch > that modifies the relevant rewriter source code. > I have attached the rewriter patch for your review and would greatly > appreciate your feedback. > > Thank you for your time and consideration. Thank you so much for spotting the issue and creating the patch. I confirmed that your patch applies cleanly and solve the issue. I will include the patches into upcoming v13 patches. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
Re: Row pattern recognition
On Sat, 09 Dec 2023 07:22:58 +0900 (JST) Tatsuo Ishii wrote: > > On 04.12.23 12:40, Tatsuo Ishii wrote: > >> diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y > >> index d631ac89a9..5a77fca17f 100644 > >> --- a/src/backend/parser/gram.y > >> +++ b/src/backend/parser/gram.y > >> @@ -251,6 +251,8 @@ static Node *makeRecursiveViewSelect(char > >> *relname, List *aliases, Node *query); > >>DefElem*defelt; > >>SortBy *sortby; > >>WindowDef *windef; > >> + RPCommonSyntax *rpcom; > >> + RPSubsetItem*rpsubset; > >>JoinExpr *jexpr; > >>IndexElem *ielem; > >>StatsElem *selem; > >> @@ -278,6 +280,7 @@ static Node *makeRecursiveViewSelect(char > >> *relname, List *aliases, Node *query); > >>MergeWhenClause *mergewhen; > >>struct KeyActions *keyactions; > >>struct KeyAction *keyaction; > >> + RPSkipToskipto; > >> } > >> %typestmt toplevel_stmt schema_stmt routine_body_stmt > > > > It is usually not the style to add an entry for every node type to the > > %union. Otherwise, we'd have hundreds of entries in there. > > Ok, I have removed the node types and used existing node types. Also > I have moved RPR related %types to same place to make it easier to know > what are added by RPR. > > >> @@ -866,6 +878,7 @@ static Node *makeRecursiveViewSelect(char > >> *relname, List *aliases, Node *query); > >> %nonassoc UNBOUNDED /* ideally would have same precedence as IDENT */ > >> %nonassoc IDENT PARTITION RANGE ROWS GROUPS PRECEDING FOLLOWING CUBE > >> %ROLLUP > >>SET KEYS OBJECT_P SCALAR VALUE_P WITH WITHOUT > >> +%nonassoc MEASURES AFTER INITIAL SEEK PATTERN_P > >> %left Op OPERATOR /* multi-character ops and user-defined operators */ > >> %left'+' '-' > >> %left'*' '/' '%' > > > > It was recently discussed that these %nonassoc should ideally all have > > the same precedence. Did you consider that here? > > No, I didn't realize it. Thanks for pointing it out. I have removed > %nonassoc so that MEASURES etc. have the same precedence as IDENT etc. > > Attached is the new diff of gram.y against master branch. Thank you very much for providing the patch for the RPR implementation. After applying the v12-patches, I noticed an issue that the rpr related parts in window clauses were not displayed in the view definitions (the definition column of pg_views). To address this, I have taken the liberty of adding an additional patch that modifies the relevant rewriter source code. I have attached the rewriter patch for your review and would greatly appreciate your feedback. Thank you for your time and consideration. -- SRA OSS LLC Ningwei Chen TEL: 03-5979-2701 FAX: 03-5979-2702 diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c index 0b2a164057..baded88201 100644 --- a/src/backend/utils/adt/ruleutils.c +++ b/src/backend/utils/adt/ruleutils.c @@ -428,6 +428,10 @@ static void get_rule_groupingset(GroupingSet *gset, List *targetlist, bool omit_parens, deparse_context *context); static void get_rule_orderby(List *orderList, List *targetList, bool force_colno, deparse_context *context); +static void get_rule_pattern(List *patternVariable, List *patternRegexp, +bool force_colno, deparse_context *context); +static void get_rule_define(List *defineClause, List *patternVariables, + bool force_colno, deparse_context *context); static void get_rule_windowclause(Query *query, deparse_context *context); static void get_rule_windowspec(WindowClause *wc, List *targetList, deparse_context *context); @@ -6459,6 +6463,64 @@ get_rule_orderby(List *orderList, List *targetList, } } +/* + * Display a PATTERN clause. + */ +static void +get_rule_pattern(List *patternVariable, List *patternRegexp, +bool force_colno, deparse_context *context) +{ + StringInfo buf = context->buf; + const char *sep; + ListCell *lc_var, *lc_reg = list_head(patternRegexp); + + sep = ""; + appendStringInfoChar(buf, '('); + foreach(lc_var, patternVariable) + { + char *variable = strVal((String *) lfirst(lc_var)); + char *regexp = NULL; + + if (lc_reg != NULL) + { + regexp = strVal((String *) lfirst(lc_reg)); + lc_reg = lnext(patternRegexp, lc_reg); + } + + appendStringInfo(buf, "%s%s", sep, variable); + if (regexp != NULL) + appendStringInfoString(buf, regexp); + + sep = " "; + } + appendStringInfoChar(buf, ')'); +} +
Re: Row pattern recognition
> On 04.12.23 12:40, Tatsuo Ishii wrote: >> diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y >> index d631ac89a9..5a77fca17f 100644 >> --- a/src/backend/parser/gram.y >> +++ b/src/backend/parser/gram.y >> @@ -251,6 +251,8 @@ static Node *makeRecursiveViewSelect(char >> *relname, List *aliases, Node *query); >> DefElem*defelt; >> SortBy *sortby; >> WindowDef *windef; >> +RPCommonSyntax *rpcom; >> +RPSubsetItem*rpsubset; >> JoinExpr *jexpr; >> IndexElem *ielem; >> StatsElem *selem; >> @@ -278,6 +280,7 @@ static Node *makeRecursiveViewSelect(char >> *relname, List *aliases, Node *query); >> MergeWhenClause *mergewhen; >> struct KeyActions *keyactions; >> struct KeyAction *keyaction; >> +RPSkipToskipto; >> } >> %type stmt toplevel_stmt schema_stmt routine_body_stmt > > It is usually not the style to add an entry for every node type to the > %union. Otherwise, we'd have hundreds of entries in there. Ok, I have removed the node types and used existing node types. Also I have moved RPR related %types to same place to make it easier to know what are added by RPR. >> @@ -866,6 +878,7 @@ static Node *makeRecursiveViewSelect(char >> *relname, List *aliases, Node *query); >> %nonassoc UNBOUNDED /* ideally would have same precedence as IDENT */ >> %nonassoc IDENT PARTITION RANGE ROWS GROUPS PRECEDING FOLLOWING CUBE >> %ROLLUP >> SET KEYS OBJECT_P SCALAR VALUE_P WITH WITHOUT >> +%nonassoc MEASURES AFTER INITIAL SEEK PATTERN_P >> %left Op OPERATOR /* multi-character ops and user-defined operators */ >> %left '+' '-' >> %left '*' '/' '%' > > It was recently discussed that these %nonassoc should ideally all have > the same precedence. Did you consider that here? No, I didn't realize it. Thanks for pointing it out. I have removed %nonassoc so that MEASURES etc. have the same precedence as IDENT etc. Attached is the new diff of gram.y against master branch. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index d631ac89a9..6c41aa2e9f 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -659,6 +659,21 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); json_object_constructor_null_clause_opt json_array_constructor_null_clause_opt +%type row_pattern_measure_item + row_pattern_definition +%typeopt_row_pattern_common_syntax + opt_row_pattern_skip_to + row_pattern_subset_item + row_pattern_term +%typeopt_row_pattern_measures + row_pattern_measure_list + row_pattern_definition_list + opt_row_pattern_subset_clause + row_pattern_subset_list + row_pattern_subset_rhs + row_pattern +%type opt_row_pattern_initial_or_seek + first_or_last /* * Non-keyword token types. These are hard-wired into the "flex" lexer. @@ -702,7 +717,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); CURRENT_TIME CURRENT_TIMESTAMP CURRENT_USER CURSOR CYCLE DATA_P DATABASE DAY_P DEALLOCATE DEC DECIMAL_P DECLARE DEFAULT DEFAULTS - DEFERRABLE DEFERRED DEFINER DELETE_P DELIMITER DELIMITERS DEPENDS DEPTH DESC + DEFERRABLE DEFERRED DEFINE DEFINER DELETE_P DELIMITER DELIMITERS DEPENDS DEPTH DESC DETACH DICTIONARY DISABLE_P DISCARD DISTINCT DO DOCUMENT_P DOMAIN_P DOUBLE_P DROP @@ -718,7 +733,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); HANDLER HAVING HEADER_P HOLD HOUR_P IDENTITY_P IF_P ILIKE IMMEDIATE IMMUTABLE IMPLICIT_P IMPORT_P IN_P INCLUDE - INCLUDING INCREMENT INDENT INDEX INDEXES INHERIT INHERITS INITIALLY INLINE_P + INCLUDING INCREMENT INDENT INDEX INDEXES INHERIT INHERITS INITIAL INITIALLY INLINE_P INNER_P INOUT INPUT_P INSENSITIVE INSERT INSTEAD INT_P INTEGER INTERSECT INTERVAL INTO INVOKER IS ISNULL ISOLATION @@ -731,7 +746,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED - MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MERGE METHOD + MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MEASURES MERGE METHOD MINUTE_P MINVALUE MODE MONTH_P MOVE NAME_P NAMES NATIONAL NATURAL NCHAR NEW NEXT NFC NFD NFKC NFKD NO NONE @@
Re: Row pattern recognition
On 04.12.23 12:40, Tatsuo Ishii wrote: diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index d631ac89a9..5a77fca17f 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -251,6 +251,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); DefElem*defelt; SortBy *sortby; WindowDef *windef; + RPCommonSyntax *rpcom; + RPSubsetItem*rpsubset; JoinExpr *jexpr; IndexElem *ielem; StatsElem *selem; @@ -278,6 +280,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); MergeWhenClause *mergewhen; struct KeyActions *keyactions; struct KeyAction *keyaction; + RPSkipToskipto; } %type stmt toplevel_stmt schema_stmt routine_body_stmt It is usually not the style to add an entry for every node type to the %union. Otherwise, we'd have hundreds of entries in there. @@ -866,6 +878,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); %nonassoc UNBOUNDED /* ideally would have same precedence as IDENT */ %nonassoc IDENT PARTITION RANGE ROWS GROUPS PRECEDING FOLLOWING CUBE ROLLUP SET KEYS OBJECT_P SCALAR VALUE_P WITH WITHOUT +%nonassoc MEASURES AFTER INITIAL SEEK PATTERN_P %left Op OPERATOR /* multi-character ops and user-defined operators */ %left '+' '-' %left '*' '/' '%' It was recently discussed that these %nonassoc should ideally all have the same precedence. Did you consider that here?
Re: Row pattern recognition
On Tue, Oct 24, 2023 at 7:49 PM Tatsuo Ishii wrote: > I am impressed the simple NFA implementation. Thanks! > It would be nicer if it > could be implemented without using recursion. Yeah. If for some reason we end up going with a bespoke implementation, I assume we'd just convert the algorithm to an iterative one and optimize it heavily. But I didn't want to do that too early, since it'd probably make it harder to add new features... and anyway my goal is still to try to reuse src/backend/regex eventually. > > SELECT aid, first_value(aid) OVER w, > > count(*) OVER w > > FROM pgbench_accounts > > WINDOW w AS ( > > PARTITION BY bid > > ORDER BY aid > > ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > > AFTER MATCH SKIP PAST LAST ROW > > INITIAL > > PATTERN (START UP+) > > DEFINE > > START AS TRUE, > > UP AS aid > PREV(aid) > > ); > > I ran this against your patch. It failed around > 60k rows. Nice, that's actually more frames than I expected. Looks like I have similar results here with my second test query (segfault at ~58k rows). Thanks, --Jacob
Re: Row pattern recognition
> Great. I will look into this. I am impressed the simple NFA implementation. It would be nicer if it could be implemented without using recursion. > By the way, I tested my patch (v10) to handle more large data set and > tried to following query with pgbench database. On my laptop it works > with 100k rows pgbench_accounts table but with beyond the number I got ~~~ I meant 10k. > OOM killer. I would like to enhance this in the next patch. > > SELECT aid, first_value(aid) OVER w, > count(*) OVER w > FROM pgbench_accounts > WINDOW w AS ( > PARTITION BY bid > ORDER BY aid > ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > AFTER MATCH SKIP PAST LAST ROW > INITIAL > PATTERN (START UP+) > DEFINE > START AS TRUE, > UP AS aid > PREV(aid) > ); I ran this against your patch. It failed around > 60k rows. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
Re: Row pattern recognition
> On Sat, Oct 21, 2023 at 7:39 PM Tatsuo Ishii wrote: >> Attached is the v10 patch. This version enhances the performance of >> pattern matching. > > Nice! I've attached a couple of more stressful tests (window > partitions of 1000 rows each). Beware that the second one runs my > desktop out of memory fairly quickly with the v10 implementation. > > I was able to carve out some time this week to implement a very basic > recursive NFA, which handles both the + and * qualifiers (attached). Great. I will look into this. > It's not production quality -- a frame on the call stack for every row > isn't going to work Yeah. > -- but with only those two features, it's pretty > tiny, and it's able to run the new stress tests with no issue. If I > continue to have time, I hope to keep updating this parallel > implementation as you add features to the StringSet implementation, > and we can see how it evolves. I expect that alternation and grouping > will ratchet up the complexity. Sounds like a plan. By the way, I tested my patch (v10) to handle more large data set and tried to following query with pgbench database. On my laptop it works with 100k rows pgbench_accounts table but with beyond the number I got OOM killer. I would like to enhance this in the next patch. SELECT aid, first_value(aid) OVER w, count(*) OVER w FROM pgbench_accounts WINDOW w AS ( PARTITION BY bid ORDER BY aid ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING AFTER MATCH SKIP PAST LAST ROW INITIAL PATTERN (START UP+) DEFINE START AS TRUE, UP AS aid > PREV(aid) ); Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
Re: Row pattern recognition
On Sat, Oct 21, 2023 at 7:39 PM Tatsuo Ishii wrote: > Attached is the v10 patch. This version enhances the performance of > pattern matching. Nice! I've attached a couple of more stressful tests (window partitions of 1000 rows each). Beware that the second one runs my desktop out of memory fairly quickly with the v10 implementation. I was able to carve out some time this week to implement a very basic recursive NFA, which handles both the + and * qualifiers (attached). It's not production quality -- a frame on the call stack for every row isn't going to work -- but with only those two features, it's pretty tiny, and it's able to run the new stress tests with no issue. If I continue to have time, I hope to keep updating this parallel implementation as you add features to the StringSet implementation, and we can see how it evolves. I expect that alternation and grouping will ratchet up the complexity. Thanks! --Jacob From fb3cbb6f99f0fe7b05027759454d7e0013225929 Mon Sep 17 00:00:00 2001 From: Jacob Champion Date: Fri, 20 Oct 2023 16:11:14 -0700 Subject: [PATCH 2/2] squash! Row pattern recognition patch (executor). - Extract pattern matching logic into match_pattern(). - Replace the str_set/initials implementation with a recursive backtracker. - Rework match_pattern in preparation for * quantifier: separate success from number of matched rows. - Handle `*` quantifier - Simplify the evaluate_pattern() signature. - Remove defineInitial and related code. --- src/backend/executor/nodeWindowAgg.c| 871 +++- src/backend/optimizer/plan/createplan.c | 4 - src/backend/parser/parse_clause.c | 31 - src/include/nodes/execnodes.h | 1 - src/include/nodes/parsenodes.h | 2 - src/include/nodes/plannodes.h | 3 - 6 files changed, 91 insertions(+), 821 deletions(-) diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c index 2e1baef7ea..30abe60159 100644 --- a/src/backend/executor/nodeWindowAgg.c +++ b/src/backend/executor/nodeWindowAgg.c @@ -161,40 +161,6 @@ typedef struct WindowStatePerAggData boolrestart;/* need to restart this agg in this cycle? */ } WindowStatePerAggData; -/* - * Set of StringInfo. Used in RPR. - */ -typedef struct StringSet { - StringInfo *str_set; - Sizeset_size; /* current array allocation size in number of items */ - int set_index; /* current used size */ -} StringSet; - -/* - * Allowed subsequent PATTERN variables positions. - * Used in RPR. - * - * pos represents the pattern variable defined order in DEFINE caluase. For - * example. "DEFINE START..., UP..., DOWN ..." and "PATTERN START UP DOWN UP" - * will create: - * VariablePos[0].pos[0] = 0; START - * VariablePos[1].pos[0] = 1; UP - * VariablePos[1].pos[1] = 3; UP - * VariablePos[2].pos[0] = 2; DOWN - * - * Note that UP has two pos because UP appears in PATTERN twice. - * - * By using this strucrture, we can know which pattern variable can be followed - * by which pattern variable(s). For example, START can be followed by UP and - * DOWN since START's pos is 0, and UP's pos is 1 or 3, DOWN's pos is 2. - * DOWN can be followed by UP since UP's pos is either 1 or 3. - * - */ -#define NUM_ALPHABETS 26 /* we allow [a-z] variable initials */ -typedef struct VariablePos { - int pos[NUM_ALPHABETS]; /* postion(s) in PATTERN */ -} VariablePos; - static void initialize_windowaggregate(WindowAggState *winstate, WindowStatePerFunc perfuncstate, WindowStatePerAgg peraggstate); @@ -249,27 +215,11 @@ static void register_reduced_frame_map(WindowAggState *winstate, int64 pos, int static void clear_reduced_frame_map(WindowAggState *winstate); static void update_reduced_frame(WindowObject winobj, int64 pos); -static int64 evaluate_pattern(WindowObject winobj, int64 current_pos, - char *vname, StringInfo encoded_str, bool *result); +static bool evaluate_pattern(WindowObject winobj, int64 current_pos, +char *vname); static bool get_slots(WindowObject winobj, int64 current_pos); -//static int search_str_set(char *pattern, StringSet *str_set, int *initial_orders); -static int search_str_set(char *pattern, StringSet *str_set, VariablePos *variable_pos); -static charpattern_initial(WindowAggState *winstate, char *vname); -static int do_pattern_match(char *pattern, char *encoded_str); - -static StringSet *string_set_init(void); -static void string_set_add(StringSet *string_set, StringInfo str); -static StringInfo string_set_get(StringSet *string_set, int index); -stat
Re: Row pattern recognition
Attached is the v10 patch. This version enhances the performance of pattern matching. Previously it generated all possible pattern string candidates. This resulted in unnecessarily large number of candidates. For example if you have 2 pattern variables and the target frame includes 100 rows, the number of candidates can reach to 2^100 in the worst case. To avoid this, I do a pruning in the v10 patch. Suppose you have: PATTERN (A B+ C+) Candidates like "BAC" "CAB" cannot survive because they never satisfy the search pattern. To judge this, I assign sequence numbers (0, 1, 2) to (A B C). If the pattern generator tries to generate BA, this is not allowed because the sequence number for B is 1 and for A is 0, and 0 < 1: B cannot be followed by A. Note that this technique can be applied when the quantifiers are "+" or "*". Maybe other quantifiers such as '?' or '{n, m}' can be applied too but I don't confirm yet because I have not implemented them yet. Besides this improvement, I fixed a bug in the previous and older patches: when an expression in DEFINE uses text operators, it errors out: ERROR: could not determine which collation to use for string comparison HINT: Use the COLLATE clause to set the collation explicitly. This was fixed by adding assign_expr_collations() in transformDefineClause(). Also I have updated documentation "3.5. Window Functions" - It still mentioned about rpr(). It's not applied anymore. - Enhance the description about DEFINE and PATTERN. - Mention that quantifier '*' is supported. Finally I have added more test cases to the regression test. - same pattern variable appears twice - case for quantifier '*' Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp >From a8b41dd155a2d9ce969083fcfc53bbae3c28b263 Mon Sep 17 00:00:00 2001 From: Tatsuo Ishii Date: Sun, 22 Oct 2023 11:22:10 +0900 Subject: [PATCH v10 1/7] Row pattern recognition patch for raw parser. --- src/backend/parser/gram.y | 222 ++-- src/include/nodes/parsenodes.h | 56 src/include/parser/kwlist.h | 8 ++ src/include/parser/parse_node.h | 1 + 4 files changed, 273 insertions(+), 14 deletions(-) diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index c224df4ecc..e09eb061f8 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -251,6 +251,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); DefElem *defelt; SortBy *sortby; WindowDef *windef; + RPCommonSyntax *rpcom; + RPSubsetItem *rpsubset; JoinExpr *jexpr; IndexElem *ielem; StatsElem *selem; @@ -278,6 +280,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); MergeWhenClause *mergewhen; struct KeyActions *keyactions; struct KeyAction *keyaction; + RPSkipTo skipto; } %type stmt toplevel_stmt schema_stmt routine_body_stmt @@ -453,8 +456,12 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); TriggerTransitions TriggerReferencing vacuum_relation_list opt_vacuum_relation_list drop_option_list pub_obj_list - -%type opt_routine_body +row_pattern_measure_list row_pattern_definition_list +opt_row_pattern_subset_clause +row_pattern_subset_list row_pattern_subset_rhs +row_pattern +%type row_pattern_subset_item +%type opt_routine_body row_pattern_term %type group_clause %type group_by_list %type group_by_item empty_grouping_set rollup_clause cube_clause @@ -551,6 +558,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); %type relation_expr_opt_alias %type tablesample_clause opt_repeatable_clause %type target_el set_target insert_column_item +row_pattern_measure_item row_pattern_definition +%type first_or_last %type generic_option_name %type generic_option_arg @@ -633,6 +642,9 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); %type window_clause window_definition_list opt_partition_clause %type window_definition over_clause window_specification opt_frame_clause frame_extent frame_bound +%type opt_row_pattern_common_syntax opt_row_pattern_skip_to +%type opt_row_pattern_initial_or_seek +%type opt_row_pattern_measures %type opt_window_exclusion_clause %type opt_existing_window_name %type opt_if_not_exists @@ -659,7 +671,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); json_object_constructor_null_clause_opt json_array_constructor_null_clause_opt - /* * Non-keyword token types. These are hard-wired into the "flex" lexer. * They must be listed first so that their numeric codes do not depend on @@ -702,7 +713,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); CURRENT_TIME CURRENT_TIMESTAMP
Re: Row pattern recognition
> By the way, I was thinking about eliminating recusrive calls in > pattern matching. Attached is the first cut of the implementation. In > the attached v8 patch: > > - No recursive calls anymore. search_str_set_recurse was removed. > > - Instead it generates all possible pattern variable name initial > strings before pattern matching. Suppose we have "ab" (row 0) and > "ac" (row 1). "a" and "b" represents the pattern variable names > which are evaluated to true. In this case it will generate "aa", > "ac", "ba" and "bc" and they are examined by do_pattern_match one by > one, which performs pattern matching. > > - To implement this, an infrastructure string_set* are created. They > take care of set of StringInfo. > > I found that the previous implementation did not search all possible > cases. I believe the bug is fixed in v8. The v8 patch does not apply anymore due to commit d060e921ea "Remove obsolete executor cleanup code". So I rebased and created v9 patch. The differences from the v8 include: - Fix bug with get_slots. It did not correctly detect the end of full frame. - Add test case using "ROWS BETWEEN CURRENT ROW AND offset FOLLOWING". Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp >From d8fd143ab717fd62814e73fd24bf1a1694143dfc Mon Sep 17 00:00:00 2001 From: Tatsuo Ishii Date: Wed, 4 Oct 2023 14:51:48 +0900 Subject: [PATCH v9 1/7] Row pattern recognition patch for raw parser. --- src/backend/parser/gram.y | 222 ++-- src/include/nodes/parsenodes.h | 56 src/include/parser/kwlist.h | 8 ++ src/include/parser/parse_node.h | 1 + 4 files changed, 273 insertions(+), 14 deletions(-) diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index e56cbe77cb..730c51bc87 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -251,6 +251,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); DefElem *defelt; SortBy *sortby; WindowDef *windef; + RPCommonSyntax *rpcom; + RPSubsetItem *rpsubset; JoinExpr *jexpr; IndexElem *ielem; StatsElem *selem; @@ -278,6 +280,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); MergeWhenClause *mergewhen; struct KeyActions *keyactions; struct KeyAction *keyaction; + RPSkipTo skipto; } %type stmt toplevel_stmt schema_stmt routine_body_stmt @@ -453,8 +456,12 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); TriggerTransitions TriggerReferencing vacuum_relation_list opt_vacuum_relation_list drop_option_list pub_obj_list - -%type opt_routine_body +row_pattern_measure_list row_pattern_definition_list +opt_row_pattern_subset_clause +row_pattern_subset_list row_pattern_subset_rhs +row_pattern +%type row_pattern_subset_item +%type opt_routine_body row_pattern_term %type group_clause %type group_by_list %type group_by_item empty_grouping_set rollup_clause cube_clause @@ -551,6 +558,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); %type relation_expr_opt_alias %type tablesample_clause opt_repeatable_clause %type target_el set_target insert_column_item +row_pattern_measure_item row_pattern_definition +%type first_or_last %type generic_option_name %type generic_option_arg @@ -633,6 +642,9 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); %type window_clause window_definition_list opt_partition_clause %type window_definition over_clause window_specification opt_frame_clause frame_extent frame_bound +%type opt_row_pattern_common_syntax opt_row_pattern_skip_to +%type opt_row_pattern_initial_or_seek +%type opt_row_pattern_measures %type opt_window_exclusion_clause %type opt_existing_window_name %type opt_if_not_exists @@ -659,7 +671,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); json_object_constructor_null_clause_opt json_array_constructor_null_clause_opt - /* * Non-keyword token types. These are hard-wired into the "flex" lexer. * They must be listed first so that their numeric codes do not depend on @@ -702,7 +713,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); CURRENT_TIME CURRENT_TIMESTAMP CURRENT_USER CURSOR CYCLE DATA_P DATABASE DAY_P DEALLOCATE DEC DECIMAL_P DECLARE DEFAULT DEFAULTS - DEFERRABLE DEFERRED DEFINER DELETE_P DELIMITER DELIMITERS DEPENDS DEPTH DESC + DEFERRABLE DEFERRED DEFINE DEFINER DELETE_P DELIMITER DELIMITERS DEPENDS DEPTH DESC DETACH DICTIONARY DISABLE_P DISCARD DISTINCT DO DOCUMENT_P DOMAIN_P DOUBLE_P DROP @@ -718,7 +729,7 @@ static Node *make
Re: Row pattern recognition
> On Fri, Sep 22, 2023, 3:13 AM Tatsuo Ishii wrote: > >> > Op 9/22/23 om 07:16 schreef Tatsuo Ishii: >> >>> Attached is the fix against v6 patch. I will include this in upcoming >> >>> v7 patch. >> >> Attached is the v7 patch. It includes the fix mentioned above. Also >> > (Champion's address bounced; removed) >> >> On my side his adress bounced too:-< >> > > Yep. I'm still here, just lurking for now. It'll take a little time for me > to get back to this thread, as my schedule has changed significantly. :D Hope you get back soon... By the way, I was thinking about eliminating recusrive calls in pattern matching. Attached is the first cut of the implementation. In the attached v8 patch: - No recursive calls anymore. search_str_set_recurse was removed. - Instead it generates all possible pattern variable name initial strings before pattern matching. Suppose we have "ab" (row 0) and "ac" (row 1). "a" and "b" represents the pattern variable names which are evaluated to true. In this case it will generate "aa", "ac", "ba" and "bc" and they are examined by do_pattern_match one by one, which performs pattern matching. - To implement this, an infrastructure string_set* are created. They take care of set of StringInfo. I found that the previous implementation did not search all possible cases. I believe the bug is fixed in v8. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp >From 281ee5c3eae85f96783422cb2f9987c3af8c8a68 Mon Sep 17 00:00:00 2001 From: Tatsuo Ishii Date: Mon, 25 Sep 2023 14:01:14 +0900 Subject: [PATCH v8 1/7] Row pattern recognition patch for raw parser. --- src/backend/parser/gram.y | 222 ++-- src/include/nodes/parsenodes.h | 56 src/include/parser/kwlist.h | 8 ++ src/include/parser/parse_node.h | 1 + 4 files changed, 273 insertions(+), 14 deletions(-) diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index 7d2032885e..74c2069050 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -251,6 +251,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); DefElem *defelt; SortBy *sortby; WindowDef *windef; + RPCommonSyntax *rpcom; + RPSubsetItem *rpsubset; JoinExpr *jexpr; IndexElem *ielem; StatsElem *selem; @@ -278,6 +280,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); MergeWhenClause *mergewhen; struct KeyActions *keyactions; struct KeyAction *keyaction; + RPSkipTo skipto; } %type stmt toplevel_stmt schema_stmt routine_body_stmt @@ -453,8 +456,12 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); TriggerTransitions TriggerReferencing vacuum_relation_list opt_vacuum_relation_list drop_option_list pub_obj_list - -%type opt_routine_body +row_pattern_measure_list row_pattern_definition_list +opt_row_pattern_subset_clause +row_pattern_subset_list row_pattern_subset_rhs +row_pattern +%type row_pattern_subset_item +%type opt_routine_body row_pattern_term %type group_clause %type group_by_list %type group_by_item empty_grouping_set rollup_clause cube_clause @@ -551,6 +558,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); %type relation_expr_opt_alias %type tablesample_clause opt_repeatable_clause %type target_el set_target insert_column_item +row_pattern_measure_item row_pattern_definition +%type first_or_last %type generic_option_name %type generic_option_arg @@ -633,6 +642,9 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); %type window_clause window_definition_list opt_partition_clause %type window_definition over_clause window_specification opt_frame_clause frame_extent frame_bound +%type opt_row_pattern_common_syntax opt_row_pattern_skip_to +%type opt_row_pattern_initial_or_seek +%type opt_row_pattern_measures %type opt_window_exclusion_clause %type opt_existing_window_name %type opt_if_not_exists @@ -659,7 +671,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); json_object_constructor_null_clause_opt json_array_constructor_null_clause_opt - /* * Non-keyword token types. These are hard-wired into the "flex" lexer. * They must be listed first so that their numeric codes do not depend on @@ -702,7 +713,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); CURRENT_TIME CURRENT_TIMESTAMP CURRENT_USER CURSOR CYCLE DATA_P DATABASE DAY_P DEALLOCATE DEC DECIMAL_P DECLARE DEFAULT DEFAULTS - DEFERRABLE DEFERRED DEFINER DELETE_P DELIMITER DELIMITERS DEPENDS DEPTH DESC +
Re: Row pattern recognition
On Fri, Sep 22, 2023, 3:13 AM Tatsuo Ishii wrote: > > Op 9/22/23 om 07:16 schreef Tatsuo Ishii: > >>> Attached is the fix against v6 patch. I will include this in upcoming > >>> v7 patch. > >> Attached is the v7 patch. It includes the fix mentioned above. Also > > (Champion's address bounced; removed) > > On my side his adress bounced too:-< > Yep. I'm still here, just lurking for now. It'll take a little time for me to get back to this thread, as my schedule has changed significantly. :D Thanks, --Jacob >
Re: Row pattern recognition
Op 9/22/23 om 12:12 schreef Tatsuo Ishii: Op 9/22/23 om 07:16 schreef Tatsuo Ishii: Attached is the fix against v6 patch. I will include this in upcoming v7 patch. Attached is the v7 patch. It includes the fix mentioned above. Also (Champion's address bounced; removed) On my side his adress bounced too:-< Hi, In my hands, make check fails on the rpr test; see attached .diff file. In these two statements: -- using NEXT -- using AFTER MATCH SKIP TO NEXT ROW result of first_value(price) and next_value(price) are empty. Strange. I have checked out fresh master branch and applied the v7 patches, then ran make check. All tests including the rpr test passed. This is Ubuntu 20.04. The curious thing is that the server otherwise builds ok, and if I explicitly run on that server 'CREATE TEMP TABLE stock' + the 20 INSERTS (just to make sure to have known data), those two statements now both return the correct result. So maybe the testing/timing is wonky (not necessarily the server). Erik Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
Re: Row pattern recognition
> Op 9/22/23 om 07:16 schreef Tatsuo Ishii: >>> Attached is the fix against v6 patch. I will include this in upcoming >>> v7 patch. >> Attached is the v7 patch. It includes the fix mentioned above. Also > (Champion's address bounced; removed) On my side his adress bounced too:-< > Hi, > > In my hands, make check fails on the rpr test; see attached .diff > file. > In these two statements: > -- using NEXT > -- using AFTER MATCH SKIP TO NEXT ROW > result of first_value(price) and next_value(price) are empty. Strange. I have checked out fresh master branch and applied the v7 patches, then ran make check. All tests including the rpr test passed. This is Ubuntu 20.04. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
Re: Row pattern recognition
Op 9/22/23 om 07:16 schreef Tatsuo Ishii: Attached is the fix against v6 patch. I will include this in upcoming v7 patch. Attached is the v7 patch. It includes the fix mentioned above. Also Hi, In my hands, make check fails on the rpr test; see attached .diff file. In these two statements: -- using NEXT -- using AFTER MATCH SKIP TO NEXT ROW result of first_value(price) and next_value(price) are empty. Erik Rijkers this time the pattern matching engine is enhanced: previously it recursed to row direction, which means if the number of rows in a frame is large, it could exceed the limit of stack depth. The new version recurses over matched pattern variables in a row, which is at most 26 which should be small enough. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jpdiff -U3 /home/aardvark/pg_stuff/pg_sandbox/pgsql.rpr/src/test/regress/expected/rpr.out /home/aardvark/pg_stuff/pg_sandbox/pgsql.rpr/src/test/regress/results/rpr.out --- /home/aardvark/pg_stuff/pg_sandbox/pgsql.rpr/src/test/regress/expected/rpr.out 2023-09-22 09:04:17.770392635 +0200 +++ /home/aardvark/pg_stuff/pg_sandbox/pgsql.rpr/src/test/regress/results/rpr.out 2023-09-22 09:38:23.826458109 +0200 @@ -253,23 +253,23 @@ ); company | tdate| price | first_value | last_value --++---+-+ - company1 | 07-01-2023 | 100 | 100 |200 + company1 | 07-01-2023 | 100 | | company1 | 07-02-2023 | 200 | | company1 | 07-03-2023 | 150 | | - company1 | 07-04-2023 | 140 | 140 |150 + company1 | 07-04-2023 | 140 | | company1 | 07-05-2023 | 150 | | company1 | 07-06-2023 |90 | | - company1 | 07-07-2023 | 110 | 110 |130 + company1 | 07-07-2023 | 110 | | company1 | 07-08-2023 | 130 | | company1 | 07-09-2023 | 120 | | company1 | 07-10-2023 | 130 | | - company2 | 07-01-2023 |50 | 50 | 2000 + company2 | 07-01-2023 |50 | | company2 | 07-02-2023 | 2000 | | company2 | 07-03-2023 | 1500 | | - company2 | 07-04-2023 | 1400 |1400 | 1500 + company2 | 07-04-2023 | 1400 | | company2 | 07-05-2023 | 1500 | | company2 | 07-06-2023 |60 | | - company2 | 07-07-2023 | 1100 |1100 | 1300 + company2 | 07-07-2023 | 1100 | | company2 | 07-08-2023 | 1300 | | company2 | 07-09-2023 | 1200 | | company2 | 07-10-2023 | 1300 | | @@ -290,23 +290,23 @@ ); company | tdate| price | first_value | last_value --++---+-+ - company1 | 07-01-2023 | 100 | 100 |200 + company1 | 07-01-2023 | 100 | | company1 | 07-02-2023 | 200 | | company1 | 07-03-2023 | 150 | | - company1 | 07-04-2023 | 140 | 140 |150 + company1 | 07-04-2023 | 140 | | company1 | 07-05-2023 | 150 | | company1 | 07-06-2023 |90 | | - company1 | 07-07-2023 | 110 | 110 |130 + company1 | 07-07-2023 | 110 | | company1 | 07-08-2023 | 130 | | company1 | 07-09-2023 | 120 | | company1 | 07-10-2023 | 130 | | - company2 | 07-01-2023 |50 | 50 | 2000 + company2 | 07-01-2023 |50 | | company2 | 07-02-2023 | 2000 | | company2 | 07-03-2023 | 1500 | | - company2 | 07-04-2023 | 1400 |1400 | 1500 + company2 | 07-04-2023 | 1400 | | company2 | 07-05-2023 | 1500 | | company2 | 07-06-2023 |60 | | - company2 | 07-07-2023 | 1100 |1100 | 1300 + company2 | 07-07-2023 | 1100 | | company2 | 07-08-2023 | 1300 | | company2 | 07-09-2023 | 1200 | | company2 | 07-10-2023 | 1300 | |
Re: Row pattern recognition
Op 9/22/23 om 10:23 schreef Erik Rijkers: Op 9/22/23 om 07:16 schreef Tatsuo Ishii: Attached is the fix against v6 patch. I will include this in upcoming v7 patch. Attached is the v7 patch. It includes the fix mentioned above. Also (Champion's address bounced; removed) Sorry, I forgot to re-attach the regression.diffs with resend... Erik Hi, In my hands, make check fails on the rpr test; see attached .diff file. In these two statements: -- using NEXT -- using AFTER MATCH SKIP TO NEXT ROW result of first_value(price) and next_value(price) are empty. Erik Rijkers this time the pattern matching engine is enhanced: previously it recursed to row direction, which means if the number of rows in a frame is large, it could exceed the limit of stack depth. The new version recurses over matched pattern variables in a row, which is at most 26 which should be small enough. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp diff -U3 /home/aardvark/pg_stuff/pg_sandbox/pgsql.rpr/src/test/regress/expected/rpr.out /home/aardvark/pg_stuff/pg_sandbox/pgsql.rpr/src/test/regress/results/rpr.out --- /home/aardvark/pg_stuff/pg_sandbox/pgsql.rpr/src/test/regress/expected/rpr.out 2023-09-22 09:04:17.770392635 +0200 +++ /home/aardvark/pg_stuff/pg_sandbox/pgsql.rpr/src/test/regress/results/rpr.out 2023-09-22 09:38:23.826458109 +0200 @@ -253,23 +253,23 @@ ); company | tdate| price | first_value | last_value --++---+-+ - company1 | 07-01-2023 | 100 | 100 |200 + company1 | 07-01-2023 | 100 | | company1 | 07-02-2023 | 200 | | company1 | 07-03-2023 | 150 | | - company1 | 07-04-2023 | 140 | 140 |150 + company1 | 07-04-2023 | 140 | | company1 | 07-05-2023 | 150 | | company1 | 07-06-2023 |90 | | - company1 | 07-07-2023 | 110 | 110 |130 + company1 | 07-07-2023 | 110 | | company1 | 07-08-2023 | 130 | | company1 | 07-09-2023 | 120 | | company1 | 07-10-2023 | 130 | | - company2 | 07-01-2023 |50 | 50 | 2000 + company2 | 07-01-2023 |50 | | company2 | 07-02-2023 | 2000 | | company2 | 07-03-2023 | 1500 | | - company2 | 07-04-2023 | 1400 |1400 | 1500 + company2 | 07-04-2023 | 1400 | | company2 | 07-05-2023 | 1500 | | company2 | 07-06-2023 |60 | | - company2 | 07-07-2023 | 1100 |1100 | 1300 + company2 | 07-07-2023 | 1100 | | company2 | 07-08-2023 | 1300 | | company2 | 07-09-2023 | 1200 | | company2 | 07-10-2023 | 1300 | | @@ -290,23 +290,23 @@ ); company | tdate| price | first_value | last_value --++---+-+ - company1 | 07-01-2023 | 100 | 100 |200 + company1 | 07-01-2023 | 100 | | company1 | 07-02-2023 | 200 | | company1 | 07-03-2023 | 150 | | - company1 | 07-04-2023 | 140 | 140 |150 + company1 | 07-04-2023 | 140 | | company1 | 07-05-2023 | 150 | | company1 | 07-06-2023 |90 | | - company1 | 07-07-2023 | 110 | 110 |130 + company1 | 07-07-2023 | 110 | | company1 | 07-08-2023 | 130 | | company1 | 07-09-2023 | 120 | | company1 | 07-10-2023 | 130 | | - company2 | 07-01-2023 |50 | 50 | 2000 + company2 | 07-01-2023 |50 | | company2 | 07-02-2023 | 2000 | | company2 | 07-03-2023 | 1500 | | - company2 | 07-04-2023 | 1400 |1400 | 1500 + company2 | 07-04-2023 | 1400 | | company2 | 07-05-2023 | 1500 | | company2 | 07-06-2023 |60 | | - company2 | 07-07-2023 | 1100 |1100 | 1300 + company2 | 07-07-2023 | 1100 | | company2 | 07-08-2023 | 1300 | | company2 | 07-09-2023 | 1200 | | company2 | 07-10-2023 | 1300 | |
Re: Row pattern recognition
Op 9/22/23 om 07:16 schreef Tatsuo Ishii: Attached is the fix against v6 patch. I will include this in upcoming v7 patch. Attached is the v7 patch. It includes the fix mentioned above. Also (Champion's address bounced; removed) Hi, In my hands, make check fails on the rpr test; see attached .diff file. In these two statements: -- using NEXT -- using AFTER MATCH SKIP TO NEXT ROW result of first_value(price) and next_value(price) are empty. Erik Rijkers this time the pattern matching engine is enhanced: previously it recursed to row direction, which means if the number of rows in a frame is large, it could exceed the limit of stack depth. The new version recurses over matched pattern variables in a row, which is at most 26 which should be small enough. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
Re: Row pattern recognition
> Attached is the fix against v6 patch. I will include this in upcoming v7 > patch. Attached is the v7 patch. It includes the fix mentioned above. Also this time the pattern matching engine is enhanced: previously it recursed to row direction, which means if the number of rows in a frame is large, it could exceed the limit of stack depth. The new version recurses over matched pattern variables in a row, which is at most 26 which should be small enough. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp >From 4d5be4c27ae5e4a50924520574e2c1ca3e398551 Mon Sep 17 00:00:00 2001 From: Tatsuo Ishii Date: Fri, 22 Sep 2023 13:53:35 +0900 Subject: [PATCH v7 1/7] Row pattern recognition patch for raw parser. --- src/backend/parser/gram.y | 222 ++-- src/include/nodes/parsenodes.h | 56 src/include/parser/kwlist.h | 8 ++ src/include/parser/parse_node.h | 1 + 4 files changed, 273 insertions(+), 14 deletions(-) diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index 7d2032885e..74c2069050 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -251,6 +251,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); DefElem *defelt; SortBy *sortby; WindowDef *windef; + RPCommonSyntax *rpcom; + RPSubsetItem *rpsubset; JoinExpr *jexpr; IndexElem *ielem; StatsElem *selem; @@ -278,6 +280,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); MergeWhenClause *mergewhen; struct KeyActions *keyactions; struct KeyAction *keyaction; + RPSkipTo skipto; } %type stmt toplevel_stmt schema_stmt routine_body_stmt @@ -453,8 +456,12 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); TriggerTransitions TriggerReferencing vacuum_relation_list opt_vacuum_relation_list drop_option_list pub_obj_list - -%type opt_routine_body +row_pattern_measure_list row_pattern_definition_list +opt_row_pattern_subset_clause +row_pattern_subset_list row_pattern_subset_rhs +row_pattern +%type row_pattern_subset_item +%type opt_routine_body row_pattern_term %type group_clause %type group_by_list %type group_by_item empty_grouping_set rollup_clause cube_clause @@ -551,6 +558,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); %type relation_expr_opt_alias %type tablesample_clause opt_repeatable_clause %type target_el set_target insert_column_item +row_pattern_measure_item row_pattern_definition +%type first_or_last %type generic_option_name %type generic_option_arg @@ -633,6 +642,9 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); %type window_clause window_definition_list opt_partition_clause %type window_definition over_clause window_specification opt_frame_clause frame_extent frame_bound +%type opt_row_pattern_common_syntax opt_row_pattern_skip_to +%type opt_row_pattern_initial_or_seek +%type opt_row_pattern_measures %type opt_window_exclusion_clause %type opt_existing_window_name %type opt_if_not_exists @@ -659,7 +671,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); json_object_constructor_null_clause_opt json_array_constructor_null_clause_opt - /* * Non-keyword token types. These are hard-wired into the "flex" lexer. * They must be listed first so that their numeric codes do not depend on @@ -702,7 +713,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); CURRENT_TIME CURRENT_TIMESTAMP CURRENT_USER CURSOR CYCLE DATA_P DATABASE DAY_P DEALLOCATE DEC DECIMAL_P DECLARE DEFAULT DEFAULTS - DEFERRABLE DEFERRED DEFINER DELETE_P DELIMITER DELIMITERS DEPENDS DEPTH DESC + DEFERRABLE DEFERRED DEFINE DEFINER DELETE_P DELIMITER DELIMITERS DEPENDS DEPTH DESC DETACH DICTIONARY DISABLE_P DISCARD DISTINCT DO DOCUMENT_P DOMAIN_P DOUBLE_P DROP @@ -718,7 +729,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); HANDLER HAVING HEADER_P HOLD HOUR_P IDENTITY_P IF_P ILIKE IMMEDIATE IMMUTABLE IMPLICIT_P IMPORT_P IN_P INCLUDE - INCLUDING INCREMENT INDENT INDEX INDEXES INHERIT INHERITS INITIALLY INLINE_P + INCLUDING INCREMENT INDENT INDEX INDEXES INHERIT INHERITS INITIAL INITIALLY INLINE_P INNER_P INOUT INPUT_P INSENSITIVE INSERT INSTEAD INT_P INTEGER INTERSECT INTERVAL INTO INVOKER IS ISNULL ISOLATION @@ -731,7 +742,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED - MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MERGE METHOD + MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MEASURES MERGE METHOD MINUTE_P MINVALUE MODE MONTH_P MOVE NA
Re: Row pattern recognition
> On 9/13/23 07:14, Tatsuo Ishii wrote: >>> >> I was looking for this but I only found ISO/IEC 19075-5:2021. >> https://www.iso.org/standard/78936.html >> Maybe 19075-5:2021 is the latest one? > > Yes, probably. Sorry. No problem. Thanks for confirmation. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
Re: Row pattern recognition
On 9/13/23 07:14, Tatsuo Ishii wrote: I was looking for this but I only found ISO/IEC 19075-5:2021. https://www.iso.org/standard/78936.html Maybe 19075-5:2021 is the latest one? Yes, probably. Sorry. -- Vik Fearing
Re: Row pattern recognition
> I was looking for this but I only found ISO/IEC 19075-5:2021. https://www.iso.org/standard/78936.html Maybe 19075-5:2021 is the latest one? Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
Re: Row pattern recognition
On Mon, Sep 11, 2023 at 11:18 PM Tatsuo Ishii wrote: > What I am not sure about is, you and Vik mentioned that the > traditional NFA is superior that POSIX NFA in terms of performance. > But how "lexicographic ordering" is related to performance? I think they're only tangentially related. POSIX NFAs have to fully backtrack even after the first match is found, so that's where the performance difference comes in. (We would be introducing new ways to catastrophically backtrack if we used that approach.) But since you don't visit every possible path through the graph with a traditional NFA, it makes sense to define an order in which you visit the nodes, so that you can reason about which string is actually going to be matched in the end. > BTW, attched is the v6 patch. The differences from v5 include: > > - Now aggregates can be used with RPR. Below is an example from the > regression test cases, which is added by v6 patch. Great, thank you! --Jacob
Re: Row pattern recognition
Regarding v6 patch: > SELECT company, tdate, price, > first_value(price) OVER w, > last_value(price) OVER w, > max(price) OVER w, > min(price) OVER w, > sum(price) OVER w, > avg(price) OVER w, > count(price) OVER w > FROM stock > WINDOW w AS ( > PARTITION BY company > ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > AFTER MATCH SKIP PAST LAST ROW > INITIAL > PATTERN (START UP+ DOWN+) > DEFINE > START AS TRUE, > UP AS price > PREV(price), > DOWN AS price < PREV(price) > ); > company | tdate| price | first_value | last_value | max | min | sum > | avg | count > --++---+-++--+-+--+---+--- > company1 | 07-01-2023 | 100 | 100 |140 | 200 | 100 | 590 > | 147.5000 | 4 > company1 | 07-02-2023 | 200 | || | | > | | > company1 | 07-03-2023 | 150 | || | | > | | > company1 | 07-04-2023 | 140 | || | | > | | > company1 | 07-05-2023 | 150 | || | | > | | > company1 | 07-06-2023 |90 | 90 |120 | 130 | 90 | 450 > | 112.5000 | 4 > company1 | 07-07-2023 | 110 | || | | > | | > company1 | 07-08-2023 | 130 | || | | > | | > company1 | 07-09-2023 | 120 | || | | > | | > company1 | 07-10-2023 | 130 | || | | > | | > company2 | 07-01-2023 |50 | 50 | 1400 | 2000 | 50 | 4950 > | 1237.5000 | 4 > company2 | 07-02-2023 | 2000 | || | | > | | > company2 | 07-03-2023 | 1500 | || | | > | | > company2 | 07-04-2023 | 1400 | || | | > | | > company2 | 07-05-2023 | 1500 | || | | > | | > company2 | 07-06-2023 |60 | 60 | 1200 | 1300 | 60 | 3660 > | 915. | 4 > company2 | 07-07-2023 | 1100 | || | | > | | > company2 | 07-08-2023 | 1300 | || | | > | | > company2 | 07-09-2023 | 1200 | || | | > | | > company2 | 07-10-2023 | 1300 | || | | > | | > (20 rows) count column for unmatched rows should have been 0, rather than NULL. i.e. company | tdate| price | first_value | last_value | max | min | sum | avg | count --++---+-++--+-+--+---+--- company1 | 07-01-2023 | 100 | 100 |140 | 200 | 100 | 590 | 147.5000 | 4 company1 | 07-02-2023 | 200 | || | | | | company1 | 07-03-2023 | 150 | || | | | | company1 | 07-04-2023 | 140 | || | | | | company1 | 07-05-2023 | 150 | || | | | | 0 company1 | 07-06-2023 |90 | 90 |120 | 130 | 90 | 450 | 112.5000 | 4 company1 | 07-07-2023 | 110 | || | | | | company1 | 07-08-2023 | 130 | || | | | | company1 | 07-09-2023 | 120 | || | | | | company1 | 07-10-2023 | 130 | || | | | | 0 company2 | 07-01-2023 |50 | 50 | 1400 | 2000 | 50 | 4950 | 1237.5000 | 4 company2 | 07-02-2023 | 2000 | || | | | | company2 | 07-03-2023 | 1500 | || | | | | company2 | 07-04-2023 | 1400 | || | | | | company2 | 07-05-2023 | 1500 | || | | | | 0
Re: Row pattern recognition
| company2 | 07-09-2023 | 1200 | || | | | | company2 | 07-10-2023 | 1300 | || | | | | (20 rows) Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp >From c5699cdf64df9b102c5d9e3b6f6002e504c93fc6 Mon Sep 17 00:00:00 2001 From: Tatsuo Ishii Date: Tue, 12 Sep 2023 14:22:22 +0900 Subject: [PATCH v6 1/7] Row pattern recognition patch for raw parser. --- src/backend/parser/gram.y | 216 +--- src/include/nodes/parsenodes.h | 56 + src/include/parser/kwlist.h | 8 ++ src/include/parser/parse_node.h | 1 + 4 files changed, 267 insertions(+), 14 deletions(-) diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index 7d2032885e..70409cdc9a 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -251,6 +251,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); DefElem *defelt; SortBy *sortby; WindowDef *windef; + RPCommonSyntax *rpcom; + RPSubsetItem *rpsubset; JoinExpr *jexpr; IndexElem *ielem; StatsElem *selem; @@ -453,8 +455,12 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); TriggerTransitions TriggerReferencing vacuum_relation_list opt_vacuum_relation_list drop_option_list pub_obj_list - -%type opt_routine_body +row_pattern_measure_list row_pattern_definition_list +opt_row_pattern_subset_clause +row_pattern_subset_list row_pattern_subset_rhs +row_pattern +%type row_pattern_subset_item +%type opt_routine_body row_pattern_term %type group_clause %type group_by_list %type group_by_item empty_grouping_set rollup_clause cube_clause @@ -551,6 +557,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); %type relation_expr_opt_alias %type tablesample_clause opt_repeatable_clause %type target_el set_target insert_column_item +row_pattern_measure_item row_pattern_definition %type generic_option_name %type generic_option_arg @@ -633,6 +640,9 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); %type window_clause window_definition_list opt_partition_clause %type window_definition over_clause window_specification opt_frame_clause frame_extent frame_bound +%type opt_row_pattern_common_syntax opt_row_pattern_skip_to +%type opt_row_pattern_initial_or_seek +%type opt_row_pattern_measures %type opt_window_exclusion_clause %type opt_existing_window_name %type opt_if_not_exists @@ -659,7 +669,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); json_object_constructor_null_clause_opt json_array_constructor_null_clause_opt - /* * Non-keyword token types. These are hard-wired into the "flex" lexer. * They must be listed first so that their numeric codes do not depend on @@ -702,7 +711,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); CURRENT_TIME CURRENT_TIMESTAMP CURRENT_USER CURSOR CYCLE DATA_P DATABASE DAY_P DEALLOCATE DEC DECIMAL_P DECLARE DEFAULT DEFAULTS - DEFERRABLE DEFERRED DEFINER DELETE_P DELIMITER DELIMITERS DEPENDS DEPTH DESC + DEFERRABLE DEFERRED DEFINE DEFINER DELETE_P DELIMITER DELIMITERS DEPENDS DEPTH DESC DETACH DICTIONARY DISABLE_P DISCARD DISTINCT DO DOCUMENT_P DOMAIN_P DOUBLE_P DROP @@ -718,7 +727,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); HANDLER HAVING HEADER_P HOLD HOUR_P IDENTITY_P IF_P ILIKE IMMEDIATE IMMUTABLE IMPLICIT_P IMPORT_P IN_P INCLUDE - INCLUDING INCREMENT INDENT INDEX INDEXES INHERIT INHERITS INITIALLY INLINE_P + INCLUDING INCREMENT INDENT INDEX INDEXES INHERIT INHERITS INITIAL INITIALLY INLINE_P INNER_P INOUT INPUT_P INSENSITIVE INSERT INSTEAD INT_P INTEGER INTERSECT INTERVAL INTO INVOKER IS ISNULL ISOLATION @@ -731,7 +740,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED - MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MERGE METHOD + MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MEASURES MERGE METHOD MINUTE_P MINVALUE MODE MONTH_P MOVE NAME_P NAMES NATIONAL NATURAL NCHAR NEW NEXT NFC NFD NFKC NFKD NO NONE @@ -743,8 +752,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); ORDER ORDINALITY OTHERS OUT_P OUTER_P OVER OVERLAPS OVERLAY OVERRIDING OWNED OWNER - PARALLEL PARAMETER PARSER PARTIAL PARTITION PASSING PASSWORD - PLACING PLANS POLICY + PARALLEL PARAMETER PARSER PARTIAL PARTITION PASSING PASSWORD PAST + PATTERN_P PERMUTE PLACING PLANS POLICY POSITION PRECEDING PRECISION PRESERVE PREPARE PREPARED PRIMARY PRIOR PRIVILEGES PROCEDURAL PROCEDURE PROCEDURES PR
Re: Row pattern recognition
On Sat, Sep 9, 2023 at 4:21 AM Tatsuo Ishii wrote: > Then we will get for str_set: > r0: B > r1: AB > > Because r0 only has classifier B, r1 can have A and B. Problem is, > r2. If we choose A at r1, then r2 = B. But if we choose B at t1, then > r2 = AB. I guess this is the issue you pointed out. Right. > Yeah, probably we have delay evaluation of such pattern variables like > A, then reevaluate A after the first scan. > > What about leaving this (reevaluation) for now? Because: > > 1) we don't have CLASSIFIER > 2) we don't allow to give CLASSIFIER to PREV as its arggument > > so I think we don't need to worry about this for now. Sure. I'm all for deferring features to make it easier to iterate; I just want to make sure the architecture doesn't hit a dead end. Or at least, not without being aware of it. Also: is CLASSIFIER the only way to run into this issue? > What if we don't follow the standard, instead we follow POSIX EREs? I > think this is better for users unless RPR's REs has significant merit > for users. Piggybacking off of what Vik wrote upthread, I think we would not be doing ourselves any favors by introducing a non-compliant implementation that performs worse than a traditional NFA. Those would be some awful bug reports. > > - I think we have to implement a parallel parser regardless (RPR PATTERN > > syntax looks incompatible with POSIX) > > I am not sure if we need to worry about this because of the reason I > mentioned above. Even if we adopted POSIX NFA semantics, we'd still have to implement our own parser for the PATTERN part of the query. I don't think there's a good way for us to reuse the parser in src/backend/regex. > > Does that seem like a workable approach? (Worst-case, my code is just > > horrible, and we throw it in the trash.) > > Yes, it seems workable. I think for the first cut of RPR needs at > least the +quantifier with reasonable performance. The current naive > implementation seems to have issue because of exhaustive search. +1 Thanks! --Jacob
Re: Row pattern recognition
On 9/9/23 13:21, Tatsuo Ishii wrote: Thanks for the explanation. Surprising concet of the standard:-) This leaves the choice between traditional NFA and Posix NFA. The difference between these is that a traditional NFA exits (declares a match) as soon as it finds the first possible match, whereas a Posix NFA is obliged to find all possible matches and then choose the “leftmost longest”. There are examples that show that, even for conventional regular expression matching on text strings and without back references, there are patterns for which a Posix NFA is orders of magnitude slower than a traditional NFA. In addition, reluctant quantifiers cannot be defined in a Posix NFA, because of the leftmost longest rule. Therefore it was decided not to use the Posix NFA model, which leaves the traditional NFA as the model for row pattern matching. Among available tools that use traditional NFA engines, Perl is the most influential; therefore Perl was adopted as the design target for pattern matching rules. Is it different from SIMILAR TO REs too? Of course it is. :-) SIMILAR TO uses its own language and rules. What if we don't follow the standard, instead we follow POSIX EREs? I think this is better for users unless RPR's REs has significant merit for users. This would get big pushback from me. -- Vik Fearing
Re: Row pattern recognition
Hi, >> But: >> >> UP AS price > PREV(price) >> >> also depends on previous row, no? > > PREV(CLASSIFIER()) depends not on the value of the previous row but the > state of the match so far. To take an example from the patch: > >> * Example: >> * str_set[0] = "AB"; >> * str_set[1] = "AC"; >> * In this case at row 0 A and B are true, and A and C are true in row 1. > > With these str_sets and my example DEFINE, row[1] is only classifiable > as 'A' if row[0] is *not* classified as 'A' at this point in the match. > "AA" is not a valid candidate string, even if it matches the PATTERN. Ok, Let me clarify my understanding. Suppose we have: PATTER (A B) DEFINE A AS PREV(CLASSIFIER()) IS DISTINCT FROM 'A', B AS price > 100 and the target table has price column values: row[0]: 110 row[1]: 110 row[2]: 110 row[3]: 110 Then we will get for str_set: r0: B r1: AB Because r0 only has classifier B, r1 can have A and B. Problem is, r2. If we choose A at r1, then r2 = B. But if we choose B at t1, then r2 = AB. I guess this is the issue you pointed out. > So if we don't reevaluate the pattern variable condition for the row, we > at least have to prune the combinations that search_str_set() visits, so > that we don't generate a logically impossible combination. That seems > like it could be pretty fragile, and it may be difficult for us to prove > compliance. Yeah, probably we have delay evaluation of such pattern variables like A, then reevaluate A after the first scan. What about leaving this (reevaluation) for now? Because: 1) we don't have CLASSIFIER 2) we don't allow to give CLASSIFIER to PREV as its arggument so I think we don't need to worry about this for now. >> Can you explain a little bit more? I think 'aaa' matches a regular >> expression 'a+(b|a)+' and should be no problem before "aab" is >> considered. > > Assuming I've understood the rules correctly, we're not allowed to > classify the last row as 'A' if it also matches 'B'. Lexicographic > ordering takes precedence, so we have to try "aab" first. Otherwise our > query could return different results compared to another implementation. > >>> Similarly, the assumption that we want to match the longest string only >>> works because we don't allow alternation yet. >> >> Can you please clarify more on this? > > Sure: for the pattern > > PATTERN ( (A|B)+ ) > > we have to consider the candidate "a" over the candidate "ba", even > though the latter is longer. Like the prior example, lexicographic > ordering is considered more important than the greedy quantifier. > Quoting ISO/IEC 9075-2:2016: > > More precisely, with both reluctant and greedy quantifiers, the set > of matches is ordered lexicographically, but when one match is an > initial substring of another match, reluctant quantifiers prefer the > shorter match (the substring), whereas greedy quantifiers prefer the > longer match (the “superstring”). > > Here, "ba" doesn't have "a" as a prefix, so "ba" doesn't get priority. > ISO/IEC 19075-5:2021 has a big section on this (7.2) with worked examples. > > (The "lexicographic order matters more than greediness" concept was the > most mind-bending part for me so far, probably because I haven't figured > out how to translate the concept into POSIX EREs. It wouldn't make sense > to say "the letter 't' can match 'a', 'B', or '3' in this regex", but > that's what RPR is doing.) Thanks for the explanation. Surprising concet of the standard:-) Is it different from SIMILAR TO REs too? What if we don't follow the standard, instead we follow POSIX EREs? I think this is better for users unless RPR's REs has significant merit for users. >> Do you mean you want to provide a better patch for the pattern >> matching part? That will be helpfull. > > No guarantees that I'll find a better patch :D But yes, I will give it a > try. Ok. >> Because I am currently working >> on the aggregation part and have no time to do it. However, the >> aggregation work affects the v5 patch: it needs a refactoring. So can >> you wait until I release v6 patch? I hope it will be released in two >> weeks or so. > > Absolutely! Thanks. > Heh, I think it would be pretty foolish for me to code an NFA, from > scratch, and then try to convince the community to maintain it. > > But: > - I think we have to implement a parallel parser regardless (RPR PATTERN > syntax looks incompatible with POSIX) I am not sure if we need to worry about this because of the reason I mentioned above. > - I suspect we need more control over the backtracking than the current > pg_reg* API is going to give us, or else I'm worried performance is > going to fall off a cliff with usefully-large partitions Agreed. > - there's a lot of stuff in POSIX EREs that we don't need, and of the > features we do need, the + quantifier is probably one of the easiest > - it seems easier to prove the correctness of a slow, naive, > row-at-a-time engine, because we can compare it directly
Re: Row pattern recognition
On 9/8/23 21:27, Jacob Champion wrote: On 9/7/23 20:54, Tatsuo Ishii wrote: But it's easy to come up with a pattern where that's the wrong order, like PATTERN ( A+ (B|A)+ ) Now "aaa" will be considered before "aab", which isn't correct. Can you explain a little bit more? I think 'aaa' matches a regular expression 'a+(b|a)+' and should be no problem before "aab" is considered. Assuming I've understood the rules correctly, we're not allowed to classify the last row as 'A' if it also matches 'B'. Lexicographic ordering takes precedence, so we have to try "aab" first. Otherwise our query could return different results compared to another implementation. Your understanding is correct. -- Vik Fearing
Re: Row pattern recognition
On 9/7/23 20:54, Tatsuo Ishii wrote: >> DEFINE >> A AS PREV(CLASSIFIER()) IS DISTINCT FROM 'A', >> ... > > But: > > UP AS price > PREV(price) > > also depends on previous row, no? PREV(CLASSIFIER()) depends not on the value of the previous row but the state of the match so far. To take an example from the patch: > * Example: > * str_set[0] = "AB"; > * str_set[1] = "AC"; > * In this case at row 0 A and B are true, and A and C are true in row 1. With these str_sets and my example DEFINE, row[1] is only classifiable as 'A' if row[0] is *not* classified as 'A' at this point in the match. "AA" is not a valid candidate string, even if it matches the PATTERN. So if we don't reevaluate the pattern variable condition for the row, we at least have to prune the combinations that search_str_set() visits, so that we don't generate a logically impossible combination. That seems like it could be pretty fragile, and it may be difficult for us to prove compliance. >> But it's easy >> to come up with a pattern where that's the wrong order, like >> >> PATTERN ( A+ (B|A)+ ) >> >> Now "aaa" will be considered before "aab", which isn't correct. > > Can you explain a little bit more? I think 'aaa' matches a regular > expression 'a+(b|a)+' and should be no problem before "aab" is > considered. Assuming I've understood the rules correctly, we're not allowed to classify the last row as 'A' if it also matches 'B'. Lexicographic ordering takes precedence, so we have to try "aab" first. Otherwise our query could return different results compared to another implementation. >> Similarly, the assumption that we want to match the longest string only >> works because we don't allow alternation yet. > > Can you please clarify more on this? Sure: for the pattern PATTERN ( (A|B)+ ) we have to consider the candidate "a" over the candidate "ba", even though the latter is longer. Like the prior example, lexicographic ordering is considered more important than the greedy quantifier. Quoting ISO/IEC 9075-2:2016: More precisely, with both reluctant and greedy quantifiers, the set of matches is ordered lexicographically, but when one match is an initial substring of another match, reluctant quantifiers prefer the shorter match (the substring), whereas greedy quantifiers prefer the longer match (the “superstring”). Here, "ba" doesn't have "a" as a prefix, so "ba" doesn't get priority. ISO/IEC 19075-5:2021 has a big section on this (7.2) with worked examples. (The "lexicographic order matters more than greediness" concept was the most mind-bending part for me so far, probably because I haven't figured out how to translate the concept into POSIX EREs. It wouldn't make sense to say "the letter 't' can match 'a', 'B', or '3' in this regex", but that's what RPR is doing.) >> Cool, I will give this piece some more thought. Do you mind if I try to >> add some more complicated pattern quantifiers to stress the >> architecture, or would you prefer to tackle that later? Just alternation >> by itself will open up a world of corner cases. > > Do you mean you want to provide a better patch for the pattern > matching part? That will be helpfull. No guarantees that I'll find a better patch :D But yes, I will give it a try. > Because I am currently working > on the aggregation part and have no time to do it. However, the > aggregation work affects the v5 patch: it needs a refactoring. So can > you wait until I release v6 patch? I hope it will be released in two > weeks or so. Absolutely! >> But I'm wondering if we might want to just implement the NFA directly? >> The current implementation's Cartesian explosion can probably be pruned >> aggressively, but replaying the entire regex match once for every >> backtracked step will still duplicate a lot of work. > > Not sure if you mean implementing new regular expression engine > besides src/backend/regexp. I am afraid it's not a trivial work. The > current regexp code consists of over 10k lines. What do you think? Heh, I think it would be pretty foolish for me to code an NFA, from scratch, and then try to convince the community to maintain it. But: - I think we have to implement a parallel parser regardless (RPR PATTERN syntax looks incompatible with POSIX) - I suspect we need more control over the backtracking than the current pg_reg* API is going to give us, or else I'm worried performance is going to fall off a cliff with usefully-large partitions - there's a lot of stuff in POSIX EREs that we don't need, and of the features we do need, the + quantifier is probably one of the easiest - it seems easier to prove the correctness of a slow, naive, row-at-a-time engine, because we can compare it directly to the spec So what I'm thinking is: if I start by open-coding the + quantifier, and slowly add more pieces in, then it might be easier to see the parts of src/backend/regex that I've duplicated. We can try to expose those parts directly from the
Re: Row pattern recognition
Hi, > Hello! > >> (1) I completely changed the pattern matching engine so that it >> performs backtracking. Now the engine evaluates all pattern elements >> defined in PATTER against each row, saving matched pattern variables >> in a string per row. For example if the pattern element A and B >> evaluated to true, a string "AB" is created for current row. > > If I understand correctly, this strategy assumes that one row's > membership in a pattern variable is independent of the other rows' > membership. But I don't think that's necessarily true: > > DEFINE > A AS PREV(CLASSIFIER()) IS DISTINCT FROM 'A', > ... But: UP AS price > PREV(price) also depends on previous row, no? Can you please elaborate how your example could break current implementation? I cannot test it because CLASSIFIER is not implemented yet. >> See row_is_in_reduced_frame, search_str_set and search_str_set_recurse >> in nodeWindowAggs.c for more details. For now I use a naive depth >> first search and probably there is a lot of rooms for optimization >> (for example rewriting it without using >> recursion). > > The depth-first match is doing a lot of subtle work here. For example, with > > PATTERN ( A+ B+ ) > DEFINE A AS TRUE, >B AS TRUE > > (i.e. all rows match both variables), and three rows in the partition, > our candidates will be tried in the order > > aaa > aab > aba > abb > ... > bbb > > The only possible matches against our regex `^a+b+` are "aab" and "abb", > and that order matches the preferment order, so it's fine. But it's easy > to come up with a pattern where that's the wrong order, like > > PATTERN ( A+ (B|A)+ ) > > Now "aaa" will be considered before "aab", which isn't correct. Can you explain a little bit more? I think 'aaa' matches a regular expression 'a+(b|a)+' and should be no problem before "aab" is considered. > Similarly, the assumption that we want to match the longest string only > works because we don't allow alternation yet. Can you please clarify more on this? > Cool, I will give this piece some more thought. Do you mind if I try to > add some more complicated pattern quantifiers to stress the > architecture, or would you prefer to tackle that later? Just alternation > by itself will open up a world of corner cases. Do you mean you want to provide a better patch for the pattern matching part? That will be helpfull. Because I am currently working on the aggregation part and have no time to do it. However, the aggregation work affects the v5 patch: it needs a refactoring. So can you wait until I release v6 patch? I hope it will be released in two weeks or so. > On 8/9/23 01:41, Tatsuo Ishii wrote: >> - I found a bug with pattern matching code. It creates a string for >> subsequent regular expression matching. It uses the initial letter >> of each define variable name. For example, if the varname is "foo", >> then "f" is used. Obviously this makes trouble if we have two or >> more variables starting with same "f" (e.g. "food"). To fix this, I >> assign [a-z] to each variable instead of its initial letter. However >> this way limits us not to have more than 26 variables. I hope 26 is >> enough for most use cases. > > There are still plenty of alphanumerics left that could be assigned... > > But I'm wondering if we might want to just implement the NFA directly? > The current implementation's Cartesian explosion can probably be pruned > aggressively, but replaying the entire regex match once for every > backtracked step will still duplicate a lot of work. Not sure if you mean implementing new regular expression engine besides src/backend/regexp. I am afraid it's not a trivial work. The current regexp code consists of over 10k lines. What do you think? > I've attached another test case; it looks like last_value() is depending > on some sort of side effect from either first_value() or nth_value(). I > know the window frame itself is still under construction, so apologies > if this is an expected failure. Thanks. Fortunately current code which I am working passes the new test. I will include it in the next (v6) patch. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
Re: Row pattern recognition
Hello! > (1) I completely changed the pattern matching engine so that it > performs backtracking. Now the engine evaluates all pattern elements > defined in PATTER against each row, saving matched pattern variables > in a string per row. For example if the pattern element A and B > evaluated to true, a string "AB" is created for current row. If I understand correctly, this strategy assumes that one row's membership in a pattern variable is independent of the other rows' membership. But I don't think that's necessarily true: DEFINE A AS PREV(CLASSIFIER()) IS DISTINCT FROM 'A', ... > See row_is_in_reduced_frame, search_str_set and search_str_set_recurse > in nodeWindowAggs.c for more details. For now I use a naive depth > first search and probably there is a lot of rooms for optimization > (for example rewriting it without using > recursion). The depth-first match is doing a lot of subtle work here. For example, with PATTERN ( A+ B+ ) DEFINE A AS TRUE, B AS TRUE (i.e. all rows match both variables), and three rows in the partition, our candidates will be tried in the order aaa aab aba abb ... bbb The only possible matches against our regex `^a+b+` are "aab" and "abb", and that order matches the preferment order, so it's fine. But it's easy to come up with a pattern where that's the wrong order, like PATTERN ( A+ (B|A)+ ) Now "aaa" will be considered before "aab", which isn't correct. Similarly, the assumption that we want to match the longest string only works because we don't allow alternation yet. > Suggestions/patches are welcome. Cool, I will give this piece some more thought. Do you mind if I try to add some more complicated pattern quantifiers to stress the architecture, or would you prefer to tackle that later? Just alternation by itself will open up a world of corner cases. > With the new engine, cases above do not fail anymore. See new > regression test cases. Thanks for providing valuable test cases! You're very welcome! On 8/9/23 01:41, Tatsuo Ishii wrote: > - I found a bug with pattern matching code. It creates a string for > subsequent regular expression matching. It uses the initial letter > of each define variable name. For example, if the varname is "foo", > then "f" is used. Obviously this makes trouble if we have two or > more variables starting with same "f" (e.g. "food"). To fix this, I > assign [a-z] to each variable instead of its initial letter. However > this way limits us not to have more than 26 variables. I hope 26 is > enough for most use cases. There are still plenty of alphanumerics left that could be assigned... But I'm wondering if we might want to just implement the NFA directly? The current implementation's Cartesian explosion can probably be pruned aggressively, but replaying the entire regex match once for every backtracked step will still duplicate a lot of work. -- I've attached another test case; it looks like last_value() is depending on some sort of side effect from either first_value() or nth_value(). I know the window frame itself is still under construction, so apologies if this is an expected failure. Thanks! --Jacobdiff --git a/src/test/regress/expected/rpr.out b/src/test/regress/expected/rpr.out index 340ccf242c..bd35e8389e 100644 --- a/src/test/regress/expected/rpr.out +++ b/src/test/regress/expected/rpr.out @@ -89,6 +89,44 @@ SELECT company, tdate, price, first_value(price) OVER w, last_value(price) OVER company2 | 07-10-2023 | 1300 | || (20 rows) +-- last_value() should remain consistent +SELECT company, tdate, price, last_value(price) OVER w + FROM stock + WINDOW w AS ( + PARTITION BY company + ORDER BY tdate + ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING + INITIAL + PATTERN (START UP+ DOWN+) + DEFINE + START AS TRUE, + UP AS price > PREV(price), + DOWN AS price < PREV(price) +); + company | tdate| price | last_value +--++---+ + company1 | 07-01-2023 | 100 |140 + company1 | 07-02-2023 | 200 | + company1 | 07-03-2023 | 150 | + company1 | 07-04-2023 | 140 | + company1 | 07-05-2023 | 150 | + company1 | 07-06-2023 |90 |120 + company1 | 07-07-2023 | 110 | + company1 | 07-08-2023 | 130 | + company1 | 07-09-2023 | 120 | + company1 | 07-10-2023 | 130 | + company2 | 07-01-2023 |50 | 1400 + company2 | 07-02-2023 | 2000 | + company2 | 07-03-2023 | 1500 | + company2 | 07-04-2023 | 1400 | + company2 | 07-05-2023 | 1500 | + company2 | 07-06-2023 |60 | 1200 + company2 | 07-07-2023 | 1100 | + company2 | 07-08-2023 | 1300 | + company2 | 07-09-2023 | 1200 | + company2 | 07-10-2023 | 1300 | +(20 rows) + -- omit "START" in DEFINE but it is ok because "START
Re: Row pattern recognition
> Hi, > > The patches compile & tests run fine but this statement from the > documentation crashes an assert-enabled server: > > SELECT company, tdate, price, max(price) OVER w FROM stock > WINDOW w AS ( > PARTITION BY company > ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > AFTER MATCH SKIP PAST LAST ROW > INITIAL > PATTERN (LOWPRICE UP+ DOWN+) > DEFINE > LOWPRICE AS price <= 100, > UP AS price > PREV(price), > DOWN AS price < PREV(price) > ); > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > connection to server was lost Thank you for the report. Currently the patch has an issue with aggregate functions including max. I have been working on aggregations in row pattern recognition but will take more time to complete the part. In the mean time if you want to play with RPR, you can try window functions. Examples can be found in src/test/regress/sql/rpr.sql. Here is one of this: -- the first row start with less than or equal to 100 SELECT company, tdate, price, first_value(price) OVER w, last_value(price) OVER w FROM stock WINDOW w AS ( PARTITION BY company ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING INITIAL PATTERN (LOWPRICE UP+ DOWN+) DEFINE LOWPRICE AS price <= 100, UP AS price > PREV(price), DOWN AS price < PREV(price) ); -- second row raises 120% SELECT company, tdate, price, first_value(price) OVER w, last_value(price) OVER w FROM stock WINDOW w AS ( PARTITION BY company ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING INITIAL PATTERN (LOWPRICE UP+ DOWN+) DEFINE LOWPRICE AS price <= 100, UP AS price > PREV(price) * 1.2, DOWN AS price < PREV(price) ); Sorry for inconvenience. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
Re: Row pattern recognition
Op 9/2/23 om 08:52 schreef Tatsuo Ishii: Attached is the v5 patch. Differences from previous patch include: Hi, The patches compile & tests run fine but this statement from the documentation crashes an assert-enabled server: SELECT company, tdate, price, max(price) OVER w FROM stock WINDOW w AS ( PARTITION BY company ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING AFTER MATCH SKIP PAST LAST ROW INITIAL PATTERN (LOWPRICE UP+ DOWN+) DEFINE LOWPRICE AS price <= 100, UP AS price > PREV(price), DOWN AS price < PREV(price) ); server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. connection to server was lost Log file: TRAP: failed Assert("aggregatedupto_nonrestarted <= winstate->aggregatedupto"), File: "nodeWindowAgg.c", Line: 1054, PID: 68975 postgres: 17_rpr_d0ec_gulo: aardvark testdb ::1(34808) SELECT(ExceptionalCondition+0x54)[0x9b0824] postgres: 17_rpr_d0ec_gulo: aardvark testdb ::1(34808) SELECT[0x71ae8d] postgres: 17_rpr_d0ec_gulo: aardvark testdb ::1(34808) SELECT(standard_ExecutorRun+0x13a)[0x6def9a] /home/aardvark/pg_stuff/pg_installations/pgsql.rpr/lib/pg_stat_statements.so(+0x55e5)[0x7ff3798b95e5] /home/aardvark/pg_stuff/pg_installations/pgsql.rpr/lib/auto_explain.so(+0x2680)[0x7ff3798ab680] postgres: 17_rpr_d0ec_gulo: aardvark testdb ::1(34808) SELECT[0x88a4ff] postgres: 17_rpr_d0ec_gulo: aardvark testdb ::1(34808) SELECT(PortalRun+0x240)[0x88bb50] postgres: 17_rpr_d0ec_gulo: aardvark testdb ::1(34808) SELECT[0x887cca] postgres: 17_rpr_d0ec_gulo: aardvark testdb ::1(34808) SELECT(PostgresMain+0x14dc)[0x88958c] postgres: 17_rpr_d0ec_gulo: aardvark testdb ::1(34808) SELECT[0x7fb0da] postgres: 17_rpr_d0ec_gulo: aardvark testdb ::1(34808) SELECT(PostmasterMain+0xd2d)[0x7fc01d] postgres: 17_rpr_d0ec_gulo: aardvark testdb ::1(34808) SELECT(main+0x1e0)[0x5286d0] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea)[0x7ff378e9dd0a] postgres: 17_rpr_d0ec_gulo: aardvark testdb ::1(34808) SELECT(_start+0x2a)[0x5289aa] 2023-09-02 19:59:05.329 CEST 46723 LOG: server process (PID 68975) was terminated by signal 6: Aborted 2023-09-02 19:59:05.329 CEST 46723 DETAIL: Failed process was running: SELECT company, tdate, price, max(price) OVER w FROM stock WINDOW w AS ( PARTITION BY company ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING AFTER MATCH SKIP PAST LAST ROW INITIAL PATTERN (LOWPRICE UP+ DOWN+) DEFINE LOWPRICE AS price <= 100, UP AS price > PREV(price), DOWN AS price < PREV(price) ); 2023-09-02 19:59:05.329 CEST 46723 LOG: terminating any other active server processes Erik Rijkers
Re: Row pattern recognition
Attached is the v5 patch. Differences from previous patch include: * Resolve complaint from "PostgreSQL Patch Tester" https://commitfest.postgresql.org/44/4460/ - Change gram.y to use PATTERN_P instead of PATTERN. Using PATTERN seems to make trouble with Visual Studio build. : : [10:07:57.853] FAILED: src/backend/parser/parser.a.p/meson-generated_.._gram.c.obj [10:07:57.853] "cl" "-Isrc\backend\parser\parser.a.p" "-Isrc\backend\parser" "-I..\src\backend\parser" "-Isrc\include" "-I..\src\include" "-Ic:\openssl\1.1\include" "-I..\src\include\port\win32" "-I..\src\include\port\win32_msvc" "/MDd" "/nologo" "/showIncludes" "/utf-8" "/W2" "/Od" "/Zi" "/DWIN32" "/DWINDOWS" "/D__WINDOWS__" "/D__WIN32__" "/D_CRT_SECURE_NO_DEPRECATE" "/D_CRT_NONSTDC_NO_DEPRECATE" "/wd4018" "/wd4244" "/wd4273" "/wd4101" "/wd4102" "/wd4090" "/wd4267" "-DBUILDING_DLL" "/Fdsrc\backend\parser\parser.a.p\meson-generated_.._gram.c.pdb" /Fosrc/backend/parser/parser.a.p/meson-generated_.._gram.c.obj "/c" src/backend/parser/gram.c [10:07:57.860] c:\cirrus\build\src\backend\parser\gram.h(379): error C2365: 'PATTERN': redefinition; previous definition was 'typedef' [10:07:57.860] C:\Program Files (x86)\Windows Kits\10\include\10.0.20348.0\um\wingdi.h(1375): note: see declaration of 'PATTERN' [10:07:57.860] c:\cirrus\build\src\backend\parser\gram.h(379): error C2086: 'yytokentype PATTERN': redefinition [10:07:57.860] c:\cirrus\build\src\backend\parser\gram.h(379): note: see declaration of 'PATTERN' [10:07:57.860] ninja: build stopped: subcommand failed. * Resolve complaint from "make headerscheck" - Change Windowapi.h and nodeWindowAgg.c to remove unecessary extern and public functions. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp >From 3e02bccbd3dc02304d6dc34f5ab6cc6dd2ee26d1 Mon Sep 17 00:00:00 2001 From: Tatsuo Ishii Date: Sat, 2 Sep 2023 15:32:49 +0900 Subject: [PATCH v5 1/7] Row pattern recognition patch for raw parser. --- src/backend/parser/gram.y | 216 +--- src/include/nodes/parsenodes.h | 56 + src/include/parser/kwlist.h | 8 ++ src/include/parser/parse_node.h | 1 + 4 files changed, 267 insertions(+), 14 deletions(-) diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index 7d2032885e..70409cdc9a 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -251,6 +251,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); DefElem *defelt; SortBy *sortby; WindowDef *windef; + RPCommonSyntax *rpcom; + RPSubsetItem *rpsubset; JoinExpr *jexpr; IndexElem *ielem; StatsElem *selem; @@ -453,8 +455,12 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); TriggerTransitions TriggerReferencing vacuum_relation_list opt_vacuum_relation_list drop_option_list pub_obj_list - -%type opt_routine_body +row_pattern_measure_list row_pattern_definition_list +opt_row_pattern_subset_clause +row_pattern_subset_list row_pattern_subset_rhs +row_pattern +%type row_pattern_subset_item +%type opt_routine_body row_pattern_term %type group_clause %type group_by_list %type group_by_item empty_grouping_set rollup_clause cube_clause @@ -551,6 +557,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); %type relation_expr_opt_alias %type tablesample_clause opt_repeatable_clause %type target_el set_target insert_column_item +row_pattern_measure_item row_pattern_definition %type generic_option_name %type generic_option_arg @@ -633,6 +640,9 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); %type window_clause window_definition_list opt_partition_clause %type window_definition over_clause window_specification opt_frame_clause frame_extent frame_bound +%type opt_row_pattern_common_syntax opt_row_pattern_skip_to +%type opt_row_pattern_initial_or_seek +%type opt_row_pattern_measures %type opt_window_exclusion_clause %type opt_existing_window_name %type opt_if_not_exists @@ -659,7 +669,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); json_object_constructor_null_clause_opt json_array_constructor_null_clause_opt - /* * Non-keyword token types. These are hard-wired into the "flex" lexer. * They must be listed first so that their numeric codes do not depend on @@ -702,7 +711,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); CURRE
Re: Row pattern recognition
Attached is the v4 patch. Differences from previous patch include: > - PERMUTE is still misspelled as PREMUTE Fixed. > - PATTERN variables do not have to exist in the DEFINE clause. They are > - considered TRUE if not present. Fixed. Moreover new regression test case is added. - It was possible that tle nodes in DEFINE clause do not appear in the plan's target list. This makes impossible to evaluate expressions in the DEFINE because it does not appear in the outer plan's target list. To fix this, call findTargetlistEntrySQL99 (with resjunk is true) so that the missing TargetEntry is added to the outer plan later on. - I eliminated some hacks in handling the Var node in DEFINE clause. Previously I replaced varattno of Var node in a plan tree by hand so that it refers to correct varattno in the outer plan node. In this patch I modified set_upper_references so that it calls fix_upper_expr for those Var nodes in the DEFINE clause. See v4-0003 patch for more details. - I found a bug with pattern matching code. It creates a string for subsequent regular expression matching. It uses the initial letter of each define variable name. For example, if the varname is "foo", then "f" is used. Obviously this makes trouble if we have two or more variables starting with same "f" (e.g. "food"). To fix this, I assign [a-z] to each variable instead of its initial letter. However this way limits us not to have more than 26 variables. I hope 26 is enough for most use cases. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp >From 5c6430e171ab9f9a75df92fac25949cb96fe1da2 Mon Sep 17 00:00:00 2001 From: Tatsuo Ishii Date: Wed, 9 Aug 2023 16:56:17 +0900 Subject: [PATCH v4 1/7] Row pattern recognition patch for raw parser. --- src/backend/parser/gram.y | 216 +--- src/include/nodes/parsenodes.h | 56 + src/include/parser/kwlist.h | 8 ++ src/include/parser/parse_node.h | 1 + 4 files changed, 267 insertions(+), 14 deletions(-) diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index 15ece871a0..62c1919538 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -251,6 +251,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); DefElem *defelt; SortBy *sortby; WindowDef *windef; + RPCommonSyntax *rpcom; + RPSubsetItem *rpsubset; JoinExpr *jexpr; IndexElem *ielem; StatsElem *selem; @@ -453,8 +455,12 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); TriggerTransitions TriggerReferencing vacuum_relation_list opt_vacuum_relation_list drop_option_list pub_obj_list - -%type opt_routine_body +row_pattern_measure_list row_pattern_definition_list +opt_row_pattern_subset_clause +row_pattern_subset_list row_pattern_subset_rhs +row_pattern +%type row_pattern_subset_item +%type opt_routine_body row_pattern_term %type group_clause %type group_by_list %type group_by_item empty_grouping_set rollup_clause cube_clause @@ -551,6 +557,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); %type relation_expr_opt_alias %type tablesample_clause opt_repeatable_clause %type target_el set_target insert_column_item +row_pattern_measure_item row_pattern_definition %type generic_option_name %type generic_option_arg @@ -633,6 +640,9 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); %type window_clause window_definition_list opt_partition_clause %type window_definition over_clause window_specification opt_frame_clause frame_extent frame_bound +%type opt_row_pattern_common_syntax opt_row_pattern_skip_to +%type opt_row_pattern_initial_or_seek +%type opt_row_pattern_measures %type opt_window_exclusion_clause %type opt_existing_window_name %type opt_if_not_exists @@ -659,7 +669,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); json_object_constructor_null_clause_opt json_array_constructor_null_clause_opt - /* * Non-keyword token types. These are hard-wired into the "flex" lexer. * They must be listed first so that their numeric codes do not depend on @@ -702,7 +711,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); CURRENT_TIME CURRENT_TIMESTAMP CURRENT_USER CURSOR CYCLE DATA_P DATABASE DAY_P DEALLOCATE DEC DECIMAL_P DECLARE DEFAULT DEFAULTS - DEFERRABLE DEFERRED DEFINER DELETE_P DELIMITER DELIMITERS DEPENDS DEPTH DESC + DEFERRABLE DEFERRED DEFINE DEFINER DELETE_P DELIMITER DELIMITERS DEPENDS DEPTH DESC DETACH DICTIONARY DISABLE_P DISCARD DISTINCT DO DOCUMENT_P DOMAIN_P DOUBLE_P DROP @@ -718,7 +727,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
Re: Row pattern recognition
>>> - PATTERN variables do not have to exist in the DEFINE clause. They are >>> - considered TRUE if not present. >> Do you think we really need this? I found a criticism regarding this. >> https://link.springer.com/article/10.1007/s13222-022-00404-3 >> "3.2 Explicit Definition of All Row Pattern Variables" >> What do you think? > > I think that a large part of obeying the standard is to allow queries > from other engines to run the same on ours. The standard does not > require the pattern variables to be defined and so there are certainly > queries out there without them, and that hurts migrating to > PostgreSQL. Yeah, migration is good point. I agree we should have the feature. >>> When we get to adding count in the MEASURES clause, there will be a >>> difference between no match and empty match, but that does not apply >>> here. >> Can you elaborate more? I understand that "no match" and "empty match" >> are different things. But I do not understand how the difference >> affects the result of count. > > This query: > > SELECT v.a, wcnt OVER w, count(*) OVER w > FROM (VALUES ('A')) AS v (a) > WINDOW w AS ( > ORDER BY v.a > MEASURES count(*) AS wcnt > ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > PATTERN (B) > DEFINE B AS B.a = 'B' > ) > > produces this result: > > a | wcnt | count > ---+--+--- > A | | 0 > (1 row) > > Inside the window specification, *no match* was found and so all of > the MEASURES are null. The count(*) in the target list however, still > exists and operates over zero rows. > > This very similar query: > > SELECT v.a, wcnt OVER w, count(*) OVER w > FROM (VALUES ('A')) AS v (a) > WINDOW w AS ( > ORDER BY v.a > MEASURES count(*) AS wcnt > ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > PATTERN (B?) > DEFINE B AS B.a = 'B' > ) > > produces this result: > > a | wcnt | count > ---+--+--- > A |0 | 0 > (1 row) > > In this case, the pattern is B? instead of just B, which produces an > *empty match* for the MEASURES to be applied over. Thank you for the detailed explanation. I think I understand now. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
Re: Row pattern recognition
On 7/28/23 13:02, Tatsuo Ishii wrote: Attached is the v3 patch. In this patch following changes are made. - PATTERN variables do not have to exist in the DEFINE clause. They are - considered TRUE if not present. Do you think we really need this? I found a criticism regarding this. https://link.springer.com/article/10.1007/s13222-022-00404-3 "3.2 Explicit Definition of All Row Pattern Variables" What do you think? I think that a large part of obeying the standard is to allow queries from other engines to run the same on ours. The standard does not require the pattern variables to be defined and so there are certainly queries out there without them, and that hurts migrating to PostgreSQL. - I am working on making window aggregates RPR aware now. The implementation is in progress and far from completeness. An example is below. I think row 2, 3, 4 of "count" column should be NULL instead of 3, 2, 0, 0. Same thing can be said to other rows. Probably this is an effect of moving aggregate but I still studying the window aggregation code. This tells me again that RPR is not being run in the right place. All windowed aggregates and frame-level window functions should Just Work with no modification. I am not touching each aggregate function. I am modifying eval_windowaggregates() in nodeWindowAgg.c, which calls each aggregate function. Do you think it's not the right place to make window aggregates RPR aware? Oh, okay. SELECT company, tdate, first_value(price) OVER W, count(*) OVER w FROM stock WINDOW w AS ( PARTITION BY company ORDER BY tdate ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING AFTER MATCH SKIP PAST LAST ROW INITIAL PATTERN (START UP+ DOWN+) DEFINE START AS TRUE, UP AS price > PREV(price), DOWN AS price < PREV(price) ); company | tdate| first_value | count --++-+--- company1 | 2023-07-01 | 100 | 4 company1 | 2023-07-02 | | 3 company1 | 2023-07-03 | | 2 company1 | 2023-07-04 | | 0 company1 | 2023-07-05 | | 0 company1 | 2023-07-06 | 90 | 4 company1 | 2023-07-07 | | 3 company1 | 2023-07-08 | | 2 company1 | 2023-07-09 | | 0 company1 | 2023-07-10 | | 0 company2 | 2023-07-01 | 50 | 4 company2 | 2023-07-02 | | 3 company2 | 2023-07-03 | | 2 company2 | 2023-07-04 | | 0 company2 | 2023-07-05 | | 0 company2 | 2023-07-06 | 60 | 4 company2 | 2023-07-07 | | 3 company2 | 2023-07-08 | | 2 company2 | 2023-07-09 | | 0 company2 | 2023-07-10 | | 0 In this scenario, row 1's frame is the first 5 rows and specified SKIP PAST LAST ROW, so rows 2-5 don't have *any* frame (because they are skipped) and the result of the outer count should be 0 for all of them because there are no rows in the frame. Ok. Just I want to make sure. If it's other aggregates like sum or avg, the result of the outer aggregates should be NULL. They all behave the same way as in a normal query when they receive no rows as input. When we get to adding count in the MEASURES clause, there will be a difference between no match and empty match, but that does not apply here. Can you elaborate more? I understand that "no match" and "empty match" are different things. But I do not understand how the difference affects the result of count. This query: SELECT v.a, wcnt OVER w, count(*) OVER w FROM (VALUES ('A')) AS v (a) WINDOW w AS ( ORDER BY v.a MEASURES count(*) AS wcnt ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING PATTERN (B) DEFINE B AS B.a = 'B' ) produces this result: a | wcnt | count ---+--+--- A | | 0 (1 row) Inside the window specification, *no match* was found and so all of the MEASURES are null. The count(*) in the target list however, still exists and operates over zero rows. This very similar query: SELECT v.a, wcnt OVER w, count(*) OVER w FROM (VALUES ('A')) AS v (a) WINDOW w AS ( ORDER BY v.a MEASURES count(*) AS wcnt ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING PATTERN (B?) DEFINE B AS B.a = 'B' ) produces this result: a | wcnt | count ---+--+--- A |0 | 0 (1 row) In this case, the pattern is B? instead of just B, which produces an *empty match* for the MEASURES to be applied over. -- Vik Fearing
Re: Row pattern recognition
>> Attached is the v3 patch. In this patch following changes are made. > > Excellent. Thanks! You are welcome! > A few quick comments: > > - PERMUTE is still misspelled as PREMUTE Oops. Will fix. > - PATTERN variables do not have to exist in the DEFINE clause. They are > - considered TRUE if not present. Do you think we really need this? I found a criticism regarding this. https://link.springer.com/article/10.1007/s13222-022-00404-3 "3.2 Explicit Definition of All Row Pattern Variables" What do you think? >> - I am working on making window aggregates RPR aware now. The >>implementation is in progress and far from completeness. An example >>is below. I think row 2, 3, 4 of "count" column should be NULL >>instead of 3, 2, 0, 0. Same thing can be said to other >>rows. Probably this is an effect of moving aggregate but I still >>studying the window aggregation code. > > This tells me again that RPR is not being run in the right place. All > windowed aggregates and frame-level window functions should Just Work > with no modification. I am not touching each aggregate function. I am modifying eval_windowaggregates() in nodeWindowAgg.c, which calls each aggregate function. Do you think it's not the right place to make window aggregates RPR aware? >> SELECT company, tdate, first_value(price) OVER W, count(*) OVER w FROM >> stock >> WINDOW w AS ( >> PARTITION BY company >> ORDER BY tdate >> ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING >> AFTER MATCH SKIP PAST LAST ROW >> INITIAL >> PATTERN (START UP+ DOWN+) >> DEFINE >>START AS TRUE, >>UP AS price > PREV(price), >>DOWN AS price < PREV(price) >> ); >> company | tdate| first_value | count >> --++-+--- >> company1 | 2023-07-01 | 100 | 4 >> company1 | 2023-07-02 | | 3 >> company1 | 2023-07-03 | | 2 >> company1 | 2023-07-04 | | 0 >> company1 | 2023-07-05 | | 0 >> company1 | 2023-07-06 | 90 | 4 >> company1 | 2023-07-07 | | 3 >> company1 | 2023-07-08 | | 2 >> company1 | 2023-07-09 | | 0 >> company1 | 2023-07-10 | | 0 >> company2 | 2023-07-01 | 50 | 4 >> company2 | 2023-07-02 | | 3 >> company2 | 2023-07-03 | | 2 >> company2 | 2023-07-04 | | 0 >> company2 | 2023-07-05 | | 0 >> company2 | 2023-07-06 | 60 | 4 >> company2 | 2023-07-07 | | 3 >> company2 | 2023-07-08 | | 2 >> company2 | 2023-07-09 | | 0 >> company2 | 2023-07-10 | | 0 > > In this scenario, row 1's frame is the first 5 rows and specified SKIP > PAST LAST ROW, so rows 2-5 don't have *any* frame (because they are > skipped) and the result of the outer count should be 0 for all of them > because there are no rows in the frame. Ok. Just I want to make sure. If it's other aggregates like sum or avg, the result of the outer aggregates should be NULL. > When we get to adding count in the MEASURES clause, there will be a > difference between no match and empty match, but that does not apply > here. Can you elaborate more? I understand that "no match" and "empty match" are different things. But I do not understand how the difference affects the result of count. >> I am going to add this thread to CommitFest and plan to add both of >> you as reviewers. Thanks in advance. > > My pleasure. Thank you for working on this difficult feature. Thank you for accepting being registered as a reviewer. Your comments are really helpful. -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
Re: Row pattern recognition
On 7/26/23 14:21, Tatsuo Ishii wrote: Attached is the v3 patch. In this patch following changes are made. Excellent. Thanks! A few quick comments: - PERMUTE is still misspelled as PREMUTE - PATTERN variables do not have to exist in the DEFINE clause. They are considered TRUE if not present. (1) I completely changed the pattern matching engine so that it performs backtracking. Now the engine evaluates all pattern elements defined in PATTER against each row, saving matched pattern variables in a string per row. For example if the pattern element A and B evaluated to true, a string "AB" is created for current row. This continues until all pattern matching fails or encounters the end of full window frame/partition. After that, the pattern matching engine creates all possible "pattern strings" and apply the regular expression matching to each. For example if we have row 0 = "AB" row 1 = "C", possible pattern strings are: "AC" and "BC". If it matches, the length of matching substring is saved. After all possible trials are done, the longest matching substring is chosen and it becomes the width (number of rows) in the reduced window frame. See row_is_in_reduced_frame, search_str_set and search_str_set_recurse in nodeWindowAggs.c for more details. For now I use a naive depth first search and probably there is a lot of rooms for optimization (for example rewriting it without using recursion). Suggestions/patches are welcome. My own reviews will only focus on correctness for now. Once we get a good set of regression tests all passing, I will focus more on optimization. Of course, others might want to review the performance now. Vik Fearing wrote: I strongly disagree with this. Window function do not need to know how the frame is defined, and indeed they should not. WinGetFuncArgInFrame should answer yes or no and the window function just works on that. Otherwise we will get extension (and possibly even core) functions that don't handle the frame properly. So I moved row_is_in_reduce_frame into WinGetFuncArgInFrame so that those window functions are not needed to be changed. (3) Window function rpr was removed. We can use first_value instead. Excellent. (4) Remaining tasks/issues. - For now I disable WinSetMarkPosition because RPR often needs to access a row before the mark is set. We need to fix this in the future. Noted, and agreed. - I am working on making window aggregates RPR aware now. The implementation is in progress and far from completeness. An example is below. I think row 2, 3, 4 of "count" column should be NULL instead of 3, 2, 0, 0. Same thing can be said to other rows. Probably this is an effect of moving aggregate but I still studying the window aggregation code. This tells me again that RPR is not being run in the right place. All windowed aggregates and frame-level window functions should Just Work with no modification. SELECT company, tdate, first_value(price) OVER W, count(*) OVER w FROM stock WINDOW w AS ( PARTITION BY company ORDER BY tdate ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING AFTER MATCH SKIP PAST LAST ROW INITIAL PATTERN (START UP+ DOWN+) DEFINE START AS TRUE, UP AS price > PREV(price), DOWN AS price < PREV(price) ); company | tdate| first_value | count --++-+--- company1 | 2023-07-01 | 100 | 4 company1 | 2023-07-02 | | 3 company1 | 2023-07-03 | | 2 company1 | 2023-07-04 | | 0 company1 | 2023-07-05 | | 0 company1 | 2023-07-06 | 90 | 4 company1 | 2023-07-07 | | 3 company1 | 2023-07-08 | | 2 company1 | 2023-07-09 | | 0 company1 | 2023-07-10 | | 0 company2 | 2023-07-01 | 50 | 4 company2 | 2023-07-02 | | 3 company2 | 2023-07-03 | | 2 company2 | 2023-07-04 | | 0 company2 | 2023-07-05 | | 0 company2 | 2023-07-06 | 60 | 4 company2 | 2023-07-07 | | 3 company2 | 2023-07-08 | | 2 company2 | 2023-07-09 | | 0 company2 | 2023-07-10 | | 0 In this scenario, row 1's frame is the first 5 rows and specified SKIP PAST LAST ROW, so rows 2-5 don't have *any* frame (because they are skipped) and the result of the outer count should be 0 for all of them because there are no rows in the frame. When we get to adding count in the MEASURES clause, there will be a difference between no match and empty match, but that does not apply here. I am going to add this thread to CommitFest and plan to add both of you as reviewers. Thanks in advance. My pleasure. Thank you for working on this difficult feature. -- Vik Fearing
Re: Row pattern recognition
On 7/28/23 09:09, Tatsuo Ishii wrote: We already recalculate a frame each time a row is processed even without RPR. See ExecWindowAgg. Yes, after each row. Not for each function. Ok, I understand now. Closer look at the code, I realized that each window function calls update_frameheadpos, which computes the frame head position. But actually it checks winstate->framehead_valid and if it's already true (probably by other window function), then it does nothing. Also RPR always requires a frame option ROWS BETWEEN CURRENT ROW, which means the frame head is changed each time current row position changes. Off topic for now: I wonder why this restriction is in place and whether we should respect or ignore it. That is a discussion for another time, though. My guess is, it is because other than ROWS BETWEEN CURRENT ROW has little or no meaning. Consider following example: Yes, that makes sense. I strongly disagree with this. Window function do not need to know how the frame is defined, and indeed they should not. We already break the rule by defining *support functions. See windowfuncs.c. The support functions don't know anything about the frame, they just know when a window function is monotonically increasing and execution can either stop or be "passed through". I see following code in window_row_number_support: /* * The frame options can always become "ROWS BETWEEN UNBOUNDED * PRECEDING AND CURRENT ROW". row_number() always just increments by * 1 with each row in the partition. Using ROWS instead of RANGE * saves effort checking peer rows during execution. */ req->frameOptions = (FRAMEOPTION_NONDEFAULT | FRAMEOPTION_ROWS | FRAMEOPTION_START_UNBOUNDED_PRECEDING | FRAMEOPTION_END_CURRENT_ROW); I think it not only knows about frame but it even changes the frame options. This seems far from "don't know anything about the frame", no? That's the planner support function. The row_number() function itself is not even allowed to *have* a frame, per spec. We allow it, but as you can see from that support function, we completely replace it. So all of the partition-level window functions are not affected by RPR anyway. I have two comments about this: It isn't just for convenience, it is for correctness. The window functions do not need to know which rows they are *not* operating on. There is no such thing as a "full" or "reduced" frame. The standard uses those terms to explain the difference between before and after RPR is applied, but window functions do not get to choose which frame they apply over. They only ever apply over the reduced window frame. I agree that "full window frame" and "reduced window frame" do not exist at the same time, and in the end (after computation of reduced frame), only "reduced" frame is visible to window functions/aggregates. But I still do think that "full window frame" and "reduced window frame" are important concept to explain/understand how PRP works. If we are just using those terms for documentation, then okay. -- Vik Fearing
Re: Row pattern recognition
> I am going to add this thread to CommitFest and plan to add both of > you as reviewers. Thanks in advance. Done. https://commitfest.postgresql.org/44/4460/ Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
Re: Row pattern recognition
ce) ); ERROR: attribute number 3 exceeds number of columns 2 This is because attributes appearing in DEFINE are not added to the target list. I am looking for way to teach planner to add attributes appearing in DEFINE. I am going to add this thread to CommitFest and plan to add both of you as reviewers. Thanks in advance. -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp >From 16cab4e23f38b447a814773fea83a74c337a08d5 Mon Sep 17 00:00:00 2001 From: Tatsuo Ishii Date: Wed, 26 Jul 2023 19:49:09 +0900 Subject: [PATCH v3 1/7] Row pattern recognition patch for raw parser. --- src/backend/parser/gram.y | 218 +--- src/include/nodes/parsenodes.h | 54 src/include/parser/kwlist.h | 8 ++ src/include/parser/parse_node.h | 1 + 4 files changed, 266 insertions(+), 15 deletions(-) diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index 856d5dee0e..a4c97c28ff 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -251,6 +251,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); DefElem *defelt; SortBy *sortby; WindowDef *windef; + RPCommonSyntax *rpcom; + RPSubsetItem *rpsubset; JoinExpr *jexpr; IndexElem *ielem; StatsElem *selem; @@ -453,8 +455,12 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); TriggerTransitions TriggerReferencing vacuum_relation_list opt_vacuum_relation_list drop_option_list pub_obj_list - -%type opt_routine_body +row_pattern_measure_list row_pattern_definition_list +opt_row_pattern_subset_clause +row_pattern_subset_list row_pattern_subset_rhs +row_pattern +%type row_pattern_subset_item +%type opt_routine_body row_pattern_term %type group_clause %type group_by_list %type group_by_item empty_grouping_set rollup_clause cube_clause @@ -551,6 +557,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); %type relation_expr_opt_alias %type tablesample_clause opt_repeatable_clause %type target_el set_target insert_column_item +row_pattern_measure_item row_pattern_definition %type generic_option_name %type generic_option_arg @@ -633,6 +640,9 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); %type window_clause window_definition_list opt_partition_clause %type window_definition over_clause window_specification opt_frame_clause frame_extent frame_bound +%type opt_row_pattern_common_syntax opt_row_pattern_skip_to +%type opt_row_pattern_initial_or_seek +%type opt_row_pattern_measures %type opt_window_exclusion_clause %type opt_existing_window_name %type opt_if_not_exists @@ -659,7 +669,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); json_object_constructor_null_clause_opt json_array_constructor_null_clause_opt - /* * Non-keyword token types. These are hard-wired into the "flex" lexer. * They must be listed first so that their numeric codes do not depend on @@ -702,7 +711,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); CURRENT_TIME CURRENT_TIMESTAMP CURRENT_USER CURSOR CYCLE DATA_P DATABASE DAY_P DEALLOCATE DEC DECIMAL_P DECLARE DEFAULT DEFAULTS - DEFERRABLE DEFERRED DEFINER DELETE_P DELIMITER DELIMITERS DEPENDS DEPTH DESC + DEFERRABLE DEFERRED DEFINE DEFINER DELETE_P DELIMITER DELIMITERS DEPENDS DEPTH DESC DETACH DICTIONARY DISABLE_P DISCARD DISTINCT DO DOCUMENT_P DOMAIN_P DOUBLE_P DROP @@ -718,7 +727,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); HANDLER HAVING HEADER_P HOLD HOUR_P IDENTITY_P IF_P ILIKE IMMEDIATE IMMUTABLE IMPLICIT_P IMPORT_P IN_P INCLUDE - INCLUDING INCREMENT INDENT INDEX INDEXES INHERIT INHERITS INITIALLY INLINE_P + INCLUDING INCREMENT INDENT INDEX INDEXES INHERIT INHERITS INITIAL INITIALLY INLINE_P INNER_P INOUT INPUT_P INSENSITIVE INSERT INSTEAD INT_P INTEGER INTERSECT INTERVAL INTO INVOKER IS ISNULL ISOLATION @@ -731,7 +740,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED - MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MERGE METHOD + MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MEASURES MERGE METHOD MINUTE_P MINVALUE MODE MONTH_P MOVE NAME_P NAMES NATIONAL NATURAL NCHAR NEW NEXT NFC NFD NFKC NFKD NO NONE @@ -743,9 +752,9 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); ORDER ORDINALITY OTHERS OUT_P OUTER_P OVER OVERLAPS OVERLAY OVERRIDING OWNED OWNER - PARALLEL PARAMETER PARSER PARTIAL PARTITION PASSING PASSWORD - PLACING PLANS POLICY - POSITION PRECEDING PRECISION PRESERVE PREPARE PREPARED PRIMARY + PARALLEL PARAMETER PARSER PARTIAL PARTITI
Re: Row pattern recognition
Hi, > diff --git a/src/test/regress/expected/rpr.out > b/src/test/regress/expected/rpr.out > index 6bf8818911..f3fd22de2a 100644 > --- a/src/test/regress/expected/rpr.out > +++ b/src/test/regress/expected/rpr.out > @@ -230,6 +230,79 @@ SELECT company, tdate, price, rpr(price) OVER w FROM > stock > company2 | 07-10-2023 | 1300 | > (20 rows) > > +-- match everything > +SELECT company, tdate, price, rpr(price) OVER w FROM stock > + WINDOW w AS ( > + PARTITION BY company > + ORDER BY tdate > + ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > + AFTER MATCH SKIP TO NEXT ROW It seems it's a result with AFTER MATCH SKIP PAST LAST ROW. > + INITIAL > + PATTERN (A+) > + DEFINE > + A AS TRUE > +); > + company | tdate| price | rpr > +--++---+- > + company1 | 07-01-2023 | 100 | 100 > + company1 | 07-02-2023 | 200 | > + company1 | 07-03-2023 | 150 | > + company1 | 07-04-2023 | 140 | > + company1 | 07-05-2023 | 150 | > + company1 | 07-06-2023 |90 | > + company1 | 07-07-2023 | 110 | > + company1 | 07-08-2023 | 130 | > + company1 | 07-09-2023 | 120 | > + company1 | 07-10-2023 | 130 | > + company2 | 07-01-2023 |50 | 50 > + company2 | 07-02-2023 | 2000 | > + company2 | 07-03-2023 | 1500 | > + company2 | 07-04-2023 | 1400 | > + company2 | 07-05-2023 | 1500 | > + company2 | 07-06-2023 |60 | > + company2 | 07-07-2023 | 1100 | > + company2 | 07-08-2023 | 1300 | > + company2 | 07-09-2023 | 1200 | > + company2 | 07-10-2023 | 1300 | > +(20 rows) > + > +-- backtracking with reclassification of rows > +SELECT company, tdate, price, rpr(price) OVER w FROM stock > + WINDOW w AS ( > + PARTITION BY company > + ORDER BY tdate > + ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > + AFTER MATCH SKIP TO NEXT ROW > + INITIAL > + PATTERN (A+ B+) > + DEFINE > + A AS price > 100, > + B AS price > 100 > +); > + company | tdate| price | rpr > +--++---+-- > + company1 | 07-01-2023 | 100 | > + company1 | 07-02-2023 | 200 | 200 > + company1 | 07-03-2023 | 150 | > + company1 | 07-04-2023 | 140 | > + company1 | 07-05-2023 | 150 | > + company1 | 07-06-2023 |90 | > + company1 | 07-07-2023 | 110 | 110 > + company1 | 07-08-2023 | 130 | > + company1 | 07-09-2023 | 120 | > + company1 | 07-10-2023 | 130 | > + company2 | 07-01-2023 |50 | > + company2 | 07-02-2023 | 2000 | 2000 > + company2 | 07-03-2023 | 1500 | > + company2 | 07-04-2023 | 1400 | > + company2 | 07-05-2023 | 1500 | > + company2 | 07-06-2023 |60 | > + company2 | 07-07-2023 | 1100 | 1100 > + company2 | 07-08-2023 | 1300 | > + company2 | 07-09-2023 | 1200 | > + company2 | 07-10-2023 | 1300 | > +(20 rows) > + > -- > -- Error cases > -- > diff --git a/src/test/regress/sql/rpr.sql b/src/test/regress/sql/rpr.sql > index 951c9abfe9..f1cd0369f4 100644 > --- a/src/test/regress/sql/rpr.sql > +++ b/src/test/regress/sql/rpr.sql > @@ -94,6 +94,33 @@ SELECT company, tdate, price, rpr(price) OVER w FROM stock >UPDOWN AS price > PREV(price) AND price > NEXT(price) > ); > > +-- match everything > +SELECT company, tdate, price, rpr(price) OVER w FROM stock > + WINDOW w AS ( > + PARTITION BY company > + ORDER BY tdate > + ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > + AFTER MATCH SKIP TO NEXT ROW > + INITIAL > + PATTERN (A+) > + DEFINE > + A AS TRUE > +); > + > +-- backtracking with reclassification of rows > +SELECT company, tdate, price, rpr(price) OVER w FROM stock > + WINDOW w AS ( > + PARTITION BY company > + ORDER BY tdate > + ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > + AFTER MATCH SKIP TO NEXT ROW > + INITIAL > + PATTERN (A+ B+) > + DEFINE > + A AS price > 100, > + B AS price > 100 > +); > + > -- > -- Error cases > --
Re: Row pattern recognition
On 7/24/23 02:22, Tatsuo Ishii wrote: What we are talking about here is *defining* a window frame. Well, we are defining a "reduced" window frame within a (full) window frame. A "reduced" window frame is calculated each time when a window function is called. Why? It should only be recalculated when the current row changes and we need a new frame. The reduced window frame *is* the window frame for all functions over that window. We already recalculate a frame each time a row is processed even without RPR. See ExecWindowAgg. Yes, after each row. Not for each function. Also RPR always requires a frame option ROWS BETWEEN CURRENT ROW, which means the frame head is changed each time current row position changes. Off topic for now: I wonder why this restriction is in place and whether we should respect or ignore it. That is a discussion for another time, though. I strongly disagree with this. Window function do not need to know how the frame is defined, and indeed they should not. We already break the rule by defining *support functions. See windowfuncs.c. The support functions don't know anything about the frame, they just know when a window function is monotonically increasing and execution can either stop or be "passed through". WinGetFuncArgInFrame should answer yes or no and the window function just works on that. Otherwise we will get extension (and possibly even core) functions that don't handle the frame properly. Maybe I can move row_is_in_reduced_frame into WinGetFuncArgInFrame just for convenience. I have two comments about this: It isn't just for convenience, it is for correctness. The window functions do not need to know which rows they are *not* operating on. There is no such thing as a "full" or "reduced" frame. The standard uses those terms to explain the difference between before and after RPR is applied, but window functions do not get to choose which frame they apply over. They only ever apply over the reduced window frame. -- Vik Fearing
Re: Row pattern recognition
>>> What we are talking about here is *defining* a window >>> frame. >> Well, we are defining a "reduced" window frame within a (full) window >> frame. A "reduced" window frame is calculated each time when a window >> function is called. > > > Why? It should only be recalculated when the current row changes and > we need a new frame. The reduced window frame *is* the window frame > for all functions over that window. We already recalculate a frame each time a row is processed even without RPR. See ExecWindowAgg. Also RPR always requires a frame option ROWS BETWEEN CURRENT ROW, which means the frame head is changed each time current row position changes. > I strongly disagree with this. Window function do not need to know > how the frame is defined, and indeed they should not. We already break the rule by defining *support functions. See windowfuncs.c. > WinGetFuncArgInFrame should answer yes or no and the window function > just works on that. Otherwise we will get extension (and possibly even > core) functions that don't handle the frame properly. Maybe I can move row_is_in_reduced_frame into WinGetFuncArgInFrame just for convenience. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
Re: Row pattern recognition
On 7/22/23 08:14, Tatsuo Ishii wrote: On 7/22/23 03:11, Tatsuo Ishii wrote: Maybe. Suppose a window function executes row pattern matching using price > PREV(price). The window function already receives WindowStatePerFuncData. If we can pass the WindowStatePerFuncData to PREV, we could let PREV do the real work (getting previous tuple). I have not tried this yet, though. I don't understand this logic. Window functions work over a window frame. Yes. What we are talking about here is *defining* a window frame. Well, we are defining a "reduced" window frame within a (full) window frame. A "reduced" window frame is calculated each time when a window function is called. Why? It should only be recalculated when the current row changes and we need a new frame. The reduced window frame *is* the window frame for all functions over that window. How can a window function execute row pattern matching? A window function is called for each row fed by an outer plan. It fetches current, previous and next row to execute pattern matching. If it matches, the window function moves to next row and repeat the process, until pattern match fails. Below is an example window function to execute pattern matching (I will include this in the v3 patch). row_is_in_reduced_frame() is a function to execute pattern matching. It returns the number of rows in the reduced frame if pattern match succeeds. If succeeds, the function returns the last row in the reduced frame instead of the last row in the full window frame. I strongly disagree with this. Window function do not need to know how the frame is defined, and indeed they should not. WinGetFuncArgInFrame should answer yes or no and the window function just works on that. Otherwise we will get extension (and possibly even core) functions that don't handle the frame properly. -- Vik Fearing
Re: Row pattern recognition
> On 7/22/23 03:11, Tatsuo Ishii wrote: > Maybe > that can help clarify the design? It feels like it'll need to > eventually > be a "real" function that operates on the window state, even if it > doesn't support all of the possible complexities in v1. Unfortunately an window function can not call other window functions. >>> Can that restriction be lifted for the EXPR_KIND_RPR_DEFINE case? > >> I am not sure at this point. Current PostgreSQL executor creates >> WindowStatePerFuncData for each window function and aggregate >> appearing in OVER clause. This means PREV/NEXT and other row pattern >> navigation operators cannot have their own WindowStatePerFuncData if >> they do not appear in OVER clauses in a query even if PREV/NEXT >> etc. are defined as window function. >> >>> Or >>> does it make sense to split the pattern navigation "functions" into >>> their own new concept, and maybe borrow some of the window function >>> infrastructure for it? > >> Maybe. Suppose a window function executes row pattern matching using >> price > PREV(price). The window function already receives >> WindowStatePerFuncData. If we can pass the WindowStatePerFuncData to >> PREV, we could let PREV do the real work (getting previous tuple). >> I have not tried this yet, though. > > > I don't understand this logic. Window functions work over a window > frame. Yes. > What we are talking about here is *defining* a window > frame. Well, we are defining a "reduced" window frame within a (full) window frame. A "reduced" window frame is calculated each time when a window function is called. > How can a window function execute row pattern matching? A window function is called for each row fed by an outer plan. It fetches current, previous and next row to execute pattern matching. If it matches, the window function moves to next row and repeat the process, until pattern match fails. Below is an example window function to execute pattern matching (I will include this in the v3 patch). row_is_in_reduced_frame() is a function to execute pattern matching. It returns the number of rows in the reduced frame if pattern match succeeds. If succeeds, the function returns the last row in the reduced frame instead of the last row in the full window frame. /* * last_value * return the value of VE evaluated on the last row of the * window frame, per spec. */ Datum window_last_value(PG_FUNCTION_ARGS) { WindowObject winobj = PG_WINDOW_OBJECT(); Datum result; boolisnull; int64 abspos; int num_reduced_frame; abspos = WinGetCurrentPosition(winobj); num_reduced_frame = row_is_in_reduced_frame(winobj, abspos); if (num_reduced_frame == 0) /* no RPR is involved */ result = WinGetFuncArgInFrame(winobj, 0, 0, WINDOW_SEEK_TAIL, true, , NULL); else if (num_reduced_frame > 0) /* get last row value in the reduced frame */ result = WinGetFuncArgInFrame(winobj, 0, num_reduced_frame - 1, WINDOW_SEEK_HEAD, true, , NULL); else /* RPR is involved and current row is unmatched or skipped */ isnull = true; if (isnull) PG_RETURN_NULL(); PG_RETURN_DATUM(result); }
Re: Row pattern recognition
On 7/22/23 03:11, Tatsuo Ishii wrote: Maybe that can help clarify the design? It feels like it'll need to eventually be a "real" function that operates on the window state, even if it doesn't support all of the possible complexities in v1. Unfortunately an window function can not call other window functions. Can that restriction be lifted for the EXPR_KIND_RPR_DEFINE case? I am not sure at this point. Current PostgreSQL executor creates WindowStatePerFuncData for each window function and aggregate appearing in OVER clause. This means PREV/NEXT and other row pattern navigation operators cannot have their own WindowStatePerFuncData if they do not appear in OVER clauses in a query even if PREV/NEXT etc. are defined as window function. Or does it make sense to split the pattern navigation "functions" into their own new concept, and maybe borrow some of the window function infrastructure for it? Maybe. Suppose a window function executes row pattern matching using price > PREV(price). The window function already receives WindowStatePerFuncData. If we can pass the WindowStatePerFuncData to PREV, we could let PREV do the real work (getting previous tuple). I have not tried this yet, though. I don't understand this logic. Window functions work over a window frame. What we are talking about here is *defining* a window frame. How can a window function execute row pattern matching? -- Vik Fearing
Re: Row pattern recognition
>> One small concern is how to convert pattern variables to regex >> expression which our regex enginer understands. Suppose, >> >> PATTERN UP+ >> >> Currently I convert "UP+" to "U+" so that it can be fed to the regexp >> engine. In order to do that, we need to know which part of the pattern >> (UP+) is the pattern variable ("UP"). For "UP+" it's quite easy. But >> for more complex regular expressions it would be not, unless PATTERN >> grammer can be analyzed by our parser to know which part is the >> pattern variable. > > Is the eventual plan to generate multiple alternatives, and run the > regex against them one at a time? Yes, that's my plan. >>> I'm not familiar enough with this code yet to offer very concrete >>> suggestions, sorry... But at some point in the future, we need to be >>> able to skip forward and backward from arbitrary points, like >>> >>> DEFINE B AS B.price > PREV(FIRST(A.price), 3) >>> >>> so there won't be just one pair of "previous and next" tuples. >> >> Yes, I know. > > I apologize. I got overexplain-y. No problem. Thank you for reminding me it. >>> Maybe >>> that can help clarify the design? It feels like it'll need to eventually >>> be a "real" function that operates on the window state, even if it >>> doesn't support all of the possible complexities in v1. >> >> Unfortunately an window function can not call other window functions. > > Can that restriction be lifted for the EXPR_KIND_RPR_DEFINE case? I am not sure at this point. Current PostgreSQL executor creates WindowStatePerFuncData for each window function and aggregate appearing in OVER clause. This means PREV/NEXT and other row pattern navigation operators cannot have their own WindowStatePerFuncData if they do not appear in OVER clauses in a query even if PREV/NEXT etc. are defined as window function. > Or > does it make sense to split the pattern navigation "functions" into > their own new concept, and maybe borrow some of the window function > infrastructure for it? Maybe. Suppose a window function executes row pattern matching using price > PREV(price). The window function already receives WindowStatePerFuncData. If we can pass the WindowStatePerFuncData to PREV, we could let PREV do the real work (getting previous tuple). I have not tried this yet, though. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
Re: Row pattern recognition
On 7/22/23 01:14, Jacob Champion wrote: On 7/20/23 17:07, Vik Fearing wrote: On 7/21/23 01:36, Jacob Champion wrote: (I've attached two failing tests against v2, to hopefully better illustrate, along with what I_think_ should be the correct results.) Almost. You are matching 07-01-2023 but the condition is "price > 100". D'oh. Correction attached. I think :) This looks correct to my human brain. Thanks! -- Vik Fearing
Re: Row pattern recognition
On 7/20/23 23:16, Tatsuo Ishii wrote: > I don't know at this point. I think context-free is not enough to be > repsented in Bison. The grammer also needs to be LALR(1). Moreover, > adding the grammer to existing parser may generate shift/reduce > errors. Ah. It's been too long since my compilers classes; I will pipe down. > One small concern is how to convert pattern variables to regex > expression which our regex enginer understands. Suppose, > > PATTERN UP+ > > Currently I convert "UP+" to "U+" so that it can be fed to the regexp > engine. In order to do that, we need to know which part of the pattern > (UP+) is the pattern variable ("UP"). For "UP+" it's quite easy. But > for more complex regular expressions it would be not, unless PATTERN > grammer can be analyzed by our parser to know which part is the > pattern variable. Is the eventual plan to generate multiple alternatives, and run the regex against them one at a time? >> I'm not familiar enough with this code yet to offer very concrete >> suggestions, sorry... But at some point in the future, we need to be >> able to skip forward and backward from arbitrary points, like >> >> DEFINE B AS B.price > PREV(FIRST(A.price), 3) >> >> so there won't be just one pair of "previous and next" tuples. > > Yes, I know. I apologize. I got overexplain-y. >> Maybe >> that can help clarify the design? It feels like it'll need to eventually >> be a "real" function that operates on the window state, even if it >> doesn't support all of the possible complexities in v1. > > Unfortunately an window function can not call other window functions. Can that restriction be lifted for the EXPR_KIND_RPR_DEFINE case? Or does it make sense to split the pattern navigation "functions" into their own new concept, and maybe borrow some of the window function infrastructure for it? Thanks! --Jacob
Re: Row pattern recognition
On 7/20/23 17:07, Vik Fearing wrote: > On 7/21/23 01:36, Jacob Champion wrote: >> (I've attached two failing tests against v2, to hopefully better >> illustrate, along with what I_think_ should be the correct results.) > > Almost. You are matching 07-01-2023 but the condition is "price > 100". D'oh. Correction attached. I think :) Thanks, --Jacobdiff --git a/src/test/regress/expected/rpr.out b/src/test/regress/expected/rpr.out index 6bf8818911..f3fd22de2a 100644 --- a/src/test/regress/expected/rpr.out +++ b/src/test/regress/expected/rpr.out @@ -230,6 +230,79 @@ SELECT company, tdate, price, rpr(price) OVER w FROM stock company2 | 07-10-2023 | 1300 | (20 rows) +-- match everything +SELECT company, tdate, price, rpr(price) OVER w FROM stock + WINDOW w AS ( + PARTITION BY company + ORDER BY tdate + ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING + AFTER MATCH SKIP TO NEXT ROW + INITIAL + PATTERN (A+) + DEFINE + A AS TRUE +); + company | tdate| price | rpr +--++---+- + company1 | 07-01-2023 | 100 | 100 + company1 | 07-02-2023 | 200 | + company1 | 07-03-2023 | 150 | + company1 | 07-04-2023 | 140 | + company1 | 07-05-2023 | 150 | + company1 | 07-06-2023 |90 | + company1 | 07-07-2023 | 110 | + company1 | 07-08-2023 | 130 | + company1 | 07-09-2023 | 120 | + company1 | 07-10-2023 | 130 | + company2 | 07-01-2023 |50 | 50 + company2 | 07-02-2023 | 2000 | + company2 | 07-03-2023 | 1500 | + company2 | 07-04-2023 | 1400 | + company2 | 07-05-2023 | 1500 | + company2 | 07-06-2023 |60 | + company2 | 07-07-2023 | 1100 | + company2 | 07-08-2023 | 1300 | + company2 | 07-09-2023 | 1200 | + company2 | 07-10-2023 | 1300 | +(20 rows) + +-- backtracking with reclassification of rows +SELECT company, tdate, price, rpr(price) OVER w FROM stock + WINDOW w AS ( + PARTITION BY company + ORDER BY tdate + ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING + AFTER MATCH SKIP TO NEXT ROW + INITIAL + PATTERN (A+ B+) + DEFINE + A AS price > 100, + B AS price > 100 +); + company | tdate| price | rpr +--++---+-- + company1 | 07-01-2023 | 100 | + company1 | 07-02-2023 | 200 | 200 + company1 | 07-03-2023 | 150 | + company1 | 07-04-2023 | 140 | + company1 | 07-05-2023 | 150 | + company1 | 07-06-2023 |90 | + company1 | 07-07-2023 | 110 | 110 + company1 | 07-08-2023 | 130 | + company1 | 07-09-2023 | 120 | + company1 | 07-10-2023 | 130 | + company2 | 07-01-2023 |50 | + company2 | 07-02-2023 | 2000 | 2000 + company2 | 07-03-2023 | 1500 | + company2 | 07-04-2023 | 1400 | + company2 | 07-05-2023 | 1500 | + company2 | 07-06-2023 |60 | + company2 | 07-07-2023 | 1100 | 1100 + company2 | 07-08-2023 | 1300 | + company2 | 07-09-2023 | 1200 | + company2 | 07-10-2023 | 1300 | +(20 rows) + -- -- Error cases -- diff --git a/src/test/regress/sql/rpr.sql b/src/test/regress/sql/rpr.sql index 951c9abfe9..f1cd0369f4 100644 --- a/src/test/regress/sql/rpr.sql +++ b/src/test/regress/sql/rpr.sql @@ -94,6 +94,33 @@ SELECT company, tdate, price, rpr(price) OVER w FROM stock UPDOWN AS price > PREV(price) AND price > NEXT(price) ); +-- match everything +SELECT company, tdate, price, rpr(price) OVER w FROM stock + WINDOW w AS ( + PARTITION BY company + ORDER BY tdate + ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING + AFTER MATCH SKIP TO NEXT ROW + INITIAL + PATTERN (A+) + DEFINE + A AS TRUE +); + +-- backtracking with reclassification of rows +SELECT company, tdate, price, rpr(price) OVER w FROM stock + WINDOW w AS ( + PARTITION BY company + ORDER BY tdate + ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING + AFTER MATCH SKIP TO NEXT ROW + INITIAL + PATTERN (A+ B+) + DEFINE + A AS price > 100, + B AS price > 100 +); + -- -- Error cases --
Re: Row pattern recognition
Hi, > Hi Ishii-san, > > On 7/19/23 22:15, Tatsuo Ishii wrote: >> Currently my patch has a limitation for the sake of simple >> implementation: a pattern like "A+" is parsed and analyzed in the raw >> parser. This makes subsequent process much easier because the pattern >> element variable (in this case "A") and the quantifier (in this case >> "+") is already identified by the raw parser. However there are much >> more cases are allowed in the standard as you already pointed out. For >> those cases probably we should give up to parse PATTERN items in the >> raw parser, and instead the raw parser just accepts the elements as >> Sconst? > > Is there a concern that the PATTERN grammar can't be represented in > Bison? I thought it was all context-free... I don't know at this point. I think context-free is not enough to be repsented in Bison. The grammer also needs to be LALR(1). Moreover, adding the grammer to existing parser may generate shift/reduce errors. > Or is the concern that the > parse tree of the pattern is hard to feed into a regex engine? One small concern is how to convert pattern variables to regex expression which our regex enginer understands. Suppose, PATTERN UP+ Currently I convert "UP+" to "U+" so that it can be fed to the regexp engine. In order to do that, we need to know which part of the pattern (UP+) is the pattern variable ("UP"). For "UP+" it's quite easy. But for more complex regular expressions it would be not, unless PATTERN grammer can be analyzed by our parser to know which part is the pattern variable. >> Any comments, especially on the PREV/NEXT implementation part is >> welcome. Currently the DEFINE expression like "price > PREV(price)" is >> prepared in ExecInitWindowAgg using ExecInitExpr,tweaking var->varno >> in Var node so that PREV uses OUTER_VAR, NEXT uses INNER_VAR. Then >> evaluate the expression in ExecWindowAgg using ExecEvalExpr, setting >> previous row TupleSlot to ExprContext->ecxt_outertuple, and next row >> TupleSlot to ExprContext->ecxt_innertuple. I think this is temporary >> hack and should be gotten ride of before v1 is committed. Better idea? > > I'm not familiar enough with this code yet to offer very concrete > suggestions, sorry... But at some point in the future, we need to be > able to skip forward and backward from arbitrary points, like > > DEFINE B AS B.price > PREV(FIRST(A.price), 3) > > so there won't be just one pair of "previous and next" tuples. Yes, I know. > Maybe > that can help clarify the design? It feels like it'll need to eventually > be a "real" function that operates on the window state, even if it > doesn't support all of the possible complexities in v1. Unfortunately an window function can not call other window functions. > Taking a closer look at the regex engine: > > It looks like the + qualifier has trouble when it meets the end of the > frame. For instance, try removing the last row of the 'stock' table in > rpr.sql; some of the final matches will disappear unexpectedly. Or try a > pattern like > > PATTERN ( a+ ) > DEFINE a AS TRUE > > which doesn't seem to match anything in my testing. > > There's also the issue of backtracking in the face of reclassification, > as I think Vik was alluding to upthread. The pattern > > PATTERN ( a+ b+ ) > DEFINE a AS col = 2, > b AS col = 2 > > doesn't match a sequence of values (2 2 ...) with the current > implementation, even with a dummy row at the end to avoid the > end-of-frame bug. > > (I've attached two failing tests against v2, to hopefully better > illustrate, along with what I _think_ should be the correct results.) Thanks. I will look into this. > I'm not quite understanding the match loop in evaluate_pattern(). It > looks like we're building up a string to pass to the regex engine, but > by the we call regexp_instr, don't we already know whether or not the > pattern will match based on the expression evaluation we've done? For "+" yes. But for more complex regular expression like '{n}', we need to call our regexp engine to check if the pattern matches. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
Re: Row pattern recognition
On 7/21/23 01:36, Jacob Champion wrote: There's also the issue of backtracking in the face of reclassification, as I think Vik was alluding to upthread. We definitely need some kind of backtracking or other reclassification method. (I've attached two failing tests against v2, to hopefully better illustrate, along with what I_think_ should be the correct results.) Almost. You are matching 07-01-2023 but the condition is "price > 100". -- Vik Fearing
Re: Row pattern recognition
Hi Ishii-san, On 7/19/23 22:15, Tatsuo Ishii wrote: > Currently my patch has a limitation for the sake of simple > implementation: a pattern like "A+" is parsed and analyzed in the raw > parser. This makes subsequent process much easier because the pattern > element variable (in this case "A") and the quantifier (in this case > "+") is already identified by the raw parser. However there are much > more cases are allowed in the standard as you already pointed out. For > those cases probably we should give up to parse PATTERN items in the > raw parser, and instead the raw parser just accepts the elements as > Sconst? Is there a concern that the PATTERN grammar can't be represented in Bison? I thought it was all context-free... Or is the concern that the parse tree of the pattern is hard to feed into a regex engine? > Any comments, especially on the PREV/NEXT implementation part is > welcome. Currently the DEFINE expression like "price > PREV(price)" is > prepared in ExecInitWindowAgg using ExecInitExpr,tweaking var->varno > in Var node so that PREV uses OUTER_VAR, NEXT uses INNER_VAR. Then > evaluate the expression in ExecWindowAgg using ExecEvalExpr, setting > previous row TupleSlot to ExprContext->ecxt_outertuple, and next row > TupleSlot to ExprContext->ecxt_innertuple. I think this is temporary > hack and should be gotten ride of before v1 is committed. Better idea? I'm not familiar enough with this code yet to offer very concrete suggestions, sorry... But at some point in the future, we need to be able to skip forward and backward from arbitrary points, like DEFINE B AS B.price > PREV(FIRST(A.price), 3) so there won't be just one pair of "previous and next" tuples. Maybe that can help clarify the design? It feels like it'll need to eventually be a "real" function that operates on the window state, even if it doesn't support all of the possible complexities in v1. -- Taking a closer look at the regex engine: It looks like the + qualifier has trouble when it meets the end of the frame. For instance, try removing the last row of the 'stock' table in rpr.sql; some of the final matches will disappear unexpectedly. Or try a pattern like PATTERN ( a+ ) DEFINE a AS TRUE which doesn't seem to match anything in my testing. There's also the issue of backtracking in the face of reclassification, as I think Vik was alluding to upthread. The pattern PATTERN ( a+ b+ ) DEFINE a AS col = 2, b AS col = 2 doesn't match a sequence of values (2 2 ...) with the current implementation, even with a dummy row at the end to avoid the end-of-frame bug. (I've attached two failing tests against v2, to hopefully better illustrate, along with what I _think_ should be the correct results.) I'm not quite understanding the match loop in evaluate_pattern(). It looks like we're building up a string to pass to the regex engine, but by the we call regexp_instr, don't we already know whether or not the pattern will match based on the expression evaluation we've done? Thanks, --Jacobdiff --git a/src/test/regress/expected/rpr.out b/src/test/regress/expected/rpr.out index 6bf8818911..68e9a98684 100644 --- a/src/test/regress/expected/rpr.out +++ b/src/test/regress/expected/rpr.out @@ -230,6 +230,79 @@ SELECT company, tdate, price, rpr(price) OVER w FROM stock company2 | 07-10-2023 | 1300 | (20 rows) +-- match everything +SELECT company, tdate, price, rpr(price) OVER w FROM stock + WINDOW w AS ( + PARTITION BY company + ORDER BY tdate + ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING + AFTER MATCH SKIP TO NEXT ROW + INITIAL + PATTERN (A+) + DEFINE + A AS TRUE +); + company | tdate| price | rpr +--++---+- + company1 | 07-01-2023 | 100 | 100 + company1 | 07-02-2023 | 200 | + company1 | 07-03-2023 | 150 | + company1 | 07-04-2023 | 140 | + company1 | 07-05-2023 | 150 | + company1 | 07-06-2023 |90 | + company1 | 07-07-2023 | 110 | + company1 | 07-08-2023 | 130 | + company1 | 07-09-2023 | 120 | + company1 | 07-10-2023 | 130 | + company2 | 07-01-2023 |50 | 50 + company2 | 07-02-2023 | 2000 | + company2 | 07-03-2023 | 1500 | + company2 | 07-04-2023 | 1400 | + company2 | 07-05-2023 | 1500 | + company2 | 07-06-2023 |60 | + company2 | 07-07-2023 | 1100 | + company2 | 07-08-2023 | 1300 | + company2 | 07-09-2023 | 1200 | + company2 | 07-10-2023 | 1300 | +(20 rows) + +-- backtracking with reclassification of rows +SELECT company, tdate, price, rpr(price) OVER w FROM stock + WINDOW w AS ( + PARTITION BY company + ORDER BY tdate + ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING + AFTER MATCH SKIP TO NEXT ROW + INITIAL + PATTERN (A+ B+) + DEFINE + A AS price > 100, + B AS price > 100 +); + company | tdate| price | rpr +--++---+-- + company1 | 07-01-2023 | 100 | 100 + company1 | 07-02-2023 | 200 | +
Re: Row pattern recognition
> Hello, > > Thanks for working on this! We're interested in RPR as well, and I've > been trying to get up to speed with the specs, to maybe make myself > useful. Thank you for being interested in this. > 19075-5 discusses that, at least; not sure about other parts of the spec. Thanks for the info. Unfortunately I don't have 19075-5 though. >> Maybe an insane idea but what about rewriting MATCH_RECOGNIZE clause >> into Window clause with RPR? > > Are we guaranteed to always have an equivalent window clause? There seem > to be many differences between the two, especially when it comes to ONE > ROW/ALL ROWS PER MATCH. You are right. I am not 100% sure if the rewriting is possible at this point. > To add onto what Vik said above: > >>> It seems RPR in the standard is quite complex. I think we can start >>> with a small subset of RPR then we could gradually enhance the >>> implementation. >> >> I have no problem with that as long as we don't paint ourselves into a >> corner. > > To me, PATTERN looks like an area where we may want to support a broader > set of operations in the first version. Me too but... > The spec has a bunch of > discussion around cases like empty matches, match order of alternation > and permutation, etc., which are not possible to express or test with > only the + quantifier. Those might be harder to get right in a v2, if we > don't at least keep them in mind for v1? Currently my patch has a limitation for the sake of simple implementation: a pattern like "A+" is parsed and analyzed in the raw parser. This makes subsequent process much easier because the pattern element variable (in this case "A") and the quantifier (in this case "+") is already identified by the raw parser. However there are much more cases are allowed in the standard as you already pointed out. For those cases probably we should give up to parse PATTERN items in the raw parser, and instead the raw parser just accepts the elements as Sconst? >> +static List * >> +transformPatternClause(ParseState *pstate, WindowClause *wc, WindowDef >> *windef) >> +{ >> + List*patterns; > > My compiler complains about the `patterns` variable here, which is > returned without ever being initialized. (The caller doesn't seem to use > it.) Will fix. >> +-- basic test using PREV >> +SELECT company, tdate, price, rpr(price) OVER w FROM stock >> + WINDOW w AS ( >> + PARTITION BY company >> + ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING >> + INITIAL >> + PATTERN (START UP+ DOWN+) >> + DEFINE >> + START AS TRUE, >> + UP AS price > PREV(price), >> + DOWN AS price < PREV(price) >> +); > > nitpick: IMO the tests should be making use of ORDER BY in the window > clauses. Right. Will fix. > This is a very big feature. I agree with you that MEASURES seems like a > very important "next piece" to add. Are there other areas where you > would like reviewers to focus on right now (or avoid)? Any comments, especially on the PREV/NEXT implementation part is welcome. Currently the DEFINE expression like "price > PREV(price)" is prepared in ExecInitWindowAgg using ExecInitExpr,tweaking var->varno in Var node so that PREV uses OUTER_VAR, NEXT uses INNER_VAR. Then evaluate the expression in ExecWindowAgg using ExecEvalExpr, setting previous row TupleSlot to ExprContext->ecxt_outertuple, and next row TupleSlot to ExprContext->ecxt_innertuple. I think this is temporary hack and should be gotten ride of before v1 is committed. Better idea? Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
Re: Row pattern recognition
Hello, Thanks for working on this! We're interested in RPR as well, and I've been trying to get up to speed with the specs, to maybe make myself useful. On 6/27/23 17:58, Tatsuo Ishii wrote: > Yes. (I think the standard calls the window frame as "full window > frame" in context of RPR to make a contrast with the subset of the > frame rows restricted by RPR. The paper I refered to as [2] claims > that the latter window frame is called "reduced window frame" in the > standard but I wasn't able to find the term in the standard.) 19075-5 discusses that, at least; not sure about other parts of the spec. > Maybe an insane idea but what about rewriting MATCH_RECOGNIZE clause > into Window clause with RPR? Are we guaranteed to always have an equivalent window clause? There seem to be many differences between the two, especially when it comes to ONE ROW/ALL ROWS PER MATCH. -- To add onto what Vik said above: >> It seems RPR in the standard is quite complex. I think we can start >> with a small subset of RPR then we could gradually enhance the >> implementation. > > I have no problem with that as long as we don't paint ourselves into a > corner. To me, PATTERN looks like an area where we may want to support a broader set of operations in the first version. The spec has a bunch of discussion around cases like empty matches, match order of alternation and permutation, etc., which are not possible to express or test with only the + quantifier. Those might be harder to get right in a v2, if we don't at least keep them in mind for v1? > +static List * > +transformPatternClause(ParseState *pstate, WindowClause *wc, WindowDef > *windef) > +{ > + List*patterns; My compiler complains about the `patterns` variable here, which is returned without ever being initialized. (The caller doesn't seem to use it.) > +-- basic test using PREV > +SELECT company, tdate, price, rpr(price) OVER w FROM stock > + WINDOW w AS ( > + PARTITION BY company > + ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > + INITIAL > + PATTERN (START UP+ DOWN+) > + DEFINE > + START AS TRUE, > + UP AS price > PREV(price), > + DOWN AS price < PREV(price) > +); nitpick: IMO the tests should be making use of ORDER BY in the window clauses. This is a very big feature. I agree with you that MEASURES seems like a very important "next piece" to add. Are there other areas where you would like reviewers to focus on right now (or avoid)? Thanks! --Jacob
Re: Row pattern recognition
On 6/28/23 14:17, Tatsuo Ishii wrote: Small question. This query (with all the defaults made explicit): SELECT s.company, s.tdate, s.price, FIRST_VALUE(s.tdate) OVER w, LAST_VALUE(s.tdate) OVER w, lowest OVER w FROM stock AS s WINDOW w AS ( PARTITION BY s.company ORDER BY s.tdate ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING EXCLUDE NO OTHERS MEASURES LAST(DOWN) AS lowest AFTER MATCH SKIP PAST LAST ROW INITIAL PATTERN (START DOWN+ UP+) DEFINE START AS TRUE, UP AS price > PREV(price), DOWN AS price < PREV(price) ); LAST(DOWN) AS lowest should be "LAST(DOWN.price) AS lowest"? Yes, it should be. And the tdate='07-08-2023' row in the first resultset should have '07-08-2023' and '07-10-2023' as its 4th and 5th columns. Since my brain is doing the processing instead of postgres, I made some human errors. :-) -- Vik Fearing
Re: Row pattern recognition
Small question. > This query (with all the defaults made explicit): > > SELECT s.company, s.tdate, s.price, >FIRST_VALUE(s.tdate) OVER w, >LAST_VALUE(s.tdate) OVER w, >lowest OVER w > FROM stock AS s > WINDOW w AS ( > PARTITION BY s.company > ORDER BY s.tdate > ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > EXCLUDE NO OTHERS > MEASURES > LAST(DOWN) AS lowest > AFTER MATCH SKIP PAST LAST ROW > INITIAL PATTERN (START DOWN+ UP+) > DEFINE > START AS TRUE, > UP AS price > PREV(price), > DOWN AS price < PREV(price) > ); > LAST(DOWN) AS lowest should be "LAST(DOWN.price) AS lowest"? Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
Re: Row pattern recognition
> Okay, I see the problem now, and why you need the rpr() function. > > You are doing this as something that happens over a window frame, but > it is actually something that *reduces* the window frame. The pattern > matching needs to be done when the frame is calculated and not when > any particular function is applied over it. Yes. (I think the standard calls the window frame as "full window frame" in context of RPR to make a contrast with the subset of the frame rows restricted by RPR. The paper I refered to as [2] claims that the latter window frame is called "reduced window frame" in the standard but I wasn't able to find the term in the standard.) I wanted to demonstate that pattern matching logic is basically correct in the PoC patch. Now what I need to do is, move the row pattern matching logic to somewhere inside nodeWindowAgg so that "restricted window frame" can be applied to all window functions and window aggregates. Currently I am looking into update_frameheadpos() and update_frametailpos() which calculate the frame head and tail against current row. What do you think? > This query (with all the defaults made explicit): > > SELECT s.company, s.tdate, s.price, >FIRST_VALUE(s.tdate) OVER w, >LAST_VALUE(s.tdate) OVER w, >lowest OVER w > FROM stock AS s > WINDOW w AS ( > PARTITION BY s.company > ORDER BY s.tdate > ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > EXCLUDE NO OTHERS > MEASURES > LAST(DOWN) AS lowest > AFTER MATCH SKIP PAST LAST ROW > INITIAL PATTERN (START DOWN+ UP+) > DEFINE > START AS TRUE, > UP AS price > PREV(price), > DOWN AS price < PREV(price) > ); > > Should produce this result: [snip] Thanks for the examples. I agree with the expected query results. o SUBSET is not supported >>> >>> Is this because you haven't done it yet, or because you ran into >>> problems trying to do it? >> Because it seems SUBSET is not useful without MEASURES support. Thus >> my plan is, firstly implement MEASURES, then SUBSET. What do you >> think? > > > SUBSET elements can be used in DEFINE clauses, but I do not think this > is important compared to other features. Ok. >>> I have not looked at the patch yet, but is the reason for doing R020 >>> before R010 because you haven't done the MEASURES clause yet? >> One of the reasons is, implementing MATCH_RECOGNIZE (R010) looked >> harder for me because modifying main SELECT clause could be a hard >> work. Another reason is, I had no idea how to implement PREV/NEXT in >> other than in WINDOW clause. Other people might feel differently >> though. > > > I think we could do this with a single tuplesort if we use > backtracking (which might be really slow for some patterns). I have > not looked into it in any detail. > > We would need to be able to remove tuples from the end (even if only > logically), and be able to update tuples inside the store. Both of > those needs come from backtracking and possibly changing the > classifier. > > Without backtracking, I don't see how we could do it without have a > separate tuplestore for every current possible match. Maybe an insane idea but what about rewriting MATCH_RECOGNIZE clause into Window clause with RPR? > I looked at your v2 patches a little bit and the only comment that I > currently have on the code is you spelled PERMUTE as > PREMUTE. Everything else is hopefully explained above. Thanks. Will fix. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
Re: Row pattern recognition
On 6/26/23 03:05, Tatsuo Ishii wrote: I don't understand this. RPR in a window specification limits the window to the matched rows, so this looks like your rpr() function is just the regular first_value() window function that we already have? No, rpr() is different from first_value(). rpr() returns the argument value at the first row in a frame only when matched rows found. On the other hand first_value() returns the argument value at the first row in a frame unconditionally. company | tdate| price | rpr | first_value --++---+--+- company1 | 2023-07-01 | 100 | | 100 company1 | 2023-07-02 | 200 | 200 | 200 company1 | 2023-07-03 | 150 | 150 | 150 company1 | 2023-07-04 | 140 | | 140 company1 | 2023-07-05 | 150 | 150 | 150 company1 | 2023-07-06 |90 | | 90 company1 | 2023-07-07 | 110 | | 110 company1 | 2023-07-08 | 130 | | 130 company1 | 2023-07-09 | 120 | | 120 company1 | 2023-07-10 | 130 | | 130 For example, a frame starting with (tdate = 2023-07-02, price = 200) consists of rows (price = 200, 150, 140, 150) satisfying the pattern, thus rpr returns 200. Since in this example frame option "ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING" is specified, next frame starts with (tdate = 2023-07-03, price = 150). This frame satisfies the pattern too (price = 150, 140, 150), and rpr retus 150... and so on. Okay, I see the problem now, and why you need the rpr() function. You are doing this as something that happens over a window frame, but it is actually something that *reduces* the window frame. The pattern matching needs to be done when the frame is calculated and not when any particular function is applied over it. This query (with all the defaults made explicit): SELECT s.company, s.tdate, s.price, FIRST_VALUE(s.tdate) OVER w, LAST_VALUE(s.tdate) OVER w, lowest OVER w FROM stock AS s WINDOW w AS ( PARTITION BY s.company ORDER BY s.tdate ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING EXCLUDE NO OTHERS MEASURES LAST(DOWN) AS lowest AFTER MATCH SKIP PAST LAST ROW INITIAL PATTERN (START DOWN+ UP+) DEFINE START AS TRUE, UP AS price > PREV(price), DOWN AS price < PREV(price) ); Should produce this result: company | tdate| price | first_value | last_value | lowest --++---+-++ company1 | 07-01-2023 | 100 | || company1 | 07-02-2023 | 200 | 07-02-2023 | 07-05-2023 |140 company1 | 07-03-2023 | 150 | || company1 | 07-04-2023 | 140 | || company1 | 07-05-2023 | 150 | || company1 | 07-06-2023 |90 | || company1 | 07-07-2023 | 110 | || company1 | 07-08-2023 | 130 | 07-05-2023 | 07-05-2023 |120 company1 | 07-09-2023 | 120 | || company1 | 07-10-2023 | 130 | || (10 rows) Or if we switch to AFTER MATCH SKIP TO NEXT ROW, then we get: company | tdate| price | first_value | last_value | lowest --++---+-++ company1 | 07-01-2023 | 100 | || company1 | 07-02-2023 | 200 | 07-02-2023 | 07-05-2023 |140 company1 | 07-03-2023 | 150 | 07-03-2023 | 07-05-2023 |140 company1 | 07-04-2023 | 140 | || company1 | 07-05-2023 | 150 | 07-05-2023 | 07-08-2023 | 90 company1 | 07-06-2023 |90 | || company1 | 07-07-2023 | 110 | || company1 | 07-08-2023 | 130 | 07-08-2023 | 07-10-2023 |120 company1 | 07-09-2023 | 120 | || company1 | 07-10-2023 | 130 | || (10 rows) And then if we change INITIAL to SEEK: company | tdate| price | first_value | last_value | lowest --++---+-++ company1 | 07-01-2023 | 100 | 07-02-2023 | 07-05-2023 |140 company1 | 07-02-2023 | 200 | 07-02-2023 | 07-05-2023 |140 company1 | 07-03-2023 | 150 | 07-03-2023 | 07-05-2023 |140 company1 | 07-04-2023 | 140 | 07-05-2023 | 07-08-2023 | 90 company1 | 07-05-2023 | 150 | 07-05-2023 | 07-08-2023 | 90 company1 | 07-06-2023 |90 | 07-08-2023 | 07-10-2023 |120 company1 | 07-07-2023 | 110 | 07-08-2023 | 07-10-2023 |120 company1 | 07-08-2023 | 130 | 07-08-2023 | 07-10-2023 |120 company1 | 07-09-2023 | 120 | || company1 | 07-10-2023 | 130 | || (10 rows) Since the pattern recognition is part of the frame, the window aggregates should Just Work. o SUBSET is not supported Is this because
Re: Row pattern recognition
>> In this case, we should require the user to specify AFTER MATCH SKIP >> TO NEXT ROW so that behavior doesn't change when we implement the >> standard default. (Your patch might do this already.) > > Agreed. I will implement AFTER MATCH SKIP PAST LAST ROW in the next > patch and I will change the default to AFTER MATCH SKIP PAST LAST ROW. Attached is the v2 patch to add support for AFTER MATCH SKIP PAST LAST ROW and AFTER MATCH SKIP PAST LAST ROW. The default is AFTER MATCH SKIP PAST LAST ROW as the standard default. Here are some examples to demonstrate how those clauses affect the query result. SELECT i, rpr(i) OVER w FROM (VALUES (1), (2), (3), (4)) AS v (i) WINDOW w AS ( ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING AFTER MATCH SKIP PAST LAST ROW PATTERN (A B) DEFINE A AS i <= 2, B AS i <= 3 ); i | rpr ---+- 1 | 1 2 | 3 | 4 | (4 rows) In this example rpr starts from i = 1 and find that row i = 1 satisfies A, and row i = 2 satisfies B. Then rpr moves to row i = 3 and find that it does not satisfy A, thus the result is NULL. Same thing can be said to row i = 4. SELECT i, rpr(i) OVER w FROM (VALUES (1), (2), (3), (4)) AS v (i) WINDOW w AS ( ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING AFTER MATCH SKIP TO NEXT ROW PATTERN (A B) DEFINE A AS i <= 2, B AS i <= 3 ); i | rpr ---+- 1 | 1 2 | 2 3 | 4 | (4 rows) In this example rpr starts from i = 1 and find that row i = 1 satisfies A, and row i = 2 satisfies B (same as above). Then rpr moves to row i = 2, rather than 3 because AFTER MATCH SKIP TO NEXT ROW is specified. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp >From 92aebdb8a8c6d3739ee49b755f281e77c34f5d36 Mon Sep 17 00:00:00 2001 From: Tatsuo Ishii Date: Mon, 26 Jun 2023 17:05:35 +0900 Subject: [PATCH v2 1/7] Row pattern recognition patch for raw parser. --- src/backend/parser/gram.y | 218 +--- src/include/nodes/parsenodes.h | 54 src/include/parser/kwlist.h | 8 ++ src/include/parser/parse_node.h | 1 + 4 files changed, 266 insertions(+), 15 deletions(-) diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index 39ab7eac0d..5cd3ebaa98 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -251,6 +251,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); DefElem *defelt; SortBy *sortby; WindowDef *windef; + RPCommonSyntax *rpcom; + RPSubsetItem *rpsubset; JoinExpr *jexpr; IndexElem *ielem; StatsElem *selem; @@ -453,8 +455,12 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); TriggerTransitions TriggerReferencing vacuum_relation_list opt_vacuum_relation_list drop_option_list pub_obj_list - -%type opt_routine_body +row_pattern_measure_list row_pattern_definition_list +opt_row_pattern_subset_clause +row_pattern_subset_list row_pattern_subset_rhs +row_pattern +%type row_pattern_subset_item +%type opt_routine_body row_pattern_term %type group_clause %type group_by_list %type group_by_item empty_grouping_set rollup_clause cube_clause @@ -551,6 +557,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); %type relation_expr_opt_alias %type tablesample_clause opt_repeatable_clause %type target_el set_target insert_column_item +row_pattern_measure_item row_pattern_definition %type generic_option_name %type generic_option_arg @@ -633,6 +640,9 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); %type window_clause window_definition_list opt_partition_clause %type window_definition over_clause window_specification opt_frame_clause frame_extent frame_bound +%type opt_row_pattern_common_syntax opt_row_pattern_skip_to +%type opt_row_pattern_initial_or_seek +%type opt_row_pattern_measures %type opt_window_exclusion_clause %type opt_existing_window_name %type opt_if_not_exists @@ -645,7 +655,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); %type hash_partbound %type hash_partbound_elem - %type json_format_clause_opt json_value_expr json_output_clause_opt @@ -705,7 +714,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); CURRENT_TIME CURRENT_TIMESTAMP CURRENT_USER CURSOR CYCLE DATA_P DATABASE DAY_P DEALLOCATE DEC DECIMAL_P DECLARE DEFAULT DEFAULTS - DEFERRABLE DEFERRED DEFINER DELETE_P DELIMITER DELIMITERS DEPENDS DEPTH DESC + DEFERRABLE DEFERRED DEFINE DEFINER DELETE_P DELIMITER DELIMITERS DEPENDS DEPTH DESC DETACH DICTIONARY DISABLE_P DISCARD DISTINCT DO DOCUMENT_P DOMAIN_P DOUBLE_P DROP @@ -721,7 +730,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *que
Re: Row pattern recognition
> I have been dreaming of and lobbying for someone to pick up this > feature. I will be sure to review it from a standards perspective and > will try my best to help with the technical aspect, but I am not sure > to have the qualifications for that. > > THANK YOU! Thank you for looking into my proposal. >> (I know SQL:2023 is already out, but I don't have access to it) > > If you can, try to get ISO/IEC 19075-5 which is a guide to RPR instead > of just its technical specification. > > https://www.iso.org/standard/78936.html Thanks for the info. > I don't understand this. RPR in a window specification limits the > window to the matched rows, so this looks like your rpr() function is > just the regular first_value() window function that we already have? No, rpr() is different from first_value(). rpr() returns the argument value at the first row in a frame only when matched rows found. On the other hand first_value() returns the argument value at the first row in a frame unconditionally. company | tdate| price | rpr | first_value --++---+--+- company1 | 2023-07-01 | 100 | | 100 company1 | 2023-07-02 | 200 | 200 | 200 company1 | 2023-07-03 | 150 | 150 | 150 company1 | 2023-07-04 | 140 | | 140 company1 | 2023-07-05 | 150 | 150 | 150 company1 | 2023-07-06 |90 | | 90 company1 | 2023-07-07 | 110 | | 110 company1 | 2023-07-08 | 130 | | 130 company1 | 2023-07-09 | 120 | | 120 company1 | 2023-07-10 | 130 | | 130 For example, a frame starting with (tdate = 2023-07-02, price = 200) consists of rows (price = 200, 150, 140, 150) satisfying the pattern, thus rpr returns 200. Since in this example frame option "ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING" is specified, next frame starts with (tdate = 2023-07-03, price = 150). This frame satisfies the pattern too (price = 150, 140, 150), and rpr retus 150... and so on. > As in your example, you cannot have START.price outside of the window > specification; it can only go in the MEASURES clause. Only startprice > is allowed outside and it gets its qualification from the OVER. Using > w.startprice might have been better but that would require window > names to be in the same namespace as range tables. > > This currently works in Postgres: > > SELECT RANK() OVER w > FROM (VALUES (1)) AS w (x) > WINDOW w AS (ORDER BY w.x); Interesting. >> o SUBSET is not supported > > Is this because you haven't done it yet, or because you ran into > problems trying to do it? Because it seems SUBSET is not useful without MEASURES support. Thus my plan is, firstly implement MEASURES, then SUBSET. What do you think? >> o Regular expressions other than "+" are not supported > > This is what I had a hard time imagining how to do while thinking > about it. The grammar is so different here and we allow many more > operators (like "?" which is also the standard parameter symbol). > People more expert than me will have to help here. Yes, that is a problem. > In this case, we should require the user to specify AFTER MATCH SKIP > TO NEXT ROW so that behavior doesn't change when we implement the > standard default. (Your patch might do this already.) Agreed. I will implement AFTER MATCH SKIP PAST LAST ROW in the next patch and I will change the default to AFTER MATCH SKIP PAST LAST ROW. >> o Aggregate functions associated with window clause using RPR do not >> respect RPR > > I do not understand what this means. Ok, let me explain. See example below. In my understanding "count" should retun the number of rows in a frame restriced by the match condition. For example at the first line (2023-07-01 | 100) count returns 10. I think this should be 0 because the "restriced" frame starting at the line contains no matched row. On the other hand the (restricted) frame starting at second line (2023-07-02 | 200) contains 4 rows, thus count should return 4, instead of 9. SELECT company, tdate, price, rpr(price) OVER w, count(*) OVER w FROM stock WINDOW w AS ( PARTITION BY company ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING PATTERN (START DOWN+ UP+) DEFINE START AS TRUE, UP AS price > PREV(price), DOWN AS price < PREV(price) ); company | tdate| price | rpr | count --++---+--+--- company1 | 2023-07-01 | 100 | |10 company1 | 2023-07-02 | 200 | 200 | 9 company1 | 2023-07-03 | 150 | 150 | 8 company1 | 2023-07-04 | 140 | | 7 company1 | 2023-07-05 | 150 | 150 | 6 company1 | 2023-07-06 |90 | | 5 company1 | 2023-07-07 | 110 | | 4 company1 | 2023-07-08 | 130 | | 3 company1 | 2023-07-09 | 120 | | 2 company1 | 2023-07-10 | 130 | | 1 >> It seems RPR in the standard is quite complex. I think we can start >> with a small subset
Re: Row pattern recognition
On 6/25/23 14:05, Tatsuo Ishii wrote: Attached is a PoC patch to implement "Row pattern recognition" (RPR) in SQL:2016 (I know SQL:2023 is already out, but I don't have access to it). Actually SQL:2016 defines RPR in two places[1]: Feature R010, “Row pattern recognition: FROM clause” Feature R020, “Row pattern recognition: WINDOW clause” The patch includes partial support for R020 part. I have been dreaming of and lobbying for someone to pick up this feature. I will be sure to review it from a standards perspective and will try my best to help with the technical aspect, but I am not sure to have the qualifications for that. THANK YOU! > (I know SQL:2023 is already out, but I don't have access to it) If you can, try to get ISO/IEC 19075-5 which is a guide to RPR instead of just its technical specification. https://www.iso.org/standard/78936.html - What is RPR? RPR provides a way to search series of data using regular expression patterns. Suppose you have a stock database. CREATE TABLE stock ( company TEXT, tdate DATE, price BIGINT); You want to find a "V-shaped" pattern: i.e. price goes down for 1 or more days, then goes up for 1 or more days. If we try to implement the query in PostgreSQL, it could be quite complex and inefficient. RPR provides convenient way to implement the query. SELECT company, tdate, price, rpr(price) OVER w FROM stock WINDOW w AS ( PARTITION BY company ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING PATTERN (START DOWN+ UP+) DEFINE START AS TRUE, UP AS price > PREV(price), DOWN AS price < PREV(price) ); "PATTERN" and "DEFINE" are the key clauses of RPR. DEFINE defines 3 "row pattern variables" namely START, UP and DOWN. They are associated with logical expressions namely "TRUE", "price > PREV(price)", and "price < PREV(price)". Note that "PREV" function returns price column in the previous row. So, UP is true if price is higher than previous day. On the other hand, DOWN is true if price is lower than previous day. PATTERN uses the row pattern variables to create a necessary pattern. In this case, the first row is always match because START is always true, and second or more rows match with "UP" ('+' is a regular expression representing one or more), and subsequent rows match with "DOWN". Here is the sample output. company | tdate| price | rpr --++---+-- company1 | 2023-07-01 | 100 | company1 | 2023-07-02 | 200 | 200 -- 200->150->140->150 company1 | 2023-07-03 | 150 | 150 -- 150->140->150 company1 | 2023-07-04 | 140 | company1 | 2023-07-05 | 150 | 150 -- 150->90->110->130 company1 | 2023-07-06 |90 | company1 | 2023-07-07 | 110 | company1 | 2023-07-08 | 130 | company1 | 2023-07-09 | 120 | company1 | 2023-07-10 | 130 | rpr shows the first row if all the patterns are satisfied. In the example above 200, 150, 150 are the cases. Other rows are shown as NULL. For example, on 2023-07-02 price starts with 200, then goes down to 150 then 140 but goes up 150 next day. I don't understand this. RPR in a window specification limits the window to the matched rows, so this looks like your rpr() function is just the regular first_value() window function that we already have? As far as I know, only Oracle implements RPR (only R010. R020 is not implemented) among OSS/commercial RDBMSs. There are a few DWH software having RPR. According to [2] they are Snowflake and MS Stream Analytics. It seems Trino is another one[3]. I thought DuckDB had it already, but it looks like I was wrong. - Note about the patch The current implementation is not only a subset of the standard, but is different from it in some places. The example above is written as follows according to the standard: SELECT company, tdate, startprice OVER w FROM stock WINDOW w AS ( PARTITION BY company MEASURES START.price AS startprice ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING PATTERN (START DOWN+ UP+) DEFINE START AS TRUE, UP AS UP.price > PREV(UP.price), DOWN AS DOWN.price < PREV(DOWN.price) ); Notice that rpr(price) is written as START.price and startprice in the standard. MEASURES defines variable names used in the target list used with "OVER window". As OVER only allows functions in PostgreSQL, I had to make up a window function "rpr" which performs the row pattern recognition task. I was not able to find a way to implement expressions like START.price (START is not a table alias). Any suggestion is greatly appreciated. As in your example, you cannot have START.price outside of the window specification; it can only go in the MEASURES clause. Only startprice is allowed outside and it gets its qualification from the OVER. Using w.startprice might h
Row pattern recognition
Attached is a PoC patch to implement "Row pattern recognition" (RPR) in SQL:2016 (I know SQL:2023 is already out, but I don't have access to it). Actually SQL:2016 defines RPR in two places[1]: Feature R010, “Row pattern recognition: FROM clause” Feature R020, “Row pattern recognition: WINDOW clause” The patch includes partial support for R020 part. - What is RPR? RPR provides a way to search series of data using regular expression patterns. Suppose you have a stock database. CREATE TABLE stock ( company TEXT, tdate DATE, price BIGINT); You want to find a "V-shaped" pattern: i.e. price goes down for 1 or more days, then goes up for 1 or more days. If we try to implement the query in PostgreSQL, it could be quite complex and inefficient. RPR provides convenient way to implement the query. SELECT company, tdate, price, rpr(price) OVER w FROM stock WINDOW w AS ( PARTITION BY company ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING PATTERN (START DOWN+ UP+) DEFINE START AS TRUE, UP AS price > PREV(price), DOWN AS price < PREV(price) ); "PATTERN" and "DEFINE" are the key clauses of RPR. DEFINE defines 3 "row pattern variables" namely START, UP and DOWN. They are associated with logical expressions namely "TRUE", "price > PREV(price)", and "price < PREV(price)". Note that "PREV" function returns price column in the previous row. So, UP is true if price is higher than previous day. On the other hand, DOWN is true if price is lower than previous day. PATTERN uses the row pattern variables to create a necessary pattern. In this case, the first row is always match because START is always true, and second or more rows match with "UP" ('+' is a regular expression representing one or more), and subsequent rows match with "DOWN". Here is the sample output. company | tdate| price | rpr --++---+-- company1 | 2023-07-01 | 100 | company1 | 2023-07-02 | 200 | 200 -- 200->150->140->150 company1 | 2023-07-03 | 150 | 150 -- 150->140->150 company1 | 2023-07-04 | 140 | company1 | 2023-07-05 | 150 | 150 -- 150->90->110->130 company1 | 2023-07-06 |90 | company1 | 2023-07-07 | 110 | company1 | 2023-07-08 | 130 | company1 | 2023-07-09 | 120 | company1 | 2023-07-10 | 130 | rpr shows the first row if all the patterns are satisfied. In the example above 200, 150, 150 are the cases. Other rows are shown as NULL. For example, on 2023-07-02 price starts with 200, then goes down to 150 then 140 but goes up 150 next day. As far as I know, only Oracle implements RPR (only R010. R020 is not implemented) among OSS/commercial RDBMSs. There are a few DWH software having RPR. According to [2] they are Snowflake and MS Stream Analytics. It seems Trino is another one[3]. - Note about the patch The current implementation is not only a subset of the standard, but is different from it in some places. The example above is written as follows according to the standard: SELECT company, tdate, startprice OVER w FROM stock WINDOW w AS ( PARTITION BY company MEASURES START.price AS startprice ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING PATTERN (START DOWN+ UP+) DEFINE START AS TRUE, UP AS UP.price > PREV(UP.price), DOWN AS DOWN.price < PREV(DOWN.price) ); Notice that rpr(price) is written as START.price and startprice in the standard. MEASURES defines variable names used in the target list used with "OVER window". As OVER only allows functions in PostgreSQL, I had to make up a window function "rpr" which performs the row pattern recognition task. I was not able to find a way to implement expressions like START.price (START is not a table alias). Any suggestion is greatly appreciated. The differences from the standard include: o MEASURES is not supported o SUBSET is not supported o FIRST, LAST, CLASSIFIER are not supported o PREV/NEXT in the standard accept more complex arguments o Regular expressions other than "+" are not supported o Only AFTER MATCH SKIP TO NEXT ROW is supported (if AFTER MATCH is not specified, AFTER MATCH SKIP TO NEXT ROW is assumed. In the standard AFTER MATCH SKIP PAST LAST ROW is assumed in this case). I have a plan to implement AFTER MATCH SKIP PAST LAST ROW though. o INITIAL or SEEK are not supported ((behaves as if INITIAL is specified) o Aggregate functions associated with window clause using RPR do not respect RPR It seems RPR in the standard is quite complex. I think we can start with a small subset of RPR then we could gradually enhance the implementation. Comments and suggestions are welcome. [1] https://sqlperformance.com/2019/04/t-sql-queries/row-pattern-recognition-in-sql [2] https://link.springer.com/article/10.1007/s13222-022-00404-3