Re: NOT ENFORCED constraint feature
On Mon, Feb 3, 2025 at 10:49 AM Ashutosh Bapat wrote: > > On Mon, Feb 3, 2025 at 9:57 AM Amul Sul wrote: > > > > On Fri, Jan 31, 2025 at 7:10 PM Alvaro Herrera > > wrote: > > > > > > On 2025-Jan-31, Ashutosh Bapat wrote: > > > > > > > But if the constraint is NOT VALID and later marked as NOT ENFORCED, > > > > what is expected behaviour while changing it to ENFORCED? > > > > > > I think what you want is a different mode that would be ENFORCED NOT > > > VALID, which would be an extension of the standard, because the standard > > > does not support the concept of NOT VALID. So while I think what you > > > want is nice, I'm not sure that this patch necessarily must implement > > > it. > > > > > This way allows VALID/NOT VALID and ENFORCED/NOT ENFORCED states to > work together and also implement behaviour specified by the standard > (ref. Peter's email). If there's some other way to implement the > behaviour, that's fine too. > > > > > Here is my understanding behind this feature implementation -- I am > > not claiming to be 100% correct, I am confident I am not entirely > > wrong either. Let me explain with an example: imagine a user adds a > > VALID constraint to a table that already has data, and the user is > > completely sure that all the data complies with the constraint. Even > > in this case, the system still runs a validation check. This is > > expected behavior because the system can't just take the user's word > > for it -- it needs to explicitly confirm that the data is valid > > through validation. > > > > Now, with a NOT ENFORCED constraint, it's almost like the constraint > > doesn't exist, because no checks are being performed and there is no > > visible effect for the user, even though the constraint is technically > > still there. So when the constraint is switched to ENFORCED, we should > > be careful not to automatically mark it as validated (regardless of > > its previous validate status) unless the data is actually checked > > against the constraint -- treat as adding a new VALID constraint. Even > > if the user is absolutely sure the data complies, we should still run > > the validation to ensure reliability. > > > > In response to Ashutosh’s point about the VALID/NOT ENFORCED scenario: > > if a constraint is initially VALID, then marked as NOT ENFORCED, and > > later switched back to ENFORCED -- IMO, it shouldn't automatically be > > considered VALID. > > I am suggesting that when a constraint is changed from NOT ENFORCED to > ENFORCED, if it's marked VALID - we run validation checks. > Ok. > Here's how I see the state conversions happening. > > NOT VALID, NOT ENFORCED changed to NOT_VALID, ENFORCED - no data > validation required, constraint is enforced on the new tuples/changes > NOT VALID, ENFORCED changed to NOT VALID, NOT ENFORCED - no data > validation, constraint isn't enforced anymore > VALID, NOT ENFORCED changed to VALID, ENFORCED - data validation > required, constraint is enforced > VALID, ENFORCED changed to VALID, NOT ENFORCED - no data validation > required, constrain isn't enforced anymore, we rely on user to enforce > the constraint on their side > Understood, thanks for the detailed explanation. This is what I had implemented in the v4 patch, and I agree with this. If we decide to go with this, I can revert the behavior to the v4 patch set. Regards, Amul
Re: Logging parallel worker draught
> The "story" I have in mind is: I need to audit an instance I know > nothing about. I ask the client to adapt the logging parameters for > pgbadger (including this one), collect the logs and generate a report > for the said period to have a broad overview of what is happenning. Let's see if anyone has a different opinion. As far as the current set of patches, I had some other changes that I missed earlier; indentation to the calls in LogParallelWorkersIfNeeded and comment for the LogParallelWorkersIfNeeded function. I also re-worked the setup of the GUC as it was not setup the same way as other GUCs with an options list. I ended up just making those changes in v8. Besides that, I think this is ready for committer. Regards, Sami v8-0002-Setup-counters-for-parallel-vacuums.patch Description: Binary data v8-0001-Add-a-guc-for-parallel-worker-logging.patch Description: Binary data v8-0003-Implements-logging-for-parallel-worker-usage-in-u.patch Description: Binary data v8-0004-Implements-logging-for-parallel-worker-usage-in-q.patch Description: Binary data
Re: NOT ENFORCED constraint feature
On Mon, Feb 3, 2025 at 9:57 AM Amul Sul wrote: > > On Fri, Jan 31, 2025 at 7:10 PM Alvaro Herrera > wrote: > > > > On 2025-Jan-31, Ashutosh Bapat wrote: > > > > > But if the constraint is NOT VALID and later marked as NOT ENFORCED, > > > what is expected behaviour while changing it to ENFORCED? > > > > I think what you want is a different mode that would be ENFORCED NOT > > VALID, which would be an extension of the standard, because the standard > > does not support the concept of NOT VALID. So while I think what you > > want is nice, I'm not sure that this patch necessarily must implement > > it. > > This way allows VALID/NOT VALID and ENFORCED/NOT ENFORCED states to work together and also implement behaviour specified by the standard (ref. Peter's email). If there's some other way to implement the behaviour, that's fine too. > > Here is my understanding behind this feature implementation -- I am > not claiming to be 100% correct, I am confident I am not entirely > wrong either. Let me explain with an example: imagine a user adds a > VALID constraint to a table that already has data, and the user is > completely sure that all the data complies with the constraint. Even > in this case, the system still runs a validation check. This is > expected behavior because the system can't just take the user's word > for it -- it needs to explicitly confirm that the data is valid > through validation. > > Now, with a NOT ENFORCED constraint, it's almost like the constraint > doesn't exist, because no checks are being performed and there is no > visible effect for the user, even though the constraint is technically > still there. So when the constraint is switched to ENFORCED, we should > be careful not to automatically mark it as validated (regardless of > its previous validate status) unless the data is actually checked > against the constraint -- treat as adding a new VALID constraint. Even > if the user is absolutely sure the data complies, we should still run > the validation to ensure reliability. > > In response to Ashutosh’s point about the VALID/NOT ENFORCED scenario: > if a constraint is initially VALID, then marked as NOT ENFORCED, and > later switched back to ENFORCED -- IMO, it shouldn't automatically be > considered VALID. I am suggesting that when a constraint is changed from NOT ENFORCED to ENFORCED, if it's marked VALID - we run validation checks. Here's how I see the state conversions happening. NOT VALID, NOT ENFORCED changed to NOT_VALID, ENFORCED - no data validation required, constraint is enforced on the new tuples/changes NOT VALID, ENFORCED changed to NOT VALID, NOT ENFORCED - no data validation, constraint isn't enforced anymore VALID, NOT ENFORCED changed to VALID, ENFORCED - data validation required, constraint is enforced VALID, ENFORCED changed to VALID, NOT ENFORCED - no data validation required, constrain isn't enforced anymore, we rely on user to enforce the constraint on their side -- Best Wishes, Ashutosh Bapat
Re: pg_rewind with --write-recovery-conf option doesn't write dbname to primary_conninfo value.
On Mon, Feb 3, 2025 at 4:50 PM Hayato Kuroda (Fujitsu) wrote: > > Dear Sawada-san, > > > I think it's a good idea to support it at least on HEAD. I've attached > > a patch for that. > > +1. I've confirmed that pg_rewind and -R can't output dbname for now, > and your patch allows to do it. > > Few comments for your patch. > > 1. > > pg_basebackup.sgml has below description. I feel this can be ported to > pg_rewind.sgml as well. > > ``` > The dbname will be recorded only if the dbname was > specified explicitly in the connection string or > environment variable. > ``` > > 2. > I'm not sure whether recovery_gen.h/c is a correct place for the exported > function > GetDbnameFromConnectionOptions(). The function itself does not handle > postgresql.cuto.conf file. > I seeked other header files and felt that connect_utils.h might be. > > ``` > /*- > * > * Facilities for frontend code to connect to and disconnect from databases. > ``` > > Another idea is to change the third API to accept the whole connection > string, and > it extracts dbname from it. In this approach we can make > GetDbnameFromConnectionOptions() > to static function - which does not feel strange for me. > > Best regards, > Hayato Kuroda > FUJITSU LIMITED > Hi Sawada-san, h Here are a few more minor comments in addition to what Kuroda-San already posted. == typo in patch name /primary_conninfo/primary_connifo/ == Commit message no details. bad link. == src/fe_utils/recovery_gen.c 1. static char * FindDbnameInConnParams(PQconninfoOption *conn_opts) There is a missing forward declaration of this function. Better to add it for consistency because the other static function has one. ~~~ 2. +static char * +FindDbnameInConnParams(PQconninfoOption *conn_opts) +{ + PQconninfoOption *conn_opt; + + for (conn_opt = conn_opts; conn_opt->keyword != NULL; conn_opt++) + { + if (strcmp(conn_opt->keyword, "dbname") == 0 && + conn_opt->val != NULL && conn_opt->val[0] != '\0') + return pg_strdup(conn_opt->val); + } + return NULL; +} 2a. I know you copied this, but it seems a bit strange that this function is named "Params" when everything about it including its parameter and comments and the caller talks about "_opts" or "Options" ~ 2b. I know you copied this, but normally 'conn_opt' might have been written as a for-loop variable. ~~~ 3. +/* + * GetDbnameFromConnectionOptions + * + * This is a special purpose function to retrieve the dbname from either the + * 'connstr' specified by the caller or from the environment variables. + * + * Returns NULL, if dbname is not specified by the user in the above + * mentioned connection options. + */ What does "in the above mentioned connection options" mean? In the original function comment where this was copied from there was an extra sentence ("We follow ... from various connection options.") so this had more context, but now that the other sentence is removed maybe "above mentioned connection options" looks like it also needs rewording. == Kind Regards, Peter Smith. Fujitsu Australia
Re: POC, WIP: OR-clause support for indexes
On 2/3/25 00:57, Alexander Korotkov wrote: On Thu, Jan 30, 2025 at 3:23 PM Pavel Borisov wrote: On Tue, 28 Jan 2025 at 12:42, Andrei Lepikhov wrote: On 1/28/25 11:36, Andrei Lepikhov wrote: On 1/27/25 16:50, Alexander Korotkov wrote: qsort(matches, n, sizeof(OrArgIndexMatch), or_arg_index_match_cmp); To fit an index, the order of elements in the target array of the `ScalarArrayOpExpr` may change compared to the initial list of OR expressions. If there are indexes that cover the same set of columns but in reverse order, this could potentially alter the position of a Subplan. However, I believe this is a rare case; it is supported by the initial OR path and should be acceptable. I beg your pardon - I forgot that we've restricted the feature's scope and can't combine OR clauses into ScalarArrayOpExpr if the args list contains references to different columns. So, my note can't be applied here. -- regards, Andrei Lepikhov I've looked at the patch v46-0001 Looks good to me. There is a test that demonstrates the behavior change. Maybe some more cases like are also worth adding to a test. +SELECT * FROM bitmap_split_or t1, bitmap_split_or t2 WHERE t1.a=t2.c OR (t1.a=t2.b OR t1.a=1); + QUERY PLAN + + Nested Loop + -> Seq Scan on bitmap_split_or t2 + -> Index Scan using t_a_b_idx on bitmap_split_or t1 + Index Cond: (a = ANY (ARRAY[t2.c, t2.b, 1])) +(4 rows) + +EXPLAIN (COSTS OFF) +SELECT * FROM bitmap_split_or t1, bitmap_split_or t2 WHERE t1.c=t2.b OR t1.a=1; + QUERY PLAN +-- + Nested Loop + Join Filter: ((t1.c = t2.b) OR (t1.a = 1)) + -> Seq Scan on bitmap_split_or t1 + -> Materialize + -> Seq Scan on bitmap_split_or t2 +(5 rows) + +EXPLAIN (COSTS OFF) Added more tests to join.sql I have made final pass through the changes. All looks good. Only one thing looks strange for me - multiple '42's in the output of the test. May be reduce output by an aggregate in the target list of the query? -- regards, Andrei Lepikhov
Re: pgbench with partitioned tables
On 2025-Feb-03, Sergey Tatarintsev wrote: > Thanks for the note. I changed the query in the patch (v2 patch attached) > > Btw, an additional benefit from the patch is that we can use foreign tables > (for example, to test postgres_fdw optimizations) Good thought, and maybe it would be better if the new function was "get_table_relkind" which just returns the char, and this check > + /* Use COPY with FREEZE on v14 and later for all regular tables */ > + if ((PQserverVersion(con) >= 14) && is_regular_table(con, table)) > copy_statement_fmt = "copy %s from stdin with (freeze > on)"; can be "&& get_table_relkind(con, table) == RELKIND_RELATION" which I think is more natural. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Re: why there is not VACUUM FULL CONCURRENTLY?
> From bf2ec8c5d753de340140839f1b061044ec4c1149 Mon Sep 17 00:00:00 2001 > From: Antonin Houska > Date: Mon, 13 Jan 2025 14:29:54 +0100 > Subject: [PATCH 4/8] Add CONCURRENTLY option to both VACUUM FULL and CLUSTER > commands. > @@ -950,8 +1412,46 @@ copy_table_data(Relation NewHeap, Relation OldHeap, > Relation OldIndex, bool verb > + if (concurrent) > + { > + PgBackendProgress progress; > + > + /* > + * Command progress reporting gets terminated at > subtransaction > + * end. Save the status so it can be eventually > restored. > + */ > + memcpy(&progress, &MyBEEntry->st_progress, > +sizeof(PgBackendProgress)); > + > + /* Release the locks by aborting the subtransaction. */ > + RollbackAndReleaseCurrentSubTransaction(); > + > + /* Restore the progress reporting status. */ > + pgstat_progress_restore_state(&progress); > + > + CurrentResourceOwner = oldowner; > + } I was looking at 0002 to see if it'd make sense to commit it ahead of a fuller review of the rest, and I find that the reason for that patch is this hunk you have here in copy_table_data -- you want to avoid a subtransaction abort (which you use to release planner lock) clobbering the status. I think this a bad idea. It might be better to handle this in a different way, for instance 1) maybe have a flag that says "do not reset progress status during subtransaction abort"; REPACK would set that flag, so it'd be able to continue its business without having to memcpy the current status (which seems like quite a hack) or restoring it afterwards. 2) maybe subtransaction abort is not the best way to release the planning locks anyway. I think it might be better to have a ResourceOwner that owns those locks, and we do ResourceOwnerRelease() which would release them. I think this would be a novel usage of ResourceOwner so it needs more research. But if this works, then we don't need the subtransaction at all, and therefore we don't need backend progress restore at all either. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: Proposal to CREATE FOREIGN TABLE LIKE
On 2025-Feb-01, Zhang Mingli wrote: > For example, we use kafka_fdw to produce and consume data from a Kafka > server. In our scenario, we sometimes need to write records from a > local table into Kafka. Here’s a brief outline of our process: > > 1. We already have a wide table, local_wide_table in Postgres. > 2. We need to create a foreign table, foreign_table, with the same > definition as local_wide_table. > 3. Insert records into foreign_table by selecting > from local_wide_table with the some quals. > > In step 2, we currently have to manually create the foreign table > using CREATE FOREIGN TABLE and copy the column definitions one by one. Eh yeah, I guess for this use case it makes sense to allow a LIKE clause on CREATE FOREIGN TABLE. Were you going to submit a patch? -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "Having your biases confirmed independently is how scientific progress is made, and hence made our great society what it is today" (Mary Gardiner)
Re: pgbench with partitioned tables
On 2025-Jan-31, Melanie Plageman wrote: > Maybe instead of just not using COPY FREEZE on a table if it is > partitioned, we could add new data generation init_steps. Perhaps one > that is client-side data generation (g) but with no freezing? I'm not > really sure what the letter would be (f? making it f, g, and G?). I think it makes sense to do what you suggest, but on the other hand, the original code that Sergey is patching looks like a hack in the sense that it hardcodes which tables to use FREEZE with. There's no point to doing things that way actually, so accepting Sergey's patch to replace that with a relkind check feels sensible to me. I think the query should be SELECT relkind FROM pg_catalog.pg_class WHERE oid='%s'::pg_catalog.regclass if only for consistency with pgbench's other query on catalogs. Your proposal to add different init_steps might be reasonable, at least if we allowed partitionedness of tables to vary in other ways (eg. if we made pgbench_history partitioned), but I don't think it conflicts with Sergey's patch in spirit. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Here's a general engineering tip: if the non-fun part is too complex for you to figure out, that might indicate the fun part is too ambitious." (John Naylor) https://postgr.es/m/CAFBsxsG4OWHBbSDM%3DsSeXrQGOtkPiOEOuME4yD7Ce41NtaAD9g%40mail.gmail.com
Re: Getting rid of CaseTestExpr (well, partially)
I wrote: > Therefore, my proposal here is to leave the parser's usage of > CaseTestExpr alone, and replace CaseTestExprs with Params during > eval_const_expressions, just before any relevant function inlining > could happen. I thought of a nasty defect in this idea: CASE expressions that would have been seen as equal() may not be equal after the transformation. (The upper nodes are fine because we can make their *param fields be equal_ignore, but the lower PARAM_EXPRs won't match.) There's at least one important behavior this probably breaks, which is matching of index expressions containing a CASE or ArrayCoerce to query quals. That might be a rare use-case, but it'd annoy anyone doing it, and it'd be pretty hard to explain why it's not a bug. I spent some time wondering if we could somehow number the Params "bottom up" to fix this, so that (for example) a CASE would always use paramid 1 unless it contains one other CASE, which would cause it to use paramid 2, etc. I think eval_const_expressions could be made to do that, but it's not clear how to preserve the property during function inlining. But the main thing I don't like is that this would make it much less obvious that Params with overlapping lifespans have distinct IDs, which takes away a large part of the attraction of the whole design. Pending some better idea, I'm withdrawing this patch. regards, tom lane
Re: POC, WIP: OR-clause support for indexes
On Tue, Jan 28, 2025 at 10:42 AM Andrei Lepikhov wrote: > On 1/28/25 11:36, Andrei Lepikhov wrote: > > On 1/27/25 16:50, Alexander Korotkov wrote: > > qsort(matches, n, sizeof(OrArgIndexMatch), or_arg_index_match_cmp); > > > > To fit an index, the order of elements in the target array of the > > `ScalarArrayOpExpr` may change compared to the initial list of OR > > expressions. If there are indexes that cover the same set of columns but > > in reverse order, this could potentially alter the position of a > > Subplan. However, I believe this is a rare case; it is supported by the > > initial OR path and should be acceptable. > I beg your pardon - I forgot that we've restricted the feature's scope > and can't combine OR clauses into ScalarArrayOpExpr if the args list > contains references to different columns. > So, my note can't be applied here. OK, thank you! -- Regards, Alexander Korotkov Supabase
Re: POC, WIP: OR-clause support for indexes
On Thu, Jan 30, 2025 at 3:23 PM Pavel Borisov wrote: > On Tue, 28 Jan 2025 at 12:42, Andrei Lepikhov wrote: > > > > On 1/28/25 11:36, Andrei Lepikhov wrote: > > > On 1/27/25 16:50, Alexander Korotkov wrote: > > > qsort(matches, n, sizeof(OrArgIndexMatch), or_arg_index_match_cmp); > > > > > > To fit an index, the order of elements in the target array of the > > > `ScalarArrayOpExpr` may change compared to the initial list of OR > > > expressions. If there are indexes that cover the same set of columns but > > > in reverse order, this could potentially alter the position of a > > > Subplan. However, I believe this is a rare case; it is supported by the > > > initial OR path and should be acceptable. > > I beg your pardon - I forgot that we've restricted the feature's scope > > and can't combine OR clauses into ScalarArrayOpExpr if the args list > > contains references to different columns. > > So, my note can't be applied here. > > > > -- > > regards, Andrei Lepikhov > > I've looked at the patch v46-0001 > Looks good to me. > > There is a test that demonstrates the behavior change. Maybe some more > cases like are also worth adding to a test. > > +SELECT * FROM bitmap_split_or t1, bitmap_split_or t2 WHERE t1.a=t2.c > OR (t1.a=t2.b OR t1.a=1); > + QUERY PLAN > + > + Nested Loop > + -> Seq Scan on bitmap_split_or t2 > + -> Index Scan using t_a_b_idx on bitmap_split_or t1 > + Index Cond: (a = ANY (ARRAY[t2.c, t2.b, 1])) > +(4 rows) > + > +EXPLAIN (COSTS OFF) > +SELECT * FROM bitmap_split_or t1, bitmap_split_or t2 WHERE t1.c=t2.b OR > t1.a=1; > + QUERY PLAN > +-- > + Nested Loop > + Join Filter: ((t1.c = t2.b) OR (t1.a = 1)) > + -> Seq Scan on bitmap_split_or t1 > + -> Materialize > + -> Seq Scan on bitmap_split_or t2 > +(5 rows) > + > +EXPLAIN (COSTS OFF) Added more tests to join.sql > Comment > > *Also, add any potentially usable join OR clauses to *joinorclauses > may reflect the change in v46-0001 lappend -> list_append_unique_ptr > that differs in the processing of equal clauses in the list. Comments in this function are revised. I also added detailed explanation of this change to the commit message. > Semantics mentioned in the commit message: > > 2. Make match_join_clauses_to_index() pass OR-clauses to > >match_clause_to_index(). > could also be added as comments in the section just before > match_join_clauses_to_index() Right, this is addressed too. > Since d4378c0005e6 comment for match_clause_to_indexcol() I think > needs change. This could be as a separate commit, not regarding > current patch v46-0001. > > * NOTE: returns NULL if clause is an OR or AND clause; it is the > > * responsibility of higher-level routines to co Good catch. This is added as a separate patch. > I think the patch can be pushed with possible additions to regression > test and comments. OK, thank you! -- Regards, Alexander Korotkov Supabase v47-0001-Revise-the-header-comment-for-match_clause_to_in.patch Description: Binary data v47-0002-Allow-usage-of-match_orclause_to_indexcol-for-jo.patch Description: Binary data
Re: POC, WIP: OR-clause support for indexes
On Fri, Jan 31, 2025 at 4:31 PM Alena Rybakina wrote: > I started reviewing at the patch and saw some output "ERROR" in the output of > the test and is it okay here? > > SELECT * FROM tenk1 t1 > WHERE t1.thousand = 42 OR t1.thousand = (SELECT t2.tenthous FROM tenk1 t2 > WHERE t2.thousand = t1.tenthous); > ERROR: more than one row returned by a subquery used as an expression The output is correct for this query. But the query is very unfortunate for the regression test. I've revised query in the v47 revision [1]. Links. 1. https://www.postgresql.org/message-id/CAPpHfdsBZmNt9qUoJBqsQFiVDX1%3DyCKpuVAt1YnR7JCpP%3Dk8%2BA%40mail.gmail.com -- Regards, Alexander Korotkov Supabase
Re: Allow default \watch interval in psql to be configured
> On 19 Nov 2024, at 11:20, Masahiro Ikeda wrote: > I've tested it and reviewed the patch. I'd like to provide some feedback. Thank you very much for your review and I do apologize for the late response. > I tested with v3 patch and found the following compile error. > It seems that math.h needs to be included in variables.c. > > variables.c: In function 'ParseVariableDouble': > variables.c:220:54: error: 'HUGE_VAL' undeclared (first use in this function) > 220 | (dblval == 0.0 || dblval >= HUGE_VAL || > dblval <= -HUGE_VAL)) > | ^~~~ Odd, I don't see that, but I've nonetheless fixed it by including math.h > Although the error handling logic is being discussed now, I think it would be > better, > at least, to align the logic and messages of exec_command_watch() and > ParseVariableDouble(). I see your point, especially since ParseVariableBuffer is only used for this at the moment. It is however intended as generic functionality and would require to diverge from the other ParseVariableXXX to support custom error messages. I'm not sure it's worth the added complexity for something which is likely to be a quite niche feature. > ParseVariableDouble() doesn't have a comment yet, though you may be planning > to add one later. Fixed. > I believe the default value is 2 after the WATCH_INTERVAL is specified > because \unset > WATCH_INTERVAL sets it to '2'. So, why not update the following sentence > accordingly? > > -of rows. Wait the specified number of seconds (default 2) between > executions. > -For backwards compatibility, > +of rows. Wait the specified number of seconds (default 2, which can > be > +changed with the variable > +between executions. For backwards compatibility, > > For example, > >> Wait WATCH_INTERVAL seconds unless the interval option is >> specified. >> If the interval option is specified, wait the specified number of seconds >> instead. > > +This variable sets the default interval which > \watch > +waits between executing the query. Specifying an interval in the > +command overrides this variable. > >> This variable sets the interval in seconds that \watch >> waits >> between executions. The default value is 2.0. I don't think this improves the documentation. The patch only changes the defult interval which is used if the interval is omitted from the command, the above sections discuss how interval is handled when specified. I did some minor tweaks to the docs to try and clarify things. > I think it's better to replace queries with executions because the \watch > documentation says so. > > + HELP0(" WATCH_INTERVAL\n" > + "number of seconds \\watch waits between queries\n"); Fair point, fixed. The attached v4 fixes the above and also adds tests. -- Daniel Gustafsson v4-0001-Make-default-watch-interval-configurable.patch Description: Binary data
Increased work_mem for "logical replication tablesync worker" only?
Hi. Trying to monitor perf during the initial tablesync phase (COPY) right after CREATE SUBSCRIPTION. I noticed that the size of 17/main/base/pgsql_tmp on the destination node grows (tens of gigabytes) as the COPY command (running internally on the publisher) progresses. Then in the end (when its "EXPLAIN SELECT 1 FROM tbl" on the destination shows the approximate number of rows equals to the number of rows on the source node) it hangs for several minutes, and then 17/main/base/pgsql_tmp empties, and the subscription progresses. It seems like if I increase work_mem to several GB, then the growth of 17/main/base/pgsql_tmp becomes less significant. Questions: 1. Are there some diagnostics commands that would allow me to figure out what is in those tmp files? Why does the subscriber create those tmp files and not just write directly to the data files and WAL? (The table has 2 bytea columns, i.e. it's TOASTed for sure.) 2. Is there a way to set work_mem only for "logical replication tablesync worker"? I don't want to have it that high for all connections, but for logical replication tablesync worker - it's fine to have it set to a huge value (I have max_sync_workers_per_subscription=1, so there is not more than 1 of such processes in the system). 3. Is this work_mem consideration relevant at all? Maybe it's a red herring? Thanks!
[PATCH] Fix incorrect range in pg_regress comment
Hi hackers, I noticed that a comment in pg_regress incorrectly states that alternative output files can be named filename{_i}.out with 0 < i <= 9. However, the actual valid range is 0 <= i <= 9. This patch corrects the comment. The fix is trivial but ensures that the documentation in the code accurately reflects the behavior of pg_regress. Attached is a small patch correcting this. -- Best regards, Ilia Evdokimov, Tantor Labs LLC. From d2ef1b2ff2f42bc38eb2d2d87201a2822d53d8b5 Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Sun, 2 Feb 2025 21:35:10 +0300 Subject: [PATCH v1] Fix incorrect range in pg_regress comment The comment in pg_regress incorrectly stated that alternative output files could be named test{_i}.out with 0 < i <= 9. However, the valid range is actually 0 <= i <= 9. --- src/test/regress/pg_regress.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c index 61a234ae21..5d85dcc62f 100644 --- a/src/test/regress/pg_regress.c +++ b/src/test/regress/pg_regress.c @@ -1335,7 +1335,7 @@ make_directory(const char *dir) } /* - * In: filename.ext, Return: filename_i.ext, where 0 < i <= 9 + * In: filename.ext, Return: filename_i.ext, where 0 <= i <= 9 */ static char * get_alternative_expectfile(const char *expectfile, int i) -- 2.34.1
Re: Proposal to Enable/Disable Index using ALTER INDEX (with patch)
hi. the following reviews based on v10-0001-Introduce-the-ability-to-set-index-visibility-us.patch. in src/test/regress/sql/create_index.sql seems there are no sql tests for "create index ... invisible"? VISIBLE Make the specified index visible. The index will be used for queries. here it should be "Make the specified index visible. The index can be used for query planning" ? Do we need to add GUC use_invisible_index to postgresql.conf.sample? CREATE TABLE t(id INT PRIMARY KEY, data TEXT,num INT, vector INT[], range INT4RANGE); ALTER INDEX t_pkey INVISIBLE; alter table t alter column id set data type bigint; \d t after ALTER TABLE SET DATA TYPE, the "visible" status should not change? but here it changed. you may check ATPostAlterTypeParse to make the "visible" status not change. @@ -3449,6 +3451,7 @@ typedef struct IndexStmt boolif_not_exists;/* just do nothing if index already exists? */ boolreset_default_tblspc;/* reset default_tablespace prior to * executing */ + boolisvisible;/* true if VISIBLE (default), false if INVISIBLE */ } IndexStmt; the indentation level is not right? +opt_index_visibility: +VISIBLE_P { $$ = true; } +| INVISIBLE_P { $$ = false; } +| /*EMPTY*/ { $$ = true; } +; + the indentation level seems also not right? +createFlags = INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT; +if (indexForm->indisvisible) +createFlags |= INDEX_CREATE_VISIBLE; the indentation level seems also not right? INVISIBLE, VISIBLE is not special words, in gram.y, you don't need "VISIBLE_P", "INVISIBLE_P", you can just use "INVISIBLE", "VISIBLE" ? \d t3 Table "public.t3" Column | Type| Collation | Nullable | Default +---+---+--+- id | integer | | not null | data | text | | | num| integer | | | vector | integer[] | | | range | int4range | | | a | box | | | Indexes: "t3_pkey" PRIMARY KEY, btree (id) INVISIBLE "grect2ind" gist (a) INVISIBLE "t3_1" gist (a) INVISIBLE "t3_2" gin (vector) WITH (fastupdate='on', gin_pending_list_limit='128') INVISIBLE "t3_4" spgist (data) INVISIBLE "t3_6" hash (id) INVISIBLE pg_dump will dump as -- -- Name: t3 t3_pkey; Type: CONSTRAINT; Schema: public; Owner: jian -- ALTER TABLE ONLY public.t3 ADD CONSTRAINT t3_pkey PRIMARY KEY (id); after dump, restore index (primary key: t3_pkey) INVISIBLE will not be restored. We need extra work for restoring the INVISIBLE flag for the primary key index. I am not sure if we need to change index_concurrently_swap or not. but many other pg_index columns changed.
Re: [PATCH] Fix incorrect range in pg_regress comment
Ilia Evdokimov writes: > I noticed that a comment in pg_regress incorrectly states that > alternative output files can be named filename{_i}.out with 0 < i <= 9. > However, the actual valid range is 0 <= i <= 9. This patch corrects the > comment. Hmm, our convention is definitely that the numbers start with 1, so I do not want to make this change. Maybe we should change the code instead. regards, tom lane
Re: Pgoutput not capturing the generated columns
On Wed, Jan 29, 2025 at 2:48 PM Amit Kapila wrote: > > On Wed, Jan 29, 2025 at 6:03 AM Peter Smith wrote: > > > > On Tue, Jan 28, 2025 at 7:59 PM Amit Kapila wrote: > > > > > > On Thu, Jan 23, 2025 at 9:58 AM vignesh C wrote: > > > > > > > > > > I have pushed the remaining part of this patch. Now, we can review the > > > proposed documentation part. > > > > > > I feel we don't need the Examples sub-section for this part of the > > > docs. The behavior is quite clear from the "Behavior Summary" > > > sub-section table. > > > > It is good to hear that the "Behavior Summary" matrix is clear, but it > > is never the purpose of examples to show behaviour that is not already > > clearly documented. The examples are simply to demonstrate some real > > usage. Personally, I find it far easier to understand this (or any > > other) feature by working through a few examples in conjunction with > > the behaviour matrix, instead of just having the matrix and nothing > > else. > > > > I am not against giving examples in the docs to make the topic easy to > understand but in this particular case, I am not sure if additional > examples are useful. You already gave one example in the beginning: > "For example, note below that subscriber table generated column value > comes from the subscriber column's calculation." the remaining text is > clear enough to understand the feature. > > If you still want to make a case for additional examples, divide this > patch into two parts. The first part without examples could be part of > this thread and I can commit that. Then you can start a separate > thread just for the examples and then we can see what others think and > make a decision based on that. > I have created a new thread [1] to propose adding the "Examples" subsection. == [1] https://www.postgresql.org/message-id/CAHut%2BPt_7GV8eHSW4XQsC6rF13TWrz-SrGeeiV71%3DSE14DC4Jg%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
DOCS - Generated Column Replication Examples
Hi, A recent commit [1] added a new section "29.6. Generated Column Replication" to the documentation [2]. But, no "Examples" were included. I've created this new thread to make a case for adding an "Examples" subsection to that page. (the proposed examples patch is attached) == 1. Reasons AGAINST Adding Examples to this section (and why I disagree): 1a. The necessary information to describe the feature can already be found in the text. Yes, the documentation describes the feature, as it should. However, examples serve a different purpose: they reinforce understanding by demonstrating how the feature works in practice. They don’t introduce new details; they clarify and illustrate existing ones. 1b. I can understand the feature without needing examples. PostgreSQL serves a diverse user base with different levels of expertise and learning styles. While experienced users might grasp the feature quickly, others may struggle with dense explanatory text and behaviour matrices. Why knowingly make it harder for them? If examples can help some users, and others can simply skip them, there's no downside to including them. 1c. Too many examples will swamp the page. Agreed—too many examples would be excessive. That’s why the proposed patch includes only a couple of well-chosen cases, rather than covering every possible combination. This keeps the documentation focused and useful without overwhelming the reader. ~~~ 2. Reasons FOR Adding Examples to this section: 2a. To emphasise some special and/or subtle rules. Some critical details—like how publication column lists override the publish_generated_columns setting—are easy to overlook in a dense set of behaviour rules. An example makes this clear by demonstrating the effect directly, ensuring users don't miss it. 2b. To make it easy for users to try it themselves. These examples are self-contained including the necessary publication/subscription and tables. A user can simply cut/paste these to PSQL and begin experimenting with the feature immediately, instead of piecing together a working setup from scattered explanations. This reduces friction and encourages hands-on learning. 2c. It may become inevitable later anyway. With the eventual introduction of VIRTUAL generated columns, replication behaviour will become even more complex, requiring more rules and expanded behaviour matrices. Including examples now lays the groundwork for future enhancements ~~~ Thoughts? == [1] https://github.com/postgres/postgres/commit/6252b1eaf82bb8db361341d1c8651e43b29be816#diff-e593b875ae1b49848e9954e7a3ea26fb34de7454b338a0029aec72b0fb25bb0a [2] https://www.postgresql.org/docs/devel/logical-replication-gencols.html Kind Regards, Peter Smith. Fujitsu Australia. v1-0001-DOCS-Generated-Column-Replication-Examples.patch Description: Binary data
Re: pgbench with partitioned tables
I was looking at the comments [1] for why COPY FREEZE is not allowed on a parent table, and it was mainly due to having to open up partitions to check if they are able to take the optimization (table created or truncated in the current transaction ). Obviously as the number of partitions grow, it will take more to perform these checks. My suspicious is that in the cases in which partitions are used, the benefits of COPY FREEZE could outweigh the overhead of these additional checks. So while we could try to solve the COPY FREEZE issue specifically for pgbench, I wonder if we should try to do better and see if the limitation on a parent partition can be removed. I can provide a patch and some benchmark numbers unless there is something bigger I am missing about the reason this limitation exists? [1] https://github.com/postgres/postgres/blob/master/src/backend/commands/copyfrom.c#L727-L735 Regards, Sami
Re: Using Expanded Objects other than Arrays from plpgsql
On Sun, Feb 2, 2025 at 1:57 PM Tom Lane wrote: > I wrote: > > Hmm, it seemed to still apply for me. But anyway, I needed to make > > the other changes, so here's v4. > > PFA v5. The new 0001 patch refactors the free_xxx infrastructure > to create plpgsql_statement_tree_walker(), and then in what's now > 0003 we can use that instead of writing a lot of duplicate code. > Thanks Tom! These patches apply for me and all my tests and benchmarks are still good. -Michel
Re: Increased work_mem for "logical replication tablesync worker" only?
On Sun, Feb 2, 2025 at 5:13 PM Dmitry Koterov wrote: > > Trying to monitor perf during the initial tablesync phase (COPY) right after > CREATE SUBSCRIPTION. I noticed that the size of 17/main/base/pgsql_tmp on the > destination node grows (tens of gigabytes) as the COPY command (running > internally on the publisher) progresses. Then in the end (when its "EXPLAIN > SELECT 1 FROM tbl" on the destination shows the approximate number of rows > equals to the number of rows on the source node) it hangs for several > minutes, and then 17/main/base/pgsql_tmp empties, and the subscription > progresses. > > It seems like if I increase work_mem to several GB, then the growth of > 17/main/base/pgsql_tmp becomes less significant. > > Questions: > > 1. Are there some diagnostics commands that would allow me to figure out what > is in those tmp files? Why does the subscriber create those tmp files and not > just write directly to the data files and WAL? (The table has 2 bytea > columns, i.e. it's TOASTed for sure.) > We do write spill files (ending with '.spill') if the changes are large. Can you please share the name of tmp files to avoid any assumptions? -- With Regards, Amit Kapila.
Re: jsonlog missing from logging_collector description
On Fri, Jan 31, 2025 at 09:32:42PM -0500, Tom Lane wrote: > Sure, fine by me. This one has been done as d61b9662b09e. -- Michael signature.asc Description: PGP signature
Re: pg_rewind with --write-recovery-conf option doesn't write dbname to primary_conninfo value.
On Thu, Jan 30, 2025 at 3:17 AM Masahiko Sawada wrote: > > I found that pg_rewind with --write-recovery-conf option doesn't write > the dbname to an auto-generated primary_conninfo value. Therefore, > after a failover, the old primary cannot start if it's rewound by > pg_rewind with --write-recovery-conf option if sync_replication_slots > is on. Is it an oversight in commit a145f424d5? > IIRC, we tried to do minimal in the first version as we didn't have all the use cases. However, it makes sense to support this in HEAD as proposed by you. -- With Regards, Amit Kapila.
Re: Introduce XID age and inactive timeout based replication slot invalidation
On Fri, Jan 31, 2025 at 8:02 PM Amit Kapila wrote: > > On Fri, Jan 31, 2025 at 10:40 AM Peter Smith wrote: > > > > == > > src/backend/replication/slot.c > > > > ReportSlotInvalidation: > > > > 1. > > + > > + case RS_INVAL_IDLE_TIMEOUT: > > + Assert(inactive_since > 0); > > + /* translator: second %s is a GUC variable name */ > > + appendStringInfo(&err_detail, > > + _("The slot has remained idle since %s, which is longer than the > > configured \"%s\" duration."), > > + timestamptz_to_str(inactive_since), > > + "idle_replication_slot_timeout"); > > + break; > > + > > > > errdetail: > > > > I guess it is no fault of this patch because I see you've only copied > > nearby code, but AFAICT this function is still having an each-way bet > > by using a mixture of _() macro which is for strings intended be > > translated, but then only using them in errdetail_internal() which is > > for strings that are NOT intended to be translated. Isn't it > > contradictory? Why don't we use errdetail() here? > > > > Your question is valid and I don't have an answer. I encourage you to > start a new thread to clarify this. > I think this was a false alarm. After studying this more deeply, I've changed my mind and now think the code is OK as-is. AFAICT errdetail_internal is used when not wanting to translate the *fmt* string passed to it (see EVALUATE_MESSAGE in elog.c). Now, here the format string is just "%s" so it's fine to not translate that. Meanwhile, the string value being substituted to the "%s" was already translated because of the _(x) macro aka gettext(x). I found other examples similar to this -- see the error_view_not_updatable() function in rewriteHandler.c which does: ereport(ERROR, ... detail ? errdetail_internal("%s", _(detail)) : 0, ... == Kind Regards, Peter Smith. Fujitsu Australia
Re: [PATCH] Fix incorrect range in pg_regress comment
Michael Paquier writes: > Right, let's adjust the comment to reflect what the docs say, as your > patch does. I presume that Tom will do that.. If I complain about it I gotta fix it, huh? Okay, done. regards, tom lane
Re: Proposal to CREATE FOREIGN TABLE LIKE
Zhang Mingli www.hashdata.xyz On Feb 2, 2025 at 21:24 +0800, Álvaro Herrera , wrote: Eh yeah, I guess for this use case it makes sense to allow a LIKE clause on CREATE FOREIGN TABLE. Were you going to submit a patch? Hi, Yes, I would like to provide a patch. Glad to see we have come to an agreement on this.
Re: Using Expanded Objects other than Arrays from plpgsql
I wrote: > Hmm, it seemed to still apply for me. But anyway, I needed to make > the other changes, so here's v4. I decided to see what would happen if we tried to avoid the code duplication in pl_funcs.c by making some "walker" infrastructure akin to expression_tree_walker. While that doesn't seem useful for the dump_xxx functions, it works very nicely for the free_xxx functions and now for the mark_xxx ones as well. pl_funcs.c nets out about 400 lines shorter than in the v4 patch. The code coverage score for the file is still awful :-(, but that's because we're not testing the dump_xxx functions at all. PFA v5. The new 0001 patch refactors the free_xxx infrastructure to create plpgsql_statement_tree_walker(), and then in what's now 0003 we can use that instead of writing a lot of duplicate code. regards, tom lane From bec67b472a0f9b237c5ed1feffd01ee4428b0688 Mon Sep 17 00:00:00 2001 From: Tom Lane Date: Sun, 2 Feb 2025 16:05:01 -0500 Subject: [PATCH v5 1/5] Refactor pl_funcs.c to provide a usage-independent tree walker. We haven't done this up to now because there was only one use-case, namely plpgsql_free_function_memory's search for expressions to clean up. However an upcoming patch has another need for walking plpgsql functions' statement trees, so let's create sharable tree-walker infrastructure in the same style as expression_tree_walker(). This patch actually makes the code shorter, although that's mainly down to having used a more compact coding style. (I didn't write a separate subroutine for each statement type, and I made use of some newer notations like foreach_ptr.) Discussion: https://postgr.es/m/CACxu=vjakfnsyxoosnw1wegsao5u_v1xybacfvj14wgjv_p...@mail.gmail.com --- src/pl/plpgsql/src/pl_funcs.c | 582 ++ 1 file changed, 244 insertions(+), 338 deletions(-) diff --git a/src/pl/plpgsql/src/pl_funcs.c b/src/pl/plpgsql/src/pl_funcs.c index 8c827fe5cc..88e25b54bc 100644 --- a/src/pl/plpgsql/src/pl_funcs.c +++ b/src/pl/plpgsql/src/pl_funcs.c @@ -334,387 +334,291 @@ plpgsql_getdiag_kindname(PLpgSQL_getdiag_kind kind) /** - * Release memory when a PL/pgSQL function is no longer needed + * Support for recursing through a PL/pgSQL statement tree * - * The code for recursing through the function tree is really only - * needed to locate PLpgSQL_expr nodes, which may contain references - * to saved SPI Plans that must be freed. The function tree itself, - * along with subsidiary data, is freed in one swoop by freeing the - * function's permanent memory context. + * The point of this code is to encapsulate knowledge of where the + * sub-statements and expressions are in a statement tree, avoiding + * duplication of code. The caller supplies two callbacks, one to + * be invoked on statements and one to be invoked on expressions. + * (The recursion should be started by invoking the statement callback + * on function->action.) The statement callback should do any + * statement-type-specific action it needs, then recurse by calling + * plpgsql_statement_tree_walker(). The expression callback can be a + * no-op if no per-expression behavior is needed. **/ -static void free_stmt(PLpgSQL_stmt *stmt); -static void free_block(PLpgSQL_stmt_block *block); -static void free_assign(PLpgSQL_stmt_assign *stmt); -static void free_if(PLpgSQL_stmt_if *stmt); -static void free_case(PLpgSQL_stmt_case *stmt); -static void free_loop(PLpgSQL_stmt_loop *stmt); -static void free_while(PLpgSQL_stmt_while *stmt); -static void free_fori(PLpgSQL_stmt_fori *stmt); -static void free_fors(PLpgSQL_stmt_fors *stmt); -static void free_forc(PLpgSQL_stmt_forc *stmt); -static void free_foreach_a(PLpgSQL_stmt_foreach_a *stmt); -static void free_exit(PLpgSQL_stmt_exit *stmt); -static void free_return(PLpgSQL_stmt_return *stmt); -static void free_return_next(PLpgSQL_stmt_return_next *stmt); -static void free_return_query(PLpgSQL_stmt_return_query *stmt); -static void free_raise(PLpgSQL_stmt_raise *stmt); -static void free_assert(PLpgSQL_stmt_assert *stmt); -static void free_execsql(PLpgSQL_stmt_execsql *stmt); -static void free_dynexecute(PLpgSQL_stmt_dynexecute *stmt); -static void free_dynfors(PLpgSQL_stmt_dynfors *stmt); -static void free_getdiag(PLpgSQL_stmt_getdiag *stmt); -static void free_open(PLpgSQL_stmt_open *stmt); -static void free_fetch(PLpgSQL_stmt_fetch *stmt); -static void free_close(PLpgSQL_stmt_close *stmt); -static void free_perform(PLpgSQL_stmt_perform *stmt); -static void free_call(PLpgSQL_stmt_call *stmt); -static void free_commit(PLpgSQL_stmt_commit *stmt); -static void free_rollback(PLpgSQL_stmt_rollback *stmt); -static void free_expr(PLpgSQL_expr *expr); +typedef void (*plpgsql_stmt_walker_callback) (PLpgSQL_stmt *stmt, + void *context); +typedef void (*plpgsql_expr_walker_callback) (PLpgSQL
Re: [PATCH] Fix incorrect range in pg_regress comment
On Sun, 2 Feb 2025 at 22:26, Tom Lane wrote: > Hmm, our convention is definitely that the numbers start with 1, > so I do not want to make this change. Maybe we should change > the code instead. That would require any extensions that use the _0.out suffix to update all those files to use _1.out as the suffix. One such extension is Citus[1]. That seems like unnecessary extension churn with little benefit. So my vote would be to update the comment (as is done in patch v1). [1]: https://github.com/citusdata/citus/blob/main/src/test/regress/expected/columnar_lz4_0.out
Re: [PATCH] Fix incorrect range in pg_regress comment
Jelte Fennema-Nio writes: > On Sun, 2 Feb 2025 at 22:26, Tom Lane wrote: >> Hmm, our convention is definitely that the numbers start with 1, >> so I do not want to make this change. Maybe we should change >> the code instead. > That would require any extensions that use the _0.out suffix to update > all those files to use _1.out as the suffix. One such extension is > Citus[1]. Oh. I see we did document it as 0-9 in [1], so I guess we're stuck with that now. Objection withdrawn. regards, tom lane [1] https://www.postgresql.org/docs/devel/regress-variant.html
Re: Proposal to Enable/Disable Index using ALTER INDEX (with patch)
hi. modules/test_ddl_deparse/test_ddl_deparse.so.p/test_ddl_deparse.c.o -c ../../Desktop/pg_src/src7/postgres/src/test/modules/test_ddl_deparse/test_ddl_deparse.c ../../Desktop/pg_src/src7/postgres/src/test/modules/test_ddl_deparse/test_ddl_deparse.c: In function ‘get_altertable_subcmdinfo’: ../../Desktop/pg_src/src7/postgres/src/test/modules/test_ddl_deparse/test_ddl_deparse.c:111:17: error: enumeration value ‘AT_SetIndexVisible’ not handled in switch [-Werror=switch] 111 | switch (subcmd->subtype) | ^~ ../../Desktop/pg_src/src7/postgres/src/test/modules/test_ddl_deparse/test_ddl_deparse.c:111:17: error: enumeration value ‘AT_SetIndexInvisible’ not handled in switch [-Werror=switch] cc1: all warnings being treated as errors so we need to change test_ddl_deparse.c. The attached patch fixes the indentation and test_ddl_deparse.c issue. Maybe we can add some tests on src/test/modules/test_ddl_deparse. v10-0001-minor-indentation-fix.no-cfbot Description: Binary data
Re: Issues with 2PC at recovery: CLOG lookups and GlobalTransactionData
On Fri, Jan 31, 2025 at 04:54:17PM +0300, Vitaly Davydov wrote: > I'm looking at the v13 patch. I see, there is the only file for v13: > v2-0002-Fix-issues-with-2PC-file-handling-at-recovery-13.txt Yes. The fixes in 13~16 are simpler. There are so many conflicts across all the branches that I'm planning to maintain branches for each stable versions, depending on the discussion going on. The refactoring of 17~ and HEAD could be done first, I suppose. > #1. In RecoverPreparedTransactions we iterave over all in-memory two phase > states and check every xid state in CLOG unconditionally. Image, we have > a very old transaction that is close to the oldestXid. Will CLOG state be > available for such transaction? I'm not sure about it. Yes. RecoverPreparedTransactions() where the cluster is officially at the end of recovery, and we've checked that we are consistent. > #2. In restoreTwoPhaseData we load all the twostate files that are in > the valid xid range (from oldestXid to nextFullXid in terms of logical > comparision of xids with epoch). The question - should we load files > which xids greater than ControlFile->checkPointCopy.nextXid > (xids after last checkpoint). In general, all twophase files should belong > to xids before the last checkpoint. I guess, we should probably ignore > files which xid is equal or greater than the xid of the last checkpoint - > twostate data should be in the WAL. If not, I guess, we may see error messages > like show below when doing xact_redo -> PrepareRedoAdd: We are already using the nextFullXid from ShmemVariableCache, which is something that StartupXLOG() fetches from the checkpoint record we define as the starting point of recovery, which provides a protection good enough. Grabbing this information from checkPointCopy.nextXid would be actually worse when recovering from a backup_label, no? The copy of the checkpoint record in the control file would be easily newer than the checkpoint record retrieved from the LSN in the backup_label, leading to less files considered as in the future. I'd rather trust what we have in WAL a maximum for this data. Note that there is a minor release next week on the roadmap, and for clarity I am not planning to rush this stuff in the tree this week: https://www.postgresql.org/developer/roadmap/ Let's give it more time to find a solution we're all OK with. I'd love to hear from Noah here as well, as he got involved in the discussion of the other thread. -- Michael signature.asc Description: PGP signature
Re: Vacuum statistics
On Mon, Jan 13, 2025 at 3:26 PM Alena Rybakina wrote: > I noticed that the cfbot is bad, the reason seems to be related to the lack > of a parameter in src/backend/utils/misc/postgresql.conf.sample. I added it, > it should help. The patch doesn't apply cleanly. Please rebase. I see you introduced new GUC variable pgstat_track_vacuum_statistics, which should address the increased size of statistics. However, I don't see how it could affect the size of PgStat_StatTabEntry struct. It seems that when pgstat_track_vacuum_statistics == 0, extended vacuum statistics is not collected but the size of hash table entries is the same. Also, should pgstat_track_vacuum_statistics also affect per database statistics? The name of 0001 is "... on heap relations". Should we say "on table relations", because new machinery should work with alternative table AMs as well. There are deletions of empty lines in src/include/utils/pgstat_internal.h and src/include/pgstat.h. Please, remote them as it's not purpose of this patchset. -- Regards, Alexander Korotkov Supabase
Re: Non-text mode for pg_dumpall
Thanks Jian for review, testing and delta patches. On Wed, 29 Jan 2025 at 15:09, jian he wrote: > > hi. > > we need to escape the semicolon within the single quotes or double quotes. > I think my patch in [1] is correct. > > we can have "ERROR: role "z" already exists > but > error message like > pg_restore: error: could not execute query: "ERROR: unterminated > quoted string at or near "'; > should not be accepted in execute_global_sql_commands, ReadOneStatement, PQexec > > attached is the all the corner test case i come up with against > ReadOneStatement. > your v13 will generate errors like "ERROR: unterminated quoted string > at or near ..."', > which is not good, i think. > > [1] https://www.postgresql.org/message-id/CACJufxEQUcjBocKJQ0Amf3AfiS9wFB7zYSHrj1qqD_oWeaJoGQ%40mail.gmail.com Yes, you are right. We can't read line by line. We should read char by char and we need some extra handling for double quote names. I have merged your delta patch into this and now I am doing some more testing for corner cases of this type of names. *Ex*: add some comments in names etc or multiple semicolons or other special characters in name. On Fri, 31 Jan 2025 at 09:23, jian he wrote: > > hi. > > -extern void RestoreArchive(Archive *AHX); > +extern void RestoreArchive(Archive *AHX, bool append_data); > Can we spare some words to explain the purpose of append_data. Fixed. I added some comments on the top of the RestoreArchive function. > > in get_dbname_oid_list_from_mfile > pg_log_info("map.dat file is not present in dump of > pg_dumpall, so nothing to restore."); > maybe we can change it to > pg_log_info("databases restoring is skipped as map.dat file is > not present in \"%s\"", dumpdirpath); Fixed. > we can aslo add Assert(dumpdirpath != NULL) No, we don't need it as we are already checking inputfileSpec!= NULL. > > pg_log_info("found dbname as : \"%s\" and db_oid:%u in map.dat file > while restoring", dbname, db_oid); > also need to change. maybe > pg_log_info("found database \"%s\" (OID: %u) in map.dat file while > restoring.", dbname, db_oid); Fixed. > > I also did some minor refactoring, please check attached. Thanks. I merged it. > > > doc/src/sgml/ref/pg_restore.sgml > > pg_restore > > >restore a PostgreSQL database from an >archive file created by pg_dump > > > need to change, since now we can restore multiple databases. Agreed. I added some comments. > > doc/src/sgml/ref/pg_dumpall.sgml > > pg_dumpall > extract a PostgreSQL database > cluster into a script file > > also need change. On Sat, 1 Feb 2025 at 21:36, Srinath Reddy wrote: > > Hi, > i think we have to change the pg_dumpall "--help" message similar to pg_dump's specifying that now pg_dumpall dumps cluster into to other non-text formats. > Need similar "--help" message change in pg_restore to specify that now pg_restore supports restoring whole cluster from archive created from pg_dumpall. As Jian suggested, we need to change docs so I did the same changes into doc and --help also. On Fri, 31 Jan 2025 at 14:22, jian he wrote: > > hi. > more small issues. > > + count_db++; /* Increment db couter. */ > + dboidprecell = dboid_cell; > + } > + > typo, "couter" should be "counter". Fixed. > > + > +/* > + * get_dbname_oid_list_from_mfile > + * > + * Open map.dat file and read line by line and then prepare a list of database > + * names and correspoding db_oid. > + * > typo, "correspoding" should be "corresponding". Fixed. > > > execute_global_sql_commands comments didn't mention ``IF (outfile) `` > branch related code. > We can add some comments saying that > ""IF opts->filename is not specified, then copy the content of > global.dat to opts->filename""". We already have some comments on the top of the execute_global_sql_commands function. > > or split it into two functions. Done. I added a new function for outfile. > > > + while((fgets(line, MAXPGPATH, pfile)) != NULL) > + { > + Oid db_oid; > + char db_oid_str[MAXPGPATH + 1]; > + chardbname[MAXPGPATH + 1]; > + > + /* Extract dboid. */ > + sscanf(line, "%u" , &db_oid); > + sscanf(line, "%s" , db_oid_str); > + > + /* Now copy dbname. */ > + strcpy(dbname, line + strlen(db_oid_str) + 1); > + > + /* Remove \n from dbanme. */ > + dbname[strlen(dbname) - 1] = '\0'; > + > + pg_log_info("found dbname as : \"%s\" and db_oid:%u in map.dat file > while restoring", dbname, db_oid); > + > + /* Report error if file has any corrupted data. */ > + if (!OidIsValid(db_oid) || strlen(dbname) == 0) > + pg_fatal("invalid entry in map.dat file at line : %d", count + 1); > + > + /* > + * XXX : before adding dbname into list, we can verify that this db > + * needs to skipped for restore or not but as of now, we are making > + * a list of all the databases. > + */ > + simple_db_oid_list_append(dbname_oid_list, db_oid, dbname); > + count++; > + } > > > db_oid first should be set to 0, dbname first character first should be set to 0 > (char
meson's in-tree libpq header search order vs -Dextra_include_dirs
When I use configure/make and --with-includes=/usr/local/include, I see compile lines like this: ... -I../../src/interfaces/libpq -I../../src/include -I/usr/local/include ... That's important, because if I happen to have libpq headers installed on the system I don't want to be confused by them. Yet meson's -Dextra_include_dirs=/usr/local/include compiles like this: ... -I../src/include -I/usr/local/include -Isrc/interfaces/libpq ... ... which gives me errors due to the v16 headers I happen to have installed: ../src/fe_utils/connect_utils.c:164:3: error: unknown type name 'PGcancelConn'; did you mean 'PGcancel'? 164 | PGcancelConn *cancelConn = PQcancelCreate(conn); I guess this affects Macs, BSDs and a few lesser used Linux distros that put optional packages outside the base system and default search paths. Perhaps you can see this on everything-goes-into-base-system distros if you redundantly say -Dextra_include_dirs=/usr/include; I didn't check if that is true, it wouldn't be if meson is opinionated enough to remove it. Reordering obvious mentions of libpq, as in the attached, gets past most of them but I couldn't immediately figure out how to reorder src/test/isolation/meson.build and gave up and uninstalled the interfering libpq package for now. Dumping these observations here as this is as far as I got with this impediment today. wip-reorder-include.diff Description: Binary data
Re: Proposal to CREATE FOREIGN TABLE LIKE
On Mon, Feb 03, 2025 at 06:22:13AM +0800, Mingli Zhang wrote: > Yes, I would like to provide a patch. > > Glad to see we have come to an agreement on this. Just adding my +1 here. FWIW. -- Michael signature.asc Description: PGP signature
Re: [PATCH] Fix incorrect range in pg_regress comment
On Sun, Feb 02, 2025 at 05:01:33PM -0500, Tom Lane wrote: > Oh. I see we did document it as 0-9 in [1], so I guess we're > stuck with that now. Objection withdrawn. Oh. I didn't know that 0 was accepted. We learn new things every day. Right, let's adjust the comment to reflect what the docs say, as your patch does. I presume that Tom will do that.. -- Michael signature.asc Description: PGP signature
Re: Introduce XID age and inactive timeout based replication slot invalidation
Hi Nisha, Some review comments for v66-0001. == src/backend/replication/slot.c ReportSlotInvalidation: 1. StringInfoData err_detail; + StringInfoData err_hint; bool hint = false; initStringInfo(&err_detail); + initStringInfo(&err_hint); I don't think you still need the 'hint' boolean anymore. Instead of: hint ? errhint("%s", err_hint.data) : 0); You could just do something like: err_hint.len ? errhint("%s", err_hint.data) : 0); ~~~ 2. + appendStringInfo(&err_hint, "You might need to increase \"%s\".", + "max_slot_wal_keep_size"); break; 2a. In this case, shouldn't you really be using macro _("You might need to increase \"%s\".") so that the common format string would be got using gettext()? ~ 2b. Should you include a /* translator */ comment here? Other places where GUC name is substituted do this. ~~~ 3. + appendStringInfo(&err_hint, "You might need to increase \"%s\".", + "idle_replication_slot_timeout"); + break; 3a. Ditto above. IMO this common format string should be got using macro. e.g.: _("You might need to increase \"%s\".") ~ 3b. Should you include a /* translator */ comment here? Other places where GUC name is substituted do this. == Kind Regards, Peter Smith. Fujitsu Australia
Re: Show WAL write and fsync stats in pg_stat_io
On Fri, Jan 31, 2025 at 11:29:31AM +0300, Nazir Bilal Yavuz wrote: > On Wed, 29 Jan 2025 at 18:16, Bertrand Drouvot > wrote: >> I think that's the main reason why ff99918c625 added this new GUC (looking at >> the commit message). I'd feel more comfortable if we keep it. > > As Michael suggested, I will run a couple of benchmarks to see the > actual effect of this change. Then let's see if this affects anything. I've looked at bit at all that today, and something like the attached is what seems like the best streamlined version to me for the main feature. I am also planning to run some short benchmarks with track_io_timing=on on HEAD and with the patch, then see the difference, without any relationship to track_wal_io_timing. The comment additions in pgstat_count_io_op_time() were worth a patch of their own. This part has been applied as b998fedab74c, after a few tweaks of my own. -- Michael From f3bff6e4a646cb90fd5ba0e178e282d189eda39e Mon Sep 17 00:00:00 2001 From: Nazir Bilal Yavuz Date: Mon, 27 Jan 2025 14:03:54 +0300 Subject: [PATCH v13] Add WAL I/O stats to both pg_stat_io view and per backend I/O statistics This commit adds WAL I/O stats to both pg_stat_io view and per backend I/O statistics (pg_stat_get_backend_io()). This commit introduces a three new I/O concepts: - IOObject IOOBJECT_WAL is used for tracking all WAL I/Os. - IOOBJECT_WAL / IOCONTEXT_NORMAL is used for tracking I/O operations done on already created wal segments. - IOOBJECT_WAL / IOCONTEXT_INIT is used for tracking I/O operations done while creating the WAL segments. XXX: Bump catalog version. XXX: Bump pgstats file version. --- src/include/pgstat.h| 4 +- src/backend/access/transam/xlog.c | 39 +++--- src/backend/access/transam/xlogreader.c | 10 +++ src/backend/access/transam/xlogrecovery.c | 12 src/backend/utils/activity/pgstat_backend.c | 9 +-- src/backend/utils/activity/pgstat_io.c | 79 +++-- src/test/regress/expected/stats.out | 53 ++ src/test/regress/sql/stats.sql | 25 +++ doc/src/sgml/monitoring.sgml| 19 - 9 files changed, 212 insertions(+), 38 deletions(-) diff --git a/src/include/pgstat.h b/src/include/pgstat.h index 81ec0161c09..a61b488e8d8 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -274,14 +274,16 @@ typedef enum IOObject { IOOBJECT_RELATION, IOOBJECT_TEMP_RELATION, + IOOBJECT_WAL, } IOObject; -#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1) +#define IOOBJECT_NUM_TYPES (IOOBJECT_WAL + 1) typedef enum IOContext { IOCONTEXT_BULKREAD, IOCONTEXT_BULKWRITE, + IOCONTEXT_INIT, IOCONTEXT_NORMAL, IOCONTEXT_VACUUM, } IOContext; diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 90ade4e7d39..eb19cf5690c 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2435,16 +2435,19 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible) { errno = 0; -/* Measure I/O timing to write WAL data */ -if (track_wal_io_timing) - INSTR_TIME_SET_CURRENT(start); -else - INSTR_TIME_SET_ZERO(start); +/* + * Measure I/O timing to write WAL data, for pg_stat_wal + * and/or pg_stat_io. + */ +start = pgstat_prepare_io_time(track_wal_io_timing || track_io_timing); pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE); written = pg_pwrite(openLogFile, from, nleft, startoffset); pgstat_report_wait_end(); +pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, + IOOP_WRITE, start, 1, written); + /* * Increment the I/O timing and the number of times WAL data * were written out to disk. @@ -3216,6 +3219,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, int fd; int save_errno; int open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY; + instr_time io_start; Assert(logtli != 0); @@ -3259,6 +3263,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, (errcode_for_file_access(), errmsg("could not create file \"%s\": %m", tmppath))); + /* Measure I/O timing when initializing segment */ + io_start = pgstat_prepare_io_time(track_io_timing); + pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE); save_errno = 0; if (wal_init_zero) @@ -3294,6 +3301,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, } pgstat_report_wait_end(); + pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT, IOOP_WRITE, + io_start, 1, wal_segment_size); + if (save_errno) { /* @@ -3310,6 +3320,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, errmsg("could not write to file \"%s\": %m", tmppath))); } + /* Measure I/O timing when flushing segment */ + io_start = pgstat_prepare_io_time(track_io_timing); + pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC); if (pg_fsync(fd) != 0) { @@ -3322,6 +3335,9 @@ XLogF
Re: Fix assert failure when decoding XLOG_PARAMETER_CHANGE on primary
On Fri, Jan 24, 2025 at 4:05 AM Masahiko Sawada wrote: > > When a standby replays a XLOG_PARAMETER_CHANGE record that lowers > wal_level from logical, we invalidate all logical slots only when the > standby is in hot standby mode: > > if (InRecovery && InHotStandby && > xlrec.wal_level < WAL_LEVEL_LOGICAL && > wal_level >= WAL_LEVEL_LOGICAL) > InvalidateObsoleteReplicationSlots(RS_INVAL_WAL_LEVEL, >0, InvalidOid, >InvalidTransactionId); > > However, it's possible that this record is replayed when not in hot > standby mode and the slot is used after the promotion. In this case, > the following Assert in xlog_decode() fails: > > /* > * This can occur only on a standby, as a primary would > * not allow to restart after changing wal_level < logical > * if there is pre-existing logical slot. > */ Shouldn't we do similar to what this comment indicates on standby? We can disallow to start the server as standby, if the hot_standby is off and there is a pre-existing logical slot. > Assert(RecoveryInProgress()); > ereport(ERROR, > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > errmsg("logical decoding on standby requires \"wal_level\" >= > \"logical\" on the primary"))); > > Here is brief steps to reproduce this issue: > > 1. setup the primary and the standby servers (with hot_standby=on). > 2. create a logical slot on the standby. > 3. on standby, set hot_standby=off and restart. > 4. on primary, lower wal_level to replica and restart. > 5. promote the standby and execute the logical decoding. > > I've attached a small patch to fix this issue. With the patch, we > invalidate all logical slots even when not in hot standby, that is, > the patch just removes InHotStandby from the condition. Thoughts? > This can fix the scenario you described but I am not sure it is a good idea. We can consider the fix on these lines if you don't like the above suggestion. -- With Regards, Amit Kapila.
Re: NOT ENFORCED constraint feature
On Sat, Feb 1, 2025 at 8:31 PM jian he wrote: > > [...] > So the code should only call AlterConstrTriggerDeferrability, > not call ATExecAlterConstrEnforceability? Right. Thank you for the report. We need to know whether the enforceability and/or deferability has actually been set or not before catalog update. Have you started working on the ALTER ... CONSTRAINT for the check constraint? I am thinking to start that. To fix this bug, we have two options: we could either throw an error as we don’t currently support altering enforceability and deferability together (which I’m not a fan of), or we could refactor the ALTER ... CONSTRAINT code to include more information which would allow us to perform the appropriate update action during the execution stage, and it would also help with altering check constraints. Regards, Amul
Re: NOT ENFORCED constraint feature
On Fri, Jan 31, 2025 at 7:10 PM Alvaro Herrera wrote: > > On 2025-Jan-31, Ashutosh Bapat wrote: > > > But if the constraint is NOT VALID and later marked as NOT ENFORCED, > > what is expected behaviour while changing it to ENFORCED? > > I think what you want is a different mode that would be ENFORCED NOT > VALID, which would be an extension of the standard, because the standard > does not support the concept of NOT VALID. So while I think what you > want is nice, I'm not sure that this patch necessarily must implement > it. > Here is my understanding behind this feature implementation -- I am not claiming to be 100% correct, I am confident I am not entirely wrong either. Let me explain with an example: imagine a user adds a VALID constraint to a table that already has data, and the user is completely sure that all the data complies with the constraint. Even in this case, the system still runs a validation check. This is expected behavior because the system can't just take the user's word for it -- it needs to explicitly confirm that the data is valid through validation. Now, with a NOT ENFORCED constraint, it's almost like the constraint doesn't exist, because no checks are being performed and there is no visible effect for the user, even though the constraint is technically still there. So when the constraint is switched to ENFORCED, we should be careful not to automatically mark it as validated (regardless of its previous validate status) unless the data is actually checked against the constraint -- treat as adding a new VALID constraint. Even if the user is absolutely sure the data complies, we should still run the validation to ensure reliability. In response to Ashutosh’s point about the VALID/NOT ENFORCED scenario: if a constraint is initially VALID, then marked as NOT ENFORCED, and later switched back to ENFORCED -- IMO, it shouldn't automatically be considered VALID. I had similar thoughts when working on a patch before v5, but after further discussion, I now agree that it makes more sense for the system to keep it as NOT VALID until the data has been validated against the constraint. This is especially important when a user adds the constraint, is confident the data complies, and then needs to enforce it. Validating the data ensures the integrity and consistency of the constraint. In short, even if the user is 100% confident, skipping validation is not an option. We need to make sure the constraint’s VALID status is accurate and reliable before marking it as validated. Regards, Amul
Re: Introduce XID age and inactive timeout based replication slot invalidation
On Mon, Feb 3, 2025 at 9:04 AM Peter Smith wrote: > > On Fri, Jan 31, 2025 at 8:02 PM Amit Kapila wrote: > > > > On Fri, Jan 31, 2025 at 10:40 AM Peter Smith wrote: > > > > > > == > > > src/backend/replication/slot.c > > > > > > ReportSlotInvalidation: > > > > > > 1. > > > + > > > + case RS_INVAL_IDLE_TIMEOUT: > > > + Assert(inactive_since > 0); > > > + /* translator: second %s is a GUC variable name */ > > > + appendStringInfo(&err_detail, > > > + _("The slot has remained idle since %s, which is longer than the > > > configured \"%s\" duration."), > > > + timestamptz_to_str(inactive_since), > > > + "idle_replication_slot_timeout"); > > > + break; > > > + > > > > > > errdetail: > > > > > > I guess it is no fault of this patch because I see you've only copied > > > nearby code, but AFAICT this function is still having an each-way bet > > > by using a mixture of _() macro which is for strings intended be > > > translated, but then only using them in errdetail_internal() which is > > > for strings that are NOT intended to be translated. Isn't it > > > contradictory? Why don't we use errdetail() here? > > > > > > > Your question is valid and I don't have an answer. I encourage you to > > start a new thread to clarify this. > > > > I think this was a false alarm. > > After studying this more deeply, I've changed my mind and now think > the code is OK as-is. > > AFAICT errdetail_internal is used when not wanting to translate the > *fmt* string passed to it (see EVALUATE_MESSAGE in elog.c). Now, here > the format string is just "%s" so it's fine to not translate that. > Meanwhile, the string value being substituted to the "%s" was already > translated because of the _(x) macro aka gettext(x). > I didn't get your point about " the "%s" was already translated because of ...". If we don't want to translate the message then why add '_(' to it in the first place? -- With Regards, Amit Kapila.
Re: why there is not VACUUM FULL CONCURRENTLY?
Alvaro Herrera wrote: > > > > From bf2ec8c5d753de340140839f1b061044ec4c1149 Mon Sep 17 00:00:00 2001 > > From: Antonin Houska > > Date: Mon, 13 Jan 2025 14:29:54 +0100 > > Subject: [PATCH 4/8] Add CONCURRENTLY option to both VACUUM FULL and CLUSTER > > commands. > > > @@ -950,8 +1412,46 @@ copy_table_data(Relation NewHeap, Relation OldHeap, > > Relation OldIndex, bool verb > > > + if (concurrent) > > + { > > + PgBackendProgress progress; > > + > > + /* > > +* Command progress reporting gets terminated at > > subtransaction > > +* end. Save the status so it can be eventually > > restored. > > +*/ > > + memcpy(&progress, &MyBEEntry->st_progress, > > + sizeof(PgBackendProgress)); > > + > > + /* Release the locks by aborting the subtransaction. */ > > + RollbackAndReleaseCurrentSubTransaction(); > > + > > + /* Restore the progress reporting status. */ > > + pgstat_progress_restore_state(&progress); > > + > > + CurrentResourceOwner = oldowner; > > + } > > I was looking at 0002 to see if it'd make sense to commit it ahead of a > fuller review of the rest, and I find that the reason for that patch is > this hunk you have here in copy_table_data -- you want to avoid a > subtransaction abort (which you use to release planner lock) clobbering > the status. I think this a bad idea. It might be better to handle this > in a different way, for instance > > 1) maybe have a flag that says "do not reset progress status during > subtransaction abort"; REPACK would set that flag, so it'd be able to > continue its business without having to memcpy the current status (which > seems like quite a hack) or restoring it afterwards. > > 2) maybe subtransaction abort is not the best way to release the > planning locks anyway. I think it might be better to have a > ResourceOwner that owns those locks, and we do ResourceOwnerRelease() > which would release them. I think this would be a novel usage of > ResourceOwner so it needs more research. But if this works, then we > don't need the subtransaction at all, and therefore we don't need > backend progress restore at all either. If this needs change, I prefer 2) because it's less invasive: 1) still affects the progress monitoring code. I'll look at it. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: UUID v7
Dearcolleagues, I wouldlike to present for discussion my attached new draft documentation on UUIDfunctions (Section 9.14. UUID Functions), which replaces the previouslyproposed draft at https://www.postgresql.org/docs/devel/functions-uuid.html.I have preserved and significantly supplemented the text that was there. I have thefollowing goals: 1. Statethat from now on, the function uuidv7(), rather than autoincrement, is thedefault choice for generating primary keys 2. Describethe advantages of uuidv7() over autoincrement and uuidv4() 3. Refutethe often-cited imaginary disadvantages of UUIDv7 compared to autoincrement,such as: - Lower performance (see the refutation inthe article "UUID Benchmark War" https://ardentperf.com/2024/02/03/uuid-benchmark-war/) - Disclosure of date and time of recordcreation in the table (in reality, the timestamp offset parameter distorts thisinformation) 4. Confirm thefault tolerance of the uuidv7() function in all possible critical situations,namely: - System clock failure - Receiving an invalid value of the offsetargument, which would result in a timestamp overflow or a negative timestamp Regards, SergeyProkhorenko sergeyprokhore...@yahoo.com.au UUID Functions.docx Description: MS-Word 2007 document UUID Functions.sgml Description: Binary data
Optimize scram_SaltedPassword performance
Hi Hackers, While profiling a program with `perf`, I noticed that `scram_SaltedPassword` consumed more CPU time than expected. After some investigation, I found that the function performs many HMAC iterations (4096 rounds for SCRAM-SHA-256), and each iteration reinitializes the HMAC context, causing excessive overhead. OpenSSL has an optimization for this case: when the key remains the same, the HMAC context can be reused with a lightweight state reset by passing NULL as the key. To take advantage of this, I introduced `pg_hmac_reuse()`, which replaces the key with NULL when OpenSSL is used. With this change, the performance improves by approximately **4x** (reducing execution time from 4ms to 1ms). The built-in PostgreSQL HMAC implementation does not support context reuse, and modifying it would require substantial changes. Therefore, `pg_hmac_reuse()` simply calls `pg_hmac_init()` in that case, maintaining the existing logic. Regards, Zixuan >From e13d9e9c674d1cf4f671ca81b950ae010fd958a4 Mon Sep 17 00:00:00 2001 From: Zi-Xuan Fu Date: Mon, 3 Feb 2025 13:33:43 +0800 Subject: [PATCH v1] Optimize "scram_SaltedPassword" with OpenSSL HMAC context reuse `scram_SaltedPassword()` requires a large number of HMAC iterations, with SCRAM-SHA-256 defaulting to 4096 rounds. Previously, each iteration reinitialized the HMAC context, incurring significant overhead. However, OpenSSL optimizes this case by allowing context reuse when the key remains unchanged—requiring only a lightweight state reset by passing NULL as the key. To leverage this, I introduced `pg_hmac_reuse()`, which sets the key to NULL when OpenSSL is used. This optimization improves performance by approximately 4x (reducing execution time from 4ms to 1ms). The native PostgreSQL HMAC implementation does not support context reuse, and modifying it would require substantial changes. Therefore, `pg_hmac_reuse()` simply calls `pg_hmac_init()` in that case, maintaining the existing logic. --- src/common/hmac.c | 11 +++ src/common/hmac_openssl.c | 12 src/common/scram-common.c | 2 +- src/include/common/hmac.h | 1 + 4 files changed, 25 insertions(+), 1 deletion(-) diff --git a/src/common/hmac.c b/src/common/hmac.c index 9b85910672..6de95c44ee 100644 --- a/src/common/hmac.c +++ b/src/common/hmac.c @@ -214,6 +214,17 @@ pg_hmac_init(pg_hmac_ctx *ctx, const uint8 *key, size_t len) return 0; } +/* + * pg_hmac_reuse + * + * Reuse a HMAC context with the same key. Returns 0 on success, -1 on failure. + */ +int +pg_hmac_reuse(pg_hmac_ctx *ctx, const uint8 *key, size_t len) +{ + return pg_hmac_init(ctx, key, len); +} + /* * pg_hmac_update * diff --git a/src/common/hmac_openssl.c b/src/common/hmac_openssl.c index 31cd188904..9e7a92884d 100644 --- a/src/common/hmac_openssl.c +++ b/src/common/hmac_openssl.c @@ -208,6 +208,18 @@ pg_hmac_init(pg_hmac_ctx *ctx, const uint8 *key, size_t len) return 0; } +/* + * pg_hmac_reuse + * + * Reuse a HMAC context with the same key. Returns 0 on success, -1 on failure. + */ +int +pg_hmac_reuse(pg_hmac_ctx *ctx, const uint8 *key, size_t len) +{ + /* OpenSSL skips unnecessary reinitialization if the key is NULL */ + return pg_hmac_init(ctx, NULL, len); +} + /* * pg_hmac_update * diff --git a/src/common/scram-common.c b/src/common/scram-common.c index 400900c51c..5717daff28 100644 --- a/src/common/scram-common.c +++ b/src/common/scram-common.c @@ -84,7 +84,7 @@ scram_SaltedPassword(const char *password, CHECK_FOR_INTERRUPTS(); #endif - if (pg_hmac_init(hmac_ctx, (uint8 *) password, password_len) < 0 || + if (pg_hmac_reuse(hmac_ctx, (uint8 *) password, password_len) < 0 || pg_hmac_update(hmac_ctx, (uint8 *) Ui_prev, key_length) < 0 || pg_hmac_final(hmac_ctx, Ui, key_length) < 0) { diff --git a/src/include/common/hmac.h b/src/include/common/hmac.h index c4d069e49a..e9a0366d6e 100644 --- a/src/include/common/hmac.h +++ b/src/include/common/hmac.h @@ -22,6 +22,7 @@ typedef struct pg_hmac_ctx pg_hmac_ctx; extern pg_hmac_ctx *pg_hmac_create(pg_cryptohash_type type); extern int pg_hmac_init(pg_hmac_ctx *ctx, const uint8 *key, size_t len); +extern int pg_hmac_reuse(pg_hmac_ctx *ctx, const uint8 *key, size_t len); extern int pg_hmac_update(pg_hmac_ctx *ctx, const uint8 *data, size_t len); extern int pg_hmac_final(pg_hmac_ctx *ctx, uint8 *dest, size_t len); extern void pg_hmac_free(pg_hmac_ctx *ctx); -- 2.34.1
Re: SQL/JSON json_table plan clause
Nikita Malakhov писал(а) 2025-02-03 02:43: Hi hackers! Patch has been rebased onto the current master. -- Regards, Nikita Malakhov Postgres Professional The Russian Postgres Company https://postgrespro.ru/ Hi! Tested on Mac with 'make check', all tests passed OK. -- Best regards, Vladlen Popolitov.
Re: pgbench with partitioned tables
02.02.2025 20:45, Álvaro Herrera пишет: On 2025-Jan-31, Melanie Plageman wrote: Maybe instead of just not using COPY FREEZE on a table if it is partitioned, we could add new data generation init_steps. Perhaps one that is client-side data generation (g) but with no freezing? I'm not really sure what the letter would be (f? making it f, g, and G?). I think it makes sense to do what you suggest, but on the other hand, the original code that Sergey is patching looks like a hack in the sense that it hardcodes which tables to use FREEZE with. There's no point to doing things that way actually, so accepting Sergey's patch to replace that with a relkind check feels sensible to me. I think the query should be SELECT relkind FROM pg_catalog.pg_class WHERE oid='%s'::pg_catalog.regclass if only for consistency with pgbench's other query on catalogs. Your proposal to add different init_steps might be reasonable, at least if we allowed partitionedness of tables to vary in other ways (eg. if we made pgbench_history partitioned), but I don't think it conflicts with Sergey's patch in spirit. Thanks for the note. I changed the query in the patch (v2 patch attached) Btw, an additional benefit from the patch is that we can use foreign tables (for example, to test postgres_fdw optimizations) -- With best regards, Sergey Tatarintsev, PostgresPro From 9aaf2b222d6eb0c218c8a417280a8fdc0a42c296 Mon Sep 17 00:00:00 2001 From: Sergey Tatarintsev Date: Thu, 30 Jan 2025 14:52:29 +0700 Subject: [PATCH-v1] Fix pgbench client-side data generation for partitioned tables Client-side data generation cannot perform COPY WITH (FREEZE = ON) on partitioned tables except pgbench_accounts. Patch disables FREEZE for any non-regular tables --- src/bin/pgbench/pgbench.c | 34 +- 1 file changed, 25 insertions(+), 9 deletions(-) diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index b5cf3c6ed01..cac5312c4b1 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -847,6 +847,28 @@ static const PsqlScanCallbacks pgbench_callbacks = { NULL, /* don't need get_variable functionality */ }; +static bool +is_regular_table(PGconn *con, const char *table) +{ + PGresult *res; + char *sql = psprintf("SELECT relkind FROM pg_catalog.pg_class WHERE oid='%s'::pg_catalog.regclass", table); + bool is_regular_table = true; + + res = PQexec(con, sql); + if (PQresultStatus(res) != PGRES_TUPLES_OK) + { + pg_log_error("query failed: %s", PQerrorMessage(con)); + pg_log_error_detail("Query was: %s", sql); + exit(1); + } + if (!PQgetisnull(res, 0, 0)) + is_regular_table = strncmp(PQgetvalue(res, 0, 0), "r", 1) == 0; + + PQclear(res); + pfree(sql); + return is_regular_table; +} + static inline pg_time_usec_t pg_time_now(void) { @@ -4958,16 +4980,10 @@ initPopulateTable(PGconn *con, const char *table, int64 base, initPQExpBuffer(&sql); - /* - * Use COPY with FREEZE on v14 and later for all the tables except - * pgbench_accounts when it is partitioned. - */ - if (PQserverVersion(con) >= 14) - { - if (strcmp(table, "pgbench_accounts") != 0 || - partitions == 0) + /* Use COPY with FREEZE on v14 and later for all regular tables */ + if ((PQserverVersion(con) >= 14) && is_regular_table(con, table)) copy_statement_fmt = "copy %s from stdin with (freeze on)"; - } + n = pg_snprintf(copy_statement, sizeof(copy_statement), copy_statement_fmt, table); if (n >= sizeof(copy_statement)) -- 2.43.0
Re: Introduce XID age and inactive timeout based replication slot invalidation
On Mon, Feb 3, 2025 at 4:04 PM Amit Kapila wrote: > > On Mon, Feb 3, 2025 at 9:04 AM Peter Smith wrote: > > > > On Fri, Jan 31, 2025 at 8:02 PM Amit Kapila wrote: > > > > > > On Fri, Jan 31, 2025 at 10:40 AM Peter Smith > > > wrote: > > > > > > > > == > > > > src/backend/replication/slot.c > > > > > > > > ReportSlotInvalidation: > > > > > > > > 1. > > > > + > > > > + case RS_INVAL_IDLE_TIMEOUT: > > > > + Assert(inactive_since > 0); > > > > + /* translator: second %s is a GUC variable name */ > > > > + appendStringInfo(&err_detail, > > > > + _("The slot has remained idle since %s, which is longer than the > > > > configured \"%s\" duration."), > > > > + timestamptz_to_str(inactive_since), > > > > + "idle_replication_slot_timeout"); > > > > + break; > > > > + > > > > > > > > errdetail: > > > > > > > > I guess it is no fault of this patch because I see you've only copied > > > > nearby code, but AFAICT this function is still having an each-way bet > > > > by using a mixture of _() macro which is for strings intended be > > > > translated, but then only using them in errdetail_internal() which is > > > > for strings that are NOT intended to be translated. Isn't it > > > > contradictory? Why don't we use errdetail() here? > > > > > > > > > > Your question is valid and I don't have an answer. I encourage you to > > > start a new thread to clarify this. > > > > > > > I think this was a false alarm. > > > > After studying this more deeply, I've changed my mind and now think > > the code is OK as-is. > > > > AFAICT errdetail_internal is used when not wanting to translate the > > *fmt* string passed to it (see EVALUATE_MESSAGE in elog.c). Now, here > > the format string is just "%s" so it's fine to not translate that. > > Meanwhile, the string value being substituted to the "%s" was already > > translated because of the _(x) macro aka gettext(x). > > > > I didn't get your point about " the "%s" was already translated > because of ...". If we don't want to translate the message then why > add '_(' to it in the first place? > I think this is same point where I was fooling myself yesterday. In fact we do want to translate the message seen by the user. errdetail_internal really means don't translate the ***format string***. In our case "%s" is not the message at all -- it is just the a *format string* so translating "%s" is kind of meaningless. e.g. Normally errdetail("translate me") <-- This would translate the fmt string but here the fmt is also the message; i.e. it will do gettext("translate me") internally. errdetail_internal("translate me") <-- This won't translate anything; you will have the raw fmt string "translate me" ~~ But since ReportSlotInvalidation is building the message on the fly there is no single report so it is a bit different errdetail("%s", "translate me") <-- this would just use gettext("%s") which is kind of useless. And the "translate me" is just a raw string and won't be translated. errdetail_internal("%s", "translate me") <-- this won't translate anything; the fmt string and the "translate me" are just raw strings errdetail_internal("%s", _("translate me")) <-- This won't translate the fmt string, but to translate %s is useless anyway. OTOH, the _() macro means it will do gettext("translate me") so the "translate me" string will get translated before it is substituted. This is effectively what the ReportSlotInvalidation code is doing. == Kind Regards, Peter Smith. Fujitsu Australia
RE: pg_rewind with --write-recovery-conf option doesn't write dbname to primary_conninfo value.
Dear Sawada-san, > I think it's a good idea to support it at least on HEAD. I've attached > a patch for that. +1. I've confirmed that pg_rewind and -R can't output dbname for now, and your patch allows to do it. Few comments for your patch. 1. pg_basebackup.sgml has below description. I feel this can be ported to pg_rewind.sgml as well. ``` The dbname will be recorded only if the dbname was specified explicitly in the connection string or environment variable. ``` 2. I'm not sure whether recovery_gen.h/c is a correct place for the exported function GetDbnameFromConnectionOptions(). The function itself does not handle postgresql.cuto.conf file. I seeked other header files and felt that connect_utils.h might be. ``` /*- * * Facilities for frontend code to connect to and disconnect from databases. ``` Another idea is to change the third API to accept the whole connection string, and it extracts dbname from it. In this approach we can make GetDbnameFromConnectionOptions() to static function - which does not feel strange for me. Best regards, Hayato Kuroda FUJITSU LIMITED
Re: Make COPY format extendable: Extract COPY TO format implementations
Sutou Kouhei писал(а) 2025-02-01 17:12: Hi, Hi I would like to inform about the security breach in your design of COPY TO/FROM. You use FORMAT option to add new formats, filling it with routine name in shared library. As result any caller can call any routine in PostgreSQL kernel. I think, it will start competition, who can find most dangerous routine to call just from COPY FROM command. Standard PostgreSQL realisation for new methods to use USING keyword. Every new method could have own options (FORMAT is option of internal 'copy from/to' methods), it assumes some SetOptions interface, that defines an options structure according to the new method requirements. I agree with the general direction of the extensibility, but it should be secure and consistent. -- Best regards, Vladlen Popolitov.
Re: Introduce XID age and inactive timeout based replication slot invalidation
On Mon, Feb 3, 2025 at 6:16 AM Peter Smith wrote: > > > 2. > + appendStringInfo(&err_hint, "You might need to increase \"%s\".", > + "max_slot_wal_keep_size"); > break; > 2a. > In this case, shouldn't you really be using macro _("You might need to > increase \"%s\".") so that the common format string would be got using > gettext()? > > ~ > > > ~~~ > > 3. > + appendStringInfo(&err_hint, "You might need to increase \"%s\".", > + "idle_replication_slot_timeout"); > + break; > > 3a. > Ditto above. IMO this common format string should be got using macro. > e.g.: _("You might need to increase \"%s\".") > > ~ Instead, we can directly use '_(' in errhint as we are doing in one other similar place "errhint("%s", _(view_updatable_error;". I think we didn't use it for errdetail because, in one of the cases, it needs to use ngettext -- With Regards, Amit Kapila.
Re: Optimize scram_SaltedPassword performance
03.02.2025 10:11, Zixuan Fu пишет: > Hi Hackers, > > While profiling a program with `perf`, I noticed that `scram_SaltedPassword` > consumed more CPU time than expected. After some investigation, I found > that the function performs many HMAC iterations (4096 rounds for > SCRAM-SHA-256), and each iteration reinitializes the HMAC context, causing > excessive overhead. > > OpenSSL has an optimization for this case: when the key remains the > same, the HMAC context can be reused with a lightweight state reset by > passing NULL as the key. To take advantage of this, I introduced > `pg_hmac_reuse()`, which replaces the key with NULL when OpenSSL is used. Good catch. Since pg_hmac_reuse is not `static`, I'd add some checks that key is exactly same. At least there should be Assert(key == prev_key && len == prev_len && hash_bytes(key, len) == prev_hash); Where `prev_key`, `prev_len` and `prev_hash` are static variables, filled in `pg_hmac_init`. I don't know, should it be `Assert`, or check that leads to `elog(ERROR)`. `hash_bytes` is fast enough to not cause measurable slow down in production. On the other hand, use cases are trivial enough to occasional misuses to be caught using just `Assert`. > With this change, the performance improves by approximately **4x** (reducing > execution time from 4ms to 1ms). The built-in PostgreSQL HMAC implementation > does not support context reuse, and modifying it would require substantial > changes. Therefore, `pg_hmac_reuse()` simply calls `pg_hmac_init()` in that > case, maintaining the existing logic. --- regards, Yura
Re: NOT ENFORCED constraint feature
On 2025-Feb-03, Ashutosh Bapat wrote: > VALID, NOT ENFORCED changed to VALID, ENFORCED - data validation > required, constraint is enforced There's no such thing as a VALID NOT ENFORCED constraint. It just cannot exist. > NOT VALID, NOT ENFORCED changed to NOT_VALID, ENFORCED - no data > validation required, constraint is enforced on the new tuples/changes This may make sense, but it needs special nonstandard syntax. If you start with a NOT VALID NOT ENFORCED constraint (which is the only way to have a NOT ENFORCED constraint) and apply ALTER TABLE ALTER CONSTRAINT ENFORCE, you will end up with a VALID ENFORCED constraint, therefore validation must be run. If you wanted to add a nonstandard command ALTER TABLE ALTER CONSTRAINT ENFORCE NO VALIDATE then maybe the transition you suggest could be made. It should be a separate patch from regular ALTER CONSTRAINT ENFORCE though, just in case some problems with it emerge later and we're forced to revert it, we can still keep the standard command. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "Use it up, wear it out, make it do, or do without"
Re: UUID v7
> On 31 Jan 2025, at 23:49, Masahiko Sawada wrote: > > Thank you for the patch! I agree with the basic direction of this fix. > Here are some review comments: > > --- > -static inline int64 get_real_time_ns_ascending(); > +static inline uint64 get_real_time_ns_ascending(); > > IIUC we don't need to replace int64 with uint64 if we have two > separate parameters for generate_uuidv7(). It seems to be conventional > to use a signed int for timestamps. OK, done. > > --- > Need to update the function comment of generate_uuidv7() as we changed > the function arguments. Done. > > --- > - ns = (ts + (POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) * > SECS_PER_DAY * USECS_PER_SEC) > - * NS_PER_US + ns % NS_PER_US; > + us = (ts + (POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) * > SECS_PER_DAY * USECS_PER_SEC); > >/* Generate an UUIDv7 */ > - uuid = generate_uuidv7(ns); > + uuid = generate_uuidv7(us / 1000, (us % 1000) * 1000 + ns % > NS_PER_US); > > I think we can have an inline function or a marco (or use TMODULO()?) > to split nanoseconds into milliseconds and sub-milliseconds so that > uuidv7() and uuidv7_interval() can pass them to generate_uuidv7(). I doubt that such macro will make core more readable. I've replaced 1000 with macros. > > The comments in uuidv7_interval() also need to be updated accordingly. Done. > > --- > I think we need to consider how we can handle the timestamp shifting. > UUIDv7 contains 48 bits Unix timestamp at milliseconds precision, > which can represent timestamps approximately between 2493 BC and 6432 > AC. If users specify an interval to shift the timestamp beyond the > range, 48-bits timestamp would be wrapped around and they would not be > able to get an expected result. Do we need to raise an error in that > case? > > --- > Another problem I found in uuid_extract_timestamp() is that it cannot > correctly extract a timestamp before 1970/1/1 stored in a UUIDv7 > value: > > postgres(1:1795331)=# select year, uuid_extract_timestamp(uuidv7((year > || 'year ago')::interval)) from generate_series(54, 56) year; > year | uuid_extract_timestamp > --+- > 54 | 1971-01-31 10:46:25.111-08 > 55 | 1970-01-31 10:46:25.111-08 > 56 | 10888-09-01 17:18:15.768-07 > (3 rows) > > The problem is that we correctly store a negative timestamp value in a > UUIDv7 value but uuid_extract_timestamp() unconditionally treats it as > a positive timestamp value. I think this is a separate bug we need to > fix. RFC says unix_ts_ms is unsigned. So, luckily, no BC dates. I bet Pharaohs could not measure nanoseconds. I think it's totally fine to wrap UUID values around year 10598 without an error. I was thinking about incorporating test like this. >> With this patch we can generate correct UUIDs in a very distant future. >> postgres=# select x, >> >> uuid_extract_timestamp(uuidv7((x::text || ' >> year'::text)::interval)), >> (x::text || ' year'::text)::interval >> from generate_series(1,9000,1000) x; >> x | uuid_extract_timestamp| interval >> --+-+ >>1 | 2026-01-31 12:00:53.084+05 | 1 year >> 1001 | 3026-01-31 12:00:53.084+05 | 1001 years >> 2001 | 4026-01-31 12:00:53.084+05 | 2001 years >> 3001 | 5026-01-31 12:00:53.084+05 | 3001 years >> 4001 | 6026-01-31 12:00:53.084+05 | 4001 years >> 5001 | 7026-01-31 12:00:53.085+05 | 5001 years >> 6001 | 8026-01-31 12:00:53.085+05 | 6001 years >> 7001 | 9026-01-31 12:00:53.085+05 | 7001 years >> 8001 | 10026-01-31 12:00:53.085+05 | 8001 years >> (9 rows) or maybe something simple like with u as (select uuidv7() id) select uuid_extract_timestamp(uuidv7('-09-09 12:34:56.789+05' - uuid_extract_timestamp(u.id))) from u; But it would still be flaky, second call to uuidv7() can overflow a millisecond. Thanks! Best regards, Andrey Borodin. v2-0001-UUDv7-fix-offset-computations-in-dates-after-2262.patch Description: Binary data
Doc fix of aggressive vacuum threshold for multixact members storage
Hi, This patch suggests a correction to the doc page dealing with multixact vacuuming, which, starting with PG 14, says that the multixact members storage threshold for aggressive vacuum is 2 GB. However, I believe the threshold is actually about 10 GB. MultiXactMemberFreezeThreshold() defines the threshold as 2^32 (0x) / 2 or 2^31 multixact members. However, as discussed in multixact.c, multixact members are stored in groups of 4, each group taking up 20 bytes, meaning 5 bytes per member. (This is not quite exact as 12 bytes per 8 KB page are wasted, but I believe it is close enough for the docs.) This makes the threshold in bytes be 2^31 multixact members * 5 bytes per member = 10 GiB. It was also confirmed by observing a live system (with an admittedly unfortunate workload pattern). Also, the maximum storage size for multixact members is 20 GiB (2^32 * 5), and it should be useful to call this out in the doc as well. For reference, the original commit which introduced the current wording is c552e17, and the discussion was here: https://www.postgresql.org/message-id/flat/162395467510.686.11947486273299446208%40wrigleys.postgresql.org The attached patch is against master, but it should probably be backpatched all the way through 14. Best regards, Alex Friedman v1-0001-Doc-fix-of-aggressive-vacuum-threshold-for-multix.patch Description: Binary data
Re: Cross-type index comparison support in contrib/btree_gin
I wrote: > We've had multiple requests for $SUBJECT over the years > ([1][2][3][4][5], and I'm sure my archive search missed some). > I finally decided to look into what it'd take to make that happen. I forgot to mention a couple of questions for review: > ... it turns out that the > comparePartial() method is only ever applied to compare the original > query value with an index entry, which means that internally to > comparePartial() we can apply the proper cross-type comparison > operator. Our GIN index documentation about comparePartial() doesn't > quite say that in so many words, but btree_gin was already relying on > it --- in a very confusing and ill-explained way, if you ask me, but > it was relying on it. Should we adjust the documentation of comparePartial() to promise explicitly that partial_key is the same datum returned by extractQuery? By my reading, it kind of implies that, but it's not quite black and white. > So basically all I had to do was write a bunch of non-error-throwing > conversion routines and set up some boilerplate infrastructure. In the 0005 patch, I relied on date2timestamp_opt_overflow and its siblings where available. But some of the conversions such as timestamptz-to-timestamp don't have one of those, so I was forced to copy-and-paste some fairly low-level code. Would it make sense to refactor the related core routines to expose xxx2yyy_opt_overflow interfaces, extending what 5bc450629 and 52ad1e659 did? I wouldn't think this worth doing just for btree_gin's benefit, but if btree_gin needs it maybe some other extensions could use it too. regards, tom lane