Re: Code checks for App Devs, using new options for transaction behavior
Op 27-10-2022 om 18:35 schreef Simon Riggs: On Thu, 27 Oct 2022 at 12:09, Simon Riggs wrote: Comments please Update from patch tester results. > [001_psql_parse_only.v1.patch ] > [002_nested_xacts.v7.patch] > [003_rollback_on_commit.v1.patch ] > [004_add_params_to_sample.v1.patch] patch 002 has (2x) : 'transction' should be 'transaction' also in patch 002: 'at any level will be abort' should be 'at any level will abort' I also dislike the 'we' in 'Once we reach the top-level transaction,' That seems a bit too much like the 'we developers working together to make a database server system' which is of course used often and usefully on this mailinglist and in code itself. But I think user-facing docs should be careful with that team-building 'we'. I remember well how it confused me, many years ago. Better, IMHO: 'Once the top-level transaction is reached,' Thanks, Erik Rijkers
psql: Add command to use extended query protocol
This adds a new psql command \gp that works like \g (or semicolon) but uses the extended query protocol. Parameters can also be passed, like SELECT $1, $2 \gp 'foo' 'bar' I have two main purposes for this: One, for transparent column encryption [0], we need a way to pass protocol-level parameters. The present patch in the [0] thread uses a command \gencr, but based on feedback and further thinking, a general-purpose command seems better. Two, for testing the extended query protocol from psql. For example, for the dynamic result sets patch [1], I have several ad-hoc libpq test programs lying around, which would be cumbersome to integrate into the patch. With psql support like proposed here, it would be very easy to integrate a few equivalent tests. Perhaps this would also be useful for general psql scripting. [0]: https://commitfest.postgresql.org/40/3718/ [1]: https://commitfest.postgresql.org/40/2911/From 3f4bf4a68c2edd57c7bf4c4935bad50ea0f528b7 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Fri, 28 Oct 2022 08:29:46 +0200 Subject: [PATCH] psql: Add command to use extended query protocol This adds a new psql command \gp that works like \g (or semicolon) but uses the extended query protocol. Parameters can also be passed, like SELECT $1, $2 \gp 'foo' 'bar' This may be useful for psql scripting, but one of the main purposes is also to be able to test various aspects of the extended query protocol from psql and to write tests more easily. --- doc/src/sgml/ref/psql-ref.sgml | 27 + src/bin/psql/command.c | 39 ++ src/bin/psql/common.c | 15 +++- src/bin/psql/help.c| 1 + src/bin/psql/settings.h| 3 +++ src/bin/psql/tab-complete.c| 2 +- src/test/regress/expected/psql.out | 31 src/test/regress/sql/psql.sql | 14 +++ 8 files changed, 130 insertions(+), 2 deletions(-) diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml index 9494f28063ad..51b33fd3b80c 100644 --- a/doc/src/sgml/ref/psql-ref.sgml +++ b/doc/src/sgml/ref/psql-ref.sgml @@ -2323,6 +2323,33 @@ Meta-Commands + + \gp [ parameter ] ... + + + + Sends the current query buffer to the server for execution, as with + \g, with the specified parameters passed for any + parameter placeholders ($1 etc.). + + + + Example: + +INSERT INTO tbl1 VALUES ($1, $2) \gp 'first value' 'second value' + + + + + This command uses the extended query protocol (see ), unlike \g, + which uses the simple query protocol. So this command can be useful + to test the extended query protocol from psql. + + + + + \gset [ prefix ] diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c index ab613dd49e0a..0e760eda1f3e 100644 --- a/src/bin/psql/command.c +++ b/src/bin/psql/command.c @@ -101,6 +101,7 @@ static backslashResult exec_command_gdesc(PsqlScanState scan_state, bool active_ static backslashResult exec_command_getenv(PsqlScanState scan_state, bool active_branch, const char *cmd); static backslashResult exec_command_gexec(PsqlScanState scan_state, bool active_branch); +static backslashResult exec_command_gp(PsqlScanState scan_state, bool active_branch); static backslashResult exec_command_gset(PsqlScanState scan_state, bool active_branch); static backslashResult exec_command_help(PsqlScanState scan_state, bool active_branch); static backslashResult exec_command_html(PsqlScanState scan_state, bool active_branch); @@ -354,6 +355,8 @@ exec_command(const char *cmd, status = exec_command_getenv(scan_state, active_branch, cmd); else if (strcmp(cmd, "gexec") == 0) status = exec_command_gexec(scan_state, active_branch); + else if (strcmp(cmd, "gp") == 0) + status = exec_command_gp(scan_state, active_branch); else if (strcmp(cmd, "gset") == 0) status = exec_command_gset(scan_state, active_branch); else if (strcmp(cmd, "h") == 0 || strcmp(cmd, "help") == 0) @@ -1546,6 +1549,42 @@ exec_command_gexec(PsqlScanState scan_state, bool active_branch) return status; } +/* + * \gp -- send the current query with parameters + */ +static backslashResult +exec_command_gp(PsqlScanState scan_state, bool active_branch) +{ + backslashResult status = PSQL_CMD_SKIP_LINE; + + if (active_branch) + { + char *opt; + int nparams = 0; + int nalloc = 0; + + pset.gp_params = NULL; + + while ((opt = psql_scan_slash_option(scan_state, OT_NORMAL, NULL, false))) + { +
Re: Support logical replication of DDLs
Here are some review comments for patch v32-0001. This is a WIP - I have not yet looked at the largest file of this patch (src/backend/commands/ddl_deparse.c) == Commit Message 1. The list of the supported statements should be in alphabetical order to make it easier to read ~~~ 2. The "Notes" are obviously notes, so the text does not need to say "Note that..." etc again "(Note #1) Note that some..." -> "(Note #1) Some..." "(Note #2) Note that, for..." -> "(Note #2) For..." "(Note #4) Note that, for..." -> "(Note #4) For..." ~~~ 3. For "Note #3", use uppercase for the SQL keywords in the example. ~~~ 4. For "Note #4": "We created" -> "we created" == src/backend/catalog/aclchk.c 5. ExecuteGrantStmt @@ -385,7 +385,11 @@ ExecuteGrantStmt(GrantStmt *stmt) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("grantor must be current user"))); + + istmt.grantor_uid = grantor; } + else + istmt.grantor_uid = InvalidOid; This code can be simpler by just declaring the 'grantor' variable at function scope, then assigning the istmt.grantor_uid along with the other grantor assignments. SUGGESTION Oid grantor = InvalidOid; ... istmt.grantor_uid = grantor; istmt.is_grant = stmt->is_grant; istmt.objtype = stmt->objtype; == src/backend/commands/collationcmds.c 6. DefineCollation + /* make from existing collationid available to callers */ + if (from_collid && OidIsValid(collid)) + ObjectAddressSet(*from_collid, + CollationRelationId, + collid); 6a. Maybe the param can be made 'from_existing_colid', then the above code comment can be made more readable? ~ 6b. Seems some unnecessary wrapping here == src/backend/commands/ddl_deparse.c WIP - I will try to post some review comments on this file next week == src/backend/commands/ddl_json.c 7. convSpecifier typedef enum { SpecTypename, SpecOperatorname, SpecDottedName, SpecString, SpecNumber, SpecStringLiteral, SpecIdentifier, SpecRole } convSpecifier; Inconsistent case. Some of these say "name" and some say "Name" ~~~ 8. Forward declarations char *ddl_deparse_json_to_string(char *jsonb); Is this needed here? I thought this was already declared extern in ddl_deparse.h. ~~~ 9. find_string_in_jsonbcontainer The function comment says "If it's of a type other than jbvString, an error is raised.", but I do not see this check in the function code. ~~~ 10. expand_fmt_recursive /* * Recursive helper for pg_event_trigger_expand_command * * Find the "fmt" element in the given container, and expand it into the * provided StringInfo. */ 10a. I am not sure if the mention of "pg_event_trigger_expand_command" is stale or is not relevant anymore, because that caller is not in this module. ~ 10b. The first sentence is missing a period. ~~~ 11. value = findJsonbValueFromContainer(container, JB_FOBJECT, &key); Should this be checking is value is NULL? ~~~ 12. expand_jsonval_dottedname * Expand a json value as a dot-separated-name. The value must be of type * object and may contain elements "schemaname" (optional), "objname" * (mandatory), "attrname" (optional). Double quotes are added to each element * as necessary, and dot separators where needed. The comment says "The value must be of type object" but I don't see any check/assert for that in the code. ~~~ 13. expand_jsonval_typename In other code (e.g. expand_jsonval_dottedname) there are lots of pfree(str) so why not similar here? e.g. Shouldn’t the end of the function have like shown below: pfree(schema); pfree(typename); pfree(typmodstr); ~~~ 14. expand_jsonval_operator The function comment is missing a period. ~~~ 15. expand_jsonval_string /* * Expand a JSON value as a string. The value must be of type string or of * type object. In the latter case, it must contain a "fmt" element which will * be recursively expanded; also, if the object contains an element "present" * and it is set to false, the expansion is the empty string. 15a. Although the comment says "The value must be of type string or of type object" the code is checking for jbvString and jbvBinary (??) ~ 15b. else return false; Is that OK to just return false, or should this in fact be throwing an error if the wrong type? ~~~ 16. expand_jsonval_strlit /* Easy case: if there are no ' and no \, just use a single quote */ if (strchr(str, '\'') == NULL && strchr(str, '\\') == NULL) That could be simplified as: if ((strpbk(str, "\'\\") == NULL) ~~~ 17. expand_jsonval_number strdatum = DatumGetCString(DirectFunctionCall1(numeric_out, NumericGetDatum(jsonval->val.numeric))); appendStringInfoString(buf, strdatum); Shouldn't this function do pfree(strdatum) at the end? ~~~ 18. expand_jsonval_role /* * Expand a JSON value as a role name. If the is_public element is set to * true, PUBLIC is expanded (no quotes); otherwise, expand the given role name, * quoting as an identifie
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Fri, Oct 28, 2022 at 7:39 AM Michael Paquier wrote: > > On Thu, Oct 27, 2022 at 03:58:25PM -0700, Andres Freund wrote: > > The block sizes don't need to match, do they? As long as the block is properly > > aligned, we can change the iov_len of the final iov to match whatever the size > > is being passed in, no? > > Hmm. Based on what Bharath has written upthread, it does not seem to > matter if the size of the aligned block changes, either: > https://www.postgresql.org/message-id/calj2acuccjr7kbkqwosqmqh1zgedyj7hh5ef+dohcv7+kon...@mail.gmail.com > > I am honestly not sure whether it is a good idea to make file_utils.c > depend on one of the compile-time page sizes in this routine, be it > the page size of the WAL page size, as pg_write_zeros() would be used > for some rather low-level operations. But we could as well just use a > locally-defined structure with a buffer at 4kB or 8kB and call it a > day? +1. Please see the attached v8 patch. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From 88d67c6e3e92a62ac4c783a137759b2f3795cb21 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Fri, 28 Oct 2022 04:44:41 + Subject: [PATCH v8] Use pg_pwritev_with_retry() instead of write() in walmethods.c Use pg_pwritev_with_retry() while prepadding a WAL segment instead of write() in walmethods.c dir_open_for_write() to avoid partial writes. As the pg_pwritev_with_retry() function uses pwritev, we can avoid explicit lseek() on non-Windows platforms to seek to the beginning of the WAL segment. It looks like on Windows, we need an explicit lseek() call here despite using pwrite() implementation from win32pwrite.c. Otherwise an error occurs. These changes are inline with how core postgres initializes the WAL segment in XLogFileInitInternal(). Author: Bharath Rupireddy Reviewed-by: Nathan Bossart, Michael Paquier Reviewed-by: Thomas Munro, Andres Freund Discussion: https://www.postgresql.org/message-id/CALj2ACUq7nAb7%3DbJNbK3yYmp-SZhJcXFR_pLk8un6XgDzDF3OA%40mail.gmail.com --- src/backend/access/transam/xlog.c | 33 ++- src/bin/pg_basebackup/walmethods.c | 26 + src/common/file_utils.c| 92 ++ src/include/common/file_utils.h| 3 + 4 files changed, 114 insertions(+), 40 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 8f10effe3a..15dd5ce834 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2921,7 +2921,6 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, bool *added, char *path) { char tmppath[MAXPGPATH]; - PGAlignedXLogBlock zbuffer; XLogSegNo installed_segno; XLogSegNo max_segno; int fd; @@ -2965,14 +2964,11 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, (errcode_for_file_access(), errmsg("could not create file \"%s\": %m", tmppath))); - memset(zbuffer.data, 0, XLOG_BLCKSZ); - pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE); save_errno = 0; if (wal_init_zero) { - struct iovec iov[PG_IOV_MAX]; - int blocks; + ssize_t rc; /* * Zero-fill the file. With this setting, we do this the hard way to @@ -2983,29 +2979,10 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, * indirect blocks are down on disk. Therefore, fdatasync(2) or * O_DSYNC will be sufficient to sync future writes to the log file. */ + rc = pg_pwrite_zeros(fd, wal_segment_size); - /* Prepare to write out a lot of copies of our zero buffer at once. */ - for (int i = 0; i < lengthof(iov); ++i) - { - iov[i].iov_base = zbuffer.data; - iov[i].iov_len = XLOG_BLCKSZ; - } - - /* Loop, writing as many blocks as we can for each system call. */ - blocks = wal_segment_size / XLOG_BLCKSZ; - for (int i = 0; i < blocks;) - { - int iovcnt = Min(blocks - i, lengthof(iov)); - off_t offset = i * XLOG_BLCKSZ; - - if (pg_pwritev_with_retry(fd, iov, iovcnt, offset) < 0) - { -save_errno = errno; -break; - } - - i += iovcnt; - } + if (rc < 0) + save_errno = errno; } else { @@ -3014,7 +2991,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, * enough. */ errno = 0; - if (pg_pwrite(fd, zbuffer.data, 1, wal_segment_size - 1) != 1) + if (pg_pwrite(fd, "\0", 1, wal_segment_size - 1) != 1) { /* if write didn't set errno, assume no disk space */ save_errno = errno ? errno : ENOSPC; diff --git a/src/bin/pg_basebackup/walmethods.c b/src/bin/pg_basebackup/walmethods.c index bc2e83d02b..18bad52c96 100644 --- a/src/bin/pg_basebackup/walmethods.c +++ b/src/bin/pg_basebackup/walmethods.c @@ -220,22 +220,24 @@ dir_open_for_write(WalWriteMethod *wwmethod, const char *pathname, /* Do pre-padding on non-compressed files */ if (pad_to_size && wwmethod->compression_algorithm == PG_COMPRESSION_NONE) { - PGAlignedXLogBlock zerobuf; - int bytes; + ssize_t rc;
Re: [PROPOSAL] : Use of ORDER BY clause in insert.sql
On Fri, Oct 28, 2022 at 10:43 AM Tom Lane wrote: > > David Rowley writes: > > On Fri, 28 Oct 2022 at 16:51, Amul Sul wrote: > >> If we > >> are too sure that the output usually comes in the same order then the > >> ORDER BY clause that exists in other tests seems useless. I am a bit > >> confused & what could be a possible bug? > > > You can't claim that if this test shouldn't get an ORDER BY that all > > tests shouldn't have an ORDER BY. That's just crazy. What if the test > > is doing something like testing sort?! > > The general policy is that we'll add ORDER BY when a test is demonstrated > to have unstable output order for identifiable environmental reasons > (e.g. locale dependency) or timing reasons (e.g. background autovacuum > sometimes changing statistics). But the key word there is "identifiable". > Without some evidence as to what's causing this, it remains possible > that it's a code bug not the fault of the test case. > > regress.sgml explains the policy further: > > You might wonder why we don't order all the regression test queries > explicitly > to get rid of this issue once and for all. The reason is that that would > make the regression tests less useful, not more, since they'd tend > to exercise query plan types that produce ordered results to the > exclusion of those that don't. > Understood. Thanks for the clarification. Regards, Amul
Re: [PROPOSAL] : Use of ORDER BY clause in insert.sql
On Fri, Oct 28, 2022 at 10:28 AM David Rowley wrote: > > On Fri, 28 Oct 2022 at 16:51, Amul Sul wrote: > > > > On Thu, Oct 27, 2022 at 6:54 PM Tom Lane wrote: > > > Please be specific about the circumstances in which the output is > > > unstable for you. With zero information to go on, it seems about as > > > likely that this change is masking a bug as that it's a good idea. > > > > > > > At the first glance, I thought the patch is pretty much obvious, and > > we usually add an ORDER BY clause to ensure stable output. > > Unfortunately, you'll need to do better than that. We're not in the > business of accepting patches with zero justification for why they're > required. If you're not willing to do the analysis on why the order > changes sometimes, why should we accept your patch? > Unfortunately the test is not failing at me. Otherwise, I would have done that analysis. When I saw the patch for the first time, somehow, I didn't think anything spurious due to my misconception that we usually add the ORDER BY clause for the select queries just to be sure. > If you can't find the problem then you should modify insert.sql to > EXPLAIN the problem query to see if the plan has changed between the > passing and failing run. The only thing that comes to mind about why > this test might produce rows in a different order would be if a > parallel Append was sorting the subpaths by cost (See > create_append_path's call to list_sort) and the costs were for some > reason coming out differently sometimes. It's hard to imagine why this > query would be parallelised though. If you show us the EXPLAIN from a > passing and failing run, it might help us see the problem. > Understood. > > If we > > are too sure that the output usually comes in the same order then the > > ORDER BY clause that exists in other tests seems useless. I am a bit > > confused & what could be a possible bug? > > You can't claim that if this test shouldn't get an ORDER BY that all > tests shouldn't have an ORDER BY. That's just crazy. What if the test > is doing something like testing sort?! > That I can understand that the sorted output doesn't need further sorting. I am just referring to the simple SELECT queries that do not have any sorting. Thanks & Regards, Amul
Re: Adding doubly linked list type which stores the number of items in the list
Thank you for having a look at this. On Thu, 27 Oct 2022 at 19:32, Bharath Rupireddy wrote: > Some comments on the patch: > 1. I think it's better to just return dlist_is_empty(&head->dlist) && > (head->count == 0); from dclist_is_empty() and remove the assert for > better readability and safety against count being zero. I don't think that's a good change. For 1) it adds unnecessary overhead due to the redundant checks and 2) it removes the Assert which is our early warning that the dclist's count is getting out of sync somewhere. > 2. Missing dlist_is_memberof() in dclist_delete_from()? I put that in dlist_delete_from() which is called from dclist_delete_from(). > 3. Just thinking if we need to move dlist_is_memberof() check from > dclist_* functions to dlist_* functions, because they also need such > insurance against callers passing spurious nodes. I think the affected functions there would be; dlist_move_head(), dlist_move_tail(), dlist_has_next(), dlist_has_prev(), dlist_next_node() and dlist_prev_node(). I believe if we did that then it's effectively an API change. The comments only claim that it's undefined if node is not a member of the list. It does not say 'node' *must* be part of the list. Now, perhaps doing this would just make it more likely that we'd find bugs in our code and extension authors would find bugs in their code, but it does move the bar. dlist_move_head and dlist_move_tail look like they'd work perfectly well to remove an item from 1 list and put it on the head or tail of some completely different list. Should we really be changing that in a patch that is meant to just add the dclist type? > 4. More opportunities to use dclist_* in below places, no? > dlist_push_tail(&src->mappings, &pmap->node); > src->num_mappings++; > > dlist_push_head(&MXactCache, &entry->node); > if (MXactCacheMembers++ >= MAX_CACHE_ENTRIES) Thanks for finding those. I've adjusted them both to use dclists. > 5. dlist_is_memberof() - do we need this at all? We trust the callers > of dlist_* today that the passed in node belongs to the list, no? hmm, this seems to contradict your #3? If you look at something like dlist_move_head(), if someone calls that and passes a 'node' that does not belong to 'head' then the result of that is that we delete 'node' from whichever dlist that it's on and push it onto 'head'. Nothing bad happens there. If we do the same on a dclist then the count gets out of sync. That's bad as it could lead to assert failures and bugs. > 6. If we decide to have dlist_is_memberof() and the asserts around it, > can we design it on the similar lines as dlist_check() to avoid many > #ifdef ILIST_DEBUG #endif blocks spreading around the code? OK, that likely is a better idea. I've done this in the attached by way of dlist_member_check() > 7. Do we need Assert(head->count > 0); in more places like > dclist_delete_from()? I guess it does no harm. I've added some additional ones in the attached. > 8. Don't we need dlist_container(), dlist_head_element(), > dlist_tail_element() for dclist_*? Even though, we might not use them > immediately, just for the sake for completeness of dclist data > structure. OK, I think I'd left those because dclist_container() would just be the same as dlist_container(), but that's not the case for the other two, so I've added all 3. One additional change is that I also ended up removing the use of dclist that I had in the previous patch for ReorderBufferTXN.subtxns. Looking more closely at the code in ReorderBufferAssignChild(): /* * We already saw this transaction, but initially added it to the * list of top-level txns. Now that we know it's not top-level, * remove it from there. */ dlist_delete(&subtxn->node); The problem is that since ReorderBufferTXN is used for both transactions and sub-transactions that it's not easy to determine if the ReorderBufferTXN.node is part of the ReorderBuffer.toplevel_by_lsn dlist or the ReorderBufferTXN.subtxns. It seems safer just to leave this one alone. David diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c index b01b39b008..a34e9b352d 100644 --- a/src/backend/access/heap/rewriteheap.c +++ b/src/backend/access/heap/rewriteheap.c @@ -196,8 +196,7 @@ typedef struct RewriteMappingFile TransactionId xid; /* xid that might need to see the row */ int vfd; /* fd of mappings file */ off_t off; /* how far have we written yet */ - uint32 num_mappings; /* number of in-memory mappings */ - dlist_head mappings; /* list of in-memory mappings */ + dclist_head mappings; /* list of in-memory mappings */ char path[MAXPGPATH]; /* path, for error messages */ } RewriteMappingFile; @@ -864,9 +863,10 @@ logical_heap_rewrite_flush_mappings(RewriteState state) Oid dboid; uint32 len; int written; + uint32 num_mappings = dclist_count(&src->mappings); /* this file hasn't got any new mappings */ - if (src->num_mappings == 0) + if (num
Re: [PROPOSAL] : Use of ORDER BY clause in insert.sql
David Rowley writes: > On Fri, 28 Oct 2022 at 16:51, Amul Sul wrote: >> If we >> are too sure that the output usually comes in the same order then the >> ORDER BY clause that exists in other tests seems useless. I am a bit >> confused & what could be a possible bug? > You can't claim that if this test shouldn't get an ORDER BY that all > tests shouldn't have an ORDER BY. That's just crazy. What if the test > is doing something like testing sort?! The general policy is that we'll add ORDER BY when a test is demonstrated to have unstable output order for identifiable environmental reasons (e.g. locale dependency) or timing reasons (e.g. background autovacuum sometimes changing statistics). But the key word there is "identifiable". Without some evidence as to what's causing this, it remains possible that it's a code bug not the fault of the test case. regress.sgml explains the policy further: You might wonder why we don't order all the regression test queries explicitly to get rid of this issue once and for all. The reason is that that would make the regression tests less useful, not more, since they'd tend to exercise query plan types that produce ordered results to the exclusion of those that don't. regards, tom lane
Re: [PROPOSAL] : Use of ORDER BY clause in insert.sql
On Fri, 28 Oct 2022 at 16:51, Amul Sul wrote: > > On Thu, Oct 27, 2022 at 6:54 PM Tom Lane wrote: > > Please be specific about the circumstances in which the output is > > unstable for you. With zero information to go on, it seems about as > > likely that this change is masking a bug as that it's a good idea. > > > > At the first glance, I thought the patch is pretty much obvious, and > we usually add an ORDER BY clause to ensure stable output. Unfortunately, you'll need to do better than that. We're not in the business of accepting patches with zero justification for why they're required. If you're not willing to do the analysis on why the order changes sometimes, why should we accept your patch? If you can't find the problem then you should modify insert.sql to EXPLAIN the problem query to see if the plan has changed between the passing and failing run. The only thing that comes to mind about why this test might produce rows in a different order would be if a parallel Append was sorting the subpaths by cost (See create_append_path's call to list_sort) and the costs were for some reason coming out differently sometimes. It's hard to imagine why this query would be parallelised though. If you show us the EXPLAIN from a passing and failing run, it might help us see the problem. > If we > are too sure that the output usually comes in the same order then the > ORDER BY clause that exists in other tests seems useless. I am a bit > confused & what could be a possible bug? You can't claim that if this test shouldn't get an ORDER BY that all tests shouldn't have an ORDER BY. That's just crazy. What if the test is doing something like testing sort?! David
Latches vs lwlock contention
Hi, We usually want to release lwlocks, and definitely spinlocks, before calling SetLatch(), to avoid putting a system call into the locked region so that we minimise the time held. There are a few places where we don't do that, possibly because it's not just a simple latch to hold a pointer to but rather a set of them that needs to be collected from some data structure and we don't have infrastructure to help with that. There are also cases where we semi-reliably create lock contention, because the backends that wake up immediately try to acquire the very same lock. One example is heavyweight lock wakeups. If you run BEGIN; LOCK TABLE t; ... and then N other sessions wait in SELECT * FROM t;, and then you run ... COMMIT;, you'll see the first session wake all the others while it still holds the partition lock itself. They'll all wake up and begin to re-acquire the same partition lock in exclusive mode, immediately go back to sleep on *that* wait list, and then wake each other up one at a time in a chain. We could avoid the first double-bounce by not setting the latches until after we've released the partition lock. We could avoid the rest of them by not re-acquiring the partition lock at all, which ... if I'm reading right ... shouldn't actually be necessary in modern PostgreSQL? Or if there is another reason to re-acquire then maybe the comment should be updated. Presumably no one really does that repeatedly while there is a long queue of non-conflicting waiters, so I'm not claiming it's a major improvement, but it's at least a micro-optimisation. There are some other simpler mechanical changes including synchronous replication, SERIALIZABLE DEFERRABLE and condition variables (this one inspired by Yura Sokolov's patches[1]). Actually I'm not at all sure about the CV implementation, I feel like a more ambitious change is needed to make our CVs perform. See attached sketch patches. I guess the main thing that may not be good enough is the use of a fixed sized latch buffer. Memory allocation in don't-throw-here environments like the guts of lock code might be an issue, which is why it just gives up and flushes when full; maybe it should try to allocate and fall back to flushing only if that fails. These sketch patches aren't proposals, just observations in need of more study. [1] https://postgr.es/m/1edbb61981fe1d99c3f20e3d56d6c88999f4227c.camel%40postgrespro.ru From b64d5782e2c3a2e34274a3bf9df4449afaee94dc Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Wed, 26 Oct 2022 15:51:45 +1300 Subject: [PATCH 1/8] Provide SetLatches() for batched deferred latches. If we have a way to buffer a set of wakeup targets and process them at a later time, we can: * move SetLatch() system calls out from under LWLocks, so that locks can be released faster; this is especially interesting in cases where the target backends will immediately try to acquire the same lock, or generally when the lock is heavily contended * possibly gain some micro-opimization from issuing only two memory barriers for the whole batch of latches, not two for each latch to be set * provide the opportunity for potential future latch implementation mechanisms to deliver wakeups in a single system call Individual users of this facility will follow in separate patches. --- src/backend/storage/ipc/latch.c | 187 ++- src/include/storage/latch.h | 13 +++ src/tools/pgindent/typedefs.list | 1 + 3 files changed, 123 insertions(+), 78 deletions(-) diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c index eb3a569aae..71fdc388c8 100644 --- a/src/backend/storage/ipc/latch.c +++ b/src/backend/storage/ipc/latch.c @@ -576,105 +576,136 @@ WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock, } /* - * Sets a latch and wakes up anyone waiting on it. - * - * This is cheap if the latch is already set, otherwise not so much. - * - * NB: when calling this in a signal handler, be sure to save and restore - * errno around it. (That's standard practice in most signal handlers, of - * course, but we used to omit it in handlers that only set a flag.) - * - * NB: this function is called from critical sections and signal handlers so - * throwing an error is not a good idea. + * Set multiple latches at the same time. + * Note: modifies input array. */ -void -SetLatch(Latch *latch) +static void +SetLatchV(Latch **latches, int nlatches) { -#ifndef WIN32 - pid_t owner_pid; -#else - HANDLE handle; -#endif - - /* - * The memory barrier has to be placed here to ensure that any flag - * variables possibly changed by this process have been flushed to main - * memory, before we check/set is_set. - */ + /* Flush any other changes out to main memory just once. */ pg_memory_barrier(); - /* Quick exit if already set */ - if (latch->is_set) - return; + /* Keep only latches that are not already set, and set them. */ + for (int i = 0; i < nlatches; ++i) + { + Latch
Re: [PROPOSAL] : Use of ORDER BY clause in insert.sql
On Thu, Oct 27, 2022 at 6:54 PM Tom Lane wrote: > > Nishant Sharma writes: > > We would like to share a proposal of a patch, where we have added order by > > clause in two select statements in src/test/regress/sql/insert.sql file and > > respective changes in src/test/regress/expected/insert.out output file. > > > This would help in generating output in consistent sequence, as sometimes > > we have observed change in sequence in output. > > Please be specific about the circumstances in which the output is > unstable for you. With zero information to go on, it seems about as > likely that this change is masking a bug as that it's a good idea. > At the first glance, I thought the patch is pretty much obvious, and we usually add an ORDER BY clause to ensure stable output. If we are too sure that the output usually comes in the same order then the ORDER BY clause that exists in other tests seems useless. I am a bit confused & what could be a possible bug? I have tested on my Centos and the Mac OS, insert.sql test is giving stable output, I didn't find failure in the subsequent runs too but I am not sure if that is enough evidence to skip the ORDER BY clause. Regards, Amul
Re: Allow file inclusion in pg_hba and pg_ident files
On Fri, Oct 28, 2022 at 10:24:23AM +0900, Michael Paquier wrote: > On Thu, Oct 27, 2022 at 12:26:25PM +0800, Julien Rouhaud wrote: > > I am still not completely sure what's the best way to do things here, > so let's do the following: let's keep the patch the way you think is > better for now (I may change my opinion on that but I'll hack that by > myself anyway). Using what you have as a base, could you split the > test and have it in its simplest to ease irs review? It would be able > to stress the buildfarm with a first version of the test and see how > it goes from there, and it is useful by itself IMO as HEAD has zero > coverage for this area. To be honest I'd rather not to. It's excessively annoying to work on those tests (I spent multiple days trying to make it as clean and readable as possible), and splitting it to only test the current infrastructure will need some substantial efforts. But more importantly, the next commit that will add tests for file inclusion will then be totally unmaintainable and unreadable, so that's IMO even worse. I think it will probably either be the current file overwritten or a new one written from scratch if some changes are done in the simplified test, and I'm not volunteering to do that.
Re: pg_recvlogical prints bogus error when interrupted
On Fri, Oct 28, 2022 at 4:41 AM Andres Freund wrote: > > On 2022-10-24 08:15:11 +0530, Bharath Rupireddy wrote: > > > > + /* When we get SIGINT/SIGTERM, we exit */ > > + if (ready_to_exit) > > + { > > + /* > > + * Try informing the server about our exit, but don't > > wait around > > + * or retry on failure. > > + */ > > + (void) PQputCopyEnd(conn, NULL); > > + (void) PQflush(conn); > > + time_to_abort = ready_to_exit; > > This doesn't strike me as great - because the ready_to_exit isn't checked in > the loop around StreamLogicalLog(), we'll reconnect if something else causes > StreamLogicalLog() to return. Fixed. > Why do we need both time_to_abort and ready_to_exit? Intention to have ready_to_exit is to be able to distinguish between SIGINT/SIGTERM and aborting when endpos is reached so that necessary code is skipped/executed and proper logs are printed. > Perhaps worth noting that > time_to_abort is still an sig_atomic_t, but isn't modified in a signal > handler, which seems a bit unnecessarily confusing. time_to_abort is just a static variable, no? +static booltime_to_abort = false; +static volatile sig_atomic_t ready_to_exit = false; Please see the attached v3 patch. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com v3-0001-Fix-pg_recvlogical-error-message-upon-SIGINT-SIGT.patch Description: Binary data
Re: GUC values - recommended way to declare the C variables?
On Fri, Oct 28, 2022 at 11:48:13AM +0900, Michael Paquier wrote: > Actually, pg_iovec.h uses WIN32 without any previous header declared, > but win32.h tells a different story as of ed9b3606, where we would > define WIN32 if it does not exist yet. Seeing all the places where pg_status.h is included, that should be fine, so please just ignore this part. -- Michael signature.asc Description: PGP signature
Re: GUC values - recommended way to declare the C variables?
On Thu, Oct 27, 2022 at 07:00:26PM +1100, Peter Smith wrote: > The GUC defaults of guc_tables.c, and the modified GUC C var > declarations now share the same common #define'd value (instead of > cut/paste preprocessor code). Thanks. I have not looked at the checkup logic yet, but the central declarations seem rather sane, and I have a few comments about the latter. +#ifdef WIN32 +#define DEFAULT_UPDATE_PROCESS_TITLE false +#else +#define DEFAULT_UPDATE_PROCESS_TITLE true +#endif This is the kind of things I would document as a comment, say "Disabled on Windows as the performance overhead can be significant". Actually, pg_iovec.h uses WIN32 without any previous header declared, but win32.h tells a different story as of ed9b3606, where we would define WIN32 if it does not exist yet. That may impact the default depending on the environment used? I am wondering whether the top of win32.h could be removed, these days.. +#ifdef USE_PREFETCH +#define DEFAULT_EFFECTIVE_IO_CONCURRENCY 1 +#define DEFAULT_MAINTENANCE_IO_CONCURRENCY 10 +#else +#define DEFAULT_EFFECTIVE_IO_CONCURRENCY 0 +#define DEFAULT_MAINTENANCE_IO_CONCURRENCY 0 +#endif These don't make sense without prefetching available. Perhaps that's obvious enough when reading the code still I would add a small note. -- Michael signature.asc Description: PGP signature
Merging LatchWaitSet and FeBeWaitSet
Hi, Currently all backends have LatchWaitSet (latch.c), and most also have FeBeWaitSet (pqcomm.c). It's not the end of the world, but it's a little bit wasteful in terms of kernel resources to have two epoll/kqueue descriptors per backend. I wonder if we should consider merging them into a single BackendWaitSet. The reason they exist is because callers of WaitLatch() might be woken by the kernel just because data appears on the FeBe socket. One idea is that we could assume that socket readiness events should be rare enough at WaitLatch() sites that it's enough to disable them lazily if they are reported. The FeBe code already adjusts as required. For example, if you're waiting for a heavyweight lock or condition variable while executing a query, and pipelined query or COPY data arrives, you'll spuriously wake up, but only once and not again until you eventually reach FeBe read and all queued socket data is drained and more data arrives. Sketch patch attached. Just an idea, not putting into commitfest yet. (Besides the wasted kernel sources, I also speculate that things get pretty confusing if you try to switch to completion based APIs for more efficient socket IO on various OSes, depending on how you implement latches. I have some handwavy theories about various schemes to achieve that on Linux, Windows and FreeBSD with various different problems relating to the existence of two kernel objects. Which is a bit more fuel for my early suspicion that postgres_fdw, which currently creates and destroys WES, should eventually also use BackendWaitSet, which should be dynamically resizing. But that's for another time.) From e3c4bb6245d0329f53d5545f487c996da1948ade Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Thu, 27 Oct 2022 16:37:28 +1300 Subject: [PATCH] Merge FeBeWaitSet and LatchWaitSet. Instead of using two epoll/kqueue kernel objects and descriptors in every backend process, create a new common one called BackendWaitSet. In backends that are connected to a FeBe socket, WaitLatch() might occasionally report a socket readiness event if data arrives, but that's not expected to happen much and can be lazily suppressed, and reenabled at the next FeBe wait. --- src/backend/libpq/be-secure.c | 10 ++-- src/backend/libpq/pqcomm.c | 21 +++- src/backend/replication/walsender.c | 4 +- src/backend/storage/ipc/latch.c | 74 +++-- src/backend/utils/init/miscinit.c | 12 ++--- src/include/libpq/libpq.h | 6 --- src/include/storage/latch.h | 9 +++- 7 files changed, 77 insertions(+), 59 deletions(-) diff --git a/src/backend/libpq/be-secure.c b/src/backend/libpq/be-secure.c index e3e54713e8..431abd58ce 100644 --- a/src/backend/libpq/be-secure.c +++ b/src/backend/libpq/be-secure.c @@ -179,9 +179,10 @@ retry: Assert(waitfor); - ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetSocketPos, waitfor, NULL); + ModifyWaitEvent(BackendWaitSet, BackendWaitSetLatchPos, WL_LATCH_SET, MyLatch); + ModifyWaitEvent(BackendWaitSet, BackendWaitSetSocketPos, waitfor, NULL); - WaitEventSetWait(FeBeWaitSet, -1 /* no timeout */ , &event, 1, + WaitEventSetWait(BackendWaitSet, -1 /* no timeout */ , &event, 1, WAIT_EVENT_CLIENT_READ); /* @@ -291,9 +292,10 @@ retry: Assert(waitfor); - ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetSocketPos, waitfor, NULL); + ModifyWaitEvent(BackendWaitSet, BackendWaitSetLatchPos, WL_LATCH_SET, MyLatch); + ModifyWaitEvent(BackendWaitSet, BackendWaitSetSocketPos, waitfor, NULL); - WaitEventSetWait(FeBeWaitSet, -1 /* no timeout */ , &event, 1, + WaitEventSetWait(BackendWaitSet, -1 /* no timeout */ , &event, 1, WAIT_EVENT_CLIENT_WRITE); /* See comments in secure_read. */ diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c index ce56ab1d41..1762e3a6ba 100644 --- a/src/backend/libpq/pqcomm.c +++ b/src/backend/libpq/pqcomm.c @@ -161,8 +161,6 @@ static const PQcommMethods PqCommSocketMethods = { const PQcommMethods *PqCommMethods = &PqCommSocketMethods; -WaitEventSet *FeBeWaitSet; - /* * pq_init - initialize libpq at backend startup @@ -172,7 +170,6 @@ void pq_init(void) { int socket_pos PG_USED_FOR_ASSERTS_ONLY; - int latch_pos PG_USED_FOR_ASSERTS_ONLY; /* initialize state variables */ PqSendBufferSize = PQ_SEND_BUFFER_SIZE; @@ -200,20 +197,14 @@ pq_init(void) (errmsg("could not set socket to nonblocking mode: %m"))); #endif - FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, FeBeWaitSetNEvents); - socket_pos = AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, + socket_pos = AddWaitEventToSet(BackendWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock, NULL, NULL); - latch_pos = AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, PGINVALID_SOCKET, - MyLatch, NULL); - AddWaitEventToSet(FeBeWaitSet, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, - NULL, NULL); /* * The event positions match the order
Re: Make mesage at end-of-recovery less scary.
rebased >From 67ce65038ae6a7d5b023b7472df9f9ca9835d5f5 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 7 Jul 2022 11:51:45 +0900 Subject: [PATCH] Make End-Of-Recovery error less scary When recovery in any type ends, we see a bit scary error message like "invalid record length" that suggests something serious is happening. Actually if recovery meets a record with length = 0, that usually means it finished applying all available WAL records. Make this message less scary as "reached end of WAL". Instead, raise the error level for other kind of WAL failure to WARNING. --- src/backend/access/transam/xlogreader.c | 135 +- src/backend/access/transam/xlogrecovery.c | 95 +++ src/backend/replication/walreceiver.c | 7 +- src/bin/pg_waldump/pg_waldump.c | 13 ++- src/include/access/xlogreader.h | 1 + src/test/recovery/t/011_crash_recovery.pl | 106 + 6 files changed, 299 insertions(+), 58 deletions(-) diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 93f667b2544..f891a629443 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -48,6 +48,8 @@ static int ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen); static void XLogReaderInvalReadState(XLogReaderState *state); static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking); +static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr, + XLogRecord *record); static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess); static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record, @@ -149,6 +151,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir, pfree(state); return NULL; } + state->EndOfWAL = false; state->errormsg_buf[0] = '\0'; /* @@ -558,6 +561,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking) /* reset error state */ state->errormsg_buf[0] = '\0'; decoded = NULL; + state->EndOfWAL = false; state->abortedRecPtr = InvalidXLogRecPtr; state->missingContrecPtr = InvalidXLogRecPtr; @@ -640,25 +644,21 @@ restart: Assert(pageHeaderSize <= readOff); /* - * Read the record length. + * Validate the record header. * - * NB: Even though we use an XLogRecord pointer here, the whole record - * header might not fit on this page. xl_tot_len is the first field of the - * struct, so it must be on this page (the records are MAXALIGNed), but we - * cannot access any other fields until we've verified that we got the - * whole header. - */ - record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ); - total_len = record->xl_tot_len; - - /* - * If the whole record header is on this page, validate it immediately. - * Otherwise do just a basic sanity check on xl_tot_len, and validate the - * rest of the header after reading it from the next page. The xl_tot_len + * Even though we use an XLogRecord pointer here, the whole record header + * might not fit on this page. If the whole record header is on this page, + * validate it immediately. Even otherwise xl_tot_len must be on this page + * (it is the first field of MAXALIGNed records), but we still cannot + * access any further fields until we've verified that we got the whole + * header, so do just a basic sanity check on record length, and validate + * the rest of the header after reading it from the next page. The length * check is necessary here to ensure that we enter the "Need to reassemble * record" code path below; otherwise we might fail to apply * ValidXLogRecordHeader at all. */ + record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ); + if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord) { if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr, record, @@ -668,18 +668,14 @@ restart: } else { - /* XXX: more validation should be done here */ - if (total_len < SizeOfXLogRecord) - { - report_invalid_record(state, - "invalid record length at %X/%X: wanted %u, got %u", - LSN_FORMAT_ARGS(RecPtr), - (uint32) SizeOfXLogRecord, total_len); + if (!ValidXLogRecordLength(state, RecPtr, record)) goto err; - } + gotheader = false; } + total_len = record->xl_tot_len; + /* * Find space to decode this record. Don't allow oversized allocation if * the caller requested nonblocking. Otherwise, we *have* to try to @@ -1106,16 +1102,47 @@ XLogReaderInvalReadState(XLogReaderState *state) } /* - * Validate an XLOG record header. + * Validate record length of an XLOG record header. * - * This is just a convenience subroutine to avoid duplicated code in - * XLogReadRecord. It's not intended for use from anywhere else. + * This is substantially a part of ValidXLogRecordHeader. But XL
Re: Improve tab completion for ALTER FUNCTION/PROCEDURE/ROUTINE
On Fri, Oct 28, 2022 at 12:08 AM vignesh C wrote: > > Hi, > > Tab completion for ALTER FUNCTION/PROCEDURE/ROUTINE action was > missing, this patch adds the tab completion for the same. > > Regards, > Vignesh Hi, I applied your patch and did some tests. Is it okay not to consider SET and RESET commands? (e.g ALTER FUNCTION) --- Regards, DongWook Lee.
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Thu, Oct 27, 2022 at 03:58:25PM -0700, Andres Freund wrote: > The block sizes don't need to match, do they? As long as the block is properly > aligned, we can change the iov_len of the final iov to match whatever the size > is being passed in, no? Hmm. Based on what Bharath has written upthread, it does not seem to matter if the size of the aligned block changes, either: https://www.postgresql.org/message-id/calj2acuccjr7kbkqwosqmqh1zgedyj7hh5ef+dohcv7+kon...@mail.gmail.com I am honestly not sure whether it is a good idea to make file_utils.c depend on one of the compile-time page sizes in this routine, be it the page size of the WAL page size, as pg_write_zeros() would be used for some rather low-level operations. But we could as well just use a locally-defined structure with a buffer at 4kB or 8kB and call it a day? -- Michael signature.asc Description: PGP signature
Re: Allow file inclusion in pg_hba and pg_ident files
On Thu, Oct 27, 2022 at 12:26:25PM +0800, Julien Rouhaud wrote: > On Thu, Oct 27, 2022 at 12:08:31PM +0900, Michael Paquier wrote: >> >> Putting things afresh, there are two different things here (sorry I >> need to see that typed ;p): >> 1) How do we want to check reliably the loading of the HBA and ident >> files on errors? > > I guess you meant the failure to load HBA / ident files containing invalid > data? Yeah. >> Hmm. And what if we just gave up on the checks for error patterns in >> pg_hba_file_rules? One part that I was thinking about when typing this part yesterday is that an EXEC_BACKEND build should work in non-WIN32 in TAP even if pg_ident.conf cannot be loaded, but I forgot entirely about the part where we need a user mapping for the SSPI authentication on WIN32, as set by pg_regress. > We discussed this problem in the past (1), and my understanding was that > detecting a -DEXEC_BACKEND/Win32 build and skipping those tests in that case > would be an acceptable solution to make sure there's at least some coverage. > The proposed patch adds such an approach, making sure that the failure is due > to an invalid HBA file. If you changed you mind I can remove that part, but > again I'd like to be sure of what you exactly want before starting to rewrite > stuff. I am still not completely sure what's the best way to do things here, so let's do the following: let's keep the patch the way you think is better for now (I may change my opinion on that but I'll hack that by myself anyway). Using what you have as a base, could you split the test and have it in its simplest to ease irs review? It would be able to stress the buildfarm with a first version of the test and see how it goes from there, and it is useful by itself IMO as HEAD has zero coverage for this area. > I'm not sure what you mean here. The patch does check for all the errors > looking at LOG lines and CONTEXT lines, but to make the regexp easier it > doesn't try to make sure that each CONTEXT line is immediately following the > expected LOG line. Hmm. Perhaps we'd better make sure that the LOG/CONTEXT link is checked? The context includes the line number while a generic sentence, and the LOG provides all the details of the error happening. > That's why the errors are divided in 2 steps: a first step with a single error > using some inclusion, so we can validate that the CONTEXT line is entirely > correct (wrt. line number and such), and then every possible error pattern > where we assume that the CONTEXT line are still following their LOG entry if > they're found. It also has the knowledge of which errors adds a CONTEXT line > and which don't. And that's done twice, for HBA and ident. Okay, so you do check the relationship between both, after all. > The part 3 is just concatenating everything back, for HBA and ident. So > long-term maintenance shouldn't get any harder as there won't be any need for > more steps. We can just keep appending stuff in the 2nd step and all the > tests > should run as expected. Hmm. Okay. -- Michael signature.asc Description: PGP signature
Re: Perform streaming logical transactions by background workers and parallel apply
On Thu, Oct 27, 2022 at 11:34 AM shiy.f...@fujitsu.com wrote: > > On Wed, Oct 26, 2022 7:19 PM Amit Kapila wrote: > > > > On Tue, Oct 25, 2022 at 8:38 AM Masahiko Sawada > > wrote: > > > > > > On Fri, Oct 21, 2022 at 6:32 PM houzj.f...@fujitsu.com > > > wrote: > > > > > > I've started to review this patch. I tested v40-0001 patch and have > > > one question: > > > > > > IIUC even when most of the changes in the transaction are filtered out > > > in pgoutput (eg., by relation filter or row filter), the walsender > > > sends STREAM_START. This means that the subscriber could end up > > > launching parallel apply workers also for almost empty (and streamed) > > > transactions. For example, I created three subscriptions each of which > > > subscribes to a different table. When I loaded a large amount of data > > > into one table, all three (leader) apply workers received START_STREAM > > > and launched their parallel apply workers. > > > > > > > The apply workers will be launched just the first time then we > > maintain a pool so that we don't need to restart them. > > > > > However, two of them > > > finished without applying any data. I think this behaviour looks > > > problematic since it wastes workers and rather decreases the apply > > > performance if the changes are not large. Is it worth considering a > > > way to delay launching a parallel apply worker until we find out the > > > amount of changes is actually large? > > > > > > > I think even if changes are less there may not be much difference > > because we have observed that the performance improvement comes from > > not writing to file. > > > > > For example, the leader worker > > > writes the streamed changes to files as usual and launches a parallel > > > worker if the amount of changes exceeds a threshold or the leader > > > receives the second segment. After that, the leader worker switches to > > > send the streamed changes to parallel workers via shm_mq instead of > > > files. > > > > > > > I think writing to file won't be a good idea as that can hamper the > > performance benefit in some cases and not sure if it is worth. > > > > I tried to test some cases that only a small part of the transaction or an > empty > transaction is sent to subscriber, to see if using streaming parallel will > bring > performance degradation. > > The test was performed ten times, and the average was taken. > The results are as follows. The details and the script of the test is > attached. > > 10% of rows are sent > -- > HEAD24.4595 > patched 18.4545 > > 5% of rows are sent > -- > HEAD21.244 > patched 17.9655 > > 0% of rows are sent > -- > HEAD18.0605 > patched 17.893 > > > It shows that when only 5% or 10% of rows are sent to subscriber, using > parallel > apply takes less time than HEAD, and even if all rows are filtered there's no > performance degradation. Thank you for the testing! I think this performance improvement comes from both applying changes in parallel to receiving changes and avoiding writing a file. I'm happy to know there is also a benefit also for small streaming transactions. I've also measured the overhead when processing streaming empty transactions and confirmed the overhead is negligible. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Support logical replication of DDLs
On Fri, Oct 28, 2022 at 11:20 AM Zheng Li wrote: > > > 1. It might be useful to add this thread to the commitfest, if only so > > the cfbot can discover the latest patch set and alert about any rebase > > problems. > > There is already a commitfest entry for the thread that I added back in March: > https://commitfest.postgresql.org/40/3595/ Sorry, I missed that earlier because I searched only by authors, and some were missing. Now I saw it has just been updated - thanks. -- Kind Regards, Peter Smith. Fujitsu Australia.
Re: pg_ls_tmpdir to show directories and shared filesets (and pg_ls_*)
On Fri, Dec 13, 2019 at 03:03:47PM +1300, Thomas Munro wrote: > > Actually, I tried using pg_ls_tmpdir(), but it unconditionally masks > > non-regular files and thus shared filesets. Maybe that's worth > > discussion on a new thread ? > > > > src/backend/utils/adt/genfile.c > > /* Ignore anything but regular files */ > > if (!S_ISREG(attrib.st_mode)) > > continue; > > +1, that's worth fixing. @cfbot: rebased on eddc128be. >From f641f63a1ff3efd90a3831927c758636cc82d080 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Mon, 16 Mar 2020 14:12:55 -0500 Subject: [PATCH v37 01/11] Document historic behavior of links to directories.. Backpatch to 9.5: pg_stat_file --- doc/src/sgml/func.sgml | 4 1 file changed, 4 insertions(+) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 6e0425cb3dc..d958c3e74ac 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -27700,6 +27700,10 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8'); platforms only), file creation time stamp (Windows only), and a flag indicating if it is a directory. + + If filename is a link, this function returns information about the file + or directory the link refers to. + This function is restricted to superusers by default, but other users can be granted EXECUTE to run the function. -- 2.25.1 >From 3f28fa01fa8e01b633e5c2a24ba4deed4abe6053 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Tue, 17 Mar 2020 13:16:24 -0500 Subject: [PATCH v37 02/11] Add tests before changing pg_ls_* --- src/test/regress/expected/misc_functions.out | 59 src/test/regress/sql/misc_functions.sql | 15 + 2 files changed, 74 insertions(+) diff --git a/src/test/regress/expected/misc_functions.out b/src/test/regress/expected/misc_functions.out index 88bb696ded8..77d285ecc85 100644 --- a/src/test/regress/expected/misc_functions.out +++ b/src/test/regress/expected/misc_functions.out @@ -480,6 +480,65 @@ select count(*) > 0 from t (1 row) +select * from (select pg_ls_dir('.', false, true) as name) as ls where ls.name='.'; -- include_dot_dirs=true + name +-- + . +(1 row) + +select * from (select pg_ls_dir('.', false, false) as name) as ls where ls.name='.'; -- include_dot_dirs=false + name +-- +(0 rows) + +select pg_ls_dir('does not exist', true, false); -- ok with missingok=true + pg_ls_dir +--- +(0 rows) + +select pg_ls_dir('does not exist'); -- fails with missingok=false +ERROR: could not open directory "does not exist": No such file or directory +-- Check that expected columns are present +select * from pg_ls_archive_statusdir() limit 0; + name | size | modification +--+--+-- +(0 rows) + +select * from pg_ls_logdir() limit 0; + name | size | modification +--+--+-- +(0 rows) + +select * from pg_ls_logicalmapdir() limit 0; + name | size | modification +--+--+-- +(0 rows) + +select * from pg_ls_logicalsnapdir() limit 0; + name | size | modification +--+--+-- +(0 rows) + +select * from pg_ls_replslotdir('') limit 0; + name | size | modification +--+--+-- +(0 rows) + +select * from pg_ls_tmpdir() limit 0; + name | size | modification +--+--+-- +(0 rows) + +select * from pg_ls_waldir() limit 0; + name | size | modification +--+--+-- +(0 rows) + +select * from pg_stat_file('.') limit 0; + size | access | modification | change | creation | isdir +--++--++--+--- +(0 rows) + -- -- Test replication slot directory functions -- diff --git a/src/test/regress/sql/misc_functions.sql b/src/test/regress/sql/misc_functions.sql index b07e9e8dbb3..d299f3d8949 100644 --- a/src/test/regress/sql/misc_functions.sql +++ b/src/test/regress/sql/misc_functions.sql @@ -166,6 +166,21 @@ select count(*) > 0 from where spcname = 'pg_default') pts join pg_database db on pts.pts = db.oid; +select * from (select pg_ls_dir('.', false, true) as name) as ls where ls.name='.'; -- include_dot_dirs=true +select * from (select pg_ls_dir('.', false, false) as name) as ls where ls.name='.'; -- include_dot_dirs=false +select pg_ls_dir('does not exist', true, false); -- ok with missingok=true +select pg_ls_dir('does not exist'); -- fails with missingok=false + +-- Check that expected columns are present +select * from pg_ls_archive_statusdir() limit 0; +select * from pg_ls_logdir() limit 0; +select * from pg_ls_logicalmapdir() limit 0; +select * from pg_ls_logicalsnapdir() limit 0; +select * from pg_ls_replslotdir('') limit 0; +select * from pg_ls_tmpdir() limit 0; +select * from pg_ls_waldir() limit 0; +select * from pg_stat_file('.') limit 0; + -- -- Test replication slot directory functions -- -- 2.25.1 >From 2fcff37d19869c83eab7c8116946854f52a20515 Mon Sep
Re: Support logical replication of DDLs
> Hi, authors on this thread. > > The patch v32-0001 is very large, so it will take some time to review > the code in detail. Thanks for reviewing! > Meanwhile, here are some general comments about the patch: > > == > > 1. It might be useful to add this thread to the commitfest, if only so > the cfbot can discover the latest patch set and alert about any rebase > problems. There is already a commitfest entry for the thread that I added back in March: https://commitfest.postgresql.org/40/3595/ > 2. User interface design/documentation? > > Please consider adding user interface documentation, so it is > available for review sooner than later. This comment was previously > posted 2 weeks ago [1] but no replies. > > I can only guess (from the test of patch 0004) that the idea is to use > another 'ddl' option for the 'publish' parameter: > CREATE PUBLICATION mypub FOR ALL TABLES with (publish = 'insert, > update, delete, ddl'); > > My first impression is that a blanket option like that could be > painful for some users who DO (for example) want the convenience of > the DDL replication automagically creating new tables on the fly, but > maybe they do NOT want the side-effect of replicating every other kind > of DDL as well. > > Maybe such scenarios can be handled by another publication parameter > which can allow more fine-grained DDL replication like: > CREATE PUBLICATION mypub FOR ALL TABLES WITH (ddl = 'tables') > > I also have lots of usability questions but probably the documentation > would give the answers to those. IMO the docs for the user interface > and runtime behaviours should not be deferred - they should form part > of this patch 0001. We've been deferring the discussion on user interface syntax (and documentation) until we get the DDL deparser in a good shape. I agree it's time to pick up the discussion again now that we're getting close to fully integrating the DDL deparser with DDL replication. I think it makes sense to introduce different DDL replication granularity levels, for example, I think the most important levels would be ddl = 'tables' and ddl = 'database' (or ddl = 'all'). > 5. File size > > As mentioned above in #4, the src/backend/commands/ddl_deparse.c is > huge (9200+ lines as at v32-0001). It is already unwieldy. Is there > some way to reduce this? For example, perhaps many of those > "utility/helper" functions (even though they are static) would be > better moved out to another file simply to get things down to a more > manageable size. Yes, I think we can split patch 0001 into a bare-bone patch for a few essential commands and a patch for the rest of the commands for ease of review. Another topic we haven't discussed is the ownership of the replicated objects. Currently all the replicated objects are owned by the subscription owner regardless of their owners in the publisher database. I think we can consider making it user configurable so that the ownership of the replicated objects match that of their original owner in certain use cases such as in a full database logical replica scenario. Otherwise the DBA will have to fix the ownership structure manually which could be painful. Thoughts? Regards, Zheng
Re: Requiring 32 bit atomics
Hi, On 2022-10-27 19:44:13 -0400, Tom Lane wrote: > Turns out they have a pretty cute workaround for it, on HPPA and a couple of > other atomics-less arches they still support. They've written short > sequences that have the effect of CAS and are designed to store to memory > only at the end. To make them atomic, libc asks the kernel "pretty please, > if you happen to notice that I've been interrupted in the PC range from here > to here, would you reset the PC to the start of that before returning?". That sounds roughly like restartable sequences in the linux world - a pretty cool feature. It's too bad that it's not yet available everywhere, it does make some things a lot easier [to make performant]. > Anyway, I think the big picture here is that nowadays we could > assume that the platform offers this feature. Agreed. Greetings, Andres Freund
Re: Requiring 32 bit atomics
I wrote: > But wait, you say, what about mamba-nee-gaur, my HPPA dinosaur? sigh ... s/mamba/chickadee/. Got too many NetBSD machines, perhaps. regards, tom lane
Re: Requiring 32 bit atomics
Thomas Munro writes: > We have fallback code for computers that don't have 32 bit atomic ops. > Of course all modern ISAs have 32 bit atomics, but various comments > imagine that a new architecture might be born that we don't have > support for yet, so the fallback provides a way to bring a new system > up by implementing only the spinlock operations and emulating the > rest. This seems pretty strange to me: by the time someone brings an > SMP kernel up on a hypothetical new architecture and gets around to > porting relational databases, it's hard to imagine that the compiler > builtins and C11 atomic support wouldn't be working. Fair point. Another point you could make is that we no longer have any test coverage for machines without 32-bit atomic ops. But wait, you say, what about mamba-nee-gaur, my HPPA dinosaur? The only actual hardware support there is equivalent to TAS(); nonetheless, if you read mamba's configure report you'll see it claims to have atomic ops. I wondered if NetBSD was implementing that by using kernel calls to disable interrupts, or something equally badly-performing. Turns out they have a pretty cute workaround for it, on HPPA and a couple of other atomics-less arches they still support. They've written short sequences that have the effect of CAS and are designed to store to memory only at the end. To make them atomic, libc asks the kernel "pretty please, if you happen to notice that I've been interrupted in the PC range from here to here, would you reset the PC to the start of that before returning?". At least on HPPA, this is implemented for 8-bit, 16-bit, and 32-bit CAS and then all the other standard atomics are implemented on top of that, so that the kernel doesn't spend too much time checking for these address ranges when it takes an interrupt. Of course this only works on single-CPU machines. On multi-CPU there's a completely different implementation that I've not spent time looking at ... but I assume the performance is a lot worse. Anyway, I think the big picture here is that nowadays we could assume that the platform offers this feature. regards, tom lane
Re: Proposal to use JSON for Postgres Parser format
Hi, On 2022-09-19 22:29:15 -0400, Tom Lane wrote: > There are certainly reasons to think about changing the node tree > storage format; but if we change it, I'd like to see it go to something > more compact not more verbose. Very much seconded - the various pg_node_trees are a quite significant fraction of the overall size of an empty database. And they're not particularly useful for a human either. IIRC it's not just catalog storage that's affected, but iirc also relevant for parallel query. My pet peeve is the way datums are output as individual bytes printed as integers each. For narrow fixed-width datums including a lot of 0's for bytes that aren't even used in the datum. > Maybe a compromise could be found whereby we provide a conversion function > that converts whatever the catalog storage format is to some JSON > equivalent. That would address the needs of external code that doesn't want > to write a custom parser, while not tying us directly to JSON. +1 Greetings, Andres Freund
Re: Support logical replication of DDLs
Hi, authors on this thread. The patch v32-0001 is very large, so it will take some time to review the code in detail. Meanwhile, here are some general comments about the patch: == 1. It might be useful to add this thread to the commitfest, if only so the cfbot can discover the latest patch set and alert about any rebase problems. ~~~ 2. User interface design/documentation? Please consider adding user interface documentation, so it is available for review sooner than later. This comment was previous posted 2 weeks ago [1] but no replies. I can only guess (from the test of patch 0004) that the idea is to use another 'ddl' option for the 'publish' parameter: CREATE PUBLICATION mypub FOR ALL TABLES with (publish = 'insert, update, delete, ddl'); My first impression is that a blanket option like that could be painful for some users who DO (for example) want the convenience of the DDL replication automagically creating new tables on the fly, but maybe they do NOT want the side-effect of replicating every other kind of DDL as well. Maybe such scenarios can be handled by another publication parameter which can allow more fine-grained DDL replication like: CREATE PUBLICATION mypub FOR ALL TABLES WITH (ddl = 'tables') I also have lots of usability questions but probably the documentation would give the answers to those. IMO the docs for the user interface and runtime behaviours should not be deferred - they should form part of this patch 0001. ~~~ 3. Why all-or-nothing? The current strategy for this thread appears to be to implement *everything* in the underlying code, and then figure out what to do with it. I'm not sure if the all-or-nothing approach is best - It just feels risky to me, so I hope it will not result in a ton of work that ends up unused. Why not instead implement just some core set of the DDL replications that are the most wanted ones (e.g. create/alter/drop tables/whatever?) and try to get that subset committed first? Then the remainder can be added incrementally. But this harks back to comment #2: the user interface would need to allow flexibility to do it like this. ~~~ 4. Function name inconsistency. Even if it was not obvious from the posts, it is clear there are multiple authors. As an example, the file src/backend/commands/ddl_deparse.c is 9200+ lines (and growing rapidly) and the functions in this module are a mixture of many different naming conventions and they seem scattered around the source file in different ways. I suggest this all needs to be brought under some control ASAP, by introducing some strict naming convention and sticking to it. For example, you might do something like: * xxx_util_foo() * xxx_util_bah() * xxx_deparse_alter() * xxx_deparse_create() * xxx_whatever() where xxx is the main object (table, sequence, schema, etc). Then order everything alphabetically, so that related stuff ends up together. IMO grouping functions like this will also make reviewing different objects far easier. ~~~ 5. File size As mentioned above in #4, the src/backend/commands/ddl_deparse.c is huge (9200+ lines as at v32-0001). It is already unwieldy. Is there some way to reduce this? For example, perhaps many of those "utility/helper" functions (even though they are static) would be better moved out to another file simply to get things down to a more manageable size. -- [1] https://www.postgresql.org/message-id/CAHut%2BPtKiTmcQ7zXs6YvR-qtuMQ9wgffnfamqCAVpM_ETa2LCg%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Re: pg_recvlogical prints bogus error when interrupted
Hi, On 2022-10-24 08:15:11 +0530, Bharath Rupireddy wrote: > I came up with the attached v2 patch, please have a look. Thanks for working on this! > + /* When we get SIGINT/SIGTERM, we exit */ > + if (ready_to_exit) > + { > + /* > + * Try informing the server about our exit, but don't > wait around > + * or retry on failure. > + */ > + (void) PQputCopyEnd(conn, NULL); > + (void) PQflush(conn); > + time_to_abort = ready_to_exit; This doesn't strike me as great - because the ready_to_exit isn't checked in the loop around StreamLogicalLog(), we'll reconnect if something else causes StreamLogicalLog() to return. Why do we need both time_to_abort and ready_to_exit? Perhaps worth noting that time_to_abort is still an sig_atomic_t, but isn't modified in a signal handler, which seems a bit unnecessarily confusing. Greetings, Andres Freund
Re: Documentation for building with meson
Hi, On 2022-10-27 14:15:32 -0700, Jacob Champion wrote: > On Thu, Oct 27, 2022 at 1:04 AM John Naylor > wrote: > > This does not work for me in a fresh install until running > > > > meson test --suite setup > > > > In fact, we see in > > > > https://wiki.postgresql.org/wiki/Meson > > > > meson test --suite setup --suite main > > (Is there a way to declare a dependency on the setup suite in Meson, > so that we don't have to specify it manually? I was bitten by this > recently; if you make a code change and forget to run setup, it'll > recompile locally but then skip reinstallation, giving false test > results.) Tests can have dependencies, and they're correctly built. The problem however is that, for historical reasons if I understand correctly, dependencies of tests are automatically included in the default 'all' target. Which means if you just type in 'ninja', it'd automatically create the test installation - which is probably not what we want, given that that's not a fast step on some platforms. Greetings, Andres Freund
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
Hi, Interestingly, I also needed something like pg_pwrite_zeros() today. Exposed via smgr, for more efficient relation extensions. On 2022-10-27 14:54:00 +0900, Michael Paquier wrote: > Regarding 0002, using pg_pwrite_zeros() as a routine name, as > suggested by Thomas, sounds good to me. However, I am not really a > fan of its dependency with PGAlignedXLogBlock, because it should be > able to work with any buffers of any sizes, as long as the input > buffer is aligned, shouldn't it? For example, what about > PGAlignedBlock? So, should we make this more extensible? My guess > would be the addition of the block size and the block pointer to the > arguments of pg_pwrite_zeros(), in combination with a check to make > sure that the input buffer is MAXALIGN()'d (with an Assert() rather > than just an elog/pg_log_error?). I don't like passing in the buffer. That leads to code like in Bharat's latest version, where we now zero that buffer on every invocation of pg_pwrite_zeros() - not at all cheap. And every caller has to have provisions to provide that buffer. The block sizes don't need to match, do they? As long as the block is properly aligned, we can change the iov_len of the final iov to match whatever the size is being passed in, no? Why don't we define a static PGAlignedBlock source_of_zeroes; in file_utils.c, and use that in pg_pwrite_zeros(), being careful to set the iov_len arguments correctly? Greetings, Andres Freund
Requiring 32 bit atomics
Hi, We have fallback code for computers that don't have 32 bit atomic ops. Of course all modern ISAs have 32 bit atomics, but various comments imagine that a new architecture might be born that we don't have support for yet, so the fallback provides a way to bring a new system up by implementing only the spinlock operations and emulating the rest. This seems pretty strange to me: by the time someone brings an SMP kernel up on a hypothetical new architecture and gets around to porting relational databases, it's hard to imagine that the compiler builtins and C11 atomic support wouldn't be working. I suppose this could be considered in the spirit of recent cleanup of obsolete code in v16. The specific reason I'm interested is that I have a couple of different experimental patches in development that would like to use atomic ops from a signal handler, which is against the law if they're emulated with spinlocks due to self-deadlock. Not sure if it's really a blocker, I can surely find some way to code around the limitation (I want to collapse a lot of flags into a single word and set them with fetch_or), but it seemed a little weird to have to do so for such an unlikely hypothetical consideration. (64 bit atomics are another matter, real hardware exists that doesn't have them.) No patch yet, just running a flame-proof flag up the poll before investing effort...
Re: Documentation for building with meson
On Thu, Oct 27, 2022 at 1:04 AM John Naylor wrote: > This does not work for me in a fresh install until running > > meson test --suite setup > > In fact, we see in > > https://wiki.postgresql.org/wiki/Meson > > meson test --suite setup --suite main (Is there a way to declare a dependency on the setup suite in Meson, so that we don't have to specify it manually? I was bitten by this recently; if you make a code change and forget to run setup, it'll recompile locally but then skip reinstallation, giving false test results.) --Jacob
Re: Have nodeSort.c use datum sorts single-value byref types
On Wed, 26 Oct 2022 at 23:35, David Rowley wrote: > I think this is a fairly trivial patch, so if nobody objects, I plan > to push it in the next few days. Pushed. David
Re: Allow single table VACUUM in transaction block
On Thu, Oct 27, 2022 at 10:31:31AM +0100, Simon Riggs wrote: > Allows both ANALYZE and vacuum of toast tables, but not VACUUM FULL. Maybe I misunderstood what you meant: you said "not VACUUM FULL", but with your patch, that works: postgres=# begin; VACUUM FULL pg_class; commit; BEGIN VACUUM COMMIT Actually, I've thought before that it was bit weird that CLUSTER can be run within a transaction, but VACUUM FULL cannot (even though it does a CLUSTER behind the scenes). VACUUM FULL can process multiple relations, whereas CLUSTER can't, but it seems nice to allow vacuum full for the case of a single relation. I haven't checked the rest of the patch, but +1 for allowing VACUUM FULL within a user txn. Maybe the error message needs to be qualified "...when multiple relations are specified". ERROR: VACUUM cannot run inside a transaction block -- Justin
Re: Reducing planning time on tables with many indexes
On 2022-Aug-19, David Geier wrote: > Beyond that I did some off-CPU profiling to precisely track down which lock > serializes execution. It turned out to be the MyProc::fpInfoLock lightweight > lock. This lock is used in the fast path of the heavyweight lock. In the > contenting case, fpInfoLock is acquired in LW_EXCLUSIVE mode to (1) check if > there is no other process holding a stronger lock, and if not, to reserve a > process local fast path lock slot and (2) to return the fast path lock slots > all in one go. To do so, the current implementation always linearly iterates > over all lock slots. Ah, so this is the aspect that you mentioned to me today. I definitely think that this analysis deserves its own thread, and the fix is its own separate patch. > I have attached the patch to improve the heavyweight lock fast path. It also > for now contains moving out _bt_getrootheight(). For workloads where the > same set of locks is used over and over again, it only needs on average a > single loop iteration to find the relation (instead of a linear scan > before). This allows to increase the number of fast path locks by a lot. In > this patch I increased them from 16 to 64. The code can be further improved > for cases where to be locked relations change frequently and therefore the > chance of not finding a relation and because of that having to linearly > search the whole array is higher. I suggest to put each change in a separate patch: 1. improve fast-path lock algorithm to find the element, perhaps together with increasing the number of elements in the array 2. change _bt_getrootheight However, since patch (1) may have nontrivial performance implications, you would also need to justify the change: not only that improves the case where many locks are acquired, but also that it does not make the case with few locks worse. I strongly suggest to not include C++ comments or any other dirtiness in the patch, as that might deter some potential reviewers. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "It takes less than 2 seconds to get to 78% complete; that's a good sign. A few seconds later it's at 90%, but it seems to have stuck there. Did somebody make percentages logarithmic while I wasn't looking?" http://smylers.hates-software.com/2005/09/08/1995c749.html
heavily contended lwlocks with long wait queues scale badly
Hi, I am working on posting a patch series making relation extension more scalable. As part of that I was running some benchmarks for workloads that I thought should not or just positively impacted - but I was wrong, there was some very significant degradation at very high client counts. After pulling my hair out for quite a while to try to understand that behaviour, I figured out that it's just a side-effect of *removing* some other contention. This morning, turns out sleeping helps, I managed to reproduce it in an unmodified postgres. $ cat ~/tmp/txid.sql SELECT txid_current(); $ for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n "$c ";pgbench -n -M prepared -f ~/tmp/txid.sql -c$c -j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done 160174 2 116169 4 208119 8 373685 16 515247 32 554726 64 497508 128 415097 256 334923 512 243679 768 192959 1024157734 2048 82904 4096 32007 (I didn't properly round TPS, but that doesn't matter here) Performance completely falls off a cliff starting at ~256 clients. There's actually plenty CPU available here, so this isn't a case of running out of CPU time. Rather, the problem is very bad contention on the "spinlock" for the lwlock wait list. I realized that something in that direction was off when trying to investigate why I was seeing spin delays of substantial duration (>100ms). The problem isn't a fundamental issue with lwlocks, it's that LWLockDequeueSelf() does this: LWLockWaitListLock(lock); /* * Can't just remove ourselves from the list, but we need to iterate over * all entries as somebody else could have dequeued us. */ proclist_foreach_modify(iter, &lock->waiters, lwWaitLink) { if (iter.cur == MyProc->pgprocno) { found = true; proclist_delete(&lock->waiters, iter.cur, lwWaitLink); break; } } I.e. it iterates over the whole waitlist to "find itself". The longer the waitlist gets, the longer this takes. And the longer it takes for LWLockWakeup() to actually wake up all waiters, the more likely it becomes that LWLockDequeueSelf() needs to be called. We can't make the trivial optimization and use proclist_contains(), because PGPROC->lwWaitLink is also used for the list of processes to wake up in LWLockWakeup(). But I think we can solve that fairly reasonably nonetheless. We can change PGPROC->lwWaiting to not just be a boolean, but have three states: 0: not waiting 1: waiting in waitlist 2: waiting to be woken up which we then can use in LWLockDequeueSelf() to only remove ourselves from the list if we're on it. As removal from that list is protected by the wait list lock, there's no race to worry about. client patched HEAD 16010960174 2 112694 116169 4 214287 208119 8 377459 373685 16 524132 515247 32 565772 554726 64 587716 497508 128 581297 415097 256 550296 334923 512 486207 243679 768 449673 192959 1024410836 157734 204832622482904 409625025232007 Not perfect with the patch, but not awful either. I suspect this issue might actually explain quite a few odd performance behaviours we've seen at the larger end in the past. I think it has gotten a bit worse with the conversion of lwlock.c to proclists (I see lots of expensive multiplications to deal with sizeof(PGPROC)), but otherwise likely exists at least as far back as ab5194e6f61, in 9.5. I guess there's an argument for considering this a bug that we should backpatch a fix for? But given the vintage, probably not? The only thing that gives me pause is that this is quite hard to pinpoint as happening. I've attached my quick-and-dirty patch. Obviously it'd need a few defines etc, but I wanted to get this out to discuss before spending further time. Greetings, Andres Freund diff --git i/src/include/storage/proc.h w/src/include/storage/proc.h index 8d096fdeeb1..9a2615666a1 100644 --- i/src/include/storage/proc.h +++ w/src/include/storage/proc.h @@ -217,7 +217,8 @@ struct PGPROC bool recoveryConflictPending; /* Info about LWLock the process is currently waiting for, if any. */ - bool lwWaiting; /* true if waiting for an LW lock */ + int lwWaiting; /* 0 if not waiting, 1 if on waitlist, 2 if + * waiting to be woken */ uint8 lwWaitMode; /* lwlock mode being waited for */ proclist_node lwWaitLink; /* position in LW lock wait list */ diff --git i/src/backend/storage/lmgr/lwlock.c w/src/backend/storage/lmgr/lwlock.c index d274c9b1dc9..ffb852c91d7 100644 --- i/src/backend/storage/lmgr/lwlock.c +++ w/src/backend/storage/lmgr/lwlock.c @@ -987,6 +987,9 @@ LWLockWakeup(LWLock *lock) wokeup_somebody = true; } + /* signal that the process isn't on the wait list anymore */ + waiter->lwWaiting = 2; +
Re: Code checks for App Devs, using new options for transaction behavior
On Thu, 27 Oct 2022 at 12:09, Simon Riggs wrote: > Comments please Update from patch tester results. -- Simon Riggshttp://www.EnterpriseDB.com/ 002_nested_xacts.v7.patch Description: Binary data 001_psql_parse_only.v1.patch Description: Binary data 003_rollback_on_commit.v1.patch Description: Binary data 004_add_params_to_sample.v1.patch Description: Binary data
Re: Allow single table VACUUM in transaction block
On Thu, Oct 27, 2022 at 9:49 PM Simon Riggs wrote: > > On Thu, 27 Oct 2022 at 10:31, Simon Riggs > wrote: > > > Tests, docs. > > The patch tester says that a pg_upgrade test is failing on Windows, > but works for me. > > t/002_pg_upgrade.pl .. ok > > Anybody shed any light on that, much appreciated. Please see a recent thread on pg_upgrade failure - https://www.postgresql.org/message-id/Y04mN0ZLNzJywrad%40paquier.xyz. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Allow single table VACUUM in transaction block
On Thu, 27 Oct 2022 at 10:31, Simon Riggs wrote: > Tests, docs. The patch tester says that a pg_upgrade test is failing on Windows, but works for me. t/002_pg_upgrade.pl .. ok Anybody shed any light on that, much appreciated. -- Simon Riggshttp://www.EnterpriseDB.com/
Improve tab completion for ALTER FUNCTION/PROCEDURE/ROUTINE
Hi, Tab completion for ALTER FUNCTION/PROCEDURE/ROUTINE action was missing, this patch adds the tab completion for the same. Regards, Vignesh From 4b0473799a0d6af126ccd3d7802e5d0cbb83b944 Mon Sep 17 00:00:00 2001 From: "vignesh.c" Date: Thu, 27 Oct 2022 14:00:46 +0530 Subject: [PATCH v1] Tab completion for ALTER FUNCTION/PROCEDURE/ROUTINE. Tab completion for ALTER FUNCTION/PROCEDURE/ROUTINE action was missing, added the tab completion for the same. --- src/bin/psql/tab-complete.c | 48 + 1 file changed, 44 insertions(+), 4 deletions(-) diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c index a64571215b..bf3c5a7aae 100644 --- a/src/bin/psql/tab-complete.c +++ b/src/bin/psql/tab-complete.c @@ -1805,12 +1805,52 @@ psql_completion(const char *text, int start, int end) else COMPLETE_WITH_FUNCTION_ARG(prev2_wd); } - /* ALTER FUNCTION,PROCEDURE,ROUTINE (...) */ - else if (Matches("ALTER", "FUNCTION|PROCEDURE|ROUTINE", MatchAny, MatchAny)) + /* ALTER FUNCTION (...) */ + else if (Matches("ALTER", "FUNCTION", MatchAny, MatchAny)) { if (ends_with(prev_wd, ')')) - COMPLETE_WITH("OWNER TO", "RENAME TO", "SET SCHEMA", - "DEPENDS ON EXTENSION", "NO DEPENDS ON EXTENSION"); + COMPLETE_WITH("CALLED ON NULL INPUT", "COST", + "DEPENDS ON EXTENSION", "EXTERNAL SECURITY", + "IMMUTABLE", "LEAKPROOF", "NO DEPENDS ON EXTENSION", + "NOT LEAKPROOF", "OWNER TO", "PARALLEL", "RENAME TO", + "RETURNS NULL ON NULL INPUT ", "ROWS", "SECURITY", + "SET SCHEMA", "STABLE", "STRICT", "SUPPORT", + "VOLATILE"); + else + COMPLETE_WITH_FUNCTION_ARG(prev2_wd); + } + /* ALTER FUNCTION|ROUTINE (...) PARALLEL */ + else if (Matches("ALTER", "FUNCTION|ROUTINE", MatchAny, MatchAny, + "PARALLEL")) + COMPLETE_WITH("RESTRICTED", "SAFE", "UNSAFE"); + /* + * ALTER FUNCTION|PROCEDURE|ROUTINE (...) + * [EXTERNAL] SECURITY + */ + else if (Matches("ALTER", "FUNCTION|PROCEDURE|ROUTINE", MatchAny, MatchAny, + "SECURITY") || + Matches("ALTER", "FUNCTION|PROCEDURE|ROUTINE", MatchAny, MatchAny, + "EXTERNAL", "SECURITY")) + COMPLETE_WITH("DEFINER" ,"INVOKER"); + /* ALTER PROCEDURE (...) */ + else if (Matches("ALTER", "PROCEDURE", MatchAny, MatchAny)) + { + if (ends_with(prev_wd, ')')) + COMPLETE_WITH("DEPENDS ON EXTENSION", "EXTERNAL SECURITY", + "NO DEPENDS ON EXTENSION", "OWNER TO", "RENAME TO", + "SECURITY", "SET SCHEMA"); + else + COMPLETE_WITH_FUNCTION_ARG(prev2_wd); + } + /* ALTER ROUTINE (...) */ + else if (Matches("ALTER", "ROUTINE", MatchAny, MatchAny)) + { + if (ends_with(prev_wd, ')')) + COMPLETE_WITH("COST", "DEPENDS ON EXTENSION", "EXTERNAL SECURITY", + "IMMUTABLE", "LEAKPROOF", "NO DEPENDS ON EXTENSION", + "NOT LEAKPROOF", "OWNER TO", "PARALLEL", "RENAME TO", + "ROWS", "SECURITY", "SET SCHEMA", "STABLE", + "VOLATILE"); else COMPLETE_WITH_FUNCTION_ARG(prev2_wd); } -- 2.32.0
Re: Proposal to use JSON for Postgres Parser format
On Wed, Sep 21, 2022 at 11:04 AM Matthias van de Meent < boekewurm+postg...@gmail.com> wrote: > On Tue, 20 Sept 2022 at 17:29, Alvaro Herrera > wrote: > > > > On 2022-Sep-20, Matthias van de Meent wrote: > > > > > Allow me to add: compressability > > > > > > In the thread surrounding [0] there were complaints about the size of > > > catalogs, and specifically the template database. Significant parts of > > > that (688kB of 8080kB a fresh PG14 database) are in pg_rewrite, which > > > consists mostly of serialized Nodes. If we're going to replace our > > > current NodeToText infrastructure, we'd better know we can effectively > > > compress this data. > > > > True. Currently, the largest ev_action values compress pretty well. I > > think if we wanted this to be more succint, we would have to invent some > > binary format -- perhaps something like Protocol Buffers: it'd be stored > > in the binary format in catalogs, but for output it would be converted > > into something easy to read (we already do this for > > pg_statistic_ext_data for example). We'd probably lose compressibility, > > but that'd be okay because the binary format would already remove most > > of the redundancy by nature. > > > > Do we want to go there? > > I don't think that a binary format would be much better for > debugging/fixing than an optimization of the current textual format > when combined with compression. I agree, JSON is not perfect, but it compresses and it's usable everywhere. My personal need for this is purely developer experience, and Tom pointed out, a "niche" need for sure, but we are starting to do some serious work with Dan Lynch's plpgsql deparser tool to generate RLS policies from meta schema models, and having the same format come out of the parser would make a complete end to end solution for us, especially if we can get this data from a function in a ddl_command_start event trigger. Dan also writes a popular deparser for Javascript, and unifying the formats across these tools would be a big win for us. > As I mentioned in that thread, there > is a lot of improvement possible with the existing format, and I think > any debugging of serialized nodes would greatly benefit from using a > textual format. > Agreed. > Then again, I also agree that this argument doesn't hold it's weight > when storage and output formats are going to be different. I trust > that any new tooling introduced as a result of this thread will be > better than what we have right now. > Separating formats seems like a lot of work to me, to get what might not be a huge improvement over compressing JSON, for what seems unlikely to be more than a few megabytes of parsed SQL. > As for best format: I don't know. The current format is usable, and a > better format would not store any data for default values. JSON can do > that, but I could think of many formats that could do the same (Smile, > BSON, xml, etc.). > > I do not think that protobuf is the best choice for storage, though, > because it has its own rules on what it considers a default value and > what it does or does not serialize: zero is considered the only > default for numbers, as is the empty string for text, etc. > I think it is allright for general use, but with e.g. `location: -1` > in just about every parse node we'd probably want to select our own > values to ignore during (de)serialization of fields. > Agreed. Thank you everyone who has contributed to this thread, I'm pleased that it got a very spirited debate and I apologize for the delay in getting back to everyone. I'd like to spike on a proposed patch that: - Converts the existing text format to JSON (or possibly jsonb, considering feedback from this thread) - Can be stored compressed - Can be passed to a ddl_command_start event trigger with a function. Thoughts? -Michel
Re: Reducing duplicativeness of EquivalenceClass-derived clauses
Hi, On Oct 27, 2022, 21:29 +0800, Tom Lane , wrote: > Zhang Mingli writes: > > How about combine ec->ec_sources and ec->derives as one list for less codes? > > Keeping them separate is required for the broken-EC code paths. > Even if it weren't, I wouldn't merge them just to save a couple > of lines of code --- I think it's useful to be able to tell which > clauses the EC started from. > > regards, tom lane Got it, thanks. Regards, Zhang Mingli
Re: Reducing duplicativeness of EquivalenceClass-derived clauses
Zhang Mingli writes: > How about combine ec->ec_sources and ec->derives as one list for less codes? Keeping them separate is required for the broken-EC code paths. Even if it weren't, I wouldn't merge them just to save a couple of lines of code --- I think it's useful to be able to tell which clauses the EC started from. regards, tom lane
Re: [PROPOSAL] : Use of ORDER BY clause in insert.sql
Nishant Sharma writes: > We would like to share a proposal of a patch, where we have added order by > clause in two select statements in src/test/regress/sql/insert.sql file and > respective changes in src/test/regress/expected/insert.out output file. > This would help in generating output in consistent sequence, as sometimes > we have observed change in sequence in output. Please be specific about the circumstances in which the output is unstable for you. With zero information to go on, it seems about as likely that this change is masking a bug as that it's a good idea. regards, tom lane
Re: Reducing duplicativeness of EquivalenceClass-derived clauses
HI, On Oct 26, 2022, 06:09 +0800, Tom Lane , wrote: > While fooling with my longstanding outer-join variables changes > (I am making progress on that, honest), I happened to notice that > equivclass.c is leaving some money on the table by generating > redundant RestrictInfo clauses. It already attempts to not generate > the same clause twice, which can save a nontrivial amount of work > because we cache selectivity estimates and so on per-RestrictInfo. > I realized though that it will build and return a clause like > "a.x = b.y" even if it already has "b.y = a.x". This is just > wasteful. It's always been the case that equivclass.c will > produce clauses that are ordered according to its own whims. > Consumers that need the operands in a specific order, such as > index scans or hash joins, are required to commute the clause > to be the way they want it while building the finished plan. > Therefore, it shouldn't matter which order of the operands we > return, and giving back the commutator clause if available could > potentially save as much as half of the selectivity-estimation > work we do with these clauses subsequently. > > Hence, PFA a patch that adjusts create_join_clause() to notice > commuted as well as exact matches among the EquivalenceClass's > existing clauses. This results in a number of changes visible in > regression test cases, but they're all clearly inconsequential. > > The only thing that I think might be controversial here is that > I dropped the check for matching operator OID. To preserve that, > we'd have needed to use get_commutator() in the reverse-match cases, > which it seemed to me would be a completely unjustified expenditure > of cycles. The operators we select for freshly-generated clauses > will certainly always match those of previously-generated clauses. > Maybe there's a chance that they'd not match those of ec_sources > clauses (that is, the user-written clauses we started from), but > if they don't and that actually makes any difference then surely > we are talking about a buggy opclass definition. > > I've not bothered to make any performance tests to see if there's > actually an easily measurable gain here. Saving some duplicative > selectivity estimates could be down in the noise ... but it's > surely worth the tiny number of extra tests added here. > > Comments? > > regards, tom lane > Make sense. How about combine ec->ec_sources and ec->derives as one list for less codes? ``` foreach(lc, list_union(ec->ec_sources, ec->ec_derives)) { rinfo = (RestrictInfo *) lfirst(lc); if (rinfo->left_em == leftem && rinfo->right_em == rightem && rinfo->parent_ec == parent_ec) return rinfo; if (rinfo->left_em == rightem && rinfo->right_em == leftem && rinfo->parent_ec == parent_ec) return rinfo; } ``` I have a try, it will change some in join.out and avoid changes in tidscan.out. Regards, Zhang Mingli >
Re: [PATCHES] Post-special page storage TDE support
Hi On Mon, 24 Oct 2022, 19:56 David Christensen, < david.christen...@crunchydata.com> wrote: > > Discussion is welcome and encouraged! Did you read the related thread with related discussion from last June, "Re: better page-level checksums" [0]? In that I argued that space at the end of a page is already allocated for the AM, and that reserving variable space at the end of the page for non-AM usage is wasting the AM's performance potential. Apart from that: Is this variable-sized 'metadata' associated with smgr infrastructure only, or is it also available for AM features? If not; then this is a strong -1. The amount of tasks smgr needs to do on a page is generally much less than the amount of tasks an AM needs to do; so in my view the AM has priority in prime page real estate, not smgr or related infrastructure. re: PageFeatures I'm not sure I understand the goal, nor the reasoning. Shouldn't this be part of the storage manager (smgr) implementation / can't this be part of the smgr of the relation? re: use of pd_checksum I mentioned this in the above-mentioned thread too, in [1], that we could use pd_checksum as an extra area marker for this storage-specific data, which would be located between pd_upper and pd_special. Re: patch contents 0001: >+ specialSize = MAXALIGN(specialSize) + reserved_page_size; This needs to be aligned, so MAXALIGN(specialSize + reserved_page_size), or an assertion that reserved_page_size is MAXALIGNED, would be better. > PageValidateSpecialPointer(Page page) > { > Assert(page); > -Assert(((PageHeader) page)->pd_special <= BLCKSZ); > +AssertPageHeader) page)->pd_special - reserved_page_size) <= BLCKSZ); This check is incorrect. With your code it would allow pd_special past the end of the block. If you want to put the reserved_space_size effectively inside the special area, this check should instead be: +Assert(((PageHeader) page)->pd_special <= (BLCKSZ - reserved_page_size)); Or, equally valid +AssertPageHeader) page)->pd_special + reserved_page_size) <= BLCKSZ); > + * +-+-++-+ > + * | ... tuple2 tuple1 | "special space" | "reserved" | > + * +---++-+ Could you fix the table display if / when you revise the patchset? It seems to me that the corners don't line up with the column borders. 0002: > Make the output of "select_views" test stable > Changing the reserved_page_size has resulted in non-stable results for this test. This makes sense, what kind of instability are we talking about? Are there different results for runs with the same binary, or is this across compilations? 0003 and up were not yet reviewed in depth. Kind regards, Matthias van de Meent [0] https://www.postgresql.org/message-id/flat/CA%2BTgmoaCeQ2b-BVgVfF8go8zFoceDjJq9w4AFQX7u6Acfdn2uA%40mail.gmail.com#90badc63e568a89a76f51fc95f07ffaf [1] https://postgr.es/m/CAEze2Wi5wYinU7nYxyKe_C0DRc6uWYa8ivn5%3Dzg63nKtHBnn8A%40mail.gmail.com
[PROPOSAL] : Use of ORDER BY clause in insert.sql
Hi, We would like to share a proposal of a patch, where we have added order by clause in two select statements in src/test/regress/sql/insert.sql file and respective changes in src/test/regress/expected/insert.out output file. This would help in generating output in consistent sequence, as sometimes we have observed change in sequence in output. Please find the patch attached Regards, Nishant Sharma EDB: http://www.enterprisedb.com Proposal_OrderBy_insert.sql.out.patch Description: Binary data
Re: pg_recvlogical prints bogus error when interrupted
On Mon, Oct 24, 2022 at 8:15 AM Bharath Rupireddy wrote: > > On Fri, Oct 21, 2022 at 7:52 AM Kyotaro Horiguchi > wrote: > > > > > +1. How about emitting a message like its friend pg_receivewal, like > > > the attached patch? > > > > I'm not a fan of treating SIGINT as an error in this case. It calls > > prepareToTerminate() when time_to_abort and everything goes fine after > > then. So I think we should do the same thing after receiving an > > interrupt. This also does file-sync naturally as a part of normal > > shutdown. I'm also not a fan of doing fsync at error. > > I think the pg_recvlogical can gracefully exit on both SIGINT and > SIGTERM to keep things simple. > > > > > I also then noticed that we don't fsync the output file in cases of > > > > errors - > > > > that seems wrong to me? Looks to me like that block should be moved > > > > till after > > > > the error:? > > > > > > How about something like the attached patch? > > The attached patch (pg_recvlogical_graceful_interrupt.text) has a > couple of problems, I believe. We're losing prepareToTerminate() with > keepalive true and we're not skipping pg_log_error("unexpected > termination of replication stream: %s" upon interrupt, after all we're > here discussing how to avoid it. > > I came up with the attached v2 patch, please have a look. FWIW, I added it to CF - https://commitfest.postgresql.org/40/3966/. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Code checks for App Devs, using new options for transaction behavior
In the past, developers have wondered how we can provide "--dry-run" functionality https://www.postgresql.org/message-id/15791.1450383201%40sss.pgh.pa.us This is important for application developers, especially when migrating programs to Postgres. Presented here are 3 features aimed at developers, each of which is being actively used by me in a large and complex migration project. * psql --parse-only Checks the syntax of all SQL in a script, but without actually executing it. This is very important in the early stages of complex migrations because we need to see if the code would generate syntax errors before we attempt to execute it. When there are many dependencies between objects, actual execution fails very quickly if we run in a single transaction, yet running outside of a transaction can leave a difficult cleanup task. Fixing errors iteratively is difficult when there are long chains of dependencies between objects, since there is no easy way to predict how long it will take to make everything work unless you understand how many syntax errors exist in the script. 001_psql_parse_only.v1.patch * nested transactions = off (default) | all | on Handle nested BEGIN/COMMIT, which can cause chaos on failure. This is an important part of guaranteeing that everything that gets executed is part of a single atomic transaction, which can then be rolled back - this is a pre-requisite for the last feature. 002_nested_xacts.v7.patch The default behavior is unchanged (off) Setting "all" treats nested BEGIN/COMMIT as subtransactions, allowing some parts to fail without rolling back the outer transaction. Setting "outer" flattens nested BEGIN/COMMIT into one single outer transaction, so that any failure rolls back the entire transaction. * rollback_on_commit = off (default) | on Force transactions to fail their final commit, ensuring that no lasting change is made when a script is tested. i.e. accept COMMIT, but do rollback instead. 003_rollback_on_commit.v1.patch We will probably want to review these on separate threads, but the common purpose of these features is hopefully clear from these notes. 001 and 003 are fairly small patches, 002 is longer. Comments please -- Simon Riggshttp://www.EnterpriseDB.com/ 001_psql_parse_only.v1.patch Description: Binary data 002_nested_xacts.v7.patch Description: Binary data 003_rollback_on_commit.v1.patch Description: Binary data
Avoid using list_delete_first in simplify_or/and_arguments
Hi hackers, While trying to measure if there is any gain from the change as discussed in [1], I happened to notice another place that is slowed down by list_delete_first. I'm using the query as below: (n=100; printf "explain (summary on) select * from t where " for ((i=1;i<$n;i++)); do printf "a = $i or "; done; printf "a = $n;" ) | psql And I notice that a large part of planning time is spent on the list_delete_first calls inside simplify_or_arguments(). I think the issue here is clear and straightforward: list_delete_first has an O(N) cost due to data movement. And I believe similar issue has been discussed several times before. I wonder if we can improve it by using list_delete_last instead, so I tried the following change: --- a/src/backend/optimizer/util/clauses.c +++ b/src/backend/optimizer/util/clauses.c @@ -3612,9 +3612,9 @@ simplify_or_arguments(List *args, unprocessed_args = list_copy(args); while (unprocessed_args) { - Node *arg = (Node *) linitial(unprocessed_args); + Node *arg = (Node *) llast(unprocessed_args); - unprocessed_args = list_delete_first(unprocessed_args); + unprocessed_args = list_delete_last(unprocessed_args); With this change, in my box the planning time for the query above is reduced from 64257.784 ms to 1411.666 ms, a big improvement. The side effect is that it results in a lot of plan diffs in regression tests, but they are all about different order of OR arguments. I believe simplify_and_arguments() can also benefit from similar changes. But I'm not sure if we could have such a long AND/OR arguments in real world. So is this worth doing? [1] https://www.postgresql.org/message-id/CAMbWs4-RXhgz0i4O1z62gt%2BbTLTM5vXYyYhgnius0j_txLH7hg%40mail.gmail.com Thanks Richard
Allow single table VACUUM in transaction block
It is a common user annoyance to have a script fail because someone added a VACUUM, especially when using --single-transaction option. Fix, so that this works without issue: BEGIN; VACUUM (ANALYZE) vactst; COMMIT; Allows both ANALYZE and vacuum of toast tables, but not VACUUM FULL. When in a xact block, we do not set PROC_IN_VACUUM, nor update datfrozenxid. Tests, docs. -- Simon Riggshttp://www.EnterpriseDB.com/ single_table_vacuum.v1.patch Description: Binary data
Re: Add LSN along with offset to error messages reported for WAL file read/write/validate header failures
On Tue, Oct 4, 2022 at 2:58 PM Bharath Rupireddy wrote: > > On Thu, Sep 29, 2022 at 7:43 PM Bharath Rupireddy > wrote: > > > > Please see the attached v1 patch. > > FWIW, I'm attaching Nathan's patch that introduced the new function > pg_walfile_offset_lsn as 0002 in the v1 patch set. Here's the CF entry > - https://commitfest.postgresql.org/40/3909/. > > Please consider this for further review. I'm attaching the v2 patch set after rebasing on to the latest HEAD. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From 87da2d0ae8fb0f92d2d88d8638aee1ca58332522 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Thu, 27 Oct 2022 09:06:27 + Subject: [PATCH v2] Add LSN along with offset to error messages reported for WAL file read/write/validate header failures --- src/backend/access/transam/xlog.c | 8 ++-- src/backend/access/transam/xlogreader.c | 16 +++- src/backend/access/transam/xlogrecovery.c | 13 - src/backend/access/transam/xlogutils.c| 10 ++ src/include/access/xlogreader.h | 1 + 5 files changed, 32 insertions(+), 16 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 8f10effe3a..1467001b4f 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2214,6 +2214,7 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible) { char xlogfname[MAXFNAMELEN]; int save_errno; + XLogRecPtr lsn; if (errno == EINTR) continue; @@ -,11 +2223,14 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible) XLogFileName(xlogfname, tli, openLogSegNo, wal_segment_size); errno = save_errno; + XLogSegNoOffsetToRecPtr(openLogSegNo, startoffset, + wal_segment_size, lsn); ereport(PANIC, (errcode_for_file_access(), errmsg("could not write to log file %s " - "at offset %u, length %zu: %m", - xlogfname, startoffset, nleft))); + "at offset %u, LSN %X/%X, length %zu: %m", + xlogfname, startoffset, + LSN_FORMAT_ARGS(lsn), nleft))); } nleft -= written; from += written; diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 93f667b254..a321c55d8f 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -1226,9 +1226,10 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr, XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize); report_invalid_record(state, - "invalid magic number %04X in WAL segment %s, offset %u", + "invalid magic number %04X in WAL segment %s, LSN %X/%X, offset %u", hdr->xlp_magic, fname, + LSN_FORMAT_ARGS(recptr), offset); return false; } @@ -1240,9 +1241,10 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr, XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize); report_invalid_record(state, - "invalid info bits %04X in WAL segment %s, offset %u", + "invalid info bits %04X in WAL segment %s, LSN %X/%X, offset %u", hdr->xlp_info, fname, + LSN_FORMAT_ARGS(recptr), offset); return false; } @@ -1281,9 +1283,10 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr, /* hmm, first page of file doesn't have a long header? */ report_invalid_record(state, - "invalid info bits %04X in WAL segment %s, offset %u", + "invalid info bits %04X in WAL segment %s, LSN %X/%X, offset %u", hdr->xlp_info, fname, + LSN_FORMAT_ARGS(recptr), offset); return false; } @@ -1300,9 +1303,10 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr, XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize); report_invalid_record(state, - "unexpected pageaddr %X/%X in WAL segment %s, offset %u", + "unexpected pageaddr %X/%X in WAL segment %s, LSN %X/%X, offset %u", LSN_FORMAT_ARGS(hdr->xlp_pageaddr), fname, + LSN_FORMAT_ARGS(recptr), offset); return false; } @@ -1325,10 +1329,11 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr, XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize); report_invalid_record(state, - "out-of-sequence timeline ID %u (after %u) in WAL segment %s, offset %u", + "out-of-sequence timeline ID %u (after %u) in WAL segment %s, LSN %X/%X, offset %u", hdr->xlp_tli, state->latestPageTLI, fname, + LSN_FORMAT_ARGS(recptr), offset); return false; } @@ -1553,6 +1558,7 @@ WALRead(XLogReaderState *state, errinfo->wre_req = segbyt
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Thu, Oct 27, 2022 at 11:24 AM Michael Paquier wrote: > > On Thu, Sep 29, 2022 at 08:09:56PM -0700, Nathan Bossart wrote: > > Looks reasonable to me. > > 0001, to move pg_pwritev_with_retry() to a new home, seems fine, so > applied. Thanks. > Regarding 0002, using pg_pwrite_zeros() as a routine name, as > suggested by Thomas, sounds good to me. Changed. > However, I am not really a > fan of its dependency with PGAlignedXLogBlock, because it should be > able to work with any buffers of any sizes, as long as the input > buffer is aligned, shouldn't it? For example, what about > PGAlignedBlock? So, should we make this more extensible? My guess > would be the addition of the block size and the block pointer to the > arguments of pg_pwrite_zeros(), in combination with a check to make > sure that the input buffer is MAXALIGN()'d (with an Assert() rather > than just an elog/pg_log_error?). +1 to pass in the aligned buffer, its size and an assertion on the buffer size. Please see the attached v7 patch. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From 1d81929de5f5b56d39fc7f1273cdb4c1e36337bb Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Thu, 27 Oct 2022 08:51:07 + Subject: [PATCH v7] Use pg_pwritev_with_retry() instead of write() in walmethods.c Use pg_pwritev_with_retry() while prepadding a WAL segment instead of write() in walmethods.c dir_open_for_write() to avoid partial writes. As the pg_pwritev_with_retry() function uses pwritev, we can avoid explicit lseek() on non-Windows platforms to seek to the beginning of the WAL segment. It looks like on Windows, we need an explicit lseek() call here despite using pwrite() implementation from win32pwrite.c. Otherwise an error occurs. These changes are inline with how core postgres initializes the WAL segment in XLogFileInitInternal(). Author: Bharath Rupireddy Reviewed-by: Nathan Bossart, Michael Paquier Reviewed-by: Thomas Munro Discussion: https://www.postgresql.org/message-id/CALj2ACUq7nAb7%3DbJNbK3yYmp-SZhJcXFR_pLk8un6XgDzDF3OA%40mail.gmail.com --- src/backend/access/transam/xlog.c | 35 +++--- src/bin/pg_basebackup/walmethods.c | 28 ++- src/common/file_utils.c| 77 ++ src/include/common/file_utils.h| 5 ++ 4 files changed, 105 insertions(+), 40 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 8f10effe3a..638d6d448b 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2921,7 +2921,6 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, bool *added, char *path) { char tmppath[MAXPGPATH]; - PGAlignedXLogBlock zbuffer; XLogSegNo installed_segno; XLogSegNo max_segno; int fd; @@ -2965,14 +2964,12 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, (errcode_for_file_access(), errmsg("could not create file \"%s\": %m", tmppath))); - memset(zbuffer.data, 0, XLOG_BLCKSZ); - pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE); save_errno = 0; if (wal_init_zero) { - struct iovec iov[PG_IOV_MAX]; - int blocks; + PGAlignedXLogBlock zbuffer; + ssize_t rc; /* * Zero-fill the file. With this setting, we do this the hard way to @@ -2983,29 +2980,11 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, * indirect blocks are down on disk. Therefore, fdatasync(2) or * O_DSYNC will be sufficient to sync future writes to the log file. */ + rc = pg_pwrite_zeros(fd, wal_segment_size, zbuffer.data, + sizeof(zbuffer.data)); - /* Prepare to write out a lot of copies of our zero buffer at once. */ - for (int i = 0; i < lengthof(iov); ++i) - { - iov[i].iov_base = zbuffer.data; - iov[i].iov_len = XLOG_BLCKSZ; - } - - /* Loop, writing as many blocks as we can for each system call. */ - blocks = wal_segment_size / XLOG_BLCKSZ; - for (int i = 0; i < blocks;) - { - int iovcnt = Min(blocks - i, lengthof(iov)); - off_t offset = i * XLOG_BLCKSZ; - - if (pg_pwritev_with_retry(fd, iov, iovcnt, offset) < 0) - { -save_errno = errno; -break; - } - - i += iovcnt; - } + if (rc < 0) + save_errno = errno; } else { @@ -3014,7 +2993,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, * enough. */ errno = 0; - if (pg_pwrite(fd, zbuffer.data, 1, wal_segment_size - 1) != 1) + if (pg_pwrite(fd, "\0", 1, wal_segment_size - 1) != 1) { /* if write didn't set errno, assume no disk space */ save_errno = errno ? errno : ENOSPC; diff --git a/src/bin/pg_basebackup/walmethods.c b/src/bin/pg_basebackup/walmethods.c index bc2e83d02b..f7af73c01b 100644 --- a/src/bin/pg_basebackup/walmethods.c +++ b/src/bin/pg_basebackup/walmethods.c @@ -220,22 +220,26 @@ dir_open_for_write(WalWriteMethod *wwmethod, const char *pathname, /* Do pre-padding on non-compressed files */ if (pad_to_size
Re: Documentation for building with meson
+# Run the main pg_regress and isolation tests +meson test --suite main This does not work for me in a fresh install until running meson test --suite setup In fact, we see in https://wiki.postgresql.org/wiki/Meson meson test --suite setup --suite main That was just an eyeball check from a naive user -- it would be good to try running everything documented here. -- John Naylor EDB: http://www.enterprisedb.com
Re: GUC values - recommended way to declare the C variables?
On Thu, Oct 27, 2022 at 1:33 PM Michael Paquier wrote: > >... > > Anyway, per my previous comments in my last message of this thread as > of https://www.postgresql.org/message-id/y1nnwftrnl3it...@paquier.xyz, > I don't see a need for DYNAMIC_DEFAULT from the other thread, nor do I > see a need to a style like that: > +/* GUC variable */ > +bool update_process_title = > +#ifdef WIN32 > + false; > +#else > + true; > +#endif > > I think that it would be cleaner to use the same approach as > checking_after_flush and similar GUCs with a centralized definition, > rather than spreading such style in two places for each GUC that this > patch touches (aka its declaration and its default value in > guc_tables.c). In any case, the patch of this thread still needs some > adjustments IMO. PSA patch v6. The GUC defaults of guc_tables.c, and the modified GUC C var declarations now share the same common #define'd value (instead of cut/paste preprocessor code). Per Michael's suggestion [1] to use centralized definitions. -- [1] https://www.postgresql.org/message-id/Y1nuDNZDncx7%2BA1j%40paquier.xyz Kind Regards, Peter Smith. Fujitsu Australia v6-0001-GUC-C-variable-sanity-check.patch Description: Binary data
Re: Simplifying our Trap/Assert infrastructure
On Wed, Oct 12, 2022 at 09:19:17PM +0200, Peter Eisentraut wrote: > I'm in favor of this. These variants are a distraction. Agreed, even if extensions could use these, it looks like any out-of-core code using what's removed here would also gain in clarity. This is logically fine (except for an indentation blip in miscadmin.h?), so I have marked this entry as ready for committer. Side note, rather unrelated to what's proposed here: would it be worth extending AssertPointerAlignment() for the frontend code? -- Michael signature.asc Description: PGP signature
Re: [patch] Have psql's \d+ indicate foreign partitions
On 2022-Oct-24, Justin Pryzby wrote: > On Mon, Oct 24, 2022 at 09:44:18PM +0900, Ian Lawrence Barwick wrote: > > + else if (child_relkind == RELKIND_FOREIGN_TABLE > > && is_partitioned) > > + appendPQExpBuffer(&buf, ", server: > > \"%s\"", PQgetvalue(result, i, 4)); > To avoid the clutter that you mentioned, I suggest that this should show > that the table *is* foreign, but without the server - if you want to > know the server (or its options), you can run another \d command for > that (or run a SQL query). But 'server "%s"' is not much longer than "foreign", and it's not like your saving any vertical space at all (you're just using space that would otherwise be empty), so I'm not sure it is better. I would vote for showing the server. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "You don't solve a bad join with SELECT DISTINCT" #CupsOfFail https://twitter.com/connor_mc_d/status/1431240081726115845