Re: [HACKERS] logical decoding of two-phase transactions
On Fri, Mar 12, 2021 at 8:38 PM vignesh C wrote: > ... > 1) I felt twophase_given can be a local variable, it need not be added > as a function parameter as it is not used outside the function. > --- a/src/backend/commands/subscriptioncmds.c > +++ b/src/backend/commands/subscriptioncmds.c > @@ -67,7 +67,8 @@ parse_subscription_options(List *options, >char **synchronous_commit, >bool *refresh, >bool *binary_given, > bool *binary, > - bool > *streaming_given, bool *streaming) > + bool > *streaming_given, bool *streaming, > + bool > *twophase_given, bool *twophase) > > The corresponding changes should be done here too: > @@ -358,6 +402,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, > bool isTopLevel) > boolcopy_data; > boolstreaming; > boolstreaming_given; > + booltwophase; > + booltwophase_given; > char *synchronous_commit; > char *conninfo; > char *slotname; > @@ -382,7 +428,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, > bool isTopLevel) > > &synchronous_commit, >NULL, > /* no "refresh" */ > > &binary_given, &binary, > - > &streaming_given, &streaming); > + > &streaming_given, &streaming, > + > &twophase_given, &twophase); > It was deliberately coded this way for consistency with the other new PG14 options - e.g. it mimics exactly binary_given, and streaming_given. I know the param is not currently used by the caller and so could be a local (as you say), but I felt the code consistency and future-proof benefits outweighed the idea of reducing the code to bare minimum required to work just "because we can". So I don't plan to change this, but if you still feel strongly that the parameter must be removed please give a convincing reason. Kind Regards, Peter Smith. Fujitsu Australia.
Re: Type of wait events WalReceiverWaitStart and WalSenderWaitForWAL
At Wed, 17 Mar 2021 15:31:37 +0900 (JST), Kyotaro Horiguchi wrote in > WAIT_EVENT_WAL_RECEIVER_MAIN(WalReceiverMain) is waiting for new data > to arrive. This looks like an activity to me. > > WAIT_EVENT_WAL_RECEIVER_WAIT_START is waiting for waiting for starup > process to kick me. So it may be either IPC or Activity. Since > walreceiver hasn't sent anything to startup, so it's activity, rather > than IPC. However, the behavior can be said that it convey a piece of > information from startup to wal receiver so it also can be said to be > an IPC. (That is the reason why I don't object for IPC.) > > 1(WAIT_EVENT_WAL_SENDER_MAIN, currently an activity) is waiting for > something to happen on the connection to the peer > receiver/worker. This might either be an activity or an wait_client, > but I prefer it to be wait_client, as the same behavior of a client > backend is categorizes as wait_client. > > 2 (WAIT_EVENT_WAL_SENDER_WRITE_DATA, currently a wait_client) is the > same to 1. > > 3 (WAIT_EVENT_WAL_SENDER_WAIT_WAL, currently a wait_client) is the > same to 1. - As the result I'd prefer to categorize all of them to Activity. Year, I don't understand what I meant:( + As the result I'd prefer to categorize the first two to Activity, and + the last three to wait_client. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Type of wait events WalReceiverWaitStart and WalSenderWaitForWAL
At Tue, 16 Mar 2021 15:42:27 +0900, Fujii Masao wrote in > > > On 2021/03/16 11:59, Kyotaro Horiguchi wrote: > > At Tue, 16 Mar 2021 03:12:54 +0900, Fujii Masao > > wrote in > >> The wait event WalReceiverWaitStart has been categorized in the type > >> Client. > >> But why? Walreceiver is waiting for startup process to set the lsn and > >> timeline while it is reporting WalReceiverWaitStart. So its type > >> should be IPC, > >> instead? > >> > >> The wait event WalSenderWaitForWAL has also been categorized in the > >> type > >> Client. While this wait event is being reported, logical replication > >> walsender > >> is waiting for not only new WAL to be flushed but also the socket to > >> be > >> readable and writeable (if there is pending data). I guess that this > >> is why > >> its type is Client. But ISTM walsender is *mainly* waiting for new WAL > >> to be > >> flushed by other processes during that period, so I think that it's > >> better > >> to use IPC as the type of the wait event WalSenderWaitForWAL. Thought? > > I agree that it's definitely not a client wait. It would be either > > activity or IPC. My reasoning for the latter is it's similar to > > WAIT_EVENT_WAL_RECEIVER_MAIN since both are a wait while > > WalReceiverMain to continue. With a difference thatin walreceiver > > hears where to start in the latter state. > > I don't object if it were categorized to IPC, though. > > Ok. And on my further thought; > There are three calls to WalSndWait() in walsender.c as follow. > > 1. WalSndLoop() calls WalSndWait() with the wait event > "Activity:WalSenderMain". Both physical and logical replication > walsenders > use this function. > 2. WalSndWriteData() calls WalSndWait() with the wait event > "Client:WalSenderWriteData". Only logical replication walsender uses > this function. > 3. WalSndWaitForWal() calls WalSndWait() with the wait event > "Client:WalSenderWaitForWAL". Only logical replication walsender > uses this function. > > These three WalSndWait() basically do the same thing, i.e., wait for > the latch > set, timeout, postmaster death, the readable and writeable socket. So > you > may think that it's strange to categorize them differently. Maybe it's > better > to categorize all of them in Actvitiy? I think it'd be better that they are categorized by what it is waiting for. Activity is waiting for something gating me to be released. IPC is waiting for the response for a request previously sent to another process. Wait-client is waiting for the peer over a network connection to allow me to proceed activity. So whether the three fall into the same category or not doesn't matter to me. WAIT_EVENT_WAL_RECEIVER_MAIN(WalReceiverMain) is waiting for new data to arrive. This looks like an activity to me. WAIT_EVENT_WAL_RECEIVER_WAIT_START is waiting for waiting for starup process to kick me. So it may be either IPC or Activity. Since walreceiver hasn't sent anything to startup, so it's activity, rather than IPC. However, the behavior can be said that it convey a piece of information from startup to wal receiver so it also can be said to be an IPC. (That is the reason why I don't object for IPC.) 1(WAIT_EVENT_WAL_SENDER_MAIN, currently an activity) is waiting for something to happen on the connection to the peer receiver/worker. This might either be an activity or an wait_client, but I prefer it to be wait_client, as the same behavior of a client backend is categorizes as wait_client. 2 (WAIT_EVENT_WAL_SENDER_WRITE_DATA, currently a wait_client) is the same to 1. 3 (WAIT_EVENT_WAL_SENDER_WAIT_WAL, currently a wait_client) is the same to 1. As the result I'd prefer to categorize all of them to Activity. > Or it's better to categorize only WalSenderMain in Activity, and the > others > in IPC because only WalSenderMain is reported in walsender's main > loop. I don't think 1, 2 and 3 are Activities. And Activity necessariry means main loop as I descrived. And as I said, WAIT_EVENT_WAL_RECEIVER_WAIT_START seems in between IPC and activity so I don't object to categorized it to IPC. > At least for me the latter is better because the former, i.e., > reporting > three different events for walsender's activity in main loop seems a > bit strange. > Thought? Maybe it's the difference in what one consider as the same event. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pl/pgsql feature request: shorthand for argument and local variable references
On Wed, Mar 17, 2021 at 05:30:30AM +0100, Pavel Stehule wrote: > st 17. 3. 2021 v 4:52 odesílatel Michael Paquier > > > I am wondering whether it would be better to allow multiple aliases > > though, and if it would bring more readability to the routines written > > if these are treated equal to the top-most namespace which is the > > routine name now, meaning that we would maintain not one, but N top > > namespace labels that could be used as aliases of the root one. > > > > I do not have a strong opinion, but I am inclined to disallow this. I am > afraid so code can be less readable. > > There is a precedent - SQL doesn't allow you to use table names as > qualifiers when you have an alias. +1 > > But it is a very important question. The selected behavior strongly impacts > an implementation. > > What is the more common opinion about it? 1. Should we allow the original > top label or not? 2. Should we allow to define more top labels? I also think that there should be a single usable top label, otherwise it will lead to confusing code and it can be a source of bug.
Re: Assertion failure with barriers in parallel hash join
On Wed, Mar 17, 2021 at 6:17 PM Thomas Munro wrote: > On Sat, Mar 6, 2021 at 9:56 AM Thomas Munro wrote: > > While working on Melanie's Parallel Hash Full Join patch I remembered > > that this (apparently extremely rare) race still needs fixing. Here > > is a slightly tidied version, which I'm adding to the next CF for CI > > coverage. > > Pushed and back-patched. According to BF animal elver there is something wrong with this commit. Looking into it.
Re: [HACKERS] logical decoding of two-phase transactions
On Tue, Mar 16, 2021 at 5:03 PM Amit Kapila wrote: > > On Mon, Mar 15, 2021 at 6:14 PM Ajin Cherian wrote: > > > > Here's a new patch-set that implements this new solution proposed by Amit. > > Patchset-v60 implements: > > > > I have reviewed the latest patch and below are my comments, some of > these might overlap with Vignesh's as I haven't looked at his comments > in detail. > Review comments > > Few more comments: = 1. + subtwophase char + + + The two_phase commit current state: + +'n' = two_phase mode was not requested, so is disabled. +'p' = two_phase mode was requested, but is pending enablement. +'y' = two_phase mode was requested, and is enabled. + + + Can we name the column as subtwophasestate? And then describe as we are doing for srsubstate in pg_subscription_rel. Also, it might be better to keep names as: 'd' disabled, 'p' pending twophase enablement and 'e' twophase enabled. srsubstate char State code: i = initialize, d = data is being copied, f = finished table copy, s = synchronized, r = ready (normal replication) 2. @@ -427,6 +428,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn, PQserverVersion(conn->streamConn) >= 14) appendStringInfoString(&cmd, ", streaming 'on'"); + if (options->proto.logical.twophase && + PQserverVersion(conn->streamConn) >= 14) + appendStringInfoString(&cmd, ", two_phase 'on'"); + pubnames = options->proto.logical.publication_names; pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames); if (!pubnames_str) @@ -453,6 +458,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn, appendStringInfo(&cmd, " TIMELINE %u", options->proto.physical.startpointTLI); + if (options->logical && two_phase) + appendStringInfoString(&cmd, " TWO_PHASE"); + Why are we sending two_phase 'on' and " TWO_PHASE" separately? I think we don't need to introduce TWO_PHASE token in grammar, let's handle it via plugin_options similar to what we do for 'streaming'. Also, a similar change would be required for Create_Replication_Slot. 3. + /* + * Do not allow toggling of two_phase option, this could + * cause missing of transactions and lead to an inconsistent + * replica. + */ + if (!twophase) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("cannot alter two_phase option"))); + I think here you can either give reference of worker.c to explain how this could lead to an inconsistent replica or expand the comments here if the information is not present elsewhere. 4. * support for streaming large transactions. + * + * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with + * support for two-phase commit PREPARE decoding. */ #define LOGICALREP_PROTO_MIN_VERSION_NUM 1 #define LOGICALREP_PROTO_VERSION_NUM 1 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2 +#define LOGICALREP_PROTO_2PC_VERSION_NUM 2 I think it is better to name the new define as LOGICALREP_PROTO_TWOPHASE_VERSION_NUM. Also mention in comments in some way that we are keeping the same version number for stream and two-phase defines because they got introduced in the same release (14). 5. I have modified the comments atop worker.c to explain the design and some of the problems clearly. See attached. If you are fine with this, please include it in the next version of the patch. -- With Regards, Amit Kapila. change_two_phase_desc_1.patch Description: Binary data
Re: Getting better results from valgrind leak tracking
Hi, On 2021-03-16 20:50:17 -0700, Andres Freund wrote: > What I meant was that I didn't understand how there's not a leak > danger when compilation fails halfway through, given that the context > in question is below TopMemoryContext and that I didn't see a relevant > TRY block. But that probably there is something cleaning it up that I > didn't see. Looks like it's an actual leak: postgres[2058957][1]=# DO $do$BEGIN CREATE OR REPLACE FUNCTION foo() RETURNS VOID LANGUAGE plpgsql AS $$BEGIN frakbar;END;$$;^C postgres[2058957][1]=# SELECT count(*), SUM(total_bytes) FROM pg_backend_memory_contexts WHERE name = 'PL/pgSQL function'; ┌───┬┐ │ count │ sum │ ├───┼┤ │ 0 │ (null) │ └───┴┘ (1 row) Time: 1.666 ms postgres[2058957][1]=# CREATE OR REPLACE FUNCTION foo() RETURNS VOID LANGUAGE plpgsql AS $$BEGIN frakbar;END;$$; ERROR: 42601: syntax error at or near "frakbar" LINE 1: ...ON foo() RETURNS VOID LANGUAGE plpgsql AS $$BEGIN frakbar;EN... ^ LOCATION: scanner_yyerror, scan.l:1176 Time: 5.463 ms postgres[2058957][1]=# CREATE OR REPLACE FUNCTION foo() RETURNS VOID LANGUAGE plpgsql AS $$BEGIN frakbar;END;$$; ERROR: 42601: syntax error at or near "frakbar" LINE 1: ...ON foo() RETURNS VOID LANGUAGE plpgsql AS $$BEGIN frakbar;EN... ^ LOCATION: scanner_yyerror, scan.l:1176 Time: 1.223 ms postgres[2058957][1]=# CREATE OR REPLACE FUNCTION foo() RETURNS VOID LANGUAGE plpgsql AS $$BEGIN frakbar;END;$$; ERROR: 42601: syntax error at or near "frakbar" LINE 1: ...ON foo() RETURNS VOID LANGUAGE plpgsql AS $$BEGIN frakbar;EN... ^ LOCATION: scanner_yyerror, scan.l:1176 Time: 1.194 ms postgres[2058957][1]=# SELECT count(*), SUM(total_bytes) FROM pg_backend_memory_contexts WHERE name = 'PL/pgSQL function'; ┌───┬───┐ │ count │ sum │ ├───┼───┤ │ 3 │ 24576 │ └───┴───┘ (1 row) Something like DO $do$ BEGIN FOR i IN 1 .. 1 LOOP BEGIN EXECUTE $cf$CREATE OR REPLACE FUNCTION foo() RETURNS VOID LANGUAGE plpgsql AS $f$BEGIN frakbar; END;$f$;$cf$; EXCEPTION WHEN others THEN END; END LOOP; END;$do$; will show the leak visible in top too (albeit slowly - a more complicated statement will leak more quickly I think). postgres[2059268][1]=# SELECT count(*), SUM(total_bytes) FROM pg_backend_memory_contexts WHERE name = 'PL/pgSQL function'; ┌───┬──┐ │ count │ sum│ ├───┼──┤ │ 1 │ 8192 │ └───┴──┘ (1 row) The leak appears to be not new, I see it in 9.6 as well. This seems like a surprisingly easy to trigger leak... Looks like there's something else awry. The above DO statement takes 2.2s on an 13 assert build, but 32s on a master assert build. Spending a lot of time doing dependency lookups: - 94.62% 0.01% postgres postgres[.] CreateFunction - 94.61% CreateFunction - 94.56% ProcedureCreate - 89.68% deleteDependencyRecordsFor - 89.38% systable_getnext - 89.33% index_getnext_slot - 56.00% index_fetch_heap + 54.64% table_index_fetch_tuple 0.09% heapam_index_fetch_tuple + 28.53% index_getnext_tid 2.77% ItemPointerEquals 0.10% table_index_fetch_tuple 0.09% btgettuple 0.03% index_getnext_tid 1000 iterations: 521ms 1000 iterations: 533ms 2000 iterations: 1670ms 3000 iterations: 3457ms 3000 iterations: 3457ms 1 iterations: 31794ms The quadratic seeming nature made me wonder if someone broke killtuples in this situation. And it seem that someone was me, in 623a9ba . We need to bump xactCompletionCount in the subxid abort case as well... Greetings, Andres Freund
Re: PATCH: Attempt to make dbsize a bit more consistent
On Mon, Mar 15, 2021 at 03:10:59PM +0900, Michael Paquier wrote: > Anyway, as mentioned by other people upthread, I am not really > convinced either that we should have more flavors of size functions, > particularly depending on the relkind as this would be confusing for > the end-user. pg_relation_size() can already do this job for all > relkinds that use segment files, so it should still be able to hold > its ground and maintain a consistent behavior with what it does > currently. On top of the rest of my notes, there are two more things that would face a behavior change if making the existing functions go through table AMs, which would scan the data in the smgr: - After a VACUUM, the relation would be reported with a size of 0, while that may not be the case of on-disk files yet. - Temporary tables of other sessions would be accessible. So we would likely want a separate function. Another possibility, which I find tempting, would be to push down the calculation logic relying on physical files down to the table AM themselves with a new dedicated callback (relation_size_physical?), as it seems to me that the most important thing users want to know with this set of functions is how much physical space is being consumed at one given point in time. Attached is a small prototype I was just playing with. -- Michael diff --git a/src/backend/utils/adt/dbsize.c b/src/backend/utils/adt/dbsize.c index 64cdaa4134..8bd5d33265 100644 --- a/src/backend/utils/adt/dbsize.c +++ b/src/backend/utils/adt/dbsize.c @@ -15,6 +15,7 @@ #include "access/htup_details.h" #include "access/relation.h" +#include "access/tableam.h" #include "catalog/catalog.h" #include "catalog/namespace.h" #include "catalog/pg_authid.h" @@ -23,6 +24,7 @@ #include "commands/tablespace.h" #include "miscadmin.h" #include "storage/fd.h" +#include "storage/smgr.h" #include "utils/acl.h" #include "utils/builtins.h" #include "utils/numeric.h" @@ -270,14 +272,28 @@ pg_tablespace_size_name(PG_FUNCTION_ARGS) * is no check here or at the call sites for that. */ static int64 -calculate_relation_size(RelFileNode *rfn, BackendId backend, ForkNumber forknum) +calculate_relation_size(Relation rel, ForkNumber forknum) { int64 totalsize = 0; char *relationpath; char pathname[MAXPGPATH]; unsigned int segcount = 0; - relationpath = relpathbackend(*rfn, backend, forknum); + /* + * If the relation is related to a table AM, do its sizing directly + * using its interface. + */ + if (rel->rd_rel->relkind == RELKIND_RELATION || + rel->rd_rel->relkind == RELKIND_TOASTVALUE || + rel->rd_rel->relkind == RELKIND_MATVIEW) + { + if (rel->rd_smgr && smgrexists(rel->rd_smgr, forknum)) + return table_relation_size(rel, forknum); + else + return 0; + } + + relationpath = relpathbackend(rel->rd_node, rel->rd_backend, forknum); for (segcount = 0;; segcount++) { @@ -327,7 +343,7 @@ pg_relation_size(PG_FUNCTION_ARGS) if (rel == NULL) PG_RETURN_NULL(); - size = calculate_relation_size(&(rel->rd_node), rel->rd_backend, + size = calculate_relation_size(rel, forkname_to_number(text_to_cstring(forkName))); relation_close(rel, AccessShareLock); @@ -352,8 +368,7 @@ calculate_toast_table_size(Oid toastrelid) /* toast heap size, including FSM and VM size */ for (forkNum = 0; forkNum <= MAX_FORKNUM; forkNum++) - size += calculate_relation_size(&(toastRel->rd_node), - toastRel->rd_backend, forkNum); + size += calculate_relation_size(toastRel, forkNum); /* toast index size, including FSM and VM size */ indexlist = RelationGetIndexList(toastRel); @@ -366,8 +381,7 @@ calculate_toast_table_size(Oid toastrelid) toastIdxRel = relation_open(lfirst_oid(lc), AccessShareLock); for (forkNum = 0; forkNum <= MAX_FORKNUM; forkNum++) - size += calculate_relation_size(&(toastIdxRel->rd_node), - toastIdxRel->rd_backend, forkNum); + size += calculate_relation_size(toastIdxRel, forkNum); relation_close(toastIdxRel, AccessShareLock); } @@ -395,8 +409,7 @@ calculate_table_size(Relation rel) * heap size, including FSM and VM */ for (forkNum = 0; forkNum <= MAX_FORKNUM; forkNum++) - size += calculate_relation_size(&(rel->rd_node), rel->rd_backend, - forkNum); + size += calculate_relation_size(rel, forkNum); /* * Size of toast relation @@ -434,9 +447,7 @@ calculate_indexes_size(Relation rel) idxRel = relation_open(idxOid, AccessShareLock); for (forkNum = 0; forkNum <= MAX_FORKNUM; forkNum++) -size += calculate_relation_size(&(idxRel->rd_node), -idxRel->rd_backend, -forkNum); +size += calculate_relation_size(idxRel, forkNum); relation_close(idxRel, AccessShareLock); } diff --git a/src/test/regress/expected/reloptions.out b/src/test/regress/expected/reloptions.out index 44c130409f..4b3ec17989 100644 --- a/src/test/regress/expected/reloptions.out +++ b/src/test/regress/expected/reloptions.out @@ -103,7 +103,7 @@ INSE
Re: libpq debug log
At Wed, 17 Mar 2021 02:09:32 +, "tsunakawa.ta...@fujitsu.com" wrote in > Alvaro-san, > > > Thank you for taking your time to take a look at an incomplete patch. I > thought I would ask you for final check for commit after Iwata-san has > reflected my review comments. > > I discussed with Iwata-sn your below comments. Let me convey her opinions. > (She is now focusing on fixing the patch.) We'd appreciate if you could > tweak her completed patch. > > > From: Alvaro Herrera > > * There's no cross-check that the protocol message length word > > corresponds to what we actually print. I think one easy way to > > cross-check that would be to return the "count" from the type-specific > > routines, and verify that it matches "length" in pqTraceOutputMessage. > > (If the separate routines are removed and the code put back in one > > giant switch, then the "count" is already available when the switch > > ends.) > > We don't think the length check is necessary, because 1) for FE->BE, correct > messages are always assembled, and 2) for BE->FE, the parsing and > decomposition of messages have already checked the messages. Maybe it is not for the sanity of message bytes but to check if the logging-stuff is implemented in sync with the messages structure. If the given message length differs from the length the logging facility read after scanning a message bytes, it's sign of something wrong in *the logging feature*. > > * If we have separate functions for each message type, then we can have > > that function supply the message name by itself, rather than have the > > separate protocol_message_type_f / _b arrays that print it. > > I felt those two arrays are clumsy and thought of suggesting to remove them. > But I didn't because the functions or case labels for each message type have > to duplicate the printing of header fields: timestamp, message type, and > length. Maybe we can change the order of output to "timestamp length > message_type content", but I kind of prefer the current order. What do you > think? (I can understand removing the clumsy static arrays should be better > than the order of output fields.) +1 for removing the arrays. > > * If we make each type-specific function print the message name, then we > > need to make the same function print the message length, because it > > comes after. So the function would have to receive the length (so > > that it can be printed). But I think we should still have the > > cross-check I mentioned in my first point occur in the main > > pqTraceOutputMessage, not the type-specific function, for fear that we > > will forget the cross-check in one of the functions and never realize > > that we did. > > As mentioned above, we think the current structure is better for smaller and > readable code. > > > > * I would make the pqTraceOutputInt16() function and siblings advance > > the cursor themselves, actually. I think something like this: > > static int > > pqTraceOutputInt16(const char *data, int *cursor, FILE *pfdebug) > > { > > uint16tmp; > > int result; > > > > memcpy(&tmp, data + *cursor, 2); > > *cursor += 2; > > result = (int) pg_ntoh16(tmp); > > fprintf(pfdebug, " #%d", result); > > > > return result; > > } > > (So the caller has to pass the original "data" pointer, not > > "data+cursor"). This means the caller no longer has to do "cursor+=N" > > at each place. Also, you get rid of the moveStrCursor() which does > > not look good. > > That makes sense, because in fact the patch mistakenly added 4 when it should > add 2. Also, the code would become smaller. FWIW, that's what I suggested upthread:p So +3. me> *I*'d want to make the output functions move the reading pointer by me> themseves. pqTradeOutputMsg can be as simplified as the follows doing > > * I'm not fond of the idea of prefixing "#" for 16bit ints and no > > prefix for 32bit ints. Seems obscure and the output looks weird. > > I would use a one-letter prefix for each type, "w" for 32-bit ints > > (stands for "word") and "h" for 16-bit ints (stands for "half-word"). > > Message length goes unadorned. Then you'd have lines like this > > > > 2021-03-15 08:10:44.324296 < RowDescription 35 h1 "spcoptions" > > w1213 h5 w1009 h65535 w-1 h0 > > 2021-03-15 08:10:44.324335 < DataRow 32 h1 w22 > > '{random_page_cost=3.0}' > > Yes, actually I felt something similar. Taking a second thought, I think we > don't have to have any prefix because it doesn't help users. So we're > removing the prefix. We don't have any opinion on adding some prefix. It would help when the value is "255", which is confusing between -1 (or 255) in byte or 255 in 2-byte word. Or when the value looks like broken. I'd suggest "b"yte for 1 byte, "s"hort for 2 bytes, "l"ong for 4 bytes ('l' is confusing with '1', but anyway it is not needed)..
Re: Assertion failure with barriers in parallel hash join
On Sat, Mar 6, 2021 at 9:56 AM Thomas Munro wrote: > While working on Melanie's Parallel Hash Full Join patch I remembered > that this (apparently extremely rare) race still needs fixing. Here > is a slightly tidied version, which I'm adding to the next CF for CI > coverage. Pushed and back-patched.
Re: Regression tests vs SERIALIZABLE
Thomas Munro writes: > On Tue, Mar 16, 2021 at 3:28 AM Tom Lane wrote: >> Usually, if we issue a SET in the regression tests, we explicitly RESET >> as soon thereafter as practical, so as to have a well-defined scope >> where the script is running under unusual conditions. > Oh, of course. Thanks. > I was wrong to blame that commit, and there are many other tests that > fail in the back branches. But since we were down to just one, I went > ahead and fixed this in the master branch only. Makes sense to me. Committed patch looks good. regards, tom lane
Re: pl/pgsql feature request: shorthand for argument and local variable references
st 17. 3. 2021 v 4:52 odesílatel Michael Paquier napsal: > On Mon, Mar 15, 2021 at 05:31:18AM +0100, Pavel Stehule wrote: > > Thank you very much > > I am not much a fan of the implementation done here. Based on my > understanding of the PL code, the namespace lookup is organized as a > list of items, with the namespace associated to the routine name being > the first one pushed to the list after plpgsql_ns_init() initializes > the top of it. Then, the proposed patch hijacks the namespace lookup > list by doing a replacement of the root label while parsing the > routine and then tweaks the lookup logic when a variable is qualified > (aka name2 != NULL) to bump the namespace list one level higher so as > it does not look after the namespace with the routine name but instead > fetches the one defined by ROUTINE_NAME. That seems rather bug-prone > to me, to say the least. > I'll check what can be done. I am not sure if it is possible to push a new label to the same level as the top label. > No changes to plpgsql_compile_inline()? > > + ereport(ERROR, > + (errcode(ERRCODE_SYNTAX_ERROR), > +errmsg("redundant option"), > +errhint("The command \"routine_label\" can be used > only once in rutine."), > +plpgsql_scanner_errposition(location))); > s/rutine/routine/ > > + /* Don't allow repeated redefination of routine label */ > redefination? > I'll check it. But inline block has not top label > I am wondering whether it would be better to allow multiple aliases > though, and if it would bring more readability to the routines written > if these are treated equal to the top-most namespace which is the > routine name now, meaning that we would maintain not one, but N top > namespace labels that could be used as aliases of the root one. > I do not have a strong opinion, but I am inclined to disallow this. I am afraid so code can be less readable. There is a precedent - SQL doesn't allow you to use table names as qualifiers when you have an alias. But it is a very important question. The selected behavior strongly impacts an implementation. What is the more common opinion about it? 1. Should we allow the original top label or not? 2. Should we allow to define more top labels? Regards Pavel > -- > Michael >
Re: Regression tests vs SERIALIZABLE
On Tue, Mar 16, 2021 at 3:28 AM Tom Lane wrote: > Thomas Munro writes: > > On Mon, Mar 15, 2021 at 5:24 PM Thomas Munro wrote: > >> However, since commit 862ef372d6b, there *is* one test that fails if > >> you run make installcheck against a cluster running with -c > >> default_transaction_isolation=serializable: transaction.sql. Is that > >> a mistake? Is it a goal to be able to run this test suite against all > >> 3 isolation levels? > > > Here's a fix. > > Usually, if we issue a SET in the regression tests, we explicitly RESET > as soon thereafter as practical, so as to have a well-defined scope > where the script is running under unusual conditions. Oh, of course. Thanks. I was wrong to blame that commit, and there are many other tests that fail in the back branches. But since we were down to just one, I went ahead and fixed this in the master branch only.
Re: Getting better results from valgrind leak tracking
"Andres Freund" writes: > On Tue, Mar 16, 2021, at 20:01, Tom Lane wrote: >> That seems both unnecessary and impractical. We have to consider that >> everything-still-reachable is an OK final state. > I don't consider "still reachable" a leak. Just definitely unreachable. OK, we're in violent agreement then -- I must've misread what you wrote. >>> I think the run might have shown a genuine leak: >> I didn't see anything like that after applying the fixes I showed before. > I think it's actually unreachable memory (unless you count resetting the > cache context), based on manually tracing the code... I'll try to repro. I agree that having to reset CacheContext is not something we should count as "still reachable", and I'm pretty sure that the existing valgrind infrastructure doesn't count it that way. As for the particular point about ParallelBlockTableScanWorkerData, I agree with your question to David about why that's in TableScanDesc not HeapScanDesc, but I can't get excited about it not being freed in heap_endscan. That's mainly because I do not believe that anything as complex as a heap or indexscan should be counted on to be zero-leakage. The right answer is to not do such operations in long-lived contexts. So if we're running such a thing while switched into CacheContext, *that* is the bug, not that heap_endscan didn't free this particular allocation. regards, tom lane
Re: pl/pgsql feature request: shorthand for argument and local variable references
On Mon, Mar 15, 2021 at 05:31:18AM +0100, Pavel Stehule wrote: > Thank you very much I am not much a fan of the implementation done here. Based on my understanding of the PL code, the namespace lookup is organized as a list of items, with the namespace associated to the routine name being the first one pushed to the list after plpgsql_ns_init() initializes the top of it. Then, the proposed patch hijacks the namespace lookup list by doing a replacement of the root label while parsing the routine and then tweaks the lookup logic when a variable is qualified (aka name2 != NULL) to bump the namespace list one level higher so as it does not look after the namespace with the routine name but instead fetches the one defined by ROUTINE_NAME. That seems rather bug-prone to me, to say the least. No changes to plpgsql_compile_inline()? + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), +errmsg("redundant option"), +errhint("The command \"routine_label\" can be used only once in rutine."), +plpgsql_scanner_errposition(location))); s/rutine/routine/ + /* Don't allow repeated redefination of routine label */ redefination? I am wondering whether it would be better to allow multiple aliases though, and if it would bring more readability to the routines written if these are treated equal to the top-most namespace which is the routine name now, meaning that we would maintain not one, but N top namespace labels that could be used as aliases of the root one. -- Michael signature.asc Description: PGP signature
Re: Getting better results from valgrind leak tracking
Hi, On Tue, Mar 16, 2021, at 20:01, Tom Lane wrote: > Andres Freund writes: > > On 2021-03-16 19:36:10 -0400, Tom Lane wrote: > >> It doesn't appear to me that we leak much at all, at least not if you > >> are willing to take "still reachable" blocks as not-leaked. > > > Well, I think for any sort of automated testing - which I think would be > > useful - we'd really need *no* leaks. > > That seems both unnecessary and impractical. We have to consider that > everything-still-reachable is an OK final state. I don't consider "still reachable" a leak. Just definitely unreachable. And with a few tweaks that seems like we could achieve that? > > I think the run might have shown a genuine leak: > > > ==2048803== 16 bytes in 1 blocks are definitely lost in loss record 139 of > > 906 > > ==2048803==at 0x89D2EA: palloc (mcxt.c:975) > > ==2048803==by 0x2392D3: heap_beginscan (heapam.c:1198) > > ==2048803==by 0x264E8F: table_beginscan_strat (tableam.h:918) > > ==2048803==by 0x265994: systable_beginscan (genam.c:453) > > ==2048803==by 0x83C2D1: SearchCatCacheMiss (catcache.c:1359) > > ==2048803==by 0x83C197: SearchCatCacheInternal (catcache.c:1299) > > I didn't see anything like that after applying the fixes I showed before. > There are a LOT of false positives from the fact that with our HEAD > code, valgrind believes that everything in the catalog caches and > most things in dynahash tables (including the relcache) are unreachable. I think it's actually unreachable memory (unless you count resetting the cache context), based on manually tracing the code... I'll try to repro. > > I do see a bunch of leaks bleats below fun:plpgsql_compile that I don't > > yet understand. E.g. > > Those are probably a variant of what you were suggesting above, ie > plpgsql isn't terribly careful not to leak random stuff while building > a long-lived function parse tree. It's supposed to use a temp context > for anything that might leak, but I suspect it's not thorough about it. What I meant was that I didn't understand how there's not a leak danger when compilation fails halfway through, given that the context in question is below TopMemoryContext and that I didn't see a relevant TRY block. But that probably there is something cleaning it up that I didn't see. Andres
Re: fdatasync performance problem with large number of DB files
On Tue, Mar 16, 2021 at 9:29 PM Fujii Masao wrote: > On 2021/03/16 8:15, Thomas Munro wrote: > > I don't want to add a hypothetical sync_after_crash=none, because it > > seems like generally a bad idea. We already have a > > running-with-scissors mode you could use for that: fsync=off. > > I heard that some backup tools sync the database directory when restoring it. > I guess that those who use such tools might want the option to disable such > startup sync (i.e., sync_after_crash=none) because it's not necessary. Hopefully syncfs() will return quickly in that case, without doing any work? > They can skip that sync by fsync=off. But if they just want to skip only that > startup sync and make subsequent recovery (or standby server) work with > fsync=on, they would need to shutdown the server after that startup sync > finishes, enable fsync, and restart the server. In this case, since the server > is restarted with the state=DB_SHUTDOWNED_IN_RECOVERY, the startup sync > would not be performed. This procedure is tricky. So IMO supporting > sync_after_crash=none would be helpful for this case and simple. I still do not like this footgun :-) However, perhaps I am being overly dogmatic. Consider the change in d8179b00, which decided that I/O errors in this phase should be reported at LOG level rather than ERROR. In contrast, my "sync_after_crash=wal" mode (which I need to rebase over this) will PANIC in this case, because any syncing will be handled through the usual checkpoint codepaths. Do you think it would be OK to commit this feature with just "fsync" and "syncfs", and then to continue to consider adding "none" as a possible separate commit?
Re: [HACKERS] logical decoding of two-phase transactions
On Wed, Mar 17, 2021 at 8:07 AM vignesh C wrote: > > On Tue, Mar 16, 2021 at 7:22 PM Amit Kapila wrote: > > > > On Tue, Mar 16, 2021 at 6:22 PM vignesh C wrote: > > > > > > On Mon, Mar 15, 2021 at 6:14 PM Ajin Cherian wrote: > > > > > > > > On Mon, Mar 15, 2021 at 2:04 PM Amit Kapila > > > > wrote: > > > >> > > > > > > 2) table_states_not_ready global variable is used immediately after > > > call to FetchTableStates, we can make FetchTableStates return the > > > value or get it as an argument to the function and the global > > > variables can be removed. > > > +static List *table_states_not_ready = NIL; > > > > > > > But we do update the states in the list table_states_not_ready in > > function process_syncing_tables_for_apply. So, the current arrangement > > looks good to me. > > But I felt we can do this without using global variables. > table_states_not_ready is used immediately after calling > FetchTableStates in AnyTablesyncsNotREADY and > process_syncing_tables_for_apply functions. It is not used anywhere > else. My point was we do not need to store this in global variables as > it is not needed elsewhere. > It might be possible but I am not if that is better than what we are currently doing and moreover that is existing code and this patch has just encapsulated in a function. Even if you think there is a better way which I doubt, we can probably look at it as a separate patch. -- With Regards, Amit Kapila.
Re: subscriptionCheck failures
On Tue, Mar 16, 2021 at 6:22 PM osumi.takami...@fujitsu.com wrote: > > > To me, this correctly works because > the timing I put the while loop and stops the walsender > makes the DROP SUBSCRIPTION affects two slots. Any comments ? > No, your testing looks fine. I have also done the similar test. Pushed! -- With Regards, Amit Kapila.
Re: pg_subscription - substream column?
On Wed, Mar 17, 2021 at 4:56 AM Peter Smith wrote: > > On Wed, Mar 17, 2021 at 12:45 AM Amit Kapila wrote: > > > > > > Attached, please find the patch to update the description of substream > > in pg_subscription. > > > > I applied your patch and regenerated the PG docs to check the result. > > LGTM. > Pushed! -- With Regards, Amit Kapila.
Re: fdatasync performance problem with large number of DB files
On Tue, Mar 16, 2021 at 9:10 PM Fujii Masao wrote: > Thanks for the patch! > > +When set to fsync, which is the default, > +PostgreSQL will recursively open and fsync > +all files in the data directory before crash recovery begins. > > Isn't this a bit misleading? This may cause users to misunderstand that > such fsync can happen only in the case of crash recovery. If I insert the following extra sentence after that one, is it better? "This applies whenever starting a database cluster that did not shut down cleanly, including copies created with pg_basebackup."
Re: Getting better results from valgrind leak tracking
Andres Freund writes: > On 2021-03-16 19:36:10 -0400, Tom Lane wrote: >> It doesn't appear to me that we leak much at all, at least not if you >> are willing to take "still reachable" blocks as not-leaked. > Well, I think for any sort of automated testing - which I think would be > useful - we'd really need *no* leaks. That seems both unnecessary and impractical. We have to consider that everything-still-reachable is an OK final state. > I think the run might have shown a genuine leak: > ==2048803== 16 bytes in 1 blocks are definitely lost in loss record 139 of 906 > ==2048803==at 0x89D2EA: palloc (mcxt.c:975) > ==2048803==by 0x2392D3: heap_beginscan (heapam.c:1198) > ==2048803==by 0x264E8F: table_beginscan_strat (tableam.h:918) > ==2048803==by 0x265994: systable_beginscan (genam.c:453) > ==2048803==by 0x83C2D1: SearchCatCacheMiss (catcache.c:1359) > ==2048803==by 0x83C197: SearchCatCacheInternal (catcache.c:1299) I didn't see anything like that after applying the fixes I showed before. There are a LOT of false positives from the fact that with our HEAD code, valgrind believes that everything in the catalog caches and most things in dynahash tables (including the relcache) are unreachable. I'm not trying to claim there are no leaks anywhere, just that the amount of noise from those issues swamps all the real problems. I particularly recommend not believing anything related to catcache or relcache if you haven't applied that admittedly-hacky patch. > Hm. For me the number of leaks seem to stay the same with or without > your changes related to this. Is this a USE_VALGRIND build? Yeah ... > I do see a bunch of leaks bleats below fun:plpgsql_compile that I don't > yet understand. E.g. Those are probably a variant of what you were suggesting above, ie plpgsql isn't terribly careful not to leak random stuff while building a long-lived function parse tree. It's supposed to use a temp context for anything that might leak, but I suspect it's not thorough about it. We could chase that sort of thing after we clean up the other problems. regards, tom lane
Re: A new function to wait for the backend exit after termination
At Wed, 17 Mar 2021 07:01:39 +0530, Bharath Rupireddy wrote in > Attaching v10 patch for further review. The time-out mechanism doesn't count remainingtime as expected, concretely it does the following. do { kill(); WaitLatch(WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, waittime); ResetLatch(MyLatch); remainingtime -= waittime; } while (remainingtime > 0); So, the WaitLatch doesn't consume as much time as the set waittime in case of latch set. remainingtime reduces faster than the real at the iteration. It wouldn't happen actually but I concern about PID recycling. We can make sure to get rid of the fear by checking for our BEENTRY instead of PID. However, it seems to me that some additional function is needed in pgstat.c so that we can check the realtime value of PgBackendStatus, which might be too much. + /* If asked to wait, check whether the timeout value is valid or not. */ + if (wait && pid != MyProcPid) + { + timeout = PG_GETARG_INT64(2); + + if (timeout <= 0) + ereport(ERROR, + (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE), +errmsg("\"timeout\" must not be negative or zero"))); This means that pg_terminate_backend accepts negative timeouts when terminating myself, which looks odd. Is there any reason to reject 0 as timeout? +* Wait only if requested and the termination is successful. Self +* termination is allowed but waiting is not. +*/ + if (wait && pid != MyProcPid && result) + result = pg_wait_until_termination(pid, timeout); Why don't we wait for myself to be terminated? There's no guarantee that myself will be terminated without failure. (I agree that that is not so useful, but I think there's no reason not to do so.) The first suggested signature for pg_terminate_backend() with timeout was pg_terminate_backend(pid, timeout). The current signature (pid, wait?, timeout) looks redundant. Maybe the reason for rejecting 0 astimeout is pg_terminate_backend(pid, true, 0) looks odd but it we can wait forever in that case (as other features does). On the other hand pg_terminate_backend(pid, false, 100) is apparently odd but this patch doesn't seem to reject it. If there's no considerable reason for the current signature, I would suggest that: pg_terminate_backend(pid, timeout), where it waits forever if timeout is zero and waits for the timeout if positive. Negative values are not accepted. That being said, I didn't find the disucssion about allowing default timeout value by separating the boolean, if it is the consensus on this thread, sorry for the noise. + ereport(WARNING, + (errmsg("could not check the existence of the backend with PID %d: %m", + pid))); + return false; I think this is worth ERROR. We can avoid this handling if we look into PgBackendEntry instead. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Getting better results from valgrind leak tracking
Hi, For the second last trace involving SearchCatCacheList (catcache.c:1691), the ctlist's members are stored in cl->members array where cl is returned at the end of SearchCatCacheList. Maybe this was not accounted for by valgrind ? Cheers On Tue, Mar 16, 2021 at 7:31 PM Andres Freund wrote: > Hi, > > David, there's a question about a commit of yours below, hence adding > you. > > On 2021-03-16 19:36:10 -0400, Tom Lane wrote: > > Out of curiosity I poked at this for a little while. > > Cool. > > > > It doesn't appear to me that we leak much at all, at least not if you > > are willing to take "still reachable" blocks as not-leaked. > > Well, I think for any sort of automated testing - which I think would be > useful - we'd really need *no* leaks. I know that I get a few bleats > whenever I forget to set --leak-check=no. It's also not just postgres > itself, but some of the helper tools... > > And it's not just valgrind, also gcc/clang sanitizers... > > > > What I found out is that we have a lot of issues that seem to devolve > > to valgrind not being sure that a block is referenced. I identified > > two main causes of that: > > > > (1) We have a pointer, but it doesn't actually point right at the start > > of the block. A primary culprit here is lists of thingies that use the > > slist and dlist infrastructure. As an experiment, I moved the dlist_node > > fields of some popular structs to the beginning, and verified that that > > silences associated complaints. I'm not sure that we want to insist on > > put-the-link-first as policy (although if we did, it could provide some > > notational savings perhaps). However, unless someone knows of a way to > > teach valgrind about this situation, there may be no other way to silence > > those leakage complaints. A secondary culprit is people randomly > applying > > CACHELINEALIGN or the like to a palloc'd address, so that the address we > > have isn't pointing right at the block start. > > Hm, do you still have a backtrace / suppression for one of those? I > didn't see any in a quick (*) serial installcheck I just ran. Or I > wasn't able to pinpoint them to this issue. > > > I think the run might have shown a genuine leak: > > ==2048803== 16 bytes in 1 blocks are definitely lost in loss record 139 of > 906 > ==2048803==at 0x89D2EA: palloc (mcxt.c:975) > ==2048803==by 0x2392D3: heap_beginscan (heapam.c:1198) > ==2048803==by 0x264E8F: table_beginscan_strat (tableam.h:918) > ==2048803==by 0x265994: systable_beginscan (genam.c:453) > ==2048803==by 0x83C2D1: SearchCatCacheMiss (catcache.c:1359) > ==2048803==by 0x83C197: SearchCatCacheInternal (catcache.c:1299) > ==2048803==by 0x83BE9A: SearchCatCache1 (catcache.c:1167) > ==2048803==by 0x85876A: SearchSysCache1 (syscache.c:1134) > ==2048803==by 0x84CDB3: RelationInitTableAccessMethod (relcache.c:1795) > ==2048803==by 0x84F807: RelationBuildLocalRelation (relcache.c:3554) > ==2048803==by 0x303C9D: heap_create (heap.c:395) > ==2048803==by 0x305790: heap_create_with_catalog (heap.c:1291) > ==2048803==by 0x41A327: DefineRelation (tablecmds.c:885) > ==2048803==by 0x6C96B6: ProcessUtilitySlow (utility.c:1131) > ==2048803==by 0x6C948A: standard_ProcessUtility (utility.c:1034) > ==2048803==by 0x6C865F: ProcessUtility (utility.c:525) > ==2048803==by 0x6C7409: PortalRunUtility (pquery.c:1159) > ==2048803==by 0x6C7636: PortalRunMulti (pquery.c:1305) > ==2048803==by 0x6C6B11: PortalRun (pquery.c:779) > ==2048803==by 0x6C05AB: exec_simple_query (postgres.c:1173) > ==2048803== > { > >Memcheck:Leak >match-leak-kinds: definite >fun:palloc >fun:heap_beginscan >fun:table_beginscan_strat >fun:systable_beginscan >fun:SearchCatCacheMiss >fun:SearchCatCacheInternal >fun:SearchCatCache1 >fun:SearchSysCache1 >fun:RelationInitTableAccessMethod >fun:RelationBuildLocalRelation >fun:heap_create >fun:heap_create_with_catalog >fun:DefineRelation >fun:ProcessUtilitySlow >fun:standard_ProcessUtility >fun:ProcessUtility >fun:PortalRunUtility >fun:PortalRunMulti >fun:PortalRun >fun:exec_simple_query > } > > Since 56788d2156fc heap_beginscan() allocates > scan->rs_base.rs_private = > palloc(sizeof(ParallelBlockTableScanWorkerData)); > in heap_beginscan(). But doesn't free it in heap_endscan(). > > In most of the places heap scans are begun inside transient contexts, > but not always. In the above trace for example > RelationBuildLocalRelation switched to CacheMemoryContext, and nothing > switched to something else. > > I'm a bit confused about the precise design of rs_private / > ParallelBlockTableScanWorkerData, specifically why it's been added to > TableScanDesc, instead of just adding it to HeapScanDesc? And why is it > allocated unconditionally, instead of just for parallel scans? > > > I don't think this is a false positive, even though it theoreti
Re: [HACKERS] logical decoding of two-phase transactions
On Tue, Mar 16, 2021 at 7:22 PM Amit Kapila wrote: > > On Tue, Mar 16, 2021 at 6:22 PM vignesh C wrote: > > > > On Mon, Mar 15, 2021 at 6:14 PM Ajin Cherian wrote: > > > > > > On Mon, Mar 15, 2021 at 2:04 PM Amit Kapila > > > wrote: > > >> > > > > 2) table_states_not_ready global variable is used immediately after > > call to FetchTableStates, we can make FetchTableStates return the > > value or get it as an argument to the function and the global > > variables can be removed. > > +static List *table_states_not_ready = NIL; > > > > But we do update the states in the list table_states_not_ready in > function process_syncing_tables_for_apply. So, the current arrangement > looks good to me. But I felt we can do this without using global variables. table_states_not_ready is used immediately after calling FetchTableStates in AnyTablesyncsNotREADY and process_syncing_tables_for_apply functions. It is not used anywhere else. My point was we do not need to store this in global variables as it is not needed elsewhere. We could change the return type or return in through the function argument in this case. Thoughts? Regards, Vignesh
Re: Should we improve "PID XXXX is not a PostgreSQL server process" warning for pg_terminate_backend(<>)?
On 2021-03-16 20:51, Bharath Rupireddy wrote: On Mon, Mar 15, 2021 at 11:23 AM torikoshia wrote: On 2021-03-07 19:16, Bharath Rupireddy wrote: > On Fri, Feb 5, 2021 at 5:15 PM Bharath Rupireddy > wrote: >> >> pg_terminate_backend and pg_cancel_backend with postmaster PID produce >> "PID is not a PostgresSQL server process" warning [1], which >> basically implies that the postmaster is not a PostgreSQL process at >> all. This is a bit misleading because the postmaster is the parent of >> all PostgreSQL processes. Should we improve the warning message if the >> given PID is postmasters' PID? +1. I felt it was a bit confusing when reviewing a thread[1]. Hmmm. > I'm attaching a small patch that emits a warning "signalling > postmaster with PID %d is not allowed" for postmaster and "signalling > PostgreSQL server process with PID %d is not allowed" for auxiliary > processes such as checkpointer, background writer, walwriter. > > However, for stats collector and sys logger processes, we still get > "PID X is not a PostgreSQL server process" warning because they > don't have PGPROC entries(??). So BackendPidGetProc and > AuxiliaryPidGetProc will not help and even pg_stat_activity is not > having these processes' pid. I also ran into the same problem while creating a patch in [2]. I have not gone through that thread though. Is there any way we can detect those child processes(stats collector, sys logger) that are forked by the postmaster from a backend process? Thoughts? I couldn't find good ways to do that, and thus I'm now wondering just changing the message. I'm now wondering if changing the message to something like "PID is not a PostgreSQL backend process". "backend process' is now defined as "Process of an instance which acts on behalf of a client session and handles its requests." in Appendix. Yeah, that looks good to me. IIUC, we can just change the message from "PID is not a PostgreSQL server process" to "PID is not a PostgreSQL backend process" and we don't need look for AuxiliaryProcs or PostmasterPid. Changing log messages can affect operations, especially when people monitor the log message strings, but improving "PID is not a PostgreSQL server process" does not seem to cause such problems. Regards, -- Atsushi Torikoshi NTT DATA CORPORATION
Re: Getting better results from valgrind leak tracking
Hi, David, there's a question about a commit of yours below, hence adding you. On 2021-03-16 19:36:10 -0400, Tom Lane wrote: > Out of curiosity I poked at this for a little while. Cool. > It doesn't appear to me that we leak much at all, at least not if you > are willing to take "still reachable" blocks as not-leaked. Well, I think for any sort of automated testing - which I think would be useful - we'd really need *no* leaks. I know that I get a few bleats whenever I forget to set --leak-check=no. It's also not just postgres itself, but some of the helper tools... And it's not just valgrind, also gcc/clang sanitizers... > What I found out is that we have a lot of issues that seem to devolve > to valgrind not being sure that a block is referenced. I identified > two main causes of that: > > (1) We have a pointer, but it doesn't actually point right at the start > of the block. A primary culprit here is lists of thingies that use the > slist and dlist infrastructure. As an experiment, I moved the dlist_node > fields of some popular structs to the beginning, and verified that that > silences associated complaints. I'm not sure that we want to insist on > put-the-link-first as policy (although if we did, it could provide some > notational savings perhaps). However, unless someone knows of a way to > teach valgrind about this situation, there may be no other way to silence > those leakage complaints. A secondary culprit is people randomly applying > CACHELINEALIGN or the like to a palloc'd address, so that the address we > have isn't pointing right at the block start. Hm, do you still have a backtrace / suppression for one of those? I didn't see any in a quick (*) serial installcheck I just ran. Or I wasn't able to pinpoint them to this issue. I think the run might have shown a genuine leak: ==2048803== 16 bytes in 1 blocks are definitely lost in loss record 139 of 906 ==2048803==at 0x89D2EA: palloc (mcxt.c:975) ==2048803==by 0x2392D3: heap_beginscan (heapam.c:1198) ==2048803==by 0x264E8F: table_beginscan_strat (tableam.h:918) ==2048803==by 0x265994: systable_beginscan (genam.c:453) ==2048803==by 0x83C2D1: SearchCatCacheMiss (catcache.c:1359) ==2048803==by 0x83C197: SearchCatCacheInternal (catcache.c:1299) ==2048803==by 0x83BE9A: SearchCatCache1 (catcache.c:1167) ==2048803==by 0x85876A: SearchSysCache1 (syscache.c:1134) ==2048803==by 0x84CDB3: RelationInitTableAccessMethod (relcache.c:1795) ==2048803==by 0x84F807: RelationBuildLocalRelation (relcache.c:3554) ==2048803==by 0x303C9D: heap_create (heap.c:395) ==2048803==by 0x305790: heap_create_with_catalog (heap.c:1291) ==2048803==by 0x41A327: DefineRelation (tablecmds.c:885) ==2048803==by 0x6C96B6: ProcessUtilitySlow (utility.c:1131) ==2048803==by 0x6C948A: standard_ProcessUtility (utility.c:1034) ==2048803==by 0x6C865F: ProcessUtility (utility.c:525) ==2048803==by 0x6C7409: PortalRunUtility (pquery.c:1159) ==2048803==by 0x6C7636: PortalRunMulti (pquery.c:1305) ==2048803==by 0x6C6B11: PortalRun (pquery.c:779) ==2048803==by 0x6C05AB: exec_simple_query (postgres.c:1173) ==2048803== { Memcheck:Leak match-leak-kinds: definite fun:palloc fun:heap_beginscan fun:table_beginscan_strat fun:systable_beginscan fun:SearchCatCacheMiss fun:SearchCatCacheInternal fun:SearchCatCache1 fun:SearchSysCache1 fun:RelationInitTableAccessMethod fun:RelationBuildLocalRelation fun:heap_create fun:heap_create_with_catalog fun:DefineRelation fun:ProcessUtilitySlow fun:standard_ProcessUtility fun:ProcessUtility fun:PortalRunUtility fun:PortalRunMulti fun:PortalRun fun:exec_simple_query } Since 56788d2156fc heap_beginscan() allocates scan->rs_base.rs_private = palloc(sizeof(ParallelBlockTableScanWorkerData)); in heap_beginscan(). But doesn't free it in heap_endscan(). In most of the places heap scans are begun inside transient contexts, but not always. In the above trace for example RelationBuildLocalRelation switched to CacheMemoryContext, and nothing switched to something else. I'm a bit confused about the precise design of rs_private / ParallelBlockTableScanWorkerData, specifically why it's been added to TableScanDesc, instead of just adding it to HeapScanDesc? And why is it allocated unconditionally, instead of just for parallel scans? I don't think this is a false positive, even though it theoretically could be freed by resetting CacheMemoryContext (see below)? I saw a lot of false positives from autovacuum workers, because AutovacMemCxt is never deleted, and because table_toast_map is created in TopMemoryContext. Adding an explicit MemoryContextDelete(AutovacMemCxt) and parenting table_toast_map in that shut that up. Which brings me to the question why allocations in CacheMemoryContext, AutovacMemCxt are considered to be "lost", even though they're still "reachable" via a context reset
RE: libpq debug log
Alvaro-san, Thank you for taking your time to take a look at an incomplete patch. I thought I would ask you for final check for commit after Iwata-san has reflected my review comments. I discussed with Iwata-sn your below comments. Let me convey her opinions. (She is now focusing on fixing the patch.) We'd appreciate if you could tweak her completed patch. From: Alvaro Herrera > * There's no cross-check that the protocol message length word > corresponds to what we actually print. I think one easy way to > cross-check that would be to return the "count" from the type-specific > routines, and verify that it matches "length" in pqTraceOutputMessage. > (If the separate routines are removed and the code put back in one > giant switch, then the "count" is already available when the switch > ends.) We don't think the length check is necessary, because 1) for FE->BE, correct messages are always assembled, and 2) for BE->FE, the parsing and decomposition of messages have already checked the messages. > * If we have separate functions for each message type, then we can have > that function supply the message name by itself, rather than have the > separate protocol_message_type_f / _b arrays that print it. I felt those two arrays are clumsy and thought of suggesting to remove them. But I didn't because the functions or case labels for each message type have to duplicate the printing of header fields: timestamp, message type, and length. Maybe we can change the order of output to "timestamp length message_type content", but I kind of prefer the current order. What do you think? (I can understand removing the clumsy static arrays should be better than the order of output fields.) > * If we make each type-specific function print the message name, then we > need to make the same function print the message length, because it > comes after. So the function would have to receive the length (so > that it can be printed). But I think we should still have the > cross-check I mentioned in my first point occur in the main > pqTraceOutputMessage, not the type-specific function, for fear that we > will forget the cross-check in one of the functions and never realize > that we did. As mentioned above, we think the current structure is better for smaller and readable code. > * I would make the pqTraceOutputInt16() function and siblings advance > the cursor themselves, actually. I think something like this: > static int > pqTraceOutputInt16(const char *data, int *cursor, FILE *pfdebug) > { > uint16tmp; > int result; > > memcpy(&tmp, data + *cursor, 2); > *cursor += 2; > result = (int) pg_ntoh16(tmp); > fprintf(pfdebug, " #%d", result); > > return result; > } > (So the caller has to pass the original "data" pointer, not > "data+cursor"). This means the caller no longer has to do "cursor+=N" > at each place. Also, you get rid of the moveStrCursor() which does > not look good. That makes sense, because in fact the patch mistakenly added 4 when it should add 2. Also, the code would become smaller. > * I'm not fond of the idea of prefixing "#" for 16bit ints and no > prefix for 32bit ints. Seems obscure and the output looks weird. > I would use a one-letter prefix for each type, "w" for 32-bit ints > (stands for "word") and "h" for 16-bit ints (stands for "half-word"). > Message length goes unadorned. Then you'd have lines like this > > 2021-03-15 08:10:44.324296 < RowDescription 35 h1 "spcoptions" > w1213 h5 w1009 h65535 w-1 h0 > 2021-03-15 08:10:44.324335 < DataRow 32 h1 w22 > '{random_page_cost=3.0}' Yes, actually I felt something similar. Taking a second thought, I think we don't have to have any prefix because it doesn't help users. So we're removing the prefix. We don't have any opinion on adding some prefix. > * I don't like that pqTraceOutputS/H print nothing when !toServer. I > think they should complain. Yes, the caller should not call the function if there's no message content. That way, the function doesn't need the toServer argument. Regards Takayuki Tsunakawa
Re: Consider Parallelism While Planning For REFRESH MATERIALIZED VIEW
On Tue, Mar 16, 2021 at 2:41 PM Thomas Munro wrote: > On Mon, Mar 15, 2021 at 8:25 PM Bharath Rupireddy > wrote: > > > > The problem with a case like REFRESH MATERIALIZED VIEW is that there's > > > > nothing to prevent something that gets run in the course of the query > > > > from trying to access the view (and the heavyweight lock won't prevent > > > > that, due to group locking). That's probably a stupid thing to do, > > > > but it can't be allowed to break the world. The other cases are safe > > > > from that particular problem because the table doesn't exist yet. > > > > Please correct me if my understanding of the above comment (from the > > commit e9baa5e9) is wrong - even if the leader opens the matview > > relation in exclusive mode, because of group locking(in case we allow > > parallel workers to feed in the data to the new heap that gets created > > for RMV, see ExecRefreshMatView->make_new_heap), can other sessions > > still access the matview relation with older data? > > > > I performed below testing to prove myself wrong for the above understanding: > > session 1: > > 1) added few rows to the table t1 on which the mv1 is defined; > > 2) refresh materialized view mv1; > > > > session 2: > > 1) select count(*) from mv1; ---> this query is blocked until > > session 1's step (2) is completed and gives the latest result even if > > the underlying data-generating query runs select part in parallel. > > > > Is there anything I'm missing here? > > I think he was talking about things like functions that try to access > the mv from inside the same query, in a worker. I haven't figured out > exactly which hazards he meant. I thought about wrong-relfilenode > hazards and combocid hazards, but considering the way this thing > always inserts into a fresh table before performing merge or swap > steps later, I don't yet see why this is different from any other > insert-from-select-with-gather. I asked Robert if he had some hazard in mind that we haven't already discussed here when he wrote that, and didn't recall any. I think we're OK here. I added the "concurrently" variant to the regression test, just to get it exercised too. The documentation needed a small tweak where we have a list of data-writing commands that are allowed to use parallelism. That run of sentences was getting a bit tortured so I converted it into a bullet list; I hope I didn't upset the documentation style police. Pushed. Thanks for working on this! This is really going to fly with INSERT pushdown. The 3 merge queries used by CONCURRENTLY will take some more work.
Re: crash during cascaded foreign key update
On Tue, Mar 16, 2021 at 11:17 PM Tom Lane wrote: > Amit Langote writes: > > With HEAD (I think v12 and greater), I see $subject when trying out > > the following scenario: > > I wonder if this is related to > > https://www.postgresql.org/message-id/flat/89429.1584443208%40antos > > which we've still not done anything about. Ah, indeed the same issue. Will read through that discussion first, thanks. -- Amit Langote EDB: http://www.enterprisedb.com
Re: cleanup temporary files after crash
On Sun, Mar 14, 2021, at 11:01 PM, Thomas Munro wrote: > On Wed, Mar 10, 2021 at 1:31 AM Michael Paquier wrote: > > On Tue, Mar 09, 2021 at 02:28:43AM +0100, Tomas Vondra wrote: > > > Let's move this patch forward. Based on the responses, I agree the > > > default behavior should be to remove the temp files, and I think we > > > should have the GUC (on the off chance that someone wants to preserve > > > the temporary files for debugging or whatever other reason). > > > > Thanks for taking care of this. I am having some second-thoughts > > about changing this behavior by default, still that's much more useful > > this way. > > +1 for having it on by default. > > I was also just looking at this patch and came here to say LGTM except > for two cosmetic things, below. Thanks for taking a look at this patch. I'm not sure Tomas is preparing a new patch that includes the suggested modifications but I decided to do it. This new version has the new GUC name (using "remove"). I also replaced "cleanup" with "remove" in the all the remain places. As pointed by Thomas, I reworded the paragraph that describes the GUC moving the default information to the beginning of the sentence. I also added the "literal" as suggested by Michael. The postgresql.conf.sample was fixed. The tests was slightly modified. I reworded some comments and added a hack to avoid breaking the temporary file test in slow machines. A usleep() after sending the query provides some time for the query to create the temporary file. I used an arbitrarily sleep (10ms) that seems to be sufficient. > > +#remove_temp_files_after_crash = on# remove temporary files after > > +# # backend crash? > > The indentation of the second line is incorrect here (Incorrect number > > of spaces in tabs perhaps?), and there is no need for the '#' at the > > beginning of the line. > > Yeah, that's wrong. For some reason that one file uses a tab size of > 8, unlike the rest of the tree (I guess because people will read that > file in software with the more common setting of 8). If you do :set > tabstop=8 in vim, suddenly it all makes sense, but it is revealed that > this patch has it wrong, as you said. (Perhaps this file should have > some of those special Vim/Emacs control messages so we don't keep > getting this wrong?) I hadn't noticed that this file use ts=8. (This explains the misalignment that I observed in some parameter comments). I'm not sure if people care about the indentation in this file. From the development perspective, we can use the same number of spaces for Tab as the source code so it is not required to fix your dev environment. However, the regular user doesn't follow the dev guidelines so it could probably observe the misalignment while using Vim (that has 8 spaces as default), for example. For now, I will add autocmd BufRead,BufNewFile postgresql.conf* setlocal ts=8 to my .vimrc. We should probably fix some settings that are misaligned such as parallel_setup_cost and shared_preload_libraries. The parameters timezone_abbreviations and max_pred_locks_per_page are using spaces instead of tabs and should probably be fixed too. -- Euler Taveira EDB https://www.enterprisedb.com/ From 8dfee5694210c14054170563ff01fc15a66a76b0 Mon Sep 17 00:00:00 2001 From: Euler Taveira Date: Mon, 25 May 2020 00:08:20 -0300 Subject: [PATCH v2] Control temporary files removal after crash A new GUC remove_temp_files_after_crash controls whether temporary files are removed after a crash. Successive crashes could result in useless storage usage until service is restarted. It could be the case on host with inadequate resources. This manual intervention for some environments is not desirable. This GUC is marked as SIGHUP hence you don't have to interrupt the service to change it. The current behavior introduces a backward incompatibility (minor one) which means Postgres will reclaim temporary file space after a crash. The main reason is that we favor service continuity over debugging. --- doc/src/sgml/config.sgml | 18 ++ src/backend/postmaster/postmaster.c | 5 + src/backend/storage/file/fd.c | 12 +- src/backend/utils/misc/guc.c | 9 + src/backend/utils/misc/postgresql.conf.sample | 2 + src/include/postmaster/postmaster.h | 1 + src/test/recovery/t/022_crash_temp_files.pl | 192 ++ 7 files changed, 234 insertions(+), 5 deletions(-) create mode 100644 src/test/recovery/t/022_crash_temp_files.pl diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 5e551b9f6d..c0b1b48136 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -9648,6 +9648,24 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir' + + remove_temp_files_after_crash (boolean) + + remove_temp_files_after_crash configuration parameter + + + + +
Re: A new function to wait for the backend exit after termination
On Tue, Mar 16, 2021 at 9:48 PM Magnus Hagander wrote: > Does it really make sense that pg_wait_for_backend_termination() > defaults to waiting *100 milliseconds*, and then logs a warning? That > seems extremely short if I'm explicitly asking it to wait. I increased the default wait timeout to 5seconds. > Wait events should be in alphabetical order in pgstat_get_wait_ipc() > as well, not just in the header (which was adjusted per Fujii's > comment) Done. > > + (errmsg("could not wait for the termination of > the backend with PID %d within %lld milliseconds", > > That's not true though? The wait succeeded, it just timed out? Isn't > itm ore like "backend with PID %d did not terminate within %lld > milliseconds"? Looks better. Done. Attaching v10 patch for further review. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com v10-0001-pg_terminate_backend-with-wait-and-timeout.patch Description: Binary data
Re: make the stats collector shutdown without writing the statsfiles if the immediate shutdown is requested.
On 2021-03-17 08:25, Zhihong Yu wrote: Hi, Thanks for your comments! + * Simple signal handler for processes HAVE NOT yet touched or DO NOT I think there should be a 'which' between processes and HAVE. It seems the words in Capital letters should be in lower case. + * Simple signal handler for processes have touched shared memory to + * exit quickly. Add 'which' between processes and have. OK, I fixed them. unlink(fname); + + elog(DEBUG2, "removing stats file \"%s\"", fname); It seems the return value from unlink should be checked and reflected in the debug message. There are related codes that show log and call unlink() in slru.c and pgstat.c. ``` pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent) { // some code elog(DEBUG2, "removing temporary stats file \"%s\"", statfile); unlink(statfile) } ``` I fixed it in the same way instead of checking the return value because IIUC, it's unimportant if an error has occurred. The log shows that just to try removing it. Though? Regards, -- Masahiro Ikeda NTT DATA CORPORATIONdiff --git a/src/backend/postmaster/interrupt.c b/src/backend/postmaster/interrupt.c index dd9136a942..df8e0eb081 100644 --- a/src/backend/postmaster/interrupt.c +++ b/src/backend/postmaster/interrupt.c @@ -64,9 +64,27 @@ SignalHandlerForConfigReload(SIGNAL_ARGS) } /* - * Simple signal handler for exiting quickly as if due to a crash. + * Simple signal handler for processes which have not yet touched or do not + * touch shared memory to exit quickly. * - * Normally, this would be used for handling SIGQUIT. + * If processes already touched shared memory, call + * SignalHandlerForCrashExit() because shared memory may be corrupted. + */ +void +SignalHandlerForUnsafeExit(SIGNAL_ARGS) +{ + /* + * Since we don't touch shared memory, we can just pull the plug and exit + * without running any atexit handlers. + */ + _exit(1); +} + +/* + * Simple signal handler for processes which have touched shared memory to + * exit quickly. + * + * Normally, this would be used for handling SIGQUIT as if due to a crash. */ void SignalHandlerForCrashExit(SIGNAL_ARGS) @@ -93,9 +111,8 @@ SignalHandlerForCrashExit(SIGNAL_ARGS) * shut down and exit. * * Typically, this handler would be used for SIGTERM, but some processes use - * other signals. In particular, the checkpointer exits on SIGUSR2, the - * stats collector on SIGQUIT, and the WAL writer exits on either SIGINT - * or SIGTERM. + * other signals. In particular, the checkpointer exits on SIGUSR2 and the + * WAL writer exits on either SIGINT or SIGTERM. * * ShutdownRequestPending should be checked at a convenient place within the * main loop, or else the main loop should call HandleMainLoopInterrupts. diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index b1e2d94951..31b461eb18 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -724,6 +724,7 @@ pgstat_reset_remove_files(const char *directory) snprintf(fname, sizeof(fname), "%s/%s", directory, entry->d_name); + elog(DEBUG2, "removing stats file \"%s\"", fname); unlink(fname); } FreeDir(dir); @@ -4821,13 +4822,19 @@ PgstatCollectorMain(int argc, char *argv[]) /* * Ignore all signals usually bound to some action in the postmaster, - * except SIGHUP and SIGQUIT. Note we don't need a SIGUSR1 handler to - * support latch operations, because we only use a local latch. + * except SIGHUP, SIGTERM and SIGQUIT. Note we don't need a SIGUSR1 + * handler to support latch operations, because we only use a local latch. */ pqsignal(SIGHUP, SignalHandlerForConfigReload); pqsignal(SIGINT, SIG_IGN); - pqsignal(SIGTERM, SIG_IGN); - pqsignal(SIGQUIT, SignalHandlerForShutdownRequest); + pqsignal(SIGTERM, SignalHandlerForShutdownRequest); + + /* + * If SIGQUIT is received, exit quickly without doing any additional work, + * for example writing stats files. We arrange to do _exit(1) because the + * stats collector doesn't touch shared memory. + */ + pqsignal(SIGQUIT, SignalHandlerForUnsafeExit); pqsignal(SIGALRM, SIG_IGN); pqsignal(SIGPIPE, SIG_IGN); pqsignal(SIGUSR1, SIG_IGN); @@ -4852,8 +4859,8 @@ PgstatCollectorMain(int argc, char *argv[]) AddWaitEventToSet(wes, WL_SOCKET_READABLE, pgStatSock, NULL, NULL); /* - * Loop to process messages until we get SIGQUIT or detect ungraceful - * death of our parent postmaster. + * Loop to process messages until we get SIGTERM or SIGQUIT of our parent + * postmaster. * * For performance reasons, we don't want to do ResetLatch/WaitLatch after * every message; instead, do that only after a recv() fails to obtain a @@ -4871,7 +4878,7 @@ PgstatCollectorMain(int argc, char *argv[]) ResetLatch(MyLatch); /* - * Quit if we get SIGQUIT from the postmaster. + * Quit if we get SIGTERM from the postmaster. */ if (ShutdownRequestPending)
Getting better results from valgrind leak tracking
[ starting a new thread for this ] Andres Freund writes: > I wonder if it'd be worth starting to explicitly annotate all the places > that do allocations and are fine with leaking them. E.g. by introducing > malloc_permanently() or such. Right now it's hard to use valgrind et al > to detect leaks because of all the false positives due to such "ok to > leak" allocations. Out of curiosity I poked at this for a little while. It doesn't appear to me that we leak much at all, at least not if you are willing to take "still reachable" blocks as not-leaked. Most of the problem is more subtle than that. I found just a couple of things that really seem like leaks of permanent data structures to valgrind: * Where ps_status.c copies the original "environ" array (on PS_USE_CLOBBER_ARGV platforms), valgrind thinks that's all leaked, implying that it doesn't count the "environ" global as a valid reference to leakable data. I was able to shut that up by also saving the pointer into an otherwise-unused static variable. (This is sort of a poor man's implementation of your "malloc_permanently" idea; but I doubt it's worth working harder, given the small number of use-cases.) * The postmaster's sock_paths and lock_files lists appear to be leaked, but we're doing that to ourselves by throwing away the pointers to them without physically freeing the lists. We can just not do that. What I found out is that we have a lot of issues that seem to devolve to valgrind not being sure that a block is referenced. I identified two main causes of that: (1) We have a pointer, but it doesn't actually point right at the start of the block. A primary culprit here is lists of thingies that use the slist and dlist infrastructure. As an experiment, I moved the dlist_node fields of some popular structs to the beginning, and verified that that silences associated complaints. I'm not sure that we want to insist on put-the-link-first as policy (although if we did, it could provide some notational savings perhaps). However, unless someone knows of a way to teach valgrind about this situation, there may be no other way to silence those leakage complaints. A secondary culprit is people randomly applying CACHELINEALIGN or the like to a palloc'd address, so that the address we have isn't pointing right at the block start. (2) The only pointer to the start of a block is actually somewhere within the block. This is common in dynahash tables, where we allocate a slab of entries in a single palloc and then thread them together. Each entry should have exactly one referencing pointer, but that pointer is more likely to be elsewhere within the same palloc block than in the external hash bucket array. AFAICT, all cases except where the slab's first entry is pointed to by a hash bucket pointer confuse valgrind to some extent. I was able to hack around this by preventing dynahash from allocating more than one hash entry per palloc, but I wonder if there's a better way. Attached is a very crude hack, not meant for commit, that hacks things up enough to greatly reduce the number of complaints with "--leak-check=full". One thing I've failed to silence so far is a bunch of entries like ==00:00:03:56.088 3467702== 1,861 bytes in 67 blocks are definitely lost in loss record 1,290 of 1,418 ==00:00:03:56.088 3467702==at 0x950650: MemoryContextAlloc (mcxt.c:827) ==00:00:03:56.088 3467702==by 0x951710: MemoryContextStrdup (mcxt.c:1179) ==00:00:03:56.088 3467702==by 0x91C86E: RelationInitIndexAccessInfo (relcache.c:1444) ==00:00:03:56.088 3467702==by 0x91DA9C: RelationBuildDesc (relcache.c:1200) which is complaining about the memory context identifiers for system indexes' rd_indexcxt contexts. Those are surely not being leaked in any real sense. I suspect that this has something to do with valgrind not counting the context->ident fields as live pointers, but I don't have enough valgrind-fu to fix that. Anyway, the bottom line is that I do not think that we have all that many uses of the pattern you postulated originally. It's more that we've designed some valgrind-unfriendly data structures. We need to improve that situation to make much progress here. regards, tom lane diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c index 4c7b1e7bfd..cd984929a6 100644 --- a/src/backend/libpq/pqcomm.c +++ b/src/backend/libpq/pqcomm.c @@ -888,7 +888,6 @@ RemoveSocketFiles(void) (void) unlink(sock_path); } /* Since we're about to exit, no need to reclaim storage */ - sock_paths = NIL; } diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 55c9445898..2abdd44190 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -814,7 +814,7 @@ InitCatCache(int id, * Note: we rely on zeroing to initialize all the dlist headers correctly */ sz = sizeof(CatCache) + PG_CACHE_LINE_SIZE; - cp = (CatCache *) CACHELINEALIGN(
Re: pg_subscription - substream column?
On Wed, Mar 17, 2021 at 12:45 AM Amit Kapila wrote: > > > Attached, please find the patch to update the description of substream > in pg_subscription. > I applied your patch and regenerated the PG docs to check the result. LGTM. Kind Regards, Peter Smith. Fujitsu Australia
Re: make the stats collector shutdown without writing the statsfiles if the immediate shutdown is requested.
Hi, + * Simple signal handler for processes HAVE NOT yet touched or DO NOT I think there should be a 'which' between processes and HAVE. It seems the words in Capital letters should be in lower case. + * Simple signal handler for processes have touched shared memory to + * exit quickly. Add 'which' between processes and have. unlink(fname); + + elog(DEBUG2, "removing stats file \"%s\"", fname); It seems the return value from unlink should be checked and reflected in the debug message. Thanks On Tue, Mar 16, 2021 at 4:09 PM Masahiro Ikeda wrote: > On 2021-03-16 13:44, kuroda.hay...@fujitsu.com wrote: > > Dear Ikeda-san > > > > I think the idea is good. > > > > I read the patch and other sources, and I found > > process_startup_packet_die also execute _exit(1). > > I think they can be combined into one function and moved to > > interrupt.c, but > > some important comments might be removed. How do you think? > > Hi, Kuroda-san. > Thanks for your comments. > > I agreed that your idea. > I combined them into one function and moved the comments to > the calling function side. > (v2-0001-pgstat_avoid_writing_on_sigquit.patch) > > Regards, > -- > Masahiro Ikeda > NTT DATA CORPORATION
RE: make the stats collector shutdown without writing the statsfiles if the immediate shutdown is requested.
On 2021-03-16 13:44, kuroda.hay...@fujitsu.com wrote: Dear Ikeda-san I think the idea is good. I read the patch and other sources, and I found process_startup_packet_die also execute _exit(1). I think they can be combined into one function and moved to interrupt.c, but some important comments might be removed. How do you think? Hi, Kuroda-san. Thanks for your comments. I agreed that your idea. I combined them into one function and moved the comments to the calling function side. (v2-0001-pgstat_avoid_writing_on_sigquit.patch) Regards, -- Masahiro Ikeda NTT DATA CORPORATIONdiff --git a/src/backend/postmaster/interrupt.c b/src/backend/postmaster/interrupt.c index dd9136a942..9b25294a14 100644 --- a/src/backend/postmaster/interrupt.c +++ b/src/backend/postmaster/interrupt.c @@ -64,9 +64,27 @@ SignalHandlerForConfigReload(SIGNAL_ARGS) } /* - * Simple signal handler for exiting quickly as if due to a crash. + * Simple signal handler for processes HAVE NOT yet touched or DO NOT + * touch shared memory to exit quickly. * - * Normally, this would be used for handling SIGQUIT. + * If processes already touched shared memory, call + * SignalHandlerForCrashExit() because shared memory may be corrupted. + */ +void +SignalHandlerForUnsafeExit(SIGNAL_ARGS) +{ + /* + * Since we don't touch shared memory, we can just pull the plug and exit + * without running any atexit handlers. + */ + _exit(1); +} + +/* + * Simple signal handler for processes have touched shared memory to + * exit quickly. + * + * Normally, this would be used for handling SIGQUIT as if due to a crash. */ void SignalHandlerForCrashExit(SIGNAL_ARGS) @@ -93,9 +111,8 @@ SignalHandlerForCrashExit(SIGNAL_ARGS) * shut down and exit. * * Typically, this handler would be used for SIGTERM, but some processes use - * other signals. In particular, the checkpointer exits on SIGUSR2, the - * stats collector on SIGQUIT, and the WAL writer exits on either SIGINT - * or SIGTERM. + * other signals. In particular, the checkpointer exits on SIGUSR2 and the + * WAL writer exits on either SIGINT or SIGTERM. * * ShutdownRequestPending should be checked at a convenient place within the * main loop, or else the main loop should call HandleMainLoopInterrupts. diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index b1e2d94951..b2156cec9d 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -725,6 +725,8 @@ pgstat_reset_remove_files(const char *directory) snprintf(fname, sizeof(fname), "%s/%s", directory, entry->d_name); unlink(fname); + + elog(DEBUG2, "removing stats file \"%s\"", fname); } FreeDir(dir); } @@ -4821,13 +4823,19 @@ PgstatCollectorMain(int argc, char *argv[]) /* * Ignore all signals usually bound to some action in the postmaster, - * except SIGHUP and SIGQUIT. Note we don't need a SIGUSR1 handler to - * support latch operations, because we only use a local latch. + * except SIGHUP, SIGTERM and SIGQUIT. Note we don't need a SIGUSR1 + * handler to support latch operations, because we only use a local latch. */ pqsignal(SIGHUP, SignalHandlerForConfigReload); pqsignal(SIGINT, SIG_IGN); - pqsignal(SIGTERM, SIG_IGN); - pqsignal(SIGQUIT, SignalHandlerForShutdownRequest); + pqsignal(SIGTERM, SignalHandlerForShutdownRequest); + + /* + * If SIGQUIT is received, exit quickly without doing any additional work, + * for example writing stats files. We arrange to do _exit(1) because the + * stats collector doesn't touch shared memory. + */ + pqsignal(SIGQUIT, SignalHandlerForUnsafeExit); pqsignal(SIGALRM, SIG_IGN); pqsignal(SIGPIPE, SIG_IGN); pqsignal(SIGUSR1, SIG_IGN); @@ -4852,8 +4860,8 @@ PgstatCollectorMain(int argc, char *argv[]) AddWaitEventToSet(wes, WL_SOCKET_READABLE, pgStatSock, NULL, NULL); /* - * Loop to process messages until we get SIGQUIT or detect ungraceful - * death of our parent postmaster. + * Loop to process messages until we get SIGTERM or SIGQUIT of our parent + * postmaster. * * For performance reasons, we don't want to do ResetLatch/WaitLatch after * every message; instead, do that only after a recv() fails to obtain a @@ -4871,7 +4879,7 @@ PgstatCollectorMain(int argc, char *argv[]) ResetLatch(MyLatch); /* - * Quit if we get SIGQUIT from the postmaster. + * Quit if we get SIGTERM from the postmaster. */ if (ShutdownRequestPending) break; diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 6436ae0f48..4e1d47e0a1 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -400,7 +400,6 @@ static void SIGHUP_handler(SIGNAL_ARGS); static void pmdie(SIGNAL_ARGS); static void reaper(SIGNAL_ARGS); static void sigusr1_handler(SIGNAL_ARGS); -static void process_startup_packet_die(SIGNAL_ARGS); static void dummy_handler(SIGNAL_ARGS); static void StartupPacketTimeoutHandler(void); static void CleanupBack
Re: New IndexAM API controlling index vacuum strategies
On Tue, Mar 16, 2021 at 6:40 AM Masahiko Sawada wrote: > > Note that I've merged multiple existing functions in vacuumlazy.c into > > one: the patch merges lazy_vacuum_all_indexes() and lazy_vacuum_heap() > > into a single function named vacuum_indexes_mark_unused() > I agree to create a function like vacuum_indexes_mark_unused() that > makes a decision and does index and heap vacumming accordingly. But > what is the point of removing both lazy_vacuum_all_indexes() and > lazy_vacuum_heap()? I think we can simply have > vacuum_indexes_mark_unused() call those functions. I'm concerned that > if we add additional criteria to vacuum_indexes_mark_unused() in the > future the function will become very large. I agree now. I became overly excited about advertising the fact that these two functions are logically one thing. This is important, but it isn't necessary to go as far as actually making everything into one function. Adding some comments would also make that point clear, but without making vacuumlazy.c even more spaghetti-like. I'll fix it. > > I wonder if we can add some kind of emergency anti-wraparound vacuum > > logic to what I have here, for Postgres 14. > +1 > > I think we can set VACOPT_TERNARY_DISABLED to > tab->at_params.index_cleanup in table_recheck_autovac() or increase > the thresholds used to not skipping index vacuuming. I was worried about the "tupgone = true" special case causing problems when we decide to do some index vacuuming and some heap vacuuming/LP_UNUSED-marking but then later decide to end the VACUUM. But I now think that getting rid of "tupgone = true" gives us total freedom to choose what to do, including the freedom to start with index vacuuming and then give up on it later -- even after we do some amount of LP_UNUSED-marking (during a VACUUM with multiple index passes, perhaps due to a low maintenance_work_mem setting). That isn't actually special at all, because everything will be 100% decoupled. Whether or not it's a good idea to either not start index vacuuming or to end it early (e.g. due to XID wraparound pressure) is a complicated question, and the right approach will be debatable in each case/when thinking about each issue. However, deciding on the best behavior to address these problems should have nothing to do with implementation details and everything to do with the costs and benefits to users. Which now seems possible. A sophisticated version of the "XID wraparound pressure" implementation could increase reliability without ever being conservative, just by evaluating the situation regularly and being prepared to change course when necessary -- until it is truly a matter of shutting down new XID allocations/the server. It should be possible to decide to end VACUUM early and advance relfrozenxid (provided we have reached the point of finishing all required pruning and freezing, of course). Highly agile behavior seems quite possible, even if it takes a while to agree on a good design. > Aside from whether it's good or bad, strictly speaking, it could > change the index AM API contract. The documentation of > amvacuumcleanup() says: > > --- > stats is whatever the last ambulkdelete call returned, or NULL if > ambulkdelete was not called because no tuples needed to be deleted. > --- > > With this change, we could pass NULL to amvacuumcleanup even though > the index might have tuples needed to be deleted. I think that we should add a "Note" box to the documentation that notes the difference here. Though FWIW my interpretation of the words "no tuples needed to be deleted" was always "no tuples needed to be deleted because vacuumlazy.c didn't call ambulkdelete()". After all, VACUUM can add to tups_vacuumed through pruning inside heap_prune_chain(). It is already possible (though only barely) to not call ambulkdelete() for indexes (to only call amvacuumcleanup() during cleanup) despite the fact that heap vacuuming did "delete tuples". It's not that important, but my point is that the design has always been top-down -- an index AM "needs to delete" whatever it is told it needs to delete. It has no direct understanding of any higher-level issues. > As I mentioned in a recent reply, I'm concerned about a case where we > ran out maintenance_work_mem and decided not to skip index vacuuming > but collected only a few dead tuples in the second index vacuuming > (i.g., the total amount of dead tuples is slightly larger than > maintenance_work_mem). In this case, I think we can skip the second > (i.g., final) index vacuuming at least in terms of > maintenance_work_mem. Maybe the same is true in terms of LP_DEAD > accumulation. I remember that. That now seems very doable, but time grows short... I have already prototyped Andres' idea, which was to eliminate the HEAPTUPLE_DEAD case inside lazy_scan_heap() by restarting pruning for the same page. I've also moved the pruning into its own function called lazy_scan_heap_page(), because handling the restart requires that we b
Re: Boundary value check in lazy_tid_reaped()
We could also go parallel in another direction - I have been mulling about writing a "vectorized" bsearch which would use AVX2, where you look up 64 (or more likely 32, so tids also fit in 256bit vector) tids at a time. The trickiest part is that the search can complete at different iteration for each value. Of course it is possible that this has a very bad RAM access behaviour and is no win at all even if it otherways works. -- Hannu Krosing On Tue, Mar 16, 2021 at 10:08 PM Peter Geoghegan wrote: > On Sun, Mar 14, 2021 at 4:22 PM Thomas Munro > wrote: > > BTW I got around to trying this idea out for a specialised > > bsearch_itemptr() using a wide comparator, over here: > > Cool! > > I have another thing that should be considered when we revisit this > area in the future: maybe we should structure the binary search to > lookup multiple TIDs at once -- probably all of the TIDs from a given > leaf page, taken together. > > There is now an nbtree function used for tuple deletion (all tuple > deletion, not just bottom-up deletion) that works like this: > _bt_delitems_delete_check(). I suspect that it would make sense to > generalize it to do the same thing for regular VACUUM. Perhaps this > idea would have to be combined with other techniques to show a real > benefit. It would probably be necessary to sort the TIDs first (just > like index_delete_sort() does for the _bt_delitems_delete_check() code > today), but that's probably no big deal. > > It is typical to have 200 - 400 TIDs on an nbtree leaf page without > using deduplication. And with deduplication enabled you can have as > many as 1300 TIDs on a single 8KiB nbtree leaf page. It's easy to > imagine something like GCC's __builtin_prefetch() (or maybe just more > predictable access patterns in our "batch binary search") making > everything much faster through batching. This will also naturally make > btvacuumpage() much less "branchy", since of course it will no longer > need to process the page one TID at a time -- that helps too. > > -- > Peter Geoghegan > > >
Re: Boundary value check in lazy_tid_reaped()
On Sun, Mar 14, 2021 at 4:22 PM Thomas Munro wrote: > BTW I got around to trying this idea out for a specialised > bsearch_itemptr() using a wide comparator, over here: Cool! I have another thing that should be considered when we revisit this area in the future: maybe we should structure the binary search to lookup multiple TIDs at once -- probably all of the TIDs from a given leaf page, taken together. There is now an nbtree function used for tuple deletion (all tuple deletion, not just bottom-up deletion) that works like this: _bt_delitems_delete_check(). I suspect that it would make sense to generalize it to do the same thing for regular VACUUM. Perhaps this idea would have to be combined with other techniques to show a real benefit. It would probably be necessary to sort the TIDs first (just like index_delete_sort() does for the _bt_delitems_delete_check() code today), but that's probably no big deal. It is typical to have 200 - 400 TIDs on an nbtree leaf page without using deduplication. And with deduplication enabled you can have as many as 1300 TIDs on a single 8KiB nbtree leaf page. It's easy to imagine something like GCC's __builtin_prefetch() (or maybe just more predictable access patterns in our "batch binary search") making everything much faster through batching. This will also naturally make btvacuumpage() much less "branchy", since of course it will no longer need to process the page one TID at a time -- that helps too. -- Peter Geoghegan
Re: [HACKERS] PATCH: Batch/pipelining support for libpq
Hi, On 2021-03-05 21:35:59 -0300, Alvaro Herrera wrote: > I'll take the weekend to think about the issue with conn->last_query and > conn->queryclass that I mentioned yesterday; other than that detail my > feeling is that this is committable, so I'll be looking at getting this > pushed early next weeks, barring opinions from others. It is *very* exciting to see this being merged. Thanks for all the work to all that contributed! Greetings, Andres Freund
Re: shared-memory based stats collector
Hi, On 2021-03-16 15:08:39 -0400, Robert Haas wrote: > On Mon, Mar 15, 2021 at 10:56 PM Andres Freund wrote: > > I did roughly the first steps of the split as I had outlined. I moved: > > > > 1) wait event related functions into utils/activity/wait_event.c / > >wait_event.h > > > > 2) "backend status" functionality (PgBackendStatus stuff) into > >utils/activity/backend_status.c > > > > 3) "progress" related functionality into > >utils/activity/backend_progress.c > > In general, I like this. I'm not too sure about the names. I realize > you don't want to have functions called status.c and progress.c, > because that's awful generic, but now you have backend_progress.c, > backend_status.c, and wait_event.c, which makes the last one look a > little strange. Maybe command_progress.c instead of > backend_progress.c? I'm not thrilled about the names I ended up with either, so happy to get some ideas. I did consider command_progress.c too - but that seems confusing because there's src/include/commands/progress.h, which is imo a different layer than what pgstat/backend_progress provide. So I thought splitting things up so that backend_progress.[ch] provide the place to store the progress values, and commands/progress.h defining the meaning of the values as used for in-core postgres commands would make sense. I could see us using the general progress infrastructure for things that'd not fit super well into commands/* at some point... But I'd also be ok with folding in command/progress.h. > > I think 1 and 2 are good (albeit in need of further polish). I'm a bit > > less sure about 3: > > - There's dependency from backend_status.h to backend_progress.h, > > because it PGSTAT_NUM_PROGRESS_PARAM etc. > > That doesn't seem like a catastrophe. > > > - it's a fairly small amount of code > > But it's not bad to have it separate. Agreed. > > - I'm inclined to leave pgstat_report_{activity, tmpfile, appname, > > timestamp, ..} alone naming-wise, but to rename pgstat_bestart() to > > something like pgbestat_start()? > > I'd probably rename them e.g. command_progress_start(), > comand_progress_update_param(), etc. Hm. There's ~250 calls to pgstat_report_*. Which are you proposing to rename? In my mind there's at least the following groups with "inaccurately" overlapping names: 1) "stats reporting" functions, like pgstat_report_{bgwriter, archiver,...}(), pgstat_count_*(), pgstat_{init, end}_function_usage(), 2) "backend activity" functions, like pgstat_report_activity() 2.1) "wait event" functions, like pgstat_report_wait_{start,end}() 3) "stats control" functions, like pgstat_report_stat() 4) "stats reporting" fetch functions like pgstat_fetch_stat_dbentry() 5) "backend activity" fetch functions like pgstat_fetch_stat_numbackends(), pgstat_fetch_stat_beentry() I'd not quite group the progress functions as part of that, because they do already have a distinct namespace, even though perhaps pgstat_* isn't a great prefix. > > - backend_status.h needs miscadmin.h, due to BackendType. Imo that's a > > header we should try to avoid exposing in headers if possible. But > > right now there's no good place to move BackendType to. So I'd let > > this slide for now. > > Why the concern? miscadmin.h is extremely widely-included already. In .c files, but not from a lot of headers... There's only two places, pgstat.h and regex/regcustom.h. For me it a weird mix of things that should be included from very few places, and super widely used stuff. I already feel bad including it from .c files, but indirect includes in .h files imo should be as narrow as possible. > Maybe it should be broken up into pieces so that we're not including > so MUCH stuff in a zillion places, but the header that contains the > definition of CHECK_FOR_INTERRUPTS() is always going to be needed in a > ton of spots. Honestly, I wonder why we don't just put that part in > postgres.h. I'd not be against that. I'd personally put CFI() in a separate header, but include it from postgres.h. I don't think there's much else in there that should be as widely used. The closest is INTERRUPTS and CRIT_SECTION stuff, but that should be less frequent. Greetings, Andres Freund
Re: pg_amcheck contrib application
On Mon, Mar 15, 2021 at 10:10 PM Mark Dilger wrote: > It is unfortunate that the failing test only runs pg_amcheck after creating > numerous corruptions, as we can't know if pg_amcheck would have complained > about pg_statistic before the corruptions were created in other tables, or if > it only does so after. The attached patch v7-0003 adds a call to pg_amcheck > after all tables are created and populated, but before any corruptions are > caused. This should help narrow down what is happening, and doesn't hurt to > leave in place long-term. > > I don't immediately see anything wrong with how pg_statistic uses a > pseudo-type, but it leads me to want to poke a bit more at pg_statistic on > hornet and tern, though I don't have any regression tests specifically for > doing so. > > Tests v7-0001 and v7-0002 are just repeats of the tests posted previously. Since we now know that shutting autovacuum off makes the problem go away, I don't see a reason to commit 0001. We should fix pg_amcheck instead, if, as presently seems to be the case, that's where the problem is. I just committed 0002. I think 0003 is probably a good idea, but I haven't committed it yet. As for 0004, it seems to me that we might want to do a little more rewording of these messages and perhaps we should try to do it all at once. Like, for example, your other change to print out the toast value ID seems like a good idea, and could apply to any new messages as well as some existing ones. Maybe there are also more fields in the TOAST pointer for which we could add checks. -- Robert Haas EDB: http://www.enterprisedb.com
Re: pg_amcheck contrib application
> On Mar 16, 2021, at 11:40 AM, Robert Haas wrote: > > On Tue, Mar 16, 2021 at 2:22 PM Tom Lane wrote: >> I'm circling back around to the idea that amcheck is trying to >> validate TOAST references that are already dead, and it's getting >> burnt because something-or-other has already removed the toast >> rows, though not the referencing datums. That's legal behavior >> once the rows are marked dead. Upthread it was claimed that >> amcheck isn't doing that, but this looks like a smoking gun to me. > > I think this theory has some legs. From check_tuple_header_and_visibilty(): > >else if (!(infomask & HEAP_XMAX_COMMITTED)) >return true;/* > HEAPTUPLE_DELETE_IN_PROGRESS or > * > HEAPTUPLE_LIVE */ >else >return false; /* > HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD */ >} >return true;/* not dead */ > } > > That first case looks wrong to me. Don't we need to call > get_xid_status() here, Mark? As coded, it seems that if the xmin is ok > and the xmax is marked committed, we consider the tuple checkable. The > comment says it must be HEAPTUPLE_DELETE_IN_PROGRESS or > HEAPTUPLE_LIVE, but it seems to me that if the actual answer is either > HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD depending on whether the > xmax is all-visible. And in the second case the comment says it's > either HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD, but I think in that > case it's either HEAPTUPLE_DELETE_IN_PROGRESS or > HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD, depending on the XID > status. > > Another thought here is that it's probably not wicked smart to be > relying on the hint bits to match the actual status of the tuple in > cases of corruption. Maybe we should be warning about tuples that are > have xmin or xmax flagged as committed or invalid when in fact clog > disagrees. That's not a particularly uncommon case, and it's hard to > check. This code was not committed as part of the recent pg_amcheck work, but longer ago, and I'm having trouble reconstructing exactly why it was written that way. Changing check_tuple_header_and_visibilty() fixes the regression test and also manual tests against the "regression" database that I've been using. I'd like to ponder the changes a while longer before I post, but the fact that these changes fix the tests seems to add credibility to this theory. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Custom compression methods
On Tue, Mar 16, 2021 at 2:54 PM Andres Freund wrote: > Oh, I guess it would make sense to do it that way. However, I was just > thinking of doing the iteration over the tuples that ExecEvalRow() etc > do inside heap_form_flattened_tuple() (or whatever). That'd not be any > worse than what the patch is doing now, just less duplication, and an > easier path towards optimizing it if we notice that we need to? It's a question of whether you copy the datum array. I don't think a generic function can assume that it's OK to scribble on the input array, or if it does, that'd better be very prominently mentioned in the comments. And copying into a new array has its own costs. 0002 is based on the theory that scribbling on the executor's array won't cause any problem, which I *think* is true, but isn't correct in all cases (e.g. if the input data is coming from a slot). If we pass a flag down to fill_val() and friends then we don't end up having to copy the arrays over so the problem goes away in that design. > The harder part would probably be to find a way to deal with the layers > above, without undue code duplication. I think it's not just fill_val() > that'd need to know, but also heap_compute_data_size(), > heap_fill_tuple() - both of which are externally visible (and iirc thus > not going to get inlined with many compiler options, due to symbol > interposition dangers). But we could have a > heap_compute_data_size_internal(bool flatten) that's called by > heap_compute_data_size(). And something similar for heap_form_tuple(). Hmm, yeah, that's not great. I guess there's nothing expensive we need to repeat - I think anyway - because we should be able to get the uncompressed size from the TOAST pointer itself. But the code would have to know to do that, as you say. -- Robert Haas EDB: http://www.enterprisedb.com
Re: pg_amcheck contrib application
On 3/13/21 1:30 AM, Andres Freund wrote: > Hi, > > On 2021-03-13 01:22:54 -0500, Tom Lane wrote: >> Mark Dilger writes: >>> On Mar 12, 2021, at 10:16 PM, Noah Misch wrote: hoverfly does configure with PERL=perl64. /usr/bin/prove is from the 32-bit Perl, so I suspect the TAP suites get 32-bit Perl that way. (There's no "prove64".) >> Oh, that's annoying. > I suspect we could solve that by invoking changing our /usr/bin/prove > invocation to instead be PERL /usr/bin/prove? That might be a good thing > independent of this issue, because it's not unreasonable for a user to > expect that we'd actually use the perl installation they configured... > > Although I do not know how prove determines the perl installation it's > going to use for the test scripts... > There's a very good chance this would break msys builds, which are configured to build against a pure native (i.e. non-msys) perl sucj as AS or Strawberry, but need to run msys perl for TAP tests, so it gets the paths right. (Don't get me started on the madness that can result from managing this. I've lost weeks of my life to it ... If you add cygwin into the mix and you're trying to coordinate builds among three buildfarm animals it's a major pain.) cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Calendar support in localization
On Wed, Mar 17, 2021 at 6:31 AM Surafel Temesgen wrote: > Ethiopice calendar have 13 months so it can not be stored as date and > timestamp type and you approach seems more complicated and i suggest to have > this feature on the purpose of PostgreSQL popularities too not only for my > need I know, but the DATE and TIMESTAMPTZ datatypes don't intrinsically know anything about months or other calendar concepts. Internally, they are just a single number that counts the number of days or seconds since an arbitrary epoch time. We are all in agreement about how many times the Earth has rotated since then*. The calendar concepts such as "day", "month", "year", whether Gregorian, Ethiopic, Islamic, ... are all derivable from those numbers, if you know the rules. So I think you should seriously consider using the same types. > Each calendar-aware date arithmetic is different so solving one calendar > problem didn't help on other calendar They have a *lot* in common though. They have similar "fields" (day, month, year etc), based on the Earth, moon, sun etc, so it is possible to use a common abstraction to interact with them. I haven't studied it too closely, but it looks like ICU can give you a "Calendar" object for a given Locale (which you create from a string like "am_ET@calendar=traditional") and timezone ("Africa/Addis_Ababa"). Then you can set the object's time to X seconds since an epoch, based on UTC seconds without leap seconds -- which is exactly like our TIMESTAMPTZ's internal value -- and then you can query it to get fields like month etc. Or do the opposite, or use formatting and parsing routines etc. Internally, ICU has a C++ class for each calendar with a name like EthiopicCalendar, IslamicCalendar etc which encapsulates all the logic, but it's not necessary to use them directly: we could just look them up with names via the C API and then treat them all the same. > I think you suggesting this by expecting the implementation is difficult but > it's not that much difficult once you fully understand Gregorian calendar and > the calendar you work on Yeah, I am sure it's all just a bunch of simple integer maths. But I'm talking about things like software architecture, maintainability, cohesion, and getting maximum impact for the work we do. I may be missing some key detail though: why do you think it should be a different type? The two reasons I can think of are: (1) the slightly tricky detail that the date apparently changes at 1:00am (which I don't think is a show stopper for this approach, I could elaborate), (2) you may want dates to be formatted on the screen with the Ethiopic calendar in common software like psql and GUI clients, which may be easier to arrange with different types, but that seems to be a cosmetic thing that could eventually be handled with tighter locale integration with ICU. In the early stages you'd access calendar logic though special functions with names like icu_format_date(), or whatever. Maybe I'm totally wrong about all of this, but this is the first way I'd probably try to tackle this problem, and I suspect it has the highest chance of eventually being included in core PostgreSQL. *I mean, we can discuss the different "timelines" like UT, UTC, TAI etc, but that's getting into the weeds, the usual timeline for computer software outside specialist scientific purposes is UTC without leap seconds.
Re: automatic analyze: readahead - add "IO read time" log message
Greetings, * Stephen Frost (sfr...@snowman.net) wrote: > > >> I think e.g. prefetch_targblock could be moved to the next #ifdef, which > > >> will eliminate the one-line ifdef. > > > > > > Sure, done in the attached. > > > > > > Thanks for the review! Unless there's other comments, I'll plan to push > > > this over the weekend or early next week. > > > > +1 > > Thanks again for the review! and pushed. Thanks everyone for the suggestions and reviews, and to Jakub in particular for starting us down this path towards improving things in ANALYZE. Stephen signature.asc Description: PGP signature
Re: [HACKERS] Custom compression methods
On Tue, Mar 16, 2021 at 11:21 AM Dilip Kumar wrote: > If that is only the argument then it's possible today as well. I mean > you can INSERT INTO .. SELECT FROM where source attribute as > compressed data but the target attribute as external storage then also > we will move the compressed data as it is to the target table. Uggh. I don't like that behavior either, but I guess if it's the long-established way things work then perhaps this is no worse. -- Robert Haas EDB: http://www.enterprisedb.com
Re: pg_amcheck contrib application
On Tue, Mar 16, 2021 at 2:45 PM Tom Lane wrote: > > In what context is it OK to just add extra alignment padding? > > It's *not* extra, according to pg_statistic's tuple descriptor. > Both forming and deforming of pg_statistic tuples should honor > that and locate stavaluesX values on d-aligned boundaries. > > It could be that a particular entry is of an array type that > only requires i-alignment. But that doesn't break anything, > it just means we inserted more padding than an omniscient > implementation would do. OK, yeah, I just misunderstood what you were saying. -- Robert Haas EDB: http://www.enterprisedb.com
Re: shared-memory based stats collector
On Mon, Mar 15, 2021 at 10:56 PM Andres Freund wrote: > I did roughly the first steps of the split as I had outlined. I moved: > > 1) wait event related functions into utils/activity/wait_event.c / >wait_event.h > > 2) "backend status" functionality (PgBackendStatus stuff) into >utils/activity/backend_status.c > > 3) "progress" related functionality into >utils/activity/backend_progress.c In general, I like this. I'm not too sure about the names. I realize you don't want to have functions called status.c and progress.c, because that's awful generic, but now you have backend_progress.c, backend_status.c, and wait_event.c, which makes the last one look a little strange. Maybe command_progress.c instead of backend_progress.c? > I think 1 and 2 are good (albeit in need of further polish). I'm a bit > less sure about 3: > - There's dependency from backend_status.h to backend_progress.h, > because it PGSTAT_NUM_PROGRESS_PARAM etc. That doesn't seem like a catastrophe. > - it's a fairly small amount of code But it's not bad to have it separate. > - there's potential for confusion, because there's also > include/commands/progress.h That could be merged, perhaps. I think I only created that because I didn't want to jam too much stuff into pgstat.h. But if it has its own header then jamming some more stuff in there seems more OK. > - I'm inclined to leave pgstat_report_{activity, tmpfile, appname, > timestamp, ..} alone naming-wise, but to rename pgstat_bestart() to > something like pgbestat_start()? I'd probably rename them e.g. command_progress_start(), comand_progress_update_param(), etc. > - I've not gone through all the files that could now remove pgstat.h, > replacing it with wait_event.h - I'm thinking it might be worth > waiting till just after code freeze with that (there'll new additions, > and it's likely to cause conflicts)? Don't care. > - backend_status.h needs miscadmin.h, due to BackendType. Imo that's a > header we should try to avoid exposing in headers if possible. But > right now there's no good place to move BackendType to. So I'd let > this slide for now. Why the concern? miscadmin.h is extremely widely-included already. Maybe it should be broken up into pieces so that we're not including so MUCH stuff in a zillion places, but the header that contains the definition of CHECK_FOR_INTERRUPTS() is always going to be needed in a ton of spots. Honestly, I wonder why we don't just put that part in postgres.h. If you're writing any significant amount of code and you don't have at least one CHECK_FOR_INTERRUPTS() in there, you're probably doing it wrong. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [HACKERS] Custom compression methods
Hi, On 2021-03-16 10:27:12 -0400, Robert Haas wrote: > > I'm don't like that after 0002 ExecEvalRow(), ExecEvalFieldStoreForm() > > contain a nearly identical copy of the same code. And > > make_tuple_from_row() also is similar. It seem that there should be a > > heap_form_tuple() version doing this for us? > > I was worried about having either a performance impact or code > duplication. The actual plan where you could insert this organically > is in fill_val(), which is called from heap_fill_tuple(), which is > called from heap_form_tuple(). Oh, I guess it would make sense to do it that way. However, I was just thinking of doing the iteration over the tuples that ExecEvalRow() etc do inside heap_form_flattened_tuple() (or whatever). That'd not be any worse than what the patch is doing now, just less duplication, and an easier path towards optimizing it if we notice that we need to? > If you don't mind passing down 'int flags' or similar to all those, > and having additional branches to make the behavior dependent on the > flags, I'm cool with it. Or if you think we should template-ize all > those functions, that'd be another way to go. But I was afraid I would > get complaints about adding overhead to hot code paths. An option for fill_val() itself would probably be fine. It's already an inline, and if it doesn't get inlined, we could force the compilers hand with pg_attribute_always_inline. The harder part would probably be to find a way to deal with the layers above, without undue code duplication. I think it's not just fill_val() that'd need to know, but also heap_compute_data_size(), heap_fill_tuple() - both of which are externally visible (and iirc thus not going to get inlined with many compiler options, due to symbol interposition dangers). But we could have a heap_compute_data_size_internal(bool flatten) that's called by heap_compute_data_size(). And something similar for heap_form_tuple(). But that's complicated, so I'd just go with the iteration in a heap_form_tuple() wrapper for now. > > > I'm open to being convinced that we don't need to do either of these > > > things, and that the cost of iterating over all varlenas in the tuple > > > is not so bad as to preclude doing things as you have them here. But, > > > I'm afraid it's going to be too expensive. > > > > I mean, I would just define several of those places away by not caring > > about tuples in a different compressino formation ending up in a > > table... > > That behavior feels unacceptable to me from a user expectations point > of view. I think there's an argument that if I update a tuple that > contains a compressed datum, and I don't update that particular > column, I think it would be OK to not recompress the column. But, if I > insert data into a table, I as a user would expect that the > compression settings for that column are going to be respected. IDK. The user might also expect that INSERT .. SELECT is fast, instead of doing expensive decompression + compression (with pglz the former can be really slow). I think there's a good argument for having an explicit "recompress" operation, but I'm not convincd that doing things implicitly is good, especially if it causes complications in quite a few places. Greetings, Andres Freund
Re: pg_amcheck contrib application
Mark Dilger writes: > CopyStatistics seems to just copy Form_pg_statistic without regard for > who owns the toast. Is this safe? No less so than a ton of other places that insert values that might've come from on-disk storage. heap_toast_insert_or_update() is responsible for dealing with the problem. These days it looks like it's actually toast_tuple_init() that takes care of dereferencing previously-toasted values during an INSERT. regards, tom lane
Re: "has_column_privilege()" issue with attnums and non-existent columns
On 3/16/21 1:42 AM, Chengxi Sun wrote: The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: not tested Documentation:not tested I tested the patch and it works well. Was that my patch (i.e. the latest on this thread) or Ian's? It is not clear since you did not CC me... And according to the comment, checking existence of relation and pg_class_aclcheck() won't influenced by concurrent DROP. So I think it's safe to just reorder the checking existence of column and pg_attribute_aclcheck(). "Code never lies, comments sometimes do" That said, I did at least a basic test of that assumption and it seems to be true. Ian, or anyone else, any comments/complaints on my changes? If not I will commit and push that version sooner rather than later. Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
Re: pg_amcheck contrib application
Robert Haas writes: > On Tue, Mar 16, 2021 at 1:48 PM Tom Lane wrote: >> No. What should be happening there is that some arrays in the column >> get larger alignment than they actually need, but that shouldn't cause >> a problem (and has not done so, AFAIK, in the decades that it's been >> like this). As you say, deforming the tuple is going to rely on the >> table's tupdesc for alignment; it can't know what is in the data. > OK, I don't understand this. attalign is 'd', which is already the > maximum possible, and even if it weren't, individual rows can't decide > to use a larger OR smaller alignment than expected without breaking > stuff. > In what context is it OK to just add extra alignment padding? It's *not* extra, according to pg_statistic's tuple descriptor. Both forming and deforming of pg_statistic tuples should honor that and locate stavaluesX values on d-aligned boundaries. It could be that a particular entry is of an array type that only requires i-alignment. But that doesn't break anything, it just means we inserted more padding than an omniscient implementation would do. (I suppose in some parallel universe there could be a machine where i-alignment is stricter than d-alignment, and then we'd have trouble.) regards, tom lane
Re: pg_amcheck contrib application
On Tue, Mar 16, 2021 at 2:22 PM Tom Lane wrote: > I'm circling back around to the idea that amcheck is trying to > validate TOAST references that are already dead, and it's getting > burnt because something-or-other has already removed the toast > rows, though not the referencing datums. That's legal behavior > once the rows are marked dead. Upthread it was claimed that > amcheck isn't doing that, but this looks like a smoking gun to me. I think this theory has some legs. From check_tuple_header_and_visibilty(): else if (!(infomask & HEAP_XMAX_COMMITTED)) return true;/* HEAPTUPLE_DELETE_IN_PROGRESS or * HEAPTUPLE_LIVE */ else return false; /* HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD */ } return true;/* not dead */ } That first case looks wrong to me. Don't we need to call get_xid_status() here, Mark? As coded, it seems that if the xmin is ok and the xmax is marked committed, we consider the tuple checkable. The comment says it must be HEAPTUPLE_DELETE_IN_PROGRESS or HEAPTUPLE_LIVE, but it seems to me that if the actual answer is either HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD depending on whether the xmax is all-visible. And in the second case the comment says it's either HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD, but I think in that case it's either HEAPTUPLE_DELETE_IN_PROGRESS or HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD, depending on the XID status. Another thought here is that it's probably not wicked smart to be relying on the hint bits to match the actual status of the tuple in cases of corruption. Maybe we should be warning about tuples that are have xmin or xmax flagged as committed or invalid when in fact clog disagrees. That's not a particularly uncommon case, and it's hard to check. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Postgres crashes at memcopy() after upgrade to PG 13.
On Tue, Mar 16, 2021 at 11:20 AM Avinash Kumar wrote: > I can share any detail that would help here. I would like to know what you see when you run a slightly modified version of the same amcheck query. The same query as before, but with the call to bt_index_parent_check() replaced with a call to bt_index_check(). Can you do that, please? This is what I mean: SELECT bt_index_check(index => c.oid, heapallindexed => true), c.relname, c.relpages FROM pg_index i JOIN pg_opclass op ON i.indclass[0] = op.oid JOIN pg_am am ON op.opcmethod = am.oid JOIN pg_class c ON i.indexrelid = c.oid JOIN pg_namespace n ON c.relnamespace = n.oid WHERE am.amname = 'btree' -- Don't check temp tables, which may be from another session: AND c.relpersistence != 't' -- Function may throw an error when this is omitted: AND c.relkind = 'i' AND i.indisready AND i.indisvalid ORDER BY c.relpages DESC; The error that you reported was a cross-level invariant violation, from one of the tests that bt_index_parent_check() performs but bt_index_check() does not perform (the former performs checks that are a superset of the latter). It's possible that we'll get a more interesting error message from bt_index_check() here, because it might go on for a bit longer -- it might conceivably reach a corrupt posting list tuple on the leaf level, and report it as such. Of course we don't see any corruption in the index that you had the crash with at all, but it can't hurt to do this as well -- just in case the issue is transient or something. Thanks -- Peter Geoghegan
Re: pg_amcheck contrib application
> On Mar 16, 2021, at 10:48 AM, Tom Lane wrote: > > I'm not entirely sure what's going on, but I think coming at this > with the mindset that "amcheck has detected some corruption" is > just going to lead you astray. Almost certainly, it's "amcheck > is incorrectly claiming corruption". That might be from mis-decoding > a TOAST-referencing datum. (Too bad the message doesn't report the > TOAST OID it probed for, so we can see if that's sane or not.) CopyStatistics seems to just copy Form_pg_statistic without regard for who owns the toast. Is this safe? Looking at RemoveStatistics, I'm not sure that it is. Anybody more familiar with this code care to give an opinion? — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: pg_amcheck contrib application
On Tue, Mar 16, 2021 at 1:48 PM Tom Lane wrote: > No. What should be happening there is that some arrays in the column > get larger alignment than they actually need, but that shouldn't cause > a problem (and has not done so, AFAIK, in the decades that it's been > like this). As you say, deforming the tuple is going to rely on the > table's tupdesc for alignment; it can't know what is in the data. OK, I don't understand this. attalign is 'd', which is already the maximum possible, and even if it weren't, individual rows can't decide to use a larger OR smaller alignment than expected without breaking stuff. In what context is it OK to just add extra alignment padding? -- Robert Haas EDB: http://www.enterprisedb.com
Re: Postgres crashes at memcopy() after upgrade to PG 13.
On Tue, Mar 16, 2021 at 11:08 AM Tom Lane wrote: > Peter Geoghegan writes: > > ... It's hard to believe that the problem is > > squarely with _bt_swap_posting(). > > IIUC, the problem is seen on a replica server but not the primary? > In that case, my thoughts would run towards a bug in WAL log creation > or replay, causing the index contents to be different/wrong on the > replica. My remarks were intended to include problems during recovery (_bt_swap_posting() is run inside REDO routines). Though I did consider recovery specifically when thinking through the problem. My assessment is that the index is highly unlikely to be corrupt (whether it happened during recovery or at some other time), because it passes validation by bt_index_parent_check(), with the optional heapallindexed index-matches-table verification option enabled. This includes exhaustive verification of posting list tuple invariants. Anything is possible, but I find it easier to believe that the issue is somewhere else -- we see the problem in _bt_swap_posting() because it happens to go further than other code in trusting that the tuple isn't corrupt (which it shouldn't). Another unrelated index *was* reported corrupt by amcheck, though the error in question does not suggest an issue with deduplication. -- Peter Geoghegan
Re: pg_amcheck contrib application
Mark Dilger writes: > On Mar 16, 2021, at 10:48 AM, Tom Lane wrote: >> (Too bad the message doesn't report the >> TOAST OID it probed for, so we can see if that's sane or not.) > I've added that and now get the toast pointer's va_valueid in the message: > heap table "postgres"."pg_catalog"."pg_statistic", block 4, offset 2, > attribute 29: > toasted value id 13227 for attribute 29 missing from toast table > heap table "postgres"."pg_catalog"."pg_statistic", block 4, offset 5, > attribute 27: > toasted value id 13228 for attribute 27 missing from toast table That's awfully interesting, because OIDs less than 16384 should only be generated during initdb. So what we seem to be looking at here is pg_statistic entries that were made during the ANALYZE done by initdb (cf. vacuum_db()), and then were deleted during a later auto-analyze (in the buildfarm) or deliberate analyze (in your reproducer). But if they're deleted, why is amcheck looking for them? I'm circling back around to the idea that amcheck is trying to validate TOAST references that are already dead, and it's getting burnt because something-or-other has already removed the toast rows, though not the referencing datums. That's legal behavior once the rows are marked dead. Upthread it was claimed that amcheck isn't doing that, but this looks like a smoking gun to me. regards, tom lane
Re: Postgres crashes at memcopy() after upgrade to PG 13.
On Tue, Mar 16, 2021 at 3:08 PM Tom Lane wrote: > Peter Geoghegan writes: > > ... It's hard to believe that the problem is > > squarely with _bt_swap_posting(). > > IIUC, the problem is seen on a replica server but not the primary? > In that case, my thoughts would run towards a bug in WAL log creation > or replay, causing the index contents to be different/wrong on the > replica. > Right, observed after the replica Server after it got promoted. The replica is of the same Postgres minor version - 13.1 but, the OS is Ubuntu 16 on Primary and Ubuntu 20 on Replica (that got promoted). Replica was setup using a backup taken using pg_basebackup. I can share any detail that would help here. > regards, tom lane > -- Regards, Avi
Re: Postgres crashes at memcopy() after upgrade to PG 13.
Peter Geoghegan writes: > ... It's hard to believe that the problem is > squarely with _bt_swap_posting(). IIUC, the problem is seen on a replica server but not the primary? In that case, my thoughts would run towards a bug in WAL log creation or replay, causing the index contents to be different/wrong on the replica. regards, tom lane
Re: pg_amcheck contrib application
> On Mar 16, 2021, at 10:48 AM, Tom Lane wrote: > > Robert Haas writes: >> On Tue, Mar 16, 2021 at 12:51 PM Mark Dilger >> wrote: >>> It shows them all has having attalign = 'd', but for some array types the >>> alignment will be 'i', not 'd'. So it's lying a bit about the contents. >>> But I'm now confused why this caused problems on the two hosts where >>> integer and double have the same alignment? It seems like that would be >>> the one place where the bug would not happen, not the one place where it >>> does. > >> Wait, so the value in the attalign column isn't the alignment we're >> actually using? I can understand how we might generate tuples like >> that, if we pass the actual type to construct_array(), but shouldn't >> we then get garbage when we deform the tuple? > > No. What should be happening there is that some arrays in the column > get larger alignment than they actually need, but that shouldn't cause > a problem (and has not done so, AFAIK, in the decades that it's been > like this). As you say, deforming the tuple is going to rely on the > table's tupdesc for alignment; it can't know what is in the data. > > I'm not entirely sure what's going on, but I think coming at this > with the mindset that "amcheck has detected some corruption" is > just going to lead you astray. Almost certainly, it's "amcheck > is incorrectly claiming corruption". That might be from mis-decoding > a TOAST-referencing datum. (Too bad the message doesn't report the > TOAST OID it probed for, so we can see if that's sane or not.) I've added that and now get the toast pointer's va_valueid in the message: mark.dilger@laptop280-ma-us amcheck % pg_amcheck -t "pg_catalog.pg_statistic" postgres heap table "postgres"."pg_catalog"."pg_statistic", block 4, offset 2, attribute 29: toasted value id 13227 for attribute 29 missing from toast table heap table "postgres"."pg_catalog"."pg_statistic", block 4, offset 5, attribute 27: toasted value id 13228 for attribute 27 missing from toast table diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c index 5ccae46375..a0be71bb7f 100644 --- a/contrib/amcheck/verify_heapam.c +++ b/contrib/amcheck/verify_heapam.c @@ -,8 +,8 @@ check_tuple_attribute(HeapCheckContext *ctx) } if (!found_toasttup) report_corruption(ctx, - psprintf("toasted value for attribute %u missing from toast table", - ctx->attnum)); + psprintf("toasted value id %u for attribute %u missing from toast table", + toast_pointer.va_valueid, ctx->attnum)); else if (ctx->chunkno != (ctx->endchunk + 1)) report_corruption(ctx, psprintf("final toast chunk number %u differs from expected value %u", — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [Proposal] Global temporary tables
Hi this patch is broken now. Please, can you check it? Regards Pavel st 25. 11. 2020 v 14:08 odesílatel 曾文旌 napsal: > > > 2020年11月25日 14:19,Pavel Stehule 写道: > > > > po 23. 11. 2020 v 10:27 odesílatel 曾文旌 > napsal: > >> >> >> 2020年11月21日 02:28,Pavel Stehule 写道: >> >> Hi >> >> pá 11. 9. 2020 v 17:00 odesílatel 曾文旌 >> napsal: >> >>> I have written the README for the GTT, which contains the GTT >>> requirements and design. >>> I found that compared to my first email a year ago, many GTT Limitations >>> are now gone. >>> Now, I'm adding comments to some of the necessary functions. >>> >> >> There are problems with patching. Please, can you rebase your patch? >> >> Sure. >> I'm still working on sort code and comments. >> If you have any suggestions, please let me know. >> > > It is broken again > > There is bad white space > > + /* > +* For global temp table only > +* use ShareUpdateExclusiveLock for ensure safety > +*/ > + { > + { > + "on_commit_delete_rows", > + "global temp table on commit options", > + RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED, > + ShareUpdateExclusiveLock > + }, > + true > + }, <= > /* list terminator */ > {{NULL}} > > +7 OTHERS > +Parallel query > +Planner does not produce parallel query plans for SQL related to GTT. > Because <= > +GTT private data cannot be accessed across processes. > diff --git a/src/backend/catalog/Makefile b/src/backend/catalog/Makefile > > > +/* > + * Update global temp table relstats(relpage/reltuple/relallvisible) > < > + * to local hashtable > + */ > +void > > +/* > + * Search global temp table relstats(relpage/reltuple/relallvisible) > <== > + * from lo > > and there are lot of more places ... > > I found other issue > > postgres=# create global temp table foo(a int); > CREATE TABLE > postgres=# create index on foo(a); > CREATE INDEX > > close session and in new session > > postgres=# reindex index foo_a_idx ; > WARNING: relcache reference leak: relation "foo" not closed > REINDEX > > > I fixed all the above issues and rebase code. > Please review the new version code again. > > > Wenjing > > > > > Regards > > Pavel > > > >> >> Wenjing >> >> >> >> Regards >> >> Pavel >> >> >>> >>> Wenjing >>> >>> >>> >>> >>> >>> > 2020年7月31日 上午4:57,Robert Haas 写道: >>> > >>> > On Thu, Jul 30, 2020 at 8:09 AM wenjing zeng >>> wrote: >>> >> Please continue to review the code. >>> > >>> > This patch is pretty light on comments. Many of the new functions have >>> > no header comments, for example. There are comments here and there in >>> > the body of the new functions that are added, and in places where >>> > existing code is changed there are comments here and there, but >>> > overall it's not a whole lot. There's no documentation and no README, >>> > either. Since this adds a new feature and a bunch of new SQL-callable >>> > functions that interact with that feature, the feature itself should >>> > be documented, along with its limitations and the new SQL-callable >>> > functions that interact with it. I think there should be either a >>> > lengthy comment in some suitable file, or maybe various comments in >>> > various files, or else a README file, that clearly sets out the major >>> > design principles behind the patch, and explaining also what that >>> > means in terms of features and limitations. Without that, it's really >>> > hard for anyone to jump into reviewing this code, and it will be hard >>> > for people who have to maintain it in the future to understand it, >>> > either. Or for users, for that matter. >>> > >>> > -- >>> > Robert Haas >>> > EnterpriseDB: http://www.enterprisedb.com >>> > The Enterprise PostgreSQL Company >>> >>> >> >
Re: pg_amcheck contrib application
... btw, I now see that tern and hornet are passing this test at least as much as they're failing, proving that there's some timing or random chance involved. That doesn't completely eliminate the idea that there may be an architecture component to the issue, but it sure reduces its credibility. I now believe the theory that the triggering condition is an auto-analyze happening at the right time, and populating pg_statistic with some data that other runs don't see. regards, tom lane
Re: pg_amcheck contrib application
Robert Haas writes: > On Tue, Mar 16, 2021 at 12:51 PM Mark Dilger > wrote: >> It shows them all has having attalign = 'd', but for some array types the >> alignment will be 'i', not 'd'. So it's lying a bit about the contents. >> But I'm now confused why this caused problems on the two hosts where integer >> and double have the same alignment? It seems like that would be the one >> place where the bug would not happen, not the one place where it does. > Wait, so the value in the attalign column isn't the alignment we're > actually using? I can understand how we might generate tuples like > that, if we pass the actual type to construct_array(), but shouldn't > we then get garbage when we deform the tuple? No. What should be happening there is that some arrays in the column get larger alignment than they actually need, but that shouldn't cause a problem (and has not done so, AFAIK, in the decades that it's been like this). As you say, deforming the tuple is going to rely on the table's tupdesc for alignment; it can't know what is in the data. I'm not entirely sure what's going on, but I think coming at this with the mindset that "amcheck has detected some corruption" is just going to lead you astray. Almost certainly, it's "amcheck is incorrectly claiming corruption". That might be from mis-decoding a TOAST-referencing datum. (Too bad the message doesn't report the TOAST OID it probed for, so we can see if that's sane or not.) regards, tom lane
Re: Calendar support in localization
Hi Thomas On Mon, Mar 15, 2021 at 2:58 PM Thomas Munro wrote: > > One key question here is whether you need a different date type or > just different operations (functions, operators etc) on the existing > types. > > I am thinking of having a converter to a specific calendar after each operation and function for display or storage. It works on Ethiopice calendar and i expect it will work on other calendar too > > I cc Thomas Munro and Vik because they have interest on this area > > Last time it came up[1], I got as far as wondering if the best way > would be to write a set of ICU-based calendar functions. Would it be > enough for your needs to have Ethiopic calendar-aware date arithmetic > (add, subtract a month etc), date part extraction (get the current > Ethiopic day/month/year of a date), display and parsing, and have all > of these as functions that you have to call explicitly, but have them > take the standard built-in date and timestamp types, so that your > tables would store regular date and timestamp values? If not, what > else do you need? > > Ethiopice calendar have 13 months so it can not be stored as date and timestamp type and you approach seems more complicated and i suggest to have this feature on the purpose of PostgreSQL popularities too not only for my need > ICU is very well maintained and widely used software, and PostgreSQL > already depends on it optionally, and that's enabled in all common > distributions. In other words, maybe all the logic you want exists > already in your process's memory, we just have to figure out how to > reach it from SQL. Another reason to use ICU is that we can solve > this problem once and then it'll work for many other calendars. > > Each calendar-aware date arithmetic is different so solving one calendar problem didn't help on other calendar > > Please don't suggests to fork from PostgreSQL just for this feature > > I would start with an extension, and I'd try to do a small set of > simple functions, to let me write things like: > > icu_format(now(), 'fr_FR@calendar=buddhist') to get a Buddhist > calendar with French words > > icu_date_part('year', current_date, 'am_ET@calendar=traditional') to > get the current year in the Ethiopic calendar (2013 apparently) > > Well, the first one probably also needs a format string too, actual > details to be worked out by reading the ICU manual... > I think you suggesting this by expecting the implementation is difficult but it's not that much difficult once you fully understand Gregorian calendar and the calendar you work on > > Maybe instead of making a new extension, I might try to start from > https://github.com/dverite/icu_ext and see if it makes sense to extend > it to cover calendars. > > Maybe one day ICU will become a hard dependency of PostgreSQL and > someone will propose all that stuff into core, and then maybe we could > start to think about the possibility of tighter integration with the > built-in date/time functions (and LC_TIME setting? seems complicated, > see also problems with setting LC_COLLATE/datcollate to an ICU > collation name, but I digress and that's a far off problem). I would > also study the SQL standard and maybe DB2 (highly subjective comment: > at a wild guess, the most likely commercial RDBMS to have done a good > job of this if anyone has) to see if they contemplate non-Gregorian > calendars, to get some feel for whether that would eventually be a > possibility to conform with whatever the standard says. > > In summary, getting something of very high quality by using a widely > used open source library that we already use seems like a better plan > than trying to write and maintain our own specialist knowledge about > individual calendars. If there's something you need that can't be > done with its APIs working on top of our regular date and timestamp > types, could you elaborate? > > [1] > https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BybW0LJuLtj3yAUsbOw3DrzK00pGk8JyfpCREzi_LSsg%40mail.gmail.com#393d827f1be589d0ad6ca6b016905e80 I don't know how you see this but for me the feature deserves a specialist and it is not that much difficult to have one because i guess every majore calendar have english documentation regards Surafel
Re: Postgres crashes at memcopy() after upgrade to PG 13.
On Tue, Mar 16, 2021 at 9:50 AM Avinash Kumar wrote: > Yes, it was on the failover-over server where the issue is currently seen. > Took a snapshot of the data directory so that the issue can be analyzed. I would be very cautious when using LVM snapshots with a Postgres data directory, or VM-based snapshotting tools. There are many things that can go wrong with these tools, which are usually not sensitive to the very specific requirements of a database system like Postgres (e.g. inconsistencies between WAL and data files can emerge in many scenarios). My general recommendation is to avoid these tools completely -- consistently use a backup solution like pgBackrest instead. BTW, running pg_repack is something that creates additional risk of database corruption, at least to some degree. That seems less likely to have been the problem here (I think that it's probably something with snapshots). Something to consider. > I can do this. But, to add here, when we do a pg_repack or rebuild of > Indexes, automatically this is resolved. Your bug report was useful to me, because it made me realize that the posting list split code in _bt_swap_posting() is unnecessarily trusting of the on-disk data -- especially compared to _bt_split(), the page split code. While I consider it unlikely that the problem that you see is truly a bug in Postgres, it is still true that the crash that you saw should probably have just been an error. We don't promise that the database cannot crash even with corrupt data, but we do try to avoid it whenever possible. I may be able to harden _bt_swap_posting(), to make failures like this a little more friendly. It's an infrequently hit code path, so we can easily afford to make the code more careful/less trusting. -- Peter Geoghegan
Re: pg_amcheck contrib application
On Tue, Mar 16, 2021 at 12:51 PM Mark Dilger wrote: > Yeah, that looks related: > > regression=# select attname, attlen, attnum, attalign from pg_attribute where > attrelid = 2619 and attname like 'stavalue%'; > attname | attlen | attnum | attalign > +++-- > stavalues1 | -1 | 27 | d > stavalues2 | -1 | 28 | d > stavalues3 | -1 | 29 | d > stavalues4 | -1 | 30 | d > stavalues5 | -1 | 31 | d > (5 rows) > > It shows them all has having attalign = 'd', but for some array types the > alignment will be 'i', not 'd'. So it's lying a bit about the contents. But > I'm now confused why this caused problems on the two hosts where integer and > double have the same alignment? It seems like that would be the one place > where the bug would not happen, not the one place where it does. Wait, so the value in the attalign column isn't the alignment we're actually using? I can understand how we might generate tuples like that, if we pass the actual type to construct_array(), but shouldn't we then get garbage when we deform the tuple? I mean, heap_deform_tuple() is going to get the alignment from the tuple descriptor, which is a table property. It doesn't (and can't) know what type of value is stored inside any of these fields for real, right? In other words, isn't this actually corruption, caused by a bug in our code, and how have we not noticed it before now? -- Robert Haas EDB: http://www.enterprisedb.com
Re: pg_amcheck contrib application
> On Mar 16, 2021, at 9:30 AM, Mark Dilger wrote: > > > >> On Mar 16, 2021, at 9:07 AM, Tom Lane wrote: >> >> Mark Dilger writes: >>> I think autovacuum simply triggers the bug, and is not the cause of the >>> bug. If I turn autovacuum off and instead do an ANALYZE in each test >>> database rather than performing the corruptions, I get reports about >>> problems in pg_statistic. This is on my mac laptop. This rules out the >>> theory that autovacuum is propogating corruptions into pg_statistic, and >>> also the theory that it is architecture dependent. >> >> I wonder whether amcheck is confused by the declaration of those columns >> as "anyarray". > > It uses attlen and attalign for the attribute, so that idea does make sense. > It gets that via TupleDescAttr(RelationGetDescr(rel), attnum). Yeah, that looks related: regression=# select attname, attlen, attnum, attalign from pg_attribute where attrelid = 2619 and attname like 'stavalue%'; attname | attlen | attnum | attalign +++-- stavalues1 | -1 | 27 | d stavalues2 | -1 | 28 | d stavalues3 | -1 | 29 | d stavalues4 | -1 | 30 | d stavalues5 | -1 | 31 | d (5 rows) It shows them all has having attalign = 'd', but for some array types the alignment will be 'i', not 'd'. So it's lying a bit about the contents. But I'm now confused why this caused problems on the two hosts where integer and double have the same alignment? It seems like that would be the one place where the bug would not happen, not the one place where it does. I'm attaching a test that reliably reproduces this for me: v8-0005-Adding-pg_amcheck-special-test-for-pg_statistic.patch Description: Binary data — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Postgres crashes at memcopy() after upgrade to PG 13.
Hi, On Tue, Mar 16, 2021 at 1:44 PM Peter Geoghegan wrote: > On Tue, Mar 16, 2021 at 5:01 AM Avinash Kumar > wrote: > > I am afraid that it looks to me like a deduplication bug but not sure > how this can be pin-pointed. If there is something I could do to determine > that, I would be more than happy. > > That cannot be ruled out, but I don't consider it to be the most > likely explanation. The index in question passes amcheck verification, > which includes verification of the posting list tuple structure, and > even includes making sure the index has an entry for each row from the > table. It's highly unlikely that it is corrupt, and it's hard to see > how you get from a non-corrupt index to the segfault. At the same time > we see that some other index is corrupt -- it fails amcheck due to a > cross-level inconsistency, which is very unlikely to be related to > deduplication in any way. It's hard to believe that the problem is > squarely with _bt_swap_posting(). > > Did you actually run amcheck on the failed-over server, not the original > server? > Yes, it was on the failover-over server where the issue is currently seen. Took a snapshot of the data directory so that the issue can be analyzed. > > Note that you can disable deduplication selectively -- perhaps doing > so will make it possible to isolate the issue. Something like this > should do it (you need to reindex here to actually change the on-disk > representation to not have any posting list tuples from > deduplication): > > alter index idx_id_mtime set (deduplicate_items = off); > reindex index idx_id_mtime; > I can do this. But, to add here, when we do a pg_repack or rebuild of Indexes, automatically this is resolved. But, not sure if we get the same issue again. > > -- > Peter Geoghegan > -- Regards, Avi.
Re: [PATCH] pgbench: improve \sleep meta command
On 2021/03/09 0:54, Fujii Masao wrote: On 2021/03/08 23:10, Alvaro Herrera wrote: On 2021-Mar-08, kuroda.hay...@fujitsu.com wrote: Dear Fujii-san, Miyake-san Isn't it better to accept even negative sleep time like currently pgbench does? Otherwise we always need to check the variable is a positive integer (for example, using \if command) when using it as the sleep time in \sleep command. That seems inconvenient. Both of them are acceptable for me. But we should write down how it works when the negative value is input if we adopt. Not sleeping at all seems a good reaction (same as for zero, I guess.) +1. BTW, IIUC currently \sleep works in that way. Attached is the updated version of the patch. Currently a variable whose value is a negative number is allowed to be specified as a sleep time as follows. In this case \sleep command doesn't sleep. The patch doesn't change this behavior at all. \set hoge -1 \sleep :hoge s Currently a variable whose value is a double is allowed to be specified as a sleep time as follows. In this case the integer value that atoi() converts the double number into is used as a sleep time. The patch also doesn't change this behavior at all because I've not found any strong reason to ban that usage. \set hoge 10 / 4.0 \sleep :hoge s Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index e69d43b26b..a6d91d1089 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -2953,7 +2953,24 @@ evaluateSleep(CState *st, int argc, char **argv, int *usecs) pg_log_error("%s: undefined variable \"%s\"", argv[0], argv[1] + 1); return false; } + usec = atoi(var); + if (usec == 0) + { + char *c = var; + + /* Skip sign */ + if (*c == '+' || *c == '-') + c++; + + /* Raise an error if the value of a variable is not a number */ + if (!isdigit((unsigned char) *c)) + { + pg_log_error("%s: invalid sleep time \"%s\" for variable \"%s\"", +argv[0], var, argv[1] + 1); + return false; + } + } } else usec = atoi(argv[1]); @@ -4788,17 +4805,41 @@ process_backslash_command(PsqlScanState sstate, const char *source) * will be parsed with atoi, which ignores trailing non-digit * characters. */ - if (my_command->argc == 2 && my_command->argv[1][0] != ':') + if (my_command->argv[1][0] != ':') { char *c = my_command->argv[1]; + boolhave_digit = false; - while (isdigit((unsigned char) *c)) + /* Skip sign */ + if (*c == '+' || *c == '-') c++; + + /* Require at least one digit */ + if (*c && isdigit((unsigned char) *c)) + have_digit = true; + + /* Eat all digits */ + while (*c && isdigit((unsigned char) *c)) + c++; + if (*c) { - my_command->argv[2] = c; - offsets[2] = offsets[1] + (c - my_command->argv[1]); - my_command->argc = 3; + if (my_command->argc == 2 && have_digit) + { + my_command->argv[2] = c; + offsets[2] = offsets[1] + (c - my_command->argv[1]); + my_command->argc = 3; + } + else + { + /* +* Raise an error if argument starts with non-digit +* character (after sign). +*/ + syntax_error(source, lineno, my_command->first_line, my_command->argv[0], +"invalid sleep time, must be an integer", + my_command->argv[1], offsets[1] - start_offset); + } } }
Re: Postgres crashes at memcopy() after upgrade to PG 13.
On Tue, Mar 16, 2021 at 5:01 AM Avinash Kumar wrote: > I am afraid that it looks to me like a deduplication bug but not sure how > this can be pin-pointed. If there is something I could do to determine that, > I would be more than happy. That cannot be ruled out, but I don't consider it to be the most likely explanation. The index in question passes amcheck verification, which includes verification of the posting list tuple structure, and even includes making sure the index has an entry for each row from the table. It's highly unlikely that it is corrupt, and it's hard to see how you get from a non-corrupt index to the segfault. At the same time we see that some other index is corrupt -- it fails amcheck due to a cross-level inconsistency, which is very unlikely to be related to deduplication in any way. It's hard to believe that the problem is squarely with _bt_swap_posting(). Did you actually run amcheck on the failed-over server, not the original server? Note that you can disable deduplication selectively -- perhaps doing so will make it possible to isolate the issue. Something like this should do it (you need to reindex here to actually change the on-disk representation to not have any posting list tuples from deduplication): alter index idx_id_mtime set (deduplicate_items = off); reindex index idx_id_mtime; -- Peter Geoghegan
Re: pg_amcheck contrib application
> On Mar 16, 2021, at 9:07 AM, Tom Lane wrote: > > Mark Dilger writes: >> I think autovacuum simply triggers the bug, and is not the cause of the bug. >> If I turn autovacuum off and instead do an ANALYZE in each test database >> rather than performing the corruptions, I get reports about problems in >> pg_statistic. This is on my mac laptop. This rules out the theory that >> autovacuum is propogating corruptions into pg_statistic, and also the theory >> that it is architecture dependent. > > I wonder whether amcheck is confused by the declaration of those columns > as "anyarray". It uses attlen and attalign for the attribute, so that idea does make sense. It gets that via TupleDescAttr(RelationGetDescr(rel), attnum). — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: A new function to wait for the backend exit after termination
On Tue, Mar 16, 2021 at 10:38 AM Bharath Rupireddy wrote: > > On Mon, Mar 15, 2021 at 10:38 AM Fujii Masao > wrote: > > On 2021/03/15 12:27, Bharath Rupireddy wrote: > > > On Sun, Mar 7, 2021 at 2:39 PM Bharath Rupireddy > > > wrote: > > >> Attaching v7 patch for further review. > > > > > > Attaching v8 patch after rebasing on to the latest master. > > > > Thanks for rebasing the patch! > > Thanks for reviewing. > > > - WAIT_EVENT_XACT_GROUP_UPDATE > > + WAIT_EVENT_XACT_GROUP_UPDATE, > > + WAIT_EVENT_BACKEND_TERMINATION > > > > These should be listed in alphabetical order. > > Done. > > > In pg_wait_until_termination's do-while loop, ResetLatch() should be > > called. Otherwise, it would enter busy-loop after any signal arrives. > > Because the latch is kept set and WaitLatch() always exits immediately in > > that case. > > Done. > > > + /* > > +* Wait in steps of waittime milliseconds until this function exits > > or > > +* timeout. > > +*/ > > + int64 waittime = 10; > > > > 10 ms per cycle seems too frequent? > > Increased it to 100msec. > > > + ereport(WARNING, > > + (errmsg("timeout cannot be negative > > or zero: %lld", > > + (long long int) > > timeout))); > > + > > + result = false; > > > > IMO the parameter should be verified before doing the actual thing. > > Done. > > > Why is WARNING thrown in this case? Isn't it better to throw ERROR like > > pg_promote() does? > > Done. > > Attaching v9 patch for further review. Almost there :) Does it really make sense that pg_wait_for_backend_termination() defaults to waiting *100 milliseconds*, and then logs a warning? That seems extremely short if I'm explicitly asking it to wait. I'd argue that 100ms is too short for pg_terminate_backend() as well, but I think it's a bit more reasonable there. Wait events should be in alphabetical order in pgstat_get_wait_ipc() as well, not just in the header (which was adjusted per Fujii's comment) + (errmsg("could not wait for the termination of the backend with PID %d within %lld milliseconds", That's not true though? The wait succeeded, it just timed out? Isn't itm ore like "backend with PID %d did not terminate within %lld milliseconds"? -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
Re: pg_amcheck contrib application
Mark Dilger writes: > I think autovacuum simply triggers the bug, and is not the cause of the bug. > If I turn autovacuum off and instead do an ANALYZE in each test database > rather than performing the corruptions, I get reports about problems in > pg_statistic. This is on my mac laptop. This rules out the theory that > autovacuum is propogating corruptions into pg_statistic, and also the theory > that it is architecture dependent. I wonder whether amcheck is confused by the declaration of those columns as "anyarray". regards, tom lane
Re: pg_amcheck contrib application
> On Mar 15, 2021, at 11:09 PM, Noah Misch wrote: > >> Not sure that I believe the theory that this is from bad luck of >> concurrent autovacuum timing, though. > > With autovacuum_naptime=1s, on hornet, the failure reproduced twice in twelve > runs. With v6-0001-Turning-off-autovacuum-during-corruption-tests.patch > applied, 196 runs all succeeded. > >> The fact that we're seeing >> this on just those two animals suggests strongly to me that it's >> architecture-correlated, instead. > > That is possible. I think autovacuum simply triggers the bug, and is not the cause of the bug. If I turn autovacuum off and instead do an ANALYZE in each test database rather than performing the corruptions, I get reports about problems in pg_statistic. This is on my mac laptop. This rules out the theory that autovacuum is propogating corruptions into pg_statistic, and also the theory that it is architecture dependent. I'll investigate further. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: New IndexAM API controlling index vacuum strategies
On Tue, Mar 16, 2021 at 10:39 PM Masahiko Sawada wrote: > > On Mon, Mar 15, 2021 at 11:04 AM Peter Geoghegan wrote: > > > > One consequence of my approach is that we now call > > lazy_cleanup_all_indexes(), even when we've skipped index vacuuming > > itself. We should at least "check-in" with the indexes IMV. To an > > index AM, this will be indistinguishable from a VACUUM that never had > > tuples for it to delete, and so never called ambulkdelete() before > > calling amvacuumcleanup(). This seems logical to me: why should there > > be any significant behavioral divergence between the case where there > > are 0 tuples to delete and the case where there is 1 tuple to delete? > > The extra work that we perform in amvacuumcleanup() (if any) should > > almost always be a no-op in nbtree following my recent refactoring > > work. More generally, if an index AM is doing too much during cleanup, > > and this becomes a bottleneck, then IMV that's a problem that needs to > > be fixed in the index AM. > > Aside from whether it's good or bad, strictly speaking, it could > change the index AM API contract. The documentation of > amvacuumcleanup() says: > > --- > stats is whatever the last ambulkdelete call returned, or NULL if > ambulkdelete was not called because no tuples needed to be deleted. > --- > > With this change, we could pass NULL to amvacuumcleanup even though > the index might have tuples needed to be deleted. It seems there is no problem with that change at least in built-in index AMs. So +1 for this change. We would need to slightly update the doc accordingly. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: pg_stat_statements oddity with track = all
On Tue, Mar 16, 2021 at 12:55:45PM +0100, Magnus Hagander wrote: > On Tue, Mar 9, 2021 at 3:39 AM Julien Rouhaud wrote: > > > > I think that we might be able to handle that without a flag. The only thing > > that would need to be done is when creating an entry, look for an existing > > entry with the opposite flag, and if there's simply use the same > > (query_offset, query_len) info. This doesn't sound that expensive. > > That's basically what I was trying to say :) Oh ok sorry :) > > The real pain point will be that the garbage collection phase > > will become way more expensive as it will now have to somehow maintain that > > knowledge, which will require additional lookups for each entry. I'm a bit > > concerned about that, especially with the current heuristic to schedule > > garbage > > collection. For now, need_qc_qtext says that we have to do it if the > > extent is > > more than 512 (B) * pgss_max. This probably doesn't work well for people > > using > > ORM as they tend to generate gigantic SQL queries. > > Right, the cost would be mostly on the GC side. I've never done any > profiling to see how big of a thing that is in systems today -- have > you? I didn't, but I don't see how it could be anything but ridiculously impacting. it's basically preventing any query from being planned or executed on the whole instance the time needed to read the previous qtext file, and write all entries still needed. > > I don't that think that anyone really had a strong argument, mostly gut > > feeling. Note that pg_stat_kcache already implemented that toplevel flags, > > so > > if people are using that extension in a recent version they might have some > > figures to show. I'll ping some people that I know are using it. > > Great -- data always wins over gut feelings :) So I asked some friends that have latest pg_stat_kcache installed on some preproduction environment configured to track nested queries. There isn't a high throughput but the activity should still be representative of the production queries. There are a lot of applications plugged there, around 20 databases and quite a lot of PL code. After a few days, here are the statistics: - total of ~ 9500 entries - ~ 900 entries for nested statements - ~ 35 entries existing for both top level and nested statements So the duplicates account for less than 4% of the nested statements, and less than 0.5% of the whole entries. I wish I had more reports, but if this one is representative enough then it seems that trying to avoid storing duplicated queries wouldn't be worth it. > > One good argument would be that gigantic queries generated by ORM should > > always > > be executed as top level statements. > > Yes, that's true. And it probably holds as a more generic case as > well, that is the queries that are likely to show up both top-level > and lower-level are more likely to be relatively simple ones. (Except > for example during the development of functions/procs where they're > often executed top level as well to test etc, but that's not the most > important case to optimize for) Agreed. > > I previously tried with the postgres regression tests, which clearly isn't a > > representative workload, and as far as I can see the vast majority of > > queries > > executed bost as top level and nested level are DDL implying recursion > > (e.g. a > > CREATE TABLE with underlying index creation). > > I think the PostgreSQL regression tests are so far from a real world > workload that the input in this case has a value of exactly zero. I totally agree, but that's the only one I had at that time :) Still it wasn't entirely useless as I didn't realize before that that some DDL would lead to duplicated entries.
Re: [HACKERS] Custom compression methods
On Tue, Mar 16, 2021 at 7:57 PM Robert Haas wrote: > > That behavior feels unacceptable to me from a user expectations point > of view. I think there's an argument that if I update a tuple that > contains a compressed datum, and I don't update that particular > column, I think it would be OK to not recompress the column. But, if I > insert data into a table, I as a user would expect that the > compression settings for that column are going to be respected. > Deciding that's optional because we don't have a good way of making it > fast seems like a major cop-out, at least to me. I think from a user > perspective you don't expect INSERT INTO .. SELECT FROM to create a > different final state than a dump and reload, and that if we deviate > from that people are gonna be unhappy. I could be wrong; maybe it's > only me who would be unhappy. If that is only the argument then it's possible today as well. I mean you can INSERT INTO .. SELECT FROM where source attribute as compressed data but the target attribute as external storage then also we will move the compressed data as it is to the target table. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: libpq debug log
On 2021-Mar-15, iwata@fujitsu.com wrote: > I create protocol message reading function for each protocol message type. > (Ex. pqTraceOutputB() read Bind message) > This makes the nesting shallower and makes the code easier to read. I'm not sure I agree with this structural change. Yes, it is less indentation, but these functions are all quite short and simple anyway. But I don't disagree strongly with it either. Four issues: * There's no cross-check that the protocol message length word corresponds to what we actually print. I think one easy way to cross-check that would be to return the "count" from the type-specific routines, and verify that it matches "length" in pqTraceOutputMessage. (If the separate routines are removed and the code put back in one giant switch, then the "count" is already available when the switch ends.) * If we have separate functions for each message type, then we can have that function supply the message name by itself, rather than have the separate protocol_message_type_f / _b arrays that print it. * If we make each type-specific function print the message name, then we need to make the same function print the message length, because it comes after. So the function would have to receive the length (so that it can be printed). But I think we should still have the cross-check I mentioned in my first point occur in the main pqTraceOutputMessage, not the type-specific function, for fear that we will forget the cross-check in one of the functions and never realize that we did. * I would make the pqTraceOutputInt16() function and siblings advance the cursor themselves, actually. I think something like this: static int pqTraceOutputInt16(const char *data, int *cursor, FILE *pfdebug) { uint16tmp; int result; memcpy(&tmp, data + *cursor, 2); *cursor += 2; result = (int) pg_ntoh16(tmp); fprintf(pfdebug, " #%d", result); return result; } (So the caller has to pass the original "data" pointer, not "data+cursor"). This means the caller no longer has to do "cursor+=N" at each place. Also, you get rid of the moveStrCursor() which does not look good. Bikeshedding item: * I'm not fond of the idea of prefixing "#" for 16bit ints and no prefix for 32bit ints. Seems obscure and the output looks weird. I would use a one-letter prefix for each type, "w" for 32-bit ints (stands for "word") and "h" for 16-bit ints (stands for "half-word"). Message length goes unadorned. Then you'd have lines like this 2021-03-15 08:10:44.324296 < RowDescription 35 h1 "spcoptions" w1213 h5 w1009 h65535 w-1 h0 2021-03-15 08:10:44.324335 < DataRow 32 h1 w22 '{random_page_cost=3.0}' * I don't like that pqTraceOutputS/H print nothing when !toServer. I think they should complain. -- Álvaro Herrera39°49'30"S 73°17'W
Re: [PATCH]: Allow errors in parameter values to be reported during the BIND phase itself..
Justin Pryzby writes: > I think it's somewhat confusing that there's two callbacks. > The first one applies only during typinput/typreceive. > I guess the 2nd one should say that they're printed "in *future errors". I adjusted the comments to make this a bit clearer, and pushed it. regards, tom lane
Re: GROUP BY DISTINCT
On 3/16/21 9:21 AM, Vik Fearing wrote: > On 3/13/21 12:33 AM, Tomas Vondra wrote: >> Hi Vik, >> >> The patch seems quite ready, I have just two comments. > > Thanks for taking a look. > >> 1) Shouldn't this add another for DISTINCT, somewhere in the >> documentation? Now the index points just to the SELECT DISTINCT part. > > Good idea; I never think about the index. > >> 2) The part in gram.y that wraps/unwraps the boolean flag as an integer, >> in order to stash it in the group lists is rather ugly, IMHO. It forces >> all the places handling the list to be aware of this (there are not >> many, but still ...). And there are no other places doing (bool) intVal >> so it's not like there's a precedent for this. > > There is kind of a precedent for it, I was copying off of TriggerEvents > and func_alias_clause. > I see. I was looking for "(bool) intVal" but you're right TriggerEvents code does something similar. >> I think the clean solution is to make group_clause produce a struct with >> two fields, and just use that. Not sure how invasive that will be >> outside gram.y, though. > > I didn't want to create a whole new parse node for it, but Andrew Gierth > pointed me towards SelectLimit so I did it like that and I agree it is > much cleaner. > I agree, that's much cleaner. >> Also, the all_or_distinct vs. distinct_or_all seems a bit error-prone. I >> wonder if we can come up with some clearer names, describing the context >> of those types. > > I turned this into an enum for ALL/DISTINCT/default and the caller can > choose what it wants to do with default. I think that's a lot cleaner, > too. Maybe DISTINCT ON should be changed to fit in that? I left it > alone for now. > I think DISTINCT ON is a different kind of animal, because that is a list of expressions, not just a simple enum state. > I also snuck in something that all of us overlooked which is outputting > the DISTINCT in ruleutils.c. I didn't add a test for it but that would > have been an unfortunate bug. > Oh! > New patch attached, rebased on 15639d5e8f. > Thanks. At this point it seems fine to me, no further comments. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
re: crash during cascaded foreign key update
>0 0x7f747e6e2387 in raise () from /lib64/libc.so.6 >#1 0x7f747e6e3a78 in abort () from /lib64/libc.so.6 >#2 0x00ae056a in ExceptionalCondition ( >conditionName=0xb67c10 "!ItemPointerEquals(&oldtup.t_self, >&oldtup.t_data->t_ctid)", >errorType=0xb66d89 "FailedAssertion", fileName=0xb66e68 >"heapam.c", lineNumber=3560) at assert.c:69 >#3 0x004eed16 in heap_update (relation=0x7f747f569590, >otid=0x7ffe6f236ec0, newtup=0x1c214b8, cid=2, >crosscheck=0x1c317f8, wait=true, tmfd=0x7ffe6f236df0, l>ockmode=0x7ffe6f236dec) at heapam.c:3560 I have this report from one static analysis tool: heapam.c (9379): Dereferencing of a potential null pointer 'oldtup.t_data' regards, Ranier Vilela
Re: [HACKERS] Custom compression methods
On Mon, Mar 15, 2021 at 6:58 PM Andres Freund wrote: > I don't particularly like PG_RETURN_HEAPTUPLEHEADER_RAW(). What is "raw" > about it? It also seems to me like there needs to at least be a > sentence or two explaining when to use which of the functions. It seemed like the natural name to me; we use "raw" elsewhere to mean that fewer things are magically addressed on behalf of the caller, e.g. HeapTupleHeaderGetRawXmin. I'm open to suggestions, however. > I think heap_copy_tuple_as_raw_datum() should grow an assert checking > there are no external columns? Yeah, could be done. > I'm don't like that after 0002 ExecEvalRow(), ExecEvalFieldStoreForm() > contain a nearly identical copy of the same code. And > make_tuple_from_row() also is similar. It seem that there should be a > heap_form_tuple() version doing this for us? I was worried about having either a performance impact or code duplication. The actual plan where you could insert this organically is in fill_val(), which is called from heap_fill_tuple(), which is called from heap_form_tuple(). If you don't mind passing down 'int flags' or similar to all those, and having additional branches to make the behavior dependent on the flags, I'm cool with it. Or if you think we should template-ize all those functions, that'd be another way to go. But I was afraid I would get complaints about adding overhead to hot code paths. > > I'm open to being convinced that we don't need to do either of these > > things, and that the cost of iterating over all varlenas in the tuple > > is not so bad as to preclude doing things as you have them here. But, > > I'm afraid it's going to be too expensive. > > I mean, I would just define several of those places away by not caring > about tuples in a different compressino formation ending up in a > table... That behavior feels unacceptable to me from a user expectations point of view. I think there's an argument that if I update a tuple that contains a compressed datum, and I don't update that particular column, I think it would be OK to not recompress the column. But, if I insert data into a table, I as a user would expect that the compression settings for that column are going to be respected. Deciding that's optional because we don't have a good way of making it fast seems like a major cop-out, at least to me. I think from a user perspective you don't expect INSERT INTO .. SELECT FROM to create a different final state than a dump and reload, and that if we deviate from that people are gonna be unhappy. I could be wrong; maybe it's only me who would be unhappy. -- Robert Haas EDB: http://www.enterprisedb.com
Re: crash during cascaded foreign key update
Amit Langote writes: > With HEAD (I think v12 and greater), I see $subject when trying out > the following scenario: I wonder if this is related to https://www.postgresql.org/message-id/flat/89429.1584443208%40antos which we've still not done anything about. regards, tom lane
Re: Nondeterministic collations and the value returned by GROUP BY x
Jim Finnerty writes: > right. It doesn't matter which of the values is returned; however, a > plausible-sounding implementation would case-fold the value, like GROUP BY > LOWER(x), but the case-folded value isn't necessarily one of the original > values and so that could be subtly wrong in the case-insensitive case, and > could in principle be completely wrong in the most general nondeterministic > collation case where the case-folded value isn't even equal to the other > members of the set. > does the implementation in PG12 ensure that some member of the set of equal > values is chosen as the representative value? Without having actually looked, I'm pretty certain it does. Considerations of data type independence would seem to rule out a hack like applying case folding. There might be case folding happening internally to comparison functions, like citext_cmp, but that wouldn't affect the grouping logic that is going to save aside one of the group of peer values. regards, tom lane
crash during cascaded foreign key update
With HEAD (I think v12 and greater), I see $subject when trying out the following scenario: -- in backend 1 create table p (a int primary key); create table f (a int references p on update cascade deferrable initially deferred); insert into p values (1); begin isolation level serializable; insert into p values (3); -- in another backend insert into f values (1) -- back in backend 1 update p set a = a + 1; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. I see the following backtrace: #0 0x7f747e6e2387 in raise () from /lib64/libc.so.6 #1 0x7f747e6e3a78 in abort () from /lib64/libc.so.6 #2 0x00ae056a in ExceptionalCondition ( conditionName=0xb67c10 "!ItemPointerEquals(&oldtup.t_self, &oldtup.t_data->t_ctid)", errorType=0xb66d89 "FailedAssertion", fileName=0xb66e68 "heapam.c", lineNumber=3560) at assert.c:69 #3 0x004eed16 in heap_update (relation=0x7f747f569590, otid=0x7ffe6f236ec0, newtup=0x1c214b8, cid=2, crosscheck=0x1c317f8, wait=true, tmfd=0x7ffe6f236df0, lockmode=0x7ffe6f236dec) at heapam.c:3560 #4 0x004fdb52 in heapam_tuple_update (relation=0x7f747f569590, otid=0x7ffe6f236ec0, slot=0x1c43fc8, cid=2, snapshot=0x1c31a30, crosscheck=0x1c317f8, wait=true, tmfd=0x7ffe6f236df0, lockmode=0x7ffe6f236dec, update_indexes=0x7ffe6f236deb) at heapam_handler.c:327 #5 0x0075a7fc in table_tuple_update (rel=0x7f747f569590, otid=0x7ffe6f236ec0, slot=0x1c43fc8, cid=2, snapshot=0x1c31a30, crosscheck=0x1c317f8, wait=true, tmfd=0x7ffe6f236df0, lockmode=0x7ffe6f236dec, update_indexes=0x7ffe6f236deb) at ../../../src/include/access/tableam.h:1509 #6 0x0075cd20 in ExecUpdate (mtstate=0x1c42540, resultRelInfo=0x1c42778, tupleid=0x7ffe6f236ec0, oldtuple=0x0, slot=0x1c43fc8, planSlot=0x1c43e78, epqstate=0x1c42638, estate=0x1c422d0, canSetTag=true) at nodeModifyTable.c:1498 #7 0x0075e0a3 in ExecModifyTable (pstate=0x1c42540) at nodeModifyTable.c:2254 #8 0x0072674e in ExecProcNodeFirst (node=0x1c42540) at execProcnode.c:456 #9 0x0071b13b in ExecProcNode (node=0x1c42540) at ../../../src/include/executor/executor.h:247 #10 0x0071d8f3 in ExecutePlan (estate=0x1c422d0, planstate=0x1c42540, use_parallel_mode=false, operation=CMD_UPDATE, sendTuples=false, numberTuples=0, direction=ForwardScanDirection, dest=0xcb1440 , execute_once=true) at execMain.c:1531 #11 0x0071b75f in standard_ExecutorRun (queryDesc=0x1c4cd18, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:350 #12 0x0071b587 in ExecutorRun (queryDesc=0x1c4cd18, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:294 #13 0x00777a88 in _SPI_pquery (queryDesc=0x1c4cd18, fire_triggers=false, tcount=0) at spi.c:2729 #14 0x007774fa in _SPI_execute_plan (plan=0x1bf93d0, paramLI=0x1c402c0, snapshot=0x1036840 , crosscheck_snapshot=0x1c317f8, read_only=false, no_snapshots=false, fire_triggers=false, tcount=0, caller_dest=0x0, plan_owner=0x1bc1c10) at spi.c:2500 #15 0x007740a9 in SPI_execute_snapshot (plan=0x1bf93d0, Values=0x7ffe6f237340, Nulls=0x7ffe6f237300 " ", snapshot=0x1036840 , crosscheck_snapshot=0x1c317f8, read_only=false, fire_triggers=false, tcount=0) at spi.c:693 #16 0x00a52724 in ri_PerformCheck (riinfo=0x1c3f2f8, qkey=0x7ffe6f2378a0, qplan=0x1bf93d0, fk_rel=0x7f747f569590, pk_rel=0x7f747f564a30, oldslot=0x1c042b8, newslot=0x1c04420, detectNewRows=true, expect_OK=9) at ri_triggers.c:2517 #17 0x00a4fee5 in RI_FKey_cascade_upd (fcinfo=0x7ffe6f237a60) at ri_triggers.c:1163 #18 0x006ea114 in ExecCallTriggerFunc (trigdata=0x7ffe6f237b00, tgindx=1, finfo=0x1bc5be0, instr=0x0, per_tuple_context=0x1c06760) at trigger.c:2141 #19 0x006ed216 in AfterTriggerExecute (estate=0x1bc52f0, event=0x1c196c0, relInfo=0x1bc5798, trigdesc=0x1bc59d0, finfo=0x1bc5bb0, instr=0x0, per_tuple_context=0x1c06760, trig_tuple_slot1=0x0, trig_tuple_slot2=0x0) at trigger.c:4030 #20 0x006ed6e5 in afterTriggerInvokeEvents (events=0x1c31ac8, firing_id=1, estate=0x1bc52f0, delete_ok=false) at trigger.c:4244 #21 0x006ede4c in AfterTriggerEndQuery (estate=0x1bc52f0) at trigger.c:4581 #22 0x0071b90c in standard_ExecutorFinish (queryDesc=0x1c13040) at execMain.c:425 #23 0x0071b803 in ExecutorFinish (queryDesc=0x1c13040) at execMain.c:393 #24 0x00955ec6 in ProcessQuery (plan=0x1c51b30, sourceText=0x1b424a0 "update p set a = a + 1;", params=0x0, queryEnv=0x0, dest=0x1c51ca0, qc=0x7ffe6f237f20) at pquery.c:190 #25 0x00957701 in PortalRunMulti (portal=0x1ba4980, isTopLevel=true, setHoldSnapshot=false, dest=0x1c51ca0, altdest=0x1c51ca0, qc=0x7ffe6f237f20) at pquery.c:1267 #26 0x00956ca5 in PortalRun (portal=0
Re: [HACKERS] logical decoding of two-phase transactions
On Tue, Mar 16, 2021 at 6:22 PM vignesh C wrote: > > On Mon, Mar 15, 2021 at 6:14 PM Ajin Cherian wrote: > > > > On Mon, Mar 15, 2021 at 2:04 PM Amit Kapila wrote: > >> > > 2) table_states_not_ready global variable is used immediately after > call to FetchTableStates, we can make FetchTableStates return the > value or get it as an argument to the function and the global > variables can be removed. > +static List *table_states_not_ready = NIL; > But we do update the states in the list table_states_not_ready in function process_syncing_tables_for_apply. So, the current arrangement looks good to me. -- With Regards, Amit Kapila.
Re: pg_subscription - substream column?
On Tue, Mar 16, 2021 at 5:27 PM Peter Smith wrote: > > On Tue, Mar 16, 2021 at 7:20 PM Amit Kapila wrote: > > > > On Tue, Mar 16, 2021 at 3:35 AM Peter Smith wrote: > > > > > > I noticed that the PG docs [1] for the catalog pg_subscription doesn't > > > have any mention of the substream column. > > > > > > Accidental omission by commit 4648243 [2] from last year? > > > > > > > Right. I'll fix this unless you or someone else is interested in > > providing a patch for this. > > Sure, I will send a patch for it tomorrow. > Attached, please find the patch to update the description of substream in pg_subscription. -- With Regards, Amit Kapila. v1-0001-Doc-Add-a-description-of-substream-in-pg_subscrip.patch Description: Binary data
Re: New IndexAM API controlling index vacuum strategies
On Mon, Mar 15, 2021 at 11:04 AM Peter Geoghegan wrote: > > On Thu, Mar 11, 2021 at 8:31 AM Robert Haas wrote: > > But even if not, I'm not sure this > > helps much with the situation you're concerned about, which involves > > non-HOT tuples. > > Attached is a POC-quality revision of Masahiko's > skip_index_vacuum.patch [1]. There is an improved design for skipping > index vacuuming (improved over the INDEX_CLEANUP stuff from Postgres > 12). I'm particularly interested in your perspective on this > refactoring stuff, Robert, because you ran into the same issues after > initial commit of the INDEX_CLEANUP reloption feature [2] -- you ran > into issues with the "tupgone = true" special case. This is the case > where VACUUM considers a tuple dead that was not marked LP_DEAD by > pruning, and so needs to be killed in the second heap scan in > lazy_vacuum_heap() instead. You'll recall that these issues were fixed > by your commit dd695979888 from May 2019. I think that we need to go > further than you did in dd695979888 for this -- we ought to get rid of > the special case entirely. Thank you for the patch! > > This patch makes the "VACUUM (INDEX_CLEANUP OFF)" mechanism no longer > get invoked as if it was like the "no indexes on table so do it all in > one heap pass" optimization. This seems a lot clearer -- INDEX_CLEANUP > OFF isn't able to call lazy_vacuum_page() at all (for the obvious > reason), so any similarity between the two cases was always > superficial -- skipping index vacuuming should not be confused with > doing a one-pass VACUUM/having no indexes at all. The original > INDEX_CLEANUP structure (from commits a96c41fe and dd695979) always > seemed confusing to me for this reason, FWIW. Agreed. > > Note that I've merged multiple existing functions in vacuumlazy.c into > one: the patch merges lazy_vacuum_all_indexes() and lazy_vacuum_heap() > into a single function named vacuum_indexes_mark_unused() (note also > that lazy_vacuum_page() has been renamed to mark_unused_page() to > reflect the fact that it is now strictly concerned with making LP_DEAD > line pointers LP_UNUSED). The big idea is that there is one choke > point that decides whether index vacuuming is needed at all at one > point in time, dynamically. vacuum_indexes_mark_unused() decides this > for us at the last moment. This can only happen during a VACUUM that > has enough memory to fit all TIDs -- otherwise we won't skip anything > dynamically. > > We may in the future add additional criteria for skipping index > vacuuming. That can now just be added to the beginning of this new > vacuum_indexes_mark_unused() function. We may even teach > vacuum_indexes_mark_unused() to skip some indexes but not others in a > future release, a possibility that was already discussed at length > earlier in this thread. This new structure has all the context it > needs to do all of these things. I agree to create a function like vacuum_indexes_mark_unused() that makes a decision and does index and heap vacumming accordingly. But what is the point of removing both lazy_vacuum_all_indexes() and lazy_vacuum_heap()? I think we can simply have vacuum_indexes_mark_unused() call those functions. I'm concerned that if we add additional criteria to vacuum_indexes_mark_unused() in the future the function will become very large. > > I wonder if we can add some kind of emergency anti-wraparound vacuum > logic to what I have here, for Postgres 14. Can we come up with logic > that has us skip index vacuuming because XID wraparound is on the > verge of causing an outage? That seems like a strategically important > thing for Postgres, so perhaps we should try to get something like > that in. Practically every post mortem blog post involving Postgres > also involves anti-wraparound vacuum. +1 I think we can set VACOPT_TERNARY_DISABLED to tab->at_params.index_cleanup in table_recheck_autovac() or increase the thresholds used to not skipping index vacuuming. > > One consequence of my approach is that we now call > lazy_cleanup_all_indexes(), even when we've skipped index vacuuming > itself. We should at least "check-in" with the indexes IMV. To an > index AM, this will be indistinguishable from a VACUUM that never had > tuples for it to delete, and so never called ambulkdelete() before > calling amvacuumcleanup(). This seems logical to me: why should there > be any significant behavioral divergence between the case where there > are 0 tuples to delete and the case where there is 1 tuple to delete? > The extra work that we perform in amvacuumcleanup() (if any) should > almost always be a no-op in nbtree following my recent refactoring > work. More generally, if an index AM is doing too much during cleanup, > and this becomes a bottleneck, then IMV that's a problem that needs to > be fixed in the index AM. Aside from whether it's good or bad, strictly speaking, it could change the index AM API contract. The documentation of amvacuumcleanup() says: --- stats is wh
Re: simplifying foreign key/RI checks
On Mon, Mar 8, 2021 at 11:41 PM Amit Langote wrote: > On Thu, Mar 4, 2021 at 5:15 AM Tom Lane wrote: > > Lastly, ri_PerformCheck is pretty careful about not only which > > snapshot it uses, but which *pair* of snapshots it uses, because > > sometimes it needs to worry about data changes since the start > > of the transaction. You've ignored all of that complexity AFAICS. > > That's okay (I think) for RI_FKey_check which was passing > > detectNewRows = false, but for sure it's not okay for > > ri_Check_Pk_Match. (I kind of thought we had isolation tests > > that would catch that, but apparently not.) > > Okay, let me closely check the ri_Check_Pk_Match() case and see if > there's any live bug. I checked, and AFAICS, the query invoked by ri_Check_Pk_Match() (that is, without the patch) does not use the "crosscheck" snapshot at any point during its execution. That snapshot is only used in the table_update() and table_delete() routines, which are not involved in the execution of ri_Check_Pk_Match()'s query. I dug through git history and -hackers archives to understand the origins of RI code's use of a crosscheck snapshot and came across this discussion: https://www.postgresql.org/message-id/20031001150510.U45145%40megazone.bigpanda.com If I am reading the discussion and the details in subsequent commit 55d85f42a891a correctly, the crosscheck snapshot is only to be used to ensure, under serializable isolation, that any attempts by the RI query of updating/deleting rows that are not visible to the transaction snapshot cause a serialization error. Use of the same facilities in ri_Check_Pk_Match() was merely done as future-proofing, with no particular use case to address, then and perhaps even now. If that is indeed the case, it does not seem particularly incorrect for ri_ReferencedKeyExists() added by the patch to not bother with setting up a crosscheck snapshot, even when called from ri_Check_Pk_Match(). Am I missing something? -- Amit Langote EDB: http://www.enterprisedb.com
Re: [HACKERS] Custom compression methods
On Tue, Mar 16, 2021 at 4:07 PM Dilip Kumar wrote: > > INSERT TIME > Head: 17418.299 ms Patch: 20956.231 ms > > CTAS TIME: > Head: 12837.872 ms Patch: 16775.739 ms On quick analysis with perf it appeared that the performance is degrading because of deforming - 16.19% 3.54% postgres postgres[.] CompareCompressionMethodAndDecompress - 12.65% CompareCompressionMethodAndDecompress - 12.57% slot_getallattrs - 12.56% slot_getsomeattrs - 12.53% slot_getsomeattrs_int - 12.50% tts_buffer_heap_getsomeattrs slot_deform_heap_tuple So I think in the case of direct insert it needs to deform because I am calling CompareCompressionMethodAndDecompress after ExecCopySlot and that is why we have to deform every time so maybe that can be avoided by calling CompareCompressionMethodAndDecompress before ExecCopySlot. But in the case of CTAS or INSERT INTO SELECT we can not avoid deforming because we might get the formed tuple from the source table. I put a temporary hack to keep the map of the varlena attribute and use it across the tuple but it did not improve the performance in this case because the main bottleneck is slot_getallattrs. I think this should help where we don't have any varlena, but first I need to test whether we can any performance regression in those cases at all. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] logical decoding of two-phase transactions
On Mon, Mar 15, 2021 at 6:14 PM Ajin Cherian wrote: > > On Mon, Mar 15, 2021 at 2:04 PM Amit Kapila wrote: >> >> I think something on these lines should be much >> easier than the spool-file implementation unless we see any problem >> with this idea. >> > > Here's a new patch-set that implements this new solution proposed by Amit. Another couple of comments: 1) Should Assert be changed to the following in the below code: if (!HeapTupleIsValid(tup)) elog(ERROR, "cache lookup failed for subscription %u", MySubscription->oid); + rel = table_open(SubscriptionRelationId, RowExclusiveLock); + tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(MySubscription->oid)); + Assert(HeapTupleIsValid(tup)); 2) table_states_not_ready global variable is used immediately after call to FetchTableStates, we can make FetchTableStates return the value or get it as an argument to the function and the global variables can be removed. +static List *table_states_not_ready = NIL; +static List *table_states_all = NIL; Regards, Vignesh
RE: subscriptionCheck failures
Hi On Tuesday, March 16, 2021 4:15 PM vignesh C wrote: > On Tue, Mar 16, 2021 at 12:29 PM Amit Kapila > wrote: > > > > On Tue, Mar 16, 2021 at 9:00 AM Amit Kapila > wrote: > > > > > > On Mon, Mar 15, 2021 at 6:00 PM Thomas Munro > wrote: > > > > > > > > Hi, > > > > > > > > This seems to be a new low frequency failure, I didn't see it mentioned > already: > > > > > > > > > > Thanks for reporting, I'll look into it. > > > > > > > By looking at the logs [1] in the buildfarm, I think I know what is > > going on here. After Create Subscription, the tablesync worker is > > launched and tries to create the slot for doing the initial copy but > > before it could finish creating the slot, we issued the Drop > > Subscription which first stops the tablesync worker and then tried to > > drop its slot. Now, it is quite possible that by the time Drop > > Subscription tries to drop the tablesync slot, it is not yet created. > > We treat this condition okay and just Logs the message. I don't think > > this is an issue because anyway generally such a slot created on the > > server will be dropped before we persist it but the test was checking > > the existence of slots on server before it gets dropped. I think we > > can avoid such a situation by preventing cancel/die interrupts while > > creating tablesync slot. > > > > This is a timing issue, so I have reproduced it via debugger and > > tested that the attached patch fixes it. > > > > Thanks for the patch. > I was able to reproduce the issue using debugger by making it wait at > CreateReplicationSlot. After applying the patch the issue gets solved. I really appreciate everyone's help. For the double check, I utilized the patch and debugger too. I also put one while loop at the top of CreateReplicationSlot to control walsender. Without the patch, DROP SUBSCRIPTION goes forward, even when the table sync worker cannot move by the CreateReplicationSlot loop, and the table sync worker is killed by DROP SUBSCRIPTION command. On the other hand, with the patch contents, I observed that DROP SUBSCRIPTION hangs and waits until I release the walsender process from CreateReplicationSlot. After this, the command drops two slots like below. NOTICE: dropped replication slot "pg_16391_sync_16385_6940222843739406079" on publisher NOTICE: dropped replication slot "mysub1" on publisher DROP SUBSCRIPTION To me, this correctly works because the timing I put the while loop and stops the walsender makes the DROP SUBSCRIPTION affects two slots. Any comments ? Best Regards, Takamichi Osumi
Re: dynamic result sets support in extended query protocol
On 15.03.21 14:56, David Steele wrote: Hi Peter, On 12/30/20 9:33 AM, Peter Eisentraut wrote: On 2020-10-09 20:46, Andres Freund wrote: Is there really a good reason for forcing the client to issue NextResult, Describe, Execute for each of the dynamic result sets? It's not like there's really a case for allowing the clients to skip them, right? Why aren't we sending something more like S: CommandPartiallyComplete S: RowDescription S: DataRow... S: CommandPartiallyComplete S: RowDescription S: DataRow... ... S: CommandComplete C: Sync I want to post my current patch, to keep this discussion moving. CFBot reports that tests are failing, although the patch applies. Yes, as explained in the message, you need another patch that makes psql show the additional result sets. The cfbot cannot handle that kind of thing. In the meantime, I have made a few small fixes, so I'm attaching another patch. From 163d2ba39a0b46deb83e7509d85a5b2012fd84ec Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Tue, 16 Mar 2021 11:28:53 +0100 Subject: [PATCH v2] Dynamic result sets from procedures Declaring a cursor WITH RETURN in a procedure makes the cursor's data be returned as a result of the CALL invocation. The procedure needs to be declared with the DYNAMIC RESULT SETS attribute. Discussion: https://www.postgresql.org/message-id/flat/6e747f98-835f-2e05-cde5-86ee444a7...@2ndquadrant.com --- doc/src/sgml/catalogs.sgml| 10 +++ doc/src/sgml/information_schema.sgml | 3 +- doc/src/sgml/protocol.sgml| 19 + doc/src/sgml/ref/alter_procedure.sgml | 12 +++ doc/src/sgml/ref/create_procedure.sgml| 14 +++ doc/src/sgml/ref/declare.sgml | 34 +++- src/backend/catalog/information_schema.sql| 2 +- src/backend/catalog/pg_aggregate.c| 3 +- src/backend/catalog/pg_proc.c | 4 +- src/backend/catalog/sql_features.txt | 2 +- src/backend/commands/functioncmds.c | 75 ++-- src/backend/commands/portalcmds.c | 23 + src/backend/commands/typecmds.c | 12 ++- src/backend/parser/gram.y | 20 - src/backend/tcop/postgres.c | 62 +- src/backend/tcop/pquery.c | 6 ++ src/backend/utils/errcodes.txt| 1 + src/backend/utils/mmgr/portalmem.c| 48 +++ src/bin/pg_dump/pg_dump.c | 16 +++- src/include/catalog/pg_proc.h | 6 +- src/include/commands/defrem.h | 2 + src/include/nodes/parsenodes.h| 9 +- src/include/parser/kwlist.h | 3 + src/include/utils/portal.h| 14 +++ src/interfaces/libpq/fe-protocol3.c | 6 +- .../regress/expected/create_procedure.out | 85 ++- src/test/regress/sql/create_procedure.sql | 61 - 27 files changed, 518 insertions(+), 34 deletions(-) diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index b1de6d0674..c30d6328ee 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -5844,6 +5844,16 @@ pg_proc Columns + + + prodynres int4 + + + For procedures, this records the maximum number of dynamic result sets + the procedure may create. Otherwise zero. + + + pronargs int2 diff --git a/doc/src/sgml/information_schema.sgml b/doc/src/sgml/information_schema.sgml index 4100198252..7f7498eeff 100644 --- a/doc/src/sgml/information_schema.sgml +++ b/doc/src/sgml/information_schema.sgml @@ -5884,7 +5884,8 @@ routines Columns max_dynamic_result_sets cardinal_number - Applies to a feature not available in PostgreSQL + For a procedure, the maximum number of dynamic result sets. Otherwise + zero. diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml index 43092fe62a..4fe0b271e7 100644 --- a/doc/src/sgml/protocol.sgml +++ b/doc/src/sgml/protocol.sgml @@ -959,6 +959,25 @@ Extended Query an empty query string), ErrorResponse, or PortalSuspended. + +Executing a portal may give rise to a dynamic result set +sequence. That means the command contained in the portal +created additional result sets beyond what it normally returns. (The +typical example is calling a stored procedure that creates dynamic result +sets.) Dynamic result sets are issues after whatever response the main +command issued. Each dynamic result set begins with a RowDescription +message followed by zero or more DataRow messages. (Since as explained +above an Execute message normally does not respond with a RowDescription, +the appearance of the first RowDescription marks the end of the primary +result set of the portal and the beginning of the first d
Re: EXPLAIN/EXPLAIN ANALYZE REFRESH MATERIALIZED VIEW
On Tue, Mar 16, 2021 at 1:15 AM Tom Lane wrote: > > [ Sorry for not looking at this thread sooner ] > > Bharath Rupireddy writes: > > Currently, $subject is not allowed. We do plan the mat view query > > before every refresh. I propose to show the explain/explain analyze of > > the select part of the mat view in case of Refresh Mat View(RMV). > > TBH, I think we should reject this. The problem with it is that it > binds us to the assumption that REFRESH MATERIALIZED VIEW has an > explainable plan. There are various people poking at ideas like > incremental matview updates, which might rely on some implementation > that doesn't exactly equate to a SQL query. Incremental updates are > hard enough already; they'll be even harder if they also have to > maintain compatibility with a pre-existing EXPLAIN behavior. > > I don't really see that this feature buys us anything you can't > get by explaining the view's query, so I think we're better advised > to keep our options open about how REFRESH might be implemented > in future. That makes sense to me. Thanks for the comments. I'm fine to withdraw the patch. I would like to see if the 0001 patch(attaching here) will be useful at all. It just splits up the existing ExecRefreshMatView into a few functions to make the code readable. I'm okay to withdraw it if no one agrees. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com From 26970ffd33e67324609a03f0f61eeb6406216a7f Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Fri, 5 Mar 2021 15:47:12 +0530 Subject: [PATCH v7 1/2] Rearrange Refresh Mat View Code Currently, the function ExecRefreshMatView in matview.c is having many lines of code which is not at all good from readability and maintainability perspectives. This patch adds few functions and moves the code from ExecRefreshMatView to them making the code look better. --- src/backend/commands/matview.c | 449 - 1 file changed, 269 insertions(+), 180 deletions(-) diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c index c5c25ce11d..18e18fa627 100644 --- a/src/backend/commands/matview.c +++ b/src/backend/commands/matview.c @@ -64,7 +64,7 @@ static void transientrel_startup(DestReceiver *self, int operation, TupleDesc ty static bool transientrel_receive(TupleTableSlot *slot, DestReceiver *self); static void transientrel_shutdown(DestReceiver *self); static void transientrel_destroy(DestReceiver *self); -static uint64 refresh_matview_datafill(DestReceiver *dest, Query *query, +static uint64 refresh_matview_datafill(Oid OIDNewHeap, Query *query, const char *queryString); static char *make_temptable_name_n(char *tempname, int n); static void refresh_by_match_merge(Oid matviewOid, Oid tempOid, Oid relowner, @@ -73,6 +73,16 @@ static void refresh_by_heap_swap(Oid matviewOid, Oid OIDNewHeap, char relpersist static bool is_usable_unique_index(Relation indexRel); static void OpenMatViewIncrementalMaintenance(void); static void CloseMatViewIncrementalMaintenance(void); +static Query *get_matview_query(RefreshMatViewStmt *stmt, Relation *rel, +Oid *objectId); +static Query *rewrite_refresh_matview_query(Query *dataQuery); +static Oid get_new_heap_oid(RefreshMatViewStmt *stmt, Relation matviewRel, + Oid matviewOid, char *relpersistence); +static void match_matview_with_new_data(RefreshMatViewStmt *stmt, + Relation matviewRel, Oid matviewOid, + Oid OIDNewHeap, Oid relowner, + int save_sec_context, + char relpersistence, uint64 processed); /* * SetMatViewPopulatedState @@ -140,127 +150,18 @@ ExecRefreshMatView(RefreshMatViewStmt *stmt, const char *queryString, { Oid matviewOid; Relation matviewRel; - RewriteRule *rule; - List *actions; Query *dataQuery; - Oid tableSpace; - Oid relowner; Oid OIDNewHeap; - DestReceiver *dest; uint64 processed = 0; - bool concurrent; - LOCKMODE lockmode; + Oid relowner; char relpersistence; Oid save_userid; int save_sec_context; int save_nestlevel; ObjectAddress address; - /* Determine strength of lock needed. */ - concurrent = stmt->concurrent; - lockmode = concurrent ? ExclusiveLock : AccessExclusiveLock; - - /* - * Get a lock until end of transaction. - */ - matviewOid = RangeVarGetRelidExtended(stmt->relation, - lockmode, 0, - RangeVarCallbackOwnsTable, NULL); - matviewRel = table_open(matviewOid, NoLock); - - /* Make sure it is a materialized view. */ - if (matviewRel->rd_rel->relkind != RELKIND_MATVIEW) - ereport(ERROR, -(errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("\"%s\" is not a materialized view", - RelationGetRelationName(matviewRel; - - /* Check that CONCURRENTLY is not specified if not populated. */ - if (concurrent && !RelationIsPopulated(matviewRel)) - ereport(ERROR, -(errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("CONCURRENTLY cannot be used when the material
Re: Postgres crashes at memcopy() after upgrade to PG 13.
On Mon, Mar 15, 2021 at 3:21 PM Avinash Kumar wrote: > Hi, > > On Mon, Mar 15, 2021 at 1:18 PM Peter Geoghegan wrote: > >> On Mon, Mar 15, 2021 at 6:56 AM Avinash Kumar >> wrote: >> > psql:amchecksql.sql:17: DEBUG: leaf block 1043751 of index >> "idx_id_mtime" has no first data item >> >> That one is harmless. >> >> > And one error as follows. >> > >> > psql:amchecksql.sql:17: ERROR: down-link lower bound invariant >> violated for index "some_other_index" >> >> That indicates corruption. Can you tie this back to the crash? Is it >> the same index? >> > No, that's not the same index. The Index discussed in the previous > messages shows the following output. > > DEBUG: verifying consistency of tree structure for index "idx_id_mtime" > with cross-level checks > DEBUG: verifying level 2 (true root level) > DEBUG: verifying level 1 > DEBUG: verifying level 0 (leaf level) > DEBUG: verifying that tuples from index "idx_id_mtime" are present in > "player" > DEBUG: finished verifying presence of 1966412 tuples from table "player" > with bitset 29.89% set > LOG: duration: 3341.755 ms statement: SELECT bt_index_parent_check(index > => c.oid, heapallindexed => true), c.relname, c.relpages FROM pg_index i > JOIN pg_opclass op ON i.indclass[0] = op.oid JOIN pg_am am ON op.opcmethod > = am.oid JOIN pg_class c ON i.indexrelid = c.oid JOIN pg_namespace n ON > c.relnamespace = n.oid WHERE am.amname = 'btree' AND c.relpersistence != > 't' AND c.relkind = 'i' AND i.indisready AND i.indisvalid AND indexrelid = > 80774 AND n.nspname = 'public' ORDER BY c.relpages DESC; > bt_index_parent_check | relname | relpages > ---+-+-- >| idx_id_mtime | 8439 > (1 row) > > >> -- >> Peter Geoghegan >> > > > -- > Regards, > Avi. > I am afraid that it looks to me like a deduplication bug but not sure how this can be pin-pointed. If there is something I could do to determine that, I would be more than happy. -- Regards, Avi
Re: pg_subscription - substream column?
On Tue, Mar 16, 2021 at 7:20 PM Amit Kapila wrote: > > On Tue, Mar 16, 2021 at 3:35 AM Peter Smith wrote: > > > > I noticed that the PG docs [1] for the catalog pg_subscription doesn't > > have any mention of the substream column. > > > > Accidental omission by commit 4648243 [2] from last year? > > > > Right. I'll fix this unless you or someone else is interested in > providing a patch for this. Sure, I will send a patch for it tomorrow. Kind Regards, Peter Smith. Fujitsu Australia.