date:20210316

Re: [HACKERS] logical decoding of two-phase transactions

2021-03-16 Thread Peter Smith

On Fri, Mar 12, 2021 at 8:38 PM vignesh C  wrote:
>
...
> 1) I felt twophase_given can be a local variable, it need not be added
> as a function parameter as it is not used outside the function.
> --- a/src/backend/commands/subscriptioncmds.c
> +++ b/src/backend/commands/subscriptioncmds.c
> @@ -67,7 +67,8 @@ parse_subscription_options(List *options,
>char **synchronous_commit,
>bool *refresh,
>bool *binary_given,
> bool *binary,
> -  bool
> *streaming_given, bool *streaming)
> +  bool
> *streaming_given, bool *streaming,
> +  bool
> *twophase_given, bool *twophase)
>
> The corresponding changes should be done here too:
> @@ -358,6 +402,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt,
> bool isTopLevel)
> boolcopy_data;
> boolstreaming;
> boolstreaming_given;
> +   booltwophase;
> +   booltwophase_given;
> char   *synchronous_commit;
> char   *conninfo;
> char   *slotname;
> @@ -382,7 +428,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt,
> bool isTopLevel)
>
> &synchronous_commit,
>NULL,
>  /* no "refresh" */
>
> &binary_given, &binary,
> -
> &streaming_given, &streaming);
> +
> &streaming_given, &streaming,
> +
> &twophase_given, &twophase);
>

It was deliberately coded this way for consistency with the other new
PG14 options - e.g. it mimics exactly binary_given, and
streaming_given.

I know the param is not currently used by the caller and so could be a
local (as you say), but I felt the code consistency and future-proof
benefits outweighed the idea of reducing the code to bare minimum
required to work just "because we can".

So I don't plan to change this, but if you still feel strongly that
the parameter must be removed please give a convincing reason.

Kind Regards,
Peter Smith.
Fujitsu Australia.

Re: Type of wait events WalReceiverWaitStart and WalSenderWaitForWAL

2021-03-16 Thread Kyotaro Horiguchi

At Wed, 17 Mar 2021 15:31:37 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> WAIT_EVENT_WAL_RECEIVER_MAIN(WalReceiverMain) is waiting for new data
> to arrive.  This looks like an activity to me.
> 
> WAIT_EVENT_WAL_RECEIVER_WAIT_START is waiting for waiting for starup
> process to kick me.  So it may be either IPC or Activity.  Since
> walreceiver hasn't sent anything to startup, so it's activity, rather
> than IPC.  However, the behavior can be said that it convey a piece of
> information from startup to wal receiver so it also can be said to be
> an IPC. (That is the reason why I don't object for IPC.)
> 
> 1(WAIT_EVENT_WAL_SENDER_MAIN, currently an activity) is waiting for
> something to happen on the connection to the peer
> receiver/worker. This might either be an activity or an wait_client,
> but I prefer it to be wait_client, as the same behavior of a client
> backend is categorizes as wait_client.
> 
> 2 (WAIT_EVENT_WAL_SENDER_WRITE_DATA, currently a wait_client) is the
> same to 1.
> 
> 3 (WAIT_EVENT_WAL_SENDER_WAIT_WAL, currently a wait_client) is the
> same to 1.

- As the result I'd prefer to categorize all of them to Activity.

Year, I don't understand what I meant:(

+ As the result I'd prefer to categorize the first two to Activity, and
+ the last three to wait_client.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Type of wait events WalReceiverWaitStart and WalSenderWaitForWAL

2021-03-16 Thread Kyotaro Horiguchi

At Tue, 16 Mar 2021 15:42:27 +0900, Fujii Masao  
wrote in 
> 
> 
> On 2021/03/16 11:59, Kyotaro Horiguchi wrote:
> > At Tue, 16 Mar 2021 03:12:54 +0900, Fujii Masao
> >  wrote in
> >> The wait event WalReceiverWaitStart has been categorized in the type
> >> Client.
> >> But why? Walreceiver is waiting for startup process to set the lsn and
> >> timeline while it is reporting WalReceiverWaitStart. So its type
> >> should be IPC,
> >> instead?
> >>
> >> The wait event WalSenderWaitForWAL has also been categorized in the
> >> type
> >> Client. While this wait event is being reported, logical replication
> >> walsender
> >> is waiting for not only new WAL to be flushed but also the socket to
> >> be
> >> readable and writeable (if there is pending data). I guess that this
> >> is why
> >> its type is Client. But ISTM walsender is *mainly* waiting for new WAL
> >> to be
> >> flushed by other processes during that period, so I think that it's
> >> better
> >> to use IPC as the type of the wait event WalSenderWaitForWAL. Thought?
> > I agree that it's definitely not a client wait. It would be either
> > activity or IPC.  My reasoning for the latter is it's similar to
> > WAIT_EVENT_WAL_RECEIVER_MAIN since both are a wait while
> > WalReceiverMain to continue. With a difference thatin walreceiver
> > hears where to start in the latter state.
> > I don't object if it were categorized to IPC, though.
> 
> Ok. And on my further thought;
> There are three calls to WalSndWait() in walsender.c as follow.
> 
> 1. WalSndLoop() calls WalSndWait() with the wait event
> "Activity:WalSenderMain". Both physical and logical replication
> walsenders
> use this function.
> 2. WalSndWriteData() calls WalSndWait() with the wait event
> "Client:WalSenderWriteData". Only logical replication walsender uses
> this function.
> 3. WalSndWaitForWal() calls WalSndWait() with the wait event
> "Client:WalSenderWaitForWAL". Only logical replication walsender
> uses this function.
> 
> These three WalSndWait() basically do the same thing, i.e., wait for
> the latch
> set, timeout, postmaster death, the readable and writeable socket. So
> you
> may think that it's strange to categorize them differently. Maybe it's
> better
> to categorize all of them in Actvitiy?

I think it'd be better that they are categorized by what it is waiting
for.

Activity is waiting for something gating me to be released.

IPC is waiting for the response for a request previously sent to
another process.

Wait-client is waiting for the peer over a network connection to allow
me to proceed activity.

So whether the three fall into the same category or not doesn't matter
to me.

WAIT_EVENT_WAL_RECEIVER_MAIN(WalReceiverMain) is waiting for new data
to arrive.  This looks like an activity to me.

WAIT_EVENT_WAL_RECEIVER_WAIT_START is waiting for waiting for starup
process to kick me.  So it may be either IPC or Activity.  Since
walreceiver hasn't sent anything to startup, so it's activity, rather
than IPC.  However, the behavior can be said that it convey a piece of
information from startup to wal receiver so it also can be said to be
an IPC. (That is the reason why I don't object for IPC.)

1(WAIT_EVENT_WAL_SENDER_MAIN, currently an activity) is waiting for
something to happen on the connection to the peer
receiver/worker. This might either be an activity or an wait_client,
but I prefer it to be wait_client, as the same behavior of a client
backend is categorizes as wait_client.

2 (WAIT_EVENT_WAL_SENDER_WRITE_DATA, currently a wait_client) is the
same to 1.

3 (WAIT_EVENT_WAL_SENDER_WAIT_WAL, currently a wait_client) is the
same to 1.

As the result I'd prefer to categorize all of them to Activity.

> Or it's better to categorize only WalSenderMain in Activity, and the
> others
> in IPC because only WalSenderMain is reported in walsender's main
> loop.

I don't think 1, 2 and 3 are Activities.  And Activity necessariry
means main loop as I descrived.  And as I said,
WAIT_EVENT_WAL_RECEIVER_WAIT_START seems in between IPC and activity
so I don't object to categorized it to IPC.

> At least for me the latter is better because the former, i.e.,
> reporting
> three different events for walsender's activity in main loop seems a
> bit strange.
> Thought?

Maybe it's the difference in what one consider as the same event.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: pl/pgsql feature request: shorthand for argument and local variable references

2021-03-16 Thread Julien Rouhaud

On Wed, Mar 17, 2021 at 05:30:30AM +0100, Pavel Stehule wrote:
> st 17. 3. 2021 v 4:52 odesílatel Michael Paquier 
> 
> > I am wondering whether it would be better to allow multiple aliases
> > though, and if it would bring more readability to the routines written
> > if these are treated equal to the top-most namespace which is the
> > routine name now, meaning that we would maintain not one, but N top
> > namespace labels that could be used as aliases of the root one.
> >
> 
> I do not have a strong opinion, but I am inclined to disallow this. I am
> afraid so code can be less readable.
> 
> There is a precedent - SQL doesn't allow you to use table names as
> qualifiers when you have an alias.

+1

> 
> But it is a very important question. The selected behavior strongly impacts
> an implementation.
> 
> What is the more common opinion about it? 1. Should we allow the original
> top label or not? 2. Should we allow to define more top labels?

I also think that there should be a single usable top label, otherwise it will
lead to confusing code and it can be a source of bug.

Re: Assertion failure with barriers in parallel hash join

2021-03-16 Thread Thomas Munro

On Wed, Mar 17, 2021 at 6:17 PM Thomas Munro  wrote:
> On Sat, Mar 6, 2021 at 9:56 AM Thomas Munro  wrote:
> > While working on Melanie's Parallel Hash Full Join patch I remembered
> > that this (apparently extremely rare) race still needs fixing.  Here
> > is a slightly tidied version, which I'm adding to the next CF for CI
> > coverage.
>
> Pushed and back-patched.

According to BF animal elver there is something wrong with this
commit.  Looking into it.

Re: [HACKERS] logical decoding of two-phase transactions

2021-03-16 Thread Amit Kapila

On Tue, Mar 16, 2021 at 5:03 PM Amit Kapila  wrote:
>
> On Mon, Mar 15, 2021 at 6:14 PM Ajin Cherian  wrote:
> >
> > Here's a new patch-set that implements this new solution proposed by Amit.
> > Patchset-v60 implements:
> >
>
> I have reviewed the latest patch and below are my comments, some of
> these might overlap with Vignesh's as I haven't looked at his comments
> in detail.
> Review comments
> 
>

Few more comments:
=
1.
+   subtwophase char
+  
+  
+   The two_phase commit current state:
+   
+'n' = two_phase mode was
not requested, so is disabled.
+'p' = two_phase mode was
requested, but is pending enablement.
+'y' = two_phase mode was
requested, and is enabled.
+   
+  
+ 

Can we name the column as subtwophasestate? And then describe as we
are doing for srsubstate in pg_subscription_rel. Also, it might be
better to keep names as: 'd' disabled, 'p' pending twophase enablement
and 'e' twophase enabled.

   srsubstate char

   State code:
   i = initialize,
   d = data is being copied,
   f = finished table copy,
   s = synchronized,
   r = ready (normal replication)

2.
@@ -427,6 +428,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
  PQserverVersion(conn->streamConn) >= 14)
  appendStringInfoString(&cmd, ", streaming 'on'");

+ if (options->proto.logical.twophase &&
+ PQserverVersion(conn->streamConn) >= 14)
+ appendStringInfoString(&cmd, ", two_phase 'on'");
+
  pubnames = options->proto.logical.publication_names;
  pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
  if (!pubnames_str)
@@ -453,6 +458,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
  appendStringInfo(&cmd, " TIMELINE %u",
  options->proto.physical.startpointTLI);

+ if (options->logical && two_phase)
+ appendStringInfoString(&cmd, " TWO_PHASE");
+

Why are we sending two_phase 'on' and " TWO_PHASE" separately? I think
we don't need to introduce TWO_PHASE token in grammar, let's handle it
via plugin_options similar to what we do for 'streaming'. Also, a
similar change would be required for Create_Replication_Slot.

3.
+ /*
+ * Do not allow toggling of two_phase option, this could
+ * cause missing of transactions and lead to an inconsistent
+ * replica.
+ */
+ if (!twophase)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("cannot alter two_phase option")));
+

I think here you can either give reference of worker.c to explain how
this could lead to an inconsistent replica or expand the comments here
if the information is not present elsewhere.

4.
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_2PC_VERSION_NUM 2

I think it is better to name the new define as
LOGICALREP_PROTO_TWOPHASE_VERSION_NUM. Also mention in comments in
some way that we are keeping the same version number for stream and
two-phase defines because they got introduced in the same release
(14).

5. I have modified the comments atop worker.c to explain the design
and some of the problems clearly. See attached. If you are fine with
this, please include it in the next version of the patch.

-- 
With Regards,
Amit Kapila.

change_two_phase_desc_1.patch
Description: Binary data

Re: Getting better results from valgrind leak tracking

2021-03-16 Thread Andres Freund

Hi,

On 2021-03-16 20:50:17 -0700, Andres Freund wrote:
> What I meant was that I didn't understand how there's not a leak
> danger when compilation fails halfway through, given that the context
> in question is below TopMemoryContext and that I didn't see a relevant
> TRY block. But that probably there is something cleaning it up that I
> didn't see.

Looks like it's an actual leak:

postgres[2058957][1]=# DO $do$BEGIN CREATE OR REPLACE FUNCTION foo() RETURNS 
VOID LANGUAGE plpgsql AS $$BEGIN frakbar;END;$$;^C
postgres[2058957][1]=# SELECT count(*), SUM(total_bytes) FROM 
pg_backend_memory_contexts WHERE name = 'PL/pgSQL function';
┌───┬┐
│ count │  sum   │
├───┼┤
│ 0 │ (null) │
└───┴┘
(1 row)

Time: 1.666 ms
postgres[2058957][1]=# CREATE OR REPLACE FUNCTION foo() RETURNS VOID LANGUAGE 
plpgsql AS $$BEGIN frakbar;END;$$;
ERROR:  42601: syntax error at or near "frakbar"
LINE 1: ...ON foo() RETURNS VOID LANGUAGE plpgsql AS $$BEGIN frakbar;EN...
 ^
LOCATION:  scanner_yyerror, scan.l:1176
Time: 5.463 ms
postgres[2058957][1]=# CREATE OR REPLACE FUNCTION foo() RETURNS VOID LANGUAGE 
plpgsql AS $$BEGIN frakbar;END;$$;
ERROR:  42601: syntax error at or near "frakbar"
LINE 1: ...ON foo() RETURNS VOID LANGUAGE plpgsql AS $$BEGIN frakbar;EN...
 ^
LOCATION:  scanner_yyerror, scan.l:1176
Time: 1.223 ms
postgres[2058957][1]=# CREATE OR REPLACE FUNCTION foo() RETURNS VOID LANGUAGE 
plpgsql AS $$BEGIN frakbar;END;$$;
ERROR:  42601: syntax error at or near "frakbar"
LINE 1: ...ON foo() RETURNS VOID LANGUAGE plpgsql AS $$BEGIN frakbar;EN...
 ^
LOCATION:  scanner_yyerror, scan.l:1176
Time: 1.194 ms
postgres[2058957][1]=# SELECT count(*), SUM(total_bytes) FROM 
pg_backend_memory_contexts WHERE name = 'PL/pgSQL function';
┌───┬───┐
│ count │  sum  │
├───┼───┤
│ 3 │ 24576 │
└───┴───┘
(1 row)

Something like

DO $do$ BEGIN FOR i IN 1 .. 1 LOOP BEGIN EXECUTE $cf$CREATE OR REPLACE 
FUNCTION foo() RETURNS VOID LANGUAGE plpgsql AS $f$BEGIN frakbar; END;$f$;$cf$; 
EXCEPTION WHEN others THEN END; END LOOP; END;$do$;

will show the leak visible in top too (albeit slowly - a more
complicated statement will leak more quickly I think).

postgres[2059268][1]=# SELECT count(*), SUM(total_bytes) FROM 
pg_backend_memory_contexts WHERE name = 'PL/pgSQL function';
┌───┬──┐
│ count │   sum│
├───┼──┤
│ 1 │ 8192 │
└───┴──┘
(1 row)

The leak appears to be not new, I see it in 9.6 as well. This seems like
a surprisingly easy to trigger leak...

Looks like there's something else awry. The above DO statement takes
2.2s on an 13 assert build, but 32s on a master assert build. Spending a
lot of time doing dependency lookups:

-   94.62% 0.01%  postgres  postgres[.] CreateFunction
   - 94.61% CreateFunction
  - 94.56% ProcedureCreate
 - 89.68% deleteDependencyRecordsFor
- 89.38% systable_getnext
   - 89.33% index_getnext_slot
  - 56.00% index_fetch_heap
 + 54.64% table_index_fetch_tuple
   0.09% heapam_index_fetch_tuple
  + 28.53% index_getnext_tid
2.77% ItemPointerEquals
0.10% table_index_fetch_tuple
0.09% btgettuple
 0.03% index_getnext_tid

1000 iterations: 521ms
1000 iterations: 533ms
2000 iterations: 1670ms
3000 iterations: 3457ms
3000 iterations: 3457ms
1 iterations: 31794ms

The quadratic seeming nature made me wonder if someone broke killtuples
in this situation. And it seem that someone was me, in 623a9ba . We need
to bump xactCompletionCount in the subxid abort case as well...

Greetings,

Andres Freund

Re: PATCH: Attempt to make dbsize a bit more consistent

2021-03-16 Thread Michael Paquier

On Mon, Mar 15, 2021 at 03:10:59PM +0900, Michael Paquier wrote:
> Anyway, as mentioned by other people upthread, I am not really
> convinced either that we should have more flavors of size functions,
> particularly depending on the relkind as this would be confusing for
> the end-user.  pg_relation_size() can already do this job for all
> relkinds that use segment files, so it should still be able to hold
> its ground and maintain a consistent behavior with what it does
> currently.

On top of the rest of my notes, there are two more things that would
face a behavior change if making the existing functions go through
table AMs, which would scan the data in the smgr:
- After a VACUUM, the relation would be reported with a size of 0,
while that may not be the case of on-disk files yet.
- Temporary tables of other sessions would be accessible.

So we would likely want a separate function.  Another possibility,
which I find tempting, would be to push down the calculation logic
relying on physical files down to the table AM themselves with a new
dedicated callback (relation_size_physical?), as it seems to me that
the most important thing users want to know with this set of functions
is how much physical space is being consumed at one given point in
time.  Attached is a small prototype I was just playing with.
--
Michael
diff --git a/src/backend/utils/adt/dbsize.c b/src/backend/utils/adt/dbsize.c
index 64cdaa4134..8bd5d33265 100644
--- a/src/backend/utils/adt/dbsize.c
+++ b/src/backend/utils/adt/dbsize.c
@@ -15,6 +15,7 @@
 
 #include "access/htup_details.h"
 #include "access/relation.h"
+#include "access/tableam.h"
 #include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
@@ -23,6 +24,7 @@
 #include "commands/tablespace.h"
 #include "miscadmin.h"
 #include "storage/fd.h"
+#include "storage/smgr.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/numeric.h"
@@ -270,14 +272,28 @@ pg_tablespace_size_name(PG_FUNCTION_ARGS)
  * is no check here or at the call sites for that.
  */
 static int64
-calculate_relation_size(RelFileNode *rfn, BackendId backend, ForkNumber forknum)
+calculate_relation_size(Relation rel, ForkNumber forknum)
 {
 	int64		totalsize = 0;
 	char	   *relationpath;
 	char		pathname[MAXPGPATH];
 	unsigned int segcount = 0;
 
-	relationpath = relpathbackend(*rfn, backend, forknum);
+	/*
+	 * If the relation is related to a table AM, do its sizing directly
+	 * using its interface.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_RELATION ||
+		rel->rd_rel->relkind == RELKIND_TOASTVALUE ||
+		rel->rd_rel->relkind == RELKIND_MATVIEW)
+	{
+		if (rel->rd_smgr && smgrexists(rel->rd_smgr, forknum))
+			return table_relation_size(rel, forknum);
+		else
+			return 0;
+	}
+
+	relationpath = relpathbackend(rel->rd_node, rel->rd_backend, forknum);
 
 	for (segcount = 0;; segcount++)
 	{
@@ -327,7 +343,7 @@ pg_relation_size(PG_FUNCTION_ARGS)
 	if (rel == NULL)
 		PG_RETURN_NULL();
 
-	size = calculate_relation_size(&(rel->rd_node), rel->rd_backend,
+	size = calculate_relation_size(rel,
    forkname_to_number(text_to_cstring(forkName)));
 
 	relation_close(rel, AccessShareLock);
@@ -352,8 +368,7 @@ calculate_toast_table_size(Oid toastrelid)
 
 	/* toast heap size, including FSM and VM size */
 	for (forkNum = 0; forkNum <= MAX_FORKNUM; forkNum++)
-		size += calculate_relation_size(&(toastRel->rd_node),
-		toastRel->rd_backend, forkNum);
+		size += calculate_relation_size(toastRel, forkNum);
 
 	/* toast index size, including FSM and VM size */
 	indexlist = RelationGetIndexList(toastRel);
@@ -366,8 +381,7 @@ calculate_toast_table_size(Oid toastrelid)
 		toastIdxRel = relation_open(lfirst_oid(lc),
 	AccessShareLock);
 		for (forkNum = 0; forkNum <= MAX_FORKNUM; forkNum++)
-			size += calculate_relation_size(&(toastIdxRel->rd_node),
-			toastIdxRel->rd_backend, forkNum);
+			size += calculate_relation_size(toastIdxRel, forkNum);
 
 		relation_close(toastIdxRel, AccessShareLock);
 	}
@@ -395,8 +409,7 @@ calculate_table_size(Relation rel)
 	 * heap size, including FSM and VM
 	 */
 	for (forkNum = 0; forkNum <= MAX_FORKNUM; forkNum++)
-		size += calculate_relation_size(&(rel->rd_node), rel->rd_backend,
-		forkNum);
+		size += calculate_relation_size(rel, forkNum);
 
 	/*
 	 * Size of toast relation
@@ -434,9 +447,7 @@ calculate_indexes_size(Relation rel)
 			idxRel = relation_open(idxOid, AccessShareLock);
 
 			for (forkNum = 0; forkNum <= MAX_FORKNUM; forkNum++)
-size += calculate_relation_size(&(idxRel->rd_node),
-idxRel->rd_backend,
-forkNum);
+size += calculate_relation_size(idxRel, forkNum);
 
 			relation_close(idxRel, AccessShareLock);
 		}
diff --git a/src/test/regress/expected/reloptions.out b/src/test/regress/expected/reloptions.out
index 44c130409f..4b3ec17989 100644
--- a/src/test/regress/expected/reloptions.out
+++ b/src/test/regress/expected/reloptions.out
@@ -103,7 +103,7 @@ INSE

Re: libpq debug log

2021-03-16 Thread Kyotaro Horiguchi

At Wed, 17 Mar 2021 02:09:32 +, "tsunakawa.ta...@fujitsu.com" 
 wrote in 
> Alvaro-san,
> 
> 
> Thank you for taking your time to take a look at an incomplete patch.  I 
> thought I would ask you for final check for commit after Iwata-san has 
> reflected my review comments.
> 
> I discussed with Iwata-sn your below comments.  Let me convey her opinions.  
> (She is now focusing on fixing the patch.)  We'd appreciate if you could 
> tweak her completed patch.
> 
> 
> From: Alvaro Herrera 
> > * There's no cross-check that the protocol message length word
> >   corresponds to what we actually print.  I think one easy way to
> >   cross-check that would be to return the "count" from the type-specific
> >   routines, and verify that it matches "length" in pqTraceOutputMessage.
> >   (If the separate routines are removed and the code put back in one
> >   giant switch, then the "count" is already available when the switch
> >   ends.)
> 
> We don't think the length check is necessary, because 1) for FE->BE, correct 
> messages are always assembled, and 2) for BE->FE, the parsing and 
> decomposition of messages have already checked the messages.

Maybe it is not for the sanity of message bytes but to check if the
logging-stuff is implemented in sync with the messages structure.  If
the given message length differs from the length the logging facility
read after scanning a message bytes, it's sign of something wrong in
*the logging feature*.

> > * If we have separate functions for each message type, then we can have
> >   that function supply the message name by itself, rather than have the
> >   separate protocol_message_type_f / _b arrays that print it.
> 
> I felt those two arrays are clumsy and thought of suggesting to remove them.  
> But I didn't because the functions or case labels for each message type have 
> to duplicate the printing of header fields: timestamp, message type, and 
> length.  Maybe we can change the order of output to "timestamp length 
> message_type content", but I kind of prefer the current order.  What do you 
> think?  (I can understand removing the clumsy static arrays should be better 
> than the order of output fields.)

+1 for removing the arrays.

> > * If we make each type-specific function print the message name, then we
> >   need to make the same function print the message length, because it
> >   comes after.  So the function would have to receive the length (so
> >   that it can be printed).  But I think we should still have the
> >   cross-check I mentioned in my first point occur in the main
> >   pqTraceOutputMessage, not the type-specific function, for fear that we
> >   will forget the cross-check in one of the functions and never realize
> >   that we did.
> 
> As mentioned above, we think the current structure is better for smaller and 
> readable code.
> 
> 
> > * I would make the pqTraceOutputInt16() function and siblings advance
> >   the cursor themselves, actually.  I think something like this:
> >   static int
> >   pqTraceOutputInt16(const char *data, int *cursor, FILE *pfdebug)
> >   {
> >   uint16tmp;
> >   int   result;
> > 
> >   memcpy(&tmp, data + *cursor, 2);
> >   *cursor += 2;
> >   result = (int) pg_ntoh16(tmp);
> >   fprintf(pfdebug, " #%d", result);
> > 
> >   return result;
> >   }
> >   (So the caller has to pass the original "data" pointer, not
> >   "data+cursor").  This means the caller no longer has to do "cursor+=N"
> >   at each place.  Also, you get rid of the moveStrCursor() which does
> >   not look good.
> 
> That makes sense, because in fact the patch mistakenly added 4 when it should 
> add 2.  Also, the code would become smaller.

FWIW, that's what I suggested upthread:p  So +3.

me> *I*'d want to make the output functions move the reading pointer by
me> themseves.  pqTradeOutputMsg can be as simplified as the follows doing

> > * I'm not fond of the idea of prefixing "#" for 16bit ints and no
> >   prefix for 32bit ints.  Seems obscure and the output looks weird.
> >   I would use a one-letter prefix for each type, "w" for 32-bit ints
> >   (stands for "word") and "h" for 16-bit ints (stands for "half-word").
> >   Message length goes unadorned.  Then you'd have lines like this
> > 
> > 2021-03-15 08:10:44.324296  <   RowDescription  35 h1 "spcoptions"
> > w1213 h5 w1009 h65535 w-1 h0
> > 2021-03-15 08:10:44.324335  <   DataRow 32 h1 w22
> > '{random_page_cost=3.0}'
> 
> Yes, actually I felt something similar.  Taking a second thought, I think we 
> don't have to have any prefix because it doesn't help users.  So we're 
> removing the prefix.  We don't have any opinion on adding some prefix.

It would help when the value is "255", which is confusing between -1
(or 255) in byte or 255 in 2-byte word. Or when the value looks like
broken. I'd suggest "b"yte for 1 byte, "s"hort for 2 bytes, "l"ong for
4 bytes ('l' is confusing with '1', but anyway it is not needed)..

Re: Assertion failure with barriers in parallel hash join

2021-03-16 Thread Thomas Munro

On Sat, Mar 6, 2021 at 9:56 AM Thomas Munro  wrote:
> While working on Melanie's Parallel Hash Full Join patch I remembered
> that this (apparently extremely rare) race still needs fixing.  Here
> is a slightly tidied version, which I'm adding to the next CF for CI
> coverage.

Pushed and back-patched.

Re: Regression tests vs SERIALIZABLE

2021-03-16 Thread Tom Lane

Thomas Munro  writes:
> On Tue, Mar 16, 2021 at 3:28 AM Tom Lane  wrote:
>> Usually, if we issue a SET in the regression tests, we explicitly RESET
>> as soon thereafter as practical, so as to have a well-defined scope
>> where the script is running under unusual conditions.

> Oh, of course.  Thanks.

> I was wrong to blame that commit, and there are many other tests that
> fail in the back branches.  But since we were down to just one, I went
> ahead and fixed this in the master branch only.

Makes sense to me.  Committed patch looks good.

regards, tom lane

Re: pl/pgsql feature request: shorthand for argument and local variable references

2021-03-16 Thread Pavel Stehule

st 17. 3. 2021 v 4:52 odesílatel Michael Paquier 
napsal:

> On Mon, Mar 15, 2021 at 05:31:18AM +0100, Pavel Stehule wrote:
> > Thank you very much
>
> I am not much a fan of the implementation done here.  Based on my
> understanding of the PL code, the namespace lookup is organized as a
> list of items, with the namespace associated to the routine name being
> the first one pushed to the list after plpgsql_ns_init() initializes
> the top of it.  Then, the proposed patch hijacks the namespace lookup
> list by doing a replacement of the root label while parsing the
> routine and then tweaks the lookup logic when a variable is qualified
> (aka name2 != NULL) to bump the namespace list one level higher so as
> it does not look after the namespace with the routine name but instead
> fetches the one defined by ROUTINE_NAME.  That seems rather bug-prone
> to me, to say the least.
>

I'll check what can be done. I am not sure if it is possible to push a new
label to the same level as the top label.


> No changes to plpgsql_compile_inline()?
>
> +   ereport(ERROR,
> +   (errcode(ERRCODE_SYNTAX_ERROR),
> +errmsg("redundant option"),
> +errhint("The command \"routine_label\" can be used
> only once in rutine."),
> +plpgsql_scanner_errposition(location)));
> s/rutine/routine/
>
> +   /* Don't allow repeated redefination of routine label */
> redefination?
>

I'll check it. But inline block has not top label


> I am wondering whether it would be better to allow multiple aliases
> though, and if it would bring more readability to the routines written
> if these are treated equal to the top-most namespace which is the
> routine name now, meaning that we would maintain not one, but N top
> namespace labels that could be used as aliases of the root one.
>

I do not have a strong opinion, but I am inclined to disallow this. I am
afraid so code can be less readable.

There is a precedent - SQL doesn't allow you to use table names as
qualifiers when you have an alias.

But it is a very important question. The selected behavior strongly impacts
an implementation.

What is the more common opinion about it? 1. Should we allow the original
top label or not? 2. Should we allow to define more top labels?

Regards

Pavel





> --
> Michael
>

Re: Regression tests vs SERIALIZABLE

2021-03-16 Thread Thomas Munro

On Tue, Mar 16, 2021 at 3:28 AM Tom Lane  wrote:
> Thomas Munro  writes:
> > On Mon, Mar 15, 2021 at 5:24 PM Thomas Munro  wrote:
> >> However, since commit 862ef372d6b, there *is* one test that fails if
> >> you run make installcheck against a cluster running with -c
> >> default_transaction_isolation=serializable: transaction.sql.  Is that
> >> a mistake?  Is it a goal to be able to run this test suite against all
> >> 3 isolation levels?
>
> > Here's a fix.
>
> Usually, if we issue a SET in the regression tests, we explicitly RESET
> as soon thereafter as practical, so as to have a well-defined scope
> where the script is running under unusual conditions.

Oh, of course.  Thanks.

I was wrong to blame that commit, and there are many other tests that
fail in the back branches.  But since we were down to just one, I went
ahead and fixed this in the master branch only.

Re: Getting better results from valgrind leak tracking

2021-03-16 Thread Tom Lane

"Andres Freund"  writes:
> On Tue, Mar 16, 2021, at 20:01, Tom Lane wrote:
>> That seems both unnecessary and impractical.  We have to consider that
>> everything-still-reachable is an OK final state.

> I don't consider "still reachable" a leak. Just definitely unreachable.

OK, we're in violent agreement then -- I must've misread what you wrote.

>>> I think the run might have shown a genuine leak:

>> I didn't see anything like that after applying the fixes I showed before.

> I think it's actually unreachable memory (unless you count resetting the 
> cache context), based on manually tracing the code... I'll try to repro.

I agree that having to reset CacheContext is not something we
should count as "still reachable", and I'm pretty sure that the
existing valgrind infrastructure doesn't count it that way.

As for the particular point about ParallelBlockTableScanWorkerData,
I agree with your question to David about why that's in TableScanDesc
not HeapScanDesc, but I can't get excited about it not being freed in
heap_endscan.  That's mainly because I do not believe that anything as
complex as a heap or indexscan should be counted on to be zero-leakage.
The right answer is to not do such operations in long-lived contexts.
So if we're running such a thing while switched into CacheContext,
*that* is the bug, not that heap_endscan didn't free this particular
allocation.

regards, tom lane

Re: pl/pgsql feature request: shorthand for argument and local variable references

2021-03-16 Thread Michael Paquier

On Mon, Mar 15, 2021 at 05:31:18AM +0100, Pavel Stehule wrote:
> Thank you very much

I am not much a fan of the implementation done here.  Based on my
understanding of the PL code, the namespace lookup is organized as a
list of items, with the namespace associated to the routine name being
the first one pushed to the list after plpgsql_ns_init() initializes
the top of it.  Then, the proposed patch hijacks the namespace lookup
list by doing a replacement of the root label while parsing the
routine and then tweaks the lookup logic when a variable is qualified
(aka name2 != NULL) to bump the namespace list one level higher so as
it does not look after the namespace with the routine name but instead
fetches the one defined by ROUTINE_NAME.  That seems rather bug-prone
to me, to say the least. 

No changes to plpgsql_compile_inline()?

+   ereport(ERROR,
+   (errcode(ERRCODE_SYNTAX_ERROR),
+errmsg("redundant option"),
+errhint("The command \"routine_label\" can be used
only once in rutine."),
+plpgsql_scanner_errposition(location)));
s/rutine/routine/

+   /* Don't allow repeated redefination of routine label */
redefination?

I am wondering whether it would be better to allow multiple aliases
though, and if it would bring more readability to the routines written
if these are treated equal to the top-most namespace which is the
routine name now, meaning that we would maintain not one, but N top
namespace labels that could be used as aliases of the root one.
--
Michael


signature.asc
Description: PGP signature

Re: Getting better results from valgrind leak tracking

2021-03-16 Thread Andres Freund

Hi,

On Tue, Mar 16, 2021, at 20:01, Tom Lane wrote:
> Andres Freund  writes:
> > On 2021-03-16 19:36:10 -0400, Tom Lane wrote:
> >> It doesn't appear to me that we leak much at all, at least not if you
> >> are willing to take "still reachable" blocks as not-leaked.
> 
> > Well, I think for any sort of automated testing - which I think would be
> > useful - we'd really need *no* leaks.
> 
> That seems both unnecessary and impractical.  We have to consider that
> everything-still-reachable is an OK final state.

I don't consider "still reachable" a leak. Just definitely unreachable. And 
with a few tweaks that seems like we could achieve that?


> > I think the run might have shown a genuine leak:
> 
> > ==2048803== 16 bytes in 1 blocks are definitely lost in loss record 139 of 
> > 906
> > ==2048803==at 0x89D2EA: palloc (mcxt.c:975)
> > ==2048803==by 0x2392D3: heap_beginscan (heapam.c:1198)
> > ==2048803==by 0x264E8F: table_beginscan_strat (tableam.h:918)
> > ==2048803==by 0x265994: systable_beginscan (genam.c:453)
> > ==2048803==by 0x83C2D1: SearchCatCacheMiss (catcache.c:1359)
> > ==2048803==by 0x83C197: SearchCatCacheInternal (catcache.c:1299)
> 
> I didn't see anything like that after applying the fixes I showed before.
> There are a LOT of false positives from the fact that with our HEAD
> code, valgrind believes that everything in the catalog caches and
> most things in dynahash tables (including the relcache) are unreachable.

I think it's actually unreachable memory (unless you count resetting the cache 
context), based on manually tracing the code... I'll try to repro.


> > I do see a bunch of leaks bleats below fun:plpgsql_compile that I don't
> > yet understand. E.g.
> 
> Those are probably a variant of what you were suggesting above, ie
> plpgsql isn't terribly careful not to leak random stuff while building
> a long-lived function parse tree.  It's supposed to use a temp context
> for anything that might leak, but I suspect it's not thorough about it.

What I meant was that I didn't understand how there's not a leak danger when 
compilation fails halfway through, given that the context in question is below 
TopMemoryContext and that I didn't see a relevant TRY block. But that probably 
there is something cleaning it up that I didn't see.

Andres

Re: fdatasync performance problem with large number of DB files

2021-03-16 Thread Thomas Munro

On Tue, Mar 16, 2021 at 9:29 PM Fujii Masao  wrote:
> On 2021/03/16 8:15, Thomas Munro wrote:
> > I don't want to add a hypothetical sync_after_crash=none, because it
> > seems like generally a bad idea.  We already have a
> > running-with-scissors mode you could use for that: fsync=off.
>
> I heard that some backup tools sync the database directory when restoring it.
> I guess that those who use such tools might want the option to disable such
> startup sync (i.e., sync_after_crash=none) because it's not necessary.

Hopefully syncfs() will return quickly in that case, without doing any work?

> They can skip that sync by fsync=off. But if they just want to skip only that
> startup sync and make subsequent recovery (or standby server) work with
> fsync=on, they would need to shutdown the server after that startup sync
> finishes, enable fsync, and restart the server. In this case, since the server
> is restarted with the state=DB_SHUTDOWNED_IN_RECOVERY, the startup sync
> would not be performed. This procedure is tricky. So IMO supporting
> sync_after_crash=none would be helpful for this case and simple.

I still do not like this footgun :-)  However, perhaps I am being
overly dogmatic.  Consider the change in d8179b00, which decided that
I/O errors in this phase should be reported at LOG level rather than
ERROR.  In contrast, my "sync_after_crash=wal" mode (which I need to
rebase over this) will PANIC in this case, because any syncing will be
handled through the usual checkpoint codepaths.

Do you think it would be OK to commit this feature with just "fsync"
and "syncfs", and then to continue to consider adding "none" as a
possible separate commit?

Re: [HACKERS] logical decoding of two-phase transactions

2021-03-16 Thread Amit Kapila

On Wed, Mar 17, 2021 at 8:07 AM vignesh C  wrote:
>
> On Tue, Mar 16, 2021 at 7:22 PM Amit Kapila  wrote:
> >
> > On Tue, Mar 16, 2021 at 6:22 PM vignesh C  wrote:
> > >
> > > On Mon, Mar 15, 2021 at 6:14 PM Ajin Cherian  wrote:
> > > >
> > > > On Mon, Mar 15, 2021 at 2:04 PM Amit Kapila  
> > > > wrote:
> > > >>
> > >
> > > 2) table_states_not_ready global variable is used immediately after
> > > call to FetchTableStates, we can make FetchTableStates return the
> > > value or get it as an argument to the function and the global
> > > variables can be removed.
> > > +static List *table_states_not_ready = NIL;
> > >
> >
> > But we do update the states in the list table_states_not_ready in
> > function process_syncing_tables_for_apply. So, the current arrangement
> > looks good to me.
>
> But I felt we can do this without using global variables.
> table_states_not_ready is used immediately  after calling
> FetchTableStates in AnyTablesyncsNotREADY and
> process_syncing_tables_for_apply functions. It is not used anywhere
> else. My point was we do not need to store this in global variables as
> it is not needed elsewhere.
>

It might be possible but I am not if that is better than what we are
currently doing and moreover that is existing code and this patch has
just encapsulated in a function. Even if you think there is a better
way which I doubt, we can probably look at it as a separate patch.

-- 
With Regards,
Amit Kapila.

Re: subscriptionCheck failures

2021-03-16 Thread Amit Kapila

On Tue, Mar 16, 2021 at 6:22 PM osumi.takami...@fujitsu.com
 wrote:
>

>
> To me, this correctly works because
> the timing I put the while loop and stops the walsender
> makes the DROP SUBSCRIPTION affects two slots. Any comments ?
>

No, your testing looks fine. I have also done the similar test. Pushed!

-- 
With Regards,
Amit Kapila.

Re: pg_subscription - substream column?

2021-03-16 Thread Amit Kapila

On Wed, Mar 17, 2021 at 4:56 AM Peter Smith  wrote:
>
> On Wed, Mar 17, 2021 at 12:45 AM Amit Kapila  wrote:
> >
> >
> > Attached, please find the patch to update the description of substream
> > in pg_subscription.
> >
>
> I applied your patch and regenerated the PG docs to check the result.
>
> LGTM.
>

Pushed!

-- 
With Regards,
Amit Kapila.

Re: fdatasync performance problem with large number of DB files

2021-03-16 Thread Thomas Munro

On Tue, Mar 16, 2021 at 9:10 PM Fujii Masao  wrote:
> Thanks for the patch!
>
> +When set to fsync, which is the default,
> +PostgreSQL will recursively open and fsync
> +all files in the data directory before crash recovery begins.
>
> Isn't this a bit misleading? This may cause users to misunderstand that
> such fsync can happen only in the case of crash recovery.

If I insert the following extra sentence after that one, is it better?
 "This applies whenever starting a database cluster that did not shut
down cleanly, including copies created with pg_basebackup."

Re: Getting better results from valgrind leak tracking

2021-03-16 Thread Tom Lane

Andres Freund  writes:
> On 2021-03-16 19:36:10 -0400, Tom Lane wrote:
>> It doesn't appear to me that we leak much at all, at least not if you
>> are willing to take "still reachable" blocks as not-leaked.

> Well, I think for any sort of automated testing - which I think would be
> useful - we'd really need *no* leaks.

That seems both unnecessary and impractical.  We have to consider that
everything-still-reachable is an OK final state.

> I think the run might have shown a genuine leak:

> ==2048803== 16 bytes in 1 blocks are definitely lost in loss record 139 of 906
> ==2048803==at 0x89D2EA: palloc (mcxt.c:975)
> ==2048803==by 0x2392D3: heap_beginscan (heapam.c:1198)
> ==2048803==by 0x264E8F: table_beginscan_strat (tableam.h:918)
> ==2048803==by 0x265994: systable_beginscan (genam.c:453)
> ==2048803==by 0x83C2D1: SearchCatCacheMiss (catcache.c:1359)
> ==2048803==by 0x83C197: SearchCatCacheInternal (catcache.c:1299)

I didn't see anything like that after applying the fixes I showed before.
There are a LOT of false positives from the fact that with our HEAD
code, valgrind believes that everything in the catalog caches and
most things in dynahash tables (including the relcache) are unreachable.

I'm not trying to claim there are no leaks anywhere, just that the amount
of noise from those issues swamps all the real problems.  I particularly
recommend not believing anything related to catcache or relcache if you
haven't applied that admittedly-hacky patch.

> Hm. For me the number of leaks seem to stay the same with or without
> your changes related to this. Is this a USE_VALGRIND build?

Yeah ...

> I do see a bunch of leaks bleats below fun:plpgsql_compile that I don't
> yet understand. E.g.

Those are probably a variant of what you were suggesting above, ie
plpgsql isn't terribly careful not to leak random stuff while building
a long-lived function parse tree.  It's supposed to use a temp context
for anything that might leak, but I suspect it's not thorough about it.
We could chase that sort of thing after we clean up the other problems.

regards, tom lane

Re: A new function to wait for the backend exit after termination

2021-03-16 Thread Kyotaro Horiguchi

At Wed, 17 Mar 2021 07:01:39 +0530, Bharath Rupireddy 
 wrote in 
> Attaching v10 patch for further review.

The time-out mechanism doesn't count remainingtime as expected,
concretely it does the following.

do {
  kill();
  WaitLatch(WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, waittime);
  ResetLatch(MyLatch);
  remainingtime -= waittime;
} while (remainingtime > 0);

So, the WaitLatch doesn't consume as much time as the set waittime in
case of latch set. remainingtime reduces faster than the real at the
iteration.

It wouldn't happen actually but I concern about PID recycling. We can
make sure to get rid of the fear by checking for our BEENTRY instead
of PID.  However, it seems to me that some additional function is
needed in pgstat.c so that we can check the realtime value of
PgBackendStatus, which might be too much.


+   /* If asked to wait, check whether the timeout value is valid or not. */
+   if (wait && pid != MyProcPid)
+   {
+   timeout = PG_GETARG_INT64(2);
+
+   if (timeout <= 0)
+   ereport(ERROR,
+   
(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+errmsg("\"timeout\" must not be 
negative or zero")));

This means that pg_terminate_backend accepts negative timeouts when
terminating myself, which looks odd.

Is there any reason to reject 0 as timeout?

+* Wait only if requested and the termination is successful. Self
+* termination is allowed but waiting is not.
+*/
+   if (wait && pid != MyProcPid && result)
+   result = pg_wait_until_termination(pid, timeout);

Why don't we wait for myself to be terminated?  There's no guarantee
that myself will be terminated without failure.  (I agree that that is
not so useful, but I think there's no reason not to do so.)


The first suggested signature for pg_terminate_backend() with timeout
was pg_terminate_backend(pid, timeout).  The current signature (pid,
wait?, timeout) looks redundant.  Maybe the reason for rejecting 0
astimeout is pg_terminate_backend(pid, true, 0) looks odd but it we
can wait forever in that case (as other features does).  On the other
hand pg_terminate_backend(pid, false, 100) is apparently odd but this
patch doesn't seem to reject it.  If there's no considerable reason
for the current signature, I would suggest that:

pg_terminate_backend(pid, timeout), where it waits forever if timeout
is zero and waits for the timeout if positive. Negative values are not
accepted.

That being said, I didn't find the disucssion about allowing default
timeout value by separating the boolean, if it is the consensus on
this thread, sorry for the noise.


+   ereport(WARNING,
+   (errmsg("could not check the 
existence of the backend with PID %d: %m",
+   pid)));
+   return false;

I think this is worth ERROR. We can avoid this handling if we look
into PgBackendEntry instead.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Getting better results from valgrind leak tracking

2021-03-16 Thread Zhihong Yu

Hi,
For the second last trace involving SearchCatCacheList (catcache.c:1691),
the ctlist's members are stored in cl->members array where cl is returned
at the end of SearchCatCacheList.

Maybe this was not accounted for by valgrind ?

Cheers

On Tue, Mar 16, 2021 at 7:31 PM Andres Freund  wrote:

> Hi,
>
> David, there's a question about a commit of yours below, hence adding
> you.
>
> On 2021-03-16 19:36:10 -0400, Tom Lane wrote:
> > Out of curiosity I poked at this for a little while.
>
> Cool.
>
>
> > It doesn't appear to me that we leak much at all, at least not if you
> > are willing to take "still reachable" blocks as not-leaked.
>
> Well, I think for any sort of automated testing - which I think would be
> useful - we'd really need *no* leaks. I know that I get a few bleats
> whenever I forget to set --leak-check=no. It's also not just postgres
> itself, but some of the helper tools...
>
> And it's not just valgrind, also gcc/clang sanitizers...
>
>
> > What I found out is that we have a lot of issues that seem to devolve
> > to valgrind not being sure that a block is referenced.  I identified
> > two main causes of that:
> >
> > (1) We have a pointer, but it doesn't actually point right at the start
> > of the block.  A primary culprit here is lists of thingies that use the
> > slist and dlist infrastructure.  As an experiment, I moved the dlist_node
> > fields of some popular structs to the beginning, and verified that that
> > silences associated complaints.  I'm not sure that we want to insist on
> > put-the-link-first as policy (although if we did, it could provide some
> > notational savings perhaps).  However, unless someone knows of a way to
> > teach valgrind about this situation, there may be no other way to silence
> > those leakage complaints.  A secondary culprit is people randomly
> applying
> > CACHELINEALIGN or the like to a palloc'd address, so that the address we
> > have isn't pointing right at the block start.
>
> Hm, do you still have a backtrace / suppression for one of those? I
> didn't see any in a quick (*) serial installcheck I just ran. Or I
> wasn't able to pinpoint them to this issue.
>
>
> I think the run might have shown a genuine leak:
>
> ==2048803== 16 bytes in 1 blocks are definitely lost in loss record 139 of
> 906
> ==2048803==at 0x89D2EA: palloc (mcxt.c:975)
> ==2048803==by 0x2392D3: heap_beginscan (heapam.c:1198)
> ==2048803==by 0x264E8F: table_beginscan_strat (tableam.h:918)
> ==2048803==by 0x265994: systable_beginscan (genam.c:453)
> ==2048803==by 0x83C2D1: SearchCatCacheMiss (catcache.c:1359)
> ==2048803==by 0x83C197: SearchCatCacheInternal (catcache.c:1299)
> ==2048803==by 0x83BE9A: SearchCatCache1 (catcache.c:1167)
> ==2048803==by 0x85876A: SearchSysCache1 (syscache.c:1134)
> ==2048803==by 0x84CDB3: RelationInitTableAccessMethod (relcache.c:1795)
> ==2048803==by 0x84F807: RelationBuildLocalRelation (relcache.c:3554)
> ==2048803==by 0x303C9D: heap_create (heap.c:395)
> ==2048803==by 0x305790: heap_create_with_catalog (heap.c:1291)
> ==2048803==by 0x41A327: DefineRelation (tablecmds.c:885)
> ==2048803==by 0x6C96B6: ProcessUtilitySlow (utility.c:1131)
> ==2048803==by 0x6C948A: standard_ProcessUtility (utility.c:1034)
> ==2048803==by 0x6C865F: ProcessUtility (utility.c:525)
> ==2048803==by 0x6C7409: PortalRunUtility (pquery.c:1159)
> ==2048803==by 0x6C7636: PortalRunMulti (pquery.c:1305)
> ==2048803==by 0x6C6B11: PortalRun (pquery.c:779)
> ==2048803==by 0x6C05AB: exec_simple_query (postgres.c:1173)
> ==2048803==
> {
>
>Memcheck:Leak
>match-leak-kinds: definite
>fun:palloc
>fun:heap_beginscan
>fun:table_beginscan_strat
>fun:systable_beginscan
>fun:SearchCatCacheMiss
>fun:SearchCatCacheInternal
>fun:SearchCatCache1
>fun:SearchSysCache1
>fun:RelationInitTableAccessMethod
>fun:RelationBuildLocalRelation
>fun:heap_create
>fun:heap_create_with_catalog
>fun:DefineRelation
>fun:ProcessUtilitySlow
>fun:standard_ProcessUtility
>fun:ProcessUtility
>fun:PortalRunUtility
>fun:PortalRunMulti
>fun:PortalRun
>fun:exec_simple_query
> }
>
> Since 56788d2156fc heap_beginscan() allocates
> scan->rs_base.rs_private =
> palloc(sizeof(ParallelBlockTableScanWorkerData));
> in heap_beginscan(). But doesn't free it in heap_endscan().
>
> In most of the places heap scans are begun inside transient contexts,
> but not always. In the above trace for example
> RelationBuildLocalRelation switched to CacheMemoryContext, and nothing
> switched to something else.
>
> I'm a bit confused about the precise design of rs_private /
> ParallelBlockTableScanWorkerData, specifically why it's been added to
> TableScanDesc, instead of just adding it to HeapScanDesc? And why is it
> allocated unconditionally, instead of just for parallel scans?
>
>
> I don't think this is a false positive, even though it theoreti

Re: [HACKERS] logical decoding of two-phase transactions

2021-03-16 Thread vignesh C

On Tue, Mar 16, 2021 at 7:22 PM Amit Kapila  wrote:
>
> On Tue, Mar 16, 2021 at 6:22 PM vignesh C  wrote:
> >
> > On Mon, Mar 15, 2021 at 6:14 PM Ajin Cherian  wrote:
> > >
> > > On Mon, Mar 15, 2021 at 2:04 PM Amit Kapila  
> > > wrote:
> > >>
> >
> > 2) table_states_not_ready global variable is used immediately after
> > call to FetchTableStates, we can make FetchTableStates return the
> > value or get it as an argument to the function and the global
> > variables can be removed.
> > +static List *table_states_not_ready = NIL;
> >
>
> But we do update the states in the list table_states_not_ready in
> function process_syncing_tables_for_apply. So, the current arrangement
> looks good to me.

But I felt we can do this without using global variables.
table_states_not_ready is used immediately  after calling
FetchTableStates in AnyTablesyncsNotREADY and
process_syncing_tables_for_apply functions. It is not used anywhere
else. My point was we do not need to store this in global variables as
it is not needed elsewhere. We could change the return type or return
in through the function argument in this case.
Thoughts?

Regards,
Vignesh

Re: Should we improve "PID XXXX is not a PostgreSQL server process" warning for pg_terminate_backend(<>)?

2021-03-16 Thread torikoshia

On 2021-03-16 20:51, Bharath Rupireddy wrote:
On Mon, Mar 15, 2021 at 11:23 AM torikoshia 
 wrote:

On 2021-03-07 19:16, Bharath Rupireddy wrote:
> On Fri, Feb 5, 2021 at 5:15 PM Bharath Rupireddy
>  wrote:
>>
>> pg_terminate_backend and pg_cancel_backend with postmaster PID produce
>> "PID  is not a PostgresSQL server process" warning [1], which
>> basically implies that the postmaster is not a PostgreSQL process at
>> all. This is a bit misleading because the postmaster is the parent of
>> all PostgreSQL processes. Should we improve the warning message if the
>> given PID is postmasters' PID?

+1. I felt it was a bit confusing when reviewing a thread[1].

Hmmm.

> I'm attaching a small patch that emits a warning "signalling
> postmaster with PID %d is not allowed" for postmaster and "signalling
> PostgreSQL server process with PID %d is not allowed" for auxiliary
> processes such as checkpointer, background writer, walwriter.
>
> However, for stats collector and sys logger processes, we still get
> "PID X is not a PostgreSQL server process" warning because they
> don't have PGPROC entries(??). So BackendPidGetProc and
> AuxiliaryPidGetProc will not help and even pg_stat_activity is not
> having these processes' pid.

I also ran into the same problem while creating a patch in [2].

I have not gone through that thread though. Is there any way we can
detect those child processes(stats collector, sys logger) that are
forked by the postmaster from a backend process? Thoughts?

I couldn't find good ways to do that, and thus I'm now wondering
just changing the message.

I'm now wondering if changing the message to something like
"PID  is not a PostgreSQL backend process".

"backend process' is now defined as "Process of an instance
which acts on behalf of a client session and handles its
requests." in Appendix.

Yeah, that looks good to me. IIUC, we can just change the message from
"PID  is not a PostgreSQL server process" to "PID  is not a
PostgreSQL backend process" and we don't need look for AuxiliaryProcs
or PostmasterPid.

Changing log messages can affect operations, especially when people
monitor the log message strings, but improving "PID  is not a
PostgreSQL server process" does not seem to cause such problems.

Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION

Re: Getting better results from valgrind leak tracking

2021-03-16 Thread Andres Freund

Hi,

David, there's a question about a commit of yours below, hence adding
you.

On 2021-03-16 19:36:10 -0400, Tom Lane wrote:
> Out of curiosity I poked at this for a little while.

Cool.

> It doesn't appear to me that we leak much at all, at least not if you
> are willing to take "still reachable" blocks as not-leaked.

Well, I think for any sort of automated testing - which I think would be
useful - we'd really need *no* leaks. I know that I get a few bleats
whenever I forget to set --leak-check=no. It's also not just postgres
itself, but some of the helper tools...

And it's not just valgrind, also gcc/clang sanitizers...

> What I found out is that we have a lot of issues that seem to devolve
> to valgrind not being sure that a block is referenced.  I identified
> two main causes of that:
>
> (1) We have a pointer, but it doesn't actually point right at the start
> of the block.  A primary culprit here is lists of thingies that use the
> slist and dlist infrastructure.  As an experiment, I moved the dlist_node
> fields of some popular structs to the beginning, and verified that that
> silences associated complaints.  I'm not sure that we want to insist on
> put-the-link-first as policy (although if we did, it could provide some
> notational savings perhaps).  However, unless someone knows of a way to
> teach valgrind about this situation, there may be no other way to silence
> those leakage complaints.  A secondary culprit is people randomly applying
> CACHELINEALIGN or the like to a palloc'd address, so that the address we
> have isn't pointing right at the block start.

Hm, do you still have a backtrace / suppression for one of those? I
didn't see any in a quick (*) serial installcheck I just ran. Or I
wasn't able to pinpoint them to this issue.

I think the run might have shown a genuine leak:

==2048803== 16 bytes in 1 blocks are definitely lost in loss record 139 of 906
==2048803==at 0x89D2EA: palloc (mcxt.c:975)
==2048803==by 0x2392D3: heap_beginscan (heapam.c:1198)
==2048803==by 0x264E8F: table_beginscan_strat (tableam.h:918)
==2048803==by 0x265994: systable_beginscan (genam.c:453)
==2048803==by 0x83C2D1: SearchCatCacheMiss (catcache.c:1359)
==2048803==by 0x83C197: SearchCatCacheInternal (catcache.c:1299)
==2048803==by 0x83BE9A: SearchCatCache1 (catcache.c:1167)
==2048803==by 0x85876A: SearchSysCache1 (syscache.c:1134)
==2048803==by 0x84CDB3: RelationInitTableAccessMethod (relcache.c:1795)
==2048803==by 0x84F807: RelationBuildLocalRelation (relcache.c:3554)
==2048803==by 0x303C9D: heap_create (heap.c:395)
==2048803==by 0x305790: heap_create_with_catalog (heap.c:1291)
==2048803==by 0x41A327: DefineRelation (tablecmds.c:885)
==2048803==by 0x6C96B6: ProcessUtilitySlow (utility.c:1131)
==2048803==by 0x6C948A: standard_ProcessUtility (utility.c:1034)
==2048803==by 0x6C865F: ProcessUtility (utility.c:525)
==2048803==by 0x6C7409: PortalRunUtility (pquery.c:1159)
==2048803==by 0x6C7636: PortalRunMulti (pquery.c:1305)
==2048803==by 0x6C6B11: PortalRun (pquery.c:779)
==2048803==by 0x6C05AB: exec_simple_query (postgres.c:1173)
==2048803==
{

   Memcheck:Leak
   match-leak-kinds: definite
   fun:palloc
   fun:heap_beginscan
   fun:table_beginscan_strat
   fun:systable_beginscan
   fun:SearchCatCacheMiss
   fun:SearchCatCacheInternal
   fun:SearchCatCache1
   fun:SearchSysCache1
   fun:RelationInitTableAccessMethod
   fun:RelationBuildLocalRelation
   fun:heap_create
   fun:heap_create_with_catalog
   fun:DefineRelation
   fun:ProcessUtilitySlow
   fun:standard_ProcessUtility
   fun:ProcessUtility
   fun:PortalRunUtility
   fun:PortalRunMulti
   fun:PortalRun
   fun:exec_simple_query
}

Since 56788d2156fc heap_beginscan() allocates
scan->rs_base.rs_private =
palloc(sizeof(ParallelBlockTableScanWorkerData));
in heap_beginscan(). But doesn't free it in heap_endscan().

In most of the places heap scans are begun inside transient contexts,
but not always. In the above trace for example
RelationBuildLocalRelation switched to CacheMemoryContext, and nothing
switched to something else.

I'm a bit confused about the precise design of rs_private /
ParallelBlockTableScanWorkerData, specifically why it's been added to
TableScanDesc, instead of just adding it to HeapScanDesc? And why is it
allocated unconditionally, instead of just for parallel scans?

I don't think this is a false positive, even though it theoretically
could be freed by resetting CacheMemoryContext (see below)?

I saw a lot of false positives from autovacuum workers, because
AutovacMemCxt is never deleted, and because table_toast_map is created
in TopMemoryContext.  Adding an explicit
MemoryContextDelete(AutovacMemCxt) and parenting table_toast_map in that
shut that up.

Which brings me to the question why allocations in CacheMemoryContext,
AutovacMemCxt are considered to be "lost", even though they're still
"reachable" via a context reset

RE: libpq debug log

2021-03-16 Thread tsunakawa.ta...@fujitsu.com

Alvaro-san,


Thank you for taking your time to take a look at an incomplete patch.  I 
thought I would ask you for final check for commit after Iwata-san has 
reflected my review comments.

I discussed with Iwata-sn your below comments.  Let me convey her opinions.  
(She is now focusing on fixing the patch.)  We'd appreciate if you could tweak 
her completed patch.


From: Alvaro Herrera 
> * There's no cross-check that the protocol message length word
>   corresponds to what we actually print.  I think one easy way to
>   cross-check that would be to return the "count" from the type-specific
>   routines, and verify that it matches "length" in pqTraceOutputMessage.
>   (If the separate routines are removed and the code put back in one
>   giant switch, then the "count" is already available when the switch
>   ends.)

We don't think the length check is necessary, because 1) for FE->BE, correct 
messages are always assembled, and 2) for BE->FE, the parsing and decomposition 
of messages have already checked the messages.


> * If we have separate functions for each message type, then we can have
>   that function supply the message name by itself, rather than have the
>   separate protocol_message_type_f / _b arrays that print it.

I felt those two arrays are clumsy and thought of suggesting to remove them.  
But I didn't because the functions or case labels for each message type have to 
duplicate the printing of header fields: timestamp, message type, and length.  
Maybe we can change the order of output to "timestamp length message_type 
content", but I kind of prefer the current order.  What do you think?  (I can 
understand removing the clumsy static arrays should be better than the order of 
output fields.)


> * If we make each type-specific function print the message name, then we
>   need to make the same function print the message length, because it
>   comes after.  So the function would have to receive the length (so
>   that it can be printed).  But I think we should still have the
>   cross-check I mentioned in my first point occur in the main
>   pqTraceOutputMessage, not the type-specific function, for fear that we
>   will forget the cross-check in one of the functions and never realize
>   that we did.

As mentioned above, we think the current structure is better for smaller and 
readable code.


> * I would make the pqTraceOutputInt16() function and siblings advance
>   the cursor themselves, actually.  I think something like this:
>   static int
>   pqTraceOutputInt16(const char *data, int *cursor, FILE *pfdebug)
>   {
> uint16tmp;
> int   result;
> 
> memcpy(&tmp, data + *cursor, 2);
> *cursor += 2;
> result = (int) pg_ntoh16(tmp);
> fprintf(pfdebug, " #%d", result);
> 
> return result;
>   }
>   (So the caller has to pass the original "data" pointer, not
>   "data+cursor").  This means the caller no longer has to do "cursor+=N"
>   at each place.  Also, you get rid of the moveStrCursor() which does
>   not look good.

That makes sense, because in fact the patch mistakenly added 4 when it should 
add 2.  Also, the code would become smaller.


> * I'm not fond of the idea of prefixing "#" for 16bit ints and no
>   prefix for 32bit ints.  Seems obscure and the output looks weird.
>   I would use a one-letter prefix for each type, "w" for 32-bit ints
>   (stands for "word") and "h" for 16-bit ints (stands for "half-word").
>   Message length goes unadorned.  Then you'd have lines like this
> 
> 2021-03-15 08:10:44.324296  <   RowDescription  35 h1 "spcoptions"
> w1213 h5 w1009 h65535 w-1 h0
> 2021-03-15 08:10:44.324335  <   DataRow 32 h1 w22
> '{random_page_cost=3.0}'

Yes, actually I felt something similar.  Taking a second thought, I think we 
don't have to have any prefix because it doesn't help users.  So we're removing 
the prefix.  We don't have any opinion on adding some prefix.


> * I don't like that pqTraceOutputS/H print nothing when !toServer.  I
> think they should complain.

Yes, the caller should not call the function if there's no message content.  
That way, the function doesn't need the toServer argument.


Regards
Takayuki Tsunakawa

Re: Consider Parallelism While Planning For REFRESH MATERIALIZED VIEW

2021-03-16 Thread Thomas Munro

On Tue, Mar 16, 2021 at 2:41 PM Thomas Munro  wrote:
> On Mon, Mar 15, 2021 at 8:25 PM Bharath Rupireddy
>  wrote:
> > > > The problem with a case like REFRESH MATERIALIZED VIEW is that there's
> > > > nothing to prevent something that gets run in the course of the query
> > > > from trying to access the view (and the heavyweight lock won't prevent
> > > > that, due to group locking). That's probably a stupid thing to do,
> > > > but it can't be allowed to break the world. The other cases are safe
> > > > from that particular problem because the table doesn't exist yet.
> >
> > Please correct me if my understanding of the above comment (from the
> > commit e9baa5e9) is wrong -  even if the leader opens the matview
> > relation in exclusive mode, because of group locking(in case we allow
> > parallel workers to feed in the data to the new heap that gets created
> > for RMV, see ExecRefreshMatView->make_new_heap), can other sessions
> > still access the matview relation with older data?
> >
> > I performed below testing to prove myself wrong for the above understanding:
> > session 1:
> > 1) added few rows to the table t1 on which the mv1 is defined;
> > 2) refresh materialized view mv1;
> >
> > session 2:
> > 1) select count(*) from mv1;   ---> this query is blocked until
> > session 1's step (2) is completed and gives the latest result even if
> > the underlying data-generating query runs select part in parallel.
> >
> > Is there anything I'm missing here?
>
> I think he was talking about things like functions that try to access
> the mv from inside the same query, in a worker.  I haven't figured out
> exactly which hazards he meant.  I thought about wrong-relfilenode
> hazards and combocid hazards, but considering the way this thing
> always inserts into a fresh table before performing merge or swap
> steps later, I don't yet see why this is different from any other
> insert-from-select-with-gather.

I asked Robert if he had some hazard in mind that we haven't already
discussed here when he wrote that, and didn't recall any.  I think
we're OK here.

I added the "concurrently" variant to the regression test, just to get
it exercised too.

The documentation needed a small tweak where we have a list of
data-writing commands that are allowed to use parallelism.  That run
of sentences was getting a bit tortured so I converted it into a
bullet list; I hope I didn't upset the documentation style police.

Pushed.  Thanks for working on this!  This is really going to fly with
INSERT pushdown.  The 3 merge queries used by CONCURRENTLY will take
some more work.

Re: crash during cascaded foreign key update

2021-03-16 Thread Amit Langote

On Tue, Mar 16, 2021 at 11:17 PM Tom Lane  wrote:
> Amit Langote  writes:
> > With HEAD (I think v12 and greater), I see $subject when trying out
> > the following scenario:
>
> I wonder if this is related to
>
> https://www.postgresql.org/message-id/flat/89429.1584443208%40antos
>
> which we've still not done anything about.

Ah, indeed the same issue.  Will read through that discussion first, thanks.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Re: cleanup temporary files after crash

2021-03-16 Thread Euler Taveira

On Sun, Mar 14, 2021, at 11:01 PM, Thomas Munro wrote:
> On Wed, Mar 10, 2021 at 1:31 AM Michael Paquier  wrote:
> > On Tue, Mar 09, 2021 at 02:28:43AM +0100, Tomas Vondra wrote:
> > > Let's move this patch forward. Based on the responses, I agree the
> > > default behavior should be to remove the temp files, and I think we
> > > should have the GUC (on the off chance that someone wants to preserve
> > > the temporary files for debugging or whatever other reason).
> >
> > Thanks for taking care of this.  I am having some second-thoughts
> > about changing this behavior by default, still that's much more useful
> > this way.
> 
> +1 for having it on by default.
> 
> I was also just looking at this patch and came here to say LGTM except
> for two cosmetic things, below.
Thanks for taking a look at this patch. I'm not sure Tomas is preparing a new
patch that includes the suggested modifications but I decided to do it. This
new version has the new GUC name (using "remove"). I also replaced "cleanup"
with "remove" in the all the remain places. As pointed by Thomas, I reworded
the paragraph that describes the GUC moving the default information to the
beginning of the sentence. I also added the "literal" as suggested by Michael.
The postgresql.conf.sample was fixed. The tests was slightly modified. I
reworded some comments and added a hack to avoid breaking the temporary file
test in slow machines. A usleep() after sending the query provides some time
for the query to create the temporary file. I used an arbitrarily sleep (10ms)
that seems to be sufficient.

> > +#remove_temp_files_after_crash = on# remove temporary files after
> > +#  # backend crash?
> > The indentation of the second line is incorrect here (Incorrect number
> > of spaces in tabs perhaps?), and there is no need for the '#' at the
> > beginning of the line.
> 
> Yeah, that's wrong.  For some reason that one file uses a tab size of
> 8, unlike the rest of the tree (I guess because people will read that
> file in software with the more common setting of 8).  If you do :set
> tabstop=8 in vim, suddenly it all makes sense, but it is revealed that
> this patch has it wrong, as you said.  (Perhaps this file should have
> some of those special Vim/Emacs control messages so we don't keep
> getting this wrong?)
I hadn't noticed that this file use ts=8. (This explains the misalignment that
I observed in some parameter comments). I'm not sure if people care about the
indentation in this file. From the development perspective, we can use the same
number of spaces for Tab as the source code so it is not required to fix your
dev environment. However, the regular user doesn't follow the dev guidelines so
it could probably observe the misalignment while using Vim (that has 8 spaces
as default), for example. For now, I will add

autocmd BufRead,BufNewFile postgresql.conf* setlocal ts=8

to my .vimrc. We should probably fix some settings that are misaligned such as
parallel_setup_cost and  shared_preload_libraries. The parameters
timezone_abbreviations and max_pred_locks_per_page are using spaces instead of
tabs and should probably be fixed too.


--
Euler Taveira
EDB   https://www.enterprisedb.com/
From 8dfee5694210c14054170563ff01fc15a66a76b0 Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Mon, 25 May 2020 00:08:20 -0300
Subject: [PATCH v2] Control temporary files removal after crash

A new GUC remove_temp_files_after_crash controls whether temporary
files are removed after a crash. Successive crashes could result in
useless storage usage until service is restarted. It could be the case
on host with inadequate resources. This manual intervention for some
environments is not desirable. This GUC is marked as SIGHUP hence you
don't have to interrupt the service to change it.

The current behavior introduces a backward incompatibility (minor one)
which means Postgres will reclaim temporary file space after a crash.
The main reason is that we favor service continuity over debugging.
---
 doc/src/sgml/config.sgml  |  18 ++
 src/backend/postmaster/postmaster.c   |   5 +
 src/backend/storage/file/fd.c |  12 +-
 src/backend/utils/misc/guc.c  |   9 +
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/include/postmaster/postmaster.h   |   1 +
 src/test/recovery/t/022_crash_temp_files.pl   | 192 ++
 7 files changed, 234 insertions(+), 5 deletions(-)
 create mode 100644 src/test/recovery/t/022_crash_temp_files.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5e551b9f6d..c0b1b48136 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9648,6 +9648,24 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
   
  
 
+ 
+  remove_temp_files_after_crash (boolean)
+  
+   remove_temp_files_after_crash configuration parameter
+  
+  
+  
+   
+

Re: A new function to wait for the backend exit after termination

2021-03-16 Thread Bharath Rupireddy

On Tue, Mar 16, 2021 at 9:48 PM Magnus Hagander  wrote:
> Does it really make sense that pg_wait_for_backend_termination()
> defaults to waiting *100 milliseconds*, and then logs a warning? That
> seems extremely short if I'm explicitly asking it to wait.

I increased the default wait timeout to 5seconds.

> Wait events should be in alphabetical order in pgstat_get_wait_ipc()
> as well, not just in the header (which was adjusted per Fujii's
> comment)

Done.

>
> +   (errmsg("could not wait for the termination of
> the backend with PID %d within %lld milliseconds",
>
> That's not true though? The wait succeeded, it just timed out? Isn't
> itm ore like "backend with PID %d did not terminate within %lld
> milliseconds"?

Looks better. Done.

Attaching v10 patch for further review.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


v10-0001-pg_terminate_backend-with-wait-and-timeout.patch
Description: Binary data

Re: make the stats collector shutdown without writing the statsfiles if the immediate shutdown is requested.

2021-03-16 Thread Masahiro Ikeda


On 2021-03-17 08:25, Zhihong Yu wrote:

Hi,


Thanks for your comments!


+ * Simple signal handler for processes HAVE NOT yet touched or DO NOT

I think there should be a 'which' between processes and HAVE. It seems
the words in Capital letters should be in lower case.

+ * Simple signal handler for processes have touched shared memory to
+ * exit quickly.

Add 'which' between processes and have.


OK, I fixed them.


unlink(fname);
+
+   elog(DEBUG2, "removing stats file \"%s\"", fname);

It seems the return value from unlink should be checked and reflected
in the debug message.


There are related codes that show log and call unlink() in slru.c and 
pgstat.c.


```
pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
{
// some code
elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
unlink(statfile)
}
```

I fixed it in the same way instead of checking the return value because
IIUC, it's unimportant if an error has occurred. The log shows that just 
to try

removing it. Though?

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATIONdiff --git a/src/backend/postmaster/interrupt.c b/src/backend/postmaster/interrupt.c
index dd9136a942..df8e0eb081 100644
--- a/src/backend/postmaster/interrupt.c
+++ b/src/backend/postmaster/interrupt.c
@@ -64,9 +64,27 @@ SignalHandlerForConfigReload(SIGNAL_ARGS)
 }
 
 /*
- * Simple signal handler for exiting quickly as if due to a crash.
+ * Simple signal handler for processes which have not yet touched or do not
+ * touch shared memory to exit quickly.
  *
- * Normally, this would be used for handling SIGQUIT.
+ * If processes already touched shared memory, call
+ * SignalHandlerForCrashExit() because shared memory may be corrupted.
+ */
+void
+SignalHandlerForUnsafeExit(SIGNAL_ARGS)
+{
+	/*
+	 * Since we don't touch shared memory, we can just pull the plug and exit
+	 * without running any atexit handlers.
+	 */
+	_exit(1);
+}
+
+/*
+ * Simple signal handler for processes which have touched shared memory to
+ * exit quickly.
+ *
+ * Normally, this would be used for handling SIGQUIT as if due to a crash.
  */
 void
 SignalHandlerForCrashExit(SIGNAL_ARGS)
@@ -93,9 +111,8 @@ SignalHandlerForCrashExit(SIGNAL_ARGS)
  * shut down and exit.
  *
  * Typically, this handler would be used for SIGTERM, but some processes use
- * other signals. In particular, the checkpointer exits on SIGUSR2, the
- * stats collector on SIGQUIT, and the WAL writer exits on either SIGINT
- * or SIGTERM.
+ * other signals. In particular, the checkpointer exits on SIGUSR2 and the
+ * WAL writer exits on either SIGINT or SIGTERM.
  *
  * ShutdownRequestPending should be checked at a convenient place within the
  * main loop, or else the main loop should call HandleMainLoopInterrupts.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b1e2d94951..31b461eb18 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -724,6 +724,7 @@ pgstat_reset_remove_files(const char *directory)
 
 		snprintf(fname, sizeof(fname), "%s/%s", directory,
  entry->d_name);
+		elog(DEBUG2, "removing stats file \"%s\"", fname);
 		unlink(fname);
 	}
 	FreeDir(dir);
@@ -4821,13 +4822,19 @@ PgstatCollectorMain(int argc, char *argv[])
 
 	/*
 	 * Ignore all signals usually bound to some action in the postmaster,
-	 * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-	 * support latch operations, because we only use a local latch.
+	 * except SIGHUP, SIGTERM and SIGQUIT.  Note we don't need a SIGUSR1
+	 * handler to support latch operations, because we only use a local latch.
 	 */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
 	pqsignal(SIGINT, SIG_IGN);
-	pqsignal(SIGTERM, SIG_IGN);
-	pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
+	pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+
+	/*
+	 * If SIGQUIT is received, exit quickly without doing any additional work,
+	 * for example writing stats files. We arrange to do _exit(1) because the
+	 * stats collector doesn't touch shared memory.
+	 */
+	pqsignal(SIGQUIT, SignalHandlerForUnsafeExit);
 	pqsignal(SIGALRM, SIG_IGN);
 	pqsignal(SIGPIPE, SIG_IGN);
 	pqsignal(SIGUSR1, SIG_IGN);
@@ -4852,8 +4859,8 @@ PgstatCollectorMain(int argc, char *argv[])
 	AddWaitEventToSet(wes, WL_SOCKET_READABLE, pgStatSock, NULL, NULL);
 
 	/*
-	 * Loop to process messages until we get SIGQUIT or detect ungraceful
-	 * death of our parent postmaster.
+	 * Loop to process messages until we get SIGTERM or SIGQUIT of our parent
+	 * postmaster.
 	 *
 	 * For performance reasons, we don't want to do ResetLatch/WaitLatch after
 	 * every message; instead, do that only after a recv() fails to obtain a
@@ -4871,7 +4878,7 @@ PgstatCollectorMain(int argc, char *argv[])
 		ResetLatch(MyLatch);
 
 		/*
-		 * Quit if we get SIGQUIT from the postmaster.
+		 * Quit if we get SIGTERM from the postmaster.
 		 */
 		if (ShutdownRequestPending)

Getting better results from valgrind leak tracking

2021-03-16 Thread Tom Lane

[ starting a new thread for this ]

Andres Freund  writes:
> I wonder if it'd be worth starting to explicitly annotate all the places
> that do allocations and are fine with leaking them. E.g. by introducing
> malloc_permanently() or such. Right now it's hard to use valgrind et al
> to detect leaks because of all the false positives due to such "ok to
> leak" allocations.

Out of curiosity I poked at this for a little while.  It doesn't appear
to me that we leak much at all, at least not if you are willing to take
"still reachable" blocks as not-leaked.  Most of the problem is
more subtle than that.

I found just a couple of things that really seem like leaks of permanent
data structures to valgrind:

* Where ps_status.c copies the original "environ" array (on
PS_USE_CLOBBER_ARGV platforms), valgrind thinks that's all leaked,
implying that it doesn't count the "environ" global as a valid
reference to leakable data.  I was able to shut that up by also saving
the pointer into an otherwise-unused static variable.  (This is sort of
a poor man's implementation of your "malloc_permanently" idea; but I
doubt it's worth working harder, given the small number of use-cases.)

* The postmaster's sock_paths and lock_files lists appear to be leaked,
but we're doing that to ourselves by throwing away the pointers to them
without physically freeing the lists.  We can just not do that.

What I found out is that we have a lot of issues that seem to devolve
to valgrind not being sure that a block is referenced.  I identified
two main causes of that:

(1) We have a pointer, but it doesn't actually point right at the start
of the block.  A primary culprit here is lists of thingies that use the
slist and dlist infrastructure.  As an experiment, I moved the dlist_node
fields of some popular structs to the beginning, and verified that that
silences associated complaints.  I'm not sure that we want to insist on
put-the-link-first as policy (although if we did, it could provide some
notational savings perhaps).  However, unless someone knows of a way to
teach valgrind about this situation, there may be no other way to silence
those leakage complaints.  A secondary culprit is people randomly applying
CACHELINEALIGN or the like to a palloc'd address, so that the address we
have isn't pointing right at the block start.

(2) The only pointer to the start of a block is actually somewhere within
the block.  This is common in dynahash tables, where we allocate a slab
of entries in a single palloc and then thread them together.  Each entry
should have exactly one referencing pointer, but that pointer is more
likely to be elsewhere within the same palloc block than in the external
hash bucket array.  AFAICT, all cases except where the slab's first entry
is pointed to by a hash bucket pointer confuse valgrind to some extent.
I was able to hack around this by preventing dynahash from allocating
more than one hash entry per palloc, but I wonder if there's a better way.

Attached is a very crude hack, not meant for commit, that hacks things
up enough to greatly reduce the number of complaints with
"--leak-check=full".

One thing I've failed to silence so far is a bunch of entries like

==00:00:03:56.088 3467702== 1,861 bytes in 67 blocks are definitely lost in 
loss record 1,290 of 1,418
==00:00:03:56.088 3467702==at 0x950650: MemoryContextAlloc (mcxt.c:827)
==00:00:03:56.088 3467702==by 0x951710: MemoryContextStrdup (mcxt.c:1179)
==00:00:03:56.088 3467702==by 0x91C86E: RelationInitIndexAccessInfo 
(relcache.c:1444)
==00:00:03:56.088 3467702==by 0x91DA9C: RelationBuildDesc (relcache.c:1200)

which is complaining about the memory context identifiers for system
indexes' rd_indexcxt contexts.  Those are surely not being leaked in
any real sense.  I suspect that this has something to do with valgrind
not counting the context->ident fields as live pointers, but I don't
have enough valgrind-fu to fix that.

Anyway, the bottom line is that I do not think that we have all that
many uses of the pattern you postulated originally.  It's more that
we've designed some valgrind-unfriendly data structures.  We need to
improve that situation to make much progress here.

regards, tom lane

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 4c7b1e7bfd..cd984929a6 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -888,7 +888,6 @@ RemoveSocketFiles(void)
 		(void) unlink(sock_path);
 	}
 	/* Since we're about to exit, no need to reclaim storage */
-	sock_paths = NIL;
 }

diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 55c9445898..2abdd44190 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -814,7 +814,7 @@ InitCatCache(int id,
 	 * Note: we rely on zeroing to initialize all the dlist headers correctly
 	 */
 	sz = sizeof(CatCache) + PG_CACHE_LINE_SIZE;
-	cp = (CatCache *) CACHELINEALIGN(

Re: pg_subscription - substream column?

2021-03-16 Thread Peter Smith

On Wed, Mar 17, 2021 at 12:45 AM Amit Kapila  wrote:
>
>
> Attached, please find the patch to update the description of substream
> in pg_subscription.
>

I applied your patch and regenerated the PG docs to check the result.

LGTM.


Kind Regards,
Peter Smith.
Fujitsu Australia

Re: make the stats collector shutdown without writing the statsfiles if the immediate shutdown is requested.

2021-03-16 Thread Zhihong Yu

Hi,

+ * Simple signal handler for processes HAVE NOT yet touched or DO NOT

I think there should be a 'which' between processes and HAVE. It seems the
words in Capital letters should be in lower case.

+ * Simple signal handler for processes have touched shared memory to
+ * exit quickly.

Add 'which' between processes and have.

unlink(fname);
+
+   elog(DEBUG2, "removing stats file \"%s\"", fname);

It seems the return value from unlink should be checked and reflected in
the debug message.

Thanks

On Tue, Mar 16, 2021 at 4:09 PM Masahiro Ikeda 
wrote:

> On 2021-03-16 13:44, kuroda.hay...@fujitsu.com wrote:
> > Dear Ikeda-san
> >
> > I think the idea is good.
> >
> > I read the patch and other sources, and I found
> > process_startup_packet_die also execute _exit(1).
> > I think they can be combined into one function and moved to
> > interrupt.c, but
> > some important comments might be removed. How do you think?
>
> Hi, Kuroda-san.
> Thanks for your comments.
>
> I agreed that your idea.
> I combined them into one function and moved the comments to
> the calling function side.
> (v2-0001-pgstat_avoid_writing_on_sigquit.patch)
>
> Regards,
> --
> Masahiro Ikeda
> NTT DATA CORPORATION

RE: make the stats collector shutdown without writing the statsfiles if the immediate shutdown is requested.

2021-03-16 Thread Masahiro Ikeda


On 2021-03-16 13:44, kuroda.hay...@fujitsu.com wrote:

Dear Ikeda-san

I think the idea is good.

I read the patch and other sources, and I found
process_startup_packet_die also execute _exit(1).
I think they can be combined into one function and moved to 
interrupt.c, but

some important comments might be removed. How do you think?


Hi, Kuroda-san.
Thanks for your comments.

I agreed that your idea.
I combined them into one function and moved the comments to
the calling function side.
(v2-0001-pgstat_avoid_writing_on_sigquit.patch)

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATIONdiff --git a/src/backend/postmaster/interrupt.c b/src/backend/postmaster/interrupt.c
index dd9136a942..9b25294a14 100644
--- a/src/backend/postmaster/interrupt.c
+++ b/src/backend/postmaster/interrupt.c
@@ -64,9 +64,27 @@ SignalHandlerForConfigReload(SIGNAL_ARGS)
 }
 
 /*
- * Simple signal handler for exiting quickly as if due to a crash.
+ * Simple signal handler for processes HAVE NOT yet touched or DO NOT
+ * touch shared memory to exit quickly.
  *
- * Normally, this would be used for handling SIGQUIT.
+ * If processes already touched shared memory, call
+ * SignalHandlerForCrashExit() because shared memory may be corrupted.
+ */
+void
+SignalHandlerForUnsafeExit(SIGNAL_ARGS)
+{
+	/*
+	 * Since we don't touch shared memory, we can just pull the plug and exit
+	 * without running any atexit handlers.
+	 */
+	_exit(1);
+}
+
+/*
+ * Simple signal handler for processes have touched shared memory to
+ * exit quickly.
+ *
+ * Normally, this would be used for handling SIGQUIT as if due to a crash.
  */
 void
 SignalHandlerForCrashExit(SIGNAL_ARGS)
@@ -93,9 +111,8 @@ SignalHandlerForCrashExit(SIGNAL_ARGS)
  * shut down and exit.
  *
  * Typically, this handler would be used for SIGTERM, but some processes use
- * other signals. In particular, the checkpointer exits on SIGUSR2, the
- * stats collector on SIGQUIT, and the WAL writer exits on either SIGINT
- * or SIGTERM.
+ * other signals. In particular, the checkpointer exits on SIGUSR2 and the
+ * WAL writer exits on either SIGINT or SIGTERM.
  *
  * ShutdownRequestPending should be checked at a convenient place within the
  * main loop, or else the main loop should call HandleMainLoopInterrupts.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b1e2d94951..b2156cec9d 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -725,6 +725,8 @@ pgstat_reset_remove_files(const char *directory)
 		snprintf(fname, sizeof(fname), "%s/%s", directory,
  entry->d_name);
 		unlink(fname);
+
+		elog(DEBUG2, "removing stats file \"%s\"", fname);
 	}
 	FreeDir(dir);
 }
@@ -4821,13 +4823,19 @@ PgstatCollectorMain(int argc, char *argv[])
 
 	/*
 	 * Ignore all signals usually bound to some action in the postmaster,
-	 * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-	 * support latch operations, because we only use a local latch.
+	 * except SIGHUP, SIGTERM and SIGQUIT.  Note we don't need a SIGUSR1
+	 * handler to support latch operations, because we only use a local latch.
 	 */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
 	pqsignal(SIGINT, SIG_IGN);
-	pqsignal(SIGTERM, SIG_IGN);
-	pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
+	pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+
+	/*
+	 * If SIGQUIT is received, exit quickly without doing any additional work,
+	 * for example writing stats files. We arrange to do _exit(1) because the
+	 * stats collector doesn't touch shared memory.
+	 */
+	pqsignal(SIGQUIT, SignalHandlerForUnsafeExit);
 	pqsignal(SIGALRM, SIG_IGN);
 	pqsignal(SIGPIPE, SIG_IGN);
 	pqsignal(SIGUSR1, SIG_IGN);
@@ -4852,8 +4860,8 @@ PgstatCollectorMain(int argc, char *argv[])
 	AddWaitEventToSet(wes, WL_SOCKET_READABLE, pgStatSock, NULL, NULL);
 
 	/*
-	 * Loop to process messages until we get SIGQUIT or detect ungraceful
-	 * death of our parent postmaster.
+	 * Loop to process messages until we get SIGTERM or SIGQUIT of our parent
+	 * postmaster.
 	 *
 	 * For performance reasons, we don't want to do ResetLatch/WaitLatch after
 	 * every message; instead, do that only after a recv() fails to obtain a
@@ -4871,7 +4879,7 @@ PgstatCollectorMain(int argc, char *argv[])
 		ResetLatch(MyLatch);
 
 		/*
-		 * Quit if we get SIGQUIT from the postmaster.
+		 * Quit if we get SIGTERM from the postmaster.
 		 */
 		if (ShutdownRequestPending)
 			break;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 6436ae0f48..4e1d47e0a1 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -400,7 +400,6 @@ static void SIGHUP_handler(SIGNAL_ARGS);
 static void pmdie(SIGNAL_ARGS);
 static void reaper(SIGNAL_ARGS);
 static void sigusr1_handler(SIGNAL_ARGS);
-static void process_startup_packet_die(SIGNAL_ARGS);
 static void dummy_handler(SIGNAL_ARGS);
 static void StartupPacketTimeoutHandler(void);
 static void CleanupBack

Re: New IndexAM API controlling index vacuum strategies

2021-03-16 Thread Peter Geoghegan

On Tue, Mar 16, 2021 at 6:40 AM Masahiko Sawada  wrote:
> > Note that I've merged multiple existing functions in vacuumlazy.c into
> > one: the patch merges lazy_vacuum_all_indexes() and lazy_vacuum_heap()
> > into a single function named vacuum_indexes_mark_unused()

> I agree to create a function like vacuum_indexes_mark_unused() that
> makes a decision and does index and heap vacumming accordingly. But
> what is the point of removing both lazy_vacuum_all_indexes() and
> lazy_vacuum_heap()? I think we can simply have
> vacuum_indexes_mark_unused() call those functions. I'm concerned that
> if we add additional criteria to vacuum_indexes_mark_unused() in the
> future the function will become very large.

I agree now. I became overly excited about advertising the fact that
these two functions are logically one thing. This is important, but it
isn't necessary to go as far as actually making everything into one
function. Adding some comments would also make that point clear, but
without making vacuumlazy.c even more spaghetti-like. I'll fix it.

> > I wonder if we can add some kind of emergency anti-wraparound vacuum
> > logic to what I have here, for Postgres 14.

> +1
>
> I think we can set VACOPT_TERNARY_DISABLED to
> tab->at_params.index_cleanup in table_recheck_autovac() or increase
> the thresholds used to not skipping index vacuuming.

I was worried about the "tupgone = true" special case causing problems
when we decide to do some index vacuuming and some heap
vacuuming/LP_UNUSED-marking but then later decide to end the VACUUM.
But I now think that getting rid of "tupgone = true" gives us total
freedom to
choose what to do, including the freedom to start with index vacuuming
and then give up on it later -- even after we do some amount of
LP_UNUSED-marking (during a VACUUM with multiple index passes, perhaps
due to a low maintenance_work_mem setting). That isn't actually
special at all, because everything will be 100% decoupled.

Whether or not it's a good idea to either not start index vacuuming or
to end it early (e.g. due to XID wraparound pressure) is a complicated
question, and the right approach will be debatable in each case/when
thinking about each issue. However, deciding on the best behavior to
address these problems should have nothing to do with implementation
details and everything to do with the costs and benefits to users.
Which now seems possible.

A sophisticated version of the "XID wraparound pressure"
implementation could increase reliability without ever being
conservative, just by evaluating the situation regularly and being
prepared to change course when necessary -- until it is truly a matter
of shutting down new XID allocations/the server. It should be possible
to decide to end VACUUM early and advance relfrozenxid (provided we
have reached the point of finishing all required pruning and freezing,
of course). Highly agile behavior seems quite possible, even if it
takes a while to agree on a good design.

> Aside from whether it's good or bad, strictly speaking, it could
> change the index AM API contract. The documentation of
> amvacuumcleanup() says:
>
> ---
> stats is whatever the last ambulkdelete call returned, or NULL if
> ambulkdelete was not called because no tuples needed to be deleted.
> ---
>
> With this change, we could pass NULL to amvacuumcleanup even though
> the index might have tuples needed to be deleted.

I think that we should add a "Note" box to the documentation that
notes the difference here. Though FWIW my interpretation of the words
"no tuples needed to be deleted" was always "no tuples needed to be
deleted because vacuumlazy.c didn't call ambulkdelete()". After all,
VACUUM can add to tups_vacuumed through pruning inside
heap_prune_chain(). It is already possible (though only barely) to not
call ambulkdelete() for indexes (to only call amvacuumcleanup() during
cleanup) despite the fact that heap vacuuming did "delete tuples".

It's not that important, but my point is that the design has always
been top-down -- an index AM "needs to delete" whatever it is told it
needs to delete. It has no direct understanding of any higher-level
issues.

> As I mentioned in a recent reply, I'm concerned about a case where we
> ran out maintenance_work_mem and decided not to skip index vacuuming
> but collected only a few dead tuples in the second index vacuuming
> (i.g., the total amount of dead tuples is slightly larger than
> maintenance_work_mem). In this case, I think we can skip the second
> (i.g., final) index vacuuming at least in terms of
> maintenance_work_mem. Maybe the same is true in terms of LP_DEAD
> accumulation.

I remember that. That now seems very doable, but time grows short...

I have already prototyped Andres' idea, which was to eliminate the
HEAPTUPLE_DEAD case inside lazy_scan_heap() by restarting pruning for
the same page. I've also moved the pruning into its own function
called lazy_scan_heap_page(), because handling the restart requires
that we b

Re: Boundary value check in lazy_tid_reaped()

2021-03-16 Thread Hannu Krosing

We could also go parallel in another direction - I have been mulling about
writing a "vectorized" bsearch which would use AVX2, where you look up 64
(or more likely 32, so tids also fit in 256bit vector) tids at a time.

The trickiest part is that the search can complete at different iteration
for each value.

Of course it is possible that this has a very bad RAM access behaviour and
is no win at all even if it otherways works.

--
Hannu Krosing

On Tue, Mar 16, 2021 at 10:08 PM Peter Geoghegan  wrote:

> On Sun, Mar 14, 2021 at 4:22 PM Thomas Munro 
> wrote:
> > BTW I got around to trying this idea out for a specialised
> > bsearch_itemptr() using a wide comparator, over here:
>
> Cool!
>
> I have another thing that should be considered when we revisit this
> area in the future: maybe we should structure the binary search to
> lookup multiple TIDs at once -- probably all of the TIDs from a given
> leaf page, taken together.
>
> There is now an nbtree function used for tuple deletion (all tuple
> deletion, not just bottom-up deletion) that works like this:
> _bt_delitems_delete_check(). I suspect that it would make sense to
> generalize it to do the same thing for regular VACUUM. Perhaps this
> idea would have to be combined with other techniques to show a real
> benefit. It would probably be necessary to sort the TIDs first (just
> like index_delete_sort() does for the _bt_delitems_delete_check() code
> today), but that's probably no big deal.
>
> It is typical to have 200 - 400 TIDs on an nbtree leaf page without
> using deduplication. And with deduplication enabled you can have as
> many as 1300 TIDs on a single 8KiB nbtree leaf page. It's easy to
> imagine something like GCC's __builtin_prefetch() (or maybe just more
> predictable access patterns in our "batch binary search") making
> everything much faster through batching. This will also naturally make
> btvacuumpage() much less "branchy", since of course it will no longer
> need to process the page one TID at a time -- that helps too.
>
> --
> Peter Geoghegan
>
>
>

Re: Boundary value check in lazy_tid_reaped()

2021-03-16 Thread Peter Geoghegan

On Sun, Mar 14, 2021 at 4:22 PM Thomas Munro  wrote:
> BTW I got around to trying this idea out for a specialised
> bsearch_itemptr() using a wide comparator, over here:

Cool!

I have another thing that should be considered when we revisit this
area in the future: maybe we should structure the binary search to
lookup multiple TIDs at once -- probably all of the TIDs from a given
leaf page, taken together.

There is now an nbtree function used for tuple deletion (all tuple
deletion, not just bottom-up deletion) that works like this:
_bt_delitems_delete_check(). I suspect that it would make sense to
generalize it to do the same thing for regular VACUUM. Perhaps this
idea would have to be combined with other techniques to show a real
benefit. It would probably be necessary to sort the TIDs first (just
like index_delete_sort() does for the _bt_delitems_delete_check() code
today), but that's probably no big deal.

It is typical to have 200 - 400 TIDs on an nbtree leaf page without
using deduplication. And with deduplication enabled you can have as
many as 1300 TIDs on a single 8KiB nbtree leaf page. It's easy to
imagine something like GCC's __builtin_prefetch() (or maybe just more
predictable access patterns in our "batch binary search") making
everything much faster through batching. This will also naturally make
btvacuumpage() much less "branchy", since of course it will no longer
need to process the page one TID at a time -- that helps too.

-- 
Peter Geoghegan

Re: [HACKERS] PATCH: Batch/pipelining support for libpq

2021-03-16 Thread Andres Freund

Hi,

On 2021-03-05 21:35:59 -0300, Alvaro Herrera wrote:
> I'll take the weekend to think about the issue with conn->last_query and
> conn->queryclass that I mentioned yesterday; other than that detail my
> feeling is that this is committable, so I'll be looking at getting this
> pushed early next weeks, barring opinions from others.

It is *very* exciting to see this being merged. Thanks for all the work
to all that contributed!

Greetings,

Andres Freund

Re: shared-memory based stats collector

2021-03-16 Thread Andres Freund

Hi,

On 2021-03-16 15:08:39 -0400, Robert Haas wrote:
> On Mon, Mar 15, 2021 at 10:56 PM Andres Freund  wrote:
> > I did roughly the first steps of the split as I had outlined. I moved:
> >
> > 1) wait event related functions into utils/activity/wait_event.c /
> >wait_event.h
> >
> > 2) "backend status" functionality (PgBackendStatus stuff) into
> >utils/activity/backend_status.c
> >
> > 3) "progress" related functionality into
> >utils/activity/backend_progress.c
>
> In general, I like this. I'm not too sure about the names. I realize
> you don't want to have functions called status.c and progress.c,
> because that's awful generic, but now you have backend_progress.c,
> backend_status.c, and wait_event.c, which makes the last one look a
> little strange. Maybe command_progress.c instead of
> backend_progress.c?

I'm not thrilled about the names I ended up with either, so happy to get
some ideas.

I did consider command_progress.c too - but that seems confusing because
there's src/include/commands/progress.h, which is imo a different layer
than what pgstat/backend_progress provide. So I thought splitting things
up so that backend_progress.[ch] provide the place to store the progress
values, and commands/progress.h defining the meaning of the values as
used for in-core postgres commands would make sense.  I could see us
using the general progress infrastructure for things that'd not fit
super well into commands/* at some point...

But I'd also be ok with folding in command/progress.h.

> > I think 1 and 2 are good (albeit in need of further polish). I'm a bit
> > less sure about 3:
> > - There's dependency from backend_status.h to backend_progress.h,
> >   because it PGSTAT_NUM_PROGRESS_PARAM etc.
>
> That doesn't seem like a catastrophe.
>
> > - it's a fairly small amount of code
>
> But it's not bad to have it separate.

Agreed.

> > - I'm inclined to leave pgstat_report_{activity, tmpfile, appname,
> >   timestamp, ..} alone naming-wise, but to rename pgstat_bestart() to
> >   something like pgbestat_start()?
>
> I'd probably rename them e.g. command_progress_start(),
> comand_progress_update_param(), etc.

Hm. There's ~250 calls to pgstat_report_*. Which are you proposing to
rename? In my mind there's at least the following groups with
"inaccurately" overlapping names:

1) "stats reporting" functions, like pgstat_report_{bgwriter,
   archiver,...}(), pgstat_count_*(), pgstat_{init,
   end}_function_usage(),

2) "backend activity" functions, like pgstat_report_activity()

2.1) "wait event" functions, like pgstat_report_wait_{start,end}()

3) "stats control" functions, like pgstat_report_stat()

4) "stats reporting" fetch functions like pgstat_fetch_stat_dbentry()

5) "backend activity" fetch functions like
   pgstat_fetch_stat_numbackends(), pgstat_fetch_stat_beentry()

I'd not quite group the progress functions as part of that, because they
do already have a distinct namespace, even though perhaps pgstat_* isn't
a great prefix.

> > - backend_status.h needs miscadmin.h, due to BackendType. Imo that's a
> >   header we should try to avoid exposing in headers if possible. But
> >   right now there's no good place to move BackendType to. So I'd let
> >   this slide for now.
>
> Why the concern? miscadmin.h is extremely widely-included already.

In .c files, but not from a lot of headers... There's only two places,
pgstat.h and regex/regcustom.h.

For me it a weird mix of things that should be included from very few
places, and super widely used stuff. I already feel bad including it
from .c files, but indirect includes in .h files imo should be as
narrow as possible.

> Maybe it should be broken up into pieces so that we're not including
> so MUCH stuff in a zillion places, but the header that contains the
> definition of CHECK_FOR_INTERRUPTS() is always going to be needed in a
> ton of spots. Honestly, I wonder why we don't just put that part in
> postgres.h.

I'd not be against that. I'd personally put CFI() in a separate header,
but include it from postgres.h.  I don't think there's much else in
there that should be as widely used. The closest is INTERRUPTS and
CRIT_SECTION stuff, but that should be less frequent.

Greetings,

Andres Freund

Re: pg_amcheck contrib application

2021-03-16 Thread Robert Haas

On Mon, Mar 15, 2021 at 10:10 PM Mark Dilger
 wrote:
> It is unfortunate that the failing test only runs pg_amcheck after creating 
> numerous corruptions, as we can't know if pg_amcheck would have complained 
> about pg_statistic before the corruptions were created in other tables, or if 
> it only does so after.  The attached patch v7-0003 adds a call to pg_amcheck 
> after all tables are created and populated, but before any corruptions are 
> caused.  This should help narrow down what is happening, and doesn't hurt to 
> leave in place long-term.
>
> I don't immediately see anything wrong with how pg_statistic uses a 
> pseudo-type, but it leads me to want to poke a bit more at pg_statistic on 
> hornet and tern, though I don't have any regression tests specifically for 
> doing so.
>
> Tests v7-0001 and v7-0002 are just repeats of the tests posted previously.

Since we now know that shutting autovacuum off makes the problem go
away, I don't see a reason to commit 0001. We should fix pg_amcheck
instead, if, as presently seems to be the case, that's where the
problem is.

I just committed 0002.

I think 0003 is probably a good idea, but I haven't committed it yet.

As for 0004, it seems to me that we might want to do a little more
rewording of these messages and perhaps we should try to do it all at
once. Like, for example, your other change to print out the toast
value ID seems like a good idea, and could apply to any new messages
as well as some existing ones. Maybe there are also more fields in the
TOAST pointer for which we could add checks.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_amcheck contrib application

2021-03-16 Thread Mark Dilger




> On Mar 16, 2021, at 11:40 AM, Robert Haas  wrote:
> 
> On Tue, Mar 16, 2021 at 2:22 PM Tom Lane  wrote:
>> I'm circling back around to the idea that amcheck is trying to
>> validate TOAST references that are already dead, and it's getting
>> burnt because something-or-other has already removed the toast
>> rows, though not the referencing datums.  That's legal behavior
>> once the rows are marked dead.  Upthread it was claimed that
>> amcheck isn't doing that, but this looks like a smoking gun to me.
> 
> I think this theory has some legs. From check_tuple_header_and_visibilty():
> 
>else if (!(infomask & HEAP_XMAX_COMMITTED))
>return true;/*
> HEAPTUPLE_DELETE_IN_PROGRESS or
> *
> HEAPTUPLE_LIVE */
>else
>return false;   /*
> HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD */
>}
>return true;/* not dead */
> }
> 
> That first case looks wrong to me. Don't we need to call
> get_xid_status() here, Mark? As coded, it seems that if the xmin is ok
> and the xmax is marked committed, we consider the tuple checkable. The
> comment says it must be HEAPTUPLE_DELETE_IN_PROGRESS or
> HEAPTUPLE_LIVE, but it seems to me that if the actual answer is either
> HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD depending on whether the
> xmax is all-visible. And in the second case the comment says it's
> either HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD, but I think in that
> case it's either HEAPTUPLE_DELETE_IN_PROGRESS or
> HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD, depending on the XID
> status.
> 
> Another thought here is that it's probably not wicked smart to be
> relying on the hint bits to match the actual status of the tuple in
> cases of corruption. Maybe we should be warning about tuples that are
> have xmin or xmax flagged as committed or invalid when in fact clog
> disagrees. That's not a particularly uncommon case, and it's hard to
> check.

This code was not committed as part of the recent pg_amcheck work, but longer 
ago, and I'm having trouble reconstructing exactly why it was written that way.

Changing check_tuple_header_and_visibilty() fixes the regression test and also 
manual tests against the "regression" database that I've been using.  I'd like 
to ponder the changes a while longer before I post, but the fact that these 
changes fix the tests seems to add credibility to this theory.


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Custom compression methods

2021-03-16 Thread Robert Haas

On Tue, Mar 16, 2021 at 2:54 PM Andres Freund  wrote:
> Oh, I guess it would make sense to do it that way. However, I was just
> thinking of doing the iteration over the tuples that ExecEvalRow() etc
> do inside heap_form_flattened_tuple() (or whatever). That'd not be any
> worse than what the patch is doing now, just less duplication, and an
> easier path towards optimizing it if we notice that we need to?

It's a question of whether you copy the datum array. I don't think a
generic function can assume that it's OK to scribble on the input
array, or if it does, that'd better be very prominently mentioned in
the comments. And copying into a new array has its own costs. 0002 is
based on the theory that scribbling on the executor's array won't
cause any problem, which I *think* is true, but isn't correct in all
cases (e.g. if the input data is coming from a slot). If we pass a
flag down to fill_val() and friends then we don't end up having to
copy the arrays over so the problem goes away in that design.

> The harder part would probably be to find a way to deal with the layers
> above, without undue code duplication. I think it's not just fill_val()
> that'd need to know, but also heap_compute_data_size(),
> heap_fill_tuple() - both of which are externally visible (and iirc thus
> not going to get inlined with many compiler options, due to symbol
> interposition dangers). But we could have a
> heap_compute_data_size_internal(bool flatten) that's called by
> heap_compute_data_size(). And something similar for heap_form_tuple().

Hmm, yeah, that's not great. I guess there's nothing expensive we need
to repeat - I think anyway - because we should be able to get the
uncompressed size from the TOAST pointer itself. But the code would
have to know to do that, as you say.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_amcheck contrib application

2021-03-16 Thread Andrew Dunstan

On 3/13/21 1:30 AM, Andres Freund wrote:
> Hi,
>
> On 2021-03-13 01:22:54 -0500, Tom Lane wrote:
>> Mark Dilger  writes:
>>> On Mar 12, 2021, at 10:16 PM, Noah Misch  wrote:
 hoverfly does configure with PERL=perl64.  /usr/bin/prove is from the 
 32-bit
 Perl, so I suspect the TAP suites get 32-bit Perl that way.  (There's no
 "prove64".)
>> Oh, that's annoying.
> I suspect we could solve that by invoking changing our /usr/bin/prove
> invocation to instead be PERL /usr/bin/prove? That might be a good thing
> independent of this issue, because it's not unreasonable for a user to
> expect that we'd actually use the perl installation they configured...
>
> Although I do not know how prove determines the perl installation it's
> going to use for the test scripts...
>

There's a very good chance this would break msys builds, which are
configured to build against a pure native (i.e. non-msys) perl sucj as
AS or Strawberry, but need to run msys perl for TAP tests, so it gets
the paths right.

(Don't get me started on the madness that can result from managing this.
I've lost weeks of my life to it ... If you add cygwin into the mix and
you're trying to coordinate builds among three buildfarm animals it's a
major pain.)

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Calendar support in localization

2021-03-16 Thread Thomas Munro

On Wed, Mar 17, 2021 at 6:31 AM Surafel Temesgen  wrote:
> Ethiopice calendar have 13 months so it can not be stored as date and 
> timestamp type and you approach seems more complicated and i suggest to have 
> this feature on the purpose of PostgreSQL popularities too not only for my 
> need

I know, but the DATE and TIMESTAMPTZ datatypes don't intrinsically
know anything about months or other calendar concepts.  Internally,
they are just a single number that counts the number of days or
seconds since an arbitrary epoch time.  We are all in agreement about
how many times the Earth has rotated since then*.  The calendar
concepts such as "day", "month", "year", whether Gregorian, Ethiopic,
Islamic, ... are all derivable from those numbers, if you know the
rules.

So I think you should seriously consider using the same types.

> Each  calendar-aware date arithmetic is different so solving one calendar 
> problem didn't help on other calendar

They have a *lot* in common though.  They have similar "fields" (day,
month, year etc), based on the Earth, moon, sun etc, so it is possible
to use a common abstraction to interact with them.  I haven't studied
it too closely, but it looks like ICU can give you a "Calendar" object
for a given Locale (which you create from a string like
"am_ET@calendar=traditional") and timezone ("Africa/Addis_Ababa").
Then you can set the object's time to X seconds since an epoch, based
on UTC seconds without leap seconds -- which is exactly like our
TIMESTAMPTZ's internal value -- and then you can query it to get
fields like month etc.  Or do the opposite, or use formatting and
parsing routines etc.  Internally, ICU has a C++ class for each
calendar with a name like EthiopicCalendar, IslamicCalendar etc which
encapsulates all the logic, but it's not necessary to use them
directly: we could just look them up with names via the C API and then
treat them all the same.

> I think you suggesting this by expecting the implementation is difficult but 
> it's not that much difficult once you fully understand Gregorian calendar and 
> the calendar you work on

Yeah, I am sure it's all just a bunch of simple integer maths.  But
I'm talking about things like software architecture, maintainability,
cohesion, and getting maximum impact for the work we do.

I may be missing some key detail though: why do you think it should be
a different type?  The two reasons I can think of are: (1) the
slightly tricky detail that the date apparently changes at 1:00am
(which I don't think is a show stopper for this approach, I could
elaborate), (2) you may want dates to be formatted on the screen with
the Ethiopic calendar in common software like psql and GUI clients,
which may be easier to arrange with different types, but that seems to
be a cosmetic thing that could eventually be handled with tighter
locale integration with ICU.  In the early stages you'd access
calendar logic though special functions with names like
icu_format_date(), or whatever.

Maybe I'm totally wrong about all of this, but this is the first way
I'd probably try to tackle this problem, and I suspect it has the
highest chance of eventually being included in core PostgreSQL.

*I mean, we can discuss the different "timelines" like UT, UTC, TAI
etc, but that's getting into the weeds, the usual timeline for
computer software outside specialist scientific purposes is UTC
without leap seconds.

Re: automatic analyze: readahead - add "IO read time" log message

2021-03-16 Thread Stephen Frost

Greetings,

* Stephen Frost (sfr...@snowman.net) wrote:
> > >> I think e.g. prefetch_targblock could be moved to the next #ifdef, which
> > >> will eliminate the one-line ifdef.
> > > 
> > > Sure, done in the attached.
> > > 
> > > Thanks for the review!  Unless there's other comments, I'll plan to push
> > > this over the weekend or early next week.
> > 
> > +1
> 
> Thanks again for the review!

and pushed.

Thanks everyone for the suggestions and reviews, and to Jakub in
particular for starting us down this path towards improving things in
ANALYZE.

Stephen


signature.asc
Description: PGP signature

Re: [HACKERS] Custom compression methods

2021-03-16 Thread Robert Haas

On Tue, Mar 16, 2021 at 11:21 AM Dilip Kumar  wrote:
> If that is only the argument then it's possible today as well.  I mean
> you can INSERT INTO .. SELECT FROM where source attribute as
> compressed data but the target attribute as external storage then also
> we will move the compressed data as it is to the target table.

Uggh. I don't like that behavior either, but I guess if it's the
long-established way things work then perhaps this is no worse.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_amcheck contrib application

2021-03-16 Thread Robert Haas

On Tue, Mar 16, 2021 at 2:45 PM Tom Lane  wrote:
> > In what context is it OK to just add extra alignment padding?
>
> It's *not* extra, according to pg_statistic's tuple descriptor.
> Both forming and deforming of pg_statistic tuples should honor
> that and locate stavaluesX values on d-aligned boundaries.
>
> It could be that a particular entry is of an array type that
> only requires i-alignment.  But that doesn't break anything,
> it just means we inserted more padding than an omniscient
> implementation would do.

OK, yeah, I just misunderstood what you were saying.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: shared-memory based stats collector

2021-03-16 Thread Robert Haas

On Mon, Mar 15, 2021 at 10:56 PM Andres Freund  wrote:
> I did roughly the first steps of the split as I had outlined. I moved:
>
> 1) wait event related functions into utils/activity/wait_event.c /
>wait_event.h
>
> 2) "backend status" functionality (PgBackendStatus stuff) into
>utils/activity/backend_status.c
>
> 3) "progress" related functionality into
>utils/activity/backend_progress.c

In general, I like this. I'm not too sure about the names. I realize
you don't want to have functions called status.c and progress.c,
because that's awful generic, but now you have backend_progress.c,
backend_status.c, and wait_event.c, which makes the last one look a
little strange. Maybe command_progress.c instead of
backend_progress.c?

> I think 1 and 2 are good (albeit in need of further polish). I'm a bit
> less sure about 3:
> - There's dependency from backend_status.h to backend_progress.h,
>   because it PGSTAT_NUM_PROGRESS_PARAM etc.

That doesn't seem like a catastrophe.

> - it's a fairly small amount of code

But it's not bad to have it separate.

> - there's potential for confusion, because there's also
>   include/commands/progress.h

That could be merged, perhaps. I think I only created that because I
didn't want to jam too much stuff into pgstat.h. But if it has its own
header then jamming some more stuff in there seems more OK.

> - I'm inclined to leave pgstat_report_{activity, tmpfile, appname,
>   timestamp, ..} alone naming-wise, but to rename pgstat_bestart() to
>   something like pgbestat_start()?

I'd probably rename them e.g. command_progress_start(),
comand_progress_update_param(), etc.

> - I've not gone through all the files that could now remove pgstat.h,
>   replacing it with wait_event.h - I'm thinking it might be worth
>   waiting till just after code freeze with that (there'll new additions,
>   and it's likely to cause conflicts)?

Don't care.

> - backend_status.h needs miscadmin.h, due to BackendType. Imo that's a
>   header we should try to avoid exposing in headers if possible. But
>   right now there's no good place to move BackendType to. So I'd let
>   this slide for now.

Why the concern? miscadmin.h is extremely widely-included already.
Maybe it should be broken up into pieces so that we're not including
so MUCH stuff in a zillion places, but the header that contains the
definition of CHECK_FOR_INTERRUPTS() is always going to be needed in a
ton of spots. Honestly, I wonder why we don't just put that part in
postgres.h. If you're writing any significant amount of code and you
don't have at least one CHECK_FOR_INTERRUPTS() in there, you're
probably doing it wrong.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [HACKERS] Custom compression methods

2021-03-16 Thread Andres Freund

Hi,

On 2021-03-16 10:27:12 -0400, Robert Haas wrote:
> > I'm don't like that after 0002 ExecEvalRow(), ExecEvalFieldStoreForm()
> > contain a nearly identical copy of the same code.  And
> > make_tuple_from_row() also is similar. It seem that there should be a
> > heap_form_tuple() version doing this for us?
> 
> I was worried about having either a performance impact or code
> duplication. The actual plan where you could insert this organically
> is in fill_val(), which is called from heap_fill_tuple(), which is
> called from heap_form_tuple().

Oh, I guess it would make sense to do it that way. However, I was just
thinking of doing the iteration over the tuples that ExecEvalRow() etc
do inside heap_form_flattened_tuple() (or whatever). That'd not be any
worse than what the patch is doing now, just less duplication, and an
easier path towards optimizing it if we notice that we need to?

> If you don't mind passing down 'int flags' or similar to all those,
> and having additional branches to make the behavior dependent on the
> flags, I'm cool with it. Or if you think we should template-ize all
> those functions, that'd be another way to go. But I was afraid I would
> get complaints about adding overhead to hot code paths.

An option for fill_val() itself would probably be fine. It's already an
inline, and if it doesn't get inlined, we could force the compilers hand
with pg_attribute_always_inline.

The harder part would probably be to find a way to deal with the layers
above, without undue code duplication. I think it's not just fill_val()
that'd need to know, but also heap_compute_data_size(),
heap_fill_tuple() - both of which are externally visible (and iirc thus
not going to get inlined with many compiler options, due to symbol
interposition dangers). But we could have a
heap_compute_data_size_internal(bool flatten) that's called by
heap_compute_data_size(). And something similar for heap_form_tuple().

But that's complicated, so I'd just go with the iteration in a
heap_form_tuple() wrapper for now.

> > > I'm open to being convinced that we don't need to do either of these
> > > things, and that the cost of iterating over all varlenas in the tuple
> > > is not so bad as to preclude doing things as you have them here. But,
> > > I'm afraid it's going to be too expensive.
> >
> > I mean, I would just define several of those places away by not caring
> > about tuples in a different compressino formation ending up in a
> > table...
> 
> That behavior feels unacceptable to me from a user expectations point
> of view. I think there's an argument that if I update a tuple that
> contains a compressed datum, and I don't update that particular
> column, I think it would be OK to not recompress the column. But, if I
> insert data into a table, I as a user would expect that the
> compression settings for that column are going to be respected.

IDK. The user might also expect that INSERT .. SELECT is fast, instead
of doing expensive decompression + compression (with pglz the former can
be really slow). I think there's a good argument for having an explicit
"recompress" operation, but I'm not convincd that doing things
implicitly is good, especially if it causes complications in quite a few
places.

Greetings,

Andres Freund

Re: pg_amcheck contrib application

2021-03-16 Thread Tom Lane

Mark Dilger  writes:
> CopyStatistics seems to just copy Form_pg_statistic without regard for
> who owns the toast.  Is this safe?

No less so than a ton of other places that insert values that might've
come from on-disk storage.  heap_toast_insert_or_update() is responsible
for dealing with the problem.  These days it looks like it's actually
toast_tuple_init() that takes care of dereferencing previously-toasted
values during an INSERT.

regards, tom lane

Re: "has_column_privilege()" issue with attnums and non-existent columns

2021-03-16 Thread Joe Conway


On 3/16/21 1:42 AM, Chengxi Sun wrote:

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

I tested the patch and it works well.


Was that my patch (i.e. the latest on this thread) or Ian's? It is not clear 
since you did not CC me...



And according to the comment, checking existence of relation and
pg_class_aclcheck() won't influenced by concurrent DROP. So I think it's safe
to just reorder the checking existence of column and
pg_attribute_aclcheck().


"Code never lies, comments sometimes do"

That said, I did at least a basic test of that assumption and it seems to be 
true.

Ian, or anyone else, any comments/complaints on my changes? If not I will commit 
and push that version sooner rather than later.


Joe

--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

Re: pg_amcheck contrib application

2021-03-16 Thread Tom Lane

Robert Haas  writes:
> On Tue, Mar 16, 2021 at 1:48 PM Tom Lane  wrote:
>> No.  What should be happening there is that some arrays in the column
>> get larger alignment than they actually need, but that shouldn't cause
>> a problem (and has not done so, AFAIK, in the decades that it's been
>> like this).  As you say, deforming the tuple is going to rely on the
>> table's tupdesc for alignment; it can't know what is in the data.

> OK, I don't understand this. attalign is 'd', which is already the
> maximum possible, and even if it weren't, individual rows can't decide
> to use a larger OR smaller alignment than expected without breaking
> stuff.

> In what context is it OK to just add extra alignment padding?

It's *not* extra, according to pg_statistic's tuple descriptor.
Both forming and deforming of pg_statistic tuples should honor
that and locate stavaluesX values on d-aligned boundaries.

It could be that a particular entry is of an array type that
only requires i-alignment.  But that doesn't break anything,
it just means we inserted more padding than an omniscient
implementation would do.

(I suppose in some parallel universe there could be a machine
where i-alignment is stricter than d-alignment, and then we'd
have trouble.)

regards, tom lane

Re: pg_amcheck contrib application

2021-03-16 Thread Robert Haas

On Tue, Mar 16, 2021 at 2:22 PM Tom Lane  wrote:
> I'm circling back around to the idea that amcheck is trying to
> validate TOAST references that are already dead, and it's getting
> burnt because something-or-other has already removed the toast
> rows, though not the referencing datums.  That's legal behavior
> once the rows are marked dead.  Upthread it was claimed that
> amcheck isn't doing that, but this looks like a smoking gun to me.

I think this theory has some legs. From check_tuple_header_and_visibilty():

else if (!(infomask & HEAP_XMAX_COMMITTED))
return true;/*
HEAPTUPLE_DELETE_IN_PROGRESS or
 *
HEAPTUPLE_LIVE */
else
return false;   /*
HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD */
}
return true;/* not dead */
}

That first case looks wrong to me. Don't we need to call
get_xid_status() here, Mark? As coded, it seems that if the xmin is ok
and the xmax is marked committed, we consider the tuple checkable. The
comment says it must be HEAPTUPLE_DELETE_IN_PROGRESS or
HEAPTUPLE_LIVE, but it seems to me that if the actual answer is either
HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD depending on whether the
xmax is all-visible. And in the second case the comment says it's
either HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD, but I think in that
case it's either HEAPTUPLE_DELETE_IN_PROGRESS or
HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD, depending on the XID
status.

Another thought here is that it's probably not wicked smart to be
relying on the hint bits to match the actual status of the tuple in
cases of corruption. Maybe we should be warning about tuples that are
have xmin or xmax flagged as committed or invalid when in fact clog
disagrees. That's not a particularly uncommon case, and it's hard to
check.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Postgres crashes at memcopy() after upgrade to PG 13.

2021-03-16 Thread Peter Geoghegan

On Tue, Mar 16, 2021 at 11:20 AM Avinash Kumar
 wrote:
> I can share any detail that would help here.

I would like to know what you see when you run a slightly modified
version of the same amcheck query. The same query as before, but with
the call to bt_index_parent_check() replaced with a call to
bt_index_check(). Can you do that, please?

This is what I mean:

SELECT bt_index_check(index => c.oid, heapallindexed => true),
c.relname,
c.relpages
FROM pg_index i
JOIN pg_opclass op ON i.indclass[0] = op.oid
JOIN pg_am am ON op.opcmethod = am.oid
JOIN pg_class c ON i.indexrelid = c.oid
JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE am.amname = 'btree'
-- Don't check temp tables, which may be from another session:
AND c.relpersistence != 't'
-- Function may throw an error when this is omitted:
AND c.relkind = 'i' AND i.indisready AND i.indisvalid
ORDER BY c.relpages DESC;

The error that you reported was a cross-level invariant violation,
from one of the tests that bt_index_parent_check() performs but
bt_index_check() does not perform (the former performs checks that are
a superset of the latter). It's possible that we'll get a more
interesting error message from bt_index_check() here, because it might
go on for a bit longer -- it might conceivably reach a corrupt posting
list tuple on the leaf level, and report it as such.

Of course we don't see any corruption in the index that you had the
crash with at all, but it can't hurt to do this as well -- just in
case the issue is transient or something.

Thanks
-- 
Peter Geoghegan

Re: pg_amcheck contrib application

2021-03-16 Thread Mark Dilger




> On Mar 16, 2021, at 10:48 AM, Tom Lane  wrote:
> 
> I'm not entirely sure what's going on, but I think coming at this
> with the mindset that "amcheck has detected some corruption" is
> just going to lead you astray.  Almost certainly, it's "amcheck
> is incorrectly claiming corruption".  That might be from mis-decoding
> a TOAST-referencing datum.  (Too bad the message doesn't report the
> TOAST OID it probed for, so we can see if that's sane or not.)

CopyStatistics seems to just copy Form_pg_statistic without regard for who owns 
the toast.  Is this safe?  Looking at RemoveStatistics, I'm not sure that it 
is.  Anybody more familiar with this code care to give an opinion?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_amcheck contrib application

2021-03-16 Thread Robert Haas

On Tue, Mar 16, 2021 at 1:48 PM Tom Lane  wrote:
> No.  What should be happening there is that some arrays in the column
> get larger alignment than they actually need, but that shouldn't cause
> a problem (and has not done so, AFAIK, in the decades that it's been
> like this).  As you say, deforming the tuple is going to rely on the
> table's tupdesc for alignment; it can't know what is in the data.

OK, I don't understand this. attalign is 'd', which is already the
maximum possible, and even if it weren't, individual rows can't decide
to use a larger OR smaller alignment than expected without breaking
stuff.

In what context is it OK to just add extra alignment padding?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Postgres crashes at memcopy() after upgrade to PG 13.

2021-03-16 Thread Peter Geoghegan

On Tue, Mar 16, 2021 at 11:08 AM Tom Lane  wrote:
> Peter Geoghegan  writes:
> > ... It's hard to believe that the problem is
> > squarely with _bt_swap_posting().
>
> IIUC, the problem is seen on a replica server but not the primary?
> In that case, my thoughts would run towards a bug in WAL log creation
> or replay, causing the index contents to be different/wrong on the
> replica.

My remarks were intended to include problems during recovery
(_bt_swap_posting() is run inside REDO routines). Though I did
consider recovery specifically when thinking through the problem.

My assessment is that the index is highly unlikely to be corrupt
(whether it happened during recovery or at some other time), because
it passes validation by bt_index_parent_check(), with the optional
heapallindexed index-matches-table verification option enabled. This
includes exhaustive verification of posting list tuple invariants.

Anything is possible, but I find it easier to believe that the issue
is somewhere else -- we see the problem in _bt_swap_posting() because
it happens to go further than other code in trusting that the tuple
isn't corrupt (which it shouldn't). Another unrelated index *was*
reported corrupt by amcheck, though the error in question does not
suggest an issue with deduplication.

-- 
Peter Geoghegan

Re: pg_amcheck contrib application

2021-03-16 Thread Tom Lane

Mark Dilger  writes:
> On Mar 16, 2021, at 10:48 AM, Tom Lane  wrote:
>> (Too bad the message doesn't report the
>> TOAST OID it probed for, so we can see if that's sane or not.)

> I've added that and now get the toast pointer's va_valueid in the message:

> heap table "postgres"."pg_catalog"."pg_statistic", block 4, offset 2, 
> attribute 29:
> toasted value id 13227 for attribute 29 missing from toast table
> heap table "postgres"."pg_catalog"."pg_statistic", block 4, offset 5, 
> attribute 27:
> toasted value id 13228 for attribute 27 missing from toast table

That's awfully interesting, because OIDs less than 16384 should
only be generated during initdb.  So what we seem to be looking at
here is pg_statistic entries that were made during the ANALYZE
done by initdb (cf. vacuum_db()), and then were deleted during
a later auto-analyze (in the buildfarm) or deliberate analyze
(in your reproducer).  But if they're deleted, why is amcheck
looking for them?

I'm circling back around to the idea that amcheck is trying to
validate TOAST references that are already dead, and it's getting
burnt because something-or-other has already removed the toast
rows, though not the referencing datums.  That's legal behavior
once the rows are marked dead.  Upthread it was claimed that
amcheck isn't doing that, but this looks like a smoking gun to me.

regards, tom lane

Re: Postgres crashes at memcopy() after upgrade to PG 13.

2021-03-16 Thread Avinash Kumar

On Tue, Mar 16, 2021 at 3:08 PM Tom Lane  wrote:

> Peter Geoghegan  writes:
> > ... It's hard to believe that the problem is
> > squarely with _bt_swap_posting().
>
> IIUC, the problem is seen on a replica server but not the primary?
> In that case, my thoughts would run towards a bug in WAL log creation
> or replay, causing the index contents to be different/wrong on the
> replica.
>
Right, observed after the replica Server after it got promoted.
The replica is of the same Postgres minor version - 13.1 but, the OS is
Ubuntu 16 on Primary and Ubuntu 20 on Replica (that got promoted).
Replica was setup using a backup taken using pg_basebackup.

I can share any detail that would help here.


> regards, tom lane
>


-- 
Regards,
Avi

Re: Postgres crashes at memcopy() after upgrade to PG 13.

2021-03-16 Thread Tom Lane

Peter Geoghegan  writes:
> ... It's hard to believe that the problem is
> squarely with _bt_swap_posting().

IIUC, the problem is seen on a replica server but not the primary?
In that case, my thoughts would run towards a bug in WAL log creation
or replay, causing the index contents to be different/wrong on the
replica.

regards, tom lane

Re: pg_amcheck contrib application

2021-03-16 Thread Mark Dilger




> On Mar 16, 2021, at 10:48 AM, Tom Lane  wrote:
> 
> Robert Haas  writes:
>> On Tue, Mar 16, 2021 at 12:51 PM Mark Dilger
>>  wrote:
>>> It shows them all has having attalign = 'd', but for some array types the 
>>> alignment will be 'i', not 'd'.  So it's lying a bit about the contents.  
>>> But I'm now confused why this caused problems on the two hosts where 
>>> integer and double have the same alignment?  It seems like that would be 
>>> the one place where the bug would not happen, not the one place where it 
>>> does.
> 
>> Wait, so the value in the attalign column isn't the alignment we're
>> actually using? I can understand how we might generate tuples like
>> that, if we pass the actual type to construct_array(), but shouldn't
>> we then get garbage when we deform the tuple?
> 
> No.  What should be happening there is that some arrays in the column
> get larger alignment than they actually need, but that shouldn't cause
> a problem (and has not done so, AFAIK, in the decades that it's been
> like this).  As you say, deforming the tuple is going to rely on the
> table's tupdesc for alignment; it can't know what is in the data.
> 
> I'm not entirely sure what's going on, but I think coming at this
> with the mindset that "amcheck has detected some corruption" is
> just going to lead you astray.  Almost certainly, it's "amcheck
> is incorrectly claiming corruption".  That might be from mis-decoding
> a TOAST-referencing datum.  (Too bad the message doesn't report the
> TOAST OID it probed for, so we can see if that's sane or not.)

I've added that and now get the toast pointer's va_valueid in the message:

mark.dilger@laptop280-ma-us amcheck % pg_amcheck -t "pg_catalog.pg_statistic" 
postgres
heap table "postgres"."pg_catalog"."pg_statistic", block 4, offset 2, attribute 
29:
toasted value id 13227 for attribute 29 missing from toast table
heap table "postgres"."pg_catalog"."pg_statistic", block 4, offset 5, attribute 
27:
toasted value id 13228 for attribute 27 missing from toast table

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 5ccae46375..a0be71bb7f 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -,8 +,8 @@ check_tuple_attribute(HeapCheckContext *ctx)
}
if (!found_toasttup)
report_corruption(ctx,
- psprintf("toasted value for 
attribute %u missing from toast table",
-  
ctx->attnum));
+ psprintf("toasted value id %u 
for attribute %u missing from toast table",
+  
toast_pointer.va_valueid, ctx->attnum));
else if (ctx->chunkno != (ctx->endchunk + 1))
report_corruption(ctx,
  psprintf("final toast chunk 
number %u differs from expected value %u",

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [Proposal] Global temporary tables

2021-03-16 Thread Pavel Stehule

Hi

this patch is broken now. Please, can you check it?

Regards

Pavel


st 25. 11. 2020 v 14:08 odesílatel 曾文旌  napsal:

>
>
> 2020年11月25日 14:19，Pavel Stehule  写道：
>
>
>
> po 23. 11. 2020 v 10:27 odesílatel 曾文旌 
> napsal:
>
>>
>>
>> 2020年11月21日 02:28，Pavel Stehule  写道：
>>
>> Hi
>>
>> pá 11. 9. 2020 v 17:00 odesílatel 曾文旌 
>> napsal:
>>
>>> I have written the README for the GTT, which contains the GTT
>>> requirements and design.
>>> I found that compared to my first email a year ago, many GTT Limitations
>>> are now gone.
>>> Now, I'm adding comments to some of the necessary functions.
>>>
>>
>> There are problems with patching. Please, can you rebase your patch?
>>
>> Sure.
>> I'm still working on sort code and comments.
>> If you have any suggestions, please let me know.
>>
>
> It is broken again
>
> There is bad white space
>
> +   /*
> +* For global temp table only
> +* use ShareUpdateExclusiveLock for ensure safety
> +*/
> +   {
> +   {
> +   "on_commit_delete_rows",
> +   "global temp table on commit options",
> +   RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
> +   ShareUpdateExclusiveLock
> +   },
> +   true
> +   },  <=
> /* list terminator */
> {{NULL}}
>
> +7 OTHERS
> +Parallel query
> +Planner does not produce parallel query plans for SQL related to GTT.
> Because <=
> +GTT private data cannot be accessed across processes.
> diff --git a/src/backend/catalog/Makefile b/src/backend/catalog/Makefile
>
>
> +/*
> + * Update global temp table relstats(relpage/reltuple/relallvisible)
> <
> + * to local hashtable
> + */
> +void
>
> +/*
> + * Search global temp table relstats(relpage/reltuple/relallvisible)
> <==
> + * from lo
>
> and there are lot of more places ...
>
> I found other issue
>
> postgres=# create global temp table foo(a int);
> CREATE TABLE
> postgres=# create index on foo(a);
> CREATE INDEX
>
> close session and in new session
>
> postgres=# reindex index foo_a_idx ;
> WARNING:  relcache reference leak: relation "foo" not closed
> REINDEX
>
>
> I fixed all the above issues and rebase code.
> Please review the new version code again.
>
>
> Wenjing
>
>
>
>
> Regards
>
> Pavel
>
>
>
>>
>> Wenjing
>>
>>
>>
>> Regards
>>
>> Pavel
>>
>>
>>>
>>> Wenjing
>>>
>>>
>>>
>>>
>>>
>>> > 2020年7月31日 上午4:57，Robert Haas  写道：
>>> >
>>> > On Thu, Jul 30, 2020 at 8:09 AM wenjing zeng 
>>> wrote:
>>> >> Please continue to review the code.
>>> >
>>> > This patch is pretty light on comments. Many of the new functions have
>>> > no header comments, for example. There are comments here and there in
>>> > the body of the new functions that are added, and in places where
>>> > existing code is changed there are comments here and there, but
>>> > overall it's not a whole lot. There's no documentation and no README,
>>> > either. Since this adds a new feature and a bunch of new SQL-callable
>>> > functions that interact with that feature, the feature itself should
>>> > be documented, along with its limitations and the new SQL-callable
>>> > functions that interact with it. I think there should be either a
>>> > lengthy comment in some suitable file, or maybe various comments in
>>> > various files, or else a README file, that clearly sets out the major
>>> > design principles behind the patch, and explaining also what that
>>> > means in terms of features and limitations. Without that, it's really
>>> > hard for anyone to jump into reviewing this code, and it will be hard
>>> > for people who have to maintain it in the future to understand it,
>>> > either. Or for users, for that matter.
>>> >
>>> > --
>>> > Robert Haas
>>> > EnterpriseDB: http://www.enterprisedb.com
>>> > The Enterprise PostgreSQL Company
>>>
>>>
>>
>

Re: pg_amcheck contrib application

2021-03-16 Thread Tom Lane

... btw, I now see that tern and hornet are passing this test
at least as much as they're failing, proving that there's some
timing or random chance involved.  That doesn't completely
eliminate the idea that there may be an architecture component
to the issue, but it sure reduces its credibility.  I now
believe the theory that the triggering condition is an auto-analyze
happening at the right time, and populating pg_statistic with
some data that other runs don't see.

regards, tom lane

Re: pg_amcheck contrib application

2021-03-16 Thread Tom Lane

Robert Haas  writes:
> On Tue, Mar 16, 2021 at 12:51 PM Mark Dilger
>  wrote:
>> It shows them all has having attalign = 'd', but for some array types the 
>> alignment will be 'i', not 'd'.  So it's lying a bit about the contents.  
>> But I'm now confused why this caused problems on the two hosts where integer 
>> and double have the same alignment?  It seems like that would be the one 
>> place where the bug would not happen, not the one place where it does.

> Wait, so the value in the attalign column isn't the alignment we're
> actually using? I can understand how we might generate tuples like
> that, if we pass the actual type to construct_array(), but shouldn't
> we then get garbage when we deform the tuple?

No.  What should be happening there is that some arrays in the column
get larger alignment than they actually need, but that shouldn't cause
a problem (and has not done so, AFAIK, in the decades that it's been
like this).  As you say, deforming the tuple is going to rely on the
table's tupdesc for alignment; it can't know what is in the data.

I'm not entirely sure what's going on, but I think coming at this
with the mindset that "amcheck has detected some corruption" is
just going to lead you astray.  Almost certainly, it's "amcheck
is incorrectly claiming corruption".  That might be from mis-decoding
a TOAST-referencing datum.  (Too bad the message doesn't report the
TOAST OID it probed for, so we can see if that's sane or not.)

regards, tom lane

Re: Calendar support in localization

2021-03-16 Thread Surafel Temesgen

Hi Thomas

On Mon, Mar 15, 2021 at 2:58 PM Thomas Munro  wrote:

>
> One key question here is whether you need a different date type or
> just different operations (functions, operators etc) on the existing
> types.
>
>
I am thinking of having a converter to a specific calendar after each
operation and function for display or storage. It works on
Ethiopice calendar and i expect it will work on other calendar too


> > I cc Thomas Munro and Vik because they have interest on this area
>
> Last time it came up[1], I got as far as wondering if the best way
> would be to write a set of ICU-based calendar functions.  Would it be
> enough for your needs to have Ethiopic calendar-aware date arithmetic
> (add, subtract a month etc), date part extraction (get the current
> Ethiopic day/month/year of a date), display and parsing, and have all
> of these as functions that you have to call explicitly, but have them
> take the standard built-in date and timestamp types, so that your
> tables would store regular date and timestamp values?  If not, what
> else do you need?
>
>
Ethiopice calendar have 13 months so it can not be stored as date and
timestamp type and you approach seems more complicated and i suggest to
have this feature on the purpose of PostgreSQL popularities too not only
for my need


> ICU is very well maintained and widely used software, and PostgreSQL
> already depends on it optionally, and that's enabled in all common
> distributions.  In other words, maybe all the logic you want exists
> already in your process's memory, we just have to figure out how to
> reach it from SQL.  Another reason to use ICU is that we can solve
> this problem once and then it'll work for many other calendars.
>
>
Each  calendar-aware date arithmetic is different so solving one calendar
problem didn't help on other calendar


> > Please don't suggests to fork from PostgreSQL just for this feature
>
> I would start with an extension, and I'd try to do a small set of
> simple functions, to let me write things like:
>
>   icu_format(now(), 'fr_FR@calendar=buddhist') to get a Buddhist
> calendar with French words
>
>   icu_date_part('year', current_date, 'am_ET@calendar=traditional') to
> get the current year in the Ethiopic calendar (2013 apparently)
>
> Well, the first one probably also needs a format string too, actual
> details to be worked out by reading the ICU manual...
>

I think you suggesting this by expecting the implementation is difficult
but it's not that much difficult once you fully understand Gregorian
calendar and the calendar you work on


>
> Maybe instead of making a new extension, I might try to start from
> https://github.com/dverite/icu_ext and see if it makes sense to extend
> it to cover calendars.
>
> Maybe one day ICU will become a hard dependency of PostgreSQL and
> someone will propose all that stuff into core, and then maybe we could
> start to think about the possibility of tighter integration with the
> built-in date/time functions (and LC_TIME setting?  seems complicated,
> see also problems with setting LC_COLLATE/datcollate to an ICU
> collation name, but I digress and that's a far off problem).  I would
> also study the SQL standard and maybe DB2 (highly subjective comment:
> at a wild guess, the most likely commercial RDBMS to have done a good
> job of this if anyone has) to see if they contemplate non-Gregorian
> calendars, to get some feel for whether that would eventually be a
> possibility to conform with whatever the standard says.
>
> In summary, getting something of very high quality by using a widely
> used open source library that we already use seems like a better plan
> than trying to write and maintain our own specialist knowledge about
> individual calendars.  If there's something you need that can't be
> done with its APIs working on top of our regular date and timestamp
> types, could you elaborate?
>
> [1]
> https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BybW0LJuLtj3yAUsbOw3DrzK00pGk8JyfpCREzi_LSsg%40mail.gmail.com#393d827f1be589d0ad6ca6b016905e80


I don't know how you see this but for me the feature deserves a specialist
and it is not that much difficult to have one because i guess every majore
calendar have english documentation

regards
Surafel

Re: Postgres crashes at memcopy() after upgrade to PG 13.

2021-03-16 Thread Peter Geoghegan

On Tue, Mar 16, 2021 at 9:50 AM Avinash Kumar
 wrote:
> Yes, it was on the failover-over server where the issue is currently seen. 
> Took a snapshot of the data directory so that the issue can be analyzed.

I would be very cautious when using LVM snapshots with a Postgres data
directory, or VM-based snapshotting tools. There are many things that
can go wrong with these tools, which are usually not sensitive to the
very specific requirements of a database system like Postgres (e.g.
inconsistencies between WAL and data files can emerge in many
scenarios).

My general recommendation is to avoid these tools completely --
consistently use a backup solution like pgBackrest instead.

BTW, running pg_repack is something that creates additional risk of
database corruption, at least to some degree. That seems less likely
to have been the problem here (I think that it's probably something
with snapshots). Something to consider.

> I can do this. But, to add here, when we do a pg_repack or rebuild of 
> Indexes, automatically this is resolved.

Your bug report was useful to me, because it made me realize that the
posting list split code in _bt_swap_posting() is unnecessarily
trusting of the on-disk data -- especially compared to _bt_split(),
the page split code. While I consider it unlikely that the problem
that you see is truly a bug in Postgres, it is still true that the
crash that you saw should probably have just been an error.

We don't promise that the database cannot crash even with corrupt
data, but we do try to avoid it whenever possible. I may be able to
harden _bt_swap_posting(), to make failures like this a little more
friendly. It's an infrequently hit code path, so we can easily afford
to make the code more careful/less trusting.

-- 
Peter Geoghegan

Re: pg_amcheck contrib application

2021-03-16 Thread Robert Haas

On Tue, Mar 16, 2021 at 12:51 PM Mark Dilger
 wrote:
> Yeah, that looks related:
>
> regression=# select attname, attlen, attnum, attalign from pg_attribute where 
> attrelid = 2619 and attname like 'stavalue%';
>   attname   | attlen | attnum | attalign
> +++--
>  stavalues1 | -1 | 27 | d
>  stavalues2 | -1 | 28 | d
>  stavalues3 | -1 | 29 | d
>  stavalues4 | -1 | 30 | d
>  stavalues5 | -1 | 31 | d
> (5 rows)
>
> It shows them all has having attalign = 'd', but for some array types the 
> alignment will be 'i', not 'd'.  So it's lying a bit about the contents.  But 
> I'm now confused why this caused problems on the two hosts where integer and 
> double have the same alignment?  It seems like that would be the one place 
> where the bug would not happen, not the one place where it does.

Wait, so the value in the attalign column isn't the alignment we're
actually using? I can understand how we might generate tuples like
that, if we pass the actual type to construct_array(), but shouldn't
we then get garbage when we deform the tuple? I mean,
heap_deform_tuple() is going to get the alignment from the tuple
descriptor, which is a table property. It doesn't (and can't) know
what type of value is stored inside any of these fields for real,
right?

In other words, isn't this actually corruption, caused by a bug in our
code, and how have we not noticed it before now?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_amcheck contrib application

2021-03-16 Thread Mark Dilger



> On Mar 16, 2021, at 9:30 AM, Mark Dilger  wrote:
> 
> 
> 
>> On Mar 16, 2021, at 9:07 AM, Tom Lane  wrote:
>> 
>> Mark Dilger  writes:
>>> I think autovacuum simply triggers the bug, and is not the cause of the 
>>> bug.  If I turn autovacuum off and instead do an ANALYZE in each test 
>>> database rather than performing the corruptions, I get reports about 
>>> problems in pg_statistic.  This is on my mac laptop.  This rules out the 
>>> theory that autovacuum is propogating corruptions into pg_statistic, and 
>>> also the theory that it is architecture dependent.
>> 
>> I wonder whether amcheck is confused by the declaration of those columns
>> as "anyarray".
> 
> It uses attlen and attalign for the attribute, so that idea does make sense.  
> It gets that via TupleDescAttr(RelationGetDescr(rel), attnum).

Yeah, that looks related:

regression=# select attname, attlen, attnum, attalign from pg_attribute where 
attrelid = 2619 and attname like 'stavalue%';
  attname   | attlen | attnum | attalign 
+++--
 stavalues1 | -1 | 27 | d
 stavalues2 | -1 | 28 | d
 stavalues3 | -1 | 29 | d
 stavalues4 | -1 | 30 | d
 stavalues5 | -1 | 31 | d
(5 rows)

It shows them all has having attalign = 'd', but for some array types the 
alignment will be 'i', not 'd'.  So it's lying a bit about the contents.  But 
I'm now confused why this caused problems on the two hosts where integer and 
double have the same alignment?  It seems like that would be the one place 
where the bug would not happen, not the one place where it does.

I'm attaching a test that reliably reproduces this for me:



v8-0005-Adding-pg_amcheck-special-test-for-pg_statistic.patch
Description: Binary data


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Postgres crashes at memcopy() after upgrade to PG 13.

2021-03-16 Thread Avinash Kumar

Hi,


On Tue, Mar 16, 2021 at 1:44 PM Peter Geoghegan  wrote:

> On Tue, Mar 16, 2021 at 5:01 AM Avinash Kumar
>  wrote:
> > I am afraid that it looks to me like a deduplication bug but not sure
> how this can be pin-pointed. If there is something I could do to determine
> that, I would be more than happy.
>
> That cannot be ruled out, but I don't consider it to be the most
> likely explanation. The index in question passes amcheck verification,
> which includes verification of the posting list tuple structure, and
> even includes making sure the index has an entry for each row from the
> table. It's highly unlikely that it is corrupt, and it's hard to see
> how you get from a non-corrupt index to the segfault. At the same time
> we see that some other index is corrupt -- it fails amcheck due to a
> cross-level inconsistency, which is very unlikely to be related to
> deduplication in any way. It's hard to believe that the problem is
> squarely with _bt_swap_posting().
>
> Did you actually run amcheck on the failed-over server, not the original
> server?
>
Yes, it was on the failover-over server where the issue is currently seen.
Took a snapshot of the data directory so that the issue can be analyzed.

>
> Note that you can disable deduplication selectively -- perhaps doing
> so will make it possible to isolate the issue. Something like this
> should do it (you need to reindex here to actually change the on-disk
> representation to not have any posting list tuples from
> deduplication):
>
> alter index idx_id_mtime set (deduplicate_items = off);
> reindex index idx_id_mtime;
>
I can do this. But, to add here, when we do a pg_repack or rebuild of
Indexes, automatically this is resolved. But, not sure if we get the same
issue again.

>
> --
> Peter Geoghegan
>


-- 
Regards,
Avi.

Re: [PATCH] pgbench: improve \sleep meta command

2021-03-16 Thread Fujii Masao




On 2021/03/09 0:54, Fujii Masao wrote:



On 2021/03/08 23:10, Alvaro Herrera wrote:

On 2021-Mar-08, kuroda.hay...@fujitsu.com wrote:


Dear Fujii-san, Miyake-san


Isn't it better to accept even negative sleep time like currently pgbench does?
Otherwise we always need to check the variable is a positive integer
(for example, using \if command) when using it as the sleep time in \sleep
command. That seems inconvenient.


Both of them are acceptable for me.
But we should write down how it works when the negative value is input if we 
adopt.


Not sleeping at all seems a good reaction (same as for zero, I guess.)


+1. BTW, IIUC currently \sleep works in that way.


Attached is the updated version of the patch.

Currently a variable whose value is a negative number is allowed to be
specified as a sleep time as follows. In this case \sleep command doesn't
sleep. The patch doesn't change this behavior at all.

\set hoge -1
\sleep :hoge s

Currently a variable whose value is a double is allowed to be specified as
a sleep time as follows. In this case the integer value that atoi() converts
the double number into is used as a sleep time. The patch also doesn't
change this behavior at all because I've not found any strong reason to
ban that usage.

\set hoge 10 / 4.0
\sleep :hoge s

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index e69d43b26b..a6d91d1089 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -2953,7 +2953,24 @@ evaluateSleep(CState *st, int argc, char **argv, int 
*usecs)
pg_log_error("%s: undefined variable \"%s\"", argv[0], 
argv[1] + 1);
return false;
}
+
usec = atoi(var);
+   if (usec == 0)
+   {
+   char   *c = var;
+
+   /* Skip sign */
+   if (*c == '+' || *c == '-')
+   c++;
+
+   /* Raise an error if the value of a variable is not a 
number */
+   if (!isdigit((unsigned char) *c))
+   {
+   pg_log_error("%s: invalid sleep time \"%s\" for 
variable \"%s\"",
+argv[0], var, argv[1] 
+ 1);
+   return false;
+   }
+   }
}
else
usec = atoi(argv[1]);
@@ -4788,17 +4805,41 @@ process_backslash_command(PsqlScanState sstate, const 
char *source)
 * will be parsed with atoi, which ignores trailing non-digit
 * characters.
 */
-   if (my_command->argc == 2 && my_command->argv[1][0] != ':')
+   if (my_command->argv[1][0] != ':')
{
char   *c = my_command->argv[1];
+   boolhave_digit = false;
 
-   while (isdigit((unsigned char) *c))
+   /* Skip sign */
+   if (*c == '+' || *c == '-')
c++;
+
+   /* Require at least one digit */
+   if (*c && isdigit((unsigned char) *c))
+   have_digit = true;
+
+   /* Eat all digits */
+   while (*c && isdigit((unsigned char) *c))
+   c++;
+
if (*c)
{
-   my_command->argv[2] = c;
-   offsets[2] = offsets[1] + (c - 
my_command->argv[1]);
-   my_command->argc = 3;
+   if (my_command->argc == 2 && have_digit)
+   {
+   my_command->argv[2] = c;
+   offsets[2] = offsets[1] + (c - 
my_command->argv[1]);
+   my_command->argc = 3;
+   }
+   else
+   {
+   /*
+* Raise an error if argument starts 
with non-digit
+* character (after sign).
+*/
+   syntax_error(source, lineno, 
my_command->first_line, my_command->argv[0],
+"invalid sleep 
time, must be an integer",
+
my_command->argv[1], offsets[1] - start_offset);
+   }
}
}

Re: Postgres crashes at memcopy() after upgrade to PG 13.

2021-03-16 Thread Peter Geoghegan

On Tue, Mar 16, 2021 at 5:01 AM Avinash Kumar
 wrote:
> I am afraid that it looks to me like a deduplication bug but not sure how 
> this can be pin-pointed. If there is something I could do to determine that, 
> I would be more than happy.

That cannot be ruled out, but I don't consider it to be the most
likely explanation. The index in question passes amcheck verification,
which includes verification of the posting list tuple structure, and
even includes making sure the index has an entry for each row from the
table. It's highly unlikely that it is corrupt, and it's hard to see
how you get from a non-corrupt index to the segfault. At the same time
we see that some other index is corrupt -- it fails amcheck due to a
cross-level inconsistency, which is very unlikely to be related to
deduplication in any way. It's hard to believe that the problem is
squarely with _bt_swap_posting().

Did you actually run amcheck on the failed-over server, not the original server?

Note that you can disable deduplication selectively -- perhaps doing
so will make it possible to isolate the issue. Something like this
should do it (you need to reindex here to actually change the on-disk
representation to not have any posting list tuples from
deduplication):

alter index idx_id_mtime set (deduplicate_items = off);
reindex index idx_id_mtime;

-- 
Peter Geoghegan

Re: pg_amcheck contrib application

2021-03-16 Thread Mark Dilger




> On Mar 16, 2021, at 9:07 AM, Tom Lane  wrote:
> 
> Mark Dilger  writes:
>> I think autovacuum simply triggers the bug, and is not the cause of the bug. 
>>  If I turn autovacuum off and instead do an ANALYZE in each test database 
>> rather than performing the corruptions, I get reports about problems in 
>> pg_statistic.  This is on my mac laptop.  This rules out the theory that 
>> autovacuum is propogating corruptions into pg_statistic, and also the theory 
>> that it is architecture dependent.
> 
> I wonder whether amcheck is confused by the declaration of those columns
> as "anyarray".

It uses attlen and attalign for the attribute, so that idea does make sense.  
It gets that via TupleDescAttr(RelationGetDescr(rel), attnum).

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: A new function to wait for the backend exit after termination

2021-03-16 Thread Magnus Hagander

On Tue, Mar 16, 2021 at 10:38 AM Bharath Rupireddy
 wrote:
>
> On Mon, Mar 15, 2021 at 10:38 AM Fujii Masao
>  wrote:
> > On 2021/03/15 12:27, Bharath Rupireddy wrote:
> > > On Sun, Mar 7, 2021 at 2:39 PM Bharath Rupireddy
> > >  wrote:
> > >> Attaching v7 patch for further review.
> > >
> > > Attaching v8 patch after rebasing on to the latest master.
> >
> > Thanks for rebasing the patch!
>
> Thanks for reviewing.
>
> > -   WAIT_EVENT_XACT_GROUP_UPDATE
> > +   WAIT_EVENT_XACT_GROUP_UPDATE,
> > +   WAIT_EVENT_BACKEND_TERMINATION
> >
> > These should be listed in alphabetical order.
>
> Done.
>
> > In pg_wait_until_termination's do-while loop, ResetLatch() should be 
> > called. Otherwise, it would enter busy-loop after any signal arrives. 
> > Because the latch is kept set and WaitLatch() always exits immediately in 
> > that case.
>
> Done.
>
> > +   /*
> > +* Wait in steps of waittime milliseconds until this function exits 
> > or
> > +* timeout.
> > +*/
> > +   int64   waittime = 10;
> >
> > 10 ms per cycle seems too frequent?
>
> Increased it to 100msec.
>
> > +   ereport(WARNING,
> > +   (errmsg("timeout cannot be negative 
> > or zero: %lld",
> > +   (long long int) 
> > timeout)));
> > +
> > +   result = false;
> >
> > IMO the parameter should be verified before doing the actual thing.
>
> Done.
>
> > Why is WARNING thrown in this case? Isn't it better to throw ERROR like 
> > pg_promote() does?
>
> Done.
>
> Attaching v9 patch for further review.

Almost there :)


Does it really make sense that pg_wait_for_backend_termination()
defaults to waiting *100 milliseconds*, and then logs a warning? That
seems extremely short if I'm explicitly asking it to wait.

I'd argue that 100ms is too short for pg_terminate_backend() as well,
but I think it's a bit more reasonable there.

Wait events should be in alphabetical order in pgstat_get_wait_ipc()
as well, not just in the header (which was adjusted per Fujii's
comment)


+   (errmsg("could not wait for the termination of
the backend with PID %d within %lld milliseconds",

That's not true though? The wait succeeded, it just timed out? Isn't
itm ore like "backend with PID %d did not terminate within %lld
milliseconds"?


-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: pg_amcheck contrib application

2021-03-16 Thread Tom Lane

Mark Dilger  writes:
> I think autovacuum simply triggers the bug, and is not the cause of the bug.  
> If I turn autovacuum off and instead do an ANALYZE in each test database 
> rather than performing the corruptions, I get reports about problems in 
> pg_statistic.  This is on my mac laptop.  This rules out the theory that 
> autovacuum is propogating corruptions into pg_statistic, and also the theory 
> that it is architecture dependent.

I wonder whether amcheck is confused by the declaration of those columns
as "anyarray".

regards, tom lane

Re: pg_amcheck contrib application

2021-03-16 Thread Mark Dilger

> On Mar 15, 2021, at 11:09 PM, Noah Misch  wrote:
> 
>> Not sure that I believe the theory that this is from bad luck of
>> concurrent autovacuum timing, though.
> 
> With autovacuum_naptime=1s, on hornet, the failure reproduced twice in twelve
> runs.  With v6-0001-Turning-off-autovacuum-during-corruption-tests.patch
> applied, 196 runs all succeeded.
> 
>> The fact that we're seeing
>> this on just those two animals suggests strongly to me that it's
>> architecture-correlated, instead.
> 
> That is possible.

I think autovacuum simply triggers the bug, and is not the cause of the bug.  
If I turn autovacuum off and instead do an ANALYZE in each test database rather 
than performing the corruptions, I get reports about problems in pg_statistic.  
This is on my mac laptop.  This rules out the theory that autovacuum is 
propogating corruptions into pg_statistic, and also the theory that it is 
architecture dependent.

I'll investigate further.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: New IndexAM API controlling index vacuum strategies

2021-03-16 Thread Masahiko Sawada

On Tue, Mar 16, 2021 at 10:39 PM Masahiko Sawada  wrote:
>
> On Mon, Mar 15, 2021 at 11:04 AM Peter Geoghegan  wrote:
> >
> > One consequence of my approach is that we now call
> > lazy_cleanup_all_indexes(), even when we've skipped index vacuuming
> > itself. We should at least "check-in" with the indexes IMV. To an
> > index AM, this will be indistinguishable from a VACUUM that never had
> > tuples for it to delete, and so never called ambulkdelete() before
> > calling amvacuumcleanup().  This seems logical to me: why should there
> > be any significant behavioral divergence between the case where there
> > are 0 tuples to delete and the case where there is 1 tuple to delete?
> > The extra work that we perform in amvacuumcleanup() (if any) should
> > almost always be a no-op in nbtree following my recent refactoring
> > work. More generally, if an index AM is doing too much during cleanup,
> > and this becomes a bottleneck, then IMV that's a problem that needs to
> > be fixed in the index AM.
>
> Aside from whether it's good or bad, strictly speaking, it could
> change the index AM API contract. The documentation of
> amvacuumcleanup() says:
>
> ---
> stats is whatever the last ambulkdelete call returned, or NULL if
> ambulkdelete was not called because no tuples needed to be deleted.
> ---
>
> With this change, we could pass NULL to amvacuumcleanup even though
> the index might have tuples needed to be deleted.

It seems there is no problem with that change at least in built-in
index AMs. So +1 for this change. We would need to slightly update the
doc accordingly.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: pg_stat_statements oddity with track = all

2021-03-16 Thread Julien Rouhaud

On Tue, Mar 16, 2021 at 12:55:45PM +0100, Magnus Hagander wrote:
> On Tue, Mar 9, 2021 at 3:39 AM Julien Rouhaud  wrote:
> >
> > I think that we might be able to handle that without a flag.  The only thing
> > that would need to be done is when creating an entry, look for an existing
> > entry with the opposite flag, and if there's simply use the same
> > (query_offset, query_len) info.  This doesn't sound that expensive.
> 
> That's basically what I was trying to say :)

Oh ok sorry :)

> > The real pain point will be that the garbage collection phase
> > will become way more expensive as it will now have to somehow maintain that
> > knowledge, which will require additional lookups for each entry.  I'm a bit
> > concerned about that, especially with the current heuristic to schedule 
> > garbage
> > collection.  For now, need_qc_qtext says that we have to do it if the 
> > extent is
> > more than 512 (B) * pgss_max.  This probably doesn't work well for people 
> > using
> > ORM as they tend to generate gigantic SQL queries.
> 
> Right, the cost would be mostly on the GC  side. I've never done any
> profiling to see how big of a thing that is in systems today -- have
> you?

I didn't, but I don't see how it could be anything but ridiculously impacting.
it's basically preventing any query from being planned or executed on the whole
instance the time needed to read the previous qtext file, and write all entries
still needed.

> > I don't that think that anyone really had a strong argument, mostly gut
> > feeling.  Note that pg_stat_kcache already implemented that toplevel flags, 
> > so
> > if people are using that extension in a recent version they might have some
> > figures to show.  I'll ping some people that I know are using it.
> 
> Great -- data always wins over gut feelings :)

So I asked some friends that have latest pg_stat_kcache installed on some
preproduction environment configured to track nested queries.  There isn't a
high throughput but the activity should still be representative of the
production queries.  There are a lot of applications plugged there, around 20
databases and quite a lot of PL code.

After a few days, here are the statistics:

- total of ~ 9500 entries
- ~ 900 entries for nested statements
- ~ 35 entries existing for both top level and nested statements

So the duplicates account for less than 4% of the nested statements, and less
than 0.5% of the whole entries.

I wish I had more reports, but if this one is representative enough then it
seems that trying to avoid storing duplicated queries wouldn't be worth it.

> > One good argument would be that gigantic queries generated by ORM should 
> > always
> > be executed as top level statements.
> 
> Yes, that's true. And it probably holds as a more generic case as
> well, that is the queries that are likely to show up both top-level
> and lower-level are more likely to be relatively simple ones. (Except
> for example during the development of functions/procs where they're
> often executed top level as well to test etc, but that's not the most
> important case to optimize for)

Agreed.

> > I previously tried with the postgres regression tests, which clearly isn't a
> > representative workload, and as far as I can see the vast majority of 
> > queries
> > executed bost as top level and nested level are DDL implying recursion 
> > (e.g. a
> > CREATE TABLE with underlying index creation).
> 
> I think the PostgreSQL regression tests are so far from a real world
> workload that the input in this case has a value of exactly zero.

I totally agree, but that's the only one I had at that time :)  Still it wasn't
entirely useless as I didn't realize before that that some DDL would lead to
duplicated entries.

Re: [HACKERS] Custom compression methods

2021-03-16 Thread Dilip Kumar

On Tue, Mar 16, 2021 at 7:57 PM Robert Haas  wrote:
>

> That behavior feels unacceptable to me from a user expectations point
> of view. I think there's an argument that if I update a tuple that
> contains a compressed datum, and I don't update that particular
> column, I think it would be OK to not recompress the column. But, if I
> insert data into a table, I as a user would expect that the
> compression settings for that column are going to be respected.
> Deciding that's optional because we don't have a good way of making it
> fast seems like a major cop-out, at least to me. I think from a user
> perspective you don't expect INSERT INTO .. SELECT FROM to create a
> different final state than a dump and reload, and that if we deviate
> from that people are gonna be unhappy. I could be wrong; maybe it's
> only me who would be unhappy.

If that is only the argument then it's possible today as well.  I mean
you can INSERT INTO .. SELECT FROM where source attribute as
compressed data but the target attribute as external storage then also
we will move the compressed data as it is to the target table.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: libpq debug log

2021-03-16 Thread Alvaro Herrera

On 2021-Mar-15, iwata@fujitsu.com wrote:

> I create protocol message reading function for each protocol message type. 
> (Ex. pqTraceOutputB() read Bind message)
> This makes the nesting shallower and makes the code easier to read.

I'm not sure I agree with this structural change.  Yes, it is less
indentation, but these functions are all quite short and simple anyway.
But I don't disagree strongly with it either.

Four issues:
* There's no cross-check that the protocol message length word
  corresponds to what we actually print.  I think one easy way to
  cross-check that would be to return the "count" from the type-specific
  routines, and verify that it matches "length" in pqTraceOutputMessage.
  (If the separate routines are removed and the code put back in one
  giant switch, then the "count" is already available when the switch
  ends.)

* If we have separate functions for each message type, then we can have
  that function supply the message name by itself, rather than have the
  separate protocol_message_type_f / _b arrays that print it.

* If we make each type-specific function print the message name, then we
  need to make the same function print the message length, because it
  comes after.  So the function would have to receive the length (so
  that it can be printed).  But I think we should still have the
  cross-check I mentioned in my first point occur in the main
  pqTraceOutputMessage, not the type-specific function, for fear that we
  will forget the cross-check in one of the functions and never realize
  that we did.

* I would make the pqTraceOutputInt16() function and siblings advance
  the cursor themselves, actually.  I think something like this:
  static int
  pqTraceOutputInt16(const char *data, int *cursor, FILE *pfdebug)
  {
  uint16tmp;
  int   result;

  memcpy(&tmp, data + *cursor, 2);
  *cursor += 2;
  result = (int) pg_ntoh16(tmp);
  fprintf(pfdebug, " #%d", result);

  return result;
  }
  (So the caller has to pass the original "data" pointer, not
  "data+cursor").  This means the caller no longer has to do "cursor+=N"
  at each place.  Also, you get rid of the moveStrCursor() which does
  not look good.

Bikeshedding item:
* I'm not fond of the idea of prefixing "#" for 16bit ints and no
  prefix for 32bit ints.  Seems obscure and the output looks weird.
  I would use a one-letter prefix for each type, "w" for 32-bit ints
  (stands for "word") and "h" for 16-bit ints (stands for "half-word"). 
  Message length goes unadorned.  Then you'd have lines like this

2021-03-15 08:10:44.324296  <   RowDescription  35 h1 "spcoptions" w1213 h5 
w1009 h65535 w-1 h0
2021-03-15 08:10:44.324335  <   DataRow 32 h1 w22 '{random_page_cost=3.0}'  

* I don't like that pqTraceOutputS/H print nothing when !toServer.  I
think they should complain.

-- 
Álvaro Herrera39°49'30"S 73°17'W

Re: [PATCH]: Allow errors in parameter values to be reported during the BIND phase itself..

2021-03-16 Thread Tom Lane

Justin Pryzby  writes:
> I think it's somewhat confusing that there's two callbacks.
> The first one applies only during typinput/typreceive.
> I guess the 2nd one should say that they're printed "in *future errors".

I adjusted the comments to make this a bit clearer, and pushed it.

regards, tom lane

Re: GROUP BY DISTINCT

2021-03-16 Thread Tomas Vondra




On 3/16/21 9:21 AM, Vik Fearing wrote:
> On 3/13/21 12:33 AM, Tomas Vondra wrote:
>> Hi Vik,
>>
>> The patch seems quite ready, I have just two comments.
> 
> Thanks for taking a look.
> 
>> 1) Shouldn't this add another  for DISTINCT, somewhere in the
>> documentation? Now the index points just to the SELECT DISTINCT part.
> 
> Good idea; I never think about the index.
> 
>> 2) The part in gram.y that wraps/unwraps the boolean flag as an integer,
>> in order to stash it in the group lists is rather ugly, IMHO. It forces
>> all the places handling the list to be aware of this (there are not
>> many, but still ...). And there are no other places doing (bool) intVal
>> so it's not like there's a precedent for this.
> 
> There is kind of a precedent for it, I was copying off of TriggerEvents
> and func_alias_clause.
> 

I see. I was looking for "(bool) intVal" but you're right TriggerEvents
code does something similar.

>> I think the clean solution is to make group_clause produce a struct with
>> two fields, and just use that. Not sure how invasive that will be
>> outside gram.y, though.
> 
> I didn't want to create a whole new parse node for it, but Andrew Gierth
> pointed me towards SelectLimit so I did it like that and I agree it is
> much cleaner.
> 

I agree, that's much cleaner.

>> Also, the all_or_distinct vs. distinct_or_all seems a bit error-prone. I
>> wonder if we can come up with some clearer names, describing the context
>> of those types.
> 
> I turned this into an enum for ALL/DISTINCT/default and the caller can
> choose what it wants to do with default.  I think that's a lot cleaner,
> too.  Maybe DISTINCT ON should be changed to fit in that?  I left it
> alone for now.
> 

I think DISTINCT ON is a different kind of animal, because that is a
list of expressions, not just a simple enum state.

> I also snuck in something that all of us overlooked which is outputting
> the DISTINCT in ruleutils.c.  I didn't add a test for it but that would
> have been an unfortunate bug.
> 

Oh!

> New patch attached, rebased on 15639d5e8f.
> 

Thanks. At this point it seems fine to me, no further comments.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

re: crash during cascaded foreign key update

2021-03-16 Thread Ranier Vilela

>0 0x7f747e6e2387 in raise () from /lib64/libc.so.6
>#1 0x7f747e6e3a78 in abort () from /lib64/libc.so.6
>#2 0x00ae056a in ExceptionalCondition (
>conditionName=0xb67c10 "!ItemPointerEquals(&oldtup.t_self,
>&oldtup.t_data->t_ctid)",
>errorType=0xb66d89 "FailedAssertion", fileName=0xb66e68
>"heapam.c", lineNumber=3560) at assert.c:69
>#3 0x004eed16 in heap_update (relation=0x7f747f569590,
>otid=0x7ffe6f236ec0, newtup=0x1c214b8, cid=2,
>crosscheck=0x1c317f8, wait=true, tmfd=0x7ffe6f236df0,
l>ockmode=0x7ffe6f236dec) at heapam.c:3560

I have this report from one static analysis tool:
heapam.c (9379):
Dereferencing of a potential null pointer 'oldtup.t_data'

regards,
Ranier Vilela

Re: [HACKERS] Custom compression methods

2021-03-16 Thread Robert Haas

On Mon, Mar 15, 2021 at 6:58 PM Andres Freund  wrote:
> I don't particularly like PG_RETURN_HEAPTUPLEHEADER_RAW(). What is "raw"
> about it?  It also seems to me like there needs to at least be a
> sentence or two explaining when to use which of the functions.

It seemed like the natural name to me; we use "raw" elsewhere to mean
that fewer things are magically addressed on behalf of the caller,
e.g. HeapTupleHeaderGetRawXmin. I'm open to suggestions, however.

> I think heap_copy_tuple_as_raw_datum() should grow an assert checking
> there are no external columns?

Yeah, could be done.

> I'm don't like that after 0002 ExecEvalRow(), ExecEvalFieldStoreForm()
> contain a nearly identical copy of the same code.  And
> make_tuple_from_row() also is similar. It seem that there should be a
> heap_form_tuple() version doing this for us?

I was worried about having either a performance impact or code
duplication. The actual plan where you could insert this organically
is in fill_val(), which is called from heap_fill_tuple(), which is
called from heap_form_tuple(). If you don't mind passing down 'int
flags' or similar to all those, and having additional branches to make
the behavior dependent on the flags, I'm cool with it. Or if you think
we should template-ize all those functions, that'd be another way to
go. But I was afraid I would get complaints about adding overhead to
hot code paths.

> > I'm open to being convinced that we don't need to do either of these
> > things, and that the cost of iterating over all varlenas in the tuple
> > is not so bad as to preclude doing things as you have them here. But,
> > I'm afraid it's going to be too expensive.
>
> I mean, I would just define several of those places away by not caring
> about tuples in a different compressino formation ending up in a
> table...

That behavior feels unacceptable to me from a user expectations point
of view. I think there's an argument that if I update a tuple that
contains a compressed datum, and I don't update that particular
column, I think it would be OK to not recompress the column. But, if I
insert data into a table, I as a user would expect that the
compression settings for that column are going to be respected.
Deciding that's optional because we don't have a good way of making it
fast seems like a major cop-out, at least to me. I think from a user
perspective you don't expect INSERT INTO .. SELECT FROM to create a
different final state than a dump and reload, and that if we deviate
from that people are gonna be unhappy. I could be wrong; maybe it's
only me who would be unhappy.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: crash during cascaded foreign key update

2021-03-16 Thread Tom Lane

Amit Langote  writes:
> With HEAD (I think v12 and greater), I see $subject when trying out
> the following scenario:

I wonder if this is related to

https://www.postgresql.org/message-id/flat/89429.1584443208%40antos

which we've still not done anything about.

regards, tom lane

Re: Nondeterministic collations and the value returned by GROUP BY x

2021-03-16 Thread Tom Lane

Jim Finnerty  writes:
> right.  It doesn't matter which of the values is returned; however, a
> plausible-sounding implementation would case-fold the value, like GROUP BY
> LOWER(x), but the case-folded value isn't necessarily one of the original
> values and so that could be subtly wrong in the case-insensitive case, and
> could in principle be completely wrong in the most general nondeterministic
> collation case where the case-folded value isn't even equal to the other
> members of the set.

> does the implementation in PG12 ensure that some member of the set of equal
> values is chosen as the representative value?

Without having actually looked, I'm pretty certain it does.
Considerations of data type independence would seem to rule out a hack
like applying case folding.  There might be case folding happening
internally to comparison functions, like citext_cmp, but that wouldn't
affect the grouping logic that is going to save aside one of the
group of peer values.

regards, tom lane

crash during cascaded foreign key update

2021-03-16 Thread Amit Langote

With HEAD (I think v12 and greater), I see $subject when trying out
the following scenario:

-- in backend 1
create table p (a int primary key);
create table f (a int references p on update cascade deferrable
initially deferred);
insert into p values (1);
begin isolation level serializable;
insert into p values (3);

-- in another backend
insert into f values (1)

-- back in backend 1
update p set a = a + 1;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

I see the following backtrace:

#0  0x7f747e6e2387 in raise () from /lib64/libc.so.6
#1  0x7f747e6e3a78 in abort () from /lib64/libc.so.6
#2  0x00ae056a in ExceptionalCondition (
conditionName=0xb67c10 "!ItemPointerEquals(&oldtup.t_self,
&oldtup.t_data->t_ctid)",
errorType=0xb66d89 "FailedAssertion", fileName=0xb66e68
"heapam.c", lineNumber=3560) at assert.c:69
#3  0x004eed16 in heap_update (relation=0x7f747f569590,
otid=0x7ffe6f236ec0, newtup=0x1c214b8, cid=2,
crosscheck=0x1c317f8, wait=true, tmfd=0x7ffe6f236df0,
lockmode=0x7ffe6f236dec) at heapam.c:3560
#4  0x004fdb52 in heapam_tuple_update
(relation=0x7f747f569590, otid=0x7ffe6f236ec0, slot=0x1c43fc8, cid=2,
snapshot=0x1c31a30, crosscheck=0x1c317f8, wait=true,
tmfd=0x7ffe6f236df0, lockmode=0x7ffe6f236dec,
update_indexes=0x7ffe6f236deb) at heapam_handler.c:327
#5  0x0075a7fc in table_tuple_update (rel=0x7f747f569590,
otid=0x7ffe6f236ec0, slot=0x1c43fc8, cid=2,
snapshot=0x1c31a30, crosscheck=0x1c317f8, wait=true,
tmfd=0x7ffe6f236df0, lockmode=0x7ffe6f236dec,
update_indexes=0x7ffe6f236deb) at ../../../src/include/access/tableam.h:1509
#6  0x0075cd20 in ExecUpdate (mtstate=0x1c42540,
resultRelInfo=0x1c42778, tupleid=0x7ffe6f236ec0, oldtuple=0x0,
slot=0x1c43fc8, planSlot=0x1c43e78, epqstate=0x1c42638,
estate=0x1c422d0, canSetTag=true) at nodeModifyTable.c:1498
#7  0x0075e0a3 in ExecModifyTable (pstate=0x1c42540) at
nodeModifyTable.c:2254
#8  0x0072674e in ExecProcNodeFirst (node=0x1c42540) at
execProcnode.c:456
#9  0x0071b13b in ExecProcNode (node=0x1c42540) at
../../../src/include/executor/executor.h:247
#10 0x0071d8f3 in ExecutePlan (estate=0x1c422d0,
planstate=0x1c42540, use_parallel_mode=false, operation=CMD_UPDATE,
sendTuples=false, numberTuples=0, direction=ForwardScanDirection,
dest=0xcb1440 , execute_once=true)
at execMain.c:1531
#11 0x0071b75f in standard_ExecutorRun (queryDesc=0x1c4cd18,
direction=ForwardScanDirection, count=0,
execute_once=true) at execMain.c:350
#12 0x0071b587 in ExecutorRun (queryDesc=0x1c4cd18,
direction=ForwardScanDirection, count=0, execute_once=true)
at execMain.c:294
#13 0x00777a88 in _SPI_pquery (queryDesc=0x1c4cd18,
fire_triggers=false, tcount=0) at spi.c:2729
#14 0x007774fa in _SPI_execute_plan (plan=0x1bf93d0,
paramLI=0x1c402c0, snapshot=0x1036840 ,
crosscheck_snapshot=0x1c317f8, read_only=false,
no_snapshots=false, fire_triggers=false, tcount=0, caller_dest=0x0,
plan_owner=0x1bc1c10) at spi.c:2500
#15 0x007740a9 in SPI_execute_snapshot (plan=0x1bf93d0,
Values=0x7ffe6f237340, Nulls=0x7ffe6f237300 "  ",
snapshot=0x1036840 ,
crosscheck_snapshot=0x1c317f8, read_only=false, fire_triggers=false,
tcount=0)
at spi.c:693
#16 0x00a52724 in ri_PerformCheck (riinfo=0x1c3f2f8,
qkey=0x7ffe6f2378a0, qplan=0x1bf93d0, fk_rel=0x7f747f569590,
pk_rel=0x7f747f564a30, oldslot=0x1c042b8, newslot=0x1c04420,
detectNewRows=true, expect_OK=9) at ri_triggers.c:2517
#17 0x00a4fee5 in RI_FKey_cascade_upd (fcinfo=0x7ffe6f237a60)
at ri_triggers.c:1163
#18 0x006ea114 in ExecCallTriggerFunc
(trigdata=0x7ffe6f237b00, tgindx=1, finfo=0x1bc5be0, instr=0x0,
per_tuple_context=0x1c06760) at trigger.c:2141
#19 0x006ed216 in AfterTriggerExecute (estate=0x1bc52f0,
event=0x1c196c0, relInfo=0x1bc5798, trigdesc=0x1bc59d0,
finfo=0x1bc5bb0, instr=0x0, per_tuple_context=0x1c06760,
trig_tuple_slot1=0x0, trig_tuple_slot2=0x0) at trigger.c:4030
#20 0x006ed6e5 in afterTriggerInvokeEvents (events=0x1c31ac8,
firing_id=1, estate=0x1bc52f0, delete_ok=false)
at trigger.c:4244
#21 0x006ede4c in AfterTriggerEndQuery (estate=0x1bc52f0) at
trigger.c:4581
#22 0x0071b90c in standard_ExecutorFinish
(queryDesc=0x1c13040) at execMain.c:425
#23 0x0071b803 in ExecutorFinish (queryDesc=0x1c13040) at execMain.c:393
#24 0x00955ec6 in ProcessQuery (plan=0x1c51b30,
sourceText=0x1b424a0 "update p set a = a + 1;", params=0x0,
queryEnv=0x0, dest=0x1c51ca0, qc=0x7ffe6f237f20) at pquery.c:190
#25 0x00957701 in PortalRunMulti (portal=0x1ba4980,
isTopLevel=true, setHoldSnapshot=false, dest=0x1c51ca0,
altdest=0x1c51ca0, qc=0x7ffe6f237f20) at pquery.c:1267
#26 0x00956ca5 in PortalRun (portal=0

Re: [HACKERS] logical decoding of two-phase transactions

2021-03-16 Thread Amit Kapila

On Tue, Mar 16, 2021 at 6:22 PM vignesh C  wrote:
>
> On Mon, Mar 15, 2021 at 6:14 PM Ajin Cherian  wrote:
> >
> > On Mon, Mar 15, 2021 at 2:04 PM Amit Kapila  wrote:
> >>
>
> 2) table_states_not_ready global variable is used immediately after
> call to FetchTableStates, we can make FetchTableStates return the
> value or get it as an argument to the function and the global
> variables can be removed.
> +static List *table_states_not_ready = NIL;
>

But we do update the states in the list table_states_not_ready in
function process_syncing_tables_for_apply. So, the current arrangement
looks good to me.

-- 
With Regards,
Amit Kapila.

Re: pg_subscription - substream column?

2021-03-16 Thread Amit Kapila

On Tue, Mar 16, 2021 at 5:27 PM Peter Smith  wrote:
>
> On Tue, Mar 16, 2021 at 7:20 PM Amit Kapila  wrote:
> >
> > On Tue, Mar 16, 2021 at 3:35 AM Peter Smith  wrote:
> > >
> > > I noticed that the PG docs [1] for the catalog pg_subscription doesn't
> > > have any mention of the substream column.
> > >
> > > Accidental omission by commit 4648243 [2] from last year?
> > >
> >
> > Right. I'll fix this unless you or someone else is interested in
> > providing a patch for this.
>
> Sure, I will send a patch for it tomorrow.
>

Attached, please find the patch to update the description of substream
in pg_subscription.

-- 
With Regards,
Amit Kapila.


v1-0001-Doc-Add-a-description-of-substream-in-pg_subscrip.patch
Description: Binary data

Re: New IndexAM API controlling index vacuum strategies

2021-03-16 Thread Masahiko Sawada

On Mon, Mar 15, 2021 at 11:04 AM Peter Geoghegan  wrote:
>
> On Thu, Mar 11, 2021 at 8:31 AM Robert Haas  wrote:
> > But even if not, I'm not sure this
> > helps much with the situation you're concerned about, which involves
> > non-HOT tuples.
>
> Attached is a POC-quality revision of Masahiko's
> skip_index_vacuum.patch [1]. There is an improved design for skipping
> index vacuuming (improved over the INDEX_CLEANUP stuff from Postgres
> 12). I'm particularly interested in your perspective on this
> refactoring stuff, Robert, because you ran into the same issues after
> initial commit of the INDEX_CLEANUP reloption feature [2] -- you ran
> into issues with the "tupgone = true" special case. This is the case
> where VACUUM considers a tuple dead that was not marked LP_DEAD by
> pruning, and so needs to be killed in the second heap scan in
> lazy_vacuum_heap() instead. You'll recall that these issues were fixed
> by your commit dd695979888 from May 2019. I think that we need to go
> further than you did in dd695979888 for this -- we ought to get rid of
> the special case entirely.

Thank you for the patch!

>
> This patch makes the "VACUUM (INDEX_CLEANUP OFF)" mechanism no longer
> get invoked as if it was like the "no indexes on table so do it all in
> one heap pass" optimization. This seems a lot clearer -- INDEX_CLEANUP
> OFF isn't able to call lazy_vacuum_page() at all (for the obvious
> reason), so any similarity between the two cases was always
> superficial -- skipping index vacuuming should not be confused with
> doing a one-pass VACUUM/having no indexes at all. The original
> INDEX_CLEANUP structure (from commits a96c41fe and dd695979) always
> seemed confusing to me for this reason, FWIW.

Agreed.

>
> Note that I've merged multiple existing functions in vacuumlazy.c into
> one: the patch merges lazy_vacuum_all_indexes() and lazy_vacuum_heap()
> into a single function named vacuum_indexes_mark_unused() (note also
> that lazy_vacuum_page() has been renamed to mark_unused_page() to
> reflect the fact that it is now strictly concerned with making LP_DEAD
> line pointers LP_UNUSED). The big idea is that there is one choke
> point that decides whether index vacuuming is needed at all at one
> point in time, dynamically. vacuum_indexes_mark_unused() decides this
> for us at the last moment. This can only happen during a VACUUM that
> has enough memory to fit all TIDs -- otherwise we won't skip anything
> dynamically.
>
> We may in the future add additional criteria for skipping index
> vacuuming. That can now just be added to the beginning of this new
> vacuum_indexes_mark_unused() function. We may even teach
> vacuum_indexes_mark_unused() to skip some indexes but not others in a
> future release, a possibility that was already discussed at length
> earlier in this thread. This new structure has all the context it
> needs to do all of these things.

I agree to create a function like vacuum_indexes_mark_unused() that
makes a decision and does index and heap vacumming accordingly. But
what is the point of removing both lazy_vacuum_all_indexes() and
lazy_vacuum_heap()? I think we can simply have
vacuum_indexes_mark_unused() call those functions. I'm concerned that
if we add additional criteria to vacuum_indexes_mark_unused() in the
future the function will become very large.

>
> I wonder if we can add some kind of emergency anti-wraparound vacuum
> logic to what I have here, for Postgres 14. Can we come up with logic
> that has us skip index vacuuming because XID wraparound is on the
> verge of causing an outage? That seems like a strategically important
> thing for Postgres, so perhaps we should try to get something like
> that in. Practically every post mortem blog post involving Postgres
> also involves anti-wraparound vacuum.

+1

I think we can set VACOPT_TERNARY_DISABLED to
tab->at_params.index_cleanup in table_recheck_autovac() or increase
the thresholds used to not skipping index vacuuming.

>
> One consequence of my approach is that we now call
> lazy_cleanup_all_indexes(), even when we've skipped index vacuuming
> itself. We should at least "check-in" with the indexes IMV. To an
> index AM, this will be indistinguishable from a VACUUM that never had
> tuples for it to delete, and so never called ambulkdelete() before
> calling amvacuumcleanup().  This seems logical to me: why should there
> be any significant behavioral divergence between the case where there
> are 0 tuples to delete and the case where there is 1 tuple to delete?
> The extra work that we perform in amvacuumcleanup() (if any) should
> almost always be a no-op in nbtree following my recent refactoring
> work. More generally, if an index AM is doing too much during cleanup,
> and this becomes a bottleneck, then IMV that's a problem that needs to
> be fixed in the index AM.

Aside from whether it's good or bad, strictly speaking, it could
change the index AM API contract. The documentation of
amvacuumcleanup() says:

---
stats is wh

Re: simplifying foreign key/RI checks

2021-03-16 Thread Amit Langote

On Mon, Mar 8, 2021 at 11:41 PM Amit Langote  wrote:
> On Thu, Mar 4, 2021 at 5:15 AM Tom Lane  wrote:
> > Lastly, ri_PerformCheck is pretty careful about not only which
> > snapshot it uses, but which *pair* of snapshots it uses, because
> > sometimes it needs to worry about data changes since the start
> > of the transaction.  You've ignored all of that complexity AFAICS.
> > That's okay (I think) for RI_FKey_check which was passing
> > detectNewRows = false, but for sure it's not okay for
> > ri_Check_Pk_Match.  (I kind of thought we had isolation tests
> > that would catch that, but apparently not.)
>
> Okay, let me closely check the ri_Check_Pk_Match() case and see if
> there's any live bug.

I checked, and AFAICS, the query invoked by ri_Check_Pk_Match() (that
is, without the patch) does not use the "crosscheck" snapshot at any
point during its execution.  That snapshot is only used in the
table_update() and table_delete() routines, which are not involved in
the execution of ri_Check_Pk_Match()'s query.

I dug through git history and -hackers archives to understand the
origins of RI code's use of a crosscheck snapshot and came across this
discussion:

https://www.postgresql.org/message-id/20031001150510.U45145%40megazone.bigpanda.com

If I am reading the discussion and the details in subsequent commit
55d85f42a891a correctly, the crosscheck snapshot is only to be used to
ensure, under serializable isolation, that any attempts by the RI
query of updating/deleting rows that are not visible to the
transaction snapshot cause a serialization error.  Use of the same
facilities in ri_Check_Pk_Match() was merely done as future-proofing,
with no particular use case to address, then and perhaps even now.

If that is indeed the case, it does not seem particularly incorrect
for ri_ReferencedKeyExists() added by the patch to not bother with
setting up a crosscheck snapshot, even when called from
ri_Check_Pk_Match().  Am I missing something?

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Re: [HACKERS] Custom compression methods

2021-03-16 Thread Dilip Kumar

On Tue, Mar 16, 2021 at 4:07 PM Dilip Kumar  wrote:
>
> INSERT TIME
> Head: 17418.299 ms Patch: 20956.231 ms
>
> CTAS TIME:
> Head: 12837.872 ms Patch: 16775.739 ms

On quick analysis with perf it appeared that the performance is
degrading because of deforming

-   16.19% 3.54%  postgres  postgres[.]
CompareCompressionMethodAndDecompress
   - 12.65% CompareCompressionMethodAndDecompress
  - 12.57% slot_getallattrs
 - 12.56% slot_getsomeattrs
- 12.53% slot_getsomeattrs_int
   - 12.50% tts_buffer_heap_getsomeattrs
slot_deform_heap_tuple

So I think in the case of direct insert it needs to deform because I
am calling CompareCompressionMethodAndDecompress after ExecCopySlot
and that is why we have to deform every time so maybe that can be
avoided by calling CompareCompressionMethodAndDecompress before
ExecCopySlot.  But in the case of CTAS or INSERT INTO SELECT we can
not avoid deforming because we might get the formed tuple from the
source table.  I put a temporary hack to keep the map of the varlena
attribute and use it across the tuple but it did not improve the
performance in this case because the main bottleneck is
slot_getallattrs.   I think this should help where we don't have any
varlena, but first I need to test whether we can any performance
regression in those cases at all.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] logical decoding of two-phase transactions

2021-03-16 Thread vignesh C

On Mon, Mar 15, 2021 at 6:14 PM Ajin Cherian  wrote:
>
> On Mon, Mar 15, 2021 at 2:04 PM Amit Kapila  wrote:
>>
>> I think something on these lines should be much
>> easier than the spool-file implementation unless we see any problem
>> with this idea.
>>
>
> Here's a new patch-set that implements this new solution proposed by Amit.

Another couple of comments:
1) Should Assert be changed to the following in the below code:
if (!HeapTupleIsValid(tup))
elog(ERROR, "cache lookup failed for subscription %u", MySubscription->oid);

+   rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+   tup = SearchSysCacheCopy1(SUBSCRIPTIONOID,
ObjectIdGetDatum(MySubscription->oid));
+   Assert(HeapTupleIsValid(tup));

2) table_states_not_ready global variable is used immediately after
call to FetchTableStates, we can make FetchTableStates return the
value or get it as an argument to the function and the global
variables can be removed.
+static List *table_states_not_ready = NIL;
+static List *table_states_all = NIL;

Regards,
Vignesh

RE: subscriptionCheck failures

2021-03-16 Thread osumi.takami...@fujitsu.com

Hi


On Tuesday, March 16, 2021 4:15 PM vignesh C  wrote:
> On Tue, Mar 16, 2021 at 12:29 PM Amit Kapila 
> wrote:
> >
> > On Tue, Mar 16, 2021 at 9:00 AM Amit Kapila 
> wrote:
> > >
> > > On Mon, Mar 15, 2021 at 6:00 PM Thomas Munro
>  wrote:
> > > >
> > > > Hi,
> > > >
> > > > This seems to be a new low frequency failure, I didn't see it mentioned
> already:
> > > >
> > >
> > > Thanks for reporting, I'll look into it.
> > >
> >
> > By looking at the logs [1] in the buildfarm, I think I know what is
> > going on here. After Create Subscription, the tablesync worker is
> > launched and tries to create the slot for doing the initial copy but
> > before it could finish creating the slot, we issued the Drop
> > Subscription which first stops the tablesync worker and then tried to
> > drop its slot. Now, it is quite possible that by the time Drop
> > Subscription tries to drop the tablesync slot, it is not yet created.
> > We treat this condition okay and just Logs the message. I don't think
> > this is an issue because anyway generally such a slot created on the
> > server will be dropped before we persist it but the test was checking
> > the existence of slots on server before it gets dropped. I think we
> > can avoid such a situation by preventing cancel/die interrupts while
> > creating tablesync slot.
> >
> > This is a timing issue, so I have reproduced it via debugger and
> > tested that the attached patch fixes it.
> >
> 
> Thanks for the patch.
> I was able to reproduce the issue using debugger by making it wait at
> CreateReplicationSlot. After applying the patch the issue gets solved.
I really appreciate everyone's help.

For the double check, I utilized the patch and debugger too.
I also put one while loop at the top of CreateReplicationSlot to control 
walsender.

Without the patch, DROP SUBSCRIPTION goes forward,
even when the table sync worker cannot move by the CreateReplicationSlot loop,
and the table sync worker is killed by DROP SUBSCRIPTION command.
On the other hand, with the patch contents, I observed that
DROP SUBSCRIPTION hangs and waits
until I release the walsender process from CreateReplicationSlot.
After this, the command drops two slots like below.

NOTICE:  dropped replication slot "pg_16391_sync_16385_6940222843739406079" on 
publisher
NOTICE:  dropped replication slot "mysub1" on publisher
DROP SUBSCRIPTION

To me, this correctly works because
the timing I put the while loop and stops the walsender
makes the DROP SUBSCRIPTION affects two slots. Any comments ?


Best Regards,
Takamichi Osumi

Re: dynamic result sets support in extended query protocol

2021-03-16 Thread Peter Eisentraut



On 15.03.21 14:56, David Steele wrote:

Hi Peter,

On 12/30/20 9:33 AM, Peter Eisentraut wrote:

On 2020-10-09 20:46, Andres Freund wrote:

Is there really a good reason for forcing the client to issue
NextResult, Describe, Execute for each of the dynamic result sets? It's
not like there's really a case for allowing the clients to skip them,
right?  Why aren't we sending something more like

S: CommandPartiallyComplete
S: RowDescription
S: DataRow...
S: CommandPartiallyComplete
S: RowDescription
S: DataRow...
...
S: CommandComplete
C: Sync


I want to post my current patch, to keep this discussion moving.


CFBot reports that tests are failing, although the patch applies.


Yes, as explained in the message, you need another patch that makes psql 
show the additional result sets.  The cfbot cannot handle that kind of 
thing.


In the meantime, I have made a few small fixes, so I'm attaching another 
patch.
From 163d2ba39a0b46deb83e7509d85a5b2012fd84ec Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 16 Mar 2021 11:28:53 +0100
Subject: [PATCH v2] Dynamic result sets from procedures

Declaring a cursor WITH RETURN in a procedure makes the cursor's data be
returned as a result of the CALL invocation.  The procedure needs to
be declared with the DYNAMIC RESULT SETS attribute.

Discussion: 
https://www.postgresql.org/message-id/flat/6e747f98-835f-2e05-cde5-86ee444a7...@2ndquadrant.com
---
 doc/src/sgml/catalogs.sgml| 10 +++
 doc/src/sgml/information_schema.sgml  |  3 +-
 doc/src/sgml/protocol.sgml| 19 +
 doc/src/sgml/ref/alter_procedure.sgml | 12 +++
 doc/src/sgml/ref/create_procedure.sgml| 14 +++
 doc/src/sgml/ref/declare.sgml | 34 +++-
 src/backend/catalog/information_schema.sql|  2 +-
 src/backend/catalog/pg_aggregate.c|  3 +-
 src/backend/catalog/pg_proc.c |  4 +-
 src/backend/catalog/sql_features.txt  |  2 +-
 src/backend/commands/functioncmds.c   | 75 ++--
 src/backend/commands/portalcmds.c | 23 +
 src/backend/commands/typecmds.c   | 12 ++-
 src/backend/parser/gram.y | 20 -
 src/backend/tcop/postgres.c   | 62 +-
 src/backend/tcop/pquery.c |  6 ++
 src/backend/utils/errcodes.txt|  1 +
 src/backend/utils/mmgr/portalmem.c| 48 +++
 src/bin/pg_dump/pg_dump.c | 16 +++-
 src/include/catalog/pg_proc.h |  6 +-
 src/include/commands/defrem.h |  2 +
 src/include/nodes/parsenodes.h|  9 +-
 src/include/parser/kwlist.h   |  3 +
 src/include/utils/portal.h| 14 +++
 src/interfaces/libpq/fe-protocol3.c   |  6 +-
 .../regress/expected/create_procedure.out | 85 ++-
 src/test/regress/sql/create_procedure.sql | 61 -
 27 files changed, 518 insertions(+), 34 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index b1de6d0674..c30d6328ee 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -5844,6 +5844,16 @@ pg_proc Columns
   
  
 
+ 
+  
+   prodynres int4
+  
+  
+   For procedures, this records the maximum number of dynamic result sets
+   the procedure may create.  Otherwise zero.
+  
+ 
+
  
   
pronargs int2
diff --git a/doc/src/sgml/information_schema.sgml 
b/doc/src/sgml/information_schema.sgml
index 4100198252..7f7498eeff 100644
--- a/doc/src/sgml/information_schema.sgml
+++ b/doc/src/sgml/information_schema.sgml
@@ -5884,7 +5884,8 @@ routines Columns
max_dynamic_result_sets 
cardinal_number
   
   
-   Applies to a feature not available in 
PostgreSQL
+   For a procedure, the maximum number of dynamic result sets.  Otherwise
+   zero.
   
  
 
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 43092fe62a..4fe0b271e7 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -959,6 +959,25 @@ Extended Query
 an empty query string), ErrorResponse, or PortalSuspended.

 
+   
+Executing a portal may give rise to a dynamic result set
+sequence.  That means the command contained in the portal
+created additional result sets beyond what it normally returns.  (The
+typical example is calling a stored procedure that creates dynamic result
+sets.)  Dynamic result sets are issues after whatever response the main
+command issued.  Each dynamic result set begins with a RowDescription
+message followed by zero or more DataRow messages.  (Since as explained
+above an Execute message normally does not respond with a RowDescription,
+the appearance of the first RowDescription marks the end of the primary
+result set of the portal and the beginning of the first d

Re: EXPLAIN/EXPLAIN ANALYZE REFRESH MATERIALIZED VIEW

2021-03-16 Thread Bharath Rupireddy

On Tue, Mar 16, 2021 at 1:15 AM Tom Lane  wrote:
>
> [ Sorry for not looking at this thread sooner ]
>
> Bharath Rupireddy  writes:
> > Currently, $subject is not allowed. We do plan the mat view query
> > before every refresh. I propose to show the explain/explain analyze of
> > the select part of the mat view in case of Refresh Mat View(RMV).
>
> TBH, I think we should reject this.  The problem with it is that it
> binds us to the assumption that REFRESH MATERIALIZED VIEW has an
> explainable plan.  There are various people poking at ideas like
> incremental matview updates, which might rely on some implementation
> that doesn't exactly equate to a SQL query.  Incremental updates are
> hard enough already; they'll be even harder if they also have to
> maintain compatibility with a pre-existing EXPLAIN behavior.
>
> I don't really see that this feature buys us anything you can't
> get by explaining the view's query, so I think we're better advised
> to keep our options open about how REFRESH might be implemented
> in future.

That makes sense to me. Thanks for the comments. I'm fine to withdraw the patch.

I would like to see if the 0001 patch(attaching here) will be useful
at all. It just splits up the existing ExecRefreshMatView into a few
functions to make the code readable. I'm okay to withdraw it if no one
agrees.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
From 26970ffd33e67324609a03f0f61eeb6406216a7f Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Fri, 5 Mar 2021 15:47:12 +0530
Subject: [PATCH v7 1/2] Rearrange Refresh Mat View Code

Currently, the function ExecRefreshMatView in matview.c is having
many lines of code which is not at all good from readability and
maintainability perspectives. This patch adds few functions and
moves the code from ExecRefreshMatView to them making the code
look better.
---
 src/backend/commands/matview.c | 449 -
 1 file changed, 269 insertions(+), 180 deletions(-)

diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index c5c25ce11d..18e18fa627 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -64,7 +64,7 @@ static void transientrel_startup(DestReceiver *self, int operation, TupleDesc ty
 static bool transientrel_receive(TupleTableSlot *slot, DestReceiver *self);
 static void transientrel_shutdown(DestReceiver *self);
 static void transientrel_destroy(DestReceiver *self);
-static uint64 refresh_matview_datafill(DestReceiver *dest, Query *query,
+static uint64 refresh_matview_datafill(Oid OIDNewHeap, Query *query,
 	   const char *queryString);
 static char *make_temptable_name_n(char *tempname, int n);
 static void refresh_by_match_merge(Oid matviewOid, Oid tempOid, Oid relowner,
@@ -73,6 +73,16 @@ static void refresh_by_heap_swap(Oid matviewOid, Oid OIDNewHeap, char relpersist
 static bool is_usable_unique_index(Relation indexRel);
 static void OpenMatViewIncrementalMaintenance(void);
 static void CloseMatViewIncrementalMaintenance(void);
+static Query *get_matview_query(RefreshMatViewStmt *stmt, Relation *rel,
+Oid *objectId);
+static Query *rewrite_refresh_matview_query(Query *dataQuery);
+static Oid	get_new_heap_oid(RefreshMatViewStmt *stmt, Relation matviewRel,
+			 Oid matviewOid, char *relpersistence);
+static void match_matview_with_new_data(RefreshMatViewStmt *stmt,
+		Relation matviewRel, Oid matviewOid,
+		Oid OIDNewHeap, Oid relowner,
+		int save_sec_context,
+		char relpersistence, uint64 processed);
 
 /*
  * SetMatViewPopulatedState
@@ -140,127 +150,18 @@ ExecRefreshMatView(RefreshMatViewStmt *stmt, const char *queryString,
 {
 	Oid			matviewOid;
 	Relation	matviewRel;
-	RewriteRule *rule;
-	List	   *actions;
 	Query	   *dataQuery;
-	Oid			tableSpace;
-	Oid			relowner;
 	Oid			OIDNewHeap;
-	DestReceiver *dest;
 	uint64		processed = 0;
-	bool		concurrent;
-	LOCKMODE	lockmode;
+	Oid			relowner;
 	char		relpersistence;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
 	ObjectAddress address;
 
-	/* Determine strength of lock needed. */
-	concurrent = stmt->concurrent;
-	lockmode = concurrent ? ExclusiveLock : AccessExclusiveLock;
-
-	/*
-	 * Get a lock until end of transaction.
-	 */
-	matviewOid = RangeVarGetRelidExtended(stmt->relation,
-		  lockmode, 0,
-		  RangeVarCallbackOwnsTable, NULL);
-	matviewRel = table_open(matviewOid, NoLock);
-
-	/* Make sure it is a materialized view. */
-	if (matviewRel->rd_rel->relkind != RELKIND_MATVIEW)
-		ereport(ERROR,
-(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("\"%s\" is not a materialized view",
-		RelationGetRelationName(matviewRel;
-
-	/* Check that CONCURRENTLY is not specified if not populated. */
-	if (concurrent && !RelationIsPopulated(matviewRel))
-		ereport(ERROR,
-(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("CONCURRENTLY cannot be used when the material

Re: Postgres crashes at memcopy() after upgrade to PG 13.

2021-03-16 Thread Avinash Kumar

On Mon, Mar 15, 2021 at 3:21 PM Avinash Kumar 
wrote:

> Hi,
>
> On Mon, Mar 15, 2021 at 1:18 PM Peter Geoghegan  wrote:
>
>> On Mon, Mar 15, 2021 at 6:56 AM Avinash Kumar
>>  wrote:
>> > psql:amchecksql.sql:17: DEBUG:  leaf block 1043751 of index
>> "idx_id_mtime" has no first data item
>>
>> That one is harmless.
>>
>> > And one error as follows.
>> >
>> > psql:amchecksql.sql:17: ERROR:  down-link lower bound invariant
>> violated for index "some_other_index"
>>
>> That indicates corruption. Can you tie this back to the crash? Is it
>> the same index?
>>
> No, that's not the same index.  The Index discussed in the previous
> messages shows the following output.
>
> DEBUG:  verifying consistency of tree structure for index "idx_id_mtime"
> with cross-level checks
> DEBUG:  verifying level 2 (true root level)
> DEBUG:  verifying level 1
> DEBUG:  verifying level 0 (leaf level)
> DEBUG:  verifying that tuples from index "idx_id_mtime" are present in
> "player"
> DEBUG:  finished verifying presence of 1966412 tuples from table "player"
> with bitset 29.89% set
> LOG:  duration: 3341.755 ms  statement: SELECT bt_index_parent_check(index
> => c.oid, heapallindexed => true), c.relname, c.relpages FROM pg_index i
> JOIN pg_opclass op ON i.indclass[0] = op.oid JOIN pg_am am ON op.opcmethod
> = am.oid JOIN pg_class c ON i.indexrelid = c.oid JOIN pg_namespace n ON
> c.relnamespace = n.oid WHERE am.amname = 'btree' AND c.relpersistence !=
> 't' AND c.relkind = 'i' AND i.indisready AND i.indisvalid AND indexrelid =
> 80774 AND n.nspname = 'public' ORDER BY c.relpages DESC;
>  bt_index_parent_check | relname | relpages
> ---+-+--
>| idx_id_mtime | 8439
> (1 row)
>
>
>> --
>> Peter Geoghegan
>>
>
>
> --
> Regards,
> Avi.
>

I am afraid that it looks to me like a deduplication bug but not sure how
this can be pin-pointed. If there is something I could do to determine
that, I would be more than happy.

-- 
Regards,
Avi

Re: pg_subscription - substream column?

2021-03-16 Thread Peter Smith

On Tue, Mar 16, 2021 at 7:20 PM Amit Kapila  wrote:
>
> On Tue, Mar 16, 2021 at 3:35 AM Peter Smith  wrote:
> >
> > I noticed that the PG docs [1] for the catalog pg_subscription doesn't
> > have any mention of the substream column.
> >
> > Accidental omission by commit 4648243 [2] from last year?
> >
>
> Right. I'll fix this unless you or someone else is interested in
> providing a patch for this.

Sure, I will send a patch for it tomorrow.


Kind Regards,
Peter Smith.
Fujitsu Australia.

1 2 >

1 - 100 of 132 matches

Mail list logo