Re: bug in update tuple routing with foreign partitions

2019-04-18 Thread Amit Langote
On 2019/04/19 14:39, Etsuro Fujita wrote:
> (2019/04/19 13:00), Amit Langote wrote:
>> On 2019/04/18 22:10, Etsuro Fujita wrote:
>>> * I kept all the changes in the previous patch, because otherwise
>>> postgres_fdw would fail to release resources for a foreign-insert
>>> operation created by postgresBeginForeignInsert() for a tuple-routable
>>> foreign table (ie, a foreign-table subplan resultrel that has been updated
>>> already) during postgresEndForeignInsert().
>>
>> Hmm are you saying that the cases for which we'll still allow tuple
>> routing (foreign table receiving moved-in rows has already been updated),
>> there will be two fmstates to be released -- the original fmstate
>> (UPDATE's) and aux_fmstate (INSERT's)?
> 
> Yeah, but I noticed that that explanation was not correct.  (I think I was
> really in hurry.)  See the correction in [1].

Ah, I hadn't noticed your corrected description in [1] even though your
message was in my inbox before I sent my email.

Thanks,
Amit





Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-04-18 Thread Andres Freund



On April 18, 2019 7:53:58 PM PDT, Tom Lane  wrote:
>Amit Kapila  writes:
>> I am thinking that we should at least give it a try to move the map
>to
>> rel cache level to see how easy or difficult it is and also let's
>wait
>> for a day or two to see if Andres/Tom has to say anything about this
>> or on the response by me above to improve the current patch.
>
>FWIW, it's hard for me to see how moving the map to the relcache isn't
>the right thing to do.  You will lose state during a relcache flush,
>but that's still miles better than how often the state gets lost now.

+1
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: bug in update tuple routing with foreign partitions

2019-04-18 Thread Etsuro Fujita

(2019/04/19 13:17), Amit Langote wrote:

OTOH, we don't mention at all in postgres-fdw.sgml
that postgres_fdw supports tuple routing.  Maybe we should and list this
limitation or would it be too much burden to maintain?


I think it's better to add this limitation to postgres-fdw.sgml.  Will 
do.  Thanks for pointing that out!


Best regards,
Etsuro Fujita





Re: bug in update tuple routing with foreign partitions

2019-04-18 Thread Etsuro Fujita

(2019/04/19 13:00), Amit Langote wrote:

On 2019/04/18 22:10, Etsuro Fujita wrote:

On 2019/04/18 14:06, Amit Langote wrote:

On 2019/04/17 21:49, Etsuro Fujita wrote:

So what I'm thinking is to throw an error for cases like this.
(Though, I
think we should keep to allow rows to be moved from local partitions to a
foreign-table subplan targetrel that has been updated already.)


Hmm, how would you distinguish (totally inside postgred_fdw I presume) the
two cases?


One thing I came up with to do that is this:

@@ -1917,6 +1937,23 @@ postgresBeginForeignInsert(ModifyTableState *mtstate,

 initStringInfo();

+   /*
+* If the foreign table is an UPDATE subplan resultrel that hasn't
yet
+* been updated, routing tuples to the table might yield incorrect
+* results, because if routing tuples, routed tuples will be
mistakenly
+* read from the table and updated twice when updating the table
--- it
+* would be nice if we could handle this case; but for now, throw
an error
+* for safety.
+*/
+   if (plan&&  plan->operation == CMD_UPDATE&&
+   (resultRelInfo->ri_usesFdwDirectModify ||
+resultRelInfo->ri_FdwState)&&
+   resultRelInfo>  mtstate->resultRelInfo +
mtstate->mt_whichplan)
+   ereport(ERROR,
+   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+errmsg("cannot route tuples into foreign
table to be updated \"%s\"",
+ RelationGetRelationName(rel;


Oh, I see.

So IIUC, you're making postgresBeginForeignInsert() check two things:

1. Whether the foreign table it's about to begin inserting (moved) rows
into is a subplan result rel, checked by
(resultRelInfo->ri_usesFdwDirectModify || resultRelInfo->ri_FdwState)

2. Whether the foreign table it's about to begin inserting (moved) rows
into will be updated later, checked by (resultRelInfo>
mtstate->resultRelInfo + mtstate->mt_whichplan)

This still allows a foreign table to receive rows moved from the local
partitions if it has already finished the UPDATE operation.

Seems reasonable to me.


Great!


Forgot to say that since this is a separate issue from the original bug
report, maybe we can first finish fixing the latter.  What to do you think?


Yeah, I think we can do that, but my favorite would be to fix the two
issues in a single shot, because they seem to me rather close to each
other.  So I updated a new version in a single patch, which I'm attaching.


I agree that we can move to fix the other issue right away as the fix you
outlined above seems reasonable, but I wonder if it wouldn't be better to
commit the two fixes separately?  The two fixes, although small, are
somewhat complicated and combining them in a single commit might be
confusing.  Also, a combined commit might make it harder for the release
note author to list down the exact set of problems being fixed.


OK, I'll separate the patch into two.


Notes:

* I kept all the changes in the previous patch, because otherwise
postgres_fdw would fail to release resources for a foreign-insert
operation created by postgresBeginForeignInsert() for a tuple-routable
foreign table (ie, a foreign-table subplan resultrel that has been updated
already) during postgresEndForeignInsert().


Hmm are you saying that the cases for which we'll still allow tuple
routing (foreign table receiving moved-in rows has already been updated),
there will be two fmstates to be released -- the original fmstate
(UPDATE's) and aux_fmstate (INSERT's)?


Yeah, but I noticed that that explanation was not correct.  (I think I 
was really in hurry.)  See the correction in [1].



* I revised a comment according to your previous comment, though I changed
a state name: s/sub_fmstate/aux_fmstate/


That seems fine to me.


Cool!


Some mostly cosmetic comments on the code changes:

* In the following comment:

+/*
+ * If the foreign table is an UPDATE subplan resultrel that hasn't yet
+ * been updated, routing tuples to the table might yield incorrect
+ * results, because if routing tuples, routed tuples will be mistakenly
+ * read from the table and updated twice when updating the table --- it
+ * would be nice if we could handle this case; but for now, throw an
error
+ * for safety.
+ */

the part that start with "because if routing tuples..." reads a bit
unclear to me. How about writing this as:

 /*
  * If the foreign table we are about to insert routed rows into is
  * also an UPDATE result rel and the UPDATE hasn't been performed yet,
  * proceeding with the INSERT will result in the later UPDATE
  * incorrectly modifying those routed rows, so prevent the INSERT ---
  * it would be nice if we could handle this case, but for now, throw
  * an error for safety.
  */


I think that's better than mine; will use that wording.


I see that in all the hunks involving some manipulation of 

Re: Do CustomScan as not projection capable node

2019-04-18 Thread Tom Lane
Andrey Lepikhov  writes:
> Can we include the CustomScan node in the list of nodes that do not 
> support projection?

That seems like a pretty bad idea.  Maybe it's a good thing for whatever
unspecified extension you have in mind right now, but it's likely to be
a net negative for many more.  As an example, if some custom extension has
a better way to calculate some output expression than the core code does,
a restriction like this would prevent that from being implemented.

> Reason is that custom node can contain quite arbitrary logic that does 
> not guarantee projection support.

I don't buy this for a minute.  Where do you think projection is
going to happen?  There isn't any existing node type that *couldn't*
support projection if we insisted that that be done across-the-board.
I think it's mostly just a legacy thing that some don't.

regards, tom lane




Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2019-04-18 Thread Asim R P
On Wed, Apr 17, 2019 at 1:27 PM Paul Guo  wrote:
>
> create db with tablespace
> drop database
> drop tablespace.

Essentially, that sequence of operations causes crash recovery to fail
if the "drop tablespace" transaction was committed before crashing.
This is a bug in crash recovery in general and should be reproducible
without configuring a standby.  Is that right?

Your patch creates missing directories in the destination.  Don't we
need to create the tablespace symlink under pg_tblspc/?  I would
prefer extending the invalid page mechanism to deal with this, as
suggested by Ashwin off-list.  It will allow us to avoid creating
directories and files only to remove them shortly afterwards when the
drop database and drop tablespace records are replayed.

Asim




Do CustomScan as not projection capable node

2019-04-18 Thread Andrey Lepikhov
Can we include the CustomScan node in the list of nodes that do not 
support projection?
Reason is that custom node can contain quite arbitrary logic that does 
not guarantee projection support.
Secondly. If planner does not need a separate Result node, it just 
assign tlist to subplan (i.e. changes targetlist of custom node) and 
does not change the custom_scan_tlist.
Perhaps I do not fully understand the logic of using the 
custom_scan_tlist field. But if into the PlanCustomPath() routine our 
custom node does not build own custom_scan_tlist (may be it will use 
tlist as base for custom_scan_tlist) we will get errors in the 
set_customscan_references() call.


--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company
>From 938904d179e0a4e31cbb20fb70243d2b980d8dc2 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Fri, 19 Apr 2019 09:34:09 +0500
Subject: [PATCH] Add CustomScan node in the list of nodes that do not support
 projection

---
 src/backend/optimizer/plan/createplan.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index efe073a3ee..682d4d4429 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -6691,6 +6691,7 @@ is_projection_capable_path(Path *path)
 		case T_ModifyTable:
 		case T_MergeAppend:
 		case T_RecursiveUnion:
+		case T_CustomScan:
 			return false;
 		case T_Append:
 
-- 
2.17.1



Re: bug in update tuple routing with foreign partitions

2019-04-18 Thread Amit Langote
On 2019/04/19 13:00, Amit Langote wrote:
> BTW, do you think we should update the docs regarding the newly
> established limitation of row movement involving foreign tables?

Ah, maybe not, because it's a postgres_fdw limitation, not of the core
tuple routing feature.  OTOH, we don't mention at all in postgres-fdw.sgml
that postgres_fdw supports tuple routing.  Maybe we should and list this
limitation or would it be too much burden to maintain?

Thanks,
Amit





Re: Race conditions with checkpointer and shutdown

2019-04-18 Thread Tom Lane
>>> Maybe what we should be looking for is "why doesn't the walreceiver
>>> shut down"?  But the dragonet log you quote above shows the walreceiver
>>> exiting, or at least starting to exit.  Tis a puzzlement.

huh ... take a look at this little stanza in PostmasterStateMachine:

if (pmState == PM_SHUTDOWN_2)
{
/*
 * PM_SHUTDOWN_2 state ends when there's no other children than
 * dead_end children left. There shouldn't be any regular backends
 * left by now anyway; what we're really waiting for is walsenders and
 * archiver.
 *
 * Walreceiver should normally be dead by now, but not when a fast
 * shutdown is performed during recovery.
 */
if (PgArchPID == 0 && CountChildren(BACKEND_TYPE_ALL) == 0 &&
WalReceiverPID == 0)
{
pmState = PM_WAIT_DEAD_END;
}
}

I'm too tired to think through exactly what that last comment might be
suggesting, but it sure seems like it might be relevant to our problem.
If the walreceiver *isn't* dead yet, what's going to ensure that we
can move forward later?

regards, tom lane




Re: bug in update tuple routing with foreign partitions

2019-04-18 Thread Amit Langote
Fujita-san,

On 2019/04/18 22:10, Etsuro Fujita wrote:
>> On 2019/04/18 14:06, Amit Langote wrote:
>>> On 2019/04/17 21:49, Etsuro Fujita wrote:
 So what I'm thinking is to throw an error for cases like this. 
 (Though, I
 think we should keep to allow rows to be moved from local partitions to a
 foreign-table subplan targetrel that has been updated already.)
>>>
>>> Hmm, how would you distinguish (totally inside postgred_fdw I presume) the
>>> two cases?
> 
> One thing I came up with to do that is this:
> 
> @@ -1917,6 +1937,23 @@ postgresBeginForeignInsert(ModifyTableState *mtstate,
> 
>     initStringInfo();
> 
> +   /*
> +    * If the foreign table is an UPDATE subplan resultrel that hasn't
> yet
> +    * been updated, routing tuples to the table might yield incorrect
> +    * results, because if routing tuples, routed tuples will be
> mistakenly
> +    * read from the table and updated twice when updating the table
> --- it
> +    * would be nice if we could handle this case; but for now, throw
> an error
> +    * for safety.
> +    */
> +   if (plan && plan->operation == CMD_UPDATE &&
> +   (resultRelInfo->ri_usesFdwDirectModify ||
> +    resultRelInfo->ri_FdwState) &&
> +   resultRelInfo > mtstate->resultRelInfo +
> mtstate->mt_whichplan)
> +   ereport(ERROR,
> +   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +    errmsg("cannot route tuples into foreign
> table to be updated \"%s\"",
> + RelationGetRelationName(rel;

Oh, I see.

So IIUC, you're making postgresBeginForeignInsert() check two things:

1. Whether the foreign table it's about to begin inserting (moved) rows
into is a subplan result rel, checked by
(resultRelInfo->ri_usesFdwDirectModify || resultRelInfo->ri_FdwState)

2. Whether the foreign table it's about to begin inserting (moved) rows
into will be updated later, checked by (resultRelInfo >
mtstate->resultRelInfo + mtstate->mt_whichplan)

This still allows a foreign table to receive rows moved from the local
partitions if it has already finished the UPDATE operation.

Seems reasonable to me.

>> Forgot to say that since this is a separate issue from the original bug
>> report, maybe we can first finish fixing the latter.  What to do you think?
> 
> Yeah, I think we can do that, but my favorite would be to fix the two
> issues in a single shot, because they seem to me rather close to each
> other.  So I updated a new version in a single patch, which I'm attaching.

I agree that we can move to fix the other issue right away as the fix you
outlined above seems reasonable, but I wonder if it wouldn't be better to
commit the two fixes separately?  The two fixes, although small, are
somewhat complicated and combining them in a single commit might be
confusing.  Also, a combined commit might make it harder for the release
note author to list down the exact set of problems being fixed.  But I
guess your commit message will make it clear that two distinct problems
are being solved, so maybe that shouldn't be a problem.

> Notes:
> 
> * I kept all the changes in the previous patch, because otherwise
> postgres_fdw would fail to release resources for a foreign-insert
> operation created by postgresBeginForeignInsert() for a tuple-routable
> foreign table (ie, a foreign-table subplan resultrel that has been updated
> already) during postgresEndForeignInsert().

Hmm are you saying that the cases for which we'll still allow tuple
routing (foreign table receiving moved-in rows has already been updated),
there will be two fmstates to be released -- the original fmstate
(UPDATE's) and aux_fmstate (INSERT's)?

> * I revised a comment according to your previous comment, though I changed
> a state name: s/sub_fmstate/aux_fmstate/

That seems fine to me.

> * DirectModify also has the latter issue.  Here is an example:
> 
> postgres=# create table p (a int, b text) partition by list (a);
> postgres=# create table p1 partition of p for values in (1);
> postgres=# create table p2base (a int check (a = 2 or a = 3), b text);
> postgres=# create foreign table p2 partition of p for values in (2, 3)
> server loopback options (table_name 'p2base');
> 
> postgres=# insert into p values (1, 'foo');
> INSERT 0 1
> postgres=# explain verbose update p set a = a + 1;
>  QUERY PLAN
> -
>  Update on public.p  (cost=0.00..176.21 rows=2511 width=42)
>    Update on public.p1
>    Foreign Update on public.p2
>    ->  Seq Scan on public.p1  (cost=0.00..25.88 rows=1270 width=42)
>  Output: (p1.a + 1), p1.b, p1.ctid
>    ->  Foreign Update on public.p2  (cost=100.00..150.33 rows=1241 width=42)
>  Remote SQL: UPDATE public.p2base SET a = (a + 1)
> (7 rows)
> 
> postgres=# update p set a = a + 1;
> UPDATE 2
> postgres=# select * from p;
>  a |  b
> 

Re: Race conditions with checkpointer and shutdown

2019-04-18 Thread Tom Lane
Thomas Munro  writes:
> On Fri, Apr 19, 2019 at 10:22 AM Tom Lane  wrote:
>> Maybe what we should be looking for is "why doesn't the walreceiver
>> shut down"?  But the dragonet log you quote above shows the walreceiver
>> exiting, or at least starting to exit.  Tis a puzzlement.

> ... Is there some way that the exit code could hang
> *after* that due to corruption of libc resources (FILE streams,
> malloc, ...)?  It doesn't seem likely to me (we'd hopefully see some
> more clues) but I thought I'd mention the idea.

I agree it's not likely ... but that's part of the reason I was thinking
about adding some postmaster logging.  Whatever we're chasing here is
"not likely", per the observed buildfarm failure rate.

regards, tom lane




Re: Race conditions with checkpointer and shutdown

2019-04-18 Thread Thomas Munro
On Fri, Apr 19, 2019 at 10:22 AM Tom Lane  wrote:
> Thomas Munro  writes:
> > 2019-04-16 08:23:24.178 CEST [8393] FATAL:  terminating walreceiver
> > process due to administrator command

> Maybe what we should be looking for is "why doesn't the walreceiver
> shut down"?  But the dragonet log you quote above shows the walreceiver
> exiting, or at least starting to exit.  Tis a puzzlement.

One thing I noticed about this message: if you receive SIGTERM at a
rare time when WalRcvImmediateInterruptOK is true, then that ereport()
runs directly in the signal handler context.  That's not strictly
allowed, and could cause nasal demons.  On the other hand, it probably
wouldn't have managed to get the FATAL message out if that was the
problem here (previously we've seen reports of signal handlers
deadlocking while trying to ereport() but they couldn't get their
message out at all, because malloc or some such was already locked in
the user context).  Is there some way that the exit code could hang
*after* that due to corruption of libc resources (FILE streams,
malloc, ...)?  It doesn't seem likely to me (we'd hopefully see some
more clues) but I thought I'd mention the idea.

-- 
Thomas Munro
https://enterprisedb.com




Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)

2019-04-18 Thread Paul Guo
Hi Michael,

I updated the patches as attached following previous discussions.

The two patches:
v2-0001-Extact-common-functions-from-pg_basebackup-into-s.patch
v2-0002-Add-option-to-write-recovery-configuration-inform.patch

1. 0001 does move those common functions & variables to two new files (one
.c and one .h) for both pg_rewind and pg_basebackup use,
note the functions are slightly modified (e.g. because conn is probably
NULL on pg_rewind). I do not know where is more proper to put the
new files. Currently, they are under pg_basebackup and are used in
pg_rewind (Makefile modified to support that).

2. 0002 adds the option to write recovery conf.

The below patch runs single mode Postgres if needed to make sure the target
is cleanly shutdown. A new option is added (off by default).
v2-0001-Ensure-target-clean-shutdown-at-beginning-of-pg_r.patch

I've manually tested them and installcheck passes.

Thanks.

On Wed, Mar 20, 2019 at 1:23 PM Paul Guo  wrote:

>
>
> On Wed, Mar 20, 2019 at 1:20 PM Michael Paquier 
> wrote:
>
>> On Wed, Mar 20, 2019 at 12:48:52PM +0800, Paul Guo wrote:
>> > This is a good suggestion also. Will do it.
>>
>> Please note also that we don't care about recovery.conf since v12 as
>> recovery parameters are now GUCs.  I would suggest appending those
>> extra parameters to postgresql.auto.conf, which is what pg_basebackup
>> does.
>>
> Yes, the recovery conf patch in the first email did like this, i.e.
> writing postgresql.auto.conf & standby.signal
>
>


v2-0001-Ensure-target-clean-shutdown-at-beginning-of-pg_r.patch
Description: Binary data


v2-0001-Extact-common-functions-from-pg_basebackup-into-s.patch
Description: Binary data


v2-0002-Add-option-to-write-recovery-configuration-inform.patch
Description: Binary data


Re: Pathological performance when inserting many NULLs into a unique index

2019-04-18 Thread Peter Geoghegan
On Thu, Apr 18, 2019 at 5:46 PM Michael Paquier  wrote:
> Adding an open item is adapted, nice you found this issue.  Testing on
> my laptop with v11, the non-NULL case takes 5s and the NULL case 6s.
> On HEAD, the non-NULL case takes 6s, and the NULL case takes...  I
> just gave up and cancelled it :)

This was one of those things that occurred to me out of the blue.

It just occurred to me that the final patch will need to be more
careful about non-key attributes in INCLUDE indexes. It's not okay for
it to avoid calling _bt_check_unique() just because a non-key
attribute was NULL. It should only do that when a key attribute is
NULL.

-- 
Peter Geoghegan




Re: bug in update tuple routing with foreign partitions

2019-04-18 Thread Etsuro Fujita

(2019/04/18 22:10), Etsuro Fujita wrote:

Notes:

* I kept all the changes in the previous patch, because otherwise
postgres_fdw would fail to release resources for a foreign-insert
operation created by postgresBeginForeignInsert() for a tuple-routable
foreign table (ie, a foreign-table subplan resultrel that has been
updated already) during postgresEndForeignInsert().


I noticed that this explanation was not right.  Let me correct myself. 
The reason why I kept those changes is: without those changes, we would 
fail to release the resources for a foreign-update operation (ie, 
fmstate) created by postgresBeginForeignModify(), replaced by the 
fmstate for the foreign-insert operation, because when doing 
ExecEndPlan(), we first call postgresEndForeignModify() and then call 
postgresEndForeignInsert(); so, if we didn't keep those changes, we 
would *mistakenly* release the fmstate for the foreign-insert operation 
in postgresEndForeignModify(), and we wouldn't do anything about the 
fmstate for the foreign-update operation in that function and in the 
subsequent postgresEndForeignInsert().


Best regards,
Etsuro Fujita





Re: Race conditions with checkpointer and shutdown

2019-04-18 Thread Tom Lane
Thomas Munro  writes:
> Interesting, but I'm not sure how that could be though.  Perhaps, a
> bit like the other thing that cropped up in the build farm after that
> commit, removing ~200ms of needless sleeping around an earlier online
> CHECKPOINT made some other pre-existing race condition more likely to
> go wrong.

The data that we've got is entirely consistent with the idea that
there's a timing-sensitive bug that gets made more or less likely
to trigger by "unrelated" changes in test cases or server code.

regards, tom lane




Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-04-18 Thread Tom Lane
Amit Kapila  writes:
> I am thinking that we should at least give it a try to move the map to
> rel cache level to see how easy or difficult it is and also let's wait
> for a day or two to see if Andres/Tom has to say anything about this
> or on the response by me above to improve the current patch.

FWIW, it's hard for me to see how moving the map to the relcache isn't
the right thing to do.  You will lose state during a relcache flush,
but that's still miles better than how often the state gets lost now.

regards, tom lane




Re: Race conditions with checkpointer and shutdown

2019-04-18 Thread Tom Lane
Michael Paquier  writes:
> On Thu, Apr 18, 2019 at 05:57:39PM -0400, Tom Lane wrote:
>> Anyway, this is *not* new in v12.

> Indeed.  It seems to me that v12 makes the problem easier to appear
> though, and I got to wonder if c6c9474 is helping in that as more
> cases are popping up since mid-March.

Yeah.  Whether that's due to a server code change, or new or modified
test cases, is unknown.  But looking at my summary of buildfarm runs
that failed like this, there's a really clear breakpoint at

 gull  | 2018-08-24 03:27:16 | recoveryCheck   | pg_ctl: server 
does not shut down

Since that, most of the failures with this message have been in the
recoveryCheck step.  Before that, the failures were all over the
place, and now that I look closely a big fraction of them were
in bursts on particular animals, suggesting it was more about
some local problem on that animal than any real code issue.

So it might be worth groveling around in the commit logs from
last August...

regards, tom lane




Re: Race conditions with checkpointer and shutdown

2019-04-18 Thread Thomas Munro
On Fri, Apr 19, 2019 at 2:30 PM Michael Paquier  wrote:
> On Thu, Apr 18, 2019 at 05:57:39PM -0400, Tom Lane wrote:
> > Anyway, this is *not* new in v12.
>
> Indeed.  It seems to me that v12 makes the problem easier to appear
> though, and I got to wonder if c6c9474 is helping in that as more
> cases are popping up since mid-March.

Interesting, but I'm not sure how that could be though.  Perhaps, a
bit like the other thing that cropped up in the build farm after that
commit, removing ~200ms of needless sleeping around an earlier online
CHECKPOINT made some other pre-existing race condition more likely to
go wrong.

-- 
Thomas Munro
https://enterprisedb.com




Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-04-18 Thread Amit Kapila
On Thu, Apr 18, 2019 at 2:10 PM John Naylor  wrote:
>
> On Thu, Apr 18, 2019 at 2:48 PM Amit Kapila  wrote:
> > I respect and will follow whatever will be the consensus after
> > discussion.  However, I request you to wait for some time to let the
> > discussion conclude.  If we can't get to an
> > agreement or one of John or me can't implement what is decided, then
> > we can anyway revert it.
>
> Agreed. I suspect the most realistic way to address most of the
> objections in a short amount of time would be to:
>
> 1. rip out the local map
> 2. restore hio.c to only checking the last block in the relation if
> there is no FSM (and lower the threshold to reduce wasted space)
> 3. reduce calls to smgr_exists()
>

Won't you need an extra call to RelationGetNumberofBlocks to find the
last block?  Also won't it be less efficient in terms of dealing with
bloat as compare to current patch?  I think if we go this route, then
we might need to revisit it in the future to optimize it, but maybe
that is the best alternative as of now.

I am thinking that we should at least give it a try to move the map to
rel cache level to see how easy or difficult it is and also let's wait
for a day or two to see if Andres/Tom has to say anything about this
or on the response by me above to improve the current patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Race conditions with checkpointer and shutdown

2019-04-18 Thread Michael Paquier
On Thu, Apr 18, 2019 at 05:57:39PM -0400, Tom Lane wrote:
> It's the latter.  I searched the buildfarm database for failure logs
> including the string "server does not shut down" within the last three
> years, and got all of the hits attached.  Not all of these look like
> the failure pattern Michael pointed to, but enough of them do to say
> that the problem has existed since at least mid-2017.  To be concrete,
> we have quite a sample of cases where a standby server has received a
> "fast shutdown" signal and acknowledged that in its log, but it never
> gets to the expected "shutting down" message, meaning it never starts
> the shutdown checkpoint let alone finishes it.  The oldest case that
> clearly looks like that is
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=nightjar=2017-06-02%2018%3A54%3A29

Interesting.  I was sort of thinking about c6c3334 first but this
failed based on 9fcf670, which does not include the former.

> This leads me to suspect that the problem is (a) some very low-level issue
> in spinlocks or or latches or the like, or (b) a timing problem that just
> doesn't show up on generic Intel-oid platforms.  The timing theory is
> maybe a bit stronger given that one test case shows this more often than
> others.  I've not got any clear ideas beyond that.
> 
> Anyway, this is *not* new in v12.

Indeed.  It seems to me that v12 makes the problem easier to appear
though, and I got to wonder if c6c9474 is helping in that as more
cases are popping up since mid-March.
--
Michael


signature.asc
Description: PGP signature


Re: ExecForceStoreMinimalTuple leaks memory like there's no tomorrow

2019-04-18 Thread Andres Freund
Hi,

On 2019-04-15 22:46:56 -0400, Tom Lane wrote:
> Using HEAD,
> 
> create table t1 as select generate_series(1,4000) id;
> vacuum analyze t1;
> explain select * from t1, t1 t1b where t1.id = t1b.id;
> -- should indicate a hash join
> explain analyze select * from t1, t1 t1b where t1.id = t1b.id;
> 
> ... watch the process's memory consumption bloat.  (It runs for
> awhile before that starts to happen, but eventually it goes to
> a couple of GB.)
> 
> It looks to me like the problem is that ExecHashJoinGetSavedTuple
> calls ExecForceStoreMinimalTuple with shouldFree = true, and
> ExecForceStoreMinimalTuple's second code branch simply ignores
> the requirement to free the supplied tuple.

Thanks for finding. The fix is obviously easy - but looking through the
code I think I found another similar issue. I'll fix both in one go
tomorrow.

Greetings,

Andres Freund




Re: "make installcheck" fails in src/test/recovery

2019-04-18 Thread Tom Lane
Michael Paquier  writes:
> I am wondering if it would be better to just install automatically all
> the paths listed in EXTRA_INSTALL when invoking installcheck.

Absolutely not!  In the first place, "make installcheck" is supposed to
test the installed tree, not editorialize upon what's in it; and in the
second place, you wouldn't necessarily have permissions to change that
tree.

If we think that depending on pageinspect is worthwhile here, the right
thing to do is just to fix the README to say that you need it.  I'm
merely asking whether it's really worth the extra dependency.

regards, tom lane




Re: Pathological performance when inserting many NULLs into a unique index

2019-04-18 Thread Michael Paquier
On Wed, Apr 17, 2019 at 07:37:17PM -0700, Peter Geoghegan wrote:
> I'll create an open item for this, and begin work on a patch tomorrow.

Adding an open item is adapted, nice you found this issue.  Testing on
my laptop with v11, the non-NULL case takes 5s and the NULL case 6s.
On HEAD, the non-NULL case takes 6s, and the NULL case takes...  I
just gave up and cancelled it :)
--
Michael


signature.asc
Description: PGP signature


Re: finding changed blocks using WAL scanning

2019-04-18 Thread Michael Paquier
On Thu, Apr 18, 2019 at 03:51:14PM -0400, Robert Haas wrote:
> I was thinking that a dedicated background worker would be a good
> option, but Stephen Frost seems concerned (over on the other thread)
> about how much load that would generate.  That never really occurred
> to me as a serious issue and I suspect for many people it wouldn't be,
> but there might be some.

WAL segment size can go up to 1GB, and this does not depend on the
compilation anymore.  So scanning a very large segment is not going to
be free.  I think that the performance concerns of Stephen are legit
as now we have on the WAL partitions sequential read and write
patterns.

> It's cool that you have a command-line tool that does this as well.
> Over there, it was also discussed that we might want to have both a
> command-line tool and a background worker.  I think, though, that we
> would want to get the output in some kind of compressed binary format,
> rather than text.  e.g.

If you want to tweak it the way you want, please feel free to reuse
it for any patch submitted to upstream.  Reshaping or redirecting the
data is not a big issue once the basics with the WAL reader are in
place.
--
Michael


signature.asc
Description: PGP signature


Re: Comments for lossy ORDER BY are lacking

2019-04-18 Thread Andres Freund
Hi,

On 2019-04-18 17:30:20 -0700, Andres Freund wrote:
> For not the first time I was trying to remember why and when the whole
> nodeIndexscan.c:IndexNextWithReorder() business is needed. The comment
> about reordering
> 
>  *IndexNextWithReorder
>  *
>  *Like IndexNext, but this version can also re-check ORDER BY
>  *expressions, and reorder the tuples as necessary.
> 
> or
> +   /* Initialize sort support, if we need to re-check ORDER BY exprs */
> 
> or
> 
> +   /*
> +* If there are ORDER BY expressions, look up the sort operators for
> +* their datatypes.
> +*/

Secondary point: has anybody actually checked whether the extra
reordering infrastructure is a measurable overhead? It's obviously fine
for index scans that need reordering (i.e. lossy ones), but currently
it's at least initialized for distance based order bys.  I guess that's
largely because currently opclasses don't signal the fact that they
might return loss amcanorderby results, but that seems like it could
have been fixed back then?

Greetings,

Andres Freund




Re: Pathological performance when inserting many NULLs into a unique index

2019-04-18 Thread Peter Geoghegan
On Wed, Apr 17, 2019 at 7:37 PM Peter Geoghegan  wrote:
> Tentatively, I think that the fix here is to not "itup_key->scantid =
> NULL" when a NULL value is involved (i.e. don't "itup_key->scantid =
> NULL" when we see IndexTupleHasNulls(itup) within _bt_doinsert()). We
> may also want to avoid calling _bt_check_unique() entirely in this
> situation.

> I'll create an open item for this, and begin work on a patch tomorrow.

I came up with the attached patch, which effectively treats a unique
index insertion as if the index was not unique at all in the case
where new tuple is IndexTupleHasNulls() within _bt_doinsert(). This is
correct because inserting a new tuple with a NULL value can neither
contribute to somebody else's duplicate violation, nor have a
duplicate violation error of its own. I need to test this some more,
though I am fairly confident that I have the right idea.

We didn't have this problem before my v12 work due to the "getting
tired" strategy, but that's not the full story. We also didn't (and
don't) move right within _bt_check_unique() when
IndexTupleHasNulls(itup), because _bt_isequal() treats NULL values as
not equal to any value, including even NULL -- this meant that there
was no useless search to the right within _bt_check_unique().
Actually, the overall effect of how _bt_isequal() is used is that
_bt_check_unique() does *no* useful work at all with a NULL scankey.
It's far simpler to remove responsibility for new tuples/scankeys with
NULL values from _bt_check_unique() entirely, which is what I propose
to do with this patch.

The patch actually ends up removing quite a lot more code than it
adds, because it removes _bt_isequal() outright. I always thought that
nbtinsert.c dealt with NULL values in unique indexes at the wrong
level, and I'm glad to see _bt_isequal() go. Vadim's accompanying 1997
comment about "not using _bt_compare in real comparison" seems
confused to me. The new _bt_check_unique() may still need to compare
the scankey to some existing, adjacent tuple with a NULL value, but
_bt_compare() is perfectly capable of recognizing that that adjacent
tuple shouldn't be considered equal. IOW, _bt_compare() is merely
incapable of behaving as if "NULL != NULL", which is a bad reason for
keeping _bt_isequal() around.

--
Peter Geoghegan


0001-Prevent-O-N-2-unique-index-insertion-edge-case.patch
Description: Binary data


Comments for lossy ORDER BY are lacking

2019-04-18 Thread Andres Freund
Hi,

For not the first time I was trying to remember why and when the whole
nodeIndexscan.c:IndexNextWithReorder() business is needed. The comment
about reordering

 *  IndexNextWithReorder
 *
 *  Like IndexNext, but this version can also re-check ORDER BY
 *  expressions, and reorder the tuples as necessary.

or
+   /* Initialize sort support, if we need to re-check ORDER BY exprs */

or

+   /*
+* If there are ORDER BY expressions, look up the sort operators for
+* their datatypes.
+*/


nor any other easy to spot ones really explain that. It's not even
obvious that this isn't talking about an ordering ordering by a column
(expression could maybe be taken as a hint, but that's fairly thin)

By reading enough code one can stitch together that that's really only
needed for KNN like order bys with lossy distance functions. It'd be
good if one had to dig less for that.


that logic was (originally) added in:

commit 35fcb1b3d038a501f3f4c87c05630095abaaadab
Author: Heikki Linnakangas 
Date:   2015-05-15 14:26:51 +0300

Allow GiST distance function to return merely a lower-bound.


but I think some of the documentation & naming for related
datastructures was a bit hard to grasp before then too - it's e.g. IMO
certainly not obvious that IndexPath.indexorderbys isn't about plain
ORDER BYs.

Greetings,

Andres Freund




Re: "make installcheck" fails in src/test/recovery

2019-04-18 Thread Michael Paquier
On Thu, Apr 18, 2019 at 01:45:45AM -0400, Tom Lane wrote:
> Is this extra dependency actually essential?  I'm not really
> happy about increasing the number of moving parts in this test.

Hmmm.  I don't actually object to removing the part depending on
pageinspect in the tests.  Relying on the on-disk page format has
proved to be more reliable for the buildfarm than I initially
thought, and we are actually able to keep the same coverage without
the dependency on pageinspect.

Now, I don't think that this is not a problem only for
src/test/recovery/ but to any path using EXTRA_INSTALL.  For example,
if you take contrib/ltree_plpython/, then issue "make install" from
this path followed by an installcheck, then the tests complain about
ltree missing from the installation.  For the recovery tests, we
already require test_decoding so I would expect the problem to get
worse with the time as we should not restrict the dependencies with
other modules if they make sense for some TAP tests.

I am wondering if it would be better to just install automatically all
the paths listed in EXTRA_INSTALL when invoking installcheck.  We
enforce the target in src/test/recovery/Makefile, still we could use
this opportunity to mark it with TAP_TESTS=1.
--
Michael


signature.asc
Description: PGP signature


Re: Adding a TAP test checking data consistency on standby with minRecoveryPoint

2019-04-18 Thread Michael Paquier
On Thu, Apr 18, 2019 at 03:16:17PM -0400, Robert Haas wrote:
> On Tue, Apr 16, 2019 at 2:45 AM Michael Paquier  wrote:
>> I think that we have race conditions with checkpointing and shutdown
>> on HEAD.
> 
> Any idea whether it's something newly-introduced or of long standing?

I would suggest to keep the discussion on the more general thread I
have created a couple of days ago:
https://www.postgresql.org/message-id/20190416070119.gk2...@paquier.xyz

Thanks,
--
Michael


signature.asc
Description: PGP signature


Re: block-level incremental backup

2019-04-18 Thread Stephen Frost
Greetings,

Ok, responding to the rest of this email.

* Robert Haas (robertmh...@gmail.com) wrote:
> On Wed, Apr 17, 2019 at 6:43 PM Stephen Frost  wrote:
> > Sadly, I haven't got any great ideas today.  I do know that the WAL-G
> > folks have specifically mentioned issues with the visibility map being
> > large enough across enough of their systems that it kinda sucks to deal
> > with.  Perhaps we could do something like the rsync binary-diff protocol
> > for non-relation files?  This is clearly just hand-waving but maybe
> > there's something reasonable in that idea.
> 
> I guess it all comes down to how complicated you're willing to make
> the client-server protocol.  With the very simple protocol that I
> proposed -- client provides a threshold LSN and server sends blocks
> modified since then -- the client need not have access to the old
> incremental backup to take a new one.

Where is the client going to get the threshold LSN from?

> Of course, if it happens to
> have access to the old backup then it can delta-compress however it
> likes after-the-fact, but that doesn't help with the amount of network
> transfer.

If it doesn't have access to the old backup, then I'm a bit confused as
to how a incremental backup would be possible?  Isn't that a requirement
here?

> That problem could be solved by doing something like what
> you're talking about (with some probably-negligible false match rate)
> but I have no intention of trying to implement anything that
> complicated, and I don't really think it's necessary, at least not for
> a first version.  What I proposed would already allow, for most users,
> a large reduction in transfer and storage costs; what you are talking
> about here would help more, but also be a lot more work and impose
> some additional requirements on the system.  I don't object to you
> implementing the more complex system, but I'll pass.

I was talking about the rsync binary-diff specifically for the files
that aren't easy to deal with in the WAL stream.  I wouldn't think we'd
use it for other files, and there is definitely a question there of if
there's a way to do better than a binary-diff approach for those files.

> > There's something like 6 different backup tools, at least, for
> > PostgreSQL that provide backup management, so I have a really hard time
> > agreeing with this idea that users don't want a PG backup management
> > system.  Maybe that's not what you're suggesting here, but that's what
> > came across to me.
> 
> Let me be a little more clear.  Different users want different things.
> Some people want a canned PostgreSQL backup solution, while other
> people just want access to a reasonable set of facilities from which
> they can construct their own solution.  I believe that the proposal I
> am making here could be used either by backup tool authors to enhance
> their offerings, or by individuals who want to build up their own
> solution using facilities provided by core.

The last thing that I think users really want it so build up their own
solution.  There may be some organizations who would like to provide
their own tool, but that's a bit different.  Personally, I'd *really*
like PG to have a good tool in this area and I've been working, as I've
said before, to try to get to a point where we at least have the option
to add in such a tool that meets our various requirements.

Further, I'm concerned that the approach being presented here won't be
interesting to most of the external tools because it's limited and can't
be used in a parallel fashion.

> > Unless maybe I'm misunderstanding and what you're suggesting here is
> > that the "existing solution" is something like the external PG-specific
> > backup tools?  But then the rest doesn't seem to make sense, as only
> > maybe one or two of those tools use pg_basebackup internally.
> 
> Well, what I'm really talking about is in two pieces: providing some
> new facilities via the replication protocol, and making pg_basebackup
> able to use those facilities.  Nothing would stop other tools from
> using those facilities directly if they wish.

If those facilities are developed and implemented in the same way as the
protocol used by pg_basebackup works, then I strongly suspect that the
existing backup tools will treat it similairly- which is to say, they'll
largely end up ignoring it.

> > ... but this is exactly the situation we're in already with all of the
> > *other* features around backup (parallel backup, backup management, WAL
> > management, etc).  Users want those features, pg_basebackup/PG core
> > doesn't provide it, and therefore there's a bunch of other tools which
> > have been written that do.  In addition, saying that PG has incremental
> > backup but no built-in management of those full-vs-incremental backups
> > and telling users that they basically have to build that themselves
> > really feels a lot like we're trying to address a check-box requirement
> > rather than making something that 

Re: Race conditions with checkpointer and shutdown

2019-04-18 Thread Tom Lane
Thomas Munro  writes:
> This is a curious thing from dragonet's log:

> 2019-04-16 08:23:24.178 CEST [8335] LOG:  received fast shutdown request
> 2019-04-16 08:23:24.178 CEST [8335] LOG:  aborting any active transactions
> 2019-04-16 08:23:24.178 CEST [8393] FATAL:  terminating walreceiver
> process due to administrator command
> 2019-04-16 08:28:23.166 CEST [8337] LOG:  restartpoint starting: time

> LogCheckpointStart() is the thing that writes "starting: ...", and it
> prefers to report "shutdown" over "time", but it didn't.

Yeah, but since we don't see "shutting down", we know that the shutdown
checkpoint hasn't begun.  Here's another similar case:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mereswine=2018-11-30%2011%3A44%3A54

The relevant fragment of the standby server's log is

2018-11-30 05:09:22.996 PST [4229] LOG:  received fast shutdown request
2018-11-30 05:09:23.628 PST [4229] LOG:  aborting any active transactions
2018-11-30 05:09:23.649 PST [4231] LOG:  checkpoint complete: wrote 17 buffers 
(13.3%); 0 WAL file(s) added, 0 removed, 1 recycled; write=3.021 s, sync=0.000 
s, total=3.563 s; sync files=0, longest=0.000 s, average=0.000 s; 
distance=16563 kB, estimate=16563 kB
2018-11-30 05:09:23.679 PST [4229] LOG:  background worker "logical replication 
launcher" (PID 4276) exited with exit code 1
2018-11-30 05:11:23.757 PST [4288] master LOG:  unexpected EOF on standby 
connection
2018-11-30 05:11:23.883 PST [4229] LOG:  received immediate shutdown request
2018-11-30 05:11:23.907 PST [4229] LOG:  database system is shut down

To the extent that I've found logs in which the checkpointer prints
anything at all during this interval, it seems to be just quietly
plodding along with its usual business, without any hint that it's
aware of the pending shutdown request.  It'd be very easy to believe
that the postmaster -> checkpointer SIGUSR2 is simply getting dropped,
or never issued.

Hmm ... actually, looking at the postmaster's logic, it won't issue
SIGUSR2 to the checkpointer until the walreceiver (if any) is gone.
And now that I think about it, several of these logs contain traces
showing that the walreceiver is still live.  Like the one quoted above:
seems like the line from PID 4288 has to be from a walreceiver.

Maybe what we should be looking for is "why doesn't the walreceiver
shut down"?  But the dragonet log you quote above shows the walreceiver
exiting, or at least starting to exit.  Tis a puzzlement.

I'm a bit tempted to add temporary debug logging to the postmaster so
that walreceiver start/stop is recorded at LOG level.  We'd have to wait
a few weeks to have any clear result from the buildfarm, but I'm not sure
how we'll get any hard data without some such measures.

regards, tom lane




Re: Race conditions with checkpointer and shutdown

2019-04-18 Thread Thomas Munro
On Fri, Apr 19, 2019 at 2:53 AM Tom Lane  wrote:
> Question is, what other theory has anybody got?

I wondered if there might be a way for PostmasterStateMachine() to be
reached with without signals blocked, in the case where we fork a
fresh checkpointers, and then it misses the SIGUSR2 that we
immediately send because it hasn't installed its handler yet.  But I
can't see it.

This is a curious thing from dragonet's log:

2019-04-16 08:23:24.178 CEST [8335] LOG:  received fast shutdown request
2019-04-16 08:23:24.178 CEST [8335] LOG:  aborting any active transactions
2019-04-16 08:23:24.178 CEST [8393] FATAL:  terminating walreceiver
process due to administrator command
2019-04-16 08:28:23.166 CEST [8337] LOG:  restartpoint starting: time

LogCheckpointStart() is the thing that writes "starting: ...", and it
prefers to report "shutdown" over "time", but it didn't.

-- 
Thomas Munro
https://enterprisedb.com




Re: Race conditions with checkpointer and shutdown

2019-04-18 Thread Tom Lane
Robert Haas  wrote (in the other thread):
> Any idea whether it's something newly-introduced or of long standing?

It's the latter.  I searched the buildfarm database for failure logs
including the string "server does not shut down" within the last three
years, and got all of the hits attached.  Not all of these look like
the failure pattern Michael pointed to, but enough of them do to say
that the problem has existed since at least mid-2017.  To be concrete,
we have quite a sample of cases where a standby server has received a
"fast shutdown" signal and acknowledged that in its log, but it never
gets to the expected "shutting down" message, meaning it never starts
the shutdown checkpoint let alone finishes it.  The oldest case that
clearly looks like that is

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=nightjar=2017-06-02%2018%3A54%3A29

A significant majority of the recent cases look just like the piculet
failure Michael pointed to, that is we fail to shut down the "london"
server while it's acting as standby in the recovery/t/009_twophase.pl
test.  But there are very similar failures in other tests.

I also notice that the population of machines showing the problem seems
heavily skewed towards, um, weird cases.  For instance, in the set
that have shown this type of failure since January, we have

dragonet: uses JIT
francolin: --disable-spinlocks
gull: armv7
mereswine: armv7
piculet: --disable-atomics
sidewinder: amd64, but running netbsd 7 (and this was 9.6, note)
spurfowl: fairly generic amd64

This leads me to suspect that the problem is (a) some very low-level issue
in spinlocks or or latches or the like, or (b) a timing problem that just
doesn't show up on generic Intel-oid platforms.  The timing theory is
maybe a bit stronger given that one test case shows this more often than
others.  I've not got any clear ideas beyond that.

Anyway, this is *not* new in v12.

regards, tom lane

sysname|  snapshot   |stage|  l 
 
---+-+-+-
 jacana| 2016-07-23 06:15:32 | pg_upgradeCheck | pg_ctl: server 
does not shut down\r
 pademelon | 2016-08-14 03:49:36 | ECPG-Check  | pg_ctl: server 
does not shut down
 mereswine | 2017-02-13 14:24:37 | Check   | pg_ctl: server 
does not shut down
 arapaima  | 2017-03-04 20:06:10 | StopDb-C:4  | pg_ctl: server 
does not shut down
 nightjar  | 2017-06-02 18:54:29 | SubscriptionCheck   | pg_ctl: server 
does not shut down
 nightjar  | 2017-06-02 19:54:11 | SubscriptionCheck   | pg_ctl: server 
does not shut down
 nightjar  | 2017-06-03 15:54:05 | SubscriptionCheck   | pg_ctl: server 
does not shut down
 nightjar  | 2017-06-03 17:54:18 | SubscriptionCheck   | pg_ctl: server 
does not shut down
 nightjar  | 2017-06-03 21:54:09 | SubscriptionCheck   | pg_ctl: server 
does not shut down
 nightjar  | 2017-06-04 00:54:09 | SubscriptionCheck   | pg_ctl: server 
does not shut down
 nightjar  | 2017-06-04 16:34:32 | SubscriptionCheck   | pg_ctl: server 
does not shut down
 nightjar  | 2017-06-04 17:54:16 | SubscriptionCheck   | pg_ctl: server 
does not shut down
 hornet| 2017-06-05 16:22:09 | SubscriptionCheck   | pg_ctl: server 
does not shut down
 nightjar  | 2017-06-05 16:54:09 | SubscriptionCheck   | pg_ctl: server 
does not shut down
 nightjar  | 2017-06-05 20:26:24 | SubscriptionCheck   | pg_ctl: server 
does not shut down
 nightjar  | 2017-06-06 03:30:02 | SubscriptionCheck   | pg_ctl: server 
does not shut down
 nightjar  | 2017-06-06 15:54:18 | SubscriptionCheck   | pg_ctl: server 
does not shut down
 hornet| 2017-06-06 17:10:02 | SubscriptionCheck   | pg_ctl: server 
does not shut down
 nightjar  | 2017-06-06 18:54:27 | SubscriptionCheck   | pg_ctl: server 
does not shut down
 nightjar  | 2017-06-07 00:54:07 | SubscriptionCheck   | pg_ctl: server 
does not shut down
 nightjar  | 2017-06-07 02:54:06 | SubscriptionCheck   | pg_ctl: server 
does not shut down
 nightjar  | 2017-06-07 15:12:15 | SubscriptionCheck   | pg_ctl: server 
does not shut down
 nightjar  | 2017-06-07 17:54:07 | SubscriptionCheck   | pg_ctl: server 
does not shut down
 nightjar  | 2017-06-07 18:54:06 | SubscriptionCheck   | pg_ctl: server 
does not shut down
 sungazer  | 2017-06-07 19:46:53 | SubscriptionCheck   | pg_ctl: server 
does not shut down
 nightjar  | 2017-06-07 21:03:43 | SubscriptionCheck   | pg_ctl: server 
does not shut down
 nightjar  | 2017-06-08 01:54:07 | SubscriptionCheck   | pg_ctl: server 
does not shut down
 nightjar  | 2017-06-08 15:54:10 | SubscriptionCheck   | pg_ctl: server 
does not shut down
 nightjar  | 2017-06-08 16:57:03 | SubscriptionCheck   | pg_ctl: server 
does not shut down
 nightjar  | 2017-06-08 17:54:09 | SubscriptionCheck  

Re: finding changed blocks using WAL scanning

2019-04-18 Thread Andres Freund
On 2019-04-18 17:47:56 -0400, Bruce Momjian wrote:
> I can see a 1GB marker being used for that.  It would prevent an
> incremental backup from being done until the first 1G modblock files was
> written, since until then there is no record of modified blocks, but
> that seems fine.  A 1G marker would allow for consistent behavior
> independent of server restarts and base backups.

That doesn't seem like a good idea - it'd make writing regression tests
for this impractical.




Re: pg_dump is broken for partition tablespaces

2019-04-18 Thread Alvaro Herrera
On 2019-Apr-17, Alvaro Herrera wrote:

> On 2019-Apr-17, Tom Lane wrote:
> 
> > Alvaro Herrera  writes:
> > > 1. pg_dump now uses regular CREATE TABLE followed by ALTER TABLE / ATTACH
> > >PARTITION when creating partitions, rather than CREATE TABLE
> > >PARTITION OF.  pg_dump --binary-upgrade was already doing that, so
> > >this part mostly removes some code.  In order to make the partitions
> > >reach the correct tablespace, the "default_tablespace" GUC is used.
> > >No TABLESPACE clause is added to the dump.  This is David's patch
> > >near the start of the thread.
> > 
> > This idea seems reasonable independently of all else, simply on the grounds
> > of reducing code duplication.  It also has the advantage that if you try
> > to do a selective restore of just a partition, and the parent partitioned
> > table isn't around, you can still do it (with an ignorable error).
> 
> I'll get this part pushed, then.

After looking at it again, I found that there's no significant
duplication reduction -- the patch simply duplicates one block in a
different location, putting half of the original code in each. And if we
reject the idea of separating tablespaces, there's no reason to do
things that way.  So ISTM if we don't want the tablespace thing, we
should not apply this part.  FWIW, we got quite a few positive votes for
handling tablespaces this way for partitioned tables [1] [2], so I
resist the idea that we have to revert the initial commit, as some seem
to be proposing.


After re-reading the thread one more time, I found one more pretty
reasonable point that Andres was complaining about, and I made things
work the way he described.  Namely, if you do this:

SET default_tablespace TO 'foo';
CREATE TABLE part (a int) PARTITION BY LIST (a);
SET default_tablespace TO 'bar';
CREATE TABLE part1 PARTITION OF part FOR VALUES IN (1);

then the partition must end up in tablespace bar, not in tablespace foo:
the reason is that the default_tablespace is not "strong enough" to
stick with the partitioned table.  The partition would only end up in
tablespace foo in this case:

CREATE TABLE part (a int) PARTITION BY LIST (a) TABLESPACE foo;
CREATE TABLE part1 PARTITION OF part FOR VALUES IN (1);

i.e. when the tablespace is explicitly indicated in the CREATE TABLE
command for the partitioned table.  Of course, you can still add a
TABLESPACE clause to the partition to override it (and you can change
the parent table's tablespace later.)

So here's a proposed v5.

I would appreciate others' eyes on this patch.

[1] 
https://postgr.es/m/cakjs1f9sxvzqdrgd1teosfd6jbmm0ueaa14_8mrvcwe19tu...@mail.gmail.com
[2] 
https://www.postgresql.org/message-id/flat/CAKJS1f9PXYcT%2Bj%3DoyL-Lquz%3DScNwpRtmD7u9svLASUygBdbN8w%40mail.gmail.com

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 2e42c31985fd1ed38af7806c25539c6dba600dc1 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 8 Apr 2019 16:40:05 -0400
Subject: [PATCH v5 1/2] Make pg_dump emit ATTACH PARTITION instead of
 PARTITION OF

ca41030 modified how the TABLESPACE option worked with partitioned tables
so that it was possible to specify and store a TABLESPACE for a partitioned
table.  This setting served only to mark what the default tablespace should
be for new partitions which were created with the CREATE TABLE PARTITION OF
syntax.

The problem with this was that pg_dump used the PARTITION OF syntax which
meant that if a partition had been explicitly defined to have pg_default
as the tablespace then since reltablespace is set to 0 in this case, pg_dump
would emit a CREATE TABLE .. PARTITION OF statement and cause the partition
to default to the partitioned table's tablespace rather than the one it was
meant to be located in.

We can work around this problem by simply performing the CREATE TABLE first
then perform an ATTACH PARTITION later.  This bypasses the check of the
parent partitioned table's tablespace in DefineRelation() as the table is
not a partition when it is defined.

Doing this also means that when restoring partitions they now maintain
their original column ordering rather than switch to their partitioned
table's column ordering. This is perhaps minor, but it is noteworthy

pg_dump in binary upgrade mode already worked this way, so it turns out
this commit removes more code than it adds.
---
 src/bin/pg_dump/pg_dump.c| 118 ++-
 src/bin/pg_dump/t/002_pg_dump.pl |  12 +++-
 2 files changed, 61 insertions(+), 69 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 7cfb67fd36e..bacb4de5c36 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -8618,9 +8618,11 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables)
  * Normally this is always true, but it's false for dropped columns, as well
  * as those that were inherited without any local definition.  

Re: finding changed blocks using WAL scanning

2019-04-18 Thread Bruce Momjian
On Thu, Apr 18, 2019 at 04:25:24PM -0400, Robert Haas wrote:
> On Thu, Apr 18, 2019 at 3:51 PM Bruce Momjian  wrote:
> > How would you choose the STARTLSN/ENDLSN?  If you could do it per
> > checkpoint, rather than per-WAL, I think that would be great.
> 
> I thought of that too.  It seems appealing, because you probably only
> really care whether a particular block was modified between one
> checkpoint and the next, not exactly when during that interval it was
> modified.  However, the simple algorithm of "just stop when you get to
> a checkpoint record" does not work, because the checkpoint record
> itself points back to a much earlier LSN, and I think that it's that
> earlier LSN that is interesting.  So if you want to make this work you
> have to be more clever, and I'm not sure I'm clever enough.

OK, so let's back up and study how this will be used.  Someone wanting
to make a useful incremental backup will need the changed blocks from
the time of the start of the base backup.  It is fine if they
incrementally back up some blocks modified _before_ the base backup, but
they need all blocks after, until some marker.  They will obviously
still use WAL to recover to a point after the incremental backup, so
there is no need to get every modifified block up to current, just up to
some cut-off point where WAL can be discarded.

I can see a 1GB marker being used for that.  It would prevent an
incremental backup from being done until the first 1G modblock files was
written, since until then there is no record of modified blocks, but
that seems fine.  A 1G marker would allow for consistent behavior
independent of server restarts and base backups.

How would the modblock file record all the modified blocks across
restarts and crashes?  I assume that 1G of WAL would not be available
for scanning.  I suppose that writing a modblock file to some PGDATA
location when WAL is removed would work since during a crash the
modblock file could be updated with the contents of the existing pg_wal
files.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: block-level incremental backup

2019-04-18 Thread Stephen Frost
Greetings,

I wanted to respond to this point specifically as I feel like it'll
really help clear things up when it comes to the point of view I'm
seeing this from.

* Robert Haas (robertmh...@gmail.com) wrote:
> > Perhaps that's what we're both saying too and just talking past each
> > other, but I feel like the approach here is "make it work just for the
> > simple pg_basebackup case and not worry too much about the other tools,
> > since what we do for pg_basebackup will work for them too" while where
> > I'm coming from is "focus on what the other tools need first, and then
> > make pg_basebackup work with that if there's a sensible way to do so."
> 
> I think perhaps the disconnect is that I just don't see how it can
> fail to work for the external tools if it works for pg_basebackup.

The existing backup protocol that pg_basebackup uses *does* *not* *work*
for the external backup tools.  If it worked, they'd use it, but they
don't and that's because you can't do things like a parallel backup,
which we *know* users want because there's a number of tools which
implement that exact capability.

I do *not* want another piece of functionality added in this space which
is limited in the same way because it does *not* help the external
backup tools at all.

> Any given piece of functionality is either available in the
> replication stream, or it's not.  I suspect that for both BART and
> pg_backrest, they won't be able to completely give up on having their
> own backup engines solely because core has incremental backup, but I
> don't know what the alternative to adding features to core one at a
> time is.

This idea that it's either "in the replication system" or "not in the
replication system" is really bad, in my view, because it can be "in the
replication system" and at the same time not at all useful to the
existing external backup tools, but users and others will see the
"checkbox" as ticked and assume that it's available in a useful fashion
by the backend and then get upset when they discover the limitations.

The existing base backup/replication protocol that's used by
pg_basebackup is *not* useful to most of the backup tools, that's quite
clear since they *don't* use it.  Building on to that an incremental
backup solution that is similairly limited isn't going to make things
easier for the external tools.

If the goal is to make things easier for the external tools by providing
capability in the backend / replication protocol then we need to be
looking at what those tools require and not at what would be minimally
sufficient for pg_basebackup.  If we don't care about the external tools
and *just* care about making it work for pg_basebackup, then let's be
clear about that, and accept that it'll have to be, most likely, ripped
out and rewritten when we go to add parallel capabilities, for example,
to pg_basebackup down the road.  That's clearly the case for the
existing "base backup" protocol, so I don't see why it'd be different
for an incremental backup system that is similairly designed and
implemented.

To be clear, I'm all for adding feature to core one at a time, but
there's different ways to implement features and that's really what
we're talking about here- what's the best way to implement this
feature, ideally in a way that it's useful, practically, to both
pg_basebackup and the other external backup utilities.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-04-18 Thread Tom Lane
Andres Freund  writes:
> My compromise suggestion would be to try to give John and Amit ~2 weeks
> to come up with a cleanup proposal, and then decide whether to 1) revert
> 2) apply the new patch, 3) decide to live with the warts for 12, and
> apply the patch in 13. As we would already have a patch, 3) seems like
> it'd be more tenable than without.

Seems reasonable.  I think we should shoot to have this resolved before
the end of the month, but it doesn't have to be done immediately.

regards, tom lane




Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-04-18 Thread Andres Freund
Hi,

On 2019-04-17 12:20:29 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2019-04-17 15:49:29 +0530, Amit Kapila wrote:
> >> OTOH, if we want to extend it later for whatever reason to a relation
> >> level cache, it shouldn't be that difficult as the implementation is
> >> mostly contained in freespace.c (fsm* functions) and I think the
> >> relation is accessible in most places.  We might need to rip out some
> >> calls to clearlocalmap.
> 
> > But it really isn't contained to freespace.c. That's my primary
> > concern. You added new parameters (undocumented ones!),
> > FSMClearLocalMap() needs to be called by callers and xlog, etc.
> 
> Given where we are in the release cycle, and the major architectural
> concerns that have been raised about this patch, should we just
> revert it and try again in v13, rather than trying to fix it under
> time pressure?  It's not like there's not anything else on our
> plates to fix before beta.

Hm. I'm of split mind here:

It's a nice improvement, and the fixes probably wouldn't be that
hard. And we could have piped up a bit earlier about these concerns (I
only noticed this when rebasing zheap onto the newest version of
postgres).

But as you it's also late, and there's other stuff to do. Although I
think neither Amit nor John is heavily involved in any...

My compromise suggestion would be to try to give John and Amit ~2 weeks
to come up with a cleanup proposal, and then decide whether to 1) revert
2) apply the new patch, 3) decide to live with the warts for 12, and
apply the patch in 13. As we would already have a patch, 3) seems like
it'd be more tenable than without.

Regards,

Andres




Re: Pluggable Storage - Andres's take

2019-04-18 Thread Andres Freund
Hi,

On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote:
> The comments for relation_set_new_relfilenode() callback say that the AM can
> set *freezeXid and *minmulti to invalid. But when I did that, VACUUM hits
> this assertion:
> 
> TRAP: FailedAssertion("!(((classForm->relfrozenxid) >= ((TransactionId)
> 3)))", File: "vacuum.c", Line: 1323)

Hm, that necessary change unfortunately escaped into the the zheap tree
(which indeed doesn't set relfrozenxid). That's why I'd not noticed
this.  How about something like the attached?



I found a related problem in VACUUM FULL / CLUSTER while working on the
above, not fixed in the attached yet. Namely even if a relation doesn't
yet have a valid relfrozenxid/relminmxid before a VACUUM FULL / CLUSTER,
we'll set one after that. That's not great.

I suspect the easiest fix would be to to make the relevant
relation_copy_for_cluster() FreezeXid, MultiXactCutoff arguments into
pointer, and allow the AM to reset them to an invalid value if that's
the appropriate one.

It'd probably be better if we just moved the entire xid limit
computation into the AM, but I'm worried that we actually need to move
it *further up* instead - independent of this change. I don't think it's
quite right to allow a table with a toast table to be independently
VACUUM FULL/CLUSTERed from the toast table. GetOldestXmin() can go
backwards for a myriad of reasons (or limited by
old_snapshot_threshold), and I'm fairly certain that e.g. VACUUM FULLing
the toast table, setting a lower old_snapshot_threshold, and VACUUM
FULLing the main table would result in failures.

I think we need to fix this for 12, rather than wait for 13. Does
anybody disagree?

Greetings,

Andres Freund
>From 5c84256ea5e41055b0cb9e0dc121a4daaca43336 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Thu, 18 Apr 2019 13:20:31 -0700
Subject: [PATCH] Allow pg_class xid & multixid horizons to not be set.

This allows table AMs that don't need these horizons. This was already
documented in the tableam relation_set_new_filenode callback, but an
assert prevented if from actually working (the test AM contained the
ne necessary AM itself).

Reported-By: Heikki Linnakangas
Author: Andres Freund
Discussion: https://postgr.es/m/9a7fb9cc-2419-5db7-8840-ddc10c93f...@iki.fi
---
 src/backend/access/heap/vacuumlazy.c |  4 +++
 src/backend/commands/vacuum.c| 53 
 src/backend/postmaster/autovacuum.c  |  4 +--
 3 files changed, 45 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8dc76fa8583..9364cd4c33f 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -213,6 +213,10 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	Assert(params != NULL);
 	Assert(params->index_cleanup != VACOPT_TERNARY_DEFAULT);
 
+	/* not every AM requires these to be valid, but heap does */
+	Assert(TransactionIdIsNormal(onerel->rd_rel->relfrozenxid));
+	Assert(MultiXactIdIsValid(onerel->rd_rel->relminmxid));
+
 	/* measure elapsed time iff autovacuum logging requires it */
 	if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
 	{
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 1a7291d94bc..94fb6f26063 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1313,36 +1313,61 @@ vac_update_datfrozenxid(void)
 
 		/*
 		 * Only consider relations able to hold unfrozen XIDs (anything else
-		 * should have InvalidTransactionId in relfrozenxid anyway.)
+		 * should have InvalidTransactionId in relfrozenxid anyway).
 		 */
 		if (classForm->relkind != RELKIND_RELATION &&
 			classForm->relkind != RELKIND_MATVIEW &&
 			classForm->relkind != RELKIND_TOASTVALUE)
+		{
+			Assert(!TransactionIdIsValid(classForm->relfrozenxid));
+			Assert(!MultiXactIdIsValid(classForm->relminmxid));
 			continue;
-
-		Assert(TransactionIdIsNormal(classForm->relfrozenxid));
-		Assert(MultiXactIdIsValid(classForm->relminmxid));
+		}
 
 		/*
+		 * Some table AMs might not need per-relation xid / multixid
+		 * horizons. It therefore seems reasonable to allow relfrozenxid and
+		 * relminmxid to not be set (i.e. set to their respective Invalid*Id)
+		 * independently. Thus validate and compute horizon for each only if
+		 * set.
+		 *
 		 * If things are working properly, no relation should have a
 		 * relfrozenxid or relminmxid that is "in the future".  However, such
 		 * cases have been known to arise due to bugs in pg_upgrade.  If we
 		 * see any entries that are "in the future", chicken out and don't do
-		 * anything.  This ensures we won't truncate clog before those
-		 * relations have been scanned and cleaned up.
+		 * anything.  This ensures we won't truncate clog & multixact SLRUs
+		 * before those relations have been scanned and cleaned up.
 		 */
-		if (TransactionIdPrecedes(lastSaneFrozenXid, classForm->relfrozenxid) ||
-			

Re: block-level incremental backup

2019-04-18 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Wed, Apr 17, 2019 at 5:20 PM Stephen Frost  wrote:
> > As I understand it, the problem is not with backing up an individual
> > database or cluster, but rather dealing with backing up thousands of
> > individual clusters with thousands of tables in each, leading to an
> > awful lot of tables with lots of FSMs/VMs, all of which end up having to
> > get copied and stored wholesale.  I'll point this thread out to him and
> > hopefully he'll have a chance to share more specific information.
> 
> Sounds good.

Ok, done.

> > I can agree with the idea of having multiple options for how to collect
> > up the set of changed blocks, though I continue to feel that a
> > WAL-scanning approach isn't something that we'd have implemented in the
> > backend at all since it doesn't require the backend and a given backend
> > might not even have all of the WAL that is relevant.  I certainly don't
> > think it makes sense to have a backend go get WAL from the archive to
> > then merge the WAL to provide the result to a client asking for it-
> > that's adding entirely unnecessary load to the database server.
> 
> My motivation for wanting to include it in the database server was twofold:
> 
> 1. I was hoping to leverage the background worker machinery.  The
> WAL-scanner would just run all the time in the background, and start
> up and shut down along with the server.  If it's a standalone tool,
> then it can run on a different server or when the server is down, both
> of which are nice.  The downside though is that now you probably have
> to put it in crontab or under systemd or something, instead of just
> setting a couple of GUCs and letting the server handle the rest.  For
> me that downside seems rather significant, but YMMV.

Background workers can be used to do pretty much anything.  I'm not
suggesting that's a bad thing- just that it's such a completely generic
tool that could be used to put anything/everything into the backend, so
I'm not sure how much it makes sense as an argument when it comes to
designing a new capability/feature.  Yes, there's an advantage there
when it comes to configuration since that means we don't need to set up
a cronjob and can, instead, just set a few GUCs...  but it also means
that it *must* be done on the server and there's no option to do it
elsewhere, as you say.

When it comes to "this is something that I can do on the DB server or on
some other server", the usual preference is to use another system for
it, to reduce load on the server.

If it comes down to something that needs to/should be an ongoing
process, then the packaging can package that as a daemon-type tool which
handles the systemd component to it, assuming the stand-alone tool
supports that, which it hopefully would.

> 2. In order for the information produced by the WAL-scanner to be
> useful, it's got to be available to the server when the server is
> asked for an incremental backup.  If the information is constructed by
> a standalone frontend tool, and stored someplace other than under
> $PGDATA, then the server won't have convenient access to it.  I guess
> we could make it the client's job to provide that information to the
> server, but I kind of liked the simplicity of not needing to give the
> server anything more than an LSN.

If the WAL-scanner tool is a stand-alone tool, and it handles picking
out all of the FPIs and incremental page changes for each relation, then
what does the tool to build out the "new" backup really need to tell the
backend?  I feel like it mainly needs to ask the backend for the
non-relation files, which gets into at least one approach that I've
thought about for redesigning the backup protocol:

1. Ask for a list of files and metadata about them
2. Allow asking for individual files
3. Support multiple connections asking for individual files

Quite a few of the existing backup tools for PG use a model along these
lines (or use tools underneath which do).

> > A thought that occurs to me is to have the functions for supporting the
> > WAL merging be included in libcommon and available to both the
> > independent executable that's available for doing WAL merging, and to
> > the backend to be able to WAL merging itself-
> 
> Yeah, that might be possible.

I feel like this would be necessary, as it's certainly delicate and
critical code and having multiple implementations of it will be
difficult to manage.

That said...  we already have independent work going on to do WAL
mergeing (WAL-G, at least), and if we insist that the WAL replay code
only exists in the backend, I strongly suspect we'll end up with
independent implementations of that too.  Sure, we can distance
ourselves from that and say that we don't have to deal with any bugs
from it... but it seems like the better approach would be to have a
common library that provides it.

> > but for a specific
> > purpose: having a way to reduce the amount of WAL that needs to be 

Re: finding changed blocks using WAL scanning

2019-04-18 Thread Robert Haas
On Thu, Apr 18, 2019 at 3:51 PM Bruce Momjian  wrote:
> How would you choose the STARTLSN/ENDLSN?  If you could do it per
> checkpoint, rather than per-WAL, I think that would be great.

I thought of that too.  It seems appealing, because you probably only
really care whether a particular block was modified between one
checkpoint and the next, not exactly when during that interval it was
modified.  However, the simple algorithm of "just stop when you get to
a checkpoint record" does not work, because the checkpoint record
itself points back to a much earlier LSN, and I think that it's that
earlier LSN that is interesting.  So if you want to make this work you
have to be more clever, and I'm not sure I'm clever enough.

I think it's important that a .modblock file not be too large, because
then it will use too much memory, and that it not cover too much WAL,
because then it will be too imprecise about when the blocks were
modified.  Perhaps we should have a threshold for each -- e.g. emit
the next .modblock file after finding 2^20 distinct block references
or scanning 1GB of WAL.  Then individual files would probably be in
the single-digit numbers of megabytes in size, assuming we do a decent
job with the compression, and you never need to scan more than 1GB of
WAL to regenerate one.  If the starting point for a backup falls in
the middle of such a file, and you include the whole file, at worst
you have ~8GB of extra blocks to read, but in most cases less, because
your writes probably have some locality and the file may not actually
contain the full 2^20 block references.  You could also make it more
fine-grained than that if you don't mind having more smaller files
floating around.

It would definitely be better if we could set things up so that we
could always switch to the next .modblock file when we cross a
potential redo start point, but they're not noted in the WAL so I
don't see how to do that.  I don't know if it would be possible to
insert some new kind of log record concurrently with fixing the redo
location, so that redo always started at a record of this new type.
That would certainly be helpful for this kind of thing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: finding changed blocks using WAL scanning

2019-04-18 Thread Bruce Momjian
On Thu, Apr 18, 2019 at 03:43:30PM -0400, Robert Haas wrote:
> You can make it kinda make sense by saying "the blocks modified by
> records *beginning in* segment XYZ" or alternatively "the blocks
> modified by records *ending in* segment XYZ", but that seems confusing
> to me.  For example, suppose you decide on the first one --
> 000100010068.modblock will contain all blocks modified by
> records that begin in 000100010068.  Well, that means that
> to generate the 000100010068.modblock, you will need
> access to 000100010068 AND probably also
> 000100010069 and in rare cases perhaps
> 00010001006A or even later files.  I think that's actually
> pretty confusing.
> 
> It seems better to me to give the files names like
> ${TLI}.${STARTLSN}.${ENDLSN}.modblock, e.g.
> 0001.00016858.0001687DBBB8.modblock, so that you can
> see exactly which *records* are covered by that segment.

How would you choose the STARTLSN/ENDLSN?  If you could do it per
checkpoint, rather than per-WAL, I think that would be great.

> And I suspect it may also be a good idea to bunch up the records from
> several WAL files.  Especially if you are using 16MB WAL files,
> collecting all of the block references from a single WAL file is going
> to produce a very small file.  I suspect that the modified block files
> will end up being 100x smaller than the WAL itself, perhaps more, and
> I don't think anybody will appreciate us adding another PostgreSQL
> systems that spews out huge numbers of tiny little files.  If, for
> example, somebody's got a big cluster that is churning out a WAL
> segment every second, they would probably still be happy to have a new
> modified block file only, say, every 10 seconds.

Agreed.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: finding changed blocks using WAL scanning

2019-04-18 Thread Robert Haas
On Mon, Apr 15, 2019 at 11:45 PM Michael Paquier  wrote:
> On Mon, Apr 15, 2019 at 09:04:13PM -0400, Robert Haas wrote:
> > That is pretty much exactly what I was intending to propose.
>
> Any caller of XLogWrite() could switch to a new segment once the
> current one is done, and I am not sure that we would want some random
> backend to potentially slow down to do that kind of operation.
>
> Or would a separate background worker do this work by itself?  An
> external tool can do that easily already:
> https://github.com/michaelpq/pg_plugins/tree/master/pg_wal_blocks

I was thinking that a dedicated background worker would be a good
option, but Stephen Frost seems concerned (over on the other thread)
about how much load that would generate.  That never really occurred
to me as a serious issue and I suspect for many people it wouldn't be,
but there might be some.

It's cool that you have a command-line tool that does this as well.
Over there, it was also discussed that we might want to have both a
command-line tool and a background worker.  I think, though, that we
would want to get the output in some kind of compressed binary format,
rather than text.  e.g.

4-byte database OID
4-byte tablespace OID
any number of relation OID/block OID pairings for that
database/tablespace combination
4-byte zero to mark the end of the relation OID/block OID list
and then repeat all of the above any number of times

That might be too dumb and I suspect we want some headers and a
checksum, but we should try to somehow exploit the fact that there
aren't likely to be many distinct databases or many distinct
tablespaces mentioned -- whereas relation OID and block number will
probably have a lot more entropy.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Runtime pruning problem

2019-04-18 Thread Tom Lane
I wrote:
> [ let's fix this by reverting ruleutils back to using Plans not PlanStates ]

BTW, while I suspect the above wouldn't be a huge patch, it doesn't
seem trivial either.  Since the issue is (a) cosmetic and (b) not new
(v11 behaves the same way), I don't think we should consider it to be
an open item for v12.  I suggest leaving this as a to-do for v13.

regards, tom lane




Re: finding changed blocks using WAL scanning

2019-04-18 Thread Robert Haas
On Mon, Apr 15, 2019 at 10:22 PM Bruce Momjian  wrote:
> > > I am thinking tools could retain modblock files along with WAL, could
> > > pull full-page-writes from WAL, or from PGDATA.  It avoids the need to
> > > scan 16MB WAL files, and the WAL files and modblock files could be
> > > expired independently.
> >
> > That is pretty much exactly what I was intending to propose.
>
> OK, good.  Some of your wording was vague so I was unclear exactly what
> you were suggesting.

Well, I guess the part that isn't like what I was suggesting is the
idea that there should be exactly one modified block file per segment.
The biggest problem with that idea is that a single WAL record can be
split across two segments (or, in pathological cases, perhaps more).
I think it makes sense to talk about the blocks modified by WAL
between LSN A and LSN B, but it doesn't make much sense to talk about
the block modified by the WAL in segment XYZ.

You can make it kinda make sense by saying "the blocks modified by
records *beginning in* segment XYZ" or alternatively "the blocks
modified by records *ending in* segment XYZ", but that seems confusing
to me.  For example, suppose you decide on the first one --
000100010068.modblock will contain all blocks modified by
records that begin in 000100010068.  Well, that means that
to generate the 000100010068.modblock, you will need
access to 000100010068 AND probably also
000100010069 and in rare cases perhaps
00010001006A or even later files.  I think that's actually
pretty confusing.

It seems better to me to give the files names like
${TLI}.${STARTLSN}.${ENDLSN}.modblock, e.g.
0001.00016858.0001687DBBB8.modblock, so that you can
see exactly which *records* are covered by that segment.

And I suspect it may also be a good idea to bunch up the records from
several WAL files.  Especially if you are using 16MB WAL files,
collecting all of the block references from a single WAL file is going
to produce a very small file.  I suspect that the modified block files
will end up being 100x smaller than the WAL itself, perhaps more, and
I don't think anybody will appreciate us adding another PostgreSQL
systems that spews out huge numbers of tiny little files.  If, for
example, somebody's got a big cluster that is churning out a WAL
segment every second, they would probably still be happy to have a new
modified block file only, say, every 10 seconds.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: pgsql: Fix plan created for inherited UPDATE/DELETE with all tables exc

2019-04-18 Thread Tom Lane
Amit Langote  writes:
> On 2019/02/23 2:23, Tom Lane wrote:
>> Fix plan created for inherited UPDATE/DELETE with all tables excluded.

> I noticed that we may have forgotten to fix one more thing in this commit
> -- nominalRelation may refer to a range table entry that's not referenced
> elsewhere in the finished plan:

> create table parent (a int);
> create table child () inherits (parent);
> explain verbose update parent set a = a where false;
> QUERY PLAN
> ───
>  Update on public.parent  (cost=0.00..0.00 rows=0 width=0)
>Update on public.parent
>->  Result  (cost=0.00..0.00 rows=0 width=0)
>  Output: a, ctid
>  One-Time Filter: false
> (5 rows)

> I think the "Update on public.parent" shown in the 2nd row is unnecessary,
> because it won't really be updated.

Well, perhaps, but nonetheless that's what the plan tree shows.
Moreover, even though it will receive no row changes, we're going to fire
statement-level triggers on it.  So I'm not entirely convinced that
suppressing it is a step forward ...

> As you may notice, Result node's targetlist is shown differently than
> before, that is, columns are shown prefixed with table name.

... and that definitely isn't one.

I also think that your patch largely falsifies the discussion at 1543ff:

 * Set the nominal target relation of the ModifyTable node if not
 * already done.  If the target is a partitioned table, we already set
 * nominalRelation to refer to the partition root, above.  For
 * non-partitioned inheritance cases, we'll use the first child
 * relation (even if it's excluded) as the nominal target relation.
 * Because of the way expand_inherited_rtentry works, that should be
 * the RTE representing the parent table in its role as a simple
 * member of the inheritance set.
 *
 * It would be logically cleaner to *always* use the inheritance
 * parent RTE as the nominal relation; but that RTE is not otherwise
 * referenced in the plan in the non-partitioned inheritance case.
 * Instead the duplicate child RTE created by expand_inherited_rtentry
 * is used elsewhere in the plan, so using the original parent RTE
 * would give rise to confusing use of multiple aliases in EXPLAIN
 * output for what the user will think is the "same" table.  OTOH,
 * it's not a problem in the partitioned inheritance case, because
 * there is no duplicate RTE for the parent.

We'd have to rethink exactly what the goals are if we want to change
the definition of nominalRelation like this.

One idea, perhaps, is to teach explain.c to not list partitioned tables
as subsidiary targets, independently of nominalRelation.  But I'm not
convinced that we need to do anything at all here.  The existing output
for this case is exactly like it was in v10 and v11.

regards, tom lane




Re: proposal: psql PSQL_TABULAR_PAGER variable

2019-04-18 Thread Pavel Stehule
čt 18. 4. 2019 v 21:12 odesílatel Alvaro Herrera 
napsal:

> On 2019-Apr-18, Pavel Stehule wrote:
>
> > I don't know any about other cases. Other results in psql has tabular
> > format.
>
> What about EXPLAIN?
>

I forgot it, thank you

Pavel


> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


Re: Adding a TAP test checking data consistency on standby with minRecoveryPoint

2019-04-18 Thread Robert Haas
On Tue, Apr 16, 2019 at 2:45 AM Michael Paquier  wrote:
> I think that we have race conditions with checkpointing and shutdown
> on HEAD.

Any idea whether it's something newly-introduced or of long standing?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: proposal: psql PSQL_TABULAR_PAGER variable

2019-04-18 Thread Alvaro Herrera
On 2019-Apr-18, Pavel Stehule wrote:

> I don't know any about other cases. Other results in psql has tabular
> format.

What about EXPLAIN?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: ALTER TABLE with ADD COLUMN and ADD PRIMARY KEY USING INDEX throws spurious "column contains null values"

2019-04-18 Thread Robert Haas
On Wed, Apr 17, 2019 at 11:55 PM Tom Lane  wrote:
> * I'm not sure whether we want to try to back-patch this, or how
> far it should go.  The misbehavior has been there a long time
> (at least back to 8.4, I didn't check further); so the lack of
> previous reports means people seldom try to do it.  That may
> indicate that it's not worth taking any risks of new bugs to
> squash this one.  (Also, I suspect that it might take a lot of
> work to port this to before v10, because there are comments
> suggesting that this code worked a good bit differently before.)
> I do think we should shove this into v12 though.

Shoving it into v12 but not back-patching seems like a reasonable
compromise, although I have not reviewed the patch or tried to figure
out how risky it is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: block-level incremental backup

2019-04-18 Thread Andres Freund
Hi,

> > Wow.  I have to admit that I feel completely opposite of that- I'd
> > *love* to have an independent tool (which ideally uses the same code
> > through the common library, or similar) that can be run to apply WAL.
> >
> > In other words, I don't agree that it's the server's problem at all to
> > solve that, or, at least, I don't believe that it needs to be.
> 
> I mean, I guess I'd love to have that if I could get it by waving a
> magic wand, but I wouldn't love it if I had to write the code or
> maintain it.  The routines for applying WAL currently all assume that
> you have a whole bunch of server infrastructure present; that code
> wouldn't run in a frontend environment, I think.  I wouldn't want to
> have a second copy of every WAL apply routine that might have its own
> set of bugs.

I'll fight tooth and nail not to have a second implementation of replay,
even if it's just portions.  The code we have is complicated and fragile
enough, having a [partial] second version would be way worse.  There's
already plenty improvements we need to make to speed up replay, and a
lot of them require multiple execution threads (be it processes or OS
threads), something not easily feasible in a standalone tool. And
without the already existing concurrent work during replay (primarily
checkpointer doing a lot of the necessary IO), it'd also be pretty
unattractive to use any separate tool.

Unless you just define the server binary as that "independent tool".
Which I think is entirely reasonable. With the 'consistent' and LSN
recovery targets one already can get most of what's needed from such a
tool, anyway.  I'd argue the biggest issue there is that there's no
equivalent to starting postgres with a private socket directory on
windows, and perhaps an option or two making it easier to start postgres
in a "private" mode for things like this.

Greetings,

Andres Freund




Re: Runtime pruning problem

2019-04-18 Thread Robert Haas
On Tue, Apr 16, 2019 at 10:49 PM Amit Langote
 wrote:
> Maybe, not show them?  That may be a bit inconsistent, because the point
> of VERBOSE is to the targetlist among other things, but maybe the users
> wouldn't mind not seeing it on such empty Append nodes.  OTOH, they are
> more likely to think seeing a subplan that's clearly prunable as a bug of
> the pruning logic.

Or maybe we could show them, but the Append could also be flagged in
some way that indicates that its child is only a dummy.

Everything Pruned: Yes

Or something.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: block-level incremental backup

2019-04-18 Thread Robert Haas
On Wed, Apr 17, 2019 at 6:43 PM Stephen Frost  wrote:
> Sadly, I haven't got any great ideas today.  I do know that the WAL-G
> folks have specifically mentioned issues with the visibility map being
> large enough across enough of their systems that it kinda sucks to deal
> with.  Perhaps we could do something like the rsync binary-diff protocol
> for non-relation files?  This is clearly just hand-waving but maybe
> there's something reasonable in that idea.

I guess it all comes down to how complicated you're willing to make
the client-server protocol.  With the very simple protocol that I
proposed -- client provides a threshold LSN and server sends blocks
modified since then -- the client need not have access to the old
incremental backup to take a new one.  Of course, if it happens to
have access to the old backup then it can delta-compress however it
likes after-the-fact, but that doesn't help with the amount of network
transfer.  That problem could be solved by doing something like what
you're talking about (with some probably-negligible false match rate)
but I have no intention of trying to implement anything that
complicated, and I don't really think it's necessary, at least not for
a first version.  What I proposed would already allow, for most users,
a large reduction in transfer and storage costs; what you are talking
about here would help more, but also be a lot more work and impose
some additional requirements on the system.  I don't object to you
implementing the more complex system, but I'll pass.

> There's something like 6 different backup tools, at least, for
> PostgreSQL that provide backup management, so I have a really hard time
> agreeing with this idea that users don't want a PG backup management
> system.  Maybe that's not what you're suggesting here, but that's what
> came across to me.

Let me be a little more clear.  Different users want different things.
Some people want a canned PostgreSQL backup solution, while other
people just want access to a reasonable set of facilities from which
they can construct their own solution.  I believe that the proposal I
am making here could be used either by backup tool authors to enhance
their offerings, or by individuals who want to build up their own
solution using facilities provided by core.

> Unless maybe I'm misunderstanding and what you're suggesting here is
> that the "existing solution" is something like the external PG-specific
> backup tools?  But then the rest doesn't seem to make sense, as only
> maybe one or two of those tools use pg_basebackup internally.

Well, what I'm really talking about is in two pieces: providing some
new facilities via the replication protocol, and making pg_basebackup
able to use those facilities.  Nothing would stop other tools from
using those facilities directly if they wish.

> ... but this is exactly the situation we're in already with all of the
> *other* features around backup (parallel backup, backup management, WAL
> management, etc).  Users want those features, pg_basebackup/PG core
> doesn't provide it, and therefore there's a bunch of other tools which
> have been written that do.  In addition, saying that PG has incremental
> backup but no built-in management of those full-vs-incremental backups
> and telling users that they basically have to build that themselves
> really feels a lot like we're trying to address a check-box requirement
> rather than making something that our users are going to be happy with.

I disagree.  Yes, parallel backup, like incremental backup, needs to
go in core.  And pg_basebackup should be able to do a parallel backup.
I will fight tooth, nail, and claw any suggestion that the server
should know how to do a parallel backup but pg_basebackup should not
have an option to exploit that capability.  And similarly for
incremental.

> I don't think that I was very clear in what my specific concern here
> was.  I'm not asking for pg_basebackup to have parallel backup (at
> least, not in this part of the discussion), I'm asking for the
> incremental block-based protocol that's going to be built-in to core to
> be able to be used in a parallel fashion.
>
> The existing protocol that pg_basebackup uses is basically, connect to
> the server and then say "please give me a tarball of the data directory"
> and that is then streamed on that connection, making that protocol
> impossible to use for parallel backup.  That's fine as far as it goes
> because only pg_basebackup actually uses that protocol (note that nearly
> all of the other tools for doing backups of PostgreSQL don't...).  If
> we're expecting the external tools to use the block-level incremental
> protocol then that protocol really needs to have a way to be
> parallelized, otherwise we're just going to end up with all of the
> individual tools doing their own thing for block-level incremental
> (though perhaps they'd reimplement whatever is done in core but in a way
> that they could parallelize it...), if 

Re: Runtime pruning problem

2019-04-18 Thread Tom Lane
Amit Langote  writes:
> On 2019/04/17 13:10, Tom Lane wrote:
>> No, the larger issue is that *any* plan node above the Append might
>> be recursing down to/through the Append to find out what to print for
>> a Var reference.  We have to be able to support that.

> I wonder why the original targetlist of these nodes, adjusted using just
> fix_scan_list(), wouldn't have been better for EXPLAIN to use?

So what I'm thinking is that I made a bad decision in 1cc29fe7c,
which did this:

... In passing, simplify the EXPLAIN code by
having it deal primarily in the PlanState tree rather than separately
searching Plan and PlanState trees.  This is noticeably cleaner for
subplans, and about a wash elsewhere.

It was definitely silly to have the recursion in explain.c passing down
both Plan and PlanState nodes, when the former is always easily accessible
from the latter.  So that was an OK change, but at the same time I changed
ruleutils.c to accept PlanState pointers not Plan pointers from explain.c,
and that is now looking like a bad idea.  If we were to revert that
decision, then instead of assuming that an AppendState always has at least
one live child, we'd only have to assume that an Append has at least one
live child.  Which is true.

I don't recall that there was any really strong reason for switching
ruleutils' API like that, although maybe if we look harder we'll find one.
I think it was mainly just for consistency with the way that explain.c
now looks at the world; which is not a negligible consideration, but
it's certainly something we could overrule.

> Another idea is to teach explain.c about this special case of run-time
> pruning having pruned all child subplans even though appendplans contains
> one element to cater for targetlist accesses.  That is, Append will be
> displayed with "Subplans Removed: All" and no child subplans listed below
> it, even though appendplans[] has one.  David already said he didn't do in
> the first place to avoid PartitionPruneInfo details creeping into other
> modules, but maybe there's no other way?

I tried simply removing the hack in nodeAppend.c (per quick-hack patch
below), and it gets through the core regression tests without a crash,
and with output diffs that seem fine to me.  However, that just shows that
we lack enough test coverage; we evidently have no regression cases where
an upper node needs to print Vars that are coming from a fully-pruned
Append.  Given the test case mentioned in this thread, I get

regression=# explain verbose select * from t1 where dt = current_date + 400;
 QUERY PLAN  
-
 Append  (cost=0.00..198.42 rows=44 width=8)
   Subplans Removed: 4
(2 rows)

which seems fine, but

regression=# explain verbose select * from t1 where dt = current_date + 400 
order by id;
psql: server closed the connection unexpectedly

It's dying trying to resolve Vars in the Sort node, of course.

regards, tom lane

diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index f3be242..3cd8940 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -143,22 +143,13 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 			list_length(node->appendplans));
 
 			/*
-			 * The case where no subplans survive pruning must be handled
-			 * specially.  The problem here is that code in explain.c requires
-			 * an Append to have at least one subplan in order for it to
-			 * properly determine the Vars in that subplan's targetlist.  We
-			 * sidestep this issue by just initializing the first subplan and
-			 * setting as_whichplan to NO_MATCHING_SUBPLANS to indicate that
-			 * we don't really need to scan any subnodes.
+			 * If no subplans survive pruning, change as_whichplan to
+			 * NO_MATCHING_SUBPLANS, to indicate that we don't really need to
+			 * scan any subnodes.
 			 */
 			if (bms_is_empty(validsubplans))
-			{
 appendstate->as_whichplan = NO_MATCHING_SUBPLANS;
 
-/* Mark the first as valid so that it's initialized below */
-validsubplans = bms_make_singleton(0);
-			}
-
 			nplans = bms_num_members(validsubplans);
 		}
 		else
@@ -174,10 +165,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 		 * immediately, preventing later calls to ExecFindMatchingSubPlans.
 		 */
 		if (!prunestate->do_exec_prune)
-		{
-			Assert(nplans > 0);
 			appendstate->as_valid_subplans = bms_add_range(NULL, 0, nplans - 1);
-		}
 	}
 	else
 	{


Re: Question about the holdable cursor

2019-04-18 Thread Andy Fan
On Thu, Apr 18, 2019 at 10:09 PM Tom Lane  wrote:

> Andy Fan  writes:
> > when I fetch from holdable cursor,  I found the fact is more complex
> than I
> > expected.
> > ...
> > why the 3rd time is necessary and will the performance be bad due to this
> > design?
>
> If you read the whole cursor output, then close the transaction and
> persist the cursor, yes we'll read it twice, and yes it's bad for that
> case.  The design is intended to perform well in these other cases:
>
> Thanks you Tom for the reply!!  Looks this situation is really hard to
produce but I just got there:(   Please help me to confirm my
understanding:

1.   we can have 2 methods to reproduce it:

Method 1:
a).  begin;// begin the transaction explicitly
b).  declare c1 cursor WITH HOLD for select * from t;  // declare the
cursor with HOLD option.
c).  fetch n c1;  // this will run ExecutePlan the first time.
d).  commit // commit the transaction  explicitly,  which caused the 2nd
ExecutePlan.  Write "ALL the records" into tuplestore.

Method 2:

a).  declare c1 cursor WITH HOLD for select * from t; fetch n c1;   // send
1 query with 2 statements, with implicitly transaction begin/commit;


(even though, I don't know how to send "declare c1 cursor WITH HOLD for
select * from t; fetch n c1; " as one query in psql shell)


2.  with a bit of more normal  case:

a). declare c1 cursor WITH HOLD for select * from t;  // declare the cursor
with HOLD option.   the transaction is started implicitly and commit
implicitly.
during the commit,  "ExecutePlan" is called first time and "GET ALL THE
RECORDS"  and store ALL OF them (what if it is very big, write to file)?

b).  fetch 10 c1;   // will not run ExecutePlan any more.

even though,  "GET ALL THE RECORDS"   at the step 1 is expensive.

3).  without hold option

a)begin;
b).   declare c1 cursor  for select * from t;  .// without hold option.
c).   fetch 1 c1; // this only scan 1 row.
d).   commit;

if so,  the connection can't be used for other transactions until I commit
the transaction for cursor (which is something I dislike for now).


Could you help to me confirm my understandings are correct regarding the 3
topics?   Thanks


1. The HOLD option isn't really being used, ie you just read and
> close the cursor within the original transaction.  This is important
> because applications are frequently sloppy about marking cursors as
> WITH HOLD.
>
> 2. You declare the cursor and persist it before reading anything from it.
> (This is really the typical use-case for held cursors, IMV.)
>
> FWIW, I don't see any intermediate tuplestore in there when
> dealing with a PORTAL_ONE_SELECT query, which is the only
> case that's possible with a cursor no?
>
> regards, tom lane
>


Re: block-level incremental backup

2019-04-18 Thread Andres Freund
Hi,

On 2019-04-18 11:34:32 -0400, Bruce Momjian wrote:
> On Thu, Apr 18, 2019 at 05:32:57PM +0200, David Fetter wrote:
> > On Wed, Apr 17, 2019 at 11:57:35AM -0400, Bruce Momjian wrote:
> > > Also, instead of storing the file name and block number in the modblock
> > > file, using the database oid, relfilenode, and block number (3 int32
> > > values) should be sufficient.
> > 
> > Would doing it that way constrain the design of new table access
> > methods in some meaningful way?
> 
> I think these are the values used in WAL, so I assume table access
> methods already have to map to those, unless they use their own.
> I actually don't know.

I don't think it'd be a meaningful restriction. Given that we use those
for shared_buffer descriptors, WAL etc.

Greetings,

Andres Freund




Re: block-level incremental backup

2019-04-18 Thread Robert Haas
On Wed, Apr 17, 2019 at 5:20 PM Stephen Frost  wrote:
> As I understand it, the problem is not with backing up an individual
> database or cluster, but rather dealing with backing up thousands of
> individual clusters with thousands of tables in each, leading to an
> awful lot of tables with lots of FSMs/VMs, all of which end up having to
> get copied and stored wholesale.  I'll point this thread out to him and
> hopefully he'll have a chance to share more specific information.

Sounds good.

> I can agree with the idea of having multiple options for how to collect
> up the set of changed blocks, though I continue to feel that a
> WAL-scanning approach isn't something that we'd have implemented in the
> backend at all since it doesn't require the backend and a given backend
> might not even have all of the WAL that is relevant.  I certainly don't
> think it makes sense to have a backend go get WAL from the archive to
> then merge the WAL to provide the result to a client asking for it-
> that's adding entirely unnecessary load to the database server.

My motivation for wanting to include it in the database server was twofold:

1. I was hoping to leverage the background worker machinery.  The
WAL-scanner would just run all the time in the background, and start
up and shut down along with the server.  If it's a standalone tool,
then it can run on a different server or when the server is down, both
of which are nice.  The downside though is that now you probably have
to put it in crontab or under systemd or something, instead of just
setting a couple of GUCs and letting the server handle the rest.  For
me that downside seems rather significant, but YMMV.

2. In order for the information produced by the WAL-scanner to be
useful, it's got to be available to the server when the server is
asked for an incremental backup.  If the information is constructed by
a standalone frontend tool, and stored someplace other than under
$PGDATA, then the server won't have convenient access to it.  I guess
we could make it the client's job to provide that information to the
server, but I kind of liked the simplicity of not needing to give the
server anything more than an LSN.

> A thought that occurs to me is to have the functions for supporting the
> WAL merging be included in libcommon and available to both the
> independent executable that's available for doing WAL merging, and to
> the backend to be able to WAL merging itself-

Yeah, that might be possible.

> but for a specific
> purpose: having a way to reduce the amount of WAL that needs to be sent
> to a replica which has a replication slot but that's been disconnected
> for a while.  Of course, there'd have to be some way to handle the other
> files for that to work to update a long out-of-date replica.  Now, if we
> taught the backup tool about having a replication slot then perhaps we
> could have the backend effectively have the same capability proposed
> above, but without the need to go get the WAL from the archive
> repository.

Hmm, but you can't just skip over WAL records or segments because
there are checksums and previous-record pointers and things

> I'm still not entirely sure that this makes sense to do in the backend
> due to the additional load, this is really just some brainstorming.

Would it really be that much load?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: proposal: psql PSQL_TABULAR_PAGER variable

2019-04-18 Thread Pavel Stehule
čt 18. 4. 2019 v 18:35 odesílatel Bruce Momjian  napsal:

> On Thu, Apr 18, 2019 at 06:06:40PM +0200, Pavel Stehule wrote:
> >
> >
> > čt 18. 4. 2019 v 17:58 odesílatel Bruce Momjian 
> napsal:
> >
> > On Thu, Apr 18, 2019 at 05:45:24PM +0200, Pavel Stehule wrote:
> > > čt 18. 4. 2019 v 15:51 odesílatel Bruce Momjian 
> > napsal:
> > > In testing pspg, it seems to work fine with tabular and \
> > x-non-tabular
> > > data.  Are you asking for a pager option that is only used for
> non-\x
> > > display?  What do people want the non-pspg pager to do?
> > >
> > > My idea is following - pseudocode
> > >
> > > else /* for \h xxx */
> >
> > Well, normal output and \x looks fine in pspg, and \h doesn't use the
> > pager unless it is more than one screen.  If I do '\h *' it uses
> pspg,
> > but now often do people do that?  Most \h display doesn't use a
> pager,
> > so I don't see the point.
> >
> >
> > It depends on terminal size. On my terminal pager is mostly every time.
> \? is
> > same.
> >
> > pspg can works like classic pager, but it is not optimized for this
> purpose.
>
> Uh, the odd thing is that \? and sometimes \h are the only case I can
> see where using the classic page has much value.  Are there more cases?
> If not, I don't see the value in having a separate configuration
> variable for this.
>

I don't know any about other cases. Other results in psql has tabular
format.

Pavel


> --
>   Bruce Momjian  http://momjian.us
>   EnterpriseDB http://enterprisedb.com
>
> + As you are, so once was I.  As I am, so you will be. +
> +  Ancient Roman grave inscription +
>


Re: proposal: psql PSQL_TABULAR_PAGER variable

2019-04-18 Thread Bruce Momjian
On Thu, Apr 18, 2019 at 06:06:40PM +0200, Pavel Stehule wrote:
> 
> 
> čt 18. 4. 2019 v 17:58 odesílatel Bruce Momjian  napsal:
> 
> On Thu, Apr 18, 2019 at 05:45:24PM +0200, Pavel Stehule wrote:
> > čt 18. 4. 2019 v 15:51 odesílatel Bruce Momjian 
> napsal:
> >     In testing pspg, it seems to work fine with tabular and \
> x-non-tabular
> >     data.  Are you asking for a pager option that is only used for 
> non-\x
> >     display?  What do people want the non-pspg pager to do?
> >
> > My idea is following - pseudocode
> >
> > else /* for \h xxx */
> 
> Well, normal output and \x looks fine in pspg, and \h doesn't use the
> pager unless it is more than one screen.  If I do '\h *' it uses pspg,
> but now often do people do that?  Most \h display doesn't use a pager,
> so I don't see the point.
> 
> 
> It depends on terminal size. On my terminal pager is mostly every time. \? is
> same.
> 
> pspg can works like classic pager, but it is not optimized for this purpose.

Uh, the odd thing is that \? and sometimes \h are the only case I can
see where using the classic page has much value.  Are there more cases? 
If not, I don't see the value in having a separate configuration
variable for this.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: proposal: psql PSQL_TABULAR_PAGER variable

2019-04-18 Thread Pavel Stehule
čt 18. 4. 2019 v 17:58 odesílatel Bruce Momjian  napsal:

> On Thu, Apr 18, 2019 at 05:45:24PM +0200, Pavel Stehule wrote:
> > čt 18. 4. 2019 v 15:51 odesílatel Bruce Momjian 
> napsal:
> > In testing pspg, it seems to work fine with tabular and
> \x-non-tabular
> > data.  Are you asking for a pager option that is only used for non-\x
> > display?  What do people want the non-pspg pager to do?
> >
> > My idea is following - pseudocode
> >
> > else /* for \h xxx */
>
> Well, normal output and \x looks fine in pspg, and \h doesn't use the
> pager unless it is more than one screen.  If I do '\h *' it uses pspg,
> but now often do people do that?  Most \h display doesn't use a pager,
> so I don't see the point.
>

It depends on terminal size. On my terminal pager is mostly every time. \?
is same.

pspg can works like classic pager, but it is not optimized for this
purpose.





> --
>   Bruce Momjian  http://momjian.us
>   EnterpriseDB http://enterprisedb.com
>
> + As you are, so once was I.  As I am, so you will be. +
> +  Ancient Roman grave inscription +
>


Re: proposal: psql PSQL_TABULAR_PAGER variable

2019-04-18 Thread Bruce Momjian
On Thu, Apr 18, 2019 at 05:45:24PM +0200, Pavel Stehule wrote:
> čt 18. 4. 2019 v 15:51 odesílatel Bruce Momjian  napsal:
> In testing pspg, it seems to work fine with tabular and \x-non-tabular
> data.  Are you asking for a pager option that is only used for non-\x
> display?  What do people want the non-pspg pager to do?
>
> My idea is following - pseudocode
> 
> else /* for \h xxx */

Well, normal output and \x looks fine in pspg, and \h doesn't use the
pager unless it is more than one screen.  If I do '\h *' it uses pspg,
but now often do people do that?  Most \h display doesn't use a pager,
so I don't see the point.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: proposal: psql PSQL_TABULAR_PAGER variable

2019-04-18 Thread Pavel Stehule
čt 18. 4. 2019 v 15:51 odesílatel Bruce Momjian  napsal:

> On Thu, Apr 18, 2019 at 07:20:37AM +0200, Pavel Stehule wrote:
> > Hi
> >
> > I wrote a pspg pager https://github.com/okbob/pspg
> >
> > This pager is designed for tabular data. It can work in fallback mode as
> > classic pager, but it is not designed for this purpose (and I don't plan
> do
> > it). Can we enhance a set of psql environment variables about
> > PSQL_TABULAR_PAGER variable. This pager will be used, when psql will
> display
> > tabular data.
>
> In testing pspg, it seems to work fine with tabular and \x-non-tabular
> data.  Are you asking for a pager option that is only used for non-\x
> display?  What do people want the non-pspg pager to do?
>

My idea is following - pseudocode


if view is a table
{
  if is_defined PSQL_TABULAR_PAGER
  {
pager = PSQL_TABULAR_PAGER
  }
  else if is_defined PSQL_PAGER
  {
pager = PSQL_PAGER
  }
  else
  {
pager = PAGER
  }
}
else /* for \h xxx */
{
  if is_defined PSQL_PAGER
  {
pager = PSQL_PAGER
  }
  else
  {
pager = PAGER
  }
}

I expect some configuration like

PSQL_TABULAR_PAGER=pspg
PSQL_PAGER="less -S"

Regards

Pavel



> --
>   Bruce Momjian  http://momjian.us
>   EnterpriseDB http://enterprisedb.com
>
> + As you are, so once was I.  As I am, so you will be. +
> +  Ancient Roman grave inscription +
>


Re: block-level incremental backup

2019-04-18 Thread Bruce Momjian
On Thu, Apr 18, 2019 at 05:32:57PM +0200, David Fetter wrote:
> On Wed, Apr 17, 2019 at 11:57:35AM -0400, Bruce Momjian wrote:
> > Also, instead of storing the file name and block number in the modblock
> > file, using the database oid, relfilenode, and block number (3 int32
> > values) should be sufficient.
> 
> Would doing it that way constrain the design of new table access
> methods in some meaningful way?

I think these are the values used in WAL, so I assume table access
methods already have to map to those, unless they use their own.
I actually don't know.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: block-level incremental backup

2019-04-18 Thread David Fetter
On Wed, Apr 17, 2019 at 11:57:35AM -0400, Bruce Momjian wrote:
> On Tue, Apr 16, 2019 at 06:40:44PM -0400, Robert Haas wrote:
> > Yeah, I really hope we don't end up with dueling patches.  I want to
> > come up with an approach that can be widely-endorsed and then have
> > everybody rowing in the same direction.  On the other hand, I do think
> > that we may support multiple options in certain places which may have
> > the kinds of trade-offs that Bruce mentions.  For instance,
> > identifying changed blocks by scanning the whole cluster and checking
> > the LSN of each block has an advantage in that it requires no prior
> > setup or extra configuration.  Like a sequential scan, it always
> > works, and that is an advantage.  Of course, for many people, the
> > competing advantage of a WAL-scanning approach that can save a lot of
> > I/O will appear compelling, but maybe not for everyone.  I think
> > there's room for two or three approaches there -- not in the sense of
> > competing patches, but in the sense of giving users a choice based on
> > their needs.
> 
> Well, by having a separate modblock file for each WAL file, you can keep
> both WAL and modblock files and use the modblock list to pull pages from
> each WAL file, or from the heap/index files, and it can be done in
> parallel.  Having WAL and modblock files in the same directory makes
> retention simpler.
> 
> In fact, you can do an incremental backup just using the modblock files
> and the heap/index files, so you don't even need the WAL.
> 
> Also, instead of storing the file name and block number in the modblock
> file, using the database oid, relfilenode, and block number (3 int32
> values) should be sufficient.

Would doing it that way constrain the design of new table access
methods in some meaningful way?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate




Re: Race conditions with checkpointer and shutdown

2019-04-18 Thread Tom Lane
Thomas Munro  writes:
> On Wed, Apr 17, 2019 at 10:45 AM Tom Lane  wrote:
>> I think what we need to look for is reasons why (1) the postmaster
>> never sends SIGUSR2 to the checkpointer, or (2) the checkpointer's
>> main loop doesn't get to noticing shutdown_requested.
>> 
>> A rather scary point for (2) is that said main loop seems to be
>> assuming that MyLatch a/k/a MyProc->procLatch is not used for any
>> other purposes in the checkpointer process.  If there were something,
>> like say a condition variable wait, that would reset MyLatch at any
>> time during a checkpoint, then we could very easily go to sleep at the
>> bottom of the loop and not notice that there's a pending shutdown request.

> Agreed on the non-composability of that coding, but if there actually
> is anything in that loop that can reach ResetLatch(), it's well
> hidden...

Well, it's easy to see that there's no other ResetLatch call in
checkpointer.c.  It's much less obvious that there's no such
call anywhere in the code reachable from e.g. CreateCheckPoint().

To try to investigate that, I hacked things up to force an assertion
failure if ResetLatch was called from any other place in the
checkpointer process (dirty patch attached for amusement's sake).
This gets through check-world without any assertions.  That does not
really prove that there aren't corner timing cases where a latch
wait and reset could happen, but it does put a big dent in my theory.
Question is, what other theory has anybody got?

regards, tom lane

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index a1e0423..b0ee1fd 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -166,6 +166,8 @@ static double ckpt_cached_elapsed;
 static pg_time_t last_checkpoint_time;
 static pg_time_t last_xlog_switch_time;
 
+extern bool reset_latch_ok;
+
 /* Prototypes for private functions */
 
 static void CheckArchiveTimeout(void);
@@ -343,9 +345,13 @@ CheckpointerMain(void)
 		int			elapsed_secs;
 		int			cur_timeout;
 
+		reset_latch_ok = true;
+
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
 
+		reset_latch_ok = false;
+
 		/*
 		 * Process any requests or signals received recently.
 		 */
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 59fa917..1f0613d 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -118,6 +118,8 @@ struct WaitEventSet
 #endif
 };
 
+bool reset_latch_ok = true;
+
 #ifndef WIN32
 /* Are we currently in WaitLatch? The signal handler would like to know. */
 static volatile sig_atomic_t waiting = false;
@@ -521,6 +523,8 @@ ResetLatch(Latch *latch)
 	/* Only the owner should reset the latch */
 	Assert(latch->owner_pid == MyProcPid);
 
+	Assert(reset_latch_ok);
+
 	latch->is_set = false;
 
 	/*


Re: Question about the holdable cursor

2019-04-18 Thread Tom Lane
Andy Fan  writes:
> when I fetch from holdable cursor,  I found the fact is more complex than I
> expected.
> ...
> why the 3rd time is necessary and will the performance be bad due to this
> design?

If you read the whole cursor output, then close the transaction and
persist the cursor, yes we'll read it twice, and yes it's bad for that
case.  The design is intended to perform well in these other cases:

1. The HOLD option isn't really being used, ie you just read and
close the cursor within the original transaction.  This is important
because applications are frequently sloppy about marking cursors as
WITH HOLD.

2. You declare the cursor and persist it before reading anything from it.
(This is really the typical use-case for held cursors, IMV.)

FWIW, I don't see any intermediate tuplestore in there when
dealing with a PORTAL_ONE_SELECT query, which is the only
case that's possible with a cursor no?

regards, tom lane




Re: proposal: psql PSQL_TABULAR_PAGER variable

2019-04-18 Thread Bruce Momjian
On Thu, Apr 18, 2019 at 07:20:37AM +0200, Pavel Stehule wrote:
> Hi
> 
> I wrote a pspg pager https://github.com/okbob/pspg
> 
> This pager is designed for tabular data. It can work in fallback mode as
> classic pager, but it is not designed for this purpose (and I don't plan do
> it). Can we enhance a set of psql environment variables about
> PSQL_TABULAR_PAGER variable. This pager will be used, when psql will display
> tabular data.

In testing pspg, it seems to work fine with tabular and \x-non-tabular
data.  Are you asking for a pager option that is only used for non-\x
display?  What do people want the non-pspg pager to do?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: bug in update tuple routing with foreign partitions

2019-04-18 Thread Etsuro Fujita

Amit-san,

Thanks for the comments!

(2019/04/18 14:08), Amit Langote wrote:

On 2019/04/18 14:06, Amit Langote wrote:

On 2019/04/17 21:49, Etsuro Fujita wrote:



I think the reason for that is: the row routed to remp is incorrectly
fetched as a to-be-updated row when updating remp as a subplan targetrel.


Yeah.  In the fully-local case, that is, where both the source and the
target partition of a row movement operation are local tables, heap AM
ensures that tuples that's moved into a given relation in the same command
(by way of row movement) are not returned as to-be-updated, because it
deems such tuples invisible.  The "same command" part being crucial for
that to work.

In the case where the target of a row movement operation is a foreign
table partition, the INSERT used as part of row movement and subsequent
UPDATE of the same foreign table are distinct commands for the remote
server.  So, the rows inserted by the 1st command (as part of the row
movement) are deemed visible by the 2nd command (UPDATE) even if both are
operating within the same transaction.


Yeah, I think so too.


I guess there's no easy way for postgres_fdw to make the remote server
consider them as a single command.  IOW, no way to make the remote server
not touch those "moved-in" rows during the UPDATE part of the local query.


I agree.


The right way to fix this would be to have some way in postgres_fdw in
which we don't fetch such rows when updating such subplan targetrels.  I
tried to figure out a (simple) way to do that, but I couldn't.


Yeah, that seems a bit hard to ensure with our current infrastructure.


Yeah, I think we should leave that for future work.


So what I'm thinking is to throw an error for cases like this.  (Though, I
think we should keep to allow rows to be moved from local partitions to a
foreign-table subplan targetrel that has been updated already.)


Hmm, how would you distinguish (totally inside postgred_fdw I presume) the
two cases?


One thing I came up with to do that is this:

@@ -1917,6 +1937,23 @@ postgresBeginForeignInsert(ModifyTableState *mtstate,

initStringInfo();

+   /*
+* If the foreign table is an UPDATE subplan resultrel that 
hasn't yet

+* been updated, routing tuples to the table might yield incorrect
+* results, because if routing tuples, routed tuples will be 
mistakenly
+* read from the table and updated twice when updating the table 
--- it
+* would be nice if we could handle this case; but for now, 
throw an error

+* for safety.
+*/
+   if (plan && plan->operation == CMD_UPDATE &&
+   (resultRelInfo->ri_usesFdwDirectModify ||
+resultRelInfo->ri_FdwState) &&
+   resultRelInfo > mtstate->resultRelInfo + 
mtstate->mt_whichplan)

+   ereport(ERROR,
+   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+errmsg("cannot route tuples into 
foreign table to be updated \"%s\"",
+ 
RelationGetRelationName(rel;



Forgot to say that since this is a separate issue from the original bug
report, maybe we can first finish fixing the latter.  What to do you think?


Yeah, I think we can do that, but my favorite would be to fix the two 
issues in a single shot, because they seem to me rather close to each 
other.  So I updated a new version in a single patch, which I'm attaching.


Notes:

* I kept all the changes in the previous patch, because otherwise 
postgres_fdw would fail to release resources for a foreign-insert 
operation created by postgresBeginForeignInsert() for a tuple-routable 
foreign table (ie, a foreign-table subplan resultrel that has been 
updated already) during postgresEndForeignInsert().


* I revised a comment according to your previous comment, though I 
changed a state name: s/sub_fmstate/aux_fmstate/


* DirectModify also has the latter issue.  Here is an example:

postgres=# create table p (a int, b text) partition by list (a);
postgres=# create table p1 partition of p for values in (1);
postgres=# create table p2base (a int check (a = 2 or a = 3), b text);
postgres=# create foreign table p2 partition of p for values in (2, 3) 
server loopback options (table_name 'p2base');


postgres=# insert into p values (1, 'foo');
INSERT 0 1
postgres=# explain verbose update p set a = a + 1;
 QUERY PLAN
-
 Update on public.p  (cost=0.00..176.21 rows=2511 width=42)
   Update on public.p1
   Foreign Update on public.p2
   ->  Seq Scan on public.p1  (cost=0.00..25.88 rows=1270 width=42)
 Output: (p1.a + 1), p1.b, p1.ctid
   ->  Foreign Update on public.p2  (cost=100.00..150.33 rows=1241 
width=42)

 Remote SQL: UPDATE public.p2base SET a = (a + 1)
(7 rows)

postgres=# update p set a = a + 1;
UPDATE 2
postgres=# select * from p;
 a |  b
---+-
 3 | foo
(1 row)

As shown in the 

Remove page-read callback from XLogReaderState.

2019-04-18 Thread Kyotaro HORIGUCHI
Hello. As mentioned before [1], read_page callback in
XLogReaderState is a cause of headaches. Adding another
remote-controlling stuff to xlog readers makes things messier [2].

I refactored XLOG reading functions so that we don't need the
callback. In short, ReadRecrod now calls XLogPageRead directly
with the attached patch set.

| while (XLogReadRecord(xlogreader, RecPtr, , )
|== XLREAD_NEED_DATA)
| XLogPageRead(xlogreader, fetching_ckpt, emode, randAccess);

On the other hand, XLogReadRecord became a bit complex. The patch
makes XLogReadRecord a state machine. I'm not confident that the
additional complexity is worth doing. Anyway I'll gegister this
to the next CF.

[1] 
https://www.postgresql.org/message-id/47215279-228d-f30d-35d1-16af695e5...@iki.fi

[2] 
https://www.postgresql.org/message-id/20190412.122711.158276916.horiguchi.kyot...@lab.ntt.co.jp

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 3de14fb47987e9dd8189bfad3ca7264c26b719eb Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 18 Apr 2019 10:22:49 +0900
Subject: [PATCH 01/10] Define macros to make XLogReadRecord a state machine

To minimize apparent imapct on code, use some macros as syntax sugar. This is a similar stuff with ExecInterpExpr but a bit different. The most significant difference is that  this stuff allows some functions are leaved midst of their work then continue. Roughly speaking this is used as the follows.

enum retval
some_func()
{
  static .. internal_variables;

  XLR_SWITCH();
  ...
  XLR_LEAVE(STATUS1, RETVAL_CONTINUE);
  ...
  XLR_LEAVE(STATUS2, RETVAL_CONTINUE2);
  ...
  XLR_SWITCH_END();

  XLR_RETURN(RETVAL_FINISH);
}

The caller uses the function as follows:

  while (some_func() != RETVAL_FINISH)
  {
;
  }
---
 src/backend/access/transam/xlogreader.c | 63 +
 src/include/access/xlogreader.h |  3 ++
 2 files changed, 66 insertions(+)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 9196aa3aae..5299765040 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -29,6 +29,69 @@
 #include "utils/memutils.h"
 #endif
 
+/*
+ * Use computed-goto-based opcode dispatch when computed gotos are available.
+ * But use a separate symbol so that it's easy to adjust locally in this file
+ * for development and testing.
+ */
+#ifdef HAVE_COMPUTED_GOTO
+#define XLR_USE_COMPUTED_GOTO
+#endif			/* HAVE_COMPUTED_GOTO */
+
+/*
+ * Macros for opcode dispatch.
+ *
+ * XLR_SWITCH - just hides the switch if not in use.
+ * XLR_CASE - labels the implementation of named expression step type.
+ * XLR_DISPATCH - jump to the implementation of the step type for 'op'.
+ * XLR_LEAVE - leave the function and return here at the next call.
+ * XLR_RETURN - return from the function and set state to initial state.
+ * XLR_END - just hides the closing brace if not in use.
+ */
+#if defined(XLR_USE_COMPUTED_GOTO)
+#define XLR_SWITCH()		\
+	/* Don't call duplicatedly */			\
+	static int callcnt = 0 PG_USED_FOR_ASSERTS_ONLY;		\
+	do {	\
+		if ((XLR_STATE).j)	\
+			goto *((void *) (XLR_STATE).j);	\
+		XLR_CASE(XLR_INIT_STATE);			\
+		Assert(++callcnt == 1);\
+	} while (0)
+#define XLR_CASE(name)		name:
+#define XLR_DISPATCH()		goto *((void *) (XLR_STATE).j)
+#define XLR_LEAVE(name, code) do {\
+		(XLR_STATE).j = (&); return (code);	\
+		XLR_CASE(name);\
+	} while (0)
+#define XLR_RETURN(code) \
+  do {		\
+	  Assert(--callcnt == 0);\
+	  (XLR_STATE).j = (&_INIT_STATE); return (code);	\
+  } while (0)
+#define XLR_SWITCH_END()
+#else			/* !XLR_USE_COMPUTED_GOTO */
+#define XLR_SWITCH() \
+	/* Don't call duplicatedly */	 \
+	static int callcnt = 0 PG_USED_FOR_ASSERTS_ONLY; \
+	switch ((XLR_STATE).c) {		 \
+	XLR_CASE(XLR_INIT_STATE);		 \
+	Assert(++callcnt == 1);			 \
+#define XLR_CASE(name)		case name:
+#define XLR_DISPATCH()		goto starteval
+#define XLR_LEAVE(name, code) \
+  do {			\
+	  (XLR_STATE).c = (name); return (code);	\
+	  XLR_CASE(name);			\
+  } while (0)
+#define XLR_RETURN(code) \
+  do {		\
+	  Assert(--callcnt == 0);\
+	  (XLR_STATE).c = (XLR_INIT_STATE); return (code);		\
+  } while (0)
+#define XLR_SWITCH_END()	}
+#endif			/* XLR_USE_COMPUTED_GOTO */
+
 static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
 
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index f3bae0bf49..30500c35c7 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -240,6 +240,9 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
 #define XLogRecBlockImageApply(decoder, block_id) \
 	((decoder)->blocks[block_id].apply_image)
 
+/* Reset the 

Question about the holdable cursor

2019-04-18 Thread Andy Fan
when I fetch from holdable cursor,  I found the fact is more complex than I
expected.

suppose we fetched 20 rows.

1). It will fill a PortalStore,  the dest is not the client, it is the
DestTupleStore, called ExecutePlan once and  receiveSlot will be call 20
times.

2). the portal for client then RunFromStore and send the result to client.
the receiveSlot will be call 20 times again.

3). at last,  when we HoldPortal,  called ExecutePlan once again and
receiveSlot will be call 20 times

```
0  in ExecutePlan of execMain.c:1696
1  in standard_ExecutorRun of execMain.c:366
2  in ExecutorRun of execMain.c:309
3  in PersistHoldablePortal of portalcmds.c:392
4  in HoldPortal of portalmem.c:639
5  in PreCommit_Portals of portalmem.c:733
6  in CommitTransaction of xact.c:2007
7  in CommitTransactionCommand of xact.c:2801
8  in finish_xact_command of postgres.c:2529
9  in exec_simple_query of postgres.c:1176
10 in exec_docdb_simple_query of postgres.c:5069
11 in _exec_query_with_intercept_exception of op_executor.c:38
12 in exec_op_query of op_executor.c:102
13 in exec_op_find of op_executor.c:204
14 in run_op_find_common of op_find_common.c:42
15 in _cmd_run_find of cmd_find.c:31
16 in run_commands of commands.c:610
17 in DocdbMain of postgres.c:4792
18 in DocdbBackendRun of postmaster.c:4715
19 in DocdbBackendStartup of postmaster.c:4196
20 in ServerLoop of postmaster.c:1760
21 in PostmasterMain of postmaster.c:1406
22 in main of main.c:228
```

why the 3rd time is necessary and will the performance be bad due to this
design?

Thanks for your help!


Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-04-18 Thread John Naylor
On Thu, Apr 18, 2019 at 2:48 PM Amit Kapila  wrote:
> I respect and will follow whatever will be the consensus after
> discussion.  However, I request you to wait for some time to let the
> discussion conclude.  If we can't get to an
> agreement or one of John or me can't implement what is decided, then
> we can anyway revert it.

Agreed. I suspect the most realistic way to address most of the
objections in a short amount of time would be to:

1. rip out the local map
2. restore hio.c to only checking the last block in the relation if
there is no FSM (and lower the threshold to reduce wasted space)
3. reduce calls to smgr_exists()

Thoughts?

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Fix handling of unlogged tables in FOR ALL TABLES publications

2019-04-18 Thread Peter Eisentraut
On 2019-03-25 09:04, Peter Eisentraut wrote:
> So perhaps push the check down to GetRelationPublicationActions()
> instead.  That way we don't have to patch up two places and everything
> "just works" even for possible other callers.  See attached patch.

This has been committed and backpatched to PG10.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-04-18 Thread Amit Kapila
On Wed, Apr 17, 2019 at 9:50 PM Tom Lane  wrote:
>
> Andres Freund  writes:
> > On 2019-04-17 15:49:29 +0530, Amit Kapila wrote:
> >> OTOH, if we want to extend it later for whatever reason to a relation
> >> level cache, it shouldn't be that difficult as the implementation is
> >> mostly contained in freespace.c (fsm* functions) and I think the
> >> relation is accessible in most places.  We might need to rip out some
> >> calls to clearlocalmap.
>
> > But it really isn't contained to freespace.c. That's my primary
> > concern. You added new parameters (undocumented ones!),
> > FSMClearLocalMap() needs to be called by callers and xlog, etc.
>
> Given where we are in the release cycle, and the major architectural
> concerns that have been raised about this patch, should we just
> revert it and try again in v13, rather than trying to fix it under
> time pressure?
>

I respect and will follow whatever will be the consensus after
discussion.  However, I request you to wait for some time to let the
discussion conclude.  If we can't get to an
agreement or one of John or me can't implement what is decided, then
we can anyway revert it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-04-18 Thread Amit Kapila
On Wed, Apr 17, 2019 at 9:46 PM Andres Freund  wrote:
>
> Hi,
>
> On 2019-04-17 15:49:29 +0530, Amit Kapila wrote:
> > > *and* stash the bitmap of
> > > pages that we think are used/free as a bitmap somewhere below the
> > > relcache.
> >
> > I think maintaining at relcache level will be tricky when there are
> > insertions and deletions happening in the small relation.  We have
> > considered such a case during development wherein we don't want the
> > FSM to be created if there are insertions and deletions in small
> > relation.  The current mechanism addresses both this and normal case
> > where there are not many deletions.  Sure there is some overhead of
> > building the map again and rechecking each page.  The first one is a
> > memory operation and takes a few cycles
>
> Yea, I think creating / resetting the map is basically free.
>
> I'm not sure I buy the concurrency issue - ISTM it'd be perfectly
> reasonable to cache the local map (in the relcache) and use it for local
> FSM queries, and rebuild it from scratch once no space is found. That'd
> avoid a lot of repeated checking of relation size for small, but
> commonly changed relations.
>

Okay, so you mean to say that we need to perform additional system
call (to get a number of blocks) only when no space is found in the
existing set of pages?  I think that is a fair point, but can't we
achieve that by updating relpages in relation after a call to
RelationGetNumberofBlocks?

>  Add a pre-check of smgr_fsm_nblocks (if >
> 0, there has to be an fsm), and there should be fewer syscalls.
>

Yes, that check is a good one and I see that we already do this check
in fsm code before calling smgrexists.

>
> > and for the second we optimized by checking the pages alternatively
> > which means we won't check more than two pages at-a-time.  This cost
> > is paid by not checking FSM and it could be somewhat better in some
> > cases [1].
>
> Well, it's also paid by potentially higher bloat, because the
> intermediate pages aren't tested.
>
>
> > >  If we cleared that variable at truncations, I think we should
> > > be able to make that work reasonably well?
> >
> > Not only that, I think it needs to be cleared whenever we create the
> > FSM as well which could be tricky as it can be created by the vacuum.
>
> ISTM normal invalidation logic should just take of that kind of thing.
>

Do you mean to say that we don't need to add any new invalidation call
and the existing invalidation calls will automatically take care of
same?

>
> > OTOH, if we want to extend it later for whatever reason to a relation
> > level cache, it shouldn't be that difficult as the implementation is
> > mostly contained in freespace.c (fsm* functions) and I think the
> > relation is accessible in most places.  We might need to rip out some
> > calls to clearlocalmap.
>
> But it really isn't contained to freespace.c. That's my primary
> concern.

Okay, I get that point.  I think among that also the need to call
FSMClearLocalMap seems to be your main worry which is fair, but OTOH,
the places where it should be called shouldn't be a ton.

> You added new parameters (undocumented ones!),
>

I think this is mostly for compatibility with the old code.  I agree
that is a wart,  but without much inputs during development, it
doesn't seem advisable to change old behavior, that is why we have
added a new parameter to GetPageWithFreeSpace.  However, if we want we
can remove that parameter or add document it in a better way.

> FSMClearLocalMap() needs to be called by callers and xlog, etc.
>

Agreed, that this is an additional requirement, but we have documented
the cases atop of this function where it needs to be called.  We might
have missed something, but we tried to cover all cases that we are
aware of.  Can we make it more clear by adding the comments atop
freespace.c API where this map is used?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: bgwriter_lru_maxpages limits in PG 10 sample conf

2019-04-18 Thread Sergei Kornilov
Hello

Well, actually we change postgresql.conf.sample in back-branches. Recently was 
yet another commit in REL_10_STABLE: fea2cab70de8d190762996c7c447143fb47bcfa3
I think we need fix incosistent comment for bgwriter_lru_maxpages

regards, Sergei




Re: Status of the table access method work

2019-04-18 Thread Simon Riggs
On Wed, 10 Apr 2019 at 18:14, Alexander Korotkov 
wrote:


> Alternative idea is to have MVCC-aware indexes.  This approach looks
> more attractive for me.  In this approach you basically need xmin,
> xmax fields in index tuples.  On insertion of index tuple you fill
> it's xmin.  On update, previous index tuple is marked with xmax.
>

+1

xmax can be provided through to index by indexam when 1) we mark killed
tuples, 2) when we do indexscan of index entry without xmax set.
xmax can be set as a hint on normal scans, or set as part of an update, as
the index chooses

After that outdated index tuples might be deleted in the lazy manner
> when page space is required.


That is already done, so hardly any change there.

Also, we can

have special bit for "all visible" index tuples.  With "all visible"
bit set this tuple can get rid of visibility fields.  We can do this
for index tuples, because if index tuple requires extra space we can
split the page, in spite of heap where tuples are fixed in pages and
xmax needs to be updated in-place.

Keeping the xmin/xmax would also be useful for historical indexes, i.e.
indexes that can be used to search for data with historic snapshots.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: New vacuum option to do only freezing

2019-04-18 Thread Masahiko Sawada
On Thu, Apr 18, 2019 at 4:20 AM Peter Geoghegan  wrote:
>
> On Wed, Apr 17, 2019 at 10:46 AM Tom Lane  wrote:
> > Yeah, if we wanted to expose these complications more directly, we
> > could think about adding or changing the main counters.  I was wondering
> > about whether we should have it report "x bytes and y line pointers
> > freed", rather than counting tuples per se.
>

It looks good idea to me.

> I like that idea, but I'm pretty sure that there are very few users
> that are aware of these distinctions at all. It would be a good idea
> to clearly document them.

I completely agreed. I'm sure that only a few user can do the action
of enabling index cleanup when the report says there are many dead
line pointers in the table.

It brought me an another idea of reporting something like "x bytes
freed, y bytes can be freed after index cleanup". That is, we report
how much bytes including tuples and line pointers we freed and how
much bytes of line pointers can be freed after index cleanup. While
index cleanup is enabled, the latter value should always be 0. If the
latter value gets large user can be aware of necessity of doing index
cleanup.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center