date:20200818





On 2020/08/19 9:43, torikoshia wrote:

On 2020-08-18 22:54, Fujii Masao wrote:

On 2020/08/18 18:41, torikoshia wrote:

On 2020-08-17 21:19, Fujii Masao wrote:

On 2020/08/17 21:14, Fujii Masao wrote:

On 2020-08-07 16:38, Kasahara Tatsuhito wrote:

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:    tested, passed

I tested the latest
patch(0007-Adding-a-function-exposing-memory-usage-of-local-backend.patch)
with the latest PG-version (199cec9779504c08aaa8159c6308283156547409)
and test was passed.
It looks good to me.

The new status of this patch is: Ready for Committer


Thanks for your testing!


Thanks for updating the patch! Here are the review comments.


Thanks for reviewing!



+ 
+  pg_backend_memory_contexts
+  backend memory contexts
+ 

The above is located just after pg_matviews entry. But it should be located
just after pg_available_extension_versions entry. Because the rows in the table
"System Views" should be located in alphabetical order.


+ 
+ pg_backend_memory_contexts

Same as above.


Modified both.




+   The view pg_backend_memory_contexts displays all
+   the local backend memory contexts.

This description seems a bit confusing because maybe we can interpret this
as "... displays the memory contexts of all the local backends" wrongly. 
Thought?
What about the following description, instead?



 The view pg_backend_memory_contexts displays all
 the memory contexts of the server process attached to the current session.


Thanks! it seems better.


+    const char *name = context->name;
+    const char *ident = context->ident;
+
+    if (context == NULL)
+    return;

The above check "context == NULL" is useless? If "context" is actually NULL,
"context->name" would cause segmentation fault, so ISTM that the check
will never be performed.

If "context" can be NULL, the check should be performed before accessing
to "contect". OTOH, if "context" must not be NULL per the specification of
PutMemoryContextStatsTupleStore(), assertion test checking
"context != NULL" should be used here, instead?


Yeah, "context" cannot be NULL because "context" must be TopMemoryContext
or it is already checked as not NULL as follows(child != NULL).

I added the assertion check.


Isn't it better to add AssertArg(MemoryContextIsValid(context)), instead?


Thanks, that's better.



| for (child = context->firstchild; child != NULL; child = child->nextchild)
| {
|  ...
| PutMemoryContextsStatsTupleStore(tupstore, tupdesc,
|   child, parentname, 
level + 1);
| }


Here is another comment.

+ if (parent == NULL)
+ nulls[2] = true;
+ else
+ /*
+  * We labeled dynahash contexts with just the hash table name.
+  * To make it possible to identify its parent, we also display
+  * parent's ident here.
+  */
+ if (parent->ident && strcmp(parent->name, "dynahash") == 0)
+ values[2] = CStringGetTextDatum(parent->ident);
+ else
+ values[2] = CStringGetTextDatum(parent->name);

PutMemoryContextsStatsTupleStore() doesn't need "parent" memory context,
but uses only the name of "parent" memory context. So isn't it better to use
"const char *parent" instead of "MemoryContext parent", as the argument of
the function? If we do that, we can simplify the above code.


Thanks, the attached patch adopted the advice.

However, since PutMemoryContextsStatsTupleStore() used not only the name
but also the ident of the "parent", I could not help but adding similar
codes before calling the function.
The total amount of codes and complexity seem not to change so much.

Any thoughts? Am I misunderstanding something?


I was thinking that we can simplify the code as follows.
That is, we can just pass "name" as the argument of
PutMemoryContextsStatsTupleStore()
since "name" indicates context->name or ident (if name is "dynahash").

 for (child = context->firstchild; child != NULL; child = child->nextchild)
 {
-    const char *parentname;
-
-    /*
- * We labeled dynahash contexts with just the hash table name.
- * To make it possible to identify its parent, we also use
- * the hash table as its context name.
- */
-    if (context->ident && strcmp(context->name, "dynahash") == 0)
-    parentname = context->ident;
-    else
-    parentname = context->name;
-
 PutMemoryContextsStatsTupleStore(tupstore, tupdesc,
-  child, parentname, level + 1);
+  child, name, level + 1);


I got it, thanks for the clarification!

Attached a revised patch.


Thanks for updating the patch! I pushed it.

BTW, I guess tha

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Wed, Aug 19, 2020 at 9:27 AM Masahiko Sawada
 wrote:
>
> On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma  wrote:
> >
> > > pg_force_freeze() can revival a tuple that is already deleted but not
> > > vacuumed yet. Therefore, the user might need to reindex indexes after
> > > using that function. For instance, with the following script, the last
> > > two queries: index scan and seq scan, will return different results.
> > >
> > > set enable_seqscan to off;
> > > set enable_bitmapscan to off;
> > > set enable_indexonlyscan to off;
> > > create table tbl (a int primary key);
> > > insert into tbl values (1);
> > >
> > > update tbl set a = a + 100 where a = 1;
> > >
> > > explain analyze select * from tbl where a < 200;
> > >
> > > -- revive deleted tuple on heap
> > > select heap_force_freeze('tbl', array['(0,1)'::tid]);
> > >
> > > -- index scan returns 2 tuples
> > > explain analyze select * from tbl where a < 200;
> > >
> > > -- seq scan returns 1 tuple
> > > set enable_seqscan to on;
> > > explain analyze select * from tbl;
> > >
> >
> > I am not sure if this is the right use-case of pg_force_freeze
> > function. I think we should only be running pg_force_freeze function
> > on a tuple for which VACUUM reports "found xmin ABC from before
> > relfrozenxid PQR" sort of error otherwise it might worsen the things
> > instead of making it better.
>
> Should this also be documented? I think that it's hard to force the
> user to always use this module in the right situation but we need to
> show at least when to use.
>

I've already added some examples in the documentation explaining the
use-case of force_freeze function. If required, I will also add a note
about it.

> > > Also, if a tuple updated and moved to another partition is revived by
> > > heap_force_freeze(), its ctid still has special values:
> > > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
> > > see a problem yet caused by a visible tuple having the special ctid
> > > value, but it might be worth considering either to reset ctid value as
> > > well or to not freezing already-deleted tuple.
> > >
> >
> > For this as well, the answer remains the same as above.
>
> Perhaps the same is true when a tuple header is corrupted including
> xmin and ctid for some reason and the user wants to fix it? I'm
> concerned that a live tuple having the wrong ctid will cause SEGV or
> PANIC error in the future.
>

If a tuple header itself is corrupted, then I think we must kill that
tuple. If only xmin and t_ctid fields are corrupted, then probably we
can think of resetting the ctid value of that tuple. However, it won't
be always possible to detect the corrupted ctid value. It's quite
possible that the corrupted ctid field has valid values for block
number and offset number in it, but it's actually corrupted and it
would be difficult to consider such ctid as corrupted. Hence, we can't
do anything about such types of corruption. Probably in such cases we
need to run VACUUM FULL on such tables so that new ctid gets generated
for each tuple in the table.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: Pulling up direct-correlated ANY_SUBLINK

2020-08-18 Thread Andy Fan

On Tue, Sep 17, 2019 at 4:41 PM Richard Guo  wrote:

>
> On Thu, Sep 12, 2019 at 11:35 PM Antonin Houska  wrote:
>
>> Richard Guo  wrote:
>>
>> > On Wed, Sep 11, 2019 at 3:25 PM Antonin Houska 
>> > wrote:
>> >
>> >
>> > Nevertheless, I don't know how to overcome the problems that I
>> > mentioned
>> > upthread.
>> >
>> >
>> > Do you mean the problem "the WHERE clause of the subquery didn't
>> > participate in the SEMI JOIN evaluation"? Good news is it has been
>> > fixed
>> > by commit 043f6ff0 as I mentioned upthread.
>>
>> Do you say that my old patch (rebased) no longer breaks the regression
>> tests?
>>
>
> I think so.
>
>
>>
>> (I noticed your other email in the thread which seems to indicate that
>> you're
>> no lo longer interested to work on the feature, but asking out of
>> curiosity.)
>>
>
> Tom pointed out that even if we pull up the subquery with the help of
> LATERAL, we cannot make sure we will end up with a better plan, since
> LATERAL pretty much constrains things to use a nestloop. Hmm, I think
> what he said makes sense.
>
> Thanks
> Richard
>
>

Even if we can't do this for the general case,  I still think we can do
something
for some special cases,  for example:
select count(*) from j1 where  (i) *in*  (select i from j2 where* j2.im5 =
j1.im5*);
can be converted to
select count(*) from t1 where (i, im5) in (select i, im5 from j2);

The conversion can happen just before the convert_ANY_sublink_to_join.

@@ -399,6 +483,7 @@ pull_up_sublinks_qual_recurse(PlannerInfo *root, Node
*node,
/* Is it a convertible ANY or EXISTS clause? */
if (sublink->subLinkType == ANY_SUBLINK)
{
+   reduce_sublink_correlation_exprs(root, sublink);
if ((j = convert_ANY_sublink_to_join(root, sublink,

 available_rels1)) != NULL)

However we have to do lots of pre checking for this,  the below is
something I can think for now.

1). It must be an in-subquery.
2). The op in correlation_expr must be a mergeable op.
3). no aggregation call in subquery->targetList and subquery->havingQual.
4). no limit/offset cause.
5). No volatile function involved for safety.

I can't tell how often it is, I just run into this by my own and search the
maillist and get only 1 report [1].  Is it something worth doing or do we
have
a better strategy to handle it?   Thanks!

[1] https://www.postgresql.org/message-id/3691.1342650...@sss.pgh.pa.us

-- 
Best Regards
Andy Fan

RE: New statistics for tuning WAL buffer size

2020-08-18 Thread Masahiro Ikeda


On 2020-08-19 13:49, tsunakawa.ta...@fujitsu.com wrote:

From: Masahiro Ikeda 

If my understanding is correct, we have to measure the performance
impact first.
Do you know HariBabu is now trying to solve it? If not, I will try to
modify patches to apply HEAD.


No, he's not doing it anymore.  It'd be great if you could resume it.


OK, thanks.


However, I recommend sharing your understanding about what were the
issues with those two threads and how you're trying to solve them.
Was the performance overhead the blocker in both of the threads?


In my understanding, some comments are not solved in both of the 
threads.

I think the following works are remained.

1) Modify patches to apply HEAD
2) Get consensus what metrics we collect and how to use them for tuning.
3) Measure performance impact and if it leads poor performance, we solve 
it.


Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Re: Creating foreign key on partitioned table is too slow

2020-08-18 Thread Amit Langote

Hi Alvaro,

On Thu, Aug 6, 2020 at 4:25 PM kato-...@fujitsu.com
 wrote:
> On Wednesday, August 5, 2020 9:43 AM I wrote:
> > I'll report the result before the end of August .
>
> I test v2-0001-build-partdesc-memcxt.patch at 9a9db08ae4 and it is ok.

Is this patch meant for HEAD or back-patching?  I ask because v13 got this:

commit 5b9312378e2f8fb35ef4584aea351c3319a10422
Author: Tom Lane 
Date:   Wed Dec 25 14:43:13 2019 -0500

Load relcache entries' partitioning data on-demand, not immediately.

which prevents a partitioned table's PartitionDesc from being rebuilt
repeatedly as would happen before this commit in Kato-san's case,
because it moves RelationBuildPartitionDesc out of the relcache flush
code path.

So, the OOM situation that Kato-san original reported should not occur
as of v13.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

RE: New statistics for tuning WAL buffer size

2020-08-18 Thread tsunakawa.ta...@fujitsu.com

From: Masahiro Ikeda 
> If my understanding is correct, we have to measure the performance
> impact first.
> Do you know HariBabu is now trying to solve it? If not, I will try to
> modify patches to apply HEAD.

No, he's not doing it anymore.  It'd be great if you could resume it.  However, 
I recommend sharing your understanding about what were the issues with those 
two threads and how you're trying to solve them.  Was the performance overhead 
the blocker in both of the threads?


Regards
Takayuki Tsunakawa

RE: New statistics for tuning WAL buffer size

2020-08-18 Thread Masahiro Ikeda


On 2020-08-18 16:35, tsunakawa.ta...@fujitsu.com wrote:

From: Masahiro Ikeda 
It's important to provide the metrics for tuning the size of WAL 
buffers.
For now, it's lack of the statistics how often processes wait to write 
WAL

because WAL buffer is full.

If those situation are often occurred, WAL buffer is too small for the 
workload.

DBAs must to tune the WAL buffer size for performance improvement.


Yes, it's helpful to know if we need to enlarge the WAL buffer.
That's why our colleague HariBabu proposed the patch.  We'd be happy
if it could be committed in some form.


There are related threads, but those are not merged.
https://www.postgresql.org/message-id/4ff824f3.5090...@uptime.jp
https://www.postgresql.org/message-id/flat/CAJrrPGc6APFUGYNcPe4qcNx
pL8gXKYv1KST%2BvwJcFtCSCEySnA%40mail.gmail.com


What's the difference between those patches?  What blocked them from
being committed?


Thanks for replying.

Since the above threads are not active now and those patches can't be 
applied HEAD,
I made this thread. If it is better to reply the above thread, I will do 
so.


If my understanding is correct, we have to measure the performance 
impact first.
Do you know HariBabu is now trying to solve it? If not, I will try to 
modify patches to apply HEAD.


Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-08-18 Thread Amit Kapila

On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar  wrote:
>
>
> In last patch v49-0001, there is one issue,  Basically, I have called
> BufFileFlush in all the cases.  But, ideally, we can not call this if
> the underlying files are deleted/truncated because those files/blocks
> might not exist now.  So I think if the truncate position is within
> the same buffer we just need to adjust the buffer,  otherwise we just
> need to set the currFile and currOffset to the absolute number and set
> the pos and nbytes 0.  Attached patch fixes this issue.
>

Few comments on the latest patch v50-0001-Extend-the-BufFile-interface
1.
+
+ /*
+ * If the truncate point is within existing buffer then we can just
+ * adjust pos-within-buffer, without flushing buffer.  Otherwise,
+ * we don't need to do anything because we have already deleted/truncated
+ * the underlying files.
+ */
+ if (curFile == file->curFile &&
+ curOffset >= file->curOffset &&
+ curOffset <= file->curOffset + file->nbytes)
+ {
+ file->pos = (int) (curOffset - file->curOffset);
+ return;
+ }

I think in this case you have set the position correctly but what
about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes'
because the contents of the buffer are still valid but I don't think
the same is true here.

2.
+ int curFile = file->curFile;
+ off_t curOffset = file->curOffset;

I find the previous naming (newFile, newOffset) was better as it
distinguishes them from BufFile variables.

3.
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
..
+ /* Delete all files in the set */
+ SharedFileSetDeleteAll(input_fileset);
..
}

I am not sure if this is completely correct because we call this
function (SharedFileSetUnregister) from BufFileDeleteShared which
would have already removed all the required files. This raises the
question in my mind whether it is correct to call
SharedFileSetUnregister from BufFileDeleteShared from the API
perspective as one might not want to remove the entire fileset at that
point of time. It will work for your use case (where while removing
buffile you also want to remove the entire fileset) but not sure if it
is generic enough. For your case, I wonder if we can directly call
SharedFileSetDeleteAll and we can have a call like
SharedFileSetUnregister which will be called from it.

-- 
With Regards,
Amit Kapila.

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Tue, Aug 18, 2020 at 9:44 PM Alvaro Herrera  wrote:
>
> On 2020-Aug-17, Ashutosh Sharma wrote:
>
> > > +   if (heap_force_opt == HEAP_FORCE_KILL)
> > > +   ItemIdSetDead(itemid);
> > >
> > > I think that if the page is an all-visible page, we should clear an
> > > all-visible bit on the visibility map corresponding to the page and
> > > PD_ALL_VISIBLE on the page header. Otherwise, index only scan would
> > > return the wrong results.
> >
> > I think we should let VACUUM do that. Please note that this module is
> > intended to be used only on a damaged relation and should only be
> > operated on damaged tuples of such relations. And the execution of any
> > of the functions provided by this module on a damaged relation must be
> > followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation.
> > This is necessary to bring back a damaged relation to the sane state
> > once a surgery is performed on it. I will try to add this note in the
> > documentation for this module.
>
> It makes sense to recommend VACUUM after fixing the page, but I agree
> with Sawada-san that it would be sensible to reset the VM bit while
> doing surgery, since that's the state that the page would be in.

Sure, I will try to do that change but I would still recommend to
always run VACUUM with DISABLE_PAGE_SKIPPING option on the relation
that underwent surgery.

We
> should certainly *strongly recommend* to do VACUUM DISABLE_PAGE_SKIPPING,
> but if users fail to do so, then leaving the VM bit set just means that
> we know *for certain* that there will be further corruption as soon as
> the XID counter advances sufficiently.
>

Yeah, I've already added a note for this in the documentation:

Note: "After a surgery is performed on a damaged relation using this
module, we must run VACUUM with DISABLE_PAGE_SKIPPING option on that
relation to bring it back into a sane state."

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

David Rowley  writes:
> I don't object to making the change. I just object to making it only
> to put it back again later when someone else speaks up that they'd
> prefer to keep nodes modular and not overload them in obscure ways.

> So other input is welcome.  Is it too weird to overload SubPlan and
> Nested Loop this way?  Or okay to do that if it squeezes out a dozen
> or so nanoseconds per tuple?

If you need somebody to blame it on, blame it on me - but I agree
that that is an absolutely horrid abuse of NestLoop.  We might as
well reduce explain.c to a one-liner that prints "Here Be Dragons",
because no one will understand what this display is telling them.

I'm also quite skeptical that adding overhead to nodeNestloop.c
to support this would actually be a net win once you account for
what happens in plans where the caching is of no value.

regards, tom lane

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-08-18 Thread Pavel Stehule

st 19. 8. 2020 v 5:48 odesílatel David Rowley  napsal:

> On Tue, 18 Aug 2020 at 21:42, David Rowley  wrote:
> >
> > On Tue, 11 Aug 2020 at 17:44, Andres Freund  wrote:
> > >
> > > Hi,
> > >
> > > On 2020-08-11 17:23:42 +1200, David Rowley wrote:
> > > > On Tue, 11 Aug 2020 at 12:21, Andres Freund 
> wrote:
> > > > >
> > > > > On 2020-07-09 10:25:14 +1200, David Rowley wrote:
> > > > > > On Thu, 9 Jul 2020 at 04:53, Andres Freund 
> wrote:
> > > > > > > I'm not convinced it's a good idea to introduce a separate
> executor node
> > > > > > > for this. There's a fair bit of overhead in them, and they
> will only be
> > > > > > > below certain types of nodes afaict. It seems like it'd be
> better to
> > > > > > > pull the required calls into the nodes that do parametrized
> scans of
> > > > > > > subsidiary nodes. Have you considered that?
> > > > > >
> > > > > > I see 41 different node types mentioned in ExecReScan().  I don't
> > > > > > really think it would be reasonable to change all those.
> > > > >
> > > > > But that's because we dispatch ExecReScan mechanically down to
> every
> > > > > single executor node. That doesn't determine how many nodes would
> need
> > > > > to modify to include explicit caching? What am I missing?
> > > > >
> > > > > Wouldn't we need roughly just nodeNestloop.c and nodeSubplan.c
> > > > > integration?
> > > >
> > > > hmm, I think you're right there about those two node types.  I'm just
> > > > not sure you're right about overloading these node types to act as a
> > > > cache.
> > >
> > > I'm not 100% either, to be clear.  I am just acutely aware that adding
> > > entire nodes is pretty expensive, and that there's, afaict, no need to
> > > have arbitrary (i.e. pointer to function) type callbacks to point to
> the
> > > cache.
> >
> > Perhaps you're right, but I'm just not convinced of it.  I feel
> > there's a certain air of magic involved in any node that has a good
> > name and reputation for doing one thing that we suddenly add new
> > functionality to which causes it to perform massively differently.
> >
>
> [ my long babble removed]
>
> > I'm wondering if anyone else has any thoughts on this?
>
> Just for anyone following along at home. The two variations would
> roughly look like:
>
> Current method:
>
> regression=# explain (analyze, costs off, timing off, summary off)
> select count(*) from tenk1 t1 inner join tenk1 t2 on
> t1.twenty=t2.unique1;
>   QUERY PLAN
>
> ---
>  Aggregate (actual rows=1 loops=1)
>->  Nested Loop (actual rows=1 loops=1)
>  ->  Seq Scan on tenk1 t1 (actual rows=1 loops=1)
>  ->  Result Cache (actual rows=1 loops=1)
>Cache Key: t1.twenty
>Hits: 9980  Misses: 20  Evictions: 0  Overflows: 0
>->  Index Scan using tenk1_unique1 on tenk1 t2 (actual
> rows=1 loops=20)
>  Index Cond: (unique1 = t1.twenty)
> (8 rows)
>
> Andres' suggestion:
>
> regression=# explain (analyze, costs off, timing off, summary off)
> select count(*) from tenk1 t1 inner join tenk1 t2 on
> t1.twenty=t2.unique1;
>   QUERY PLAN
>
> ---
>  Aggregate (actual rows=1 loops=1)
>->  Nested Loop (actual rows=1 loops=1)
>   Cache Key: t1.twenty  Hits: 9980  Misses: 20  Evictions: 0
> Overflows: 0
> ->  Seq Scan on tenk1 t1 (actual rows=1 loops=1)
> ->  Index Scan using tenk1_unique1 on tenk1 t2 (actual rows=1
> loops=20)
>   Index Cond: (unique1 = t1.twenty)
> (6 rows)
>
> and for subplans:
>
> Current method:
>
> regression=# explain (analyze, costs off, timing off, summary off)
> select twenty, (select count(*) from tenk1 t2 where t1.twenty =
> t2.twenty) from tenk1 t1;
>  QUERY PLAN
> -
>  Seq Scan on tenk1 t1 (actual rows=1 loops=1)
>SubPlan 1
>  ->  Result Cache (actual rows=1 loops=1)
>Cache Key: t1.twenty
>Hits: 9980  Misses: 20  Evictions: 0  Overflows: 0
>->  Aggregate (actual rows=1 loops=20)
>  ->  Seq Scan on tenk1 t2 (actual rows=500 loops=20)
>Filter: (t1.twenty = twenty)
>Rows Removed by Filter: 9500
> (9 rows)
>
> Andres' suggestion:
>
> regression=# explain (analyze, costs off, timing off, summary off)
> select twenty, (select count(*) from tenk1 t2 where t1.twenty =
> t2.twenty) from tenk1 t1;
>  QUERY PLAN
> -
>  Seq Scan on tenk1 t1 (actual rows=1 loops=1)
>SubPlan 1
> Cache Key: t1.twenty  Hits: 9980  Misses: 20  Evictions: 0  Overflows:
> 0
> ->  Aggregate (actual r

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-18 Thread Masahiko Sawada

On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma  wrote:
>
> Hello Masahiko-san,
>
> Thanks for the review. Please check the comments inline below:
>
> On Fri, Aug 14, 2020 at 10:07 AM Masahiko Sawada
>  wrote:
>
> > Thank you for updating the patch! Here are my comments on v5 patch:
> >
> > --- a/contrib/Makefile
> > +++ b/contrib/Makefile
> > @@ -35,6 +35,7 @@ SUBDIRS = \
> > pg_standby  \
> > pg_stat_statements \
> > pg_trgm \
> > +   pg_surgery  \
> > pgcrypto\
> >
> > I guess we use alphabetical order here. So pg_surgery should be placed
> > before pg_trgm.
> >
>
> Okay, will take care of this in the next version of patch.
>
> > ---
> > +   if (heap_force_opt == HEAP_FORCE_KILL)
> > +   ItemIdSetDead(itemid);
> >
> > I think that if the page is an all-visible page, we should clear an
> > all-visible bit on the visibility map corresponding to the page and
> > PD_ALL_VISIBLE on the page header. Otherwise, index only scan would
> > return the wrong results.
> >
>
> I think we should let VACUUM do that. Please note that this module is
> intended to be used only on a damaged relation and should only be
> operated on damaged tuples of such relations. And the execution of any
> of the functions provided by this module on a damaged relation must be
> followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation.
> This is necessary to bring back a damaged relation to the sane state
> once a surgery is performed on it. I will try to add this note in the
> documentation for this module.
>
> > ---
> > +   /*
> > +* We do not mark the buffer dirty or do WAL logging for unmodifed
> > +* pages.
> > +*/
> > +   if (!did_modify_page)
> > +   goto skip_wal;
> > +
> > +   /* Mark buffer dirty before we write WAL. */
> > +   MarkBufferDirty(buf);
> > +
> > +   /* XLOG stuff */
> > +   if (RelationNeedsWAL(rel))
> > +   log_newpage_buffer(buf, true);
> > +
> > +skip_wal:
> > +   END_CRIT_SECTION();
> > +
> >
> > s/unmodifed/unmodified/
> >
>
> okay, will fix this typo.
>
> > Do we really need to use goto? I think we can modify it like follows:
> >
> > if (did_modity_page)
> > {
> >/* Mark buffer dirty before we write WAL. */
> >MarkBufferDirty(buf);
> >
> >/* XLOG stuff */
> >if (RelationNeedsWAL(rel))
> >log_newpage_buffer(buf, true);
> > }
> >
> > END_CRIT_SECTION();
> >
>
> No, we don't need it. We can achieve the same by checking the status
> of did_modify_page flag as you suggested. I will do this change in the
> next version.
>
> > ---
> > pg_force_freeze() can revival a tuple that is already deleted but not
> > vacuumed yet. Therefore, the user might need to reindex indexes after
> > using that function. For instance, with the following script, the last
> > two queries: index scan and seq scan, will return different results.
> >
> > set enable_seqscan to off;
> > set enable_bitmapscan to off;
> > set enable_indexonlyscan to off;
> > create table tbl (a int primary key);
> > insert into tbl values (1);
> >
> > update tbl set a = a + 100 where a = 1;
> >
> > explain analyze select * from tbl where a < 200;
> >
> > -- revive deleted tuple on heap
> > select heap_force_freeze('tbl', array['(0,1)'::tid]);
> >
> > -- index scan returns 2 tuples
> > explain analyze select * from tbl where a < 200;
> >
> > -- seq scan returns 1 tuple
> > set enable_seqscan to on;
> > explain analyze select * from tbl;
> >
>
> I am not sure if this is the right use-case of pg_force_freeze
> function. I think we should only be running pg_force_freeze function
> on a tuple for which VACUUM reports "found xmin ABC from before
> relfrozenxid PQR" sort of error otherwise it might worsen the things
> instead of making it better.

Should this also be documented? I think that it's hard to force the
user to always use this module in the right situation but we need to
show at least when to use.

> > Also, if a tuple updated and moved to another partition is revived by
> > heap_force_freeze(), its ctid still has special values:
> > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
> > see a problem yet caused by a visible tuple having the special ctid
> > value, but it might be worth considering either to reset ctid value as
> > well or to not freezing already-deleted tuple.
> >
>
> For this as well, the answer remains the same as above.

Perhaps the same is true when a tuple header is corrupted including
xmin and ctid for some reason and the user wants to fix it? I'm
concerned that a live tuple having the wrong ctid will cause SEGV or
PANIC error in the future.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-08-18 Thread David Rowley

On Tue, 18 Aug 2020 at 21:42, David Rowley  wrote:
>
> On Tue, 11 Aug 2020 at 17:44, Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2020-08-11 17:23:42 +1200, David Rowley wrote:
> > > On Tue, 11 Aug 2020 at 12:21, Andres Freund  wrote:
> > > >
> > > > On 2020-07-09 10:25:14 +1200, David Rowley wrote:
> > > > > On Thu, 9 Jul 2020 at 04:53, Andres Freund  wrote:
> > > > > > I'm not convinced it's a good idea to introduce a separate executor 
> > > > > > node
> > > > > > for this. There's a fair bit of overhead in them, and they will 
> > > > > > only be
> > > > > > below certain types of nodes afaict. It seems like it'd be better to
> > > > > > pull the required calls into the nodes that do parametrized scans of
> > > > > > subsidiary nodes. Have you considered that?
> > > > >
> > > > > I see 41 different node types mentioned in ExecReScan().  I don't
> > > > > really think it would be reasonable to change all those.
> > > >
> > > > But that's because we dispatch ExecReScan mechanically down to every
> > > > single executor node. That doesn't determine how many nodes would need
> > > > to modify to include explicit caching? What am I missing?
> > > >
> > > > Wouldn't we need roughly just nodeNestloop.c and nodeSubplan.c
> > > > integration?
> > >
> > > hmm, I think you're right there about those two node types.  I'm just
> > > not sure you're right about overloading these node types to act as a
> > > cache.
> >
> > I'm not 100% either, to be clear.  I am just acutely aware that adding
> > entire nodes is pretty expensive, and that there's, afaict, no need to
> > have arbitrary (i.e. pointer to function) type callbacks to point to the
> > cache.
>
> Perhaps you're right, but I'm just not convinced of it.  I feel
> there's a certain air of magic involved in any node that has a good
> name and reputation for doing one thing that we suddenly add new
> functionality to which causes it to perform massively differently.
>

[ my long babble removed]

> I'm wondering if anyone else has any thoughts on this?

Just for anyone following along at home. The two variations would
roughly look like:

Current method:

regression=# explain (analyze, costs off, timing off, summary off)
select count(*) from tenk1 t1 inner join tenk1 t2 on
t1.twenty=t2.unique1;
  QUERY PLAN
---
 Aggregate (actual rows=1 loops=1)
   ->  Nested Loop (actual rows=1 loops=1)
 ->  Seq Scan on tenk1 t1 (actual rows=1 loops=1)
 ->  Result Cache (actual rows=1 loops=1)
   Cache Key: t1.twenty
   Hits: 9980  Misses: 20  Evictions: 0  Overflows: 0
   ->  Index Scan using tenk1_unique1 on tenk1 t2 (actual
rows=1 loops=20)
 Index Cond: (unique1 = t1.twenty)
(8 rows)

Andres' suggestion:

regression=# explain (analyze, costs off, timing off, summary off)
select count(*) from tenk1 t1 inner join tenk1 t2 on
t1.twenty=t2.unique1;
  QUERY PLAN
---
 Aggregate (actual rows=1 loops=1)
   ->  Nested Loop (actual rows=1 loops=1)
  Cache Key: t1.twenty  Hits: 9980  Misses: 20  Evictions: 0
Overflows: 0
->  Seq Scan on tenk1 t1 (actual rows=1 loops=1)
->  Index Scan using tenk1_unique1 on tenk1 t2 (actual rows=1 loops=20)
  Index Cond: (unique1 = t1.twenty)
(6 rows)

and for subplans:

Current method:

regression=# explain (analyze, costs off, timing off, summary off)
select twenty, (select count(*) from tenk1 t2 where t1.twenty =
t2.twenty) from tenk1 t1;
 QUERY PLAN
-
 Seq Scan on tenk1 t1 (actual rows=1 loops=1)
   SubPlan 1
 ->  Result Cache (actual rows=1 loops=1)
   Cache Key: t1.twenty
   Hits: 9980  Misses: 20  Evictions: 0  Overflows: 0
   ->  Aggregate (actual rows=1 loops=20)
 ->  Seq Scan on tenk1 t2 (actual rows=500 loops=20)
   Filter: (t1.twenty = twenty)
   Rows Removed by Filter: 9500
(9 rows)

Andres' suggestion:

regression=# explain (analyze, costs off, timing off, summary off)
select twenty, (select count(*) from tenk1 t2 where t1.twenty =
t2.twenty) from tenk1 t1;
 QUERY PLAN
-
 Seq Scan on tenk1 t1 (actual rows=1 loops=1)
   SubPlan 1
Cache Key: t1.twenty  Hits: 9980  Misses: 20  Evictions: 0  Overflows: 0
->  Aggregate (actual rows=1 loops=20)
  ->  Seq Scan on tenk1 t2 (actual rows=500 loops=20)
Filter: (t1.twenty = twenty)
Rows Removed by Filter: 9500
(7 rows)

I've spoken to one other person off-list about this and they suggested
that they prefer Andres

RE: [PATCH]Fix ja.po error

2020-08-18 Thread Lu, Chenyang

Ping: sorry, did Alvaro and Peter forget this email?( Maybe didn't see this 
email~ ), I found that the patch of ja.po has not been applied to the 
Translation Repository.

-Original Message-
From: Michael Paquier  
Sent: Wednesday, July 29, 2020 5:25 PM
To: Lu, Chenyang/陆 晨阳 
Cc: pgsql-hack...@postgresql.org; alvhe...@2ndquadrant.com; 
peter.eisentr...@2ndquadrant.com
Subject: Re: [PATCH]Fix ja.po error

On Wed, Jul 29, 2020 at 08:42:27AM +, Lu, Chenyang wrote:
> When I was using PostgreSQL, I noticed that the output of the Japanese 
> messages was inconsistent with the English messages.
> The Japanese message needs to be modified,so I made the patch.

Indeed, good catch.  This needs to be applied to the translation repository 
first though:
https://git.postgresql.org/gitweb/?p=pgtranslation/messages.git;a=summary

I am adding Alvaro and Peter in CC as they take care of that usually (I don't 
think I have an access to this repo).
--
Michael

Re: [PG13] Planning (time + buffers) data structure in explain plan (format text)

2020-08-18 Thread Justin Pryzby

On Fri, Aug 07, 2020 at 02:30:01PM +0200, Pierre Giraud wrote:
> Hi all,
> 
> As far as I understand, in the upcoming version 13, information about
> buffers used during planning is now available in the explain plan.
> 
> […]
>  Planning Time: 0.203 ms
>Buffers: shared hit=14
> […]
> 
> For a matter of consistency, I wonder if it would be possible to format
> it like the following:
> 
> […]
>  Planning:
>Planning Time: 0.203 ms
>Buffers: shared hit=14

Thanks for reporting.  I added this here.
https://wiki.postgresql.org/wiki/PostgreSQL_13_Open_Items

-- 
Justin

Re: prepared transaction isolation tests

2020-08-18 Thread Andres Freund

Hi,

On 2020-08-18 22:24:20 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > ISTM we should just add an alternative output for disabled prepared
> > xacts, and re-add the test?
> 
> I believe the buildfarm runs the isolation step with "make installcheck",
> so if you're hoping to get buildfarm coverage that way, you're mistaken.

It seems like the buildfarm ought to configure the started server with a
bunch of prepared transactions, in that case? At least going forward?

> Having said that, it'd probably be good if "make check" did run this test.

Yea. It'd at least be run when we do check-world - which at least I do
before nearly every commit.

Greetings,

Andres Freund

Re: prepared transaction isolation tests

Andres Freund  writes:
> ISTM we should just add an alternative output for disabled prepared
> xacts, and re-add the test?

I believe the buildfarm runs the isolation step with "make installcheck",
so if you're hoping to get buildfarm coverage that way, you're mistaken.

Having said that, it'd probably be good if "make check" did run this test.

regards, tom lane

prepared transaction isolation tests

2020-08-18 Thread Andres Freund

Hi,

This thread started on committers, at
https://www.postgresql.org/message-id/20200818234532.uiafo5br5lo6zhya%40alap3.anarazel.de

In it I wanted to add a isolation test around prepared transactions:

On 2020-08-18 16:45:32 -0700, Andres Freund wrote:
> I think it's worth adding an isolation test. But it doesn't seem like
> extending prepared-transactions.spec makes too much sense, it doesn't
> fit in well. It's a lot easier to reproduce the issue without
> SERIALIZABLE, for example. Generally the file seems more about
> serializable than 2PC...
>
> So unless somebody disagrees I'm gonna add a new
> prepared-transactions-2.spec.

But I noticed that the already existing prepared transactions test
wasn't in the normal schedule, since:

commit ae55d9fbe3871a5e6309d9b91629f1b0ff2b8cba
Author: Andrew Dunstan 
Date:   2012-07-20 15:51:40 -0400

Remove prepared transactions from main isolation test schedule.

There is no point in running this test when prepared transactions are 
disabled,
which is the default. New make targets that include the test are provided. 
This
will save some useless waste of cycles on buildfarm machines.

Backpatch to 9.1 where these tests were introduced.

diff --git a/src/test/isolation/isolation_schedule 
b/src/test/isolation/isolation_schedule
index 669c0f220c4..2184975dcb1 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -9,7 +9,6 @@ test: ri-trigger
 test: partial-index
 test: two-ids
 test: multiple-row-versions
-test: prepared-transactions
 test: fk-contention
 test: fk-deadlock
 test: fk-deadlock2

The commit confuses me, cause I thought we explicitly enabled prepared
transactions during tests well before that? See

commit 8d4f2ecd41312e57422901952cbad234d293060b
Author: Tom Lane 
Date:   2009-04-23 00:23:46 +

Change the default value of max_prepared_transactions to zero, and add
documentation warnings against setting it nonzero unless active use of
prepared transactions is intended and a suitable transaction manager has 
been
installed.  This should help to prevent the type of scenario we've seen
several times now where a prepared transaction is forgotten and eventually
causes severe maintenance problems (or even anti-wraparound shutdown).

The only real reason we had the default be nonzero in the first place was to
support regression testing of the feature.  To still be able to do that,
tweak pg_regress to force a nonzero value during "make check".  Since we
cannot force a nonzero value in "make installcheck", add a variant 
regression
test "expected" file that shows the results that will be obtained when
max_prepared_transactions is zero.

Also, extend the HINT messages for transaction wraparound warnings to 
mention
the possibility that old prepared transactions are causing the problem.

All per today's discussion.

And indeed, including the test in the schedule works for make check, not
just an installcheck with explicitly enabled prepared xacts.

ISTM we should just add an alternative output for disabled prepared
xacts, and re-add the test?

My new test, without the alternative output for now, is attached.

Greetings,

Andres Freund
diff --git c/src/test/isolation/expected/prepared-transactions-2.out i/src/test/isolation/expected/prepared-transactions-2.out
new file mode 100644
index 000..f7fd2e7f989
--- /dev/null
+++ i/src/test/isolation/expected/prepared-transactions-2.out
@@ -0,0 +1,119 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1_ins_before s1_begin s1_txid s2_txid s1_update_in s1_ins_in s1_show s1_prepare s1_show s1_commit s1_show
+step s1_ins_before: INSERT INTO preptest VALUES ('before');
+step s1_begin: BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;
+step s1_txid: SELECT txid_current() IS NULL;
+?column?   
+
+f  
+step s2_txid: SELECT txid_current() IS NULL;
+?column?   
+
+f  
+step s1_update_in: UPDATE preptest SET data = 'in; '|| data;
+step s1_ins_in: INSERT INTO preptest VALUES ('in');
+step s1_show: SELECT * FROM preptest
+data   
+
+in; before 
+in 
+step s1_prepare: PREPARE TRANSACTION 'prep';
+step s1_show: SELECT * FROM preptest
+data   
+
+before 
+step s1_commit: COMMIT PREPARED 'prep';
+step s1_show: SELECT * FROM preptest
+data   
+
+in; before 
+in 
+
+starting permutation: s1_ins_before s1_begin s1_txid s2_txid s1_update_in s1_ins_in s1_show s1_prepare s2_show s1_commit s2_show
+step s1_ins_before: INSERT INTO preptest VALUES ('before');
+step s1_begin: BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;
+step s1_txid: SELECT txid_current() IS NULL;
+?column?   
+
+f  
+step s2_txid: SELECT txid_current() IS NULL;
+?column?   
+
+f  
+step s1_update_in: UPDATE preptest SET data = 'in; '|| data;
+step s1_ins_in: INSERT INTO preptest VALUES ('in'

Re: Implementing Incremental View Maintenance

2020-08-18 Thread Tatsuo Ishii

I have looked into this.

> Hi,
> 
> Attached is the rebased patch (v16) to add support for Incremental
> Materialized View Maintenance (IVM). It is able to be applied to
> current latest master branch.
> 
> This also includes the following small fixes:
> 
> - Add a query check for expressions containing aggregates in it
> - [doc] Add description about which situations IVM is effective or not in
> - Improve hint in log messages
> - Reorganize include directives in codes

- make check passed.
- make check-world passed.

- 0004-Allow-to-prolong-life-span-of-transition-tables-unti.patch:
  This one needs a comment to describe what the function does etc.

  +void
  +SetTransitionTablePreserved(Oid relid, CmdType cmdType)
  +{


- 0007-Add-aggregates-support-in-IVM.patch
  "Check if the given aggregate function is supporting" shouldn't be
  "Check if the given aggregate function is supporting IVM"?

+ * check_aggregate_supports_ivm
+ *
+ * Check if the given aggregate function is supporting

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: Creating a function for exposing memory usage of backend process

2020-08-18 Thread torikoshia


On 2020-08-18 22:54, Fujii Masao wrote:

On 2020/08/18 18:41, torikoshia wrote:

On 2020-08-17 21:19, Fujii Masao wrote:

On 2020/08/17 21:14, Fujii Masao wrote:

On 2020-08-07 16:38, Kasahara Tatsuhito wrote:
The following review has been posted through the commitfest 
application:

make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:    tested, passed

I tested the latest
patch(0007-Adding-a-function-exposing-memory-usage-of-local-backend.patch)
with the latest PG-version 
(199cec9779504c08aaa8159c6308283156547409)

and test was passed.
It looks good to me.

The new status of this patch is: Ready for Committer


Thanks for your testing!


Thanks for updating the patch! Here are the review comments.


Thanks for reviewing!



+ 
+  linkend="view-pg-backend-memory-contexts">pg_backend_memory_contexts

+  backend memory contexts
+ 

The above is located just after pg_matviews entry. But it should be 
located
just after pg_available_extension_versions entry. Because the rows 
in the table

"System Views" should be located in alphabetical order.


+ 
+  
pg_backend_memory_contexts


Same as above.


Modified both.




+   The view pg_backend_memory_contexts 
displays all

+   the local backend memory contexts.

This description seems a bit confusing because maybe we can 
interpret this
as "... displays the memory contexts of all the local backends" 
wrongly. Thought?

What about the following description, instead?


 The view pg_backend_memory_contexts 
displays all
 the memory contexts of the server process attached to the 
current session.


Thanks! it seems better.


+    const char *name = context->name;
+    const char *ident = context->ident;
+
+    if (context == NULL)
+    return;

The above check "context == NULL" is useless? If "context" is 
actually NULL,
"context->name" would cause segmentation fault, so ISTM that the 
check

will never be performed.

If "context" can be NULL, the check should be performed before 
accessing
to "contect". OTOH, if "context" must not be NULL per the 
specification of

PutMemoryContextStatsTupleStore(), assertion test checking
"context != NULL" should be used here, instead?


Yeah, "context" cannot be NULL because "context" must be 
TopMemoryContext

or it is already checked as not NULL as follows(child != NULL).

I added the assertion check.


Isn't it better to add AssertArg(MemoryContextIsValid(context)), 
instead?


Thanks, that's better.



| for (child = context->firstchild; child != NULL; child = 
child->nextchild)

| {
|  ...
| PutMemoryContextsStatsTupleStore(tupstore, tupdesc,
|   child, 
parentname, level + 1);

| }


Here is another comment.

+ if (parent == NULL)
+ nulls[2] = true;
+ else
+ /*
+  * We labeled dynahash contexts with just the hash 
table name.
+  * To make it possible to identify its parent, we also 
display

+  * parent's ident here.
+  */
+ if (parent->ident && strcmp(parent->name, "dynahash") 
== 0)

+ values[2] = CStringGetTextDatum(parent->ident);
+ else
+ values[2] = CStringGetTextDatum(parent->name);

PutMemoryContextsStatsTupleStore() doesn't need "parent" memory 
context,
but uses only the name of "parent" memory context. So isn't it better 
to use
"const char *parent" instead of "MemoryContext parent", as the 
argument of

the function? If we do that, we can simplify the above code.


Thanks, the attached patch adopted the advice.

However, since PutMemoryContextsStatsTupleStore() used not only the 
name
but also the ident of the "parent", I could not help but adding 
similar

codes before calling the function.
The total amount of codes and complexity seem not to change so much.

Any thoughts? Am I misunderstanding something?


I was thinking that we can simplify the code as follows.
That is, we can just pass "name" as the argument of
PutMemoryContextsStatsTupleStore()
since "name" indicates context->name or ident (if name is "dynahash").

 	for (child = context->firstchild; child != NULL; child = 
child->nextchild)

{
-   const char *parentname;
-
-   /*
-* We labeled dynahash contexts with just the hash table name.
-* To make it possible to identify its parent, we also use
-* the hash table as its context name.
-*/
-   if (context->ident && strcmp(context->name, "dynahash") == 0)
-   parentname = context->ident;
-   else
-   parentname = context->name;
-
PutMemoryContextsStatsTupleStore(tupstore, tupdesc,
- child, 
parentname, level + 1);
+

RE: please update ps display for recovery checkpoint

2020-08-18 Thread k.jami...@fujitsu.com

On Wednesday, August 19, 2020 7:53 AM (GMT+9), Justin Pryzby wrote: 

Hi, 

All the patches apply, although when applying them the following appears:
   (Stripping trailing CRs from patch; use --binary to disable.)

> During crash recovery, the server writes this to log:
> 
> < 2020-08-16 08:46:08.601 -03  >LOG:  redo done at 2299C/1EC6BA00 <
> 2020-08-16 08:46:08.877 -03  >LOG:  checkpoint starting: end-of-recovery
> immediate
> 
> But runs a checkpoint, which can take a long time, while the "ps" display 
> still says
> "recovering ".
> 
> Please change to say "recovery checkpoint" or similar, as I mentioned here.
> https://www.postgresql.org/message-id/2020011820.GP26045@telsasoft.c
> om

Yes, I agree that it is helpful to tell users about that.

About 0003 patch, there are similar phrases in bgwriter_flush_after and 
backend_flush_after. Should those be updated too?

--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3170,7 +3170,7 @@ include_dir 'conf.d'
 limit the amount of dirty data in the kernel's page cache, reducing
 the likelihood of stalls when an fsync is issued 
at the end of the
 checkpoint, or when the OS writes data back in larger batches in the
-background.  Often that will result in greatly reduced transaction
+background.  This feature will often result in greatly reduced 
transaction
 latency, but there also are some cases, especially with workloads
 that are bigger than , but smaller
 than the OS's page cache, where performance might degrade.  This


Regards,
Kirk Jamison

please update ps display for recovery checkpoint

2020-08-18 Thread Justin Pryzby

During crash recovery, the server writes this to log:

< 2020-08-16 08:46:08.601 -03  >LOG:  redo done at 2299C/1EC6BA00
< 2020-08-16 08:46:08.877 -03  >LOG:  checkpoint starting: end-of-recovery 
immediate

But runs a checkpoint, which can take a long time, while the "ps" display still
says "recovering ".

Please change to say "recovery checkpoint" or similar, as I mentioned here.
https://www.postgresql.org/message-id/2020011820.gp26...@telsasoft.com

-- 
Justin
>From 93c7160f39bf63593e10d7730160668016d7 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sat, 8 Feb 2020 09:16:14 -0600
Subject: [PATCH 1/3] Update PS display following replay of last xlog..

..otherwise it shows "recovering " for the duration of the recovery
checkpoint.
---
 src/backend/access/transam/xlog.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 09c01ed4ae..8e08ce09da 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7765,7 +7765,10 @@ StartupXLOG(void)
   CHECKPOINT_WAIT);
 		}
 		else
+		{
+			set_ps_display("recovery checkpoint");
 			CreateCheckPoint(CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_IMMEDIATE);
+		}
 	}
 
 	if (ArchiveRecoveryRequested)
-- 
2.17.0

>From 78e6ad73352240849982452ed1fb4cd0b8ed303e Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sat, 18 Jan 2020 13:44:08 -0600
Subject: [PATCH 2/3] Document that checkpoint_flush_after applies to
 end-of-recovery checkpoint

---
 doc/src/sgml/config.sgml | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7a7177c550..c0b5422599 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3175,6 +3175,8 @@ include_dir 'conf.d'
 that are bigger than , but smaller
 than the OS's page cache, where performance might degrade.  This
 setting may have no effect on some platforms.
+This setting also applies to the checkpoint written at the end of crash
+recovery.
 If this value is specified without units, it is taken as blocks,
 that is BLCKSZ bytes, typically 8kB.
 The valid range is
-- 
2.17.0

>From 8f27859d6ed5ee69773171c79cc9c26a18f65a67 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sat, 18 Jan 2020 13:44:33 -0600
Subject: [PATCH 3/3] Avoid ambiguous "that"

---
 doc/src/sgml/config.sgml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c0b5422599..84b40467b3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3170,7 +3170,7 @@ include_dir 'conf.d'
 limit the amount of dirty data in the kernel's page cache, reducing
 the likelihood of stalls when an fsync is issued at the end of the
 checkpoint, or when the OS writes data back in larger batches in the
-background.  Often that will result in greatly reduced transaction
+background.  This feature will often result in greatly reduced transaction
 latency, but there also are some cases, especially with workloads
 that are bigger than , but smaller
 than the OS's page cache, where performance might degrade.  This
-- 
2.17.0

Re: Print logical WAL message content

2020-08-18 Thread Alvaro Herrera

On 2020-Aug-18, Ashutosh Bapat wrote:

> Right now pg_waldump just prints whether the message is transactional
> or not and its size. That doesn't help much to understand the message
> itself. If it prints the contents of a logical WAL message, it helps
> debugging logical replication related problems. Prefix is a
> null-terminated ASCII string, so no problem printing that. Even the
> contents can be printed as a series of hex bytes. Here's a patch to do
> that.

Looks like a good idea.

I didn't like that you're documenting the message format in the new
function:

>   xl_logical_message *xlrec = (xl_logical_message *) rec;
> + /*
> +  * Per LogLogicalMessage() actual logical message follows a 
> null-terminated prefix of length
> +  * prefix_size.

I would prefer to remove this comment, and instead add a comment atop
xl_logical_message's struct definition in message.h to say that the
message has a valid C-string as prefix, whose length is prefix_size, and
please see logicalmesg_desc() if you change this.
This way, you don't need to blame LogLogicalMessage for this
restriction, but it's actually part of the definition of the WAL
message.

> + /*
> +  * Per LogLogicalMessage() actual logical message follows a 
> null-terminated prefix of length
> +  * prefix_size.
> +  */
> + char   *prefix = xlrec->message;
> + char   *message = xlrec->message + xlrec->prefix_size;
> + int cnt;
> + char   *sep = "";

This would cause a crash if the message actually fails to follow the
rule.  Let's test that prefix[xlrec->prefix_size] is a trailing zero,
and if not, avoid printing it.  Although, just Assert()'ing that it's a
trailing zero would seem to suffice.

> + appendStringInfo(buf, "%s message size %zu bytes, prefix %s; 
> mesage: ",
>xlrec->transactional ? 
> "transactional" : "nontransactional",
> -  xlrec->message_size);
> +  xlrec->message_size, prefix);

Misspelled "message", but also the line looks a bit repetitive -- the
word "message" would appear three times:

> lsn: 0/01570608, prev 0/015705D0, desc: MESSAGE nontransactional message size 
> 12 bytes, prefix some_prefix; mesage: 73 6F 6D 65 20 6D 65 73 73 61 67 65

I would reduce it to

> lsn: 0/01570608, prev 0/015705D0, desc: MESSAGE nontransactional, prefix 
> "some_prefix"; payload (12 bytes): 73 6F 6D 65 20 6D 65 73 73 61 67 65

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16583: merge join on tables with different DB collation behind postgres_fdw fails

I wrote:
>> So I think what is happening here is that postgres_fdw's version of
>> IMPORT FOREIGN SCHEMA translates "COLLATE default" on the remote
>> server to "COLLATE default" on the local one, which of course is
>> a big fail if the defaults don't match.  That allows the local
>> planner to believe that remote ORDER BYs on the two foreign tables
>> will give compatible results, causing the merge join to not work
>> very well at all.

Here's a full patch addressing this issue.  I decided that the best
way to address the test-instability problem is to explicitly give
collations to all the foreign-table columns for which it matters
in the postgres_fdw test.  (For portability's sake, that has to be
"C" or "POSIX"; I mostly used "C".)  Aside from ensuring that the
test still passes with some other prevailing locale, this seems like
a good idea since we'll then be testing the case we are encouraging
users to use.

And indeed, it immediately turned up a new problem: if we explicitly
assign a collation to a foreign-table column c, the system won't
ship WHERE clauses as simple as "c = 'foo'" to the remote.  This
surprised me, but the reason turned out to be that what postgres_fdw
is actually seeing is something like

   {OPEXPR 
   :opno 98 
   :opfuncid 67 
   :opresulttype 16 
   :opretset false 
   :opcollid 0 
   :inputcollid 950 
   :args (
  {VAR 
  :varno 6 
  :varattno 4 
  :vartype 25 
  :vartypmod -1 
  :varcollid 950 
  :varlevelsup 0 
  :varnosyn 6 
  :varattnosyn 4 
  :location 171
  }
  {RELABELTYPE 
  :arg 
 {CONST 
 :consttype 25 
 :consttypmod -1 
 :constcollid 100 
 :constlen -1 
 :constbyval false 
 :constisnull false 
 :location 341 
 :constvalue 9 [ 36 0 0 0 48 48 48 48 49 ]
 }
  :resulttype 25 
  :resulttypmod -1 
  :resultcollid 950 
  :relabelformat 2 
  :location -1
  }
   )
   :location -1
   }

that is, the constant is being explicitly relabeled with the correct
collation, and thus is_foreign_expr() thinks the collation shown by
the RelabelType node is an unsafely-introduced collation.

What I did about this was to change the recursion rule in
foreign_expr_walker() so that merging a safely-derived collation with
the same collation unsafely derived is considered safe.  I think this
is all right, and it allows us to accept some cases that previously
were rejected as unsafe.  But I might be missing something.

(BTW, there's an independent bug here, which is that we're getting
a tree of the above shape rather than a simple Const with the
appropriate collation; that is, this tree isn't fully const-folded.
This is a bug in canonicalize_ec_expression, which I'll go fix
separately.  But it won't affect the problem at hand.)

This seems like a sufficiently large change in postgres_fdw's
behavior to require review, so I'll go add this to the next CF.

regards, tom lane

diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index ad37a74221..f0693a061c 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -320,7 +320,9 @@ foreign_expr_walker(Node *node,

 /*
  * If the Var is from the foreign table, we consider its
- * collation (if any) safe to use.  If it is from another
+ * collation (if any) safe to use, *unless* it's
+ * DEFAULT_COLLATION_OID.  We treat that as meaning "we don't
+ * know which collation this is".  If it is from another
  * table, we treat its collation the same way as we would a
  * Param's collation, ie it's not safe for it to have a
  * non-default collation.
@@ -342,7 +344,12 @@ foreign_expr_walker(Node *node,

 	/* Else check the collation */
 	collation = var->varcollid;
-	state = OidIsValid(collation) ? FDW_COLLATE_SAFE : FDW_COLLATE_NONE;
+	if (collation == InvalidOid)
+		state = FDW_COLLATE_NONE;
+	else if (collation == DEFAULT_COLLATION_OID)
+		state = FDW_COLLATE_UNSAFE;
+	else
+		state = FDW_COLLATE_SAFE;
 }
 else
 {
@@ -808,8 +815,24 @@ foreign_expr_walker(Node *node,

 	/*
 	 * Now, merge my collation information into my parent's state.
+	 *
+	 * If one branch of an expression derives a non-default collation safely
+	 * (that is, from a foreign Var) and another one derives the same
+	 * collation unsafely, we can consider the expression safe overall.  This
+	 * allows cases such as "foreign_var = ('foo' COLLATE x)" where x is the
+	 * same collation the foreign_var has anyway.  Note that we will not ship
+	 * any explicit COLLATE clause to the remote, but rely on it to re-derive
+	 * the correct collation based on the foreign_var.
 	 */
-	if (state > outer_cxt->state)
+	if (collation == outer_cxt->collation &&
+		((state == FDW_COLLATE_UNSAFE &&
+		  outer_cxt->state == FDW_COLLATE_SAFE) ||
+		 (state == FDW_COLLATE_SAFE &&
+		  outer_cxt->state =

Re: proposal: enhancing plpgsql debug API - returns text value of variable content

2020-08-18 Thread Pavel Stehule

po 17. 8. 2020 v 8:40 odesílatel Pavel Stehule 
napsal:

> Hi
>
> I am working on tracing support to plpgsql_check
>
> https://github.com/okbob/plpgsql_check
>
> I would like to print content of variables - and now, I have to go some
> deeper than I would like. I need to separate between scalar, row, and
> record variables. PLpgSQL has code for it - but it is private.
>
> Now plpgsql debug API has an API for expression evaluation - and it is
> working fine, but there is a need to know the necessary namespace.
> Unfortunately, the plpgsql variables have not assigned any info about
> related namespaces. It increases the necessary work for implementing
> conditional breakpoints or just printing all variables (and maintaining a
> lot of plpgsql code outside plpgsql core).
>
> So my proposals:
>
> 1. enhancing debug api about method
>
> char *get_cstring_valule(PLpgSQL_variable *var, bool *isnull)
>
> 2. enhancing PLpgSQL_var structure about related namespace "struct
> PLpgSQL_nsitem *ns",
> PLpgSQL_stmt *scope statement (statement that limits scope of variable's
> visibility). For usage in debuggers, tracers can be nice to have a info
> about kind of variable (function argument, local variable, automatic custom
> variable (FORC), automatic internal variable (SQLERRM, FOUND, TG_OP, ...).
>
> Comments, notes?
>

There are two patches

The first patch enhances dbg api by two functions - eval_datum and
cast_value - it is an interface for functions exec_eval_datum and
do_cast_value. With this API it is easy to take a value of any PLpgSQL
variable (without the necessity to duplicate a lot of plpgsql's code), and
it easy to transform this value to any expected type - usually it should
provide the cast to the text type.

Second patch injects pointer to related namespace to any plpgsql statement.
Reference to namespace is required for building custom expressions that can
be evaluated by assign_expr function. I would like to use it for
conditional breakpoints or conditional tracing. Without this patch it is
difficult to detect the correct namespace and ensure the correct variable's
visibility.

Regards

Pavel


> Regards
>
> Pavel
>
>
>
commit 47771637cefa8a283b23bb3b1ba692cbd560403b
Author: ok...@github.com 
Date:   Tue Aug 18 14:00:58 2020 +0200

enhancing dbg plpgsql API

diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
index d4a3d58daa..294008af88 100644
--- a/src/pl/plpgsql/src/pl_exec.c
+++ b/src/pl/plpgsql/src/pl_exec.c
@@ -4078,6 +4078,8 @@ plpgsql_estate_setup(PLpgSQL_execstate *estate,
 	{
 		(*plpgsql_plugin_ptr)->error_callback = plpgsql_exec_error_callback;
 		(*plpgsql_plugin_ptr)->assign_expr = exec_assign_expr;
+		(*plpgsql_plugin_ptr)->eval_datum = exec_eval_datum;
+		(*plpgsql_plugin_ptr)->cast_value = do_cast_value;
 
 		if ((*plpgsql_plugin_ptr)->func_setup)
 			((*plpgsql_plugin_ptr)->func_setup) (estate, func);
diff --git a/src/pl/plpgsql/src/plpgsql.h b/src/pl/plpgsql/src/plpgsql.h
index 0c3d30fb13..bec2429555 100644
--- a/src/pl/plpgsql/src/plpgsql.h
+++ b/src/pl/plpgsql/src/plpgsql.h
@@ -1142,8 +1142,11 @@ typedef struct PLpgSQL_execstate
  *
  * Also, immediately before any call to func_setup, PL/pgSQL fills in the
  * error_callback and assign_expr fields with pointers to its own
- * plpgsql_exec_error_callback and exec_assign_expr functions.  This is
- * a somewhat ad-hoc expedient to simplify life for debugger plugins.
+ * plpgsql_exec_error_callback and exec_assign_expr functions. eval_datum
+ * is assigned to function exec_eval_datum, and cast_value to function
+ * do_cast_value. With last two functions is easy to get content of
+ * any PLpgSQL_datum in any expected type. This is a somewhat ad-hoc
+ * expedient to simplify life for debugger plugins.
  */
 typedef struct PLpgSQL_plugin
 {
@@ -1158,6 +1161,13 @@ typedef struct PLpgSQL_plugin
 	void		(*error_callback) (void *arg);
 	void		(*assign_expr) (PLpgSQL_execstate *estate, PLpgSQL_datum *target,
 PLpgSQL_expr *expr);
+	void		(*eval_datum) (PLpgSQL_execstate *estate, PLpgSQL_datum *datum,
+Oid *typeid, int32 *typetypmod, Datum *value,
+bool *isnull);
+	Datum		(*cast_value) (PLpgSQL_execstate *estate,
+Datum value, bool *isnull,
+Oid valtype, int32 valtypmod,
+Oid reqtype, int32 reqtypmod);
 } PLpgSQL_plugin;
 
 /*
commit 0f82febd5aeddcf5a6e1a916e29d4b9b86bd483f
Author: ok...@github.com 
Date:   Tue Aug 18 17:47:12 2020 +0200

namespace for each statement

diff --git a/src/pl/plpgsql/src/pl_gram.y b/src/pl/plpgsql/src/pl_gram.y
index 5a7e1a..f492c49513 100644
--- a/src/pl/plpgsql/src/pl_gram.y
+++ b/src/pl/plpgsql/src/pl_gram.y
@@ -912,6 +912,7 @@ stmt_perform	: K_PERFORM expr_until_semi
 		new->cmd_type = PLPGSQL_STMT_PERFORM;
 		new->lineno   = plpgsql_location_to_lineno(@1);
 		new->stmtid = ++plpgsql_curr_compile->nstatements;
+		new->ns = plpgsql_ns_top();
 		new->expr  = $2;
 
 		$$ = (PLpgSQL_stmt *)new;
@@ -926,6 +927,7 @@

Re: doc review for v13

2020-08-18 Thread Justin Pryzby

I stand by these changes which I proposed handful of times since April, but not
yet included by Michael's previous commits.

-- 
Justin
>From f029efd79c4ad14ae003ed1a1c692931cdc33f1e Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Mon, 30 Mar 2020 19:43:22 -0500
Subject: [PATCH v6 01/10] doc: btree deduplication

commit 0d861bbb702f8aa05c2a4e3f1650e7e8df8c8c27
Author: Peter Geoghegan 
---
 doc/src/sgml/btree.sgml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
index d03ee4d6fa..69c1ee0e97 100644
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -642,7 +642,7 @@ options(relopts local_relopts *) returns
   
   
Deduplication works by periodically merging groups of duplicate
-   tuples together, forming a single posting list tuple for each
+   tuples together, forming a single posting list tuple for each
group.  The column key value(s) only appear once in this
representation.  This is followed by a sorted array of
TIDs that point to rows in the table.  This
-- 
2.17.0

>From aff29ab082f8915bd5fdfc9a763e23c864e555e0 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sun, 29 Mar 2020 21:22:43 -0500
Subject: [PATCH v6 02/10] doc: pg_stat_progress_basebackup

commit e65497df8f85ab9b9084c928ff69f384ea729b24
Author: Fujii Masao 
---
 doc/src/sgml/monitoring.sgml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 304c49f07b..ea8780327f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6069,8 +6069,8 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
   waiting for checkpoint to finish
   
The WAL sender process is currently performing
-   pg_start_backup to set up for
-   taking a base backup, and waiting for backup start
+   pg_start_backup to prepare to
+   take a base backup, and waiting for the start-of-backup
checkpoint to finish.
   
  
-- 
2.17.0

>From 0949708420623878995ef82d26fed2348edad941 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sun, 5 Apr 2020 22:35:09 -0500
Subject: [PATCH v6 03/10] Fix docs: "time of the"

commit 9e257a181cc1dc5e19eb5d770ce09cc98f470f5f
Author: Andrew Dunstan 
Date:   Sun Mar 24 11:27:20 2013 -0400

Add parallel pg_dump option.

commit 5ab892c391c6bc54a00e7a8de5cab077cabace6a
Author: Michael Paquier 
Date:   Sat Jul 27 22:21:18 2019 +0900

Add support for --jobs in reindexdb

commit a17923204736d8842eade3517d6a8ee81290fca4
Author: Alvaro Herrera 
Date:   Fri Jan 23 15:02:45 2015 -0300

vacuumdb: enable parallel mode
---
 doc/src/sgml/ref/pg_dump.sgml   | 2 +-
 doc/src/sgml/ref/reindexdb.sgml | 4 ++--
 doc/src/sgml/ref/vacuumdb.sgml  | 4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 7a37fd8045..77029defc5 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -322,7 +322,7 @@ PostgreSQL documentation
   

 Run the dump in parallel by dumping njobs
-tables simultaneously. This option reduces the time of the dump but it also
+tables simultaneously. This option may reduces the time needed to perform the dump but it also
 increases the load on the database server. You can only use this option with the
 directory output format because this is the only output format where multiple processes
 can write their data at the same time.
diff --git a/doc/src/sgml/ref/reindexdb.sgml b/doc/src/sgml/ref/reindexdb.sgml
index 026fd018d9..4f821c2095 100644
--- a/doc/src/sgml/ref/reindexdb.sgml
+++ b/doc/src/sgml/ref/reindexdb.sgml
@@ -174,8 +174,8 @@ PostgreSQL documentation

 Execute the reindex commands in parallel by running
 njobs
-commands simultaneously.  This option reduces the time of the
-processing but it also increases the load on the database server.
+commands simultaneously.  This option may reduce the processing time
+but it also increases the load on the database server.


 reindexdb will open
diff --git a/doc/src/sgml/ref/vacuumdb.sgml b/doc/src/sgml/ref/vacuumdb.sgml
index 95d6894cb0..7b13950552 100644
--- a/doc/src/sgml/ref/vacuumdb.sgml
+++ b/doc/src/sgml/ref/vacuumdb.sgml
@@ -155,8 +155,8 @@ PostgreSQL documentation

 Execute the vacuum or analyze commands in parallel by running
 njobs
-commands simultaneously.  This option reduces the time of the
-processing but it also increases the load on the database server.
+commands simultaneously.  This option may reduce the processing time
+but it also increases the load on the database server.


 vacuumdb will open
-- 
2.17.0

>From 0457a0038e2c4fc473dbdbbeb3536426b5939bcd Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
D

Re: use pg_get_functiondef() in pg_dump

Stephen Frost  writes:
> So..  the code that's in pg_dump today works to go from "whatever the
> connected server's version is" to "whatever the version is of the
> pg_dump command itself".  If we had the code in libpgcommon, and
> functions in the backend to get at it along with psql having that code,
> you could then, using the code we have today, go from a bunch of
> 'source' versions to 'target' version of either the version of the psql
> command, or that of the server.

At this point, I think I need a high-power telescope even to see the
goalposts :-(

If we actually want to do something like this, we need a plan not just
some handwaving.  Let's start by enumerating the concerns that would
have to be solved.  I can think of:

* Execution context.  Stephen seems to be envisioning code that could be
compiled into the backend not just the frontend, but is that really worth
the trouble?  Could we share such code across FE/BE at all (it'd certainly
be a far more ambitious exercise in common code than we've done to date)?
What's the backend version actually doing, issuing queries over SPI?
(I suppose if you were rigid about that, it could offer a guarantee
that the results match your snapshot, which is pretty attractive.)

* Global vs. per-object activity.  pg_dump likes to query the entire state
of the database to start with, and then follow up by grabbing additional
details about objects it's going to dump.  That's not an operating mode
that most other clients would want, but if for no other reason than
performance, I don't think we can walk away from it for pg_dump ---
indeed, I think pg_dump probably needs to be fixed to do less per-object
querying, not more.  Meanwhile applications such as psql \d would only
want to investigate one object at a time.  What design can we create that
will handle that?  If there is persistent state involved, what in the
world does that mean for the case of a backend-side library?

* Context in which the output is valid.  Target server version was already
mentioned, but a quick examination of pg_dump output scripts will remind
you that there's a bunch more assumptions:

SET statement_timeout = 0;
SET lock_timeout = 0;
SET idle_in_transaction_session_timeout = 0;
SET client_encoding = 'en_US.utf8';
SET standard_conforming_strings = on;
SELECT pg_catalog.set_config('search_path', '', false);
SET check_function_bodies = false;
SET xmloption = content;
SET client_min_messages = warning;
SET row_security = off;

not to mention special hackery for object ownership and tablespaces.
Some of these things probably don't matter for other use-cases, but
others definitely do.  In particular, I really doubt that psql and
other clients would find it acceptable to force search_path to a
particular thing.  Which brings us to

* Security.  How robust do the output commands need to be, and
what will we have to do that pg_dump doesn't need to?

* No doubt there are some other topics I didn't think of.

This certainly would be attractive if we had it, but the task
seems dauntingly large.  It's not going to happen without some
fairly serious investment of time.

regards, tom lane

Re: Nicer error when connecting to standby with hot_standby=off





On 2020/08/01 5:18, James Coleman wrote:

On Wed, Jul 29, 2020 at 11:24 AM Fujii Masao
 wrote:




On 2020/04/03 22:49, James Coleman wrote:

On Thu, Apr 2, 2020 at 5:53 PM David Zhang  wrote:


The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

I applied the patch to the latest master branch and run a test below. The error 
messages have been separated. Below is the test steps.

### setup primary server
initdb -D /tmp/primary/data
mkdir /tmp/archive_dir
echo "archive_mode='on'" >> /tmp/primary/data/postgresql.conf
echo "archive_command='cp %p /tmp/archive_dir/%f'" >> 
/tmp/primary/data/postgresql.conf
pg_ctl -D /tmp/primary/data -l /tmp/primary-logs start

### setup host standby server
pg_basebackup -p 5432 -w -R -D /tmp/hotstandby
echo "primary_conninfo='host=127.0.0.1 port=5432 user=pgdev'" >> 
/tmp/hotstandby/postgresql.conf
echo "restore_command='cp /tmp/archive_dir/%f %p'" >> 
/tmp/hotstandby/postgresql.conf
echo "hot_standby = off" >> /tmp/hotstandby/postgresql.conf
pg_ctl -D /tmp/hotstandby -l /tmp/hotstandby-logs -o "-p 5433" start

### keep trying to connect to hot standby server in order to get the error 
messages in different stages.
while true; do echo "`date`"; psql postgres -p 5433 -c "SELECT 
txid_current_snapshot();" sleep 0.2; done

### before the patch
psql: error: could not connect to server: FATAL:  the database system is 
starting up
...

### after the patch, got different messages, one message indicates hot_standby 
is off
psql: error: could not connect to server: FATAL:  the database system is 
starting up
...
psql: error: could not connect to server: FATAL:  the database system is up, 
but hot_standby is off
...


Thanks for the review and testing!


Thanks for the patch! Here is the comment from me.

+   else if (!FatalError && pmState == PM_RECOVERY)
+   return CAC_STANDBY; /* connection disallowed on non-hot 
standby */

Even if hot_standby is enabled, pmState seems to indicate PM_RECOVERY
until recovery has reached a consistent state. No? So, if my understanding
is right, "FATAL:  the database system is up, but hot_standby is off" can
be logged even when hot_standby is on. Which sounds very confusing.


That's a good point. I've attached a corrected version.


Thanks for updating the patch! But it failed to be applied to the master branch
cleanly because of the recent commit 0038f94387. So could you update the patch
again?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-18 Thread Alvaro Herrera

On 2020-Aug-17, Ashutosh Sharma wrote:

> > +   if (heap_force_opt == HEAP_FORCE_KILL)
> > +   ItemIdSetDead(itemid);
> >
> > I think that if the page is an all-visible page, we should clear an
> > all-visible bit on the visibility map corresponding to the page and
> > PD_ALL_VISIBLE on the page header. Otherwise, index only scan would
> > return the wrong results.
> 
> I think we should let VACUUM do that. Please note that this module is
> intended to be used only on a damaged relation and should only be
> operated on damaged tuples of such relations. And the execution of any
> of the functions provided by this module on a damaged relation must be
> followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation.
> This is necessary to bring back a damaged relation to the sane state
> once a surgery is performed on it. I will try to add this note in the
> documentation for this module.

It makes sense to recommend VACUUM after fixing the page, but I agree
with Sawada-san that it would be sensible to reset the VM bit while
doing surgery, since that's the state that the page would be in.  We
should certainly *strongly recommend* to do VACUUM DISABLE_PAGE_SKIPPING,
but if users fail to do so, then leaving the VM bit set just means that
we know *for certain* that there will be further corruption as soon as
the XID counter advances sufficiently.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Yet another fast GiST build

2020-08-18 Thread Pavel Borisov

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

I consider this patch almost ready for commit with minor corrections (see 
previous message)

The new status of this patch is: Waiting on Author

Re: track_planning causing performance regression

2020-08-18 Thread Hamid Akhtar

On Tue, Aug 18, 2020 at 8:43 PM Fujii Masao 
wrote:

>
> > Yes, I pushed the document_overhead_by_track_planning.patch, but this
> > CF entry is for pgss_lwlock_v1.patch which replaces spinlocks with
> lwlocks
> > in pg_stat_statements. The latter patch has not been committed yet.
> > Probably attachding the different patches in the same thread would cause
> > this confusing thing... Anyway, thanks for your comment!
>
> To avoid further confusion, I attached the rebased version of
> the patch that was registered at CF. I'd appreciate it if
> you review this version.
>

Thank you. Reviewing it now.


>
> Regards,
>
> --
> Fujii Masao
> Advanced Computing Technology Center
> Research and Development Headquarters
> NTT DATA CORPORATION
>


-- 
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
ADDR: 10318 WHALLEY BLVD, Surrey, BC
CELL：+923335449950  EMAIL: mailto:hamid.akh...@highgo.ca
SKYPE: engineeredvirus

Re: track_planning causing performance regression




Yes, I pushed the document_overhead_by_track_planning.patch, but this
CF entry is for pgss_lwlock_v1.patch which replaces spinlocks with lwlocks
in pg_stat_statements. The latter patch has not been committed yet.
Probably attachding the different patches in the same thread would cause
this confusing thing... Anyway, thanks for your comment!


To avoid further confusion, I attached the rebased version of
the patch that was registered at CF. I'd appreciate it if
you review this version.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c 
b/contrib/pg_stat_statements/pg_stat_statements.c
index 6b91c62c31..3035c14ed5 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -39,7 +39,7 @@
  * in an entry except the counters requires the same.  To look up an entry,
  * one must hold the lock shared.  To read or update the counters within
  * an entry, one must hold the lock shared or exclusive (so the entry doesn't
- * disappear!) and also take the entry's mutex spinlock.
+ * disappear!) and also take the entry's partition lock.
  * The shared state variable pgss->extent (the next free spot in the external
  * query-text file) should be accessed only while holding either the
  * pgss->mutex spinlock, or exclusive lock on pgss->lock.  We use the mutex to
@@ -115,6 +115,11 @@ static const uint32 PGSS_PG_MAJOR_VERSION = PG_VERSION_NUM 
/ 100;
 
 #define JUMBLE_SIZE1024/* query serialization 
buffer size */
 
+#definePGSS_NUM_LOCK_PARTITIONS()  (pgss_max)
+#definePGSS_HASH_PARTITION_LOCK(key)   \
+   (&(pgss->base + \
+  (get_hash_value(pgss_hash, key) % PGSS_NUM_LOCK_PARTITIONS()))->lock)
+
 /*
  * Extension version number, for supporting older extension versions' objects
  */
@@ -207,7 +212,7 @@ typedef struct pgssEntry
Sizequery_offset;   /* query text offset in external file */
int query_len;  /* # of valid bytes in 
query string, or -1 */
int encoding;   /* query text encoding 
*/
-   slock_t mutex;  /* protects the counters only */
+   LWLock  *lock;  /* protects the counters only */
 } pgssEntry;
 
 /*
@@ -216,6 +221,7 @@ typedef struct pgssEntry
 typedef struct pgssSharedState
 {
LWLock *lock;   /* protects hashtable 
search/modification */
+   LWLockPadded*base;  /* base address of this LWLock */
double  cur_median_usage;   /* current median usage in 
hashtable */
Sizemean_query_len; /* current mean entry text length */
slock_t mutex;  /* protects following fields 
only: */
@@ -468,7 +474,8 @@ _PG_init(void)
 * resources in pgss_shmem_startup().
 */
RequestAddinShmemSpace(pgss_memsize());
-   RequestNamedLWLockTranche("pg_stat_statements", 1);
+   RequestNamedLWLockTranche("pg_stat_statements",
+ 
PGSS_NUM_LOCK_PARTITIONS() + 1);
 
/*
 * Install hooks.
@@ -547,7 +554,8 @@ pgss_shmem_startup(void)
if (!found)
{
/* First time through ... */
-   pgss->lock = 
&(GetNamedLWLockTranche("pg_stat_statements"))->lock;
+   pgss->base = GetNamedLWLockTranche("pg_stat_statements");
+   pgss->lock = &(pgss->base + pgss_max)->lock;
pgss->cur_median_usage = ASSUMED_MEDIAN_INIT;
pgss->mean_query_len = ASSUMED_LENGTH_INIT;
SpinLockInit(&pgss->mutex);
@@ -1384,14 +1392,14 @@ pgss_store(const char *query, uint64 queryId,
if (!jstate)
{
/*
-* Grab the spinlock while updating the counters (see comment 
about
-* locking rules at the head of the file)
+* Grab the partition lock while updating the counters (see 
comment
+* about locking rules at the head of the file)
 */
volatile pgssEntry *e = (volatile pgssEntry *) entry;
 
Assert(kind == PGSS_PLAN || kind == PGSS_EXEC);
 
-   SpinLockAcquire(&e->mutex);
+   LWLockAcquire(e->lock, LW_EXCLUSIVE);
 
/* "Unstick" entry if it was previously sticky */
if (IS_STICKY(e->counters))
@@ -1443,7 +1451,7 @@ pgss_store(const char *query, uint64 queryId,
e->counters.wal_fpi += walusage->wal_fpi;
e->counters.wal_bytes += walusage->wal_bytes;
 
-   SpinLockRelease(&e->mutex);
+   LWLockRelease(e->lock);
}
 
 done:
@@ -1770,9 +1778,9 @@ pg_stat_statements_internal(FunctionCallInfo fcinfo

Re: [patch] demote

2020-08-18 Thread Jehan-Guillaume de Rorthais

Hi,

Please find in attachment v5 of the patch set rebased on master after various
conflicts.

Regards,

On Wed, 5 Aug 2020 00:04:53 +0200
Jehan-Guillaume de Rorthais  wrote:

> Demote now keeps backends with no active xid alive. Smart mode keeps all
> backends: it waits for them to finish their xact and enter read-only. Fast
> mode terminate backends wit an active xid and keeps all other ones.
> Backends enters "read-only" using LocalXLogInsertAllowed=0 and flip it to -1
> (check recovery state) once demoted.
> During demote, no new session is allowed.
> 
> As backends with no active xid survive, a new SQL admin function
> "pg_demote(fast bool, wait bool, wait_seconds int)" had been added.
> 
> Demote now relies on sigusr1 instead of hijacking sigterm/sigint and pmdie().
> The resulting refactoring makes the code much simpler, cleaner, with better
> isolation of actions from the code point of view.
> 
> Thanks to the refactoring, the patch now only adds one state to the state
> machine: PM_DEMOTING. A second one could be use to replace:
> 
> /* Demoting: start the Startup Process */
> if (DemoteSignal && pmState == PM_SHUTDOWN && CheckpointerPID == 0)
> 
> with eg.:
> 
> if (pmState == PM_DEMOTED)
> 
> I believe it might be a bit simpler to understand, but the existing comment
> might be good enough as well. The full state machine path for demote is:
> 
>  PM_DEMOTING  /* wait for active xid backend to finish */
>  PM_SHUTDOWN  /* wait for checkpoint shutdown and its 
>  various shutdown tasks */
>  PM_SHUTDOWN && !CheckpointerPID /* aka PM_DEMOTED: start Startup process */
>  PM_STARTUP
> 
> Tests in "recovery/t/021_promote-demote.pl" grows from 13 to 24 tests,
> adding tests on backend behaviors during demote and new function pg_demote().
> 
> On my todo:
> 
> * cancel running checkpoint for fast demote ?
> * forbid demote when PITR backup is in progress
> * user documentation
> * Robert's concern about snapshot during hot standby
> * anything else reported to me
> 
> Plus, I might be able to split the backend part and their signals of the patch
> 0002 in its own patch if it helps the review. It would apply after 0001 and
> before actual 0002.
> 
> As there was no consensus and the discussions seemed to conclude this patch
> set should keep growing to see were it goes, I wonder if/when I should add it
> to the commitfest. Advice? Opinion?
>From 90e26a7e2d53f7a8436de0b73eb57498f884de9d Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Fri, 31 Jul 2020 10:58:40 +0200
Subject: [PATCH 1/4] demote: setter functions for LocalXLogInsert local
 variable

Adds functions extern LocalSetXLogInsertNotAllowed() and
LocalSetXLogInsertCheckRecovery() to set the local variable
LocalXLogInsert respectively to 0 and -1.

These functions are declared as extern for future need in
the demote patch.

Function LocalSetXLogInsertAllowed() already exists and
declared as static as it is not needed outside of xlog.h.
---
 src/backend/access/transam/xlog.c | 27 +++
 src/include/access/xlog.h |  2 ++
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 09c01ed4ae..c0d79f192c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7711,7 +7711,7 @@ StartupXLOG(void)
 	Insert->fullPageWrites = lastFullPageWrites;
 	LocalSetXLogInsertAllowed();
 	UpdateFullPageWrites();
-	LocalXLogInsertAllowed = -1;
+	LocalSetXLogInsertCheckRecovery();
 
 	if (InRecovery)
 	{
@@ -8219,6 +8219,25 @@ LocalSetXLogInsertAllowed(void)
 	InitXLOGAccess();
 }
 
+/*
+ * Make XLogInsertAllowed() return false in the current process only.
+ */
+void
+LocalSetXLogInsertNotAllowed(void)
+{
+	LocalXLogInsertAllowed = 0;
+}
+
+/*
+ * Make XLogInsertCheckRecovery() return false in the current process only.
+ */
+void
+LocalSetXLogInsertCheckRecovery(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -9004,9 +9023,9 @@ CreateCheckPoint(int flags)
 	if (shutdown)
 	{
 		if (flags & CHECKPOINT_END_OF_RECOVERY)
-			LocalXLogInsertAllowed = -1;	/* return to "check" state */
+			LocalSetXLogInsertCheckRecovery(); /* return to "check" state */
 		else
-			LocalXLogInsertAllowed = 0; /* never again write WAL */
+			LocalSetXLogInsertNotAllowed(); /* never again write WAL */
 	}
 
 	/*
@@ -9159,7 +9178,7 @@ CreateEndOfRecoveryRecord(void)
 
 	END_CRIT_SECTION();
 
-	LocalXLogInsertAllowed = -1;	/* return to "check" state */
+	LocalSetXLogInsertCheckRecovery();	/* return to "check" state */
 }
 
 /*
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e71..8c9cadc6da 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -306,6 +306,8 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(voi

Re: EDB builds Postgres 13 with an obsolete ICU version

2020-08-18 Thread Bruce Momjian

On Tue, Aug 18, 2020 at 11:38:38AM +0200, Magnus Hagander wrote:
> On Tue, Aug 18, 2020 at 11:24 AM Dave Page  wrote:
> Not necessarily. It's likely that not all indexes use ICU collations, and
> you still save time loading what may be large amounts of data.
> 
> I agree though, that it *could* be slow.
> 
> I agree it definitely could, but I'm not sure I see any case where it would
> actually be slower than the alternative (which would be dump/reload).

Well, given that pg_upgrade is more complex to run than pg_dump/reload,
you then have to weigh the complexity of using pg_upgrade with index
rebuild vs. the simpler pg_dump.  Right now, you know pg_upgrade in link
mode is going to be fast, but with the reindex, you have a much more
complex decision to make.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: psql: add \si, \sm, \st and \sr functions to show CREATE commands for indexes, matviews, triggers and tables

a.pervush...@postgrespro.ru writes:
> [ si_st_sm_sr_v2.patch ]

I hadn't particularly noticed this thread before, but I happened to
look through this patch, and I've got to say that this proposed feature
seems like an absolute disaster from a maintenance standpoint. There
will be no value in an \st command that is only 90% accurate; the produced
DDL has to be 100% correct. This means that, if we accept this feature,
psql will have to know everything pg_dump knows about how to construct the
DDL describing tables, indexes, views, etc. That is a lot of code, and
it's messy, and it changes nontrivially on a very regular basis. I can't
accept that we want another copy in psql --- especially one that looks
nothing like what pg_dump has.

There've been repeated discussions about somehow extracting pg_dump's
knowledge into a library that would also be available to other client
programs (see e.g. the concurrent thread at [1]). That's quite a tall
order, which is why it's not happened yet. But I think we really need
to have something like that before we can accept this feature for psql.

BTW, as an example of why this is far more difficult than it might
seem at first glance, this patch doesn't even begin to meet the
expectation stated at the top of describe.c:

* Support for the various \d ("describe") commands. Note that the current
* expectation is that all functions in this file will succeed when working
* with servers of versions 7.4 and up. It's okay to omit irrelevant
* information for an old server, but not to fail outright.

It might be okay for this to cut off at 8.0 or so, as I think pg_dump
does, but not to just fail on older servers.

Another angle, which I'm not even sure how we want to think about it, is
security. It will not do for "\et" to allow some attacker to replace
function calls appearing in the table's CHECK constraints, for instance.
So this means you've got to be very aware of CVE-2018-1058-style attacks.
Our answer to that for pg_dump has partially depended on restricting the
search_path used at both dump and restore time ... but I don't think \et
gets to override the search path that the psql user is using. I'm not
sure what that means in practice but it certainly requires some thought
before we add the feature, not after.

Anyway, I can see the attraction of having psql commands like these,
but "write a bunch of new code that we'll have to maintain" does not
seem like a desirable way to get them.

regards, tom lane

[1]
https://www.postgresql.org/message-id/flat/9df8a3d3-13d2-116d-26ab-6a273c1ed38c%402ndquadrant.com

Re: Creating a function for exposing memory usage of backend process





On 2020/08/18 18:41, torikoshia wrote:

On 2020-08-17 21:19, Fujii Masao wrote:

On 2020/08/17 21:14, Fujii Masao wrote:

On 2020-08-07 16:38, Kasahara Tatsuhito wrote:

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:    tested, passed

I tested the latest
patch(0007-Adding-a-function-exposing-memory-usage-of-local-backend.patch)
with the latest PG-version (199cec9779504c08aaa8159c6308283156547409)
and test was passed.
It looks good to me.

The new status of this patch is: Ready for Committer


Thanks for your testing!


Thanks for updating the patch! Here are the review comments.


Thanks for reviewing!



+ 
+  pg_backend_memory_contexts
+  backend memory contexts
+ 

The above is located just after pg_matviews entry. But it should be located
just after pg_available_extension_versions entry. Because the rows in the table
"System Views" should be located in alphabetical order.


+ 
+  pg_backend_memory_contexts

Same as above.


Modified both.




+   The view pg_backend_memory_contexts displays all
+   the local backend memory contexts.

This description seems a bit confusing because maybe we can interpret this
as "... displays the memory contexts of all the local backends" wrongly. 
Thought?
What about the following description, instead?



 The view pg_backend_memory_contexts displays all
 the memory contexts of the server process attached to the current session.


Thanks! it seems better.


+    const char *name = context->name;
+    const char *ident = context->ident;
+
+    if (context == NULL)
+    return;

The above check "context == NULL" is useless? If "context" is actually NULL,
"context->name" would cause segmentation fault, so ISTM that the check
will never be performed.

If "context" can be NULL, the check should be performed before accessing
to "contect". OTOH, if "context" must not be NULL per the specification of
PutMemoryContextStatsTupleStore(), assertion test checking
"context != NULL" should be used here, instead?


Yeah, "context" cannot be NULL because "context" must be TopMemoryContext
or it is already checked as not NULL as follows(child != NULL).

I added the assertion check.


Isn't it better to add AssertArg(MemoryContextIsValid(context)), instead?




| for (child = context->firstchild; child != NULL; child = child->nextchild)
| {
|  ...
| PutMemoryContextsStatsTupleStore(tupstore, tupdesc,
|   child, parentname, 
level + 1);
| }


Here is another comment.

+ if (parent == NULL)
+ nulls[2] = true;
+ else
+ /*
+  * We labeled dynahash contexts with just the hash table name.
+  * To make it possible to identify its parent, we also display
+  * parent's ident here.
+  */
+ if (parent->ident && strcmp(parent->name, "dynahash") == 0)
+ values[2] = CStringGetTextDatum(parent->ident);
+ else
+ values[2] = CStringGetTextDatum(parent->name);

PutMemoryContextsStatsTupleStore() doesn't need "parent" memory context,
but uses only the name of "parent" memory context. So isn't it better to use
"const char *parent" instead of "MemoryContext parent", as the argument of
the function? If we do that, we can simplify the above code.


Thanks, the attached patch adopted the advice.

However, since PutMemoryContextsStatsTupleStore() used not only the name
but also the ident of the "parent", I could not help but adding similar
codes before calling the function.
The total amount of codes and complexity seem not to change so much.

Any thoughts? Am I misunderstanding something?


I was thinking that we can simplify the code as follows.
That is, we can just pass "name" as the argument of 
PutMemoryContextsStatsTupleStore()
since "name" indicates context->name or ident (if name is "dynahash").

for (child = context->firstchild; child != NULL; child = 
child->nextchild)
{
-   const char *parentname;
-
-   /*
-* We labeled dynahash contexts with just the hash table name.
-* To make it possible to identify its parent, we also use
-* the hash table as its context name.
-*/
-   if (context->ident && strcmp(context->name, "dynahash") == 0)
-   parentname = context->ident;
-   else
-   parentname = context->name;
-
PutMemoryContextsStatsTupleStore(tupstore, tupdesc,
- child, 
parentname, level + 1);
+ child, name, 
level + 1);

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and De

Re: INSERT INTO SELECT, Why Parallelism is not selected?

2020-08-18 Thread Amit Kapila

On Tue, Aug 18, 2020 at 1:37 PM Bharath Rupireddy
 wrote:
>
> On Tue, Jul 14, 2020 at 1:20 PM Dilip Kumar  wrote:
> >
> > On Mon, Jul 13, 2020 at 4:23 PM Amit Kapila  wrote:
> >
> > > I think we can do more than this by
> > > parallelizing the Insert part of this query as well as we have lifted
> > > group locking restrictions related to RelationExtension and Page lock
> > > in PG13.  It would be really cool to do that unless we see any
> > > fundamental problems with it.
> >
> > +1, I think it will be cool to support for insert part here as well as
> > insert part in 'Create Table As Select..' as well.
> >
>
> +1 to parallelize inserts. Currently, ExecInsert() and CTAS use
> table_tuple_insert(), if we parallelize these parts, each worker will
> be inserting it's tuples(one tuple at a time) into the same data page,
> until space is available, if not a new data page can be obtained by
> any of the worker, others might start inserting into it. This way,
> will there be lock contention on data pages?
>

It is possible but we need to check how much that is a bottleneck
because that should not be a big part of the operation. And, it won't
be any worse than inserts via multiple backends. I think it is
important to do that way, otherwise, some of the pages can remain
half-empty.

Right now, the plan for Insert ... Select is like
Insert on 
   ->  Seq Scan on 
 

In the above the scan could be index scan as well. What we want is:
Gather
  -> Insert on 
   ->  Seq Scan on 
 

>. Do we also need to make
> inserts to use table_multi_insert() (like the way "COPY" uses) instead
> of table_tuple_insert()?
>

I am not sure at this stage but if it turns out to be a big problem
then we might think of inventing some way to allow individual workers
to operate on different pages. I think even without that we should be
able to make a big gain as reads, filtering, etc can be done in
parallel.

-- 
With Regards,
Amit Kapila.

Re: use pg_get_functiondef() in pg_dump

2020-08-18 Thread Stephen Frost

Greetings,

* Corey Huinker (corey.huin...@gmail.com) wrote:
> > I'm sure there's a lot of folks who'd like to see more of the logic we
> > have in pg_dump for building objects from the catalog available to more
> > tools through libpgcommon- psql being one of the absolute first
> > use-cases for exactly that (there's certainly no shortage of people
> > who've asked how they can get a CREATE TABLE statement for a table by
> > using psql...).
> 
> I count myself among those folks (see
> https://www.postgresql.org/message-id/CADkLM%3DfxfsrHASKk_bY_A4uomJ1Te5MfGgD_rwwQfV8wP68ewg%40mail.gmail.com
> for
> discussion of doing DESCRIBE and SHOW CREATE-ish functions either on server
> side or client side).
> 
> I'm all for having this as "just" as set of pg_get_*def functions, because
> they allow for the results to be used in queries. Granted, the shape of the
> result set may not be stable, but that's the sort of thing we can warn for
> the same way we have warnings for changes to pg_stat_activity. At that
> point any DESCRIBE/SHOW CREATE server side functions essentially become
> just shells around the pg_get_*def(), with no particular requirement to
> make those new commands work inside a SELECT.

Another advantage of having this in libpgcommon is that the backend
*and* the frontend could then use it.

> Would it be totally out of left field to have the functions have an
> optional "version" parameter, defaulted to null, that would be used to give
> backwards compatible results if and when we do make a breaking change?

So..  the code that's in pg_dump today works to go from "whatever the
connected server's version is" to "whatever the version is of the
pg_dump command itself".  If we had the code in libpgcommon, and
functions in the backend to get at it along with psql having that code,
you could then, using the code we have today, go from a bunch of
'source' versions to 'target' version of either the version of the psql
command, or that of the server.

That is, consider a future where this is all done and all that crazy
version-specific code in pg_dump has been moved to libpgcommon in v14,
and then you have a v15 psql, so:

psql v15 connected to PG v14:

You do: \dct mytable -- psql internal command to get 'create table'
Result: a CREATE TABLE that works for v15

You do: DESCRIBE mytable; -- PG backend function to get 'create table'
Result: a CREATE TABLE that works for v14

Without having to add anything to what we're already doing (yes, yes,
beyond the complications of moving this stuff into libpgcommon, but at
least we're not having to create some kind of matrix of "source PG
version 10, target PG version 12" into PG14).

A bit crazy, sure, but would certainly be pretty useful.

Thanks,

Stephen

signature.asc
Description: PGP signature

Documentation patch for backup manifests in protocol.sgml

2020-08-18 Thread Bernd Helmle

Hi,

protocol.sgml describes the protocol messages received by a BASE_BACKUP
streaming command, but doesn't tell anything about the additional
CopyResponse data message containing the contents of the backup
manifest (if requested) after having received the tar files. So i
propose the attached to give a little more detail in this paragraph.

Thanks, Bernd

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 8b00235a516..31918144b37 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2665,8 +2665,10 @@ The commands accepted in replication mode are:
   ustar interchange format specified in the POSIX 1003.1-2008
   standard) dump of the tablespace contents, except that the two trailing
   blocks of zeroes specified in the standard are omitted.
-  After the tar data is complete, a final ordinary result set will be sent,
-  containing the WAL end position of the backup, in the same format as
+  After the tar data is complete, and if a backup manifest was requested,
+  another CopyResponse result is sent, containing the manifest data for the
+  current base backup. In any case, a final ordinary result set will be
+  sent, containing the WAL end position of the backup, in the same format as
   the start position.

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-18 Thread Rajkumar Raghuwanshi

Thanks for suggestion Ashutosh, I have done testing around these suggestion
and found no issues. I will continue testing same with updated patch posted
on this thread.

On Fri, Aug 7, 2020 at 12:45 PM Ashutosh Sharma 
wrote:

> Thanks Rajkumar for testing the patch.
>
> Here are some of the additional test-cases that I would suggest you to
> execute, if possible:
>
> 1) You may try running the test-cases that you have executed so far
> with SR setup and see if the changes are getting reflected on the
> standby.
>
> 2) You may also try running some concurrent test-cases for e.g. try
> running these functions with VACUUM or some other sql commands
> (preferable DML commands) in parallel.
>
> 3) See what happens when you pass some invalid tids (containing
> invalid block or offset number) to these functions. You may also try
> running these functions on the same tuple repeatedly and see the
> behaviour.
>
> ...
>
> --
> With Regards,
> Ashutosh Sharma
> EnterpriseDB:http://www.enterprisedb.com


Thanks & Regards,
Rajkumar Raghuwanshi

Re: jsonb, collection & postgres_fdw

2020-08-18 Thread Ashutosh Bapat

On Mon, Aug 17, 2020 at 7:32 PM Bharath Rupireddy
 wrote:
>
> On Fri, Aug 14, 2020 at 12:46 PM Konstantin Knizhnik 
>  wrote:
> >
> > Right now postgres_fdw treat as shippable only builtin functions or
> > functions from extensions explicitly specified as shippable extensions
> > in parameters of this FDW server. So I do no see a problem here. Yes,
> > foreign server may have different version of Postgres which doesn't have
> > this built-in function or its  profile is different. It can happen if
> > postgres_fdw is used to connect two different servers which are
> > maintained independently. But in most cases I think, postgres_fdw is
> > used to organize some kind of cluster. In this case all nodes are
> > identical (hardware, OS, postgres version) and performance is very
> > critical (because scalability - of one of the goal of replacing single
> > node with cluster).
> > This is why push down of predicates is very critical in this case.
> >
>
> Agree, push down of predicates(with functions) to the remote backend helps a 
> lot. But, is it safe to push all the functions? For instance, functions that 
> deal with time/time zones, volatile functions etc. I'm not exactly sure 
> whether we will have some issues here. Since postgres_fdw can also be used 
> for independently maintained postgres servers(may be with different 
> versions), we must have a mechanism to know the compatibility.
>
> >
> >  From my point of view, it will be nice to have flag in postgres_fdw
> > server indicating that foreign and remote servers are identical
> > and treat all functions as shippable in this case (not only built-in
> > ones are belonging to explicitly specified shippable extensions).
> > It will simplify using postres_fdw in clusters and makes it more efficient.
> >
>
> I think it's better not to have a flag for this. As we have to deal with the 
> compatibility not only at the server version level, but also at each function 
> level. We could have something like a configuration file which allows the 
> user to specify the list of functions that are safely pushable to remote in 
> his/her own postgres_fdw setup, and let the postgres_fdw refer this 
> configuration file, while checking the pushability of the functions to 
> remote. This way, the user has some control over what's pushed and what's 
> not. Of course, this pushability check can only happen after the mandatory 
> checks happening currently such as remote backend configuration settings such 
> as collations etc.

I agree with most of this. We need a way for a user to tell us which
function is safe to be executed on the foreign server (not just
postgres_fdw, but other kinds of FDWs as well). But maintaining that
as a configurable file and associating safety with an FDW isn't
sufficient. We should maintain that as a catalog. A function may be
safe to push down based on the FDW (a given function always behaves in
the same way on any of the servers of an FDW as its peer locally), or
may be associated with a server (a function is available and behaves
same as its local peer on certain server/s but not all). Going further
a local function may map to a function with a different name on the
remote server/fdw, so that same catalog may maintain the function
mapping. An FDW may decide to cache relevant information, update the
catalog using IMPORT FOREIGN SCHEMA(or ROUTINE), or add some defaults
when installing the extension.

More details are required to be worked out but here my initial thoughts on this.


-- 
Best Wishes,
Ashutosh Bapat

Re: Libpq support to connect to standby server as priority

2020-08-18 Thread Greg Nancarrow

Hi Peter,

I have updated the patch (attached) based on your comments, with
adjustments made for additional changes based on feedback (which I
tend to agree with) from Robert Haas and Tsunakawa san, who suggested
read-write/read-only should be functionally different to
primary/standby, and not just have "read-write" a synonym for
"primary".
I also thought it appropriate to remove "read-write", "standby" and
"prefer-standby" from accepted values for "target_server_type"
(instead just support "secondary" and "prefer-secondary") to match the
similar targetServerType PGJDBC option.
So currently have as supported option values:

target_session_attrs:
any/read-write/read-only/primary/standby(/secondary)/prefer-standby(/prefer-secondary)
target_server_type: any/primary/secondary/prefer-secondary

See my responses to your review comments below:

>GENERAL COMMENT 1 ("any")
>
>"any" should be included as valid option for target_server_type.
>
>IIUC target_server_type was added mostly to have better alignment with JDBC 
>options.
>Both Vladimir [1] and Dave [2] already said that JDBC does have an "any" 
>option.
>[1] - 
>https://www.postgresql.org/message-id/CAB%3DJe-FwOVE%3D8gR1UDDZRnWZR65fRG40e8zW_U_6mnUqbce68g%40mail.gmail.com
>[2] - 
>https://www.postgresql.org/message-id/CADK3HHJ9316ji7L-97cJBY%3Dwp4E3ddPMn8XdkNz6j8d9u0OhmQ%40mail.gmail.com
>
>Furthermore, the fe-connect.c function makeEmptyPGConn sets default:
>+   conn->requested_server_type = SERVER_TYPE_ANY;
>This means the default type of target_server_type is "any".
>Since this is default, it should also be possible to assign the same value to 
>explicitly.
>
>(Parts of the v17 patch affected by this are itemised below)


GN RESPONSE: After checking the PGJDBC source and previous comments, I agree.
Have updated the patch to allow "any" for target_server_type.



>GENERAL COMMENT 2 (Removal of pg_is_in_recovery)
>
>Around 22/3/2019 Hari added a lot of pg_is_in_recovery code in his patch 0006 
>[1]
>[1] - 
>https://www.postgresql.org/message-id/CAJrrPGd4YeA%2BN%3DxC%2B1XPVoGzMCATJZY4irVQEJ6i0aPqorUi7g%40mail.gmail.com
>
>Much later IIUC the latest v17 patch has taken onboard the recommendation from 
>Alvaro and removed all that code:
>"I would discard the whole thing about checking "SELECT pg_is_in_recovery()"" 
>[2]
>[2] - 
>https://www.postgresql.org/message-id/20191227130828.GA21647%40alvherre.pgsql
>
>However, it seems that not ALL parts of the original code got cleanly removed 
>in v17.
>There are a number of references to CONNECTION_CHECK_RECOVERY and 
>pg_is_in_recovery still lurking.
>
>(Parts of the v17 patch affected by this are itemised below)
>

GN RESPONSE: Agree. The calling code was removed but somehow the
CONNECTION_CHECK_RECOVERY case block (and enum) was not removed. Also,
part of the documentation was not updated, for the case where the
server version is prior to 14.
I have updated the patch to correct this.



>COMMENT libpq.sgml (para blocks)
>
>+   
>
>The v17 patch for target_session_attrs and target_server_type help is 
>currently using  blocks for each of the possible >option values.
>This format is inconsistent document style with other variables in this SGML.
>Other places are using sub-lists for option values. e.g. look at "six modes" 
>of sslmode.

GN RESPONSE: True, but this was the case BEFORE the patch, and these
options are more complex than ones where sub-lists for option values
are used - there needs to be common explanation of what the option
synonyms are, and how the behaviour is version dependent, so it
doesn't really lend itself to simple list items, that would need to
cross-reference other list items.



>COMMENT libpq.sgml (cut/paste parameter description)
>
>I don't think that target_server_type help should be just a cut/paste 
>duplicate of  target_session_attrs. It is confusing >because it leaves the 
>reader doubting the purpose of having such a duplication.
>
>Suggest to simplify the target_server_type help like as follows:
>--
>target_server_type
>The purpose of this parameter is to reflect the similar PGJDBC 
>targetServerType.
>The supported options are same as target_session_attrs.
>This parameter overrides any connection type specified by target_session_attrs.
>--

GN RESPONSE: Agree. Will update documentation, though with some
modifications to the wording because of changes in supported option
values already mentioned, and target_session_attrs could contain
non-server-type options in the future.




>COMMENT libpq.sgml (pg_is_in_recovery)
>
>(As noted in GENERAL COMMENT 2 there are still residual references to 
>pg_is_in_recovery)
>
>+   
>+If this parameter is set to standby, only a 
>connection in which
>+the server is in recovery mode is considered acceptable. If the 
>server is prior to version 14,
>+the query SELECT pg_is_in_recovery() will be sent 
>upon any successful
>+connection; if it returns t, it means the server 
>is in recovery mo

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-18 Thread Chris Travers

On Tue, Jul 14, 2020 at 12:28 AM Peter Geoghegan  wrote:

> On Mon, Jul 13, 2020 at 2:12 PM Robert Haas  wrote:
> > 1. There's nothing to identify the tuple that has the problem, and no
> > way to know how many more of them there might be. Back-patching
> > b61d161c146328ae6ba9ed937862d66e5c8b035a would help with the first
> > part of this.
>
> I am in favor of backpatching such changes in cases where senior
> community members feel that it could help with hypothetical
> undiscovered data corruption issues -- if they're willing to take
> responsibility for the change. It certainly wouldn't be the first
> time. A "defense in depth" mindset seems like the right one when it
> comes to data corruption bugs. Early detection is really important.
>
> > Moreover, not everyone is as
> > interested in an extended debugging exercise as they are in getting
> > the system working again, and VACUUM failing repeatedly is a pretty
> > serious problem.
>
> That's absolutely consistent with my experience. Most users want to
> get back to business as usual now, while letting somebody else do the
> hard work of debugging.
>

Also even if you do trace the problem you still have to recover.

And sometimes I have found latent corruption from times when dbs were
running on older versions and older servers, making debugging largely a
futile exercise.

>
> > Therefore, one of my colleagues has - at my request - created a couple
> > of functions called heap_force_kill() and heap_force_freeze() which
> > take an array of TIDs.
>
> > So I have these questions:
> >
> > - Do people think it would me smart/good/useful to include something
> > like this in PostgreSQL?
>
> I'm in favor of it.
>

+1

Would be worth extending it with some functions to grab rows that have
various TOAST oids too.

>
> > - If so, how? I would propose a new contrib module that we back-patch
> > all the way, because the VACUUM errors were back-patched all the way,
> > and there seems to be no advantage in making people wait 5 years for a
> > new version that has some kind of tooling in this area.
>
> I'm in favor of it being *possible* to backpatch tooling that is
> clearly related to correctness in a fundamental way. Obviously this
> would mean that we'd be revising our general position on backpatching
> to allow some limited exceptions around corruption. I'm not sure that
> this meets that standard, though. It's hardly something that we can
> expect all that many users to be able to use effectively.
>
> I may be biased, but I'd be inclined to permit it in the case of
> something like amcheck, or pg_visibility, on the grounds that they're
> more or less the same as the new VACUUM errcontext instrumentation you
> mentioned. The same cannot be said of something like this new
> heap_force_kill() stuff.
>
> > - Any ideas for additional things we should include, or improvements
> > on the sketch above?
>
> Clearly you should work out a way of making it very hard to
> accidentally (mis)use. For example, maybe you make the functions check
> for the presence of a sentinel file in the data directory.
>

Agreed.

>
>
> --
> Peter Geoghegan
>
>
>

-- 
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin

Re: recovering from "found xmin ... from before relfrozenxid ..."

Attached is the new version of patch that addresses the comments from
Asim Praveen and Masahiko-san. It also improves the documentation to
some extent.


On Tue, Aug 18, 2020 at 1:46 PM Ashutosh Sharma  wrote:
>
> Hello Masahiko-san,
>
> I've spent some more time trying to understand the code in
> lazy_scan_heap function to know under what all circumstances a VACUUM
> can fail with "found xmin ... before relfrozenxid ..." error for a
> tuple whose xmin is behind relfrozenxid. Here are my observations:
>
> 1) It can fail with this error for a live tuple
>
> OR,
>
> 2) It can also fail with this error if a tuple (that went through
> update) is marked as HEAP_HOT_UPDATED or HEAP_ONLY_TUPLE.
>
> OR,
>
> 3) If there are any concurrent transactions, then the tuple might be
> marked as HEAPTUPLE_INSERT_IN_PROGRESS or HEAPTUPLE_DELETE_IN_PROGRESS
> or HEAPTUPLE_RECENTLY_DEAD in which case also VACUUM can fail with
> this error.
>
> Now, AFAIU, as we will be dealing with a damaged table, the chances of
> point #3 being the cause of this error looks impossible in our case
> because I don't think we will be doing anything in parallel when
> performing surgery on a damaged table, in fact we shouldn't be doing
> any such things. However, it is quite possible that reason #2 could
> cause VACUUM to fail with this sort of error, but, as we are already
> skipping redirected item pointers in heap_force_common(), I think, we
> would never be marking HEAP_HOT_UPDATED tuple as frozen and I don't
> see any problem in marking HEAP_ONLY_TUPLE as frozen. So, probably, we
> may not need to handle point #2 as well.
>
> Further, I also don't see VACUUM reporting this error for a tuple that
> has been moved from one partition to another. So, I think we might not
> need to do any special handling for a tuple that got updated and its
> new version was moved to another partition.
>
> If you feel I am missing something here, please correct me. Thank you.
>
> Moreover, while I was exploring on above, I noticed that in
> lazy_scan_heap(), before we call HeapTupleSatisfiesVacuum() we check
> for a redirected item pointers and if any redirected item pointer is
> detected we do not call HeapTupleSatisfiesVacuum(). So, not sure how
> HeapTupleSatisfiesVacuum would ever return a dead tuple that is marked
> with HEAP_HOT_UPDATED. I am referring to the following code in
> lazy_scan_heap().
>
> for (offnum = FirstOffsetNumber;
>  offnum <= maxoff;
>  offnum = OffsetNumberNext(offnum))
> {
> ItemId  itemid;
>
> itemid = PageGetItemId(page, offnum);
>
> .
> .
>
>
> /* Redirect items mustn't be touched */ <-- this check
> would bypass the redirected item pointers from being checked for
> HeapTupleSatisfiesVacuum.
> if (ItemIdIsRedirected(itemid))
> {
> hastup = true;  /* this page won't be truncatable */
> continue;
> }
>
> ..
> ..
>
> switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
> {
> case HEAPTUPLE_DEAD:
>
> if (HeapTupleIsHotUpdated(&tuple) ||
> HeapTupleIsHeapOnly(&tuple) ||
> params->index_cleanup == VACOPT_TERNARY_DISABLED)
> nkeep += 1;
> else
> tupgone = true; /* we can delete the tuple */
> ..
> ..
>  }
>
>
> So, the point is, would HeapTupleIsHotUpdated(&tuple) ever be true?
>
> --
> With Regards,
> Ashutosh Sharma
> EnterpriseDB:http://www.enterprisedb.com
>
> On Mon, Aug 17, 2020 at 11:35 AM Ashutosh Sharma  
> wrote:
> >
> > Hello Masahiko-san,
> >
> > Thanks for the review. Please check the comments inline below:
> >
> > On Fri, Aug 14, 2020 at 10:07 AM Masahiko Sawada
> >  wrote:
> >
> > > Thank you for updating the patch! Here are my comments on v5 patch:
> > >
> > > --- a/contrib/Makefile
> > > +++ b/contrib/Makefile
> > > @@ -35,6 +35,7 @@ SUBDIRS = \
> > > pg_standby  \
> > > pg_stat_statements \
> > > pg_trgm \
> > > +   pg_surgery  \
> > > pgcrypto\
> > >
> > > I guess we use alphabetical order here. So pg_surgery should be placed
> > > before pg_trgm.
> > >
> >
> > Okay, will take care of this in the next version of patch.
> >
> > > ---
> > > +   if (heap_force_opt == HEAP_FORCE_KILL)
> > > +   ItemIdSetDead(itemid);
> > >
> > > I think that if the page is an all-visible page, we should clear an
> > > all-visible bit on the visibility map corresponding to the page and
> > > PD_ALL_VISIBLE on the page header. Otherwise, index only scan would
> > > return the wrong results.
> > >
> >
> > I think we should let VACUUM do that. Please note that this module is
> > intended to be used only on a damaged relation and should only be
> > operated on damaged tuples

Re: [HACKERS] make async slave to wait for lsn to be replayed

2020-08-18 Thread Anna Akenteva


On 2020-07-13 14:21, Daniel Gustafsson wrote:
This patch require some rewording of documentation/comments and 
variable names
after the language change introduced by 229f8c219f8f..a9a4a7ad565b, the 
thread

below can be used as reference for how to change:

https://www.postgresql.org/message-id/flat/20200615182235.x7lch5n6kcjq4aue%40alap3.anarazel.de



Thank you for the heads up!

I updated the most recent patch and removed the use of "master" from it, 
replacing it with "primary".


--
Anna Akenteva
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Companycommit 1021b166c0cb4e4017d503eb481e787638842ab1
Author: Akenteva Anna 
Date:   Tue Aug 18 12:37:16 2020 +0300

wait for lsn

diff --git a/doc/src/sgml/ref/begin.sgml b/doc/src/sgml/ref/begin.sgml
index c23bbfb4e71..fbfadbac8e9 100644
--- a/doc/src/sgml/ref/begin.sgml
+++ b/doc/src/sgml/ref/begin.sgml
@@ -21,7 +21,7 @@ PostgreSQL documentation
 
  
 
-BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
+BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] [ WAIT FOR LSN lsn_value [TIMEOUT number_of_milliseconds ] ]
 
 where transaction_mode is one of:
 
@@ -63,6 +63,17 @@ BEGIN [ WORK | TRANSACTION ] [ transaction_mode
was executed.
   
+
+  
+   The WAIT FOR clause allows to wait for the target log
+   sequence number (LSN) to be replayed on standby before
+   starting the transaction in PostgreSQL databases
+   with primary-standby asynchronous replication. Wait time can be limited by
+   specifying a timeout, which is measured in milliseconds and must be a positive
+   integer. If LSN was not reached before timeout, transaction
+   doesn't begin. Waiting can be interrupted using Ctrl+C, or
+   by shutting down the postgres server.
+  
  
 
  
@@ -146,6 +157,10 @@ BEGIN;
different purpose in embedded SQL. You are advised to be careful
about the transaction semantics when porting database applications.
   
+
+  
+   There is no WAIT FOR clause in the SQL standard.
+  
  
 
  
diff --git a/doc/src/sgml/ref/start_transaction.sgml b/doc/src/sgml/ref/start_transaction.sgml
index d6cd1d41779..016e59930df 100644
--- a/doc/src/sgml/ref/start_transaction.sgml
+++ b/doc/src/sgml/ref/start_transaction.sgml
@@ -21,7 +21,7 @@ PostgreSQL documentation
 
  
 
-START TRANSACTION [ transaction_mode [, ...] ]
+START TRANSACTION [ transaction_mode [, ...] ] [ WAIT FOR LSN lsn_value [TIMEOUT number_of_milliseconds ] ]
 
 where transaction_mode is one of:
 
@@ -40,6 +40,17 @@ START TRANSACTION [ transaction_mode was executed. This is the same
as the  command.
   
+
+  
+   The WAIT FOR clause allows to wait for the target log
+   sequence number (LSN) to be replayed on standby before
+   starting the transaction in PostgreSQL databases
+   with primary-standby asynchronous replication. Wait time can be limited by
+   specifying a timeout, which is measured in milliseconds and must be a positive
+   integer. If LSN was not reached before timeout, transaction
+   doesn't begin. Waiting can be interrupted using Ctrl+C, or
+   by shutting down the postgres server.
+  
  
 
  
@@ -78,6 +89,10 @@ START TRANSACTION [ transaction_mode
 
+  
+   There is no WAIT FOR clause in the SQL standard.
+  
+
   
See also the compatibility section of .
   
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 09c01ed4ae4..19cb04869c1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -42,6 +42,7 @@
 #include "catalog/pg_database.h"
 #include "commands/progress.h"
 #include "commands/tablespace.h"
+#include "commands/wait.h"
 #include "common/controldata_utils.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
@@ -7385,6 +7386,15 @@ StartupXLOG(void)
 	break;
 }
 
+/*
+ * If we replayed an LSN that someone was waiting for,
+ * set latches in shared memory array to notify the waiter.
+ */
+if (XLogCtl->lastReplayedEndRecPtr >= GetMinWaitedLSN())
+{
+	WaitSetLatch(XLogCtl->lastReplayedEndRecPtr);
+}
+
 /* Else, try to fetch the next WAL record */
 record = ReadRecord(xlogreader, LOG, false);
 			} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index d4815d3ce65..9b310926c12 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -57,6 +57,7 @@ OBJS = \
 	user.o \
 	vacuum.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 000..e1123df321e
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,282 @@
+/*-
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Por

Re: EDB builds Postgres 13 with an obsolete ICU version

2020-08-18 Thread Julien Rouhaud

On Tue, Aug 18, 2020 at 11:39 AM Magnus Hagander  wrote:
>
> On Tue, Aug 18, 2020 at 11:24 AM Dave Page  wrote:
>>
>> On Mon, Aug 17, 2020 at 7:23 PM Bruce Momjian  wrote:
>>>
>>> On Mon, Aug 17, 2020 at 04:55:13PM +0100, Dave Page wrote:
>> I wasn't aware that ICU had the concept of collation versions internally 
>> (which Michael seems to have confirmed downthread). That would potentially 
>> make the number of users needing a reindex even smaller, but as you point 
>> out won't help us for years as we don't store it anyway.
>
> It does -- and we track it in pg_collation at this point.
>
> I think the part that Michael is referring to is we don't track enough 
> details on a per-index basis. The suggested changes (in the separate thread) 
> are to get rid of it from pg_collation and move it to a per-object dependency.
>
> (And fwiw contains a patch to pg_upgrade to at least give it the ability to 
> for all old indexes say "i know that my icu is compatible". But yeah, the 
> full functionality won't be available until upgrading *from* 14)

Indeed, when upgrading from something older than 14, all indexes would
be marked as depending on an unknown collation version as in possibly
corrupted.

Re: EDB builds Postgres 13 with an obsolete ICU version

2020-08-18 Thread Thomas Kellerer

Magnus Hagander schrieb am 18.08.2020 um 11:38:
> It might be a slightly larger percentage on Windows who use it, but
> I'm willing to bet it's still quite low.

I have seen increasingly more questions around ICU collations on Windows due to 
the fact that people that migrate from SQL Server to Postgres very often keep 
Windows as the operating system and they want to get SQL Server's 
case-insensitivity (at least partially) using ICU collations.

Thomas

Re: Implementing Incremental View Maintenance

2020-08-18 Thread Yugo NAGATA

Hi,

Attached is the rebased patch (v16) to add support for Incremental
Materialized View Maintenance (IVM). It is able to be applied to
current latest master branch.

This also includes the following small fixes:

- Add a query check for expressions containing aggregates in it
- [doc] Add description about which situations IVM is effective or not in
- Improve hint in log messages
- Reorganize include directives in codes

Regards,
Yugo Nagata

-- 
Yugo NAGATA 


IVM_patches_v16.tar.gz
Description: application/gzip

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-08-18 Thread David Rowley

On Tue, 11 Aug 2020 at 17:44, Andres Freund  wrote:
>
> Hi,
>
> On 2020-08-11 17:23:42 +1200, David Rowley wrote:
> > On Tue, 11 Aug 2020 at 12:21, Andres Freund  wrote:
> > >
> > > On 2020-07-09 10:25:14 +1200, David Rowley wrote:
> > > > On Thu, 9 Jul 2020 at 04:53, Andres Freund  wrote:
> > > > > I'm not convinced it's a good idea to introduce a separate executor 
> > > > > node
> > > > > for this. There's a fair bit of overhead in them, and they will only 
> > > > > be
> > > > > below certain types of nodes afaict. It seems like it'd be better to
> > > > > pull the required calls into the nodes that do parametrized scans of
> > > > > subsidiary nodes. Have you considered that?
> > > >
> > > > I see 41 different node types mentioned in ExecReScan().  I don't
> > > > really think it would be reasonable to change all those.
> > >
> > > But that's because we dispatch ExecReScan mechanically down to every
> > > single executor node. That doesn't determine how many nodes would need
> > > to modify to include explicit caching? What am I missing?
> > >
> > > Wouldn't we need roughly just nodeNestloop.c and nodeSubplan.c
> > > integration?
> >
> > hmm, I think you're right there about those two node types.  I'm just
> > not sure you're right about overloading these node types to act as a
> > cache.
>
> I'm not 100% either, to be clear.  I am just acutely aware that adding
> entire nodes is pretty expensive, and that there's, afaict, no need to
> have arbitrary (i.e. pointer to function) type callbacks to point to the
> cache.

Perhaps you're right, but I'm just not convinced of it.  I feel
there's a certain air of magic involved in any node that has a good
name and reputation for doing one thing that we suddenly add new
functionality to which causes it to perform massively differently.

A counterexample to your argument is that Materialize is a node type.
There's only a limits number of places where that node is used. One of
those places can be on the inside of a non-parameterized nested loop.
Your argument of having Nested Loop do caching would also indicate
that Materialize should be part of Nested Loop instead of a node
itself. There's a few other places Materialize is used, e.g scrollable
cursors, but in that regard, you could say that the caching should be
handled in ExecutePlan(). I just don't think it should be, as I don't
think Result Cache should be part of any other node or code.

Another problem I see with overloading nodeSubplan and nodeNestloop
is, we don't really document our executor nodes today, so unless this
patch starts a new standard for that, then there's not exactly a good
place to mention that parameterized nested loops may now cache results
from the inner scan.

I do understand what you mean with the additional node overhead. I saw
that in my adventures of INNER JOIN removals a few years ago.  I hope
the fact that I've tried to code the planner so that for nested loops,
it only uses a Result Cache node when it thinks it'll speed things up.
That decision is of course based on having good statistics, which
might not be the case. I don't quite have that luxury with subplans
due to lack of knowledge of the outer plan when planning the subquery.

> > How would you inform users via EXPLAIN ANALYZE of how many
> > cache hits/misses occurred?
>
> Similar to how we display memory for sorting etc.

I was more thinking of how bizarre it would be to see Nested Loop and
SubPlan report cache statistics.  It may appear quite magical to users
to see EXPLAIN ANALYZE mentioning that their Nested Loop is now
reporting something about cache hits.

> > What would you use to disable it for an
> > escape hatch for when the planner makes a bad choice about caching?
>
> Isn't that *easier* when embedding it into the node? There's no nice way
> to remove an intermediary executor node entirely, but it's trivial to
> have an if statement like
> if (node->cache && upsert_cache(node->cache, param))

I was more meaning that it might not make sense to keep the
enable_resultcache GUC if the caching were part of the existing nodes.
I think people are pretty used to the enable_* GUCs corresponding to
an executor whose name roughly matches the name of the GUC. In this
case, without a Result Cache node, enable_resultcache would not assist
in self-documenting.   However, perhaps 2 new GUCs instead,
enable_nestloop_caching and enable_subplan_caching.   We're currently
short of any other enable_* GUCs that are node modifiers. We did have
enable_hashagg_disk until a few weeks ago. Nobody seemed to like that,
but perhaps there were other reasons for people not to like it other
than it was a node modifier GUC.

I'm wondering if anyone else has any thoughts on this?

David

Re: Creating a function for exposing memory usage of backend process

2020-08-18 Thread torikoshia


On 2020-08-17 21:19, Fujii Masao wrote:

On 2020/08/17 21:14, Fujii Masao wrote:

On 2020-08-07 16:38, Kasahara Tatsuhito wrote:
The following review has been posted through the commitfest 
application:

make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:    tested, passed

I tested the latest
patch(0007-Adding-a-function-exposing-memory-usage-of-local-backend.patch)
with the latest PG-version 
(199cec9779504c08aaa8159c6308283156547409)

and test was passed.
It looks good to me.

The new status of this patch is: Ready for Committer


Thanks for your testing!


Thanks for updating the patch! Here are the review comments.


Thanks for reviewing!



+ 
+  linkend="view-pg-backend-memory-contexts">pg_backend_memory_contexts

+  backend memory contexts
+ 

The above is located just after pg_matviews entry. But it should be 
located
just after pg_available_extension_versions entry. Because the rows in 
the table

"System Views" should be located in alphabetical order.


+ 
+  pg_backend_memory_contexts

Same as above.


Modified both.




+   The view pg_backend_memory_contexts 
displays all

+   the local backend memory contexts.

This description seems a bit confusing because maybe we can interpret 
this
as "... displays the memory contexts of all the local backends" 
wrongly. Thought?

What about the following description, instead?


     The view pg_backend_memory_contexts 
displays all
     the memory contexts of the server process attached to the current 
session.


Thanks! it seems better.


+    const char *name = context->name;
+    const char *ident = context->ident;
+
+    if (context == NULL)
+    return;

The above check "context == NULL" is useless? If "context" is actually 
NULL,

"context->name" would cause segmentation fault, so ISTM that the check
will never be performed.

If "context" can be NULL, the check should be performed before 
accessing
to "contect". OTOH, if "context" must not be NULL per the 
specification of

PutMemoryContextStatsTupleStore(), assertion test checking
"context != NULL" should be used here, instead?


Yeah, "context" cannot be NULL because "context" must be 
TopMemoryContext

or it is already checked as not NULL as follows(child != NULL).

I added the assertion check.

| for (child = context->firstchild; child != NULL; child = 
child->nextchild)

| {
|  ...
| PutMemoryContextsStatsTupleStore(tupstore, tupdesc,
|   child, 
parentname, level + 1);

| }


Here is another comment.

+ if (parent == NULL)
+ nulls[2] = true;
+ else
+ /*
+  * We labeled dynahash contexts with just the hash table 
name.
+  * To make it possible to identify its parent, we also 
display

+  * parent's ident here.
+  */
+ if (parent->ident && strcmp(parent->name, "dynahash") == 
0)

+ values[2] = CStringGetTextDatum(parent->ident);
+ else
+ values[2] = CStringGetTextDatum(parent->name);

PutMemoryContextsStatsTupleStore() doesn't need "parent" memory 
context,
but uses only the name of "parent" memory context. So isn't it better 
to use
"const char *parent" instead of "MemoryContext parent", as the argument 
of

the function? If we do that, we can simplify the above code.


Thanks, the attached patch adopted the advice.

However, since PutMemoryContextsStatsTupleStore() used not only the name
but also the ident of the "parent", I could not help but adding similar
codes before calling the function.
The total amount of codes and complexity seem not to change so much.

Any thoughts? Am I misunderstanding something?


Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATIONFrom 055af903a3dbf146d97dd3fb01a6a7d3d3bd2ae0 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Tue, 18 Aug 2020 18:17:42 +0900
Subject: [PATCH] Add a function exposing memory usage of local backend.
 pg_get_backend_memory_contexts which exposes memory usage of the local
 backend. It also adds a new view pg_backend_memory_contexts for exposing
 local backend memory contexts.

---
 doc/src/sgml/catalogs.sgml   | 122 ++
 src/backend/catalog/system_views.sql |   3 +
 src/backend/utils/mmgr/mcxt.c| 147 +++
 src/include/catalog/pg_proc.dat  |   9 ++
 src/test/regress/expected/rules.out  |  10 ++
 5 files changed, 291 insertions(+)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index fc329c5cff..1232b24e74 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9226,6 +9226,11 @@ SCRAM-SHA-256$:&l
   available versions of extensions
  
 
+ 
+  pg_backend_memory_contexts
+  backend memory contexts
+ 
+
  
   pg_config
   compile-time configuration parameters
@@ -9577

Re: EDB builds Postgres 13 with an obsolete ICU version

2020-08-18 Thread Magnus Hagander

On Tue, Aug 18, 2020 at 11:24 AM Dave Page  wrote:

>
>
> On Mon, Aug 17, 2020 at 7:23 PM Bruce Momjian  wrote:
>
>> On Mon, Aug 17, 2020 at 04:55:13PM +0100, Dave Page wrote:
>> > That was more if the installer actually handles the whole chain. It
>> clearly
>> > doesn't today (since it doesn't support upgrades), I agree this
>> might
>> > definitely be overkill. But then also I don't really see the
>> problem with
>> > just putting a new version of ICU in with the newer versions of
>> PostgreSQL.
>> > That's just puts the user in the same position as they are with any
>> other
>> > platform wrt manual pg_upgrade runs.
>> >
>> > Well we can certainly do that if everyone is happy in the knowledge
>> that it'll
>> > mean pg_upgrade users will need to reindex if they've used ICU
>> collations.
>> >
>> > Sandeep; can you have someone do a test build with the latest ICU
>> please (for
>> > background, this would be with the Windows and Mac installers)? If noone
>> > objects, we can push that into the v13 builds before GA. We'd also need
>> to
>> > update the README if we do so.
>>
>> Woh, we don't have any support in pg_upgrade to reindex just indexes
>> that use ICU collations, and frankly, if they have to reindex, they
>> might decide that they should just do pg_dump/reload of their cluster at
>> that point because pg_upgrade is going to be very slow, and they will be
>> surprised.
>
>
> Not necessarily. It's likely that not all indexes use ICU collations, and
> you still save time loading what may be large amounts of data.
>
> I agree though, that it *could* be slow.
>

I agree it definitely could, but I'm not sure I see any case where it would
actually be slower than the alternative (which would be dump/reload).

I'm also willing to say that given that (1) the windows installers don't
provide a way to do it automatically, and (2) the "commandline challenge"
of running pg_upgrade on WIndows in general, I bet there's a larger
percentage of users who are using dump/reload rather than pg_upgrade on
Windows than on other platforms already in the first place.

> I can see a lot more people being disappointed by this than
>> will be happy to have Postgres using a newer ICU library.
>>
>
> Quite possibly, hence my hesitation to push ahead with anything more than
> a simple test build at this time.
>

My guess would be in the other direction :) But in particular, the vast
majority probably don't care at all, because they're not using ICU
collations.

It might be a slightly larger percentage on Windows who use it, but I'm
willing to bet it's still quite low.

Also, is it the ICU library version we should be tracking for reindex,
>> or each _collation_ version?  If the later, do we store the collation
>> version for each index?
>>
>
> I wasn't aware that ICU had the concept of collation versions internally
> (which Michael seems to have confirmed downthread). That would potentially
> make the number of users needing a reindex even smaller, but as you point
> out won't help us for years as we don't store it anyway.
>

It does -- and we track it in pg_collation at this point.

I think the part that Michael is referring to is we don't track enough
details on a per-index basis. The suggested changes (in the separate
thread) are to get rid of it from pg_collation and move it to a per-object
dependency.

(And fwiw contains a patch to pg_upgrade to at least give it the ability to
for all old indexes say "i know that my icu is compatible". But yeah, the
full functionality won't be available until upgrading *from* 14)

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: EDB builds Postgres 13 with an obsolete ICU version

2020-08-18 Thread Dave Page

On Mon, Aug 17, 2020 at 7:23 PM Bruce Momjian  wrote:

> On Mon, Aug 17, 2020 at 04:55:13PM +0100, Dave Page wrote:
> > That was more if the installer actually handles the whole chain. It
> clearly
> > doesn't today (since it doesn't support upgrades), I agree this might
> > definitely be overkill. But then also I don't really see the problem
> with
> > just putting a new version of ICU in with the newer versions of
> PostgreSQL.
> > That's just puts the user in the same position as they are with any
> other
> > platform wrt manual pg_upgrade runs.
> >
> > Well we can certainly do that if everyone is happy in the knowledge that
> it'll
> > mean pg_upgrade users will need to reindex if they've used ICU
> collations.
> >
> > Sandeep; can you have someone do a test build with the latest ICU please
> (for
> > background, this would be with the Windows and Mac installers)? If noone
> > objects, we can push that into the v13 builds before GA. We'd also need
> to
> > update the README if we do so.
>
> Woh, we don't have any support in pg_upgrade to reindex just indexes
> that use ICU collations, and frankly, if they have to reindex, they
> might decide that they should just do pg_dump/reload of their cluster at
> that point because pg_upgrade is going to be very slow, and they will be
> surprised.


Not necessarily. It's likely that not all indexes use ICU collations, and
you still save time loading what may be large amounts of data.

I agree though, that it *could* be slow.


> I can see a lot more people being disappointed by this than
> will be happy to have Postgres using a newer ICU library.
>

Quite possibly, hence my hesitation to push ahead with anything more than a
simple test build at this time.


>
> Also, is it the ICU library version we should be tracking for reindex,
> or each _collation_ version?  If the later, do we store the collation
> version for each index?
>

I wasn't aware that ICU had the concept of collation versions internally
(which Michael seems to have confirmed downthread). That would potentially
make the number of users needing a reindex even smaller, but as you point
out won't help us for years as we don't store it anyway.

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EDB: http://www.enterprisedb.com

Re: recovering from "found xmin ... from before relfrozenxid ..."