date:20200819

On Tue, 18 Aug 2020 at 13:06, Amit Kapila  wrote:
>
> On Mon, Aug 17, 2020 at 11:38 AM Masahiko Sawada
>  wrote:
> >
> > On Sat, 15 Aug 2020 at 12:19, Amit Kapila  wrote:
> > >
> > > The reason why I have not included heap_page_prune related change in
> > > the patch is that I don't want to sprinkle this in every possible
> > > function (code path) called via vacuum especially if the probability
> > > of an error in that code path is low. But, I am fine if you and or
> > > others think that it is a good idea to update offset in
> > > heap_page_prune as well.
> >
> > I agree to not try sprinkling it many places than necessity.
> >
> > Regarding heap_page_prune(), I'm concerned a bit that
> > heap_page_prune() is typically the first function to check the tuple
> > visibility within the vacuum code. I've sometimes observed an error
> > with the message like "DETAIL: could not open file “pg_xact/00AB”: No
> > such file or directory" due to a tuple header corruption. I suspect
> > this message was emitted while checking tuple visibility in
> > heap_page_prune(). So I guess the likelihood of an error in that code
> > is not so low.
> >
>
> Fair point.
>
> > On the other hand, if we want to update the offset number in
> > heap_page_prune() we will need to expose some vacuum structs defined
> > as a static including LVRelStats.
> >
>
> I don't think we need to expose LVRelStats. We can just pass the
> address of vacrelstats->offset_num to achieve what we want. I have
> tried that and it works, see the
> v6-0002-additinal-error-context-information-in-heap_page_ patch
> attached. Do you see any problem with it?

Yes, you're right. I'm concerned a bit the number of arguments passing
to heap_page_prune() might get higher when we need other values to
update for errcontext, but I'm okay with the current patch.

Currently, we're in SCAN_HEAP phase in heap_page_prune() but should it
be VACUUM_HEAP instead?

Also, I've tested the patch with log_min_messages = 'info' and get the
following sever logs:

2020-08-19 14:28:09.917 JST [72912] INFO:  launched 1 parallel vacuum
worker for index vacuuming (planned: 1)
2020-08-19 14:28:09.917 JST [72912] CONTEXT:  while scanning block 973
of relation "public.tbl"
2020-08-19 14:28:09.959 JST [72912] INFO:  scanned index "i1" to
remove 109872 row versions
2020-08-19 14:28:09.959 JST [72912] DETAIL:  CPU: user: 0.04 s,
system: 0.00 s, elapsed: 0.04 s
2020-08-19 14:28:09.959 JST [72912] CONTEXT:  while vacuuming index
"i1" of relation "public.tbl"
2020-08-19 14:28:09.967 JST [72936] INFO:  scanned index "i2" to
remove 109872 row versions by parallel vacuum worker
2020-08-19 14:28:09.967 JST [72936] DETAIL:  CPU: user: 0.03 s,
system: 0.00 s, elapsed: 0.04 s
2020-08-19 14:28:09.967 JST [72936] CONTEXT:  while vacuuming index
"i2" of relation "public.tbl"
2020-08-19 14:28:09.967 JST [72912] INFO:  scanned index "i2" to
remove 109872 row versions by parallel vacuum worker
2020-08-19 14:28:09.967 JST [72912] DETAIL:  CPU: user: 0.03 s,
system: 0.00 s, elapsed: 0.04 s
2020-08-19 14:28:09.967 JST [72912] CONTEXT:  while vacuuming index
"i2" of relation "public.tbl"
parallel worker
while scanning block 973 of relation "public.tbl"
2020-08-19 14:28:09.968 JST [72912] INFO:  "tbl": removed 109872 row
versions in 487 pages
2020-08-19 14:28:09.968 JST [72912] DETAIL:  CPU: user: 0.00 s,
system: 0.00 s, elapsed: 0.00 s
2020-08-19 14:28:09.968 JST [72912] CONTEXT:  while vacuuming block
973 of relation "public.tbl"
2020-08-19 14:28:09.968 JST [72912] INFO:  index "i1" now contains
11 row versions in 578 pages
2020-08-19 14:28:09.968 JST [72912] DETAIL:  109872 index row versions
were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
2020-08-19 14:28:09.968 JST [72912] CONTEXT:  while scanning block 973
of relation "public.tbl"
2020-08-19 14:28:09.968 JST [72912] INFO:  index "i2" now contains
11 row versions in 578 pages
2020-08-19 14:28:09.968 JST [72912] DETAIL:  109872 index row versions
were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
2020-08-19 14:28:09.968 JST [72912] CONTEXT:  while scanning block 973
of relation "public.tbl"
2020-08-19 14:28:09.969 JST [72912] INFO:  "tbl": found 11
removable, 11 nonremovable row versions in 974 out of 974 pages
2020-08-19 14:28:09.969 JST [72912] DETAIL:  0 dead row versions
cannot be removed yet, oldest xmin: 519
There were 372 unused item identifiers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.05 s, system: 0.00 s, elapsed: 0.06 s.
2020-08-19 14:28:09.969 JST [72912] CONTEXT:  while scanning block 973
of relation "public.tbl"

This is not directly related to the patch but it looks like we can
improve the current errcontext settings. For instance, the message
from lazy_v

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-08-19 Thread Dilip Kumar

On Wed, Aug 19, 2020 at 10:11 AM Amit Kapila  wrote:
>
> On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar  wrote:
> >
> >
> > In last patch v49-0001, there is one issue,  Basically, I have called
> > BufFileFlush in all the cases.  But, ideally, we can not call this if
> > the underlying files are deleted/truncated because those files/blocks
> > might not exist now.  So I think if the truncate position is within
> > the same buffer we just need to adjust the buffer,  otherwise we just
> > need to set the currFile and currOffset to the absolute number and set
> > the pos and nbytes 0.  Attached patch fixes this issue.
> >
>
> Few comments on the latest patch v50-0001-Extend-the-BufFile-interface
> 1.
> +
> + /*
> + * If the truncate point is within existing buffer then we can just
> + * adjust pos-within-buffer, without flushing buffer.  Otherwise,
> + * we don't need to do anything because we have already deleted/truncated
> + * the underlying files.
> + */
> + if (curFile == file->curFile &&
> + curOffset >= file->curOffset &&
> + curOffset <= file->curOffset + file->nbytes)
> + {
> + file->pos = (int) (curOffset - file->curOffset);
> + return;
> + }
>
> I think in this case you have set the position correctly but what
> about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes'
> because the contents of the buffer are still valid but I don't think
> the same is true here.

Right, I think we need to set nbytes to new file->pos as shown below

> + file->pos = (int) (curOffset - file->curOffset);
>  file->nbytes = file->pos


> 2.
> + int curFile = file->curFile;
> + off_t curOffset = file->curOffset;
>
> I find the previous naming (newFile, newOffset) was better as it
> distinguishes them from BufFile variables.

Ok

> 3.
> +void
> +SharedFileSetUnregister(SharedFileSet *input_fileset)
> +{
> ..
> + /* Delete all files in the set */
> + SharedFileSetDeleteAll(input_fileset);
> ..
> }
>
> I am not sure if this is completely correct because we call this
> function (SharedFileSetUnregister) from BufFileDeleteShared which
> would have already removed all the required files. This raises the
> question in my mind whether it is correct to call
> SharedFileSetUnregister from BufFileDeleteShared from the API
> perspective as one might not want to remove the entire fileset at that
> point of time. It will work for your use case (where while removing
> buffile you also want to remove the entire fileset) but not sure if it
> is generic enough. For your case, I wonder if we can directly call
> SharedFileSetDeleteAll and we can have a call like
> SharedFileSetUnregister which will be called from it.

Yeah this make more sense to me that we can directly call
SharedFileSetDeleteAll, instead of calling BufFileDeleteShared and we
can call SharedFileSetUnregister from SharedFileSetDeleteAll.

I will make these changes and send the patch after some testing.




--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-08-19 Thread Dilip Kumar

On Wed, Aug 19, 2020 at 12:20 PM Amit Kapila  wrote:
>
> On Wed, Aug 19, 2020 at 10:10 AM Amit Kapila  wrote:
> >
> > On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar  wrote:
> > >
> > >
> > > In last patch v49-0001, there is one issue,  Basically, I have called
> > > BufFileFlush in all the cases.  But, ideally, we can not call this if
> > > the underlying files are deleted/truncated because those files/blocks
> > > might not exist now.  So I think if the truncate position is within
> > > the same buffer we just need to adjust the buffer,  otherwise we just
> > > need to set the currFile and currOffset to the absolute number and set
> > > the pos and nbytes 0.  Attached patch fixes this issue.
> > >
> >
> > Few comments on the latest patch v50-0001-Extend-the-BufFile-interface
> > 1.
> > +
> > + /*
> > + * If the truncate point is within existing buffer then we can just
> > + * adjust pos-within-buffer, without flushing buffer.  Otherwise,
> > + * we don't need to do anything because we have already deleted/truncated
> > + * the underlying files.
> > + */
> > + if (curFile == file->curFile &&
> > + curOffset >= file->curOffset &&
> > + curOffset <= file->curOffset + file->nbytes)
> > + {
> > + file->pos = (int) (curOffset - file->curOffset);
> > + return;
> > + }
> >
> > I think in this case you have set the position correctly but what
> > about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes'
> > because the contents of the buffer are still valid but I don't think
> > the same is true here.
> >
>
> I think you need to set 'nbytes' to curOffset as per your current
> patch as that is the new size of the file.
> --- a/src/backend/storage/file/buffile.c
> +++ b/src/backend/storage/file/buffile.c
> @@ -912,6 +912,7 @@ BufFileTruncateShared(BufFile *file, int fileno,
> off_t offset)
> curOffset <= file->curOffset + file->nbytes)
> {
> file->pos = (int) (curOffset - file->curOffset);
> +   file->nbytes = (int) curOffset;
> return;
> }
>
> Also, what about file 'numFiles', that can also change due to the
> removal of certain files, shouldn't that be also set in this case

Right, we need to set the numFile.  I will fix this as well.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] - Provide robust alternatives for replace_string

2020-08-19 Thread Georgios

‐‐‐ Original Message ‐‐‐
On Friday, 7 August 2020 09:02, Asim Praveen  wrote:

> > On 05-Aug-2020, at 7:01 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote:
> > On 2020-Aug-05, Asim Praveen wrote:
> >
> > > Please find attached a StringInfo based solution to this problem. It
> > > uses fgetln instead of fgets such that a line is read in full, without
> > > ever splitting it.
> >
> > never heard of fgetln, my system doesn't have a manpage for it, and we
> > don't use it anywhere AFAICS. Are you planning to add something to
> > src/common for it?
>
> Indeed! I noticed fgetln on the man page of fgets and used it without 
> checking. And this happened on a MacOS system.
>
> Please find a revised version that uses fgetc instead.

Although not an issue in the current branch, fgetc might become a bit slow
in large files. Please find v3 which simply continues reading the line if
fgets fills the buffer and there is still data to read.

Also this version, implements Alvaro's suggestion to break API compatibility.

To that extent, ecpg regress has been slightly modified to use the new version
of replace_string() where needed, or remove it all together where possible.

//Georgios

>
> Asim

v3-0001-Use-a-stringInfo-instead-of-a-char-for-replace_st.patch
Description: Binary data

Re: Creating a function for exposing memory usage of backend process

2020-08-19 Thread torikoshia


On 2020-08-19 15:48, Fujii Masao wrote:

On 2020/08/19 9:43, torikoshia wrote:

On 2020-08-18 22:54, Fujii Masao wrote:

On 2020/08/18 18:41, torikoshia wrote:

On 2020-08-17 21:19, Fujii Masao wrote:

On 2020/08/17 21:14, Fujii Masao wrote:

On 2020-08-07 16:38, Kasahara Tatsuhito wrote:
The following review has been posted through the commitfest 
application:

make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:    tested, passed

I tested the latest
patch(0007-Adding-a-function-exposing-memory-usage-of-local-backend.patch)
with the latest PG-version 
(199cec9779504c08aaa8159c6308283156547409)

and test was passed.
It looks good to me.

The new status of this patch is: Ready for Committer


Thanks for your testing!


Thanks for updating the patch! Here are the review comments.


Thanks for reviewing!



+ 
+  linkend="view-pg-backend-memory-contexts">pg_backend_memory_contexts

+  backend memory contexts
+ 

The above is located just after pg_matviews entry. But it should 
be located
just after pg_available_extension_versions entry. Because the rows 
in the table

"System Views" should be located in alphabetical order.


+ 
+ 
pg_backend_memory_contexts


Same as above.


Modified both.




+   The view pg_backend_memory_contexts 
displays all

+   the local backend memory contexts.

This description seems a bit confusing because maybe we can 
interpret this
as "... displays the memory contexts of all the local backends" 
wrongly. Thought?

What about the following description, instead?


 The view pg_backend_memory_contexts 
displays all
 the memory contexts of the server process attached to the 
current session.


Thanks! it seems better.


+    const char *name = context->name;
+    const char *ident = context->ident;
+
+    if (context == NULL)
+    return;

The above check "context == NULL" is useless? If "context" is 
actually NULL,
"context->name" would cause segmentation fault, so ISTM that the 
check

will never be performed.

If "context" can be NULL, the check should be performed before 
accessing
to "contect". OTOH, if "context" must not be NULL per the 
specification of

PutMemoryContextStatsTupleStore(), assertion test checking
"context != NULL" should be used here, instead?


Yeah, "context" cannot be NULL because "context" must be 
TopMemoryContext

or it is already checked as not NULL as follows(child != NULL).

I added the assertion check.


Isn't it better to add AssertArg(MemoryContextIsValid(context)), 
instead?


Thanks, that's better.



| for (child = context->firstchild; child != NULL; child = 
child->nextchild)

| {
|  ...
| PutMemoryContextsStatsTupleStore(tupstore, tupdesc,
|   child, 
parentname, level + 1);

| }


Here is another comment.

+ if (parent == NULL)
+ nulls[2] = true;
+ else
+ /*
+  * We labeled dynahash contexts with just the hash 
table name.
+  * To make it possible to identify its parent, we 
also display

+  * parent's ident here.
+  */
+ if (parent->ident && strcmp(parent->name, "dynahash") 
== 0)
+ values[2] = 
CStringGetTextDatum(parent->ident);

+ else
+ values[2] = 
CStringGetTextDatum(parent->name);


PutMemoryContextsStatsTupleStore() doesn't need "parent" memory 
context,
but uses only the name of "parent" memory context. So isn't it 
better to use
"const char *parent" instead of "MemoryContext parent", as the 
argument of

the function? If we do that, we can simplify the above code.


Thanks, the attached patch adopted the advice.

However, since PutMemoryContextsStatsTupleStore() used not only the 
name
but also the ident of the "parent", I could not help but adding 
similar

codes before calling the function.
The total amount of codes and complexity seem not to change so much.

Any thoughts? Am I misunderstanding something?


I was thinking that we can simplify the code as follows.
That is, we can just pass "name" as the argument of
PutMemoryContextsStatsTupleStore()
since "name" indicates context->name or ident (if name is 
"dynahash").


 for (child = context->firstchild; child != NULL; child = 
child->nextchild)

 {
-    const char *parentname;
-
-    /*
- * We labeled dynahash contexts with just the hash table 
name.

- * To make it possible to identify its parent, we also use
- * the hash table as its context name.
- */
-    if (context->ident && strcmp(context->name, "dynahash") == 
0)

-    parentname = context->ident;
-    else
-    parentname = context->name;
-
 PutMemoryContextsStatsTupleStore(tupstore, tupdesc,
-  child, parentname, level + 1);
+  child, name, level + 1);


I

Re: [PG13] Planning (time + buffers) data structure in explain plan (format text)

On Wed, 19 Aug 2020 at 19:22, Julien Rouhaud  wrote:
> Hearing no objection, here's a patch to change the output as suggested by
> Pierre:
>
> =# explain (analyze, buffers) select * from pg_class;
>QUERY PLAN 
>  >
> --->
>  Seq Scan on pg_class  (cost=0.00..16.86 rows=386 width=265) (actual 
> time=0.020..0.561 rows=386 loops=1)
>Buffers: shared hit=9 read=4
>  Planning:
>Planning Time: 4.345 ms
>Buffers: shared hit=103 read=12
>  Execution Time: 1.447 ms
> (6 rows)

I don't really have anything to say about the change in format, but on
looking at the feature, I do find it strange that I need to specify
ANALYZE to get EXPLAIN to output the buffer information for the
planner.

I'd expect that EXPLAIN (BUFFERS) would work just fine, but I get:

ERROR:  EXPLAIN option BUFFERS requires ANALYZE

Ths docs [1] also mention this is disallowed per:

"This parameter may only be used when ANALYZE is also enabled."

I just don't agree that it should be. What if I want to get an
indication of why the planner is slow but I don't want to wait for the
query to execute? or don't want to execute it at all, say it's a
DELETE!

It looks like we'd need to make BUFFERS imply SUMMARY, perhaps
something along the lines of what we do now with ANALYZE with:

/* if the summary was not set explicitly, set default value */
es->summary = (summary_set) ? es->summary : es->analyze;

However, I'm not quite sure how we should handle if someone does:
EXPLAIN (BUFFERS on, SUMMARY off). Without the summary, there's no
place to print the buffers, which seems bad as they asked for buffers.

David

[1] https://www.postgresql.org/docs/devel/sql-explain.html

Re: [PG13] Planning (time + buffers) data structure in explain plan (format text)

2020-08-19 Thread Julien Rouhaud

On Wed, Aug 19, 2020 at 08:49:48PM +1200, David Rowley wrote:
> On Wed, 19 Aug 2020 at 19:22, Julien Rouhaud  wrote:
> > Hearing no objection, here's a patch to change the output as suggested by
> > Pierre:
> >
> > =# explain (analyze, buffers) select * from pg_class;
> >QUERY PLAN   
> >>
> > --->
> >  Seq Scan on pg_class  (cost=0.00..16.86 rows=386 width=265) (actual 
> > time=0.020..0.561 rows=386 loops=1)
> >Buffers: shared hit=9 read=4
> >  Planning:
> >Planning Time: 4.345 ms
> >Buffers: shared hit=103 read=12
> >  Execution Time: 1.447 ms
> > (6 rows)
> 
> I don't really have anything to say about the change in format, but on
> looking at the feature, I do find it strange that I need to specify
> ANALYZE to get EXPLAIN to output the buffer information for the
> planner.
> 
> I'd expect that EXPLAIN (BUFFERS) would work just fine, but I get:
> 
> ERROR:  EXPLAIN option BUFFERS requires ANALYZE
> 
> Ths docs [1] also mention this is disallowed per:
> 
> "This parameter may only be used when ANALYZE is also enabled."
> 
> I just don't agree that it should be. What if I want to get an
> indication of why the planner is slow but I don't want to wait for the
> query to execute? or don't want to execute it at all, say it's a
> DELETE!


I quite agree, this restriction is unhelpful since we have planning buffer
information.


> 
> It looks like we'd need to make BUFFERS imply SUMMARY


+1


> 
> However, I'm not quite sure how we should handle if someone does:
> EXPLAIN (BUFFERS on, SUMMARY off). Without the summary, there's no
> place to print the buffers, which seems bad as they asked for buffers.


But this won't be as much a problem if ANALYZE is asked, and having different
behaviors isn't appealing.  So maybe it's better to let people get what they
asked for even if that's contradictory?

Re: Creating a function for exposing memory usage of backend process





On 2020/08/19 17:40, torikoshia wrote:

On 2020-08-19 15:48, Fujii Masao wrote:

On 2020/08/19 9:43, torikoshia wrote:

On 2020-08-18 22:54, Fujii Masao wrote:

On 2020/08/18 18:41, torikoshia wrote:

On 2020-08-17 21:19, Fujii Masao wrote:

On 2020/08/17 21:14, Fujii Masao wrote:

On 2020-08-07 16:38, Kasahara Tatsuhito wrote:

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:    tested, passed

I tested the latest
patch(0007-Adding-a-function-exposing-memory-usage-of-local-backend.patch)
with the latest PG-version (199cec9779504c08aaa8159c6308283156547409)
and test was passed.
It looks good to me.

The new status of this patch is: Ready for Committer


Thanks for your testing!


Thanks for updating the patch! Here are the review comments.


Thanks for reviewing!



+ 
+  pg_backend_memory_contexts
+  backend memory contexts
+ 

The above is located just after pg_matviews entry. But it should be located
just after pg_available_extension_versions entry. Because the rows in the table
"System Views" should be located in alphabetical order.


+ 
+ pg_backend_memory_contexts

Same as above.


Modified both.




+   The view pg_backend_memory_contexts displays all
+   the local backend memory contexts.

This description seems a bit confusing because maybe we can interpret this
as "... displays the memory contexts of all the local backends" wrongly. 
Thought?
What about the following description, instead?



 The view pg_backend_memory_contexts displays all
 the memory contexts of the server process attached to the current session.


Thanks! it seems better.


+    const char *name = context->name;
+    const char *ident = context->ident;
+
+    if (context == NULL)
+    return;

The above check "context == NULL" is useless? If "context" is actually NULL,
"context->name" would cause segmentation fault, so ISTM that the check
will never be performed.

If "context" can be NULL, the check should be performed before accessing
to "contect". OTOH, if "context" must not be NULL per the specification of
PutMemoryContextStatsTupleStore(), assertion test checking
"context != NULL" should be used here, instead?


Yeah, "context" cannot be NULL because "context" must be TopMemoryContext
or it is already checked as not NULL as follows(child != NULL).

I added the assertion check.


Isn't it better to add AssertArg(MemoryContextIsValid(context)), instead?


Thanks, that's better.



| for (child = context->firstchild; child != NULL; child = child->nextchild)
| {
|  ...
| PutMemoryContextsStatsTupleStore(tupstore, tupdesc,
|   child, parentname, 
level + 1);
| }


Here is another comment.

+ if (parent == NULL)
+ nulls[2] = true;
+ else
+ /*
+  * We labeled dynahash contexts with just the hash table name.
+  * To make it possible to identify its parent, we also display
+  * parent's ident here.
+  */
+ if (parent->ident && strcmp(parent->name, "dynahash") == 0)
+ values[2] = CStringGetTextDatum(parent->ident);
+ else
+ values[2] = CStringGetTextDatum(parent->name);

PutMemoryContextsStatsTupleStore() doesn't need "parent" memory context,
but uses only the name of "parent" memory context. So isn't it better to use
"const char *parent" instead of "MemoryContext parent", as the argument of
the function? If we do that, we can simplify the above code.


Thanks, the attached patch adopted the advice.

However, since PutMemoryContextsStatsTupleStore() used not only the name
but also the ident of the "parent", I could not help but adding similar
codes before calling the function.
The total amount of codes and complexity seem not to change so much.

Any thoughts? Am I misunderstanding something?


I was thinking that we can simplify the code as follows.
That is, we can just pass "name" as the argument of
PutMemoryContextsStatsTupleStore()
since "name" indicates context->name or ident (if name is "dynahash").

 for (child = context->firstchild; child != NULL; child = child->nextchild)
 {
-    const char *parentname;
-
-    /*
- * We labeled dynahash contexts with just the hash table name.
- * To make it possible to identify its parent, we also use
- * the hash table as its context name.
- */
-    if (context->ident && strcmp(context->name, "dynahash") == 0)
-    parentname = context->ident;
-    else
-    parentname = context->name;
-
 PutMemoryContextsStatsTupleStore(tupstore, tupdesc,
-  child, parentname, level + 1);
+  child, name, level + 1);


I got it, thanks for the clarification!

Attached

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Wed, 19 Aug 2020 at 15:09, Ashutosh Sharma  wrote:
>
> On Wed, Aug 19, 2020 at 9:27 AM Masahiko Sawada
>  wrote:
> >
> > On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma  wrote:
> > >
> > > > pg_force_freeze() can revival a tuple that is already deleted but not
> > > > vacuumed yet. Therefore, the user might need to reindex indexes after
> > > > using that function. For instance, with the following script, the last
> > > > two queries: index scan and seq scan, will return different results.
> > > >
> > > > set enable_seqscan to off;
> > > > set enable_bitmapscan to off;
> > > > set enable_indexonlyscan to off;
> > > > create table tbl (a int primary key);
> > > > insert into tbl values (1);
> > > >
> > > > update tbl set a = a + 100 where a = 1;
> > > >
> > > > explain analyze select * from tbl where a < 200;
> > > >
> > > > -- revive deleted tuple on heap
> > > > select heap_force_freeze('tbl', array['(0,1)'::tid]);
> > > >
> > > > -- index scan returns 2 tuples
> > > > explain analyze select * from tbl where a < 200;
> > > >
> > > > -- seq scan returns 1 tuple
> > > > set enable_seqscan to on;
> > > > explain analyze select * from tbl;
> > > >
> > >
> > > I am not sure if this is the right use-case of pg_force_freeze
> > > function. I think we should only be running pg_force_freeze function
> > > on a tuple for which VACUUM reports "found xmin ABC from before
> > > relfrozenxid PQR" sort of error otherwise it might worsen the things
> > > instead of making it better.
> >
> > Should this also be documented? I think that it's hard to force the
> > user to always use this module in the right situation but we need to
> > show at least when to use.
> >
>
> I've already added some examples in the documentation explaining the
> use-case of force_freeze function. If required, I will also add a note
> about it.
>
> > > > Also, if a tuple updated and moved to another partition is revived by
> > > > heap_force_freeze(), its ctid still has special values:
> > > > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
> > > > see a problem yet caused by a visible tuple having the special ctid
> > > > value, but it might be worth considering either to reset ctid value as
> > > > well or to not freezing already-deleted tuple.
> > > >
> > >
> > > For this as well, the answer remains the same as above.
> >
> > Perhaps the same is true when a tuple header is corrupted including
> > xmin and ctid for some reason and the user wants to fix it? I'm
> > concerned that a live tuple having the wrong ctid will cause SEGV or
> > PANIC error in the future.
> >
>
> If a tuple header itself is corrupted, then I think we must kill that
> tuple. If only xmin and t_ctid fields are corrupted, then probably we
> can think of resetting the ctid value of that tuple. However, it won't
> be always possible to detect the corrupted ctid value. It's quite
> possible that the corrupted ctid field has valid values for block
> number and offset number in it, but it's actually corrupted and it
> would be difficult to consider such ctid as corrupted. Hence, we can't
> do anything about such types of corruption. Probably in such cases we
> need to run VACUUM FULL on such tables so that new ctid gets generated
> for each tuple in the table.

Understood.

Perhaps such corruption will be able to be detected by new heapam
check functions discussed on another thread. My point was that it
might be better to attempt making the tuple header sane state as much
as possible when fixing a live tuple in order to prevent further
problems such as databases crash by malicious attackers.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PG13] Planning (time + buffers) data structure in explain plan (format text)

On Wed, 19 Aug 2020 at 21:05, Julien Rouhaud  wrote:
>
> On Wed, Aug 19, 2020 at 08:49:48PM +1200, David Rowley wrote:
> > However, I'm not quite sure how we should handle if someone does:
> > EXPLAIN (BUFFERS on, SUMMARY off). Without the summary, there's no
> > place to print the buffers, which seems bad as they asked for buffers.
>
>
> But this won't be as much a problem if ANALYZE is asked, and having different
> behaviors isn't appealing.  So maybe it's better to let people get what they
> asked for even if that's contradictory?

I'd say BUFFERS on, BUFFERS off is contradictory. I don't think
BUFFERS, SUMMARY OFF is. It's just that we show the buffer details for
the planner in the summary.  Since "summary" is not exactly a word
that describes what you're asking EXPLAIN to do, I wouldn't blame
users if they got confused as to why their BUFFERS on request was not
displayed.

We do use errors for weird combinations already, e.g:

postgres=# explain (timing on) select * from t1 where a > 400;
ERROR:  EXPLAIN option TIMING requires ANALYZE

so, maybe we can just error if analyze == off AND buffers == on AND
summary == off.  We likely should pay attention to analyze there as it
seems perfectly fine to EXPLAIN (ANALYZE, BUFFERS, SUMMARY off). We
quite often do SUMMARY off for the regression tests... I think that
might have been why it was added in the first place.

David

Re: [PG13] Planning (time + buffers) data structure in explain plan (format text)

On Wed, 19 Aug 2020 at 22:39, David Rowley  wrote:
> so, maybe we can just error if analyze == off AND buffers == on AND
> summary == off.  We likely should pay attention to analyze there as it
> seems perfectly fine to EXPLAIN (ANALYZE, BUFFERS, SUMMARY off). We
> quite often do SUMMARY off for the regression tests... I think that
> might have been why it was added in the first place.

I had something like the attached in mind.

postgres=# explain (buffers) select * from t1 where a > 400;
QUERY PLAN
--
 Index Only Scan using t1_pkey on t1  (cost=0.42..10.18 rows=100 width=4)
   Index Cond: (a > 400)
 Planning Time: 13.341 ms
   Buffers: shared hit=2735
(4 rows)

It does look a bit weirder if the planner didn't do any buffer work:

postgres=# explain (buffers) select * from pg_class;
  QUERY PLAN
--
 Seq Scan on pg_class  (cost=0.00..443.08 rows=408 width=768)
 Planning Time: 0.136 ms
(2 rows)

but that's not a combination that people were able to use before, so I
think it's ok to show the planning time there.

David


explain_buffers_without_analyze.patch
Description: Binary data

Re: Optimising compactify_tuples()

2020-08-19 Thread Thomas Munro

On Tue, Aug 18, 2020 at 6:53 AM Peter Geoghegan  wrote:
> I definitely think that we should have something like this, though.
> It's a relatively easy win. There are plenty of workloads that spend
> lots of time on pruning.

Alright then, here's an attempt to flesh the idea out a bit more, and
replace the three other copies of qsort() while I'm at it.
From 8d0b569fcf6141b622c63fc4bc102c762f01ca9e Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 17 Aug 2020 21:31:56 +1200
Subject: [PATCH 1/4] Add sort_template.h for making fast sort functions.

Move our qsort implementation into a header that can be used to
define specialized functions for better performance.

Discussion: https://postgr.es/m/CA%2BhUKGKMQFVpjr106gRhwk6R-nXv0qOcTreZuQzxgpHESAL6dw%40mail.gmail.com
---
 src/include/lib/sort_template.h | 428 
 1 file changed, 428 insertions(+)
 create mode 100644 src/include/lib/sort_template.h

diff --git a/src/include/lib/sort_template.h b/src/include/lib/sort_template.h
new file mode 100644
index 00..a279bcf959
--- /dev/null
+++ b/src/include/lib/sort_template.h
@@ -0,0 +1,428 @@
+/*-
+ *
+ * sort_template.h
+ *
+ *	  A template for a sort algorithm that supports varying degrees of
+ *	  specialization.
+ *
+ * Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * Usage notes:
+ *
+ *	  To generate functions specialized for a type, the following parameter
+ *	  macros should be #define'd before this file is included.
+ *
+ *	  - ST_SORT - the name of a sort function to be generated
+ *	  - ST_ELEMENT_TYPE - type of the referenced elements
+ *	  - ST_DECLARE - if defined the functions and types are declared
+ *	  - ST_DEFINE - if defined the functions and types are defined
+ *	  - ST_SCOPE - scope (e.g. extern, static inline) for functions
+ *
+ *	  Instead of ST_ELEMENT_TYPE, ST_ELEMENT_TYPE_VOID can be defined.  Then
+ *	  the generated functions will automatically gain an "element_size"
+ *	  parameter.  This allows us to generate a traditional qsort function.
+ *
+ *	  One of the following macros must be defined, to show how to compare
+ *	  elements.  The first two options are arbitrary expressions depending
+ *	  on whether an extra pass-through argument is desired, and the third
+ *	  option should be defined if the sort function should receive a
+ *	  function pointer at runtime.
+ *
+ * 	  - ST_COMPARE(a, b) - a simple comparison expression
+ *	  - ST_COMPARE(a, b, arg) - variant that takes an extra argument
+ *	  - ST_COMPARE_RUNTIME_POINTER - sort function takes a function pointer
+ *
+ *	  To say that the comparator and therefore also sort function should
+ *	  receive an extra pass-through argument, specify the type of the
+ *	  argument.
+ *
+ *	  - ST_COMPARE_ARG_TYPE - type of extra argument
+ *
+ *	  The prototype of the generated sort function is:
+ *
+ *	  void ST_SORT(ST_ELEMENT_TYPE *data, size_t n,
+ *   [size_t element_size,]
+ *   [ST_SORT_compare_function compare,]
+ *   [ST_COMPARE_ARG_TYPE *arg]);
+ *
+ *	  ST_SORT_compare_function is a function pointer of the following type:
+ *
+ *	  int (*)(const ST_ELEMENT_TYPE *a, const ST_ELEMENT_TYPE *b,
+ *			  [ST_COMPARE_ARG_TYPE *arg])
+ *
+ * HISTORY
+ *
+ *	  Modifications from vanilla NetBSD source:
+ *	  - Add do ... while() macro fix
+ *	  - Remove __inline, _DIAGASSERTs, __P
+ *	  - Remove ill-considered "swap_cnt" switch to insertion sort, in favor
+ *		of a simple check for presorted input.
+ *	  - Take care to recurse on the smaller partition, to bound stack usage
+ *	  - Convert into a header that can generate specialized functions
+ *
+ * IDENTIFICATION
+ *		src/include/lib/sort_template.h
+ *
+ *-
+ */
+
+/*	  $NetBSD: qsort.c,v 1.13 2003/08/07 16:43:42 agc Exp $   */
+
+/*-
+ * Copyright (c) 1992, 1993
+ *	  The Regents of the University of California.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *	  notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *	  notice, this list of conditions and the following disclaimer in the
+ *	  documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of the University nor the names of its contributors
+ *	  may be used to endorse or promote products derived from this software
+ *	  without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE D

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-19 Thread Ashutosh Sharma

On Wed, Aug 19, 2020 at 3:55 PM Masahiko Sawada
 wrote:
>
> On Wed, 19 Aug 2020 at 15:09, Ashutosh Sharma  wrote:
> >
> > On Wed, Aug 19, 2020 at 9:27 AM Masahiko Sawada
> >  wrote:
> > >
> > > On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma  
> > > wrote:
> > > >
> > > > > pg_force_freeze() can revival a tuple that is already deleted but not
> > > > > vacuumed yet. Therefore, the user might need to reindex indexes after
> > > > > using that function. For instance, with the following script, the last
> > > > > two queries: index scan and seq scan, will return different results.
> > > > >
> > > > > set enable_seqscan to off;
> > > > > set enable_bitmapscan to off;
> > > > > set enable_indexonlyscan to off;
> > > > > create table tbl (a int primary key);
> > > > > insert into tbl values (1);
> > > > >
> > > > > update tbl set a = a + 100 where a = 1;
> > > > >
> > > > > explain analyze select * from tbl where a < 200;
> > > > >
> > > > > -- revive deleted tuple on heap
> > > > > select heap_force_freeze('tbl', array['(0,1)'::tid]);
> > > > >
> > > > > -- index scan returns 2 tuples
> > > > > explain analyze select * from tbl where a < 200;
> > > > >
> > > > > -- seq scan returns 1 tuple
> > > > > set enable_seqscan to on;
> > > > > explain analyze select * from tbl;
> > > > >
> > > >
> > > > I am not sure if this is the right use-case of pg_force_freeze
> > > > function. I think we should only be running pg_force_freeze function
> > > > on a tuple for which VACUUM reports "found xmin ABC from before
> > > > relfrozenxid PQR" sort of error otherwise it might worsen the things
> > > > instead of making it better.
> > >
> > > Should this also be documented? I think that it's hard to force the
> > > user to always use this module in the right situation but we need to
> > > show at least when to use.
> > >
> >
> > I've already added some examples in the documentation explaining the
> > use-case of force_freeze function. If required, I will also add a note
> > about it.
> >
> > > > > Also, if a tuple updated and moved to another partition is revived by
> > > > > heap_force_freeze(), its ctid still has special values:
> > > > > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
> > > > > see a problem yet caused by a visible tuple having the special ctid
> > > > > value, but it might be worth considering either to reset ctid value as
> > > > > well or to not freezing already-deleted tuple.
> > > > >
> > > >
> > > > For this as well, the answer remains the same as above.
> > >
> > > Perhaps the same is true when a tuple header is corrupted including
> > > xmin and ctid for some reason and the user wants to fix it? I'm
> > > concerned that a live tuple having the wrong ctid will cause SEGV or
> > > PANIC error in the future.
> > >
> >
> > If a tuple header itself is corrupted, then I think we must kill that
> > tuple. If only xmin and t_ctid fields are corrupted, then probably we
> > can think of resetting the ctid value of that tuple. However, it won't
> > be always possible to detect the corrupted ctid value. It's quite
> > possible that the corrupted ctid field has valid values for block
> > number and offset number in it, but it's actually corrupted and it
> > would be difficult to consider such ctid as corrupted. Hence, we can't
> > do anything about such types of corruption. Probably in such cases we
> > need to run VACUUM FULL on such tables so that new ctid gets generated
> > for each tuple in the table.
>
> Understood.
>
> Perhaps such corruption will be able to be detected by new heapam
> check functions discussed on another thread. My point was that it
> might be better to attempt making the tuple header sane state as much
> as possible when fixing a live tuple in order to prevent further
> problems such as databases crash by malicious attackers.
>

Agreed. So, to handle the ctid related concern that you raised, I'm
planning to do the following changes to ensure that the tuple being
marked as frozen contains the correct item pointer value. Please let
me know if you are okay with these changes.

   HeapTupleHeader htup;
+  ItemPointerData ctid;

   Assert(heap_force_opt == HEAP_FORCE_FREEZE);

+  ItemPointerSet(&ctid, blkno, offnos[i]);
+
   htup = (HeapTupleHeader)
PageGetItem(page, itemid);

+   /*
+* Make sure that this tuple holds the
correct item pointer
+* value.
+*/
+   if
(!HeapTupleHeaderIndicatesMovedPartitions(htup) &&
+   !ItemPointerEquals(&ctid, &htup->t_ctid))
+   ItemPointerSet(&htup->t_ctid,
blkno, offnos[i]);
+
HeapTupleHeaderSetXmin(htup,
FrozenTransactionId);
H

Re: More tests with USING INDEX replident and dropped indexes

2020-08-19 Thread Rahila Syed


Hi,

I couldn't test the patch as it does not apply cleanly on master.

Please find below some review comments:

1. Would it better to throw a warning at the time of dropping the 
REPLICA IDENTITY


index that it would also  drop the REPLICA IDENTITY of the parent table?


2. CCI is used after calling SetRelationReplIdent from index_drop() but not

after following call

relation_mark_replica_identity(Relation rel, char ri_type, Oid indexOid,
  bool is_internal)

+   /* update relreplident if necessary */
+   SetRelationReplIdent(RelationGetRelid(rel), ri_type);

3.   I think the former variable names `pg_class_tuple` and 
`pg_class_form`provided better clarity

 +   HeapTuple tuple;

 +   Form_pg_class classtuple;


- relreplident is switched to REPLICA_IDENTITY_NOTHING, which is the
behavior we have now after an index is dropped, even if there is a
primary key.


4. I am not aware of the reason of the current behavior, however it 
seems better


to switch to REPLICA_IDENTITY_DEFAULT. Can't think of a reason why user 
would not be


fine with using primary key as replica identity when replica identity 
index is dropped



Thank you,

Re: SyncRepLock acquired exclusively in default configuration





On 2020/08/12 15:32, Masahiko Sawada wrote:

On Wed, 12 Aug 2020 at 14:06, Asim Praveen  wrote:





On 11-Aug-2020, at 8:57 PM, Robert Haas  wrote:

I think this gets to the root of the issue. If we check the flag
without a lock, we might see a slightly stale value. But, considering
that there's no particular amount of time within which configuration
changes are guaranteed to take effect, maybe that's OK. However, there
is one potential gotcha here: if the walsender declares the standby to
be synchronous, a user can see that, right? So maybe there's this
problem: a user sees that the standby is synchronous and expects a
transaction committing afterward to provoke a wait, but really it
doesn't. Now the user is unhappy, feeling that the system didn't
perform according to expectations.


Yes, pg_stat_replication reports a standby in sync as soon as walsender updates 
priority of the standby to something other than 0.

The potential gotcha referred above doesn’t seem too severe.  What is the 
likelihood of someone setting synchronous_standby_names GUC with either “*” or 
a standby name and then immediately promoting that standby?  If the standby is 
promoted before the checkpointer on master gets a chance to update 
sync_standbys_defined in shared memory, commits made during this interval on 
master may not make it to standby.  Upon promotion, those commits may be lost.


I think that if the standby is quite behind the primary and in case of
the primary crashes, the likelihood of losing commits might get
higher. The user can see the standby became synchronous standby via
pg_stat_replication but commit completes without a wait because the
checkpointer doesn't update sync_standbys_defined yet. If the primary
crashes before standby catching up and the user does failover, the
committed transaction will be lost, even though the user expects that
transaction commit has been replicated to the standby synchronously.
And this can happen even without the patch, right?


I think you're right. This issue can happen even without the patch.

Maybe we should not mark the standby as "sync" whenever sync_standbys_defined
is false even if synchronous_standby_names is actually set and walsenders have
already detect that? Or we need more aggressive approach;
make the checkpointer update sync_standby_priority values of
all the walsenders? ISTM that the latter looks overkill...

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: track_planning causing performance regression

2020-08-19 Thread Hamid Akhtar

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

Overall, the patch works fine. However, I have a few observations:

(1) Code Comments:
- The code comments should be added for the 2 new macros, in particular for 
PGSS_NUM_LOCK_PARTITIONS. As you explained in your email, this may be used to 
limit the number of locks if a very large value for pgss_max is specified.
- From the code I inferred that the number of locks can in future be less than 
pgss_max (per your email where in future this macro could be used to limit the 
number of locks). I suggest to perhaps add some notes helping future changes in 
this code area.

(2) It seems like that "pgss->lock = &(pgss->base + pgss_max)->lock;" statement 
should not use pgss_max directly and instead use PGSS_NUM_LOCK_PARTITIONS 
macro, as when a limit is imposed on number of locks, this statement will cause 
an overrun.


-- 
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
ADDR: 10318 WHALLEY BLVD, Surrey, BC
CELL：+923335449950  EMAIL: mailto:hamid.akh...@highgo.ca
SKYPE: engineeredvirus

The new status of this patch is: Waiting on Author

Refactor pg_rewind code and make it work against a standby

2020-08-19 Thread Heikki Linnakangas


Hi,

I started to hack on making pg_rewind crash-safe (see [1]), but I 
quickly got side-tracked into refactoring and tidying up up the code in 
general. I ended up with this series of patches:


The first four patches are just refactoring that make the code and the 
logic more readable. Tom Lane commented about the messy comments earlier 
(see [2]), and I hope these patches will alleviate that confusion. See 
commit messages for details.


The last patch refactors the logic in libpq_fetch.c, so that it no 
longer uses a temporary table in the source system. That allows using a 
hot standby server as the pg_rewind source.


This doesn't do anything about pg_rewind's crash-safety yet, but I'll 
try work on that after these patches.


[1] 
https://www.postgresql.org/message-id/d8dcc760-8780-5084-f066-6d663801d2e2%40iki.fi


[2] https://www.postgresql.org/message-id/30255.1522711675%40sss.pgh.pa.us

- Heikki
>From f2057f662e9c52c7392491831338a720a21a8ca3 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 19 Aug 2020 15:34:35 +0300
Subject: [PATCH 1/5] pg_rewind: Move syncTargetDirectory() to file_ops.c

For consistency. All the other low-level functions that operate on the
target directory are in file_ops.c.
---
 src/bin/pg_rewind/file_ops.c  | 19 +++
 src/bin/pg_rewind/file_ops.h  |  1 +
 src/bin/pg_rewind/pg_rewind.c | 22 +-
 src/bin/pg_rewind/pg_rewind.h |  1 +
 4 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c
index b3bf091c546..55439db20ba 100644
--- a/src/bin/pg_rewind/file_ops.c
+++ b/src/bin/pg_rewind/file_ops.c
@@ -19,6 +19,7 @@
 #include 
 
 #include "common/file_perm.h"
+#include "common/file_utils.h"
 #include "file_ops.h"
 #include "filemap.h"
 #include "pg_rewind.h"
@@ -266,6 +267,24 @@ remove_target_symlink(const char *path)
  dstpath);
 }
 
+/*
+ * Sync target data directory to ensure that modifications are safely on disk.
+ *
+ * We do this once, for the whole data directory, for performance reasons.  At
+ * the end of pg_rewind's run, the kernel is likely to already have flushed
+ * most dirty buffers to disk.  Additionally fsync_pgdata uses a two-pass
+ * approach (only initiating writeback in the first pass), which often reduces
+ * the overall amount of IO noticeably.
+ */
+void
+sync_target_dir(void)
+{
+	if (!do_sync || dry_run)
+		return;
+
+	fsync_pgdata(datadir_target, PG_VERSION_NUM);
+}
+
 
 /*
  * Read a file into memory. The file to be read is /.
diff --git a/src/bin/pg_rewind/file_ops.h b/src/bin/pg_rewind/file_ops.h
index 025f24141c9..d8466385cf5 100644
--- a/src/bin/pg_rewind/file_ops.h
+++ b/src/bin/pg_rewind/file_ops.h
@@ -19,6 +19,7 @@ extern void remove_target_file(const char *path, bool missing_ok);
 extern void truncate_target_file(const char *path, off_t newsize);
 extern void create_target(file_entry_t *t);
 extern void remove_target(file_entry_t *t);
+extern void sync_target_dir(void);
 
 extern char *slurpFile(const char *datadir, const char *path, size_t *filesize);
 
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 23fc749e445..c9b9e480c0f 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -20,7 +20,6 @@
 #include "catalog/pg_control.h"
 #include "common/controldata_utils.h"
 #include "common/file_perm.h"
-#include "common/file_utils.h"
 #include "common/restricted_token.h"
 #include "common/string.h"
 #include "fe_utils/recovery_gen.h"
@@ -38,7 +37,6 @@ static void createBackupLabel(XLogRecPtr startpoint, TimeLineID starttli,
 
 static void digestControlFile(ControlFileData *ControlFile, char *source,
 			  size_t size);
-static void syncTargetDirectory(void);
 static void getRestoreCommand(const char *argv0);
 static void sanityChecks(void);
 static void findCommonAncestorTimeline(XLogRecPtr *recptr, int *tliIndex);
@@ -455,7 +453,7 @@ main(int argc, char **argv)
 
 	if (showprogress)
 		pg_log_info("syncing target data directory");
-	syncTargetDirectory();
+	sync_target_dir();
 
 	if (writerecoveryconf && !dry_run)
 		WriteRecoveryConfig(conn, datadir_target,
@@ -803,24 +801,6 @@ digestControlFile(ControlFileData *ControlFile, char *src, size_t size)
 	checkControlFile(ControlFile);
 }
 
-/*
- * Sync target data directory to ensure that modifications are safely on disk.
- *
- * We do this once, for the whole data directory, for performance reasons.  At
- * the end of pg_rewind's run, the kernel is likely to already have flushed
- * most dirty buffers to disk.  Additionally fsync_pgdata uses a two-pass
- * approach (only initiating writeback in the first pass), which often reduces
- * the overall amount of IO noticeably.
- */
-static void
-syncTargetDirectory(void)
-{
-	if (!do_sync || dry_run)
-		return;
-
-	fsync_pgdata(datadir_target, PG_VERSION_NUM);
-}
-
 /*
  * Get value of GUC parameter restore_command from the target cluster.
  *
diff --git a/src/bin/pg_rewind/pg_r

Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

2020-08-19 Thread Anastasia Lubennikova


On 18.08.2020 02:54, Alvaro Herrera wrote:

On 2020-Aug-14, Ibrar Ahmed wrote:


The table used for the test contains three columns (integer, text,
varchar).
The total number of rows is 1000 in total.

Unpatched (Master: 92c12e46d5f1e25fc85608a6d6a19b8f5ea02600)
COPY: 9069.432 ms vacuum; 2567.961ms
COPY: 9004.533 ms vacuum: 2553.075ms
COPY: 8832.422 ms vacuum: 2540.742ms

Patched (Master: 92c12e46d5f1e25fc85608a6d6a19b8f5ea02600)
COPY: 10031.723 ms vacuum: 127.524 ms
COPY: 9985.109  ms vacuum: 39.953 ms
COPY: 9283.373  ms vacuum: 37.137 ms

Time to take the copy slightly increased but the vacuum time significantly
decrease.

"Slightly"?  It seems quite a large performance drop to me -- more than
10%.  Where is that time being spent?  Andres said in [1] that he
thought the performance shouldn't be affected noticeably, but this
doesn't seem to hold true.  As I understand, the idea was that there
would be little or no additional WAL records .. only flags in the
existing record.  So what is happening?

[1] https://postgr.es/m/20190408010427.4l63qr7h2fjcy...@alap3.anarazel.de


I agree that 10% performance drop is not what we expect with this patch.
Ibrar, can you share more info about your tests? I'd like to reproduce 
this slowdown and fix it, if necessary.


I've run some tests on my laptop and COPY FREEZE shows the same time for 
both versions, while VACUUM is much faster on the patched version. I've 
also checked WAL generation and it shows that the patch works correctly 
as it doesn't add any records for COPY.


Not patched:

Time: 54883,356 ms (00:54,883)
Time: 65373,333 ms (01:05,373)
Time: 64684,592 ms (01:04,685)
VACUUM Time: 60861,670 ms (01:00,862)

COPY  wal_bytes 3765 MB
VACUUM wal_bytes 6015 MB
table size             5971 MB

Patched:

Time: 53142,947 ms (00:53,143)
Time: 65420,812 ms (01:05,421)
Time: 66600,114 ms (01:06,600)
VACUUM Time: 63,401 ms

COPY  wal_bytes 3765 MB
VACUUM wal_bytes 30 kB
table size             5971 MB

The test script is attached.


Also, when Andres posted this patch first, he said this was only for
heap_multi_insert because it was a prototype.  But I think we expect
that the table_insert path (CIM_SINGLE mode in copy) should also receive
that treatment.


I am afraid that extra checks for COPY FREEZE  in heap_insert() will 
slow down normal insertions.


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/pg_visibility/expected/pg_visibility.out b/contrib/pg_visibility/expected/pg_visibility.out
index ca4b6e186b..0017e3415c 100644
--- a/contrib/pg_visibility/expected/pg_visibility.out
+++ b/contrib/pg_visibility/expected/pg_visibility.out
@@ -179,6 +179,69 @@ select pg_truncate_visibility_map('test_partition');
  
 (1 row)
 
+-- test copy freeze
+create table copyfreeze (a int, b char(1500));
+-- load all rows via COPY FREEZE and ensure that all pages are set all-visible
+-- and all-frozen.
+begin;
+truncate copyfreeze;
+copy copyfreeze from stdin freeze;
+commit;
+select * from pg_visibility_map('copyfreeze');
+ blkno | all_visible | all_frozen 
+---+-+
+ 0 | t   | t
+ 1 | t   | t
+ 2 | t   | t
+(3 rows)
+
+select * from pg_check_frozen('copyfreeze');
+ t_ctid 
+
+(0 rows)
+
+-- load half the rows via regular COPY and rest via COPY FREEZE. The pages
+-- which are touched by regular COPY must not be set all-visible/all-frozen. On
+-- the other hand, pages allocated by COPY FREEZE should be marked
+-- all-frozen/all-visible.
+begin;
+truncate copyfreeze;
+copy copyfreeze from stdin;
+copy copyfreeze from stdin freeze;
+commit;
+select * from pg_visibility_map('copyfreeze');
+ blkno | all_visible | all_frozen 
+---+-+
+ 0 | f   | f
+ 1 | f   | f
+ 2 | t   | t
+(3 rows)
+
+select * from pg_check_frozen('copyfreeze');
+ t_ctid 
+
+(0 rows)
+
+-- Try a mix of regular COPY and COPY FREEZE.
+begin;
+truncate copyfreeze;
+copy copyfreeze from stdin freeze;
+copy copyfreeze from stdin;
+copy copyfreeze from stdin freeze;
+commit;
+select * from pg_visibility_map('copyfreeze');
+ blkno | all_visible | all_frozen 
+---+-+
+ 0 | t   | t
+ 1 | f   | f
+ 2 | t   | t
+(3 rows)
+
+select * from pg_check_frozen('copyfreeze');
+ t_ctid 
+
+(0 rows)
+
 -- cleanup
 drop table test_partitioned;
 drop view test_view;
@@ -188,3 +251,4 @@ drop server dummy_server;
 drop foreign data wrapper dummy;
 drop materialized view matview_visibility_test;
 drop table regular_table;
+drop table copyfreeze;
diff --git a/contrib/pg_visibility/sql/pg_visibility.sql b/contrib/pg_visibility/sql/pg_visibility.sql
index f79b54480b..ec1afd4906 100644
--- a/contrib/pg_visibility/sql/pg_visibility.sql
+++ b/contrib/pg_visibility/sql/pg_visibility.sql
@@ -94,6 +94,82 @@ select count(*) > 0 from pg_visibility_

Re: SyncRepLock acquired exclusively in default configuration

2020-08-19 Thread Asim Praveen



> On 12-Aug-2020, at 12:02 PM, Masahiko Sawada 
>  wrote:
> 
> On Wed, 12 Aug 2020 at 14:06, Asim Praveen  wrote:
>> 
>> 
>> 
>>> On 11-Aug-2020, at 8:57 PM, Robert Haas  wrote:
>>> 
>>> I think this gets to the root of the issue. If we check the flag
>>> without a lock, we might see a slightly stale value. But, considering
>>> that there's no particular amount of time within which configuration
>>> changes are guaranteed to take effect, maybe that's OK. However, there
>>> is one potential gotcha here: if the walsender declares the standby to
>>> be synchronous, a user can see that, right? So maybe there's this
>>> problem: a user sees that the standby is synchronous and expects a
>>> transaction committing afterward to provoke a wait, but really it
>>> doesn't. Now the user is unhappy, feeling that the system didn't
>>> perform according to expectations.
>> 
>> Yes, pg_stat_replication reports a standby in sync as soon as walsender 
>> updates priority of the standby to something other than 0.
>> 
>> The potential gotcha referred above doesn’t seem too severe.  What is the 
>> likelihood of someone setting synchronous_standby_names GUC with either “*” 
>> or a standby name and then immediately promoting that standby?  If the 
>> standby is promoted before the checkpointer on master gets a chance to 
>> update sync_standbys_defined in shared memory, commits made during this 
>> interval on master may not make it to standby.  Upon promotion, those 
>> commits may be lost.
> 
> I think that if the standby is quite behind the primary and in case of
> the primary crashes, the likelihood of losing commits might get
> higher. The user can see the standby became synchronous standby via
> pg_stat_replication but commit completes without a wait because the
> checkpointer doesn't update sync_standbys_defined yet. If the primary
> crashes before standby catching up and the user does failover, the
> committed transaction will be lost, even though the user expects that
> transaction commit has been replicated to the standby synchronously.
> And this can happen even without the patch, right?
> 

It is correct that the issue is orthogonal to the patch upthread and I’ve marked
the commitfest entry as ready-for-committer.

Regarding the issue described above, the amount by which the standby is lagging
behind the primary does not affect the severity.  A standby’s state will be
reported as “sync” to the user only after the standby has caught up (state ==
WALSNDSTATE_STREAMING).  The time it takes for the checkpointer to update the
sync_standbys_defined flag in shared memory is the important factor.  Once
checkpointer sets this flag, commits start waiting for the standby (as long as
it is in-sync).

Asim

Re: More tests with USING INDEX replident and dropped indexes

On Wed, Aug 19, 2020 at 05:33:36PM +0530, Rahila Syed wrote:
> I couldn't test the patch as it does not apply cleanly on master.

The CF bot is green, and I can apply v2 cleanly FWIW:
http://commitfest.cputube.org/michael-paquier.html

> Please find below some review comments:
> 
> 1. Would it better to throw a warning at the time of dropping the REPLICA
> IDENTITY index that it would also  drop the REPLICA IDENTITY of the
> parent table?

Not sure that's worth it.  Updating relreplident is a matter of
consistency.

> 2. CCI is used after calling SetRelationReplIdent from index_drop() but not
> after following call
> 
> relation_mark_replica_identity(Relation rel, char ri_type, Oid indexOid,
>   bool is_internal)
> 
> +   /* update relreplident if necessary */
> +   SetRelationReplIdent(RelationGetRelid(rel), ri_type);

Yes, not having a CCI here is the intention, because it is not
necessary.  That's not the case in index_drop() where the pg_class
entry of the parent table gets updated afterwards.

> 3.   I think the former variable names `pg_class_tuple` and
> `pg_class_form`provided better clarity
>  +   HeapTuple tuple;
> 
>  +   Form_pg_class classtuple;

This is chosen to be consistent with SetRelationHasSubclass().

> 4. I am not aware of the reason of the current behavior, however it seems
> better
> 
> to switch to REPLICA_IDENTITY_DEFAULT. Can't think of a reason why user
> would not be fine with using primary key as replica identity when
> replica identity index is dropped

Using NOTHING is more consistent with the current behavior we have
since 9.4 now.  There could be an argument for switching back to the
default, but that could be surprising to change an old behavior.
--
Michael


signature.asc
Description: PGP signature

Re: Print logical WAL message content

2020-08-19 Thread Ashutosh Bapat

Thanks Alvaro for review.

On Wed, Aug 19, 2020 at 3:21 AM Alvaro Herrera  wrote:
>
> I didn't like that you're documenting the message format in the new
> function:
>
> >   xl_logical_message *xlrec = (xl_logical_message *) rec;
> > + /*
> > +  * Per LogLogicalMessage() actual logical message follows a 
> > null-terminated prefix of length
> > +  * prefix_size.
>
> I would prefer to remove this comment, and instead add a comment atop
> xl_logical_message's struct definition in message.h to say that the
> message has a valid C-string as prefix, whose length is prefix_size, and
> please see logicalmesg_desc() if you change this.

It's documented in the struct definition. Added a note about logicalmesg_desc().

> This way, you don't need to blame LogLogicalMessage for this
> restriction, but it's actually part of the definition of the WAL
> message.
>
> > + /*
> > +  * Per LogLogicalMessage() actual logical message follows a 
> > null-terminated prefix of length
> > +  * prefix_size.
> > +  */
> > + char   *prefix = xlrec->message;
> > + char   *message = xlrec->message + xlrec->prefix_size;
> > + int cnt;
> > + char   *sep = "";
>
> This would cause a crash if the message actually fails to follow the
> rule.  Let's test that prefix[xlrec->prefix_size] is a trailing zero,
> and if not, avoid printing it.  Although, just Assert()'ing that it's a
> trailing zero would seem to suffice.

Added an Assert.

>
> > + appendStringInfo(buf, "%s message size %zu bytes, prefix %s; 
> > mesage: ",
> >xlrec->transactional ? 
> > "transactional" : "nontransactional",
> > -  xlrec->message_size);
> > +  xlrec->message_size, prefix);
>
> Misspelled "message", but also the line looks a bit repetitive -- the
> word "message" would appear three times:
>
> > lsn: 0/01570608, prev 0/015705D0, desc: MESSAGE nontransactional message 
> > size 12 bytes, prefix some_prefix; mesage: 73 6F 6D 65 20 6D 65 73 73 61 67 
> > 65
>
> I would reduce it to
>
> > lsn: 0/01570608, prev 0/015705D0, desc: MESSAGE nontransactional, prefix 
> > "some_prefix"; payload (12 bytes): 73 6F 6D 65 20 6D 65 73 73 61 67 65

I like this format as well. Done.

PFA the patch attached with your comments addressed.

Thanks for your review.

-- 
Best Wishes,
Ashutosh Bapat
From d32859c8368ac25366b156c3247a5b77d9b72ce8 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Tue, 18 Aug 2020 11:05:29 +0530
Subject: [PATCH] Print prefix and logical WAL message content in pg_waldump

Print logical WAL message prefix which is a null terminated ASCII
string. Print the actual message as a space separated hex byte string
since it may contain binary data. This is useful for debugging purposes.

Ashutosh Bapat, reviewed by Alvaro Herrera
---
 src/backend/access/rmgrdesc/logicalmsgdesc.c | 16 ++--
 src/include/replication/message.h|  4 +++-
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/rmgrdesc/logicalmsgdesc.c b/src/backend/access/rmgrdesc/logicalmsgdesc.c
index bff298c928..5da2625a0b 100644
--- a/src/backend/access/rmgrdesc/logicalmsgdesc.c
+++ b/src/backend/access/rmgrdesc/logicalmsgdesc.c
@@ -24,10 +24,22 @@ logicalmsg_desc(StringInfo buf, XLogReaderState *record)
 	if (info == XLOG_LOGICAL_MESSAGE)
 	{
 		xl_logical_message *xlrec = (xl_logical_message *) rec;
+		char   *prefix = xlrec->message;
+		char   *message = xlrec->message + xlrec->prefix_size;
+		int		cnt;
+		char   *sep = "";
 
-		appendStringInfo(buf, "%s message size %zu bytes",
+		Assert(prefix[xlrec->prefix_size] != '\0');
+
+		appendStringInfo(buf, "%s, prefix \"%s\"; payload (%zu bytes): ",
 		 xlrec->transactional ? "transactional" : "nontransactional",
-		 xlrec->message_size);
+		 prefix, xlrec->message_size);
+		/* Write actual message as a series of hex bytes. */
+		for (cnt = 0; cnt < xlrec->message_size; cnt++)
+		{
+			appendStringInfo(buf, "%s%02X", sep, (unsigned char)message[cnt]);
+			sep = " ";
+		}
 	}
 }
 
diff --git a/src/include/replication/message.h b/src/include/replication/message.h
index 937addde48..7b882296ec 100644
--- a/src/include/replication/message.h
+++ b/src/include/replication/message.h
@@ -25,7 +25,9 @@ typedef struct xl_logical_message
 	Size		message_size;	/* size of the message */
 	char		message[FLEXIBLE_ARRAY_MEMBER]; /* message including the null
  * terminated prefix of length
- * prefix_size */
+ * prefix_size. Please see
+ * logicalmsg_desc(), if you
+ * change this. */
 } xl_logical_message;
 
 #define SizeOfLogicalMessage	(offsetof(xl_logical_message, message))
-- 
2.17.1

Re: prepared transaction isolation tests

On Tue, Aug 18, 2020 at 07:34:00PM -0700, Andres Freund wrote:
> It seems like the buildfarm ought to configure the started server with a
> bunch of prepared transactions, in that case? At least going forward?

Agreed.  Testing with max_prepared_transactions > 0 has much more
value than not, for sure.  So I think that it could be a good thing,
particularly if we begin to add more isolation tests.
--
Michael


signature.asc
Description: PGP signature

Re: prepared transaction isolation tests

2020-08-19 Thread Andrew Dunstan



On 8/18/20 9:22 PM, Andres Freund wrote:
> Hi,
>
> This thread started on committers, at
> https://www.postgresql.org/message-id/20200818234532.uiafo5br5lo6zhya%40alap3.anarazel.de
>
> In it I wanted to add a isolation test around prepared transactions:
>
> On 2020-08-18 16:45:32 -0700, Andres Freund wrote:
>> I think it's worth adding an isolation test. But it doesn't seem like
>> extending prepared-transactions.spec makes too much sense, it doesn't
>> fit in well. It's a lot easier to reproduce the issue without
>> SERIALIZABLE, for example. Generally the file seems more about
>> serializable than 2PC...
>>
>> So unless somebody disagrees I'm gonna add a new
>> prepared-transactions-2.spec.
>
> But I noticed that the already existing prepared transactions test
> wasn't in the normal schedule, since:
>
> commit ae55d9fbe3871a5e6309d9b91629f1b0ff2b8cba
> Author: Andrew Dunstan 
> Date:   2012-07-20 15:51:40 -0400
>
> Remove prepared transactions from main isolation test schedule.
>
> There is no point in running this test when prepared transactions are 
> disabled,
> which is the default. New make targets that include the test are 
> provided. This
> will save some useless waste of cycles on buildfarm machines.
>
> Backpatch to 9.1 where these tests were introduced.
>
> diff --git a/src/test/isolation/isolation_schedule 
> b/src/test/isolation/isolation_schedule
> index 669c0f220c4..2184975dcb1 100644
> --- a/src/test/isolation/isolation_schedule
> +++ b/src/test/isolation/isolation_schedule
> @@ -9,7 +9,6 @@ test: ri-trigger
>  test: partial-index
>  test: two-ids
>  test: multiple-row-versions
> -test: prepared-transactions
>  test: fk-contention
>  test: fk-deadlock
>  test: fk-deadlock2
>
>
> The commit confuses me, cause I thought we explicitly enabled prepared
> transactions during tests well before that? See
>
> commit 8d4f2ecd41312e57422901952cbad234d293060b
> Author: Tom Lane 
> Date:   2009-04-23 00:23:46 +
>
> Change the default value of max_prepared_transactions to zero, and add
> documentation warnings against setting it nonzero unless active use of
> prepared transactions is intended and a suitable transaction manager has 
> been
> installed.  This should help to prevent the type of scenario we've seen
> several times now where a prepared transaction is forgotten and eventually
> causes severe maintenance problems (or even anti-wraparound shutdown).
> 
> The only real reason we had the default be nonzero in the first place was 
> to
> support regression testing of the feature.  To still be able to do that,
> tweak pg_regress to force a nonzero value during "make check".  Since we
> cannot force a nonzero value in "make installcheck", add a variant 
> regression
> test "expected" file that shows the results that will be obtained when
> max_prepared_transactions is zero.
> 
> Also, extend the HINT messages for transaction wraparound warnings to 
> mention
> the possibility that old prepared transactions are causing the problem.
> 
> All per today's discussion.
>
>
> And indeed, including the test in the schedule works for make check, not
> just an installcheck with explicitly enabled prepared xacts.
>
>
> ISTM we should just add an alternative output for disabled prepared
> xacts, and re-add the test?



here's the context for the 2012 commit.


https://www.postgresql.org/message-id/flat/50099220.2060005%40dunslane.net#8b189efc4920e1996ffa4d6a0ef81b9c


So I hope any changes that are made will not result in a major slowdown
of buildfarm animals.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Creating a function for exposing memory usage of backend process

On Wed, Aug 19, 2020 at 06:12:02PM +0900, Fujii Masao wrote:
> On 2020/08/19 17:40, torikoshia wrote:
>> Yes, I didn't add regression tests because of the unstability of the output.
>> I thought it would be OK since other views like pg_stat_slru and 
>> pg_shmem_allocations
>> didn't have tests for their outputs.
> 
> You're right.

If you can make a test with something minimal and with a stable
output, adding a test is helpful IMO, or how can you make easily sure
that this does not get broken, particularly in the event of future
refactorings, or even with platform-dependent behaviors?

By the way, I was looking at the code that has been committed, and I
think that it is awkward to have a SQL function in mcxt.c, which is a
rather low-level interface.  I think that this new code should be
moved to its own file, one suggestion for a location I have being
src/backend/utils/adt/mcxtfuncs.c.
--
Michael


signature.asc
Description: PGP signature

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2020-08-19 Thread Julien Rouhaud

On Tue, Jul 28, 2020 at 10:55:04AM +0200, Julien Rouhaud wrote:
> On Tue, Jul 28, 2020 at 10:07 AM torikoshia  
> wrote:
> >
> > Thanks for updating!
> > I tested the patch setting log_statement = 'all', but %Q in
> > log_line_prefix
> > was always 0 even when pg_stat_statements.queryid and
> > pg_stat_activity.queryid are not 0.
> >
> > Is this an intentional behavior?
> >
> >[...]
> 
> Thanks for the tests!  That's indeed an expected behavior (although I
> wasn't aware of it), which isn't documented in this patch (I'll fix
> it).  The reason for that is that log_statements is done right after
> parsing the query:
> 
> /*
>  * Do basic parsing of the query or queries (this should be safe even if
>  * we are in aborted transaction state!)
>  */
> parsetree_list = pg_parse_query(query_string);
> 
> /* Log immediately if dictated by log_statement */
> if (check_log_statement(parsetree_list))
> {
> ereport(LOG,
> (errmsg("statement: %s", query_string),
>  errhidestmt(true),
>  errdetail_execute(parsetree_list)));
> was_logged = true;
> }
> 
> As parse analysis is not yet done, no queryid can be computed at that
> point, so we always print 0.  That's a limitation that can't be
> removed without changing the semantics of log_statements, so we'll
> probably have to live with it.
> 
> > And here is a minor typo.
> >   optionnally -> optionally
> >
> >
> > > 753 +   /* query identifier, optionnally computed using
> > > post_parse_analyze_hook */
> 
> Thanks, I fixed it locally!


Recent conflict, rebased v11 attached.
>From 473d038a1b447d4569709c3a499fc7356af76452 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud 
Date: Mon, 18 Mar 2019 18:55:50 +0100
Subject: [PATCH v11] Expose queryid in pg_stat_activity and log_line_prefix

Similarly to other fields in pg_stat_activity, only the queryid from the top
level statements are exposed, and if the backends status isn't active then the
queryid from the last executed statements is displayed.

Also add a %Q placeholder to include the queryid in the log_line_prefix, which
will also only expose top level statements.

Author: Julien Rouhaud
Reviewed-by: Evgeny Efimkin, Michael Paquier, Yamada Tatsuro, Atsushi Torikoshi
Discussion: https://postgr.es/m/CA+8PKvQnMfOE-c3YLRwxOsCYXQDyP8VXs6CDtMZp1V4=d4l...@mail.gmail.com
---
 .../pg_stat_statements/pg_stat_statements.c   | 179 --
 doc/src/sgml/config.sgml  |   9 +-
 doc/src/sgml/monitoring.sgml  |  15 ++
 src/backend/catalog/system_views.sql  |   1 +
 src/backend/executor/execMain.c   |   8 +
 src/backend/executor/execParallel.c   |  14 +-
 src/backend/executor/nodeGather.c |   3 +-
 src/backend/executor/nodeGatherMerge.c|   4 +-
 src/backend/parser/analyze.c  |   5 +
 src/backend/postmaster/pgstat.c   |  65 +++
 src/backend/tcop/postgres.c   |   5 +
 src/backend/utils/adt/pgstatfuncs.c   |   7 +-
 src/backend/utils/error/elog.c|  10 +-
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/catalog/pg_proc.dat   |   6 +-
 src/include/executor/execParallel.h   |   3 +-
 src/include/pgstat.h  |   5 +
 src/test/regress/expected/rules.out   |   9 +-
 18 files changed, 270 insertions(+), 79 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 6b91c62c31..486d07f9de 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -115,6 +115,14 @@ static const uint32 PGSS_PG_MAJOR_VERSION = PG_VERSION_NUM / 100;
 
 #define JUMBLE_SIZE1024	/* query serialization buffer size */
 
+/*
+ * Utility statements that pgss_ProcessUtility and pgss_post_parse_analyze
+ * ignores.
+ */
+#define PGSS_HANDLED_UTILITY(n)		(!IsA(n, ExecuteStmt) && \
+	!IsA(n, PrepareStmt) && \
+	!IsA(n, DeallocateStmt))
+
 /*
  * Extension version number, for supporting older extension versions' objects
  */
@@ -345,7 +353,8 @@ static void pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 ProcessUtilityContext context, ParamListInfo params,
 QueryEnvironment *queryEnv,
 DestReceiver *dest, QueryCompletion *qc);
-static uint64 pgss_hash_string(const char *str, int len);
+static const char *pgss_clean_querytext(const char *query, int *location, int *len);
+static uint64 pgss_compute_utility_queryid(const char *query, int query_len);
 static void pgss_store(const char *query, uint64 queryId,
 	   int query_location, int query_len,
 	   pgssStoreKind kind,
@@ -845,16 +854,34 @@ pgss_post_parse_analyze(ParseState *pstate, Query *query)
 		return;
 
 	/*
-	 * Utility statements get queryId zero.  We do this even in cases where
-	 * the statement co

Re: Creating a function for exposing memory usage of backend process

2020-08-19 Thread Tom Lane

Hadn't been paying attention to this thread up till now, but ...

Michael Paquier  writes:
> By the way, I was looking at the code that has been committed, and I
> think that it is awkward to have a SQL function in mcxt.c, which is a
> rather low-level interface.  I think that this new code should be
> moved to its own file, one suggestion for a location I have being
> src/backend/utils/adt/mcxtfuncs.c.

I agree with that, but I think this patch has a bigger problem:
why bother at all?  It seems like a waste of code space and future
maintenance effort, because there is no use-case.  In the situations
where you need to know where the memory went, you are almost never
in a position to leisurely execute a query and send the results over
to your client.  This certainly would be useless to figure out why
an already-running query is eating space, for instance.

The only situation I could imagine where this would have any use is
where there is long-term (cross-query) bloat in, say, CacheMemoryContext
--- but it's not even very helpful for that, since you can't examine
anything finer-grain than a memory context.  Plus you need to be
running an interactive session, or else be willing to hack up your
application to try to get it to inspect the view (and log the
results somewhere) at useful times.

On top of all that, the functionality has Heisenberg problems,
because simply using it changes what you are trying to observe,
in complex and undocumented ways (not that the documentation
would be of any use to non-experts anyway).

My own thoughts about improving the debugging situation would've been
to create a way to send a signal to a session to make it dump its
current memory map to the postmaster log (not the client, since the
client is unlikely to be prepared to receive anything extraneous).
But this is nothing like that.

Given the lack of clear use-case, and the possibility (admittedly
not strong) that this is still somehow a security hazard, I think
we should revert it.  If it stays, I'd like to see restrictions
on who can read the view.

regards, tom lane

Re: Performing partition pruning using row value

2020-08-19 Thread ahsan hadi

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:not tested

I have performed testing of the patch with row comparison partition pruning 
scenarios, it is working well. I didn't code review hence not changing the 
status.

Re: [PG13] Planning (time + buffers) data structure in explain plan (format text)




On 2020/08/19 19:39, David Rowley wrote:

On Wed, 19 Aug 2020 at 21:05, Julien Rouhaud  wrote:


On Wed, Aug 19, 2020 at 08:49:48PM +1200, David Rowley wrote:

However, I'm not quite sure how we should handle if someone does:
EXPLAIN (BUFFERS on, SUMMARY off). Without the summary, there's no
place to print the buffers, which seems bad as they asked for buffers.



But this won't be as much a problem if ANALYZE is asked, and having different
behaviors isn't appealing.  So maybe it's better to let people get what they
asked for even if that's contradictory?


I'd say BUFFERS on, BUFFERS off is contradictory. I don't think
BUFFERS, SUMMARY OFF is. It's just that we show the buffer details for
the planner in the summary.  Since "summary" is not exactly a word
that describes what you're asking EXPLAIN to do, I wouldn't blame
users if they got confused as to why their BUFFERS on request was not
displayed.


Displaying the planner's buffer usage under summary is the root cause of
the confusion? If so, what about displaying that outside summary?
Attached is the POC patch that I'm just thinking.

With the patch, for example, whatever "summary" settng is, "buffers on"
displays the planner's buffer usage if it happens.

=# explain (buffers on, summary off) select * from t;
 QUERY PLAN
-
 Seq Scan on t  (cost=0.00..32.60 rows=2260 width=8)
 Planning:
   Buffers: shared hit=16 read=6
(3 rows)


If "summary" is enabled, the planning time is also displayed.

=# explain (buffers on, summary on) select * from t;
 QUERY PLAN
-
 Seq Scan on t  (cost=0.00..32.60 rows=2260 width=8)
 Planning:
   Buffers: shared hit=16 read=6
 Planning Time: 0.904 ms
(4 rows)


If the planner's buffer usage doesn't happen, it's not displayed
(in text format).

=# explain (buffers on, summary on) select * from t;
 QUERY PLAN
-
 Seq Scan on t  (cost=0.00..32.60 rows=2260 width=8)
 Planning Time: 0.064 ms
(2 rows)

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 30e0a7ee7f..c98c9b5547 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -116,7 +116,8 @@ static void show_instrumentation_count(const char *qlabel, 
int which,
 static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
 static void show_eval_params(Bitmapset *bms_params, ExplainState *es);
 static const char *explain_get_index_name(Oid indexId);
-static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
+static void show_buffer_usage(ExplainState *es, const BufferUsage *usage,
+ bool planning);
 static void show_wal_usage(ExplainState *es, const WalUsage *usage);
 static void ExplainIndexScanDetails(Oid indexid, ScanDirection indexorderdir,

ExplainState *es);
@@ -221,11 +222,6 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt,
 parser_errposition(pstate, 
opt->location)));
}
 
-   if (es->buffers && !es->analyze)
-   ereport(ERROR,
-   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-errmsg("EXPLAIN option BUFFERS requires 
ANALYZE")));
-
if (es->wal && !es->analyze)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
@@ -586,8 +582,13 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, 
ExplainState *es,
/* Create textual dump of plan tree */
ExplainPrintPlan(es, queryDesc);
 
-   if (es->summary && (planduration || bufusage))
+   /* Show buffer usage in planning */
+   if (bufusage)
+   {
ExplainOpenGroup("Planning", "Planning", true, es);
+   show_buffer_usage(es, bufusage, true);
+   ExplainCloseGroup("Planning", "Planning", true, es);
+   }
 
if (es->summary && planduration)
{
@@ -596,19 +597,6 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, 
ExplainState *es,
ExplainPropertyFloat("Planning Time", "ms", 1000.0 * plantime, 
3, es);
}
 
-   /* Show buffer usage */
-   if (es->summary && bufusage)
-   {
-   if (es->format == EXPLAIN_FORMAT_TEXT)
-   es->indent++;
-   show_buffer_usage(es, bufusage);
-   if (es->format == EXPLAIN_FORMAT_TEXT)
-   es->indent--;
-   }
-
-   if (es->summary && (planduration || bufusage))
-   ExplainCloseGroup("Planning", "Planning", true, es);
-
/* Print info

Re: stress test for parallel workers

2020-08-19 Thread Tom Lane

Thomas Munro  writes:
> On Tue, Jul 28, 2020 at 3:27 PM Tom Lane  wrote:
>> Anyway, I guess the interesting question for us is how long it
>> will take for this fix to propagate into real-world systems.
>> I don't have much of a clue about the Linux kernel workflow,
>> anybody want to venture a guess?

> Me neither.  It just hit Torvalds' tree[1] marked "Cc:
> sta...@vger.kernel.org # v2.6.27+".  I looked at the time for a couple
> of other PowerPC-related commits of similar complexity involving some
> of the same names to get from there to a Debian stable kernel package
> and it seemed to be under a couple of months.
> [1] 
> https://github.com/torvalds/linux/commit/63dee5df43a31f3844efabc58972f0a206ca4534

For our archives' sake: today I got seemingly-automated mail informing me
that this patch has been merged into the 4.19-stable, 5.4-stable,
5.7-stable, and 5.8-stable kernel branches; but not 4.4-stable,
4.9-stable, or 4.14-stable, because it failed to apply.

regards, tom lane

deb repo doesn't have latest. or possible to update web page?

2020-08-19 Thread Roger Pack

As a note I tried to use the deb repo today:

https://www.postgresql.org/download/linux/debian/

with an old box on Wheezy.
It only seems to have binaries up to postgres 10.

Might be nice to make a note on the web page so people realize some
distro's aren't supported fully instead of (if they're like me)
wondering "why don't these instructions work? It says to run  apt-get
install postgresql-12" 

Thanks!

Problem with accessing TOAST data in stored procedures

2020-08-19 Thread Konstantin Knizhnik


Hi hackers,

More than month ago I have sent bug report to pgsql-bugs:

https://www.postgresql.org/message-id/flat/5d335911-fb25-60cd-4aa7-a5bd0954aea0%40postgrespro.ru

with the proposed patch but have not received any response.

I wonder if there is some other way to fix this issue and does somebody 
working on it.
While the added check itself is trivial (just one line) the total patch 
is not so small because I have added walker for
plpgsql statements tree. It is not strictly needed in this case (it is 
possible to find some other way to determine that stored procedure
contains transaction control statements), but I hope such walker may be 
useful in other cases.


In any case, I will be glad to receive any response,
because this problem was reported by one of our customers and we need to 
provide some fix.

It is better to include it in vanilla, rather than in our pgpro-ee fork.

If it is desirable, I can add this patch to commitfest.

Thanks in advance,
Konstantin

Re: deb repo doesn't have latest. or possible to update web page?

2020-08-19 Thread Magnus Hagander

On Wed, Aug 19, 2020 at 7:04 PM Roger Pack  wrote:

> As a note I tried to use the deb repo today:
>
> https://www.postgresql.org/download/linux/debian/
>
> with an old box on Wheezy.
> It only seems to have binaries up to postgres 10.
>
> Might be nice to make a note on the web page so people realize some
> distro's aren't supported fully instead of (if they're like me)
> wondering "why don't these instructions work? It says to run  apt-get
> install postgresql-12" ...
>

The page lists which distros *are* supported. You can assume that anything
*not* listed is unsupported.

In the case of wheezy, whatever was the latest when it stopped being
supported, is still there. Which I guess can cause some confusing if you
just run the script without reading the note that's there. I'm unsure how
to fix that though.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: Creating foreign key on partitioned table is too slow

On 2020-Aug-19, Amit Langote wrote:

Hello

> On Thu, Aug 6, 2020 at 4:25 PM kato-...@fujitsu.com
>  wrote:
> > On Wednesday, August 5, 2020 9:43 AM I wrote:
> > > I'll report the result before the end of August .
> >
> > I test v2-0001-build-partdesc-memcxt.patch at 9a9db08ae4 and it is ok.
> 
> Is this patch meant for HEAD or back-patching?  I ask because v13 got this:
> 
> commit 5b9312378e2f8fb35ef4584aea351c3319a10422
> Author: Tom Lane 
> Date:   Wed Dec 25 14:43:13 2019 -0500
> 
> Load relcache entries' partitioning data on-demand, not immediately.
> 
> which prevents a partitioned table's PartitionDesc from being rebuilt
> repeatedly as would happen before this commit in Kato-san's case,
> because it moves RelationBuildPartitionDesc out of the relcache flush
> code path.

Hmm, so this is a problem only in v11 and v12?  It seems that putting
the patch in master *only* is pointless.  OTOH v11 had other severe
performance drawbacks with lots of partitions, so it might not be needed
there.

I admit I'm hesitant to carry code in only one or two stable branches
that exists nowhere else.  But maybe the problem is serious enough in
those branches (that will still live for quite a few years) that we
should get it there.

OTOH it could be argued that the coding in master is not all that great
anyway (given the willingness for memory to be leaked) that it should
apply to all three branches.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: remove spurious CREATE INDEX CONCURRENTLY wait

On 2020-Aug-11, James Coleman wrote:

> In [3] I'd brought up that a function index can do arbitrary
> operations (including, terribly, a query of another table), and Andres
> (in [4]) noted that:
> 
> > Well, even if we consider this an actual problem, we could still use
> > PROC_IN_CIC for non-expression non-partial indexes (index operator
> > themselves better ensure this isn't a problem, or they're ridiculously
> > broken already - they can get called during vacuum).
> 
> But went on to describe how this is a problem with all expression
> indexes (even if those expressions don't do dangerous things) because
> of syscache lookups, but that ideally for expression indexes we'd have
> a way to use a different (or more frequently taken) snapshot for the
> purpose of computing those expressions. That's a significantly more
> involved patch though.

So the easy first patch here is to add the flag as PROC_IN_SAFE_CIC,
which is set only for CIC when it's used for indexes that are neither
on expressions nor partial.  Then ignore those in WaitForOlderSnapshots.
The attached patch does it that way.  (Also updated per recent
conflicts).

I did not set the flag in REINDEX CONCURRENTLY, but as I understand it
can be done too, since in essence it's the same thing as a CIC from a
snapshot management point of view.

Also, per [1], ISTM this flag could be used to tell lazy VACUUM to
ignore the Xmin of this process too, which the previous formulation
(where all CICs were so marked) could not.  This patch doesn't do that
yet, but it seems the natural next step to take.

[1] https://postgr.es/m/20191101203310.GA12239@alvherre.pgsql

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 2a5088dfa35cbc800a87dc2154b6ebfa22837a66 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 4 Aug 2020 22:04:57 -0400
Subject: [PATCH v2] Avoid spurious CREATE INDEX CONCURRENTLY waits

---
 src/backend/commands/indexcmds.c | 50 ++--
 src/include/storage/proc.h   |  6 +++-
 2 files changed, 52 insertions(+), 4 deletions(-)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 254dbcdce5..459f6fa5db 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -372,7 +372,10 @@ CompareOpclassOptions(Datum *opts1, Datum *opts2, int natts)
  * lazy VACUUMs, because they won't be fazed by missing index entries
  * either.  (Manual ANALYZEs, however, can't be excluded because they
  * might be within transactions that are going to do arbitrary operations
- * later.)
+ * later.)  Processes running CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY
+ * on indexes that are neither expressional nor partial are also safe to
+ * ignore, since we know that those processes won't examine any data
+ * outside the table they're indexing.
  *
  * Also, GetCurrentVirtualXIDs never reports our own vxid, so we need not
  * check for that.
@@ -393,7 +396,8 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 	VirtualTransactionId *old_snapshots;
 
 	old_snapshots = GetCurrentVirtualXIDs(limitXmin, true, false,
-		  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
+		  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
+		  | PROC_IN_SAFE_CIC,
 		  &n_old_snapshots);
 	if (progress)
 		pgstat_progress_update_param(PROGRESS_WAITFOR_TOTAL, n_old_snapshots);
@@ -413,7 +417,8 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 
 			newer_snapshots = GetCurrentVirtualXIDs(limitXmin,
 	true, false,
-	PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
+	PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
+	| PROC_IN_SAFE_CIC,
 	&n_newer_snapshots);
 			for (j = i; j < n_old_snapshots; j++)
 			{
@@ -506,6 +511,7 @@ DefineIndex(Oid relationId,
 	bool		amcanorder;
 	amoptions_function amoptions;
 	bool		partitioned;
+	bool		safe_index;
 	Datum		reloptions;
 	int16	   *coloptions;
 	IndexInfo  *indexInfo;
@@ -1033,6 +1039,17 @@ DefineIndex(Oid relationId,
 		}
 	}
 
+	/*
+	 * When doing concurrent index builds, we can set a PGPROC flag to tell
+	 * concurrent VACUUM, CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY
+	 * to ignore us when waiting for concurrent snapshots.  That can only be
+	 * done for indexes that don't execute any expressions.  Determine that.
+	 * (The flag is reset automatically at transaction end, so it must be
+	 * set for each transaction.)
+	 */
+	safe_index = indexInfo->ii_Expressions == NIL &&
+		indexInfo->ii_Predicate == NIL;
+
 	/*
 	 * Report index creation if appropriate (delay this till after most of the
 	 * error checks)
@@ -1419,6 +1436,15 @@ DefineIndex(Oid relationId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+	{
+		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+		MyProc->vacuumFlags |= PROC

Re: Problem with accessing TOAST data in stored procedures

2020-08-19 Thread Pavel Stehule

Hi

st 19. 8. 2020 v 19:22 odesílatel Konstantin Knizhnik <
k.knizh...@postgrespro.ru> napsal:

> Hi hackers,
>
> More than month ago I have sent bug report to pgsql-bugs:
>
>
> https://www.postgresql.org/message-id/flat/5d335911-fb25-60cd-4aa7-a5bd0954aea0%40postgrespro.ru
>
> with the proposed patch but have not received any response.
>
> I wonder if there is some other way to fix this issue and does somebody
> working on it.
> While the added check itself is trivial (just one line) the total patch
> is not so small because I have added walker for
> plpgsql statements tree. It is not strictly needed in this case (it is
> possible to find some other way to determine that stored procedure
> contains transaction control statements), but I hope such walker may be
> useful in other cases.
>
> In any case, I will be glad to receive any response,
> because this problem was reported by one of our customers and we need to
> provide some fix.
> It is better to include it in vanilla, rather than in our pgpro-ee fork.
>
> If it is desirable, I can add this patch to commitfest.
>


I don't like this design. It is not effective to repeat the walker for
every execution. Introducing a walker just for this case looks like
overengineering.
Personally I am not sure if a walker for plpgsql is a good idea (I thought
about it more times, when I wrote plpgsql_check). But anyway - there should
be good reason for introducing the walker and clean use case.

If you want to introduce stmt walker, then it should be a separate patch
with some benefit on plpgsql environment length.

Regards

Pavel


> Thanks in advance,
> Konstantin
>
>
diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
index d4a3d58daa..ccbddd4d7a 100644
--- a/src/pl/plpgsql/src/pl_exec.c
+++ b/src/pl/plpgsql/src/pl_exec.c
@@ -5997,6 +5997,12 @@ exec_for_query(PLpgSQL_execstate *estate, PLpgSQL_stmt_forq *stmt,
 	 */
 	PinPortal(portal);
 
+	/*
+	 * Disable prefetch if procedure contains COMMIT or ROLLBACK statements
+	 */
+	if (prefetch_ok && estate->func->fn_xactctrl)
+		prefetch_ok = false;
+
 	/*
 	 * Fetch the initial tuple(s).  If prefetching is allowed then we grab a
 	 * few more rows to avoid multiple trips through executor startup
diff --git a/src/pl/plpgsql/src/pl_gram.y b/src/pl/plpgsql/src/pl_gram.y
index 5a7e1a..5b27311b95 100644
--- a/src/pl/plpgsql/src/pl_gram.y
+++ b/src/pl/plpgsql/src/pl_gram.y
@@ -2215,6 +2215,8 @@ stmt_commit		: K_COMMIT opt_transaction_chain ';'
 		new->stmtid = ++plpgsql_curr_compile->nstatements;
 		new->chain = $2;
 
+		plpgsql_curr_compile->fn_xactctrl = true;
+
 		$$ = (PLpgSQL_stmt *)new;
 	}
 ;
@@ -2229,6 +2231,8 @@ stmt_rollback	: K_ROLLBACK opt_transaction_chain ';'
 		new->stmtid = ++plpgsql_curr_compile->nstatements;
 		new->chain = $2;
 
+		plpgsql_curr_compile->fn_xactctrl = true;
+
 		$$ = (PLpgSQL_stmt *)new;
 	}
 ;
diff --git a/src/pl/plpgsql/src/plpgsql.h b/src/pl/plpgsql/src/plpgsql.h
index 0c3d30fb13..e9b9b0d335 100644
--- a/src/pl/plpgsql/src/plpgsql.h
+++ b/src/pl/plpgsql/src/plpgsql.h
@@ -1006,6 +1006,7 @@ typedef struct PLpgSQL_function
 	bool		fn_retisdomain;
 	bool		fn_retset;
 	bool		fn_readonly;
+	bool		fn_xactctrl;
 	char		fn_prokind;
 
 	int			fn_nargs;

Re: Problem with accessing TOAST data in stored procedures

2020-08-19 Thread Konstantin Knizhnik

On 19.08.2020 21:50, Pavel Stehule wrote:

st 19. 8. 2020 v 19:22 odesílatel Konstantin Knizhnik
mailto:k.knizh...@postgrespro.ru>> napsal:

Hi hackers,

More than month ago I have sent bug report to pgsql-bugs:

https://www.postgresql.org/message-id/flat/5d335911-fb25-60cd-4aa7-a5bd0954aea0%40postgrespro.ru

with the proposed patch but have not received any response.

I wonder if there is some other way to fix this issue and does
somebody
working on it.
While the added check itself is trivial (just one line) the total
patch
is not so small because I have added walker for
plpgsql statements tree. It is not strictly needed in this case
(it is
possible to find some other way to determine that stored procedure
contains transaction control statements), but I hope such walker
may be
useful in other cases.

In any case, I will be glad to receive any response,
because this problem was reported by one of our customers and we
need to
provide some fix.
It is better to include it in vanilla, rather than in our pgpro-ee
fork.

If it is desirable, I can add this patch to commitfest.

I don't like this design. It is not effective to repeat the walker for
every execution. Introducing a walker just for this case looks like
overengineering.
Personally I am not sure if a walker for plpgsql is a good idea (I
thought about it more times, when I wrote plpgsql_check). But anyway -
there should be good reason for introducing the walker and clean use case.

If you want to introduce stmt walker, then it should be a separate
patch with some benefit on plpgsql environment length.

If you think that plpgsql statement walker is not needed, then I do not
insist.

Are you going to commit your version of the patch?

Re: Problem with accessing TOAST data in stored procedures

2020-08-19 Thread Pavel Stehule

st 19. 8. 2020 v 20:59 odesílatel Konstantin Knizhnik <
k.knizh...@postgrespro.ru> napsal:

>
>
> On 19.08.2020 21:50, Pavel Stehule wrote:
>
> Hi
>
> st 19. 8. 2020 v 19:22 odesílatel Konstantin Knizhnik <
> k.knizh...@postgrespro.ru> napsal:
>
>> Hi hackers,
>>
>> More than month ago I have sent bug report to pgsql-bugs:
>>
>>
>> https://www.postgresql.org/message-id/flat/5d335911-fb25-60cd-4aa7-a5bd0954aea0%40postgrespro.ru
>>
>> with the proposed patch but have not received any response.
>>
>> I wonder if there is some other way to fix this issue and does somebody
>> working on it.
>> While the added check itself is trivial (just one line) the total patch
>> is not so small because I have added walker for
>> plpgsql statements tree. It is not strictly needed in this case (it is
>> possible to find some other way to determine that stored procedure
>> contains transaction control statements), but I hope such walker may be
>> useful in other cases.
>>
>> In any case, I will be glad to receive any response,
>> because this problem was reported by one of our customers and we need to
>> provide some fix.
>> It is better to include it in vanilla, rather than in our pgpro-ee fork.
>>
>> If it is desirable, I can add this patch to commitfest.
>>
>
>
> I don't like this design. It is not effective to repeat the walker for
> every execution. Introducing a walker just for this case looks like
> overengineering.
> Personally I am not sure if a walker for plpgsql is a good idea (I thought
> about it more times, when I wrote plpgsql_check). But anyway - there should
> be good reason for introducing the walker and clean use case.
>
> If you want to introduce stmt walker, then it should be a separate patch
> with some benefit on plpgsql environment length.
>
> If you think that plpgsql statement walker is not needed, then I do not
> insist.
> Are you going to commit your version of the patch?
>

I am afraid so it needs significantly much more work :(. My version is
correct just for the case that you describe, but it doesn't fix the
possibility of the end of the transaction inside the nested CALL.

Some like

DO $$ DECLARE v_r record; BEGIN FOR v_r in SELECT data FROM toasted LOOP
INSERT INTO toasted(data) VALUES(v_r.data) CALL check_and_commit();END
LOOP;END;$$;

Probably my patch (or your patch) will fix on 99%, but still there will be
a possibility of this issue. It is very similar to fixing releasing expr
plans inside CALL statements. Current design of CALL statement is ugly
workaround - it is slow, and there is brutal memory leak. Fixing memory
leak is not hard. Fixing every time replaning (and sometimes useless) needs
depper fix. Please check patch
https://www.postgresql.org/message-id/attachment/112489/plpgsql-stmt_call-fix-2.patch
Maybe this mechanism can be used for a clean fix of the problem mentioned
in this thread.

Regards

Pavel

"ccold" left by reindex concurrently are droppable?

Hello

The REINDEX CONCURRENTLY documentation states that if a transient index
used lingers, the fix is to drop the invalid index and perform RC again;
and that this is to be done for "ccnew" indexes and also for "ccold"
indexes:

The recommended recovery method in such cases is to drop the invalid index
and try again to perform REINDEX CONCURRENTLY.  The
concurrent index created during the processing has a name ending in the
suffix ccnew, or ccold if it is an
old index definition which we failed to drop. Invalid indexes can be
dropped using DROP INDEX, including invalid toast
indexes.

But this seems misleading to me.  It is correct advice for "ccnew"
indexes, of course.  But if the index is named "ccold", then the rebuild
of the index actually succeeded, so you can just drop the ccold index
and not rebuild anything.

In other words I propose to reword this paragraph as follows:

   If the transient index created during the concurrent operation is
   suffixed ccnew, the recommended recovery method
   is to drop the invalid index using DROP INDEX,
   and try to perform REINDEX CONCURRENTLY again. 
   If the transient index is instead suffixed ccold,
   it corresponds to the original index which we failed to drop;
   the recommended recovery method is to just drop said index, since the
   rebuild proper has been successful.

(The original talks about "the concurrent index", which seems somewhat
sloppy thinking.  I used the term "transient index" instead.)

-- 
Álvaro Herrera39°49'30"S 73°17'W

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

On Wed, 19 Aug 2020 at 16:23, Tom Lane  wrote:
>
> David Rowley  writes:
> > I don't object to making the change. I just object to making it only
> > to put it back again later when someone else speaks up that they'd
> > prefer to keep nodes modular and not overload them in obscure ways.
>
> > So other input is welcome.  Is it too weird to overload SubPlan and
> > Nested Loop this way?  Or okay to do that if it squeezes out a dozen
> > or so nanoseconds per tuple?
>
> If you need somebody to blame it on, blame it on me - but I agree
> that that is an absolutely horrid abuse of NestLoop.  We might as
> well reduce explain.c to a one-liner that prints "Here Be Dragons",
> because no one will understand what this display is telling them.

Thanks for chiming in. I'm relieved it's not me vs everyone else anymore.

> I'm also quite skeptical that adding overhead to nodeNestloop.c
> to support this would actually be a net win once you account for
> what happens in plans where the caching is of no value.

Agreed.

David

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

On Wed, 19 Aug 2020 at 16:18, Pavel Stehule  wrote:
>
>
>
> st 19. 8. 2020 v 5:48 odesílatel David Rowley  napsal:
>> Current method:
>>
>> regression=# explain (analyze, costs off, timing off, summary off)
>> select twenty, (select count(*) from tenk1 t2 where t1.twenty =
>> t2.twenty) from tenk1 t1;
>>  QUERY PLAN
>> -
>>  Seq Scan on tenk1 t1 (actual rows=1 loops=1)
>>SubPlan 1
>>  ->  Result Cache (actual rows=1 loops=1)
>>Cache Key: t1.twenty
>>Hits: 9980  Misses: 20  Evictions: 0  Overflows: 0
>>->  Aggregate (actual rows=1 loops=20)
>>  ->  Seq Scan on tenk1 t2 (actual rows=500 loops=20)
>>Filter: (t1.twenty = twenty)
>>Rows Removed by Filter: 9500
>> (9 rows)
>>
>> Andres' suggestion:
>>
>> regression=# explain (analyze, costs off, timing off, summary off)
>> select twenty, (select count(*) from tenk1 t2 where t1.twenty =
>> t2.twenty) from tenk1 t1;
>>  QUERY PLAN
>> -
>>  Seq Scan on tenk1 t1 (actual rows=1 loops=1)
>>SubPlan 1
>> Cache Key: t1.twenty  Hits: 9980  Misses: 20  Evictions: 0  Overflows: 0
>> ->  Aggregate (actual rows=1 loops=20)
>>   ->  Seq Scan on tenk1 t2 (actual rows=500 loops=20)
>> Filter: (t1.twenty = twenty)
>> Rows Removed by Filter: 9500
>> (7 rows)

> I didn't do performance tests, that should be necessary, but I think Andres' 
> variant is a little bit more readable.

Thanks for chiming in on this.  I was just wondering about the
readability part and what makes the one with the Result Cache node
less readable?  I can think of a couple of reasons you might have this
view and just wanted to double-check what it is.

David

Re: Creating a function for exposing memory usage of backend process

2020-08-19 Thread Andres Freund

Hi,

On 2020-08-19 11:01:37 -0400, Tom Lane wrote:
> Hadn't been paying attention to this thread up till now, but ...
> 
> Michael Paquier  writes:
> > By the way, I was looking at the code that has been committed, and I
> > think that it is awkward to have a SQL function in mcxt.c, which is a
> > rather low-level interface.  I think that this new code should be
> > moved to its own file, one suggestion for a location I have being
> > src/backend/utils/adt/mcxtfuncs.c.
> 
> I agree with that, but I think this patch has a bigger problem:
> why bother at all?  It seems like a waste of code space and future
> maintenance effort, because there is no use-case.  In the situations
> where you need to know where the memory went, you are almost never
> in a position to leisurely execute a query and send the results over
> to your client.  This certainly would be useless to figure out why
> an already-running query is eating space, for instance.

I don't agree with this at all. I think there's plenty use cases. It's
e.g. very common to try to figure out why the memory usage of a process
is high. Is it memory not returned to the OS? Is it caches that have
grown too much etc.

I agree it's not perfect:

> Plus you need to be running an interactive session, or else be willing
> to hack up your application to try to get it to inspect the view (and
> log the results somewhere) at useful times.

and that we likely should address that by *also* allowing to view the
memory usage of another process. Which obviously isn't entirely
trivial. But some infrastructure likely could be reused.


> My own thoughts about improving the debugging situation would've been
> to create a way to send a signal to a session to make it dump its
> current memory map to the postmaster log (not the client, since the
> client is unlikely to be prepared to receive anything extraneous).
> But this is nothing like that.

That doesn't really work in a large number of environments, I'm
afraid. Many many users don't have access to the server log.


> If it stays, I'd like to see restrictions on who can read the view.

As long as access is grantable rather than needing a security definer
wrapper I'd be fine with that.

Greetings,

Andres Freund

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

On 2020-Aug-19, David Rowley wrote:

> Andres' suggestion:
> 
> regression=# explain (analyze, costs off, timing off, summary off)
> select count(*) from tenk1 t1 inner join tenk1 t2 on
> t1.twenty=t2.unique1;
>   QUERY PLAN
> ---
>  Aggregate (actual rows=1 loops=1)
>->  Nested Loop (actual rows=1 loops=1)
>   Cache Key: t1.twenty  Hits: 9980  Misses: 20  Evictions: 0 
> Overflows: 0
> ->  Seq Scan on tenk1 t1 (actual rows=1 loops=1)
> ->  Index Scan using tenk1_unique1 on tenk1 t2 (actual rows=1 
> loops=20)
>   Index Cond: (unique1 = t1.twenty)
> (6 rows)

I think it doesn't look terrible in the SubPlan case -- it kinda makes
sense there -- but for nested loop it appears really strange.

On the performance aspect, I wonder what the overhead is, particularly
considering Tom's point of making these nodes more expensive for cases
with no caching.  And also, as the JIT saga continues, aren't we going
to get plan trees recompiled too, at which point it won't matter much?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Optimising compactify_tuples()

2020-08-19 Thread Thomas Munro

On Wed, Aug 19, 2020 at 11:41 PM Thomas Munro  wrote:
> On Tue, Aug 18, 2020 at 6:53 AM Peter Geoghegan  wrote:
> > I definitely think that we should have something like this, though.
> > It's a relatively easy win. There are plenty of workloads that spend
> > lots of time on pruning.
>
> Alright then, here's an attempt to flesh the idea out a bit more, and
> replace the three other copies of qsort() while I'm at it.

I fixed up the copyright messages, and removed some stray bits of
build scripting relating to the Perl-generated file.  Added to
commitfest.
From 940100ea1b2021e434976f6ce10aae75dd265d26 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 17 Aug 2020 21:31:56 +1200
Subject: [PATCH v2 1/5] Add sort_template.h for making fast sort functions.

Move our qsort implementation into a header that can be used to
define specialized functions for better performance.

Discussion: https://postgr.es/m/CA%2BhUKGKMQFVpjr106gRhwk6R-nXv0qOcTreZuQzxgpHESAL6dw%40mail.gmail.com
---
 src/include/lib/sort_template.h | 431 
 1 file changed, 431 insertions(+)
 create mode 100644 src/include/lib/sort_template.h

diff --git a/src/include/lib/sort_template.h b/src/include/lib/sort_template.h
new file mode 100644
index 00..0e163d4d8a
--- /dev/null
+++ b/src/include/lib/sort_template.h
@@ -0,0 +1,431 @@
+/*-
+ *
+ * sort_template.h
+ *
+ *	  A template for a sort algorithm that supports varying degrees of
+ *	  specialization.
+ *
+ * Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1992-1994, Regents of the University of California
+ *
+ * Usage notes:
+ *
+ *	  To generate functions specialized for a type, the following parameter
+ *	  macros should be #define'd before this file is included.
+ *
+ *	  - ST_SORT - the name of a sort function to be generated
+ *	  - ST_ELEMENT_TYPE - type of the referenced elements
+ *	  - ST_DECLARE - if defined the functions and types are declared
+ *	  - ST_DEFINE - if defined the functions and types are defined
+ *	  - ST_SCOPE - scope (e.g. extern, static inline) for functions
+ *
+ *	  Instead of ST_ELEMENT_TYPE, ST_ELEMENT_TYPE_VOID can be defined.  Then
+ *	  the generated functions will automatically gain an "element_size"
+ *	  parameter.  This allows us to generate a traditional qsort function.
+ *
+ *	  One of the following macros must be defined, to show how to compare
+ *	  elements.  The first two options are arbitrary expressions depending
+ *	  on whether an extra pass-through argument is desired, and the third
+ *	  option should be defined if the sort function should receive a
+ *	  function pointer at runtime.
+ *
+ * 	  - ST_COMPARE(a, b) - a simple comparison expression
+ *	  - ST_COMPARE(a, b, arg) - variant that takes an extra argument
+ *	  - ST_COMPARE_RUNTIME_POINTER - sort function takes a function pointer
+ *
+ *	  To say that the comparator and therefore also sort function should
+ *	  receive an extra pass-through argument, specify the type of the
+ *	  argument.
+ *
+ *	  - ST_COMPARE_ARG_TYPE - type of extra argument
+ *
+ *	  The prototype of the generated sort function is:
+ *
+ *	  void ST_SORT(ST_ELEMENT_TYPE *data, size_t n,
+ *   [size_t element_size,]
+ *   [ST_SORT_compare_function compare,]
+ *   [ST_COMPARE_ARG_TYPE *arg]);
+ *
+ *	  ST_SORT_compare_function is a function pointer of the following type:
+ *
+ *	  int (*)(const ST_ELEMENT_TYPE *a, const ST_ELEMENT_TYPE *b,
+ *			  [ST_COMPARE_ARG_TYPE *arg])
+ *
+ * HISTORY
+ *
+ *	  Modifications from vanilla NetBSD source:
+ *	  - Add do ... while() macro fix
+ *	  - Remove __inline, _DIAGASSERTs, __P
+ *	  - Remove ill-considered "swap_cnt" switch to insertion sort, in favor
+ *		of a simple check for presorted input.
+ *	  - Take care to recurse on the smaller partition, to bound stack usage
+ *	  - Convert into a header that can generate specialized functions
+ *
+ * IDENTIFICATION
+ *		src/include/lib/sort_template.h
+ *
+ *-
+ */
+
+/*	  $NetBSD: qsort.c,v 1.13 2003/08/07 16:43:42 agc Exp $   */
+
+/*-
+ * Copyright (c) 1992, 1993
+ *	  The Regents of the University of California.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *	  notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *	  notice, this list of conditions and the following disclaimer in the
+ *	  documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of the University nor the names of its contributors
+ *	  may be used to endorse or promote products derived from this software
+

Re: [PG13] Planning (time + buffers) data structure in explain plan (format text)

On Thu, 20 Aug 2020 at 03:31, Fujii Masao  wrote:
>
>
>
> On 2020/08/19 19:39, David Rowley wrote:
> > On Wed, 19 Aug 2020 at 21:05, Julien Rouhaud  wrote:
> >>
> >> On Wed, Aug 19, 2020 at 08:49:48PM +1200, David Rowley wrote:
> >>> However, I'm not quite sure how we should handle if someone does:
> >>> EXPLAIN (BUFFERS on, SUMMARY off). Without the summary, there's no
> >>> place to print the buffers, which seems bad as they asked for buffers.
> >>
> >>
> >> But this won't be as much a problem if ANALYZE is asked, and having 
> >> different
> >> behaviors isn't appealing.  So maybe it's better to let people get what 
> >> they
> >> asked for even if that's contradictory?
> >
> > I'd say BUFFERS on, BUFFERS off is contradictory. I don't think
> > BUFFERS, SUMMARY OFF is. It's just that we show the buffer details for
> > the planner in the summary.  Since "summary" is not exactly a word
> > that describes what you're asking EXPLAIN to do, I wouldn't blame
> > users if they got confused as to why their BUFFERS on request was not
> > displayed.
>
> Displaying the planner's buffer usage under summary is the root cause of
> the confusion? If so, what about displaying that outside summary?
> Attached is the POC patch that I'm just thinking.

I had a look at this and I like it better than what I proposed earlier.

The change to show_buffer_usage() is a bit ugly, but I'm not really
seeing a better way to do it. Perhaps that can be improved later if we
ever find that there's some other special case to add.

David

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

On Thu, 20 Aug 2020 at 10:58, Alvaro Herrera  wrote:
> On the performance aspect, I wonder what the overhead is, particularly
> considering Tom's point of making these nodes more expensive for cases
> with no caching.

It's likely small. I've not written any code but only thought about it
and I think it would be something like if (node->tuplecache != NULL).
I imagine that in simple cases the branch predictor would likely
realise the likely prediction fairly quickly and predict with 100%
accuracy, once learned. But it's perhaps possible that some other
branch shares the same slot in the branch predictor and causes some
conflicting predictions. The size of the branch predictor cache is
limited, of course.  Certainly introducing new branches that
mispredict and cause a pipeline stall during execution would be a very
bad thing for performance.  I'm unsure what would happen if there's
say, 2 Nested loops, one with caching = on and one with caching = off
where the number of tuples between the two is highly variable.  I'm
not sure a branch predictor would handle that well given that the two
branches will be at the same address but have different predictions.
However, if the predictor was to hash in the stack pointer too, then
that might not be a problem. Perhaps someone with a better
understanding of modern branch predictors can share their insight
there.

> And also, as the JIT saga continues, aren't we going
> to get plan trees recompiled too, at which point it won't matter much?

I was thinking batch execution would be our solution to the node
overhead problem.  We'll get there one day... we just need to finish
with the infinite other optimisations there are to do first.

David

RE: Implement UNLOGGED clause for COPY FROM

2020-08-19 Thread osumi.takami...@fujitsu.com

Hello.

Apologies for the delay.
> > When the server crash occurs during data loading of COPY UNLOGGED,
> > it's a must to keep index consistent of course.
> > I'm thinking that to rebuild the indexes on the target table would work.
> >
> > In my opinion, UNLOGGED clause must be designed to guarantee that
> > where the data loaded by this clause is written starts from the end of all 
> > other
> data blocks.
> > Plus, those blocks needs to be protected by any write of other transactions
> during the copy.
> > Apart from that, the server must be aware of which block is the first
> > block, or the range about where it started or ended in preparation for the 
> > crash.
> >
> > During the crash recovery, those points are helpful to recognize and
> > detach such blocks in order to solve a situation that the loaded data is 
> > partially
> synced to the disk and the rest isn't.
> 
> How do online backup and archive recovery work ?
> 
> Suppose that the user executes pg_basebackup during COPY UNLOGGED running,
> the physical backup might have the portion of tuples loaded by COPY UNLOGGED
> but these data are not recovered. It might not be a problem because the 
> operation
> is performed without WAL records. But what if an insertion happens after COPY
> UNLOGGED but before pg_stop_backup()? I think that a new tuple could be
> inserted at the end of the table, following the data loaded by COPY UNLOGGED.
> With your approach described above, the newly inserted tuple will be recovered
> during archive recovery, but it either will be removed if we replay the 
> insertion
> WAL then truncate the table or won’t be inserted due to missing block if we
> truncate the table then replay the insertion WAL, resulting in losing the 
> tuple
> although the user got successful of insertion.
I consider that from the point in time when COPY UNLOGGED is executed,
any subsequent operations to the data which comes from UNLOGGED operation
also cannot be recovered even if those issued WAL.

This is basically inevitable because subsequent operations 
after COPY UNLOGGED depend on blocks of loaded data without WAL,
which means we cannot replay exact operations.

Therefore, all I can do is to guarantee that 
when one recovery process ends, the target table returns to the state
immediately before the COPY UNLOGGED is executed.
This could be achieved by issuing and notifying the server of an invalidation 
WAL,
an indicator to stop WAL application toward one specific table after this new 
type of WAL.
I think I need to implement this mechanism as well for this feature.
Thus, I'll take a measure against your concern of confusing data loss.

For recovery of the loaded data itself, the user of this clause,
like DBA or administrator of data warehouse for instance, 
would need to make a backup just after the data loading.
For some developers, this behavior would seem incomplete because of the heavy 
user's burden.

On the other hand, I'm aware of a fact that Oracle Database has a feature of 
UNRECOVERABLE clause,
which is equivalent to what I'm suggesting now in this thread.

This data loading without REDO log by the clause is more convenient than what I 
said above,
because it's supported by a tool named Recovery Manager which enables users to 
make an incremental backup.
This works to back up only the changed blocks since the previous backup and
remove the manual burden from the user like above.
Here, I have to admit that I cannot design and implement 
this kind of synergistic pair of all features at once for data warehousing.
So I'd like to make COPY UNLOGGED as the first step.

This is the URL of how Oracle database for data warehouse achieves the backup 
of no log operation while acquiring high speed of data loading.
https://docs.oracle.com/database/121/VLDBG/GUID-42825ED1-C4C5-449B-870F-D2C8627CBF86.htm#VLDBG1578

Best,
Takamichi Osumi

Re: Creating a function for exposing memory usage of backend process

2020-08-19 Thread Tom Lane

Andres Freund  writes:
> On 2020-08-19 11:01:37 -0400, Tom Lane wrote:
>> I agree with that, but I think this patch has a bigger problem:
>> why bother at all?  It seems like a waste of code space and future
>> maintenance effort, because there is no use-case.

> I don't agree with this at all. I think there's plenty use cases. It's
> e.g. very common to try to figure out why the memory usage of a process
> is high. Is it memory not returned to the OS? Is it caches that have
> grown too much etc.

Oh, I agree completely that there are lots of use-cases for finding
out what a process' memory map looks like.  But this patch fails to
address any of them in a usable way.

>> My own thoughts about improving the debugging situation would've been
>> to create a way to send a signal to a session to make it dump its
>> current memory map to the postmaster log (not the client, since the
>> client is unlikely to be prepared to receive anything extraneous).

> That doesn't really work in a large number of environments, I'm
> afraid. Many many users don't have access to the server log.

My rationale for that was (a) it can be implemented without a lot
of impact on the memory map, and (b) requiring access to the server
log eliminates questions about whether you have enough privilege to
examine the map.  I'm prepared to compromise about (b), but less so
about (a).  Having to run a SQL query to find this out is a mess.

regards, tom lane

Re: Creating a function for exposing memory usage of backend process

2020-08-19 Thread Andres Freund

Hi,

On 2020-08-19 21:29:06 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2020-08-19 11:01:37 -0400, Tom Lane wrote:
> >> I agree with that, but I think this patch has a bigger problem:
> >> why bother at all?  It seems like a waste of code space and future
> >> maintenance effort, because there is no use-case.
> 
> > I don't agree with this at all. I think there's plenty use cases. It's
> > e.g. very common to try to figure out why the memory usage of a process
> > is high. Is it memory not returned to the OS? Is it caches that have
> > grown too much etc.
> 
> Oh, I agree completely that there are lots of use-cases for finding
> out what a process' memory map looks like.  But this patch fails to
> address any of them in a usable way.

Even just being able to see the memory usage in a queryable way is a
huge benefit. The difference over having to parse the log, then parse
the memory usage dump, and then aggregate the data in there in a
meaningful way is *huge*.  We've been slacking around lowering our
memory usage, and I think the fact that it's annoying to analyze is a
partial reason for that.

I totally agree that it's not *enough*, but in contrast to you I think
it's a good step. Subsequently we should add a way to get any backends
memory usage.
It's not too hard to imagine how to serialize it in a way that can be
easily deserialized by another backend. I am imagining something like
sending a procsignal that triggers (probably at CFR() time) a backend to
write its own memory usage into pg_memusage/ or something roughly
like that.

Greetings,

Andres Freund

Re: Creating foreign key on partitioned table is too slow

2020-08-19 Thread Amit Langote

Hi,

On Thu, Aug 20, 2020 at 3:06 AM Alvaro Herrera  wrote:
> On 2020-Aug-19, Amit Langote wrote:
> > On Thu, Aug 6, 2020 at 4:25 PM kato-...@fujitsu.com
> >  wrote:
> > > On Wednesday, August 5, 2020 9:43 AM I wrote:
> > > > I'll report the result before the end of August .
> > >
> > > I test v2-0001-build-partdesc-memcxt.patch at 9a9db08ae4 and it is ok.
> >
> > Is this patch meant for HEAD or back-patching?  I ask because v13 got this:
> >
> > commit 5b9312378e2f8fb35ef4584aea351c3319a10422
> > Author: Tom Lane 
> > Date:   Wed Dec 25 14:43:13 2019 -0500
> >
> > Load relcache entries' partitioning data on-demand, not immediately.
> >
> > which prevents a partitioned table's PartitionDesc from being rebuilt
> > repeatedly as would happen before this commit in Kato-san's case,
> > because it moves RelationBuildPartitionDesc out of the relcache flush
> > code path.
>
> Hmm, so this is a problem only in v11 and v12?  It seems that putting
> the patch in master *only* is pointless.  OTOH v11 had other severe
> performance drawbacks with lots of partitions, so it might not be needed
> there.
>
> I admit I'm hesitant to carry code in only one or two stable branches
> that exists nowhere else.  But maybe the problem is serious enough in
> those branches (that will still live for quite a few years) that we
> should get it there.
>
> OTOH it could be argued that the coding in master is not all that great
> anyway (given the willingness for memory to be leaked) that it should
> apply to all three branches.

Fwiw, I am fine with applying the memory-leak fix in all branches down
to v12 if we are satisfied with the implementation.

That said, I don't offhand know any real world use case beside
Kato-san's that's affected by this leak.  Kato-san's case is creating
a foreign key referencing a partitioned table with many partitions,
something that's only supported from v12.  (You may have noticed that
the leak that occurs when rebuilding referencing table's PartitionDesc
accumulates while addFkRecurseReferenced is looping on referenced
table's partitions.)

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Re: Creating a function for exposing memory usage of backend process





On 2020/08/20 0:01, Tom Lane wrote:

Hadn't been paying attention to this thread up till now, but ...

Michael Paquier  writes:

By the way, I was looking at the code that has been committed, and I
think that it is awkward to have a SQL function in mcxt.c, which is a
rather low-level interface.  I think that this new code should be
moved to its own file, one suggestion for a location I have being
src/backend/utils/adt/mcxtfuncs.c.


I agree with that, but I think this patch has a bigger problem:
why bother at all?  It seems like a waste of code space and future
maintenance effort, because there is no use-case.  In the situations
where you need to know where the memory went, you are almost never
in a position to leisurely execute a query and send the results over
to your client.  This certainly would be useless to figure out why
an already-running query is eating space, for instance.

The only situation I could imagine where this would have any use is
where there is long-term (cross-query) bloat in, say, CacheMemoryContext


Yes, this feature is useful to check a cross-query memory bloat like
the bloats of relcache, prepared statements, PL/pgSQL cache,
SMgrRelationHash, etc. For example, several years before, my colleague
investigated the cause of the memory bloat by using the almost same
feature that pg_cheat_funcs extension provides, and then found that
the cause was that the application forgot to release lots of SAVEPONT.



--- but it's not even very helpful for that, since you can't examine
anything finer-grain than a memory context.


Yes, but even that information can be good hint when investigating
the memory bloat.



Plus you need to be
running an interactive session, or else be willing to hack up your
application to try to get it to inspect the view (and log the
results somewhere) at useful times.

On top of all that, the functionality has Heisenberg problems,
because simply using it changes what you are trying to observe,
in complex and undocumented ways (not that the documentation
would be of any use to non-experts anyway).

My own thoughts about improving the debugging situation would've been
to create a way to send a signal to a session to make it dump its
current memory map to the postmaster log (not the client, since the
client is unlikely to be prepared to receive anything extraneous).
But this is nothing like that.


I agree that we need something like this, i.e., the way to monitor the memory
usage of any process even during query running. OTOH, I think that the added
feature has a use case and is good as the first step.



Given the lack of clear use-case, and the possibility (admittedly
not strong) that this is still somehow a security hazard, I think
we should revert it.  If it stays, I'd like to see restrictions
on who can read the view.


For example, allowing only the role with pg_monitor to see this view?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Creating a function for exposing memory usage of backend process





On 2020/08/20 10:43, Andres Freund wrote:

Hi,

On 2020-08-19 21:29:06 -0400, Tom Lane wrote:

Andres Freund  writes:

On 2020-08-19 11:01:37 -0400, Tom Lane wrote:

I agree with that, but I think this patch has a bigger problem:
why bother at all?  It seems like a waste of code space and future
maintenance effort, because there is no use-case.



I don't agree with this at all. I think there's plenty use cases. It's
e.g. very common to try to figure out why the memory usage of a process
is high. Is it memory not returned to the OS? Is it caches that have
grown too much etc.


Oh, I agree completely that there are lots of use-cases for finding
out what a process' memory map looks like.  But this patch fails to
address any of them in a usable way.


Even just being able to see the memory usage in a queryable way is a
huge benefit.


+1


The difference over having to parse the log, then parse
the memory usage dump, and then aggregate the data in there in a
meaningful way is *huge*.  We've been slacking around lowering our
memory usage, and I think the fact that it's annoying to analyze is a
partial reason for that.


Agreed.

 

I totally agree that it's not *enough*, but in contrast to you I think
it's a good step. Subsequently we should add a way to get any backends
memory usage.
It's not too hard to imagine how to serialize it in a way that can be
easily deserialized by another backend. I am imagining something like
sending a procsignal that triggers (probably at CFR() time) a backend to
write its own memory usage into pg_memusage/ or something roughly
like that.


Sounds good. Maybe we can also provide the SQL-callable function
or view to read pg_memusage/, to make the analysis easier.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: SyncRepLock acquired exclusively in default configuration

2020-08-19 Thread Kyotaro Horiguchi

At Wed, 19 Aug 2020 13:20:46 +, Asim Praveen  wrote in 
> 
> 
> > On 12-Aug-2020, at 12:02 PM, Masahiko Sawada 
> >  wrote:
> > 
> > On Wed, 12 Aug 2020 at 14:06, Asim Praveen  wrote:
> >> 
> >> 
> >> 
> >>> On 11-Aug-2020, at 8:57 PM, Robert Haas  wrote:
> >>> 
> >>> I think this gets to the root of the issue. If we check the flag
> >>> without a lock, we might see a slightly stale value. But, considering
> >>> that there's no particular amount of time within which configuration
> >>> changes are guaranteed to take effect, maybe that's OK. However, there
> >>> is one potential gotcha here: if the walsender declares the standby to
> >>> be synchronous, a user can see that, right? So maybe there's this
> >>> problem: a user sees that the standby is synchronous and expects a
> >>> transaction committing afterward to provoke a wait, but really it
> >>> doesn't. Now the user is unhappy, feeling that the system didn't
> >>> perform according to expectations.
> >> 
> >> Yes, pg_stat_replication reports a standby in sync as soon as walsender 
> >> updates priority of the standby to something other than 0.
> >> 
> >> The potential gotcha referred above doesn’t seem too severe.  What is the 
> >> likelihood of someone setting synchronous_standby_names GUC with either 
> >> “*” or a standby name and then immediately promoting that standby?  If the 
> >> standby is promoted before the checkpointer on master gets a chance to 
> >> update sync_standbys_defined in shared memory, commits made during this 
> >> interval on master may not make it to standby.  Upon promotion, those 
> >> commits may be lost.
> > 
> > I think that if the standby is quite behind the primary and in case of
> > the primary crashes, the likelihood of losing commits might get
> > higher. The user can see the standby became synchronous standby via
> > pg_stat_replication but commit completes without a wait because the
> > checkpointer doesn't update sync_standbys_defined yet. If the primary
> > crashes before standby catching up and the user does failover, the
> > committed transaction will be lost, even though the user expects that
> > transaction commit has been replicated to the standby synchronously.
> > And this can happen even without the patch, right?
> > 
> 
> It is correct that the issue is orthogonal to the patch upthread and I’ve 
> marked
> the commitfest entry as ready-for-committer.

I find the name of SyncStandbysDefined macro is very confusing with
the struct member sync_standbys_defined, but that might be another
issue..

-* Fast exit if user has not requested sync replication.
+* Fast exit if user has not requested sync replication, or there are no
+* sync replication standby names defined.

This comment sounds like we just do that twice. The reason for the
check is to avoid wasteful exclusive locks on SyncRepLock, or to form
double-checked locking on the variable. I think we should explain that
here.

> Regarding the issue described above, the amount by which the standby is 
> lagging
> behind the primary does not affect the severity.  A standby’s state will be
> reported as “sync” to the user only after the standby has caught up (state ==
> WALSNDSTATE_STREAMING).  The time it takes for the checkpointer to update the
> sync_standbys_defined flag in shared memory is the important factor.  Once
> checkpointer sets this flag, commits start waiting for the standby (as long as
> it is in-sync).

CATCHUP state is only entered at replication startup. It stays at
STREAMING when sync_sby_names is switched from '' to a valid name,
thus sync_state shows 'sync' even if checkpointer hasn't changed
sync_standbys_defined. If the standby being switched had a big lag,
the chance of losing commits getting larger (up to certain extent) for
the same extent of checkpointer lag.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

The number of logical replication slot or wal sender recommend

2020-08-19 Thread 胡常齐

Hello

I have a publisher instance, cpu 32 core, ram 64GB, with SSD and  installed
PostgreSQL 10 on it.  I want to use logical replication  with it  , I
create publication on it, add about 20 tables in the pub, each tables will
have about 1 million line data,.

I wonder how many subscriber instance the publisher can affort, what if i
have 100 sub,
which means there will be 100 logical replication slot and 100 wal sender
process on the publisher instance, will so many sub slow down the publisher
performance?

I have test on my own machine cpu 8core, ram 16GB as a publisher with 10
sub (anthoer machine start 10 postgres instance, every one create a sub),
and everything works fine. I test pgbench on my  8core, 16GB machine,
without sub or with 10 sub, the tps is the same. Even if there is one
million line data on the pub, and on sub side the table is empty, and when
i create subscription on the 10 postgres sub instance, it copy data from
pub is quickly enough.

I wonder how many logical replication slot or wal sender is ok for the
32core, 64GB machine. 100 sub？ 500 sub?

Re: Creating a function for exposing memory usage of backend process

2020-08-19 Thread Kasahara Tatsuhito

On 2020/08/20 0:01, Tom Lane wrote:
> The only situation I could imagine where this would have any use is
> where there is long-term (cross-query) bloat in, say, CacheMemoryContext
Yeah, in cases where a very large number of sessions are connected to
the DB for very long periods of time, the memory consumption of the
back-end processes may increase slowly, and such a feature is useful
for analysis.

And ,as Fujii said, this feature very useful to see which contexts are
consuming a lot of memory and to narrow down the causes.

On Thu, Aug 20, 2020 at 11:18 AM Fujii Masao
 wrote:
> On 2020/08/20 10:43, Andres Freund wrote:
> > Hi,
> > Even just being able to see the memory usage in a queryable way is a
> > huge benefit.
>
> +1
+1

I think this feature is very useful in environments where gdb is not
available or access to server logs is limited.
And it is cumbersome to extract and analyze the memory information
from the very large server logs.


> > I totally agree that it's not *enough*, but in contrast to you I think
> > it's a good step. Subsequently we should add a way to get any backends
> > memory usage.
> > It's not too hard to imagine how to serialize it in a way that can be
> > easily deserialized by another backend. I am imagining something like
> > sending a procsignal that triggers (probably at CFR() time) a backend to
> > write its own memory usage into pg_memusage/ or something roughly
> > like that.
>
> Sounds good. Maybe we can also provide the SQL-callable function
> or view to read pg_memusage/, to make the analysis easier.
+1

Best regards,

-- 
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com

Re: SyncRepLock acquired exclusively in default configuration

2020-08-19 Thread Kyotaro Horiguchi

At Wed, 19 Aug 2020 21:41:03 +0900, Fujii Masao  
wrote in 
> 
> 
> On 2020/08/12 15:32, Masahiko Sawada wrote:
> > On Wed, 12 Aug 2020 at 14:06, Asim Praveen  wrote:
> >>
> >>
> >>
> >>> On 11-Aug-2020, at 8:57 PM, Robert Haas  wrote:
> >>>
> >>> I think this gets to the root of the issue. If we check the flag
> >>> without a lock, we might see a slightly stale value. But, considering
> >>> that there's no particular amount of time within which configuration
> >>> changes are guaranteed to take effect, maybe that's OK. However, there
> >>> is one potential gotcha here: if the walsender declares the standby to
> >>> be synchronous, a user can see that, right? So maybe there's this
> >>> problem: a user sees that the standby is synchronous and expects a
> >>> transaction committing afterward to provoke a wait, but really it
> >>> doesn't. Now the user is unhappy, feeling that the system didn't
> >>> perform according to expectations.
> >>
> >> Yes, pg_stat_replication reports a standby in sync as soon as
> >> walsender updates priority of the standby to something other than 0.
> >>
> >> The potential gotcha referred above doesn’t seem too severe.  What is
> >> the likelihood of someone setting synchronous_standby_names GUC with
> >> either “*” or a standby name and then immediately promoting that
> >> standby?  If the standby is promoted before the checkpointer on master
> >> gets a chance to update sync_standbys_defined in shared memory,
> >> commits made during this interval on master may not make it to
> >> standby.  Upon promotion, those commits may be lost.
> > I think that if the standby is quite behind the primary and in case of
> > the primary crashes, the likelihood of losing commits might get
> > higher. The user can see the standby became synchronous standby via
> > pg_stat_replication but commit completes without a wait because the
> > checkpointer doesn't update sync_standbys_defined yet. If the primary
> > crashes before standby catching up and the user does failover, the
> > committed transaction will be lost, even though the user expects that
> > transaction commit has been replicated to the standby synchronously.
> > And this can happen even without the patch, right?
> 
> I think you're right. This issue can happen even without the patch.
> 
> Maybe we should not mark the standby as "sync" whenever
> sync_standbys_defined
> is false even if synchronous_standby_names is actually set and
> walsenders have
> already detect that? Or we need more aggressive approach;
> make the checkpointer update sync_standby_priority values of
> all the walsenders? ISTM that the latter looks overkill...

It seems to me that the issue here is what
pg_stat_replication.sync_status doens't show "the working state of the
walsdner", but "the state the walsender is commanded". Non-zero
WalSnd.sync_standby_priority is immediately considered as "I am in
sync" but actually it is "I am going to sync from async or am already
in sync". And it is precisely "..., or am already in sync if
checkpointer already notices any sync walsender exists".

1. if a walsender changes its state from async to sync, it should once
  change its state back to "CATCHUP" or something like that.

2. pg_stat_replication.sync_status need to consider WalSnd.state
  and WalSndCtlData.sync_standbys_defined.

We might be able to let SyncRepUpdateSyncStandbysDefined postpone
changing sync_standbys_defined until any sync standby actually comes,
but it would be complex.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-08-19 Thread Pavel Stehule

čt 20. 8. 2020 v 0:04 odesílatel David Rowley  napsal:

> On Wed, 19 Aug 2020 at 16:18, Pavel Stehule 
> wrote:
> >
> >
> >
> > st 19. 8. 2020 v 5:48 odesílatel David Rowley 
> napsal:
> >> Current method:
> >>
> >> regression=# explain (analyze, costs off, timing off, summary off)
> >> select twenty, (select count(*) from tenk1 t2 where t1.twenty =
> >> t2.twenty) from tenk1 t1;
> >>  QUERY PLAN
> >> -
> >>  Seq Scan on tenk1 t1 (actual rows=1 loops=1)
> >>SubPlan 1
> >>  ->  Result Cache (actual rows=1 loops=1)
> >>Cache Key: t1.twenty
> >>Hits: 9980  Misses: 20  Evictions: 0  Overflows: 0
> >>->  Aggregate (actual rows=1 loops=20)
> >>  ->  Seq Scan on tenk1 t2 (actual rows=500 loops=20)
> >>Filter: (t1.twenty = twenty)
> >>Rows Removed by Filter: 9500
> >> (9 rows)
> >>
> >> Andres' suggestion:
> >>
> >> regression=# explain (analyze, costs off, timing off, summary off)
> >> select twenty, (select count(*) from tenk1 t2 where t1.twenty =
> >> t2.twenty) from tenk1 t1;
> >>  QUERY PLAN
> >> -
> >>  Seq Scan on tenk1 t1 (actual rows=1 loops=1)
> >>SubPlan 1
> >> Cache Key: t1.twenty  Hits: 9980  Misses: 20  Evictions: 0
> Overflows: 0
> >> ->  Aggregate (actual rows=1 loops=20)
> >>   ->  Seq Scan on tenk1 t2 (actual rows=500 loops=20)
> >> Filter: (t1.twenty = twenty)
> >> Rows Removed by Filter: 9500
> >> (7 rows)
>
> > I didn't do performance tests, that should be necessary, but I think
> Andres' variant is a little bit more readable.
>
> Thanks for chiming in on this.  I was just wondering about the
> readability part and what makes the one with the Result Cache node
> less readable?  I can think of a couple of reasons you might have this
> view and just wanted to double-check what it is.
>

It is more compact - less rows, less nesting levels


> David
>

Re: [PG13] Planning (time + buffers) data structure in explain plan (format text)

On Thu, 20 Aug 2020 at 03:31, Fujii Masao  wrote:
> With the patch, for example, whatever "summary" settng is, "buffers on"
> displays the planner's buffer usage if it happens.

I forgot to mention earlier, you'll also need to remove the part in
the docs that mentions BUFFERS requires ANALYZE.

David

Re: Asynchronous Append on postgres_fdw nodes.

2020-08-19 Thread Justin Pryzby

On Thu, Jul 02, 2020 at 11:14:48AM +0900, Kyotaro Horiguchi wrote:
> As the result of a discussion with Fujita-san off-list, I'm going to
> hold off development until he decides whether mine or Thomas' is
> better.

The latest patch doesn't apply so I set as WoA.
https://commitfest.postgresql.org/29/2491/

-- 
Justin

Re: Include access method in listTables output

2020-08-19 Thread Justin Pryzby

On Mon, Aug 17, 2020 at 11:10:05PM +0530, vignesh C wrote:
> On Sat, Aug 1, 2020 at 8:12 AM vignesh C  wrote:
> > > >
> > > > +-- access method column should not be displayed for sequences
> > > > +\ds+
> > > >
> > > > -  List of relations
> > > >
> > > >
> > > > -   Schema | Name | Type | Owner | Persistence | Size | Description
> > > > ++--+--+---+-+--+-
> > > > +(0 rows)
> > > >
> > > > We can include one test for view.
> 
> I felt adding one test for view is good and added it.
> Attached a new patch for the same.
> 
> I felt patch is in shape for committer to have a look at this.

The patch tester shows it's failing xmllint ; could you send a fixed patch ?

/usr/bin/xmllint --path . --noout --valid postgres.sgml
ref/psql-ref.sgml:1189: parser error : Opening and ending tag mismatch: link 
line 1187 and para

http://cfbot.cputube.org/georgios-kokolatos.html

-- 
Justin

Re: display offset along with block number in vacuum errors

2020-08-19 Thread Amit Kapila

On Wed, Aug 19, 2020 at 12:54 PM Masahiko Sawada
 wrote:
>
> On Tue, 18 Aug 2020 at 13:06, Amit Kapila  wrote:
> >
> >
> > I don't think we need to expose LVRelStats. We can just pass the
> > address of vacrelstats->offset_num to achieve what we want. I have
> > tried that and it works, see the
> > v6-0002-additinal-error-context-information-in-heap_page_ patch
> > attached. Do you see any problem with it?
>
> Yes, you're right. I'm concerned a bit the number of arguments passing
> to heap_page_prune() might get higher when we need other values to
> update for errcontext, but I'm okay with the current patch.
>

Yeah, we might need to think if we want to increase the number of
parameters but not sure we need to worry at this stage. If required, I
think we can either expose LVRelStats or extract a few parameters from
it and form a separate structure that could be passed to
heap_page_prune.

> Currently, we're in SCAN_HEAP phase in heap_page_prune() but should it
> be VACUUM_HEAP instead?
>

I think it is currently similar to what we do in progress reporting.
We set to VACUUM_HEAP phase where the progress reporting is also set
to *HEAP_BLKS_VACUUMED. Currently, heap_page_prune() is covered under
*HEAP_BLKS_SCANNED, so I don't see a pressing need to change the error
context phase for heap_page_prune(). And also, we need to add some
more smarts in heap_page_prune() for this which I want to avoid.

> Also, I've tested the patch with log_min_messages = 'info' and get the
> following sever logs:
>
..
>
> This is not directly related to the patch but it looks like we can
> improve the current errcontext settings. For instance, the message
> from lazy_vacuum_index(): there are two messages reporting the phases.
> I've attached the patch that improves the current errcontext setting,
> which can be applied before the patch adding offset number.
>

After your patch, I see output like below with log_min_messages=info,

2020-08-20 10:11:46.769 IST [2640] INFO:  scanned index "idx_test_c1"
to remove 1 row versions
2020-08-20 10:11:46.769 IST [2640] DETAIL:  CPU: user: 0.06 s, system:
0.01 s, elapsed: 0.06 s
2020-08-20 10:11:46.769 IST [2640] CONTEXT:  while vacuuming index
"idx_test_c1" of relation "public.test_vac"

2020-08-20 10:11:46.901 IST [2640] INFO:  scanned index "idx_test_c2"
to remove 1 row versions
2020-08-20 10:11:46.901 IST [2640] DETAIL:  CPU: user: 0.10 s, system:
0.01 s, elapsed: 0.13 s
2020-08-20 10:11:46.901 IST [2640] CONTEXT:  while vacuuming index
"idx_test_c2" of relation "public.test_vac"

2020-08-20 10:11:46.917 IST [2640] INFO:  "test_vac": removed 1
row versions in 667 pages
2020-08-20 10:11:46.917 IST [2640] DETAIL:  CPU: user: 0.01 s, system:
0.00 s, elapsed: 0.01 s

2020-08-20 10:11:46.928 IST [2640] INFO:  index "idx_test_c1" now
contains 5 row versions in 276 pages
2020-08-20 10:11:46.928 IST [2640] DETAIL:  1 index row versions
were removed.
136 index pages have been deleted, 109 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
2020-08-20 10:11:46.928 IST [2640] CONTEXT:  while cleaning up index
"idx_test_c1" of relation "public.test_vac"

Here, we can notice that for the index, we are getting context
information but not for the heap. The reason is that in
vacuum_error_callback, we are not printing additional information for
phases VACUUM_ERRCB_PHASE_SCAN_HEAP and VACUUM_ERRCB_PHASE_VACUUM_HEAP
when block number is invalid. If we want to cover the 'info' messages
then won't it be better if we print a message in those phases even
block number is invalid (something like 'while scanning relation
\"%s.%s\"")

-- 
With Regards,
Amit Kapila.

Re: "ccold" left by reindex concurrently are droppable?

On Wed, Aug 19, 2020 at 05:13:12PM -0400, Alvaro Herrera wrote:
> In other words I propose to reword this paragraph as follows:
> 
>If the transient index created during the concurrent operation is
>suffixed ccnew, the recommended recovery method
>is to drop the invalid index using DROP INDEX,
>and try to perform REINDEX CONCURRENTLY again. 
>If the transient index is instead suffixed ccold,
>it corresponds to the original index which we failed to drop;
>the recommended recovery method is to just drop said index, since the
>rebuild proper has been successful.

Yes, that's an improvement.  It would be better to backpatch that.  So
+1 from me.

> (The original talks about "the concurrent index", which seems somewhat
> sloppy thinking.  I used the term "transient index" instead.)

Using transient to refer to an index aimed at being ephemeral sounds
fine to me in this context.
--
Michael


signature.asc
Description: PGP signature

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Wed, 19 Aug 2020 at 20:45, Ashutosh Sharma  wrote:
>
> On Wed, Aug 19, 2020 at 3:55 PM Masahiko Sawada
>  wrote:
> >
> > On Wed, 19 Aug 2020 at 15:09, Ashutosh Sharma  wrote:
> > >
> > > On Wed, Aug 19, 2020 at 9:27 AM Masahiko Sawada
> > >  wrote:
> > > >
> > > > On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma  
> > > > wrote:
> > > > >
> > > > > > pg_force_freeze() can revival a tuple that is already deleted but 
> > > > > > not
> > > > > > vacuumed yet. Therefore, the user might need to reindex indexes 
> > > > > > after
> > > > > > using that function. For instance, with the following script, the 
> > > > > > last
> > > > > > two queries: index scan and seq scan, will return different results.
> > > > > >
> > > > > > set enable_seqscan to off;
> > > > > > set enable_bitmapscan to off;
> > > > > > set enable_indexonlyscan to off;
> > > > > > create table tbl (a int primary key);
> > > > > > insert into tbl values (1);
> > > > > >
> > > > > > update tbl set a = a + 100 where a = 1;
> > > > > >
> > > > > > explain analyze select * from tbl where a < 200;
> > > > > >
> > > > > > -- revive deleted tuple on heap
> > > > > > select heap_force_freeze('tbl', array['(0,1)'::tid]);
> > > > > >
> > > > > > -- index scan returns 2 tuples
> > > > > > explain analyze select * from tbl where a < 200;
> > > > > >
> > > > > > -- seq scan returns 1 tuple
> > > > > > set enable_seqscan to on;
> > > > > > explain analyze select * from tbl;
> > > > > >
> > > > >
> > > > > I am not sure if this is the right use-case of pg_force_freeze
> > > > > function. I think we should only be running pg_force_freeze function
> > > > > on a tuple for which VACUUM reports "found xmin ABC from before
> > > > > relfrozenxid PQR" sort of error otherwise it might worsen the things
> > > > > instead of making it better.
> > > >
> > > > Should this also be documented? I think that it's hard to force the
> > > > user to always use this module in the right situation but we need to
> > > > show at least when to use.
> > > >
> > >
> > > I've already added some examples in the documentation explaining the
> > > use-case of force_freeze function. If required, I will also add a note
> > > about it.
> > >
> > > > > > Also, if a tuple updated and moved to another partition is revived 
> > > > > > by
> > > > > > heap_force_freeze(), its ctid still has special values:
> > > > > > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
> > > > > > see a problem yet caused by a visible tuple having the special ctid
> > > > > > value, but it might be worth considering either to reset ctid value 
> > > > > > as
> > > > > > well or to not freezing already-deleted tuple.
> > > > > >
> > > > >
> > > > > For this as well, the answer remains the same as above.
> > > >
> > > > Perhaps the same is true when a tuple header is corrupted including
> > > > xmin and ctid for some reason and the user wants to fix it? I'm
> > > > concerned that a live tuple having the wrong ctid will cause SEGV or
> > > > PANIC error in the future.
> > > >
> > >
> > > If a tuple header itself is corrupted, then I think we must kill that
> > > tuple. If only xmin and t_ctid fields are corrupted, then probably we
> > > can think of resetting the ctid value of that tuple. However, it won't
> > > be always possible to detect the corrupted ctid value. It's quite
> > > possible that the corrupted ctid field has valid values for block
> > > number and offset number in it, but it's actually corrupted and it
> > > would be difficult to consider such ctid as corrupted. Hence, we can't
> > > do anything about such types of corruption. Probably in such cases we
> > > need to run VACUUM FULL on such tables so that new ctid gets generated
> > > for each tuple in the table.
> >
> > Understood.
> >
> > Perhaps such corruption will be able to be detected by new heapam
> > check functions discussed on another thread. My point was that it
> > might be better to attempt making the tuple header sane state as much
> > as possible when fixing a live tuple in order to prevent further
> > problems such as databases crash by malicious attackers.
> >
>
> Agreed. So, to handle the ctid related concern that you raised, I'm
> planning to do the following changes to ensure that the tuple being
> marked as frozen contains the correct item pointer value. Please let
> me know if you are okay with these changes.

Given that a live tuple never indicates to ve moved partitions, I
guess the first condition in the if statement is not necessary. The
rest looks good to me, although other hackers might think differently.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: remove spurious CREATE INDEX CONCURRENTLY wait

On Wed, Aug 19, 2020 at 02:16:46PM -0400, Alvaro Herrera wrote:
> I did not set the flag in REINDEX CONCURRENTLY, but as I understand it
> can be done too, since in essence it's the same thing as a CIC from a
> snapshot management point of view.

Yes, I see no problems for REINDEX CONCURRENTLY as well as long as
there are no predicates and expressions involved.  The transactions
that should be patched are all started in ReindexRelationConcurrently.
The transaction of index_concurrently_swap() cannot set up that
though.  Only thing to be careful is to make sure that safe_flag is
correct depending on the list of indexes worked on.

> Also, per [1], ISTM this flag could be used to tell lazy VACUUM to
> ignore the Xmin of this process too, which the previous formulation
> (where all CICs were so marked) could not.  This patch doesn't do that
> yet, but it seems the natural next step to take.
> 
> [1] https://postgr.es/m/20191101203310.GA12239@alvherre.pgsql

Could we consider renaming vacuumFlags?  With more flags associated to
a PGPROC entry that are not related to vacuum, the current naming
makes things confusing.  Something like statusFlags could fit better
in the picture?
--
Michael

signature.asc
Description: PGP signature

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-19 Thread Ashutosh Sharma

On Thu, Aug 20, 2020 at 11:04 AM Masahiko Sawada
 wrote:
>
> On Wed, 19 Aug 2020 at 20:45, Ashutosh Sharma  wrote:
> >
> > On Wed, Aug 19, 2020 at 3:55 PM Masahiko Sawada
> >  wrote:
> > >
> > > On Wed, 19 Aug 2020 at 15:09, Ashutosh Sharma  
> > > wrote:
> > > >
> > > > On Wed, Aug 19, 2020 at 9:27 AM Masahiko Sawada
> > > >  wrote:
> > > > >
> > > > > On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma  
> > > > > wrote:
> > > > > >
> > > > > > > pg_force_freeze() can revival a tuple that is already deleted but 
> > > > > > > not
> > > > > > > vacuumed yet. Therefore, the user might need to reindex indexes 
> > > > > > > after
> > > > > > > using that function. For instance, with the following script, the 
> > > > > > > last
> > > > > > > two queries: index scan and seq scan, will return different 
> > > > > > > results.
> > > > > > >
> > > > > > > set enable_seqscan to off;
> > > > > > > set enable_bitmapscan to off;
> > > > > > > set enable_indexonlyscan to off;
> > > > > > > create table tbl (a int primary key);
> > > > > > > insert into tbl values (1);
> > > > > > >
> > > > > > > update tbl set a = a + 100 where a = 1;
> > > > > > >
> > > > > > > explain analyze select * from tbl where a < 200;
> > > > > > >
> > > > > > > -- revive deleted tuple on heap
> > > > > > > select heap_force_freeze('tbl', array['(0,1)'::tid]);
> > > > > > >
> > > > > > > -- index scan returns 2 tuples
> > > > > > > explain analyze select * from tbl where a < 200;
> > > > > > >
> > > > > > > -- seq scan returns 1 tuple
> > > > > > > set enable_seqscan to on;
> > > > > > > explain analyze select * from tbl;
> > > > > > >
> > > > > >
> > > > > > I am not sure if this is the right use-case of pg_force_freeze
> > > > > > function. I think we should only be running pg_force_freeze function
> > > > > > on a tuple for which VACUUM reports "found xmin ABC from before
> > > > > > relfrozenxid PQR" sort of error otherwise it might worsen the things
> > > > > > instead of making it better.
> > > > >
> > > > > Should this also be documented? I think that it's hard to force the
> > > > > user to always use this module in the right situation but we need to
> > > > > show at least when to use.
> > > > >
> > > >
> > > > I've already added some examples in the documentation explaining the
> > > > use-case of force_freeze function. If required, I will also add a note
> > > > about it.
> > > >
> > > > > > > Also, if a tuple updated and moved to another partition is 
> > > > > > > revived by
> > > > > > > heap_force_freeze(), its ctid still has special values:
> > > > > > > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I 
> > > > > > > don't
> > > > > > > see a problem yet caused by a visible tuple having the special 
> > > > > > > ctid
> > > > > > > value, but it might be worth considering either to reset ctid 
> > > > > > > value as
> > > > > > > well or to not freezing already-deleted tuple.
> > > > > > >
> > > > > >
> > > > > > For this as well, the answer remains the same as above.
> > > > >
> > > > > Perhaps the same is true when a tuple header is corrupted including
> > > > > xmin and ctid for some reason and the user wants to fix it? I'm
> > > > > concerned that a live tuple having the wrong ctid will cause SEGV or
> > > > > PANIC error in the future.
> > > > >
> > > >
> > > > If a tuple header itself is corrupted, then I think we must kill that
> > > > tuple. If only xmin and t_ctid fields are corrupted, then probably we
> > > > can think of resetting the ctid value of that tuple. However, it won't
> > > > be always possible to detect the corrupted ctid value. It's quite
> > > > possible that the corrupted ctid field has valid values for block
> > > > number and offset number in it, but it's actually corrupted and it
> > > > would be difficult to consider such ctid as corrupted. Hence, we can't
> > > > do anything about such types of corruption. Probably in such cases we
> > > > need to run VACUUM FULL on such tables so that new ctid gets generated
> > > > for each tuple in the table.
> > >
> > > Understood.
> > >
> > > Perhaps such corruption will be able to be detected by new heapam
> > > check functions discussed on another thread. My point was that it
> > > might be better to attempt making the tuple header sane state as much
> > > as possible when fixing a live tuple in order to prevent further
> > > problems such as databases crash by malicious attackers.
> > >
> >
> > Agreed. So, to handle the ctid related concern that you raised, I'm
> > planning to do the following changes to ensure that the tuple being
> > marked as frozen contains the correct item pointer value. Please let
> > me know if you are okay with these changes.
>
> Given that a live tuple never indicates to ve moved partitions, I
> guess the first condition in the if statement is not necessary. The
> rest looks good to me, although other hackers might think differently.
>

Okay, thanks for confirming. I am planning to go ahead with this
approach. Will later

Re: Implement UNLOGGED clause for COPY FROM

2020-08-19 Thread Kyotaro Horiguchi

At Thu, 20 Aug 2020 00:18:52 +, "osumi.takami...@fujitsu.com" 
 wrote in 
> Hello.
> 
> Apologies for the delay.
> > > When the server crash occurs during data loading of COPY UNLOGGED,
> > > it's a must to keep index consistent of course.
> > > I'm thinking that to rebuild the indexes on the target table would work.
> > >
> > > In my opinion, UNLOGGED clause must be designed to guarantee that
> > > where the data loaded by this clause is written starts from the end of 
> > > all other
> > data blocks.
> > > Plus, those blocks needs to be protected by any write of other 
> > > transactions
> > during the copy.
> > > Apart from that, the server must be aware of which block is the first
> > > block, or the range about where it started or ended in preparation for 
> > > the crash.
> > >
> > > During the crash recovery, those points are helpful to recognize and
> > > detach such blocks in order to solve a situation that the loaded data is 
> > > partially
> > synced to the disk and the rest isn't.
> > 
> > How do online backup and archive recovery work ?
> > 
> > Suppose that the user executes pg_basebackup during COPY UNLOGGED running,
> > the physical backup might have the portion of tuples loaded by COPY UNLOGGED
> > but these data are not recovered. It might not be a problem because the 
> > operation
> > is performed without WAL records. But what if an insertion happens after 
> > COPY
> > UNLOGGED but before pg_stop_backup()? I think that a new tuple could be
> > inserted at the end of the table, following the data loaded by COPY 
> > UNLOGGED.
> > With your approach described above, the newly inserted tuple will be 
> > recovered
> > during archive recovery, but it either will be removed if we replay the 
> > insertion
> > WAL then truncate the table or won’t be inserted due to missing block if we
> > truncate the table then replay the insertion WAL, resulting in losing the 
> > tuple
> > although the user got successful of insertion.
> I consider that from the point in time when COPY UNLOGGED is executed,
> any subsequent operations to the data which comes from UNLOGGED operation
> also cannot be recovered even if those issued WAL.
> 
> This is basically inevitable because subsequent operations 
> after COPY UNLOGGED depend on blocks of loaded data without WAL,
> which means we cannot replay exact operations.
> 
> Therefore, all I can do is to guarantee that 
> when one recovery process ends, the target table returns to the state
> immediately before the COPY UNLOGGED is executed.
> This could be achieved by issuing and notifying the server of an invalidation 
> WAL,
> an indicator to stop WAL application toward one specific table after this new 
> type of WAL.
> I think I need to implement this mechanism as well for this feature.
> Thus, I'll take a measure against your concern of confusing data loss.
> 
> For recovery of the loaded data itself, the user of this clause,
> like DBA or administrator of data warehouse for instance, 
> would need to make a backup just after the data loading.
> For some developers, this behavior would seem incomplete because of the heavy 
> user's burden.
> 
> On the other hand, I'm aware of a fact that Oracle Database has a feature of 
> UNRECOVERABLE clause,
> which is equivalent to what I'm suggesting now in this thread.
> 
> This data loading without REDO log by the clause is more convenient than what 
> I said above,
> because it's supported by a tool named Recovery Manager which enables users 
> to make an incremental backup.
> This works to back up only the changed blocks since the previous backup and
> remove the manual burden from the user like above.
> Here, I have to admit that I cannot design and implement 
> this kind of synergistic pair of all features at once for data warehousing.
> So I'd like to make COPY UNLOGGED as the first step.
> 
> This is the URL of how Oracle database for data warehouse achieves the backup 
> of no log operation while acquiring high speed of data loading.
> https://docs.oracle.com/database/121/VLDBG/GUID-42825ED1-C4C5-449B-870F-D2C8627CBF86.htm#VLDBG1578

Anyway, if the target table is turned back to LOGGED, the succeedeing
WAL stream can be polluted by the logs on the table. So any operations
relying on WAL records is assumed not to be continuable. Not only
useless, it is harmful. I think we don't accept emitting an
insconsistent WAL stream intentionally while wal_level > minimal.

You assert that we could prevent WAL from redoed by the "invalidation"
but it is equivalent to turning the table into UNLOGGED. Why do you
insist that a table should be labeled as "LOGGED" whereas it is
virtually UNLOGGED? That costs nothing (if the table is almost empty).

If you want to get the table back to LOGGED without emitting WAL,
wal_level=minimal works. That requires restart twice, and table
copying, though. It seems like that we can skip copying in this case
but I'm not sure.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Cente

Re: display offset along with block number in vacuum errors