Re: A wrong index choose issue because of inaccurate statistics

2020-06-08 Thread Andy Fan
On Mon, Jun 8, 2020 at 10:16 PM Ashutosh Bapat 
wrote:

> I know one project where they used PostgreSQL code base to detect
> "robust plans". https://dsl.cds.iisc.ac.in/projects/PICASSO/. Some of
> the papers cited in https://www.vldb.org/pvldb/vldb2010/papers/D01.pdf
> describe the idea.


>
In short, the idea is to annotate a plan with a "bandwidth" i.e. how
> does the plan fair with degradation of statistics. A plan which has a
> slightly higher cost which doesn't degrade much with degradation of
> statistics is preferred over a low cost plan whose cost rises sharply
> with degradation of statistics. This is similar to what David is
> suggesting.
>
>
Great to know them,  thank you for sharing it, links have been bookmarked.

On Fri, Jun 5, 2020 at 12:00 PM Pavel Stehule 
> wrote:
> >
> >
> >
> > pá 5. 6. 2020 v 8:19 odesílatel David Rowley 
> napsal:
> >>
> >> On Mon, 1 Jun 2020 at 01:24, Andy Fan  wrote:
> >> > The one-line fix describe the exact idea in my mind:
> >> >
> >> > +++ b/src/backend/optimizer/path/costsize.c
> >> > @@ -730,6 +730,13 @@ cost_index(IndexPath *path, PlannerInfo *root,
> double loop_count,
> >> >
> >> > cpu_run_cost += cpu_per_tuple * tuples_fetched;
> >> >
> >> > +   /*
> >> > +* To make the planner more robust to handle some inaccurate
> statistics
> >> > +* issue, we will add a extra cost to qpquals so that the
> less qpquals
> >> > +* the lower cost it has.
> >> > +*/
> >> > +   cpu_run_cost += 0.01 * list_length(qpquals);
> >> > +
> >> >
> >> > This change do fix the issue above, but will it make some other cases
> worse? My
> >> > answer is no based on my current knowledge, and this is most
> important place
> >> > I want to get advised. The mainly impact I can figure out is: it not
> only
> >> > change the cost difference between (a, b) and (a, c) it also cause
> the cost
> >> > difference between Index scan on (a, c) and seq scan.  However the
> >> > cost different between index scan and seq scan are huge by practice so
> >> > I don't think this impact is harmful.
> >>
> >> Didn't that idea already get shot down in the final paragraph on [1]?
> >>
> >> I understand that you wish to increase the cost by some seemingly
> >> innocent constant to fix your problem case.  Here are my thoughts
> >> about that: Telling lies is not really that easy to pull off. Bad
> >> liers think it's easy and good ones know it's hard. The problem is
> >> that the lies can start small, but then at some point the future you
> >> must fashion some more lies to account for your initial lie. Rinse and
> >> repeat that a few times and before you know it, your course is set
> >> well away from the point of truth.  I feel the part about "rinse and
> >> repeat" applies reasonably well to how join costing works.  The lie is
> >> likely to be amplified as the join level gets deeper.
> >>
> >> I think you need to think of a more generic solution and propose that
> >> instead.  There are plenty of other quirks in the planner that can
> >> cause suffering due to inaccurate or non-existing statistics. For
> >> example, due to how we multiply individual selectivity estimates,
> >> having a few correlated columns in a join condition can cause the
> >> number of join rows to be underestimated. Subsequent joins can then
> >> end up making bad choices on which join operator to use based on those
> >> inaccurate row estimates.  There's also a problem with WHERE  ORDER
> >> BY col LIMIT n; sometimes choosing an index that provides pre-sorted
> >> input to the ORDER BY but cannot use  as an indexable condition.
> >> We don't record any stats to make better choices there, maybe we
> >> should, but either way, we're taking a bit risk there as all the rows
> >> matching  might be right at the end of the index and we might need
> >> to scan the entire thing before hitting the LIMIT. For now, we just
> >> assume completely even distribution of rows. i.e. If there are 50 rows
> >> estimated in the path and the limit is for 5 rows, then we'll assume
> >> we need to read 10% of those before finding all the ones we need. In
> >> reality, everything matching  might be 95% through the index and we
> >> could end up reading 100% of rows. That particular problem is not just
> >> caused by the uneven distribution of rows in the index, but also from
> >> selectivity underestimation.
> >>
> >> I'd more recently wondered if we shouldn't have some sort of "risk"
> >> factor to the cost model. I really don't have ideas on how exactly we
> >> would calculate the risk factor in all cases, but in this case,  say
> >> the risk factor always starts out as 1. We could multiply that risk
> >> factor by some >1 const each time we find another index filter qual.
> >> add_path() can prefer lower risk paths when the costs are similar.
> >> Unsure what the exact add_path logic would be. Perhaps a GUC would
> >> need to assist with the decision there.   Likewise, with
> >> NestedLoopPaths which 

Re: Add -Wold-style-definition to CFLAGS?

2020-06-08 Thread Andres Freund
Hi,

On 2020-05-09 19:11:56 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2020-05-09 14:15:01 -0400, Tom Lane wrote:
> >> Andres Freund  writes:
> >>> Since gcc has a warning detecting such definition, I think we ought to
> >>> automatically add it when available?
> 
> >> +1
> 
> > Any opinion about waiting for branching or not?
> 
> I'd be OK with pushing it now, but I dunno about other people.

I did run into a small bit of trouble doing so. Those seem to make it a
mistake to target 13.

Unfortunately it turns out that our CFLAG configure tests don't reliably
work with -Wold-style-definition. The problem is that the generated
program contains 'int main() {...}', which obviously is an old-style
definition. Which then causes a warning, which in turn causes the cflag
tests to fail because we run them with ac_c_werror_flag=yes.

There's a pretty easy local fix, which is that we can just use
AC_LANG_SOURCE() instead of AC_LANG_PROGRAM()
PGAC_PROG_VARCC_VARFLAGS_OPT(). There seems to be little reason not to
do so.

But that still leaves us with a lot of unnecessary subsequent warnings
for other tests in config.log. They don't cause problems afaics, as
ac_c_werror_flag=yes isn't widely used, but it's still more noisy than
I'd like. And the likelihood of silent failures down the line seems
higher than I'd like.

Upstream autoconf has fixed this in 2014 (1717921a), but since they've
not bothered to release since then...

The easiest way that I can see to deal with that is to simply redefine
the relevant autoconf macro. For me that solves the vast majority of
these bleats in config.log.  That's not particularly pretty, but we have
precedent for it... Since it's just 16 lines, I think we can live with
that?

Comments?

Greetings,

Andres Freund




Re: Intermittent test plan change in "privileges" test on BF animal prion

2020-06-08 Thread Tom Lane
I wrote:
> I'm trying to reproduce this now, but it's sounding pretty plausible.

Yeah, that's definitely it.  I was able to reproduce the failure semi
reliably (every two or three tries) after adding -DRELCACHE_FORCE_RELEASE
-DCATCACHE_FORCE_RELEASE and inserting a "pg_sleep(1)" just after the
manual vacuum in privileges.sql; and as you'd guessed, the stats arrays
were just normal size in the failing runs.  After disabling autovac on
the table, the failure went away.

Thanks for the insight!

regards, tom lane




Re: A wrong index choose issue because of inaccurate statistics

2020-06-08 Thread Andy Fan
On Fri, Jun 5, 2020 at 2:30 PM Pavel Stehule 
wrote:

>
>
> pá 5. 6. 2020 v 8:19 odesílatel David Rowley 
> napsal:
>
>> On Mon, 1 Jun 2020 at 01:24, Andy Fan  wrote:
>> > The one-line fix describe the exact idea in my mind:
>> >
>> > +++ b/src/backend/optimizer/path/costsize.c
>> > @@ -730,6 +730,13 @@ cost_index(IndexPath *path, PlannerInfo *root,
>> double loop_count,
>> >
>> > cpu_run_cost += cpu_per_tuple * tuples_fetched;
>> >
>> > +   /*
>> > +* To make the planner more robust to handle some inaccurate
>> statistics
>> > +* issue, we will add a extra cost to qpquals so that the less
>> qpquals
>> > +* the lower cost it has.
>> > +*/
>> > +   cpu_run_cost += 0.01 * list_length(qpquals);
>> > +
>> >
>> > This change do fix the issue above, but will it make some other cases
>> worse? My
>> > answer is no based on my current knowledge, and this is most important
>> place
>> > I want to get advised. The mainly impact I can figure out is: it not
>> only
>> > change the cost difference between (a, b) and (a, c) it also cause the
>> cost
>> > difference between Index scan on (a, c) and seq scan.  However the
>> > cost different between index scan and seq scan are huge by practice so
>> > I don't think this impact is harmful.
>>
>> Didn't that idea already get shot down in the final paragraph on [1]?
>>
>> I understand that you wish to increase the cost by some seemingly
>> innocent constant to fix your problem case.  Here are my thoughts
>> about that: Telling lies is not really that easy to pull off. Bad
>> liers think it's easy and good ones know it's hard. The problem is
>> that the lies can start small, but then at some point the future you
>> must fashion some more lies to account for your initial lie. Rinse and
>> repeat that a few times and before you know it, your course is set
>> well away from the point of truth.  I feel the part about "rinse and
>> repeat" applies reasonably well to how join costing works.  The lie is
>> likely to be amplified as the join level gets deeper.
>>
>> I think you need to think of a more generic solution and propose that
>> instead.  There are plenty of other quirks in the planner that can
>> cause suffering due to inaccurate or non-existing statistics. For
>> example, due to how we multiply individual selectivity estimates,
>> having a few correlated columns in a join condition can cause the
>> number of join rows to be underestimated. Subsequent joins can then
>> end up making bad choices on which join operator to use based on those
>> inaccurate row estimates.  There's also a problem with WHERE  ORDER
>> BY col LIMIT n; sometimes choosing an index that provides pre-sorted
>> input to the ORDER BY but cannot use  as an indexable condition.
>> We don't record any stats to make better choices there, maybe we
>> should, but either way, we're taking a bit risk there as all the rows
>> matching  might be right at the end of the index and we might need
>> to scan the entire thing before hitting the LIMIT. For now, we just
>> assume completely even distribution of rows. i.e. If there are 50 rows
>> estimated in the path and the limit is for 5 rows, then we'll assume
>> we need to read 10% of those before finding all the ones we need. In
>> reality, everything matching  might be 95% through the index and we
>> could end up reading 100% of rows. That particular problem is not just
>> caused by the uneven distribution of rows in the index, but also from
>> selectivity underestimation.
>>
>> I'd more recently wondered if we shouldn't have some sort of "risk"
>> factor to the cost model. I really don't have ideas on how exactly we
>> would calculate the risk factor in all cases, but in this case,  say
>> the risk factor always starts out as 1. We could multiply that risk
>> factor by some >1 const each time we find another index filter qual.
>> add_path() can prefer lower risk paths when the costs are similar.
>> Unsure what the exact add_path logic would be. Perhaps a GUC would
>> need to assist with the decision there.   Likewise, with
>> NestedLoopPaths which have a large number of join quals, the risk
>> factor could go up a bit with those so that we take a stronger
>> consideration for hash or merge joins instead.
>>
>>
> I thought about these ideas too. And I am not alone.
>
> https://hal.archives-ouvertes.fr/hal-01316823/document
>
>
Thanks for the documents,  after checking it, that is more like
 oracle's statistics feedback[1].  Hope we can have it someday:)

[1] https://blogs.oracle.com/optimizer/cardinality-feedback

-- 
Best Regards
Andy Fan


Re: Intermittent test plan change in "privileges" test on BF animal prion

2020-06-08 Thread Tom Lane
David Rowley  writes:
> On Tue, 9 Jun 2020 at 15:41, Tom Lane  wrote:
>> Hmm ... that's a plausible theory, perhaps.  I forget: does autovac
>> recheck, after acquiring the requisite table lock, whether the table
>> still needs to be processed?

> It does, but I wondered if there was a window after the manual vacuum
> resets n_ins_since_vacuum and between when autovacuum looks at it.

Oh, there surely is, because of the lag in the stats collection mechanism.
I'm trying to reproduce this now, but it's sounding pretty plausible.

BTW, it looks like I managed to trim the reference off my prior message,
but I meant [1] to refer to
https://www.postgresql.org/message-id/79.1591138428%40sss.pgh.pa.us

regards, tom lane




Re: Intermittent test plan change in "privileges" test on BF animal prion

2020-06-08 Thread David Rowley
On Tue, 9 Jun 2020 at 15:41, Tom Lane  wrote:
>
> David Rowley  writes:
> > It does seem plausible, given how slow prion is that autovacuum might
> > be trigger after the manual vacuum somehow and building stats with
> > just 1k buckets instead of 10k.
>
> Hmm ... that's a plausible theory, perhaps.  I forget: does autovac
> recheck, after acquiring the requisite table lock, whether the table
> still needs to be processed?

It does, but I wondered if there was a window after the manual vacuum
resets n_ins_since_vacuum and between when autovacuum looks at it.

David




Re: Intermittent test plan change in "privileges" test on BF animal prion

2020-06-08 Thread Tom Lane
David Rowley  writes:
> I see 0c882e52a did change the number of statistics targets on that
> table. The first failure was on the commit directly after that one.
> I'm not sure what instability Tom meant when he wrote "-- results
> below depend on having quite accurate stats for atest12".

See [1], particularly the para about "When I went to test 0002".
At least one of those test cases fails if the planner estimates more
than one row being selected by the user-defined operator, and since the
table has 10K rows, that means we need 1/1 selectivity precision.

> It does seem plausible, given how slow prion is that autovacuum might
> be trigger after the manual vacuum somehow and building stats with
> just 1k buckets instead of 10k.

Hmm ... that's a plausible theory, perhaps.  I forget: does autovac
recheck, after acquiring the requisite table lock, whether the table
still needs to be processed?

> 0936d1b6 made some changes to disable
> autovacuum because it was sometimes coming in and messing with the
> statistics, maybe we need to do the same here, or at least do
> something less temporary than changing default_statistics_target.

Yeah, setting that as a table parameter seems like a better idea than
setting default_statistics_target.

regards, tom lane




Re: valgrind versus pg_atomic_init()

2020-06-08 Thread Tom Lane
Andres Freund  writes:
> And done.

LGTM, thanks!

regards, tom lane




Re: Intermittent test plan change in "privileges" test on BF animal prion

2020-06-08 Thread David Rowley
On Tue, 9 Jun 2020 at 14:27, Thomas Munro  wrote:
> Two recent failures show plan changes in RLS queries on master.  Based
> on nearby comments, the choice plan is being used to verify access (or
> lack of access) to row estimates, so I guess that means something
> could be amiss here.  (Or it could be due to the dropped UDP flaky
> stats problem, but then why in the same place twice, and why twice in
> a week, only on master, and not for months before that?)

I see 0c882e52a did change the number of statistics targets on that
table. The first failure was on the commit directly after that one.
I'm not sure what instability Tom meant when he wrote "-- results
below depend on having quite accurate stats for atest12".

It does seem plausible, given how slow prion is that autovacuum might
be trigger after the manual vacuum somehow and building stats with
just 1k buckets instead of 10k. 0936d1b6 made some changes to disable
autovacuum because it was sometimes coming in and messing with the
statistics, maybe we need to do the same here, or at least do
something less temporary than changing default_statistics_target.

select attname,array_length(histogram_bounds,1) from pg_stats where
tablename = 'atest12' order by attname;

should mention the array length is 1 if it's working as intended.
Is it worth sticking that query in there before and after the failures
to ensure we're working with the stats we think we are?

David




Re: valgrind versus pg_atomic_init()

2020-06-08 Thread Andres Freund
Hi,

On 2020-06-08 18:21:06 -0400, Tom Lane wrote:
> > Yea, it could just do that. It seemed slightly easier/clearer, back when
> > I wrote it, to just use pg_atomic_write* for the initialization, but
> > this seems enough of a reason to stop doing so. Will change it in all
> > branches, unless somebody sees a reason to not do so?
>
> Works for me.

And done.

- Andres




Re: Intermittent test plan change in "privileges" test on BF animal prion

2020-06-08 Thread Tom Lane
Thomas Munro  writes:
> Two recent failures show plan changes in RLS queries on master.

Yeah.  I assume this is related to 0c882e52a, but I'm not sure how.
The fact that we've only seen it on prion (which runs
-DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE) is suggestive,
but it's not clear why those options would lead to unstable
planner estimates.  I've been waiting to see if we start to get
similar reports from the full-fledged CLOBBER_CACHE_ALWAYS critters.

regards, tom lane




Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2020-06-08 Thread Michael Paquier
On Mon, Jun 08, 2020 at 02:16:32PM +0300, Alexey Kondratov wrote:
> BTW, most of 'common' is a really common code with only four exceptions
> like logging.c, which is frontend-only. Is it there for historical
> reasons only or something else?"
> 
> Personally, I would prefer that everything in the 'common' was actually
> common. I also do not sure about moving an older code, because of possible
> backward compatibility breakage, but doing so for a newer one seems to be a
> good idea.

src/fe_utils/ has been created much after src/common/ (588d963 vs
8396447).  I got to wonder if we should add a note to src/common/'s 
Makefile to not add more stuff to OBJS_FRONTEND and just tell to use
src/fe_utils/.  Moving things from common/ to fe_utils/ would mean
breakages, which may be hard to justify just for the sake of being
clean and more consistent, and src/common/ APIs are usually quite
popular with external plugins.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] Multiple synchronous_standby_names rules

2020-06-08 Thread James Sewell
On Thu, 12 Jan 2017 at 12:06, Michael Paquier 
wrote:

> On Thu, Jan 12, 2017 at 9:53 AM, James Sewell 
> wrote:
> > What is needed to support this is the ability to configure Px with
> something like:
> >
> >  1 (P1, P2, P3), 1 (D1, D2, D3)
> >
> > Would there be any appetite for this - or would it be seen as over
> complication of the current rules?
>
> There have been discussions about being able to do that and there are
> really use cases where that would be useful. As lately quorum commit
> has been committed, we have a better idea of the grammar to use
> (yeah!), though there are a couple of things remaining regarding the
> design of node subsets:
> - How to define group names? Making them mandatory would likely be the
> way to go.
> - How to represent that intuitively in pg_stat_replication? Perhaps
> the answer here is an extra column in this system view.
>

I'm coming back round to this as it's becoming increasingly discussed as we
look into systems with an RPO of (close to) 0 spanning multiple locations.

Before I start hacking - does anyone have any preference for syntax?

To me the best approach seems to be a list of items like:

ANY 2 (a,b) AS SiteA, ANY 2(c,d) AS SiteB
FIRST 2 (a,b,c) AS SiteA, d AS SiteB, e AS SiteC

You need a group name for the replication view as Michael noted, I suppose
you'd need to also allow no group name (one would be autogenerated??) for
backwards compatibility OR enforce group names when there are more than one
elements in the list.

I don't want to do embedded groups like:

ANY 2 (ANY 2 (a,b,c) AS SiteA, d as SiteB, c as SiteB)

Because that just seems like it's overcomplicating the issue.

thoughts?

James

-- 
The contents of this email are confidential and may be subject to legal or 
professional privilege and copyright. No representation is made that this 
email is free of viruses or other defects. If you have received this 
communication in error, you may not copy or distribute any part of it or 
otherwise disclose its contents to anyone. Please advise the sender of your 
incorrect receipt of this correspondence.


Re: BufFileRead() error signalling

2020-06-08 Thread Michael Paquier
On Mon, Jun 08, 2020 at 05:50:44PM +1200, Thomas Munro wrote:
> On Fri, Jun 5, 2020 at 8:14 PM Michael Paquier  wrote:\
>> Anyway, why are we sure that it is fine to not complain even if
>> BufFileWrite() does a partial write?  fwrite() is mentioned at the
>> top
>> of the thread, but why is that OK?
>
> It's not OK.  If any system call fails, we'll now ereport()
> immediately.  Now there can't be unhandled or unreported errors, and
> it's no longer possible for the caller to confuse EOF with errors.
> Hence the change in descriptions:

Oh, OK.  I looked at that again this morning and I see your point now.
I was wondering if it could be possible to have BufFileWrite() write
less data than what is expected with errno=0.  The code of HEAD would
issue a confusing error message like "could not write: Success" in
such a case, still it would fail on ERROR.  And I thought that your
patch would do a different thing and would cause this code path to not
fail in such a case, but the point I missed on the first read of your
patch is that BufFileWrite() is written is such a way that we would
actually just keep looping until the amount of data to write is
written, meaning that we should never see anymore the case where
BufFileWrite() returns a size that does not match with the expected
size to write.

On Tue, Jun 09, 2020 at 12:21:53PM +1200, Thomas Munro wrote:
> On Tue, Jun 9, 2020 at 2:49 AM Alvaro Herrera  
> wrote:
>> I think using our standard "exception" mechanism makes sense.  As for
>> additional context, I think usefulness of the error messages would be
>> improved by showing the file path (because then user knows which
>> filesystem/tablespace was full, for example), but IMO any additional
>> context on top of that is of marginal additional benefit.  If we really
>> cared, we could have errcontext() callbacks in the sites of interest,
>> but that would be a separate patch and perhaps not backpatchable.
> 
> Cool.  It does show the path, so that'll tell you which file system is
> full or broken.

There are some places in logtape.c, *tuplestore.c and gist where there
is no file path.  That would be nice to have, but that's not really
the problem of this patch.

> I thought a bit about the ENOSPC thing, and took that change out.
> Since commit 1173344e we handle write() returning a positive number
> less than the full size by predicting that a follow-up call to write()
> would surely return ENOSPC, without the hassle of trying to write
> more, or having a separate error message sans %m.  But
> BufFileDumpBuffer() does try again, and only raises an error if the
> system call returns < 0 (well, it says <= 0, but 0 is impossible
> according to POSIX, at least assuming you didn't try to write zero
> bytes, and we already exclude that).  So ENOSPC-prediction is
> unnecessary here.

+1.  Makes sense.

>> The wording we use is "could not seek TO block N".  (Or used to use,
>> before switching to pread/pwrite in most places, it seems).
> 
> Fixed in a couple of places.

Looks fine to me.  Are you planning to send an extra patch to switch
BufFileWrite() to void for 14~?
--
Michael


signature.asc
Description: PGP signature


Intermittent test plan change in "privileges" test on BF animal prion

2020-06-08 Thread Thomas Munro
Hello,

Two recent failures show plan changes in RLS queries on master.  Based
on nearby comments, the choice plan is being used to verify access (or
lack of access) to row estimates, so I guess that means something
could be amiss here.  (Or it could be due to the dropped UDP flaky
stats problem, but then why in the same place twice, and why twice in
a week, only on master, and not for months before that?)

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion=2020-06-08%2002%3A58%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion=2020-06-06%2003%3A18%3A03




Re: BufFileRead() error signalling

2020-06-08 Thread Thomas Munro
On Tue, Jun 9, 2020 at 2:49 AM Alvaro Herrera  wrote:
> I think using our standard "exception" mechanism makes sense.  As for
> additional context, I think usefulness of the error messages would be
> improved by showing the file path (because then user knows which
> filesystem/tablespace was full, for example), but IMO any additional
> context on top of that is of marginal additional benefit.  If we really
> cared, we could have errcontext() callbacks in the sites of interest,
> but that would be a separate patch and perhaps not backpatchable.

Cool.  It does show the path, so that'll tell you which file system is
full or broken.

I thought a bit about the ENOSPC thing, and took that change out.
Since commit 1173344e we handle write() returning a positive number
less than the full size by predicting that a follow-up call to write()
would surely return ENOSPC, without the hassle of trying to write
more, or having a separate error message sans %m.  But
BufFileDumpBuffer() does try again, and only raises an error if the
system call returns < 0 (well, it says <= 0, but 0 is impossible
according to POSIX, at least assuming you didn't try to write zero
bytes, and we already exclude that).  So ENOSPC-prediction is
unnecessary here.

> The wording we use is "could not seek TO block N".  (Or used to use,
> before switching to pread/pwrite in most places, it seems).

Fixed in a couple of places.
From 3632531a2e6f15bd4cd28fd1eb1dcc0fd341f014 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 8 Jun 2020 17:01:06 +1200
Subject: [PATCH v3] Redesign error handling in buffile.c.

Convert error handling in buffile.c functions to raise errors rather
than expecting callers to check the return value, and improve the error
messages.

This fixes various cases that forgot to check for errors, or couldn't
check for errors because EOF was indistinguishable from error, or
checked for errors unless they fell on certain boundaries which they
assumed to be EOF, or reported errno when it's not clear that it's been
set.

The problems were initially identified by code inspection, and then
reported as the probable cause of a crash report involving a corrupted
hash join spill file when disk space ran out.

Back-patch to all releases.  For now, BufFileWrite() still returns a
redundant value so as not to change the function prototype for the
benefit of any extensions that might be using it.

Reported-by: Amit Khandekar 
Reported-by: Alvaro Herrera 
Reviewed-by: Melanie Plageman 
Reviewed-by: Alvaro Herrera 
Reviewed-by: Robert Haas 
Reviewed-by: Ibrar Ahmed 
Reviewed-by: Michael Paquier 
Discussion: https://postgr.es/m/CA%2BhUKGJE04G%3D8TLK0DLypT_27D9dR8F1RQgNp0jK6qR0tZGWOw%40mail.gmail.com
---
 src/backend/access/gist/gistbuildbuffers.c | 23 -
 src/backend/executor/nodeHashjoin.c| 24 --
 src/backend/replication/backup_manifest.c  |  9 +---
 src/backend/storage/file/buffile.c | 45 +
 src/backend/utils/sort/logtape.c   | 18 ---
 src/backend/utils/sort/sharedtuplestore.c  | 21 
 src/backend/utils/sort/tuplestore.c| 56 ++
 7 files changed, 86 insertions(+), 110 deletions(-)

diff --git a/src/backend/access/gist/gistbuildbuffers.c b/src/backend/access/gist/gistbuildbuffers.c
index 4b562d8d3f..26bbc9b316 100644
--- a/src/backend/access/gist/gistbuildbuffers.c
+++ b/src/backend/access/gist/gistbuildbuffers.c
@@ -757,26 +757,19 @@ gistRelocateBuildBuffersOnSplit(GISTBuildBuffers *gfbb, GISTSTATE *giststate,
 static void
 ReadTempFileBlock(BufFile *file, long blknum, void *ptr)
 {
+	size_t		nread;
+
 	if (BufFileSeekBlock(file, blknum) != 0)
-		elog(ERROR, "could not seek temporary file: %m");
-	if (BufFileRead(file, ptr, BLCKSZ) != BLCKSZ)
-		elog(ERROR, "could not read temporary file: %m");
+		elog(ERROR, "could not seek to block %ld in temporary file", blknum);
+	if ((nread = BufFileRead(file, ptr, BLCKSZ)) != BLCKSZ)
+		elog(ERROR, "could not read temporary file: read only %zu of %zu bytes",
+			 nread, (size_t) BLCKSZ);
 }
 
 static void
 WriteTempFileBlock(BufFile *file, long blknum, void *ptr)
 {
 	if (BufFileSeekBlock(file, blknum) != 0)
-		elog(ERROR, "could not seek temporary file: %m");
-	if (BufFileWrite(file, ptr, BLCKSZ) != BLCKSZ)
-	{
-		/*
-		 * the other errors in Read/WriteTempFileBlock shouldn't happen, but
-		 * an error at write can easily happen if you run out of disk space.
-		 */
-		ereport(ERROR,
-(errcode_for_file_access(),
- errmsg("could not write block %ld of temporary file: %m",
-		blknum)));
-	}
+		elog(ERROR, "could not seek to block %ld in temporary file", blknum);
+	BufFileWrite(file, ptr, BLCKSZ);
 }
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 2cdc38a601..9bb23fef1a 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -1043,7 +1043,7 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 		if (BufFileSeek(innerFile, 0, 

Re: Remove SpinLockFree() / S_LOCK_FREE()?

2020-06-08 Thread Tom Lane
Andres Freund  writes:
> We currently have
>  *bool SpinLockFree(slock_t *lock)
>  *Tests if the lock is free. Returns true if free, false if 
> locked.
>  *This does *not* change the state of the lock.
> [ which isn't used ]
> Thus: Let's just remove SpinLockFree() / S_LOCK_FREE()?

Yeah.  I think they were included in the original design on the
theory that we'd need 'em someday.  But if we haven't found a use
yet we probably never will.  So +1 for narrowing the API a tad.

(We'd lose some error checking ability in the S_LOCK_TEST code,
but probably that's not worth worrying about.)

regards, tom lane




Remove SpinLockFree() / S_LOCK_FREE()?

2020-06-08 Thread Andres Freund
Hi,

We currently have
 *  bool SpinLockFree(slock_t *lock)
 *  Tests if the lock is free. Returns true if free, false if 
locked.
 *  This does *not* change the state of the lock.
and its underlying S_LOCK_FREE() operation:
 *
 *  bool S_LOCK_FREE(slock_t *lock)
 *  Tests if the lock is free. Returns true if free, false if 
locked.
 *  This does *not* change the state of the lock.

They are currently unused and, as far as I can tell, have never been
used outside test code /asserts. We also don't currently implement them
in the spinlock fallback code:

bool
s_lock_free_sema(volatile slock_t *lock)
{
/* We don't currently use S_LOCK_FREE anyway */
elog(ERROR, "spin.c does not support S_LOCK_FREE()");
return false;
}


I also find the "free" in the name very confusing. Everytime I look at
them (which, I grant, is not that often), I have to think about what
they mean.

Thus: Let's just remove SpinLockFree() / S_LOCK_FREE()?

Greetings,

Andres Freund




Re: [PATCH] Add support for choosing huge page size

2020-06-08 Thread Thomas Munro
On Tue, Jun 9, 2020 at 4:13 AM Odin Ugedal  wrote:
> This adds support for using non-default huge page sizes for shared
> memory. This is achived via the new "huge_page_size" config entry.
> The config value defaults to 0, meaning it will use the system default.
> ---
>
> This would be very helpful when running in kubernetes since nodes may
> support multiple huge page sizes, and have pre-allocated huge page meory
> for each size. This lets the user select huge page size without having
> to change the default huge page size on the node. This will also be
> useful when doing benchmarking with different huge page sizes, since it
> wouldn't require a full system reboot.

+1

> Since the default value of the new config is 0 (resulting in using the
> default huge page size) this should be backwards compatible with old
> configs.

+1

> Feel free to comment on the phrasing (both in docs and code) and on the
> overall change.

This change seems good to me, because it will make testing easier and
certain mixed page size configurations possible.  I haven't tried your
patch yet; I'll take it for a spin when I'm benchmarking some other
relevant stuff soon.

Minor comments on wording:

> +   
> +Most modern linux systems support 2MB and 
> 1GB
> +huge pages, and some architectures supports other sizes as well. For 
> more information
> +on how to check for support and usage, see  linkend="linux-huge-pages"/>.

Linux with a capital L.  Hmm, I don't especially like saying "Most
modern Linux systems" as code for Intel.  I wonder if we should
instead say something like: "Some commonly available page sizes on
modern 64 bit server architectures include: 2MB and
1GB (Intel and AMD), 16MB and
16GB (IBM POWER), and ... (ARM)."

> +   
> +   
> +Controling huge page size is not supported on Windows.

Controlling

Just by the way, some googling is telling me that very recent versions
of Windows *could* support this (search keywords:
"NtAllocateVirtualMemoryEx 1GB"), so that could be a project for
someone who understands Windows to look into later.




Re: valgrind versus pg_atomic_init()

2020-06-08 Thread Tom Lane
Andres Freund  writes:
> On 2020-06-07 00:23:35 -0400, Tom Lane wrote:
>> so my first thought was that we just needed an architecture-specific
>> variant of that.  But on thinking more about this, it seems like
>> generic.h's version of pg_atomic_init_u64_impl is just fundamentally
>> misguided.  Why isn't it simply assigning the value with an ordinary
>> unlocked write?  By definition, we had better not be using this function
>> in any circumstance where there might be conflicting accesses, so I don't
>> see why we should need to tolerate a valgrind exception here.  Moreover,
>> if a simple assignment *isn't* good enough, then surely the spinlock
>> version in atomics.c is 100% broken.

> Yea, it could just do that. It seemed slightly easier/clearer, back when
> I wrote it, to just use pg_atomic_write* for the initialization, but
> this seems enough of a reason to stop doing so. Will change it in all
> branches, unless somebody sees a reason to not do so?

Works for me.

regards, tom lane




Re: valgrind versus pg_atomic_init()

2020-06-08 Thread Andres Freund
Hi,

On 2020-06-07 00:23:35 -0400, Tom Lane wrote:
> I experimented with running "make check" on ARM64 under a reasonably
> bleeding-edge valgrind (3.16.0).  One thing I ran into is that
> regress.c's test_atomic_ops fails; valgrind shows the stack trace
> 
>fun:__aarch64_cas8_acq_rel
>fun:pg_atomic_compare_exchange_u64_impl
>fun:pg_atomic_exchange_u64_impl
>fun:pg_atomic_write_u64_impl
>fun:pg_atomic_init_u64_impl
>fun:pg_atomic_init_u64
>fun:test_atomic_uint64
>fun:test_atomic_ops
>fun:ExecInterpExpr
> 
> Now, this is basically the same thing as is already memorialized in
> src/tools/valgrind.supp:
> 
> # Atomic writes to 64bit atomic vars uses compare/exchange to
> # guarantee atomic writes of 64bit variables. pg_atomic_write is used
> # during initialization of the atomic variable; that leads to an
> # initial read of the old, undefined, memory value. But that's just to
> # make sure the swap works correctly.
> {
>   uninitialized_atomic_init_u64
>   Memcheck:Cond
>   fun:pg_atomic_exchange_u64_impl
>   fun:pg_atomic_write_u64_impl
>   fun:pg_atomic_init_u64_impl
> }
> 
> so my first thought was that we just needed an architecture-specific
> variant of that.  But on thinking more about this, it seems like
> generic.h's version of pg_atomic_init_u64_impl is just fundamentally
> misguided.  Why isn't it simply assigning the value with an ordinary
> unlocked write?  By definition, we had better not be using this function
> in any circumstance where there might be conflicting accesses, so I don't
> see why we should need to tolerate a valgrind exception here.  Moreover,
> if a simple assignment *isn't* good enough, then surely the spinlock
> version in atomics.c is 100% broken.

Yea, it could just do that. It seemed slightly easier/clearer, back when
I wrote it, to just use pg_atomic_write* for the initialization, but
this seems enough of a reason to stop doing so. Will change it in all
branches, unless somebody sees a reason to not do so?

Greetings,

Andres Freund




Re: Bump default wal_level to logical

2020-06-08 Thread Andres Freund
Hi,

On 2020-06-08 13:27:50 -0400, Tom Lane wrote:
> If we can allow wal_level to be changed on the fly, I agree that would
> help reduce the pressure to make the default setting more expensive.
> I don't recall why it's PGC_POSTMASTER right now, but I suppose there
> was a reason for that ...

There's reasons, but IIRC they're all solvable with reasonable effort. I
think most of it boils down to only being able to rely on the new
wal_level after a while. For minimal->recovery we basically need a
checkpoint started after the change in configuration, and for
recovery->logical we need to wait until all sessions have a) read the
new config setting b) finished the transaction that used the old
setting.

Greetings,

Andres Freund




Re: proposal: possibility to read dumped table's name from file

2020-06-08 Thread Justin Pryzby
On Mon, Jun 08, 2020 at 07:18:49PM +0200, Pavel Stehule wrote:
> pá 29. 5. 2020 v 20:25 odesílatel Justin Pryzby  napsal:
> 
> > On Fri, May 29, 2020 at 04:21:00PM +0200, Pavel Stehule wrote:
> > > one my customer has to specify dumped tables name by name. After years and
> > > increasing database size and table numbers he has problem with too short
> > > command line. He need to read the list of tables from file (or from 
> > > stdin).
> >
> > +1 - we would use this.
> >
> > We put a regex (actually a pg_dump pattern) of tables to skip (timeseries
> > partitions which are older than a few days and which are also dumped once 
> > not
> > expected to change, and typically not redumped).  We're nowhere near the
> > execve() limit, but it'd be nice if the command was primarily a list of 
> > options
> > and not a long regex.
> >
> > Please also support reading from file for --exclude-table=pattern.
> >
> > I'm drawing a parallel between this and rsync --include/--exclude and
> > --filter.
> >
> > We'd be implementing a new --filter, which might have similar syntax to 
> > rsync
> > (which I always forget).
> 
> I implemented support for all "repeated" pg_dump options.
> 
> I invite any help with doc. There is just very raw text
> 
> +Do not dump data of tables spefified in file.

*specified

I still wonder if a better syntax would use a unified --filter option, whose
argument would allow including/excluding any type of object:

+[tn] include (t)table/(n)namespace/...
-[tn] exclude (t)table/(n)namespace/...

In the past, I looked for a way to exclude extended stats objects, and ended up
using a separate schema.  An "extensible" syntax might be better (although
reading a file of just patterns has the advantage that the function can just be
called once for each option for each type of object).

-- 
Justin




Re: Bump default wal_level to logical

2020-06-08 Thread Andres Freund
Hi,

On 2020-06-08 14:58:03 -0400, Robert Haas wrote:
> On Mon, Jun 8, 2020 at 1:16 PM Alvaro Herrera  
> wrote:
> > I think it's reasonable to push our default limits for slots,
> > walsenders, max_bgworkers etc a lot higher than current value (say 10 ->
> > 100).  An unused slot wastes essentially no resources; an unused
> > walsender is just one PGPROC entry.  If we did that, and also allowed
> > wal_level to be changed on the fly, we wouldn't need to restart in order
> > to enable logical replication, so there would be little or no pressure
> > to change the wal_level default.
> 
> Wouldn't having a whole bunch of extra PGPROC entries have negative
> implications for the performance of GetSnapshotData() and other things
> that don't scale well at high connection counts?

Some things yes, but I don't think it should have a significant effect
on GetSnapshotData():

We currently don't touch unused PGPROCs for it (by virtue of
procArray->pgprocnos), and we wouldn't with my proposed / pending
changes (because the relevant arrays contain data for connected backends
at the "front").
You can have some effect if you have temporary spikes to very high
connection counts, and then reduce that again. That can lead to a lot of
unused PGXACT entries being interleaved with used ones, leading to
higher cache miss ratios (data cache as well as tlb). But that cannot
become a problem without those PGPROCs ever being used, because IIRC we
otherwise ensure they're used "densely".

There are a few places where we actually look over all PGPROCs, or size
resources according to that, but I think most of those shouldn't be in
hot paths.

There are also effects like the lock hashtables being sized larger,
which then can reduce the cache hit ratio. I guess we could check
whether it'd make sense to charge less than max_locks_per_transaction
for everything but user processes, but I'm a bit doubtful it's worth it.

Greetings,

Andres Freund




Re: hashagg slowdown due to spill changes

2020-06-08 Thread Andres Freund
Hi,

On 2020-06-08 13:41:29 -0700, Jeff Davis wrote:
> On Fri, 2020-06-05 at 21:11 -0700, Andres Freund wrote:
> > Before there was basically one call from nodeAgg.c to execGrouping.c
> > for
> > each tuple and hash table. Now it's a lot more complicated:
> > 1) nodeAgg.c: prepare_hash_slot()
> > 2) execGrouping.c: TupleHashTableHash()
> > 3) nodeAgg.c: lookup_hash_entry()
> > 4) execGrouping.c: LookupTupleHashEntryHash()
> 
> The reason that I did it that way was to be able to store the hash
> along with the saved tuple (similar to what HashJoin does), which
> avoids recalculation.

That makes sense. But then you can just use a separate call into
execGrouping for that purpose.


> > Why isn't the flow more like this:
> > 1) prepare_hash_slot()
> > 2) if (aggstate->hash_spill_mode) goto 3; else goto 4
> > 3) entry = LookupTupleHashEntry(); if (!entry)
> > hashagg_spill_tuple();
> > 4) InsertTupleHashEntry(, ); if (isnew) initialize(entry)
> 
> I'll work up a patch to refactor this. I'd still like to see if we can
> preserve the calculate-hash-once behavior somehow.

Cool!

Greetings,

Andres Freund




Re: hashagg slowdown due to spill changes

2020-06-08 Thread Jeff Davis
On Fri, 2020-06-05 at 21:11 -0700, Andres Freund wrote:
> Why isn't the flow more like this:
> 1) prepare_hash_slot()
> 2) if (aggstate->hash_spill_mode) goto 3; else goto 4
> 3) entry = LookupTupleHashEntry(); if (!entry)
> hashagg_spill_tuple();
> 4) InsertTupleHashEntry(, ); if (isnew) initialize(entry)

I see, you are suggesting that I change around the execGrouping.c
signatures to return the hash, which will avoid the extra call. That
makes more sense.

Regards,
Jeff Davis








Re: snowball release

2020-06-08 Thread Peter Eisentraut

On 2020-06-08 17:19, Tom Lane wrote:

Daniel Gustafsson  writes:

On 8 Jun 2020, at 16:33, Tom Lane  wrote:

Hm, I don't see any documentation change in that commit --- don't
we have (at least) a list of the stemmers somewhere in the SGML docs?



IIRC we refer to the Snowball site, and only have a list in the \dFd output,
but that can be argued to be an example and not expected to be updated to
match.  Perhaps we should though?


Looking in the commit logs, our past updates 7b925e127 and
fd582317e just updated that \dFd sample.  So I guess that's the
minimum expectation.


done

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: hashagg slowdown due to spill changes

2020-06-08 Thread Jeff Davis
On Fri, 2020-06-05 at 21:11 -0700, Andres Freund wrote:
> Before there was basically one call from nodeAgg.c to execGrouping.c
> for
> each tuple and hash table. Now it's a lot more complicated:
> 1) nodeAgg.c: prepare_hash_slot()
> 2) execGrouping.c: TupleHashTableHash()
> 3) nodeAgg.c: lookup_hash_entry()
> 4) execGrouping.c: LookupTupleHashEntryHash()

The reason that I did it that way was to be able to store the hash
along with the saved tuple (similar to what HashJoin does), which
avoids recalculation.

That could be a nice savings for some cases, like when work_mem is
small but the data still fits in system memory, which I expect to be
fairly common. But based on your numbers, it might be a bad trade-off
overall.

> Why isn't the flow more like this:
> 1) prepare_hash_slot()
> 2) if (aggstate->hash_spill_mode) goto 3; else goto 4
> 3) entry = LookupTupleHashEntry(); if (!entry)
> hashagg_spill_tuple();
> 4) InsertTupleHashEntry(, ); if (isnew) initialize(entry)

I'll work up a patch to refactor this. I'd still like to see if we can
preserve the calculate-hash-once behavior somehow.

Regards,
Jeff Davis
 





Re: Bump default wal_level to logical

2020-06-08 Thread Peter Geoghegan
On Mon, Jun 8, 2020 at 12:28 PM Alvaro Herrera  wrote:
> On a quantum-mechanics level, sure, but after Andres's snapshot
> scalability patches, will it be measurable?  (Besides, if your workload
> is so high that you're measurably affected by the additional unused
> PGPROC entries, you can always tune it lower.)

The point that Robert went on to make about the increased WAL volume
from logging old primary key (or replica identity) values was a
stronger argument, IMV.

-- 
Peter Geoghegan




Re: hashagg slowdown due to spill changes

2020-06-08 Thread Alvaro Herrera
On 2020-Jun-05, Andres Freund wrote:

> While preparing my pgcon talk I noticed that our hash-agg performance
> degraded noticeably. Looks to me like it's due to the spilling-hashagg
> changes.

Jeff, what are your thoughts on this?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Bump default wal_level to logical

2020-06-08 Thread Alvaro Herrera
On 2020-Jun-08, Robert Haas wrote:

> On Mon, Jun 8, 2020 at 1:16 PM Alvaro Herrera  
> wrote:
> > I think it's reasonable to push our default limits for slots,
> > walsenders, max_bgworkers etc a lot higher than current value (say 10 ->
> > 100).  An unused slot wastes essentially no resources; an unused
> > walsender is just one PGPROC entry.  If we did that, and also allowed
> > wal_level to be changed on the fly, we wouldn't need to restart in order
> > to enable logical replication, so there would be little or no pressure
> > to change the wal_level default.
> 
> Wouldn't having a whole bunch of extra PGPROC entries have negative
> implications for the performance of GetSnapshotData() and other things
> that don't scale well at high connection counts?

On a quantum-mechanics level, sure, but after Andres's snapshot
scalability patches, will it be measurable?  (Besides, if your workload
is so high that you're measurably affected by the additional unused
PGPROC entries, you can always tune it lower.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Bump default wal_level to logical

2020-06-08 Thread Peter Geoghegan
On Mon, Jun 8, 2020 at 12:09 PM Robert Haas  wrote:
> I think the big overhead is that you log the old version of each row's
> primary key (or whatever the replica identity is) when performing an
> UPDATE or DELETE. So if you test it with integer keys probably it's
> not bad, and I suspect (though I haven't looked) that we don't do the
> extra logging when they key hasn't changed. But if you have wide text
> columns as keys and you update them a lot then things might not look
> so good. I think in the bad cases for this feature the overhead is
> vastly more than going from minimal to replica.
>
> As many people here probably know, I am in general skeptical of this
> kind of change. It's based on the premise that reconfiguring the
> system is either too hard for users to figure out, or too disruptive
> because they'll need a restart.

I completely agree with your analysis, and your conclusions.

-- 
Peter Geoghegan




Re: Bump default wal_level to logical

2020-06-08 Thread Robert Haas
On Mon, Jun 8, 2020 at 5:11 AM Magnus Hagander  wrote:
> I agree that we should consider changing it *if* it does not come with a 
> substantial overhead, but that has to be shown.

I think the big overhead is that you log the old version of each row's
primary key (or whatever the replica identity is) when performing an
UPDATE or DELETE. So if you test it with integer keys probably it's
not bad, and I suspect (though I haven't looked) that we don't do the
extra logging when they key hasn't changed. But if you have wide text
columns as keys and you update them a lot then things might not look
so good. I think in the bad cases for this feature the overhead is
vastly more than going from minimal to replica.

As many people here probably know, I am in general skeptical of this
kind of change. It's based on the premise that reconfiguring the
system is either too hard for users to figure out, or too disruptive
because they'll need a restart. I tend to feel that the first problem
should be solved by making it easier to figure out, and the second one
should be solved by not requiring a restart. I don't think that's easy
engineering, because while I think barriers help, they only address
one relatively small aspect of what's probably a pretty difficult
engineering problem. But the real-life analogue of what's being
proposed here seems to be "the people who are buying this house might
not be able to figure out how to turn the lights on if they need more
light, so let's just turn on all the lights before we hand over the
keys, and that way if they just leave them on forever it'll be cool."
To which any reasonable person would say - "if your electrical
switches are that hard to locate and use, that house has got a serious
design problem."

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Bump default wal_level to logical

2020-06-08 Thread Kenneth Marshall
On Mon, Jun 08, 2020 at 02:58:03PM -0400, Robert Haas wrote:
> On Mon, Jun 8, 2020 at 1:16 PM Alvaro Herrera  
> wrote:
> > I think it's reasonable to push our default limits for slots,
> > walsenders, max_bgworkers etc a lot higher than current value (say 10 ->
> > 100).  An unused slot wastes essentially no resources; an unused
> > walsender is just one PGPROC entry.  If we did that, and also allowed
> > wal_level to be changed on the fly, we wouldn't need to restart in order
> > to enable logical replication, so there would be little or no pressure
> > to change the wal_level default.
> 
> Wouldn't having a whole bunch of extra PGPROC entries have negative
> implications for the performance of GetSnapshotData() and other things
> that don't scale well at high connection counts?
> 

+1

I think just having the defaults raised enough to allow even a couple DB
replication slots would be advantageous and allow it to be used to
address spur of the moment needs for systems that need to stay up. It
does seem wasteful to by default support large numbers of slots and
seems to be contrary to the project stance on initial limits.

Regards,
Ken




Re: Bump default wal_level to logical

2020-06-08 Thread Robert Haas
On Mon, Jun 8, 2020 at 1:16 PM Alvaro Herrera  wrote:
> I think it's reasonable to push our default limits for slots,
> walsenders, max_bgworkers etc a lot higher than current value (say 10 ->
> 100).  An unused slot wastes essentially no resources; an unused
> walsender is just one PGPROC entry.  If we did that, and also allowed
> wal_level to be changed on the fly, we wouldn't need to restart in order
> to enable logical replication, so there would be little or no pressure
> to change the wal_level default.

Wouldn't having a whole bunch of extra PGPROC entries have negative
implications for the performance of GetSnapshotData() and other things
that don't scale well at high connection counts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Bump default wal_level to logical

2020-06-08 Thread Tom Lane
Alvaro Herrera  writes:
> I think it's reasonable to push our default limits for slots,
> walsenders, max_bgworkers etc a lot higher than current value (say 10 ->
> 100).  An unused slot wastes essentially no resources; an unused
> walsender is just one PGPROC entry.  If we did that, and also allowed
> wal_level to be changed on the fly, we wouldn't need to restart in order
> to enable logical replication, so there would be little or no pressure
> to change the wal_level default.

Unused PGPROC entries will still consume semaphores, which is problematic
on at least some OSes.  It's not really clear to me why the default for
walsenders would need to be O(100) anyway.  The existing default of 10
already ought to be enough to cover approximately 99.999% of use cases.

If we can allow wal_level to be changed on the fly, I agree that would
help reduce the pressure to make the default setting more expensive.
I don't recall why it's PGC_POSTMASTER right now, but I suppose there
was a reason for that ...

regards, tom lane




Re: proposal: possibility to read dumped table's name from file

2020-06-08 Thread Pavel Stehule
Hi

pá 29. 5. 2020 v 20:25 odesílatel Justin Pryzby 
napsal:

> On Fri, May 29, 2020 at 04:21:00PM +0200, Pavel Stehule wrote:
> > one my customer has to specify dumped tables name by name. After years
> and
> > increasing database size and table numbers he has problem with too short
> > command line. He need to read the list of tables from file (or from
> stdin).
>
> +1 - we would use this.
>
> We put a regex (actually a pg_dump pattern) of tables to skip (timeseries
> partitions which are older than a few days and which are also dumped once
> not
> expected to change, and typically not redumped).  We're nowhere near the
> execve() limit, but it'd be nice if the command was primarily a list of
> options
> and not a long regex.
>
> Please also support reading from file for --exclude-table=pattern.
>
> I'm drawing a parallel between this and rsync --include/--exclude and
> --filter.
>
> We'd be implementing a new --filter, which might have similar syntax to
> rsync
> (which I always forget).
>

I implemented support for all "repeated" pg_dump options.

--exclude-schemas-file=FILENAME
--exclude-tables-data-file=FILENAME
--exclude-tables-file=FILENAME
--include-foreign-data-file=FILENAME
--include-schemas-file=FILENAME
--include-tables-file=FILENAME

Regards

Pavel

I invite any help with doc. There is just very raw text



> --
> Justin
>
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 2f0807e912..5e39268d51 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -725,6 +725,16 @@ PostgreSQL documentation
   
  
 
+ 
+  --exclude-schemas-file=filename
+  
+   
+This option allows to specify file with schemas that will not be
+dumped. Any row of is one schema's name.
+   
+  
+ 
+
  
   --exclude-table-data=pattern
   
@@ -743,6 +753,24 @@ PostgreSQL documentation
   
  
 
+ 
+  --exclude-tables-data-file=filename
+  
+   
+Do not dump data of tables spefified in file.
+   
+  
+ 
+
+ 
+  --exclude-tables-file=filename
+  
+   
+Do not dump tables spefified in file.
+   
+  
+ 
+
  
   --extra-float-digits=ndigits
   
@@ -795,6 +823,33 @@ PostgreSQL documentation
   
  
 
+ 
+  --include-foreign-data-file=filename
+  
+   
+Include data of foreign servers specified in file.
+   
+  
+ 
+
+ 
+  --include-schemas-file=filename
+  
+   
+Dump schema(s) specified in file.
+   
+  
+ 
+
+ 
+  --include-tables-file=filename
+  
+   
+Dump table(s) specified in file.
+   
+  
+ 
+
  
   --inserts
   
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index dfe43968b8..db9fb10f68 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -290,6 +290,7 @@ static void appendReloptionsArrayAH(PQExpBuffer buffer, const char *reloptions,
 static char *get_synchronized_snapshot(Archive *fout);
 static void setupDumpWorker(Archive *AHX);
 static TableInfo *getRootTableInfo(TableInfo *tbinfo);
+static bool read_options_from_file(SimpleStringList *slist, char *filename);
 
 
 int
@@ -362,8 +363,13 @@ main(int argc, char **argv)
 		{"disable-dollar-quoting", no_argument, _dollar_quoting, 1},
 		{"disable-triggers", no_argument, _triggers, 1},
 		{"enable-row-security", no_argument, _row_security, 1},
+		{"exclude-schemas-file", required_argument, NULL, 12},
 		{"exclude-table-data", required_argument, NULL, 4},
+		{"exclude-tables-data-file", required_argument, NULL, 14},
+		{"exclude-tables-file", required_argument, NULL, 13},
 		{"extra-float-digits", required_argument, NULL, 8},
+		{"include-schemas-file", required_argument, NULL, 15},
+		{"include-tables-file", required_argument, NULL, 16},
 		{"if-exists", no_argument, _exists, 1},
 		{"inserts", no_argument, NULL, 9},
 		{"lock-wait-timeout", required_argument, NULL, 2},
@@ -386,6 +392,7 @@ main(int argc, char **argv)
 		{"on-conflict-do-nothing", no_argument, _nothing, 1},
 		{"rows-per-insert", required_argument, NULL, 10},
 		{"include-foreign-data", required_argument, NULL, 11},
+		{"include-foreign-data-file", required_argument, NULL, 17},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -603,6 +610,32 @@ main(int argc, char **argv)
 		  optarg);
 break;
 
+			case 12:/* read exclude schama names from file */
+(void) read_options_from_file(_exclude_patterns, optarg);
+break;
+
+			case 13:/* read exclude table names from file */
+(void) read_options_from_file(_exclude_patterns, optarg);
+break;
+
+			case 14:/* read exclude table data names from file */
+(void) read_options_from_file(_exclude_patterns, optarg);
+break;
+
+			case 15:/* read table names from file */
+if (read_options_from_file(_include_patterns, optarg))
+	dopt.include_everything = false;
+	

Re: Bump default wal_level to logical

2020-06-08 Thread Alvaro Herrera
On 2020-Jun-08, Tomas Vondra wrote:

> Not sure if it's sufficient, though, because switching to logical may
> require bumping up number of slots, walsenders, etc. At which point you
> actually need a restart. Not to mention that extensions using logical
> decoding (like pglogical) need to allocate shared memory etc. But for
> the built-in logical replication that is not an issue, ofc.

I think it's reasonable to push our default limits for slots,
walsenders, max_bgworkers etc a lot higher than current value (say 10 ->
100).  An unused slot wastes essentially no resources; an unused
walsender is just one PGPROC entry.  If we did that, and also allowed
wal_level to be changed on the fly, we wouldn't need to restart in order
to enable logical replication, so there would be little or no pressure
to change the wal_level default.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Getting ERROR with FOR UPDATE/SHARE for partitioned table.

2020-06-08 Thread Alvaro Herrera
On 2020-Jun-03, Amit Langote wrote:

> Are you saying that the planner should take into account the state of
> the cursor specified in WHERE CURRENT OF to determine which of the
> tables to scan for the UPDATE?  Note that neither partition pruning
> nor constraint exclusion know that CurrentOfExpr can possibly allow to
> exclude children of the UPDATE target.

I think from a user POV this is pretty obvious.  The user doesn't really
care that there are partitions that were pruned, because obviously
UPDATE WHERE CURRENT OF cannot refer to a tuple in those partitions.

> > I am possibly shooting in dark, but this puzzles me. And it looks like
> > we can cause wrong rows to be updated in non-partition inheritance
> > where the ctids match?
> 
> I don't think that hazard exists, because the table OID is matched
> before the TID.

It sounds like CURRENT OF should somehow inform pruning that the
partition OID is to be matched as well.  I don't know offhand if this is
easily implementable, though.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: pg_upgrade fails with non-standard ACL

2020-06-08 Thread Alvaro Herrera
On 2020-Jun-08, Anastasia Lubennikova wrote:

> In this version I rebased test patches,  added some more comments, fixed
> memory allocation issue and also removed code that handles ACLs on
> languages. They require a lot of specific handling, while I doubt that their
> signatures, which consist of language name only, are subject to change in
> any future versions.

I'm thinking what's a good way to have a test that's committable.  Maybe
it would work to add a TAP test to pg_upgrade that runs initdb, does a
few GRANTS as per your attachment, then runs pg_upgrade?  Currently the
pg_upgrade tests are not TAP ... we'd have to revive 
https://postgr.es/m/20180126080026.gi17...@paquier.xyz
(Some progress was already made on the buildfarm front to permit this)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: should CREATE INDEX ON partitioned_table call PreventInTransactionBlock() ?

2020-06-08 Thread Alvaro Herrera
On 2020-Jun-08, Justin Pryzby wrote:

> On Mon, Jun 08, 2020 at 11:27:26AM -0400, Alvaro Herrera wrote:

> > Well, that would also require that transactions are committed and
> > started for each partition.  Merely adding PreventInTransactionBlock
> > would not do that.  Moreover, since this would break DDL-in-transactions
> > that would otherwise work, it should be optional and thus need a keyword
> > in the command.  But CONCURRENTLY isn't it (because that means something
> > else) so we'd have to discuss what it would be.
> 
> I wasn't thinking of a new feature but rather if it would be desirable to
> change behavior for v14 to always start/commit transaction for each partition.

Well, I was saying that I don't think a blanket behavior change is
desirable.  For example, if you have a script that creates a partitioned
table and a few partitions and a few indexes, and it does all that in a
transaction, it'll break.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Bump default wal_level to logical

2020-06-08 Thread David Fetter
On Mon, Jun 08, 2020 at 11:10:38AM +0200, Magnus Hagander wrote:
> On Mon, Jun 8, 2020 at 8:46 AM Michael Paquier  wrote:
> 
> > On Mon, Jun 08, 2020 at 11:59:14AM +0530, Amit Kapila wrote:
> > > I think we should first do performance testing to see what is the
> > > overhead of making this default.  I think pgbench read-write at
> > > various scale factors would be a good starting point.  Also, we should
> > > see how much additional WAL it generates as compared to current
> > > default.
> >
> > +1.  I recall that changing wal_level to logical has been discussed in
> > the past and performance was the actual take to debate on.
> >
> 
> That was at least the discussion (long-going and multi-repeated) before we
> upped it from minimal to replica. There were some pretty extensive
> benchmarking done to prove that the difference was very small, and this was
> weighed against the ability to take basic backups of the system (which
> arguably is more important than being able to do logical replication).

I'd argue this a different direction.

Logical replication has been at fundamental to how a lot of systems
operate since Slony came out for the very good reason that it was far
and away the simplest way to accomplish a bunch of design goals. There
are now, and have been for some years, both free and proprietary
systems whose sole purpose is change data capture. PostgreSQL can play
nicely with those systems with this switch flipped on by default.

Looking into the future of PostgreSQL itself, there are things we've
been unable to do thus far that logical replication makes tractable.
These include:

- Zero downtime version changes
- Substantive changes to our on-disk representations between versions
  because upgrading in place places sharp limits on what we could do.

> I agree that we should consider changing it *if* it does not come
> with a substantial overhead, but that has to be shown.

What overhead would be substantial enough to require more work than
changing the default, and under what circumstances?

I ask this because on a heavily loaded system, the kind where
differences could be practical as opposed to merely statistically
significant, statement logging at even the most basic level is a much
bigger burden than the maxed-out WALs are.  Any overhead those WALs
might impose simply disappears in the noise.  The difference is even
more stark in systems subject to audit.

> Of course, what would be even neater would be if it could be changed
> so you don't have to bounce the server to change the wal_level.
> That's a bigger change though, but perhaps it is now possible once
> we have the "global barriers" in 13?

As much as I would love to have this capability, I was hoping to keep
the scope of this contained.  As pointed out down-thread, there's lots
more to doing this dynamically that just turning up the wal_level.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate




[PATCH] Add support for choosing huge page size

2020-06-08 Thread Odin Ugedal
This adds support for using non-default huge page sizes for shared
memory. This is achived via the new "huge_page_size" config entry.
The config value defaults to 0, meaning it will use the system default.
---

This would be very helpful when running in kubernetes since nodes may
support multiple huge page sizes, and have pre-allocated huge page meory
for each size. This lets the user select huge page size without having
to change the default huge page size on the node. This will also be
useful when doing benchmarking with different huge page sizes, since it
wouldn't require a full system reboot.

Since the default value of the new config is 0 (resulting in using the
default huge page size) this should be backwards compatible with old
configs.

Feel free to comment on the phrasing (both in docs and code) and on the
overall change.

 doc/src/sgml/config.sgml  | 25 ++
 doc/src/sgml/runtime.sgml | 41 +
 src/backend/port/sysv_shmem.c | 88 ---
 src/backend/utils/misc/guc.c  | 11 +++
 src/backend/utils/misc/postgresql.conf.sample |  2 +
 src/include/storage/pg_shmem.h|  1 +
 6 files changed, 120 insertions(+), 48 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index aca8f73a50..6177b819ce 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1582,6 +1582,31 @@ include_dir 'conf.d'
   
  
 
+ 
+  huge_page_size (integer)
+  
+   huge_page_size configuration 
parameter
+  
+  
+  
+   
+Controls what size of huge pages is used in conjunction with
+.
+The default is zero (0).
+When set to 0, the default huge page size on the 
system will
+be used.
+   
+   
+Most modern linux systems support 2MB and 
1GB
+huge pages, and some architectures supports other sizes as well. For 
more information
+on how to check for support and usage, see .
+   
+   
+Controling huge page size is not supported on Windows.  
+   
+  
+ 
+
  
   temp_buffers (integer)
   
diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml
index 88210c4a5d..cbdbcb4fdf 100644
--- a/doc/src/sgml/runtime.sgml
+++ b/doc/src/sgml/runtime.sgml
@@ -1391,41 +1391,50 @@ export PG_OOM_ADJUST_VALUE=0
 using large values of .  To use this
 feature in PostgreSQL you need a kernel
 with CONFIG_HUGETLBFS=y and
-CONFIG_HUGETLB_PAGE=y. You will also have to adjust
-the kernel setting vm.nr_hugepages. To estimate the
-number of huge pages needed, start PostgreSQL
-without huge pages enabled and check the
-postmaster's anonymous shared memory segment size, as well as the system's
-huge page size, using the /proc file system.  This 
might
-look like:
+CONFIG_HUGETLB_PAGE=y. You will also have to 
pre-allocate
+huge pages with the the desired huge page size. To estimate the number of
+huge pages needed, start PostgreSQL without huge
+pages enabled and check the postmaster's anonymous shared memory segment 
size,
+as well as the system's supported huge page sizes, using the
+/sys file system.  This might look like:
 
 $ head -1 $PGDATA/postmaster.pid
 4170
 $ pmap 4170 | awk '/rw-s/  /zero/ {print $2}'
 6490428K
+$ ls /sys/kernel/mm/hugepages
+hugepages-1048576kB  hugepages-2048kB
+
+
+ You can now choose between the supported sizes, 2MiB and 1GiB in this 
case.
+ By default PostgreSQL will use the default huge
+ page size on the system, but that can be configured via
+ .
+ The default huge page size can be found with:
+
 $ grep ^Hugepagesize /proc/meminfo
 Hugepagesize:   2048 kB
 
+
+ For 2MiB,
  6490428 / 2048 gives approximately
  3169.154, so in this example we need at
  least 3170 huge pages, which we can set with:
 
-$ sysctl -w vm.nr_hugepages=3170
+$ echo 3170 | tee 
/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
 
 A larger setting would be appropriate if other programs on the machine
-also need huge pages.  Don't forget to add this setting
-to /etc/sysctl.conf so that it will be reapplied
-after reboots.
+also need huge pages. It is also possible to pre allocate huge pages on 
boot
+by adding the kernel parameters hugepagesz=2M 
hugepages=3170.

 

 Sometimes the kernel is not able to allocate the desired number of huge
-pages immediately, so it might be necessary to repeat the command or to
-reboot.  (Immediately after a reboot, most of the machine's memory
-should be available to convert into huge pages.)  To verify the huge
-page allocation situation, use:
+pages immediately due to external fragmentation, so it might be necessary 
to
+repeat the command or to reboot. To verify the huge page allocation 
situation
+for a given size, use:
 
-$ grep Huge 

Re: pg_upgrade fails with non-standard ACL

2020-06-08 Thread Anastasia Lubennikova

On 06.04.2020 19:40, Anastasia Lubennikova wrote:

On 06.04.2020 16:49, Anastasia Lubennikova wrote:
New version of the patch is attached. I fixed several issues in the 
check_for_changed_signatures().


Now it passes check without "test_rename_catalog_objects" and fails 
(generates script) with it. Test script pg_upgrade_ACL_test.sh 
demonstrates this.


The only known issue left is the usage of pg_identify_object(), 
though I don't see a problem here with object types that this patch 
involves.

As I updated the code, I will leave this patch in Need Review.


One more fix for free_acl_infos().
Thanks to cfbot.


One more update.
In this version I rebased test patches,  added some more comments, fixed 
memory allocation issue and also removed code that handles ACLs on 
languages. They require a lot of specific handling, while I doubt that 
their signatures, which consist of language name only, are subject to 
change in any future versions.


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

commit c7ce324b4a213184c9b5fec18055921fc846ccad
Author: anastasia 
Date:   Mon Jun 8 18:34:14 2020 +0300

pg_upgrade_ACL_check_v9.patch

diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 00aef855dc..94e8528a3a 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -16,6 +16,7 @@
 
 static void check_new_cluster_is_empty(void);
 static void check_databases_are_compatible(void);
+static void check_for_changed_signatures(void);
 static void check_locale_and_encoding(DbInfo *olddb, DbInfo *newdb);
 static bool equivalent_locale(int category, const char *loca, const char *locb);
 static void check_is_install_user(ClusterInfo *cluster);
@@ -142,6 +143,8 @@ check_and_dump_old_cluster(bool live_check)
 	if (GET_MAJOR_VERSION(old_cluster.major_version) <= 804)
 		new_9_0_populate_pg_largeobject_metadata(_cluster, true);
 
+	get_non_default_acl_infos(_cluster);
+
 	/*
 	 * While not a check option, we do this now because this is the only time
 	 * the old server is running.
@@ -161,6 +164,7 @@ check_new_cluster(void)
 
 	check_new_cluster_is_empty();
 	check_databases_are_compatible();
+	check_for_changed_signatures();
 
 	check_loadable_libraries();
 
@@ -443,6 +447,223 @@ check_databases_are_compatible(void)
 	}
 }
 
+/*
+ * Find the location of the last dot, return NULL if not found.
+ */
+static char *
+last_dot_location(const char *identity)
+{
+	const char *p,
+			   *ret = NULL;
+
+	for (p = identity; *p; p++)
+		if (*p == '.')
+			ret = p;
+	return unconstify(char *, ret);
+}
+
+/*
+ * check_for_changed_signatures()
+ *
+ * Check that the old cluster doesn't have non-default ACL's for system objects
+ * (relations, attributes, functions and procedures) which have different
+ * signatures in the new cluster. Otherwise generate revoke_objects.sql.
+ */
+static void
+check_for_changed_signatures(void)
+{
+	PGconn	   *conn;
+	char		subquery[QUERY_ALLOC];
+	PGresult   *res;
+	int			ntups;
+	int			i_obj_ident;
+	int			dbnum;
+	bool		need_check = false;
+	FILE	   *script = NULL;
+	bool		found_changed = false;
+	char		output_path[MAXPGPATH];
+
+	prep_status("Checking for system objects with non-default ACL");
+
+	for (dbnum = 0; dbnum < old_cluster.dbarr.ndbs; dbnum++)
+		if (old_cluster.dbarr.dbs[dbnum].non_def_acl_arr.nacls > 0)
+		{
+			need_check = true;
+			break;
+		}
+	/*
+	 * The old cluster doesn't have system objects with non-default ACL so
+	 * quickly exit.
+	 */
+	if (!need_check)
+	{
+		check_ok();
+		return;
+	}
+
+	snprintf(output_path, sizeof(output_path), "revoke_objects.sql");
+
+	snprintf(subquery, sizeof(subquery),
+		/* Get system relations which created in pg_catalog */
+		"SELECT 'pg_class'::regclass classid, oid objid, 0 objsubid "
+		"FROM pg_catalog.pg_class "
+		"WHERE relnamespace = 'pg_catalog'::regnamespace "
+		"UNION ALL "
+		/* Get system relations attributes which created in pg_catalog */
+		"SELECT 'pg_class'::regclass, att.attrelid, att.attnum "
+		"FROM pg_catalog.pg_class rel "
+		"INNER JOIN pg_catalog.pg_attribute att ON  rel.oid = att.attrelid "
+		"WHERE rel.relnamespace = 'pg_catalog'::regnamespace "
+		"UNION ALL "
+		/* Get system functions and procedure which created in pg_catalog */
+		"SELECT 'pg_proc'::regclass, oid, 0 "
+		"FROM pg_catalog.pg_proc "
+		"WHERE pronamespace = 'pg_catalog'::regnamespace ");
+
+	conn = connectToServer(_cluster, "template1");
+	res = executeQueryOrDie(conn,
+		"SELECT ident.type, ident.identity "
+		"FROM (%s) obj, "
+		"LATERAL pg_catalog.pg_identify_object("
+		"obj.classid, obj.objid, obj.objsubid) ident "
+		/*
+		 * Don't rely on database collation, since we use strcmp
+		 * comparison to find non-default ACLs.
+		 */
+		"ORDER BY ident.identity COLLATE \"C\";", subquery);
+	ntups = PQntuples(res);
+
+	i_obj_ident = PQfnumber(res, "identity");
+
+	for (dbnum = 0; dbnum < old_cluster.dbarr.ndbs; dbnum++)
+	{
+		DbInfo	   *dbinfo = 

Re: should CREATE INDEX ON partitioned_table call PreventInTransactionBlock() ?

2020-06-08 Thread Justin Pryzby
On Mon, Jun 08, 2020 at 11:27:26AM -0400, Alvaro Herrera wrote:
> On 2020-Jun-08, Justin Pryzby wrote:
> 
> > This blocks writes to all partitions until commit:
> > 
> > postgres=# begin; CREATE INDEX ON pt(i);
> > BEGIN
> > CREATE INDEX
> > 
> > Compare with CLUSTER rel1, rel2, ..., and REINDEX {SCHEMA|DATABASE|SYSTEM},
> > which release their locks as soon as each rel is processed.

(Correcting myself, I guess I mean "CLUSTER;" - it doesn't accept multiple
relation arguments.)

> Well, that would also require that transactions are committed and
> started for each partition.  Merely adding PreventInTransactionBlock
> would not do that.  Moreover, since this would break DDL-in-transactions
> that would otherwise work, it should be optional and thus need a keyword
> in the command.  But CONCURRENTLY isn't it (because that means something
> else) so we'd have to discuss what it would be.

I wasn't thinking of a new feature but rather if it would be desirable to
change behavior for v14 to always start/commit transaction for each partition.

-- 
Justin




Re: should CREATE INDEX ON partitioned_table call PreventInTransactionBlock() ?

2020-06-08 Thread Alvaro Herrera
On 2020-Jun-08, Justin Pryzby wrote:

> This blocks writes to all partitions until commit:
> 
> postgres=# begin; CREATE INDEX ON pt(i);
> BEGIN
> CREATE INDEX
> 
> Compare with CLUSTER rel1, rel2, ..., and REINDEX {SCHEMA|DATABASE|SYSTEM},
> which release their locks as soon as each rel is processed.

Well, that would also require that transactions are committed and
started for each partition.  Merely adding PreventInTransactionBlock
would not do that.  Moreover, since this would break DDL-in-transactions
that would otherwise work, it should be optional and thus need a keyword
in the command.  But CONCURRENTLY isn't it (because that means something
else) so we'd have to discuss what it would be.

> I noticed while implementing REINDEX for partitioned tables, which, it seems
> clear, should also avoid slowly accumulating more and more write locks across
> an entire partition heirarchy.

Right.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: snowball release

2020-06-08 Thread Tom Lane
Daniel Gustafsson  writes:
> On 8 Jun 2020, at 16:33, Tom Lane  wrote:
>> Hm, I don't see any documentation change in that commit --- don't
>> we have (at least) a list of the stemmers somewhere in the SGML docs?

> IIRC we refer to the Snowball site, and only have a list in the \dFd output,
> but that can be argued to be an example and not expected to be updated to
> match.  Perhaps we should though?

Looking in the commit logs, our past updates 7b925e127 and
fd582317e just updated that \dFd sample.  So I guess that's the
minimum expectation.  Maybe we should think about having a more
formal list in the actual Snowball section (12.6.6)?

regards, tom lane




Re: BufFileRead() error signalling

2020-06-08 Thread Alvaro Herrera
On 2020-Jun-08, Thomas Munro wrote:

> Stepping back a bit, one of the problems here is that we tried to
> model the functions on  fread(), but we didn't provide the
> companion ferror() and feof() functions, and then we were a bit fuzzy
> on when errno is set, even though the  functions don't
> document that.  There were various ways forward, but the one that this
> patch follows is to switch to our regular error reporting system.  The
> only thing that really costs us is marginally more vague error
> messages.  Perhaps that could eventually be fixed by passing in some
> more context into calls like BufFileCreateTemp(), for use in error
> messages and perhaps also path names.

I think using our standard "exception" mechanism makes sense.  As for
additional context, I think usefulness of the error messages would be
improved by showing the file path (because then user knows which
filesystem/tablespace was full, for example), but IMO any additional
context on top of that is of marginal additional benefit.  If we really
cared, we could have errcontext() callbacks in the sites of interest,
but that would be a separate patch and perhaps not backpatchable.

> > +   elog(ERROR, "could not seek block %ld temporary file", blknum);
> >
> > You mean "in temporary file" in the new message, no?
> 
> Fixed.

The wording we use is "could not seek TO block N".  (Or used to use,
before switching to pread/pwrite in most places, it seems).

Thanks!

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: snowball release

2020-06-08 Thread Daniel Gustafsson
> On 8 Jun 2020, at 16:33, Tom Lane  wrote:
> 
> Peter Eisentraut  writes:
>> On 2020-05-22 14:40, Peter Eisentraut wrote:
>>> I think some consideration could be given for including this into PG13.
>>> Otherwise, I'll park it for PG14.
> 
>> committed to master
> 
> Hm, I don't see any documentation change in that commit --- don't
> we have (at least) a list of the stemmers somewhere in the SGML docs?

IIRC we refer to the Snowball site, and only have a list in the \dFd output,
but that can be argued to be an example and not expected to be updated to
match.  Perhaps we should though?

cheers ./daniel




Re: snowball release

2020-06-08 Thread Tom Lane
Peter Eisentraut  writes:
> On 2020-05-22 14:40, Peter Eisentraut wrote:
>> I think some consideration could be given for including this into PG13.
>> Otherwise, I'll park it for PG14.

> committed to master

Hm, I don't see any documentation change in that commit --- don't
we have (at least) a list of the stemmers somewhere in the SGML docs?

regards, tom lane




Re: Speeding up parts of the planner using a binary search tree structure for nodes

2020-06-08 Thread Ashutosh Bapat
On Tue, 2 Jun 2020 at 03:13, David Rowley  wrote:

>
>
> It does seem likely to me that hash tables would be a more efficient
> structure, but the gains might not be as big as you think. To perform
> a lookup in the table we need to hash the entire node to get the hash
> value, then perform at least one equal comparison.  With the Binary
> Search Lists, we'd need more comparisons, but those can be partial
> comparisons as they can abort early when we find the first mismatching
> field in the Node. When the number of items in the list is fairly
> small that might actually be less work, especially when the Node type
> varies (since that's the first field we compare). Hash tables are
> likely to become more efficient when the number of items is larger.
> The general case is that we'll just have a small number of items. I'd
> like to improve the less common situation when the number of items is
> large with minimal impact for when the number of items is small.
>

Agree with that. I think we can again borrow from the way we search a join
- when small number of joins use list and for a larger number use hash
table. I am not suggesting that we use list in this case, but the idea is
to use two data structures. May be every eclass will use one of them
depending upon the number of members. Queries involving partitioned tables
with lots of partitions will have a large number of child eclass members.


>
> > > tlist_member()
> >
> > hash table by expression as key here as well?
>
> The order of the tlist does matter in many cases. I'm unsure how we
> could track the order that items were added to the hash table and then
> obtain the items back in that order in an efficient way. I imagine
> we'd need to store the item in some subsequent data structure, such as
> a List.


If we use hash table then we retain the list as well. But your idea below
looks better.


> There's obviously going to be some overhead to doing that.
> The idea with the Binary Search Lists was that items would be stored
> in an array, similar to List, but the cells of the list would contain
> the offset index to its parent and left and right child.  New items
> would always take the next free slot in the list, just like List does
> so we'd easily be able to get the insert order by looping over the
> array of elements in order.
>
> That seems like a good idea. I am worried that the expression nodes do not
have any inherent ordering and we are proposing to use a structure which
relies on order.

-- 
Best Wishes,
Ashutosh


Re: A wrong index choose issue because of inaccurate statistics

2020-06-08 Thread Ashutosh Bapat
I know one project where they used PostgreSQL code base to detect
"robust plans". https://dsl.cds.iisc.ac.in/projects/PICASSO/. Some of
the papers cited in https://www.vldb.org/pvldb/vldb2010/papers/D01.pdf
describe the idea.

In short, the idea is to annotate a plan with a "bandwidth" i.e. how
does the plan fair with degradation of statistics. A plan which has a
slightly higher cost which doesn't degrade much with degradation of
statistics is preferred over a low cost plan whose cost rises sharply
with degradation of statistics. This is similar to what David is
suggesting.


On Fri, Jun 5, 2020 at 12:00 PM Pavel Stehule  wrote:
>
>
>
> pá 5. 6. 2020 v 8:19 odesílatel David Rowley  napsal:
>>
>> On Mon, 1 Jun 2020 at 01:24, Andy Fan  wrote:
>> > The one-line fix describe the exact idea in my mind:
>> >
>> > +++ b/src/backend/optimizer/path/costsize.c
>> > @@ -730,6 +730,13 @@ cost_index(IndexPath *path, PlannerInfo *root, double 
>> > loop_count,
>> >
>> > cpu_run_cost += cpu_per_tuple * tuples_fetched;
>> >
>> > +   /*
>> > +* To make the planner more robust to handle some inaccurate 
>> > statistics
>> > +* issue, we will add a extra cost to qpquals so that the less 
>> > qpquals
>> > +* the lower cost it has.
>> > +*/
>> > +   cpu_run_cost += 0.01 * list_length(qpquals);
>> > +
>> >
>> > This change do fix the issue above, but will it make some other cases 
>> > worse? My
>> > answer is no based on my current knowledge, and this is most important 
>> > place
>> > I want to get advised. The mainly impact I can figure out is: it not only
>> > change the cost difference between (a, b) and (a, c) it also cause the cost
>> > difference between Index scan on (a, c) and seq scan.  However the
>> > cost different between index scan and seq scan are huge by practice so
>> > I don't think this impact is harmful.
>>
>> Didn't that idea already get shot down in the final paragraph on [1]?
>>
>> I understand that you wish to increase the cost by some seemingly
>> innocent constant to fix your problem case.  Here are my thoughts
>> about that: Telling lies is not really that easy to pull off. Bad
>> liers think it's easy and good ones know it's hard. The problem is
>> that the lies can start small, but then at some point the future you
>> must fashion some more lies to account for your initial lie. Rinse and
>> repeat that a few times and before you know it, your course is set
>> well away from the point of truth.  I feel the part about "rinse and
>> repeat" applies reasonably well to how join costing works.  The lie is
>> likely to be amplified as the join level gets deeper.
>>
>> I think you need to think of a more generic solution and propose that
>> instead.  There are plenty of other quirks in the planner that can
>> cause suffering due to inaccurate or non-existing statistics. For
>> example, due to how we multiply individual selectivity estimates,
>> having a few correlated columns in a join condition can cause the
>> number of join rows to be underestimated. Subsequent joins can then
>> end up making bad choices on which join operator to use based on those
>> inaccurate row estimates.  There's also a problem with WHERE  ORDER
>> BY col LIMIT n; sometimes choosing an index that provides pre-sorted
>> input to the ORDER BY but cannot use  as an indexable condition.
>> We don't record any stats to make better choices there, maybe we
>> should, but either way, we're taking a bit risk there as all the rows
>> matching  might be right at the end of the index and we might need
>> to scan the entire thing before hitting the LIMIT. For now, we just
>> assume completely even distribution of rows. i.e. If there are 50 rows
>> estimated in the path and the limit is for 5 rows, then we'll assume
>> we need to read 10% of those before finding all the ones we need. In
>> reality, everything matching  might be 95% through the index and we
>> could end up reading 100% of rows. That particular problem is not just
>> caused by the uneven distribution of rows in the index, but also from
>> selectivity underestimation.
>>
>> I'd more recently wondered if we shouldn't have some sort of "risk"
>> factor to the cost model. I really don't have ideas on how exactly we
>> would calculate the risk factor in all cases, but in this case,  say
>> the risk factor always starts out as 1. We could multiply that risk
>> factor by some >1 const each time we find another index filter qual.
>> add_path() can prefer lower risk paths when the costs are similar.
>> Unsure what the exact add_path logic would be. Perhaps a GUC would
>> need to assist with the decision there.   Likewise, with
>> NestedLoopPaths which have a large number of join quals, the risk
>> factor could go up a bit with those so that we take a stronger
>> consideration for hash or merge joins instead.
>>
>
> I thought about these ideas too. And I am not alone.
>
> https://hal.archives-ouvertes.fr/hal-01316823/document
>
> Regards
>

Re: Vacuuming the operating system documentation

2020-06-08 Thread Tom Lane
Peter Eisentraut  writes:
> On 2020-06-07 17:00, Tom Lane wrote:
>> Relevant to the current discussion: this creates a possible positive
>> reason for setting dynamic_shared_memory_type to "sysv", namely if it's
>> the best available way to get around RemoveIPC in a particular situation.
>> Should we document that?

> It sounds like both shared_memory_type and dynamic_shared_memory_type 
> ought to default to "sysv" on Linux.

Per the discussion in the older thread, that would only fix things if we
held at least one attach count constantly on every shared segment.  IIUC,
that's not guaranteed for DSAs.  So changing dynamic_shared_memory_type
would reduce the risk but not really fix anything.

For the primary shm segment, we don't (without EXEC_BACKEND) really
care if somebody unlinks the file prematurely, since backends inherit
the mapping via fork.  Hence, no need to change shared_memory_type.

regards, tom lane




Re: Bump default wal_level to logical

2020-06-08 Thread Tomas Vondra

On Mon, Jun 08, 2020 at 11:10:38AM +0200, Magnus Hagander wrote:

On Mon, Jun 8, 2020 at 8:46 AM Michael Paquier  wrote:


On Mon, Jun 08, 2020 at 11:59:14AM +0530, Amit Kapila wrote:
> I think we should first do performance testing to see what is the
> overhead of making this default.  I think pgbench read-write at
> various scale factors would be a good starting point.  Also, we should
> see how much additional WAL it generates as compared to current
> default.

+1.  I recall that changing wal_level to logical has been discussed in
the past and performance was the actual take to debate on.



That was at least the discussion (long-going and multi-repeated) before we
upped it from minimal to replica. There were some pretty extensive
benchmarking done to prove that the difference was very small, and this was
weighed against the ability to take basic backups of the system (which
arguably is more important than being able to do logical replication).

I agree that we should consider changing it *if* it does not come with a
substantial overhead, but that has to be shown.



I agree performance evaluation is necessary, and I'm willing to spend
some time on it. But I don't think the difference will be much worse
than for the wal_level=replica, at least for common workloads. It's
certainly possible to construct workloads with significant impact, due
to the extra stuff (assignments, cache invalidations and so on).

In general I think the case is somewhat weaker compared to the replica
case, which was required for such basic things like physical backups.



Of course, what would be even neater would be if it could be changed so
you don't have to bounce the server to change the wal_level. That's a
bigger change though, but perhaps it is now possible once we have the
"global barriers" in 13?



Yeah. That would sidestep a lot of the performance concerns, because if
switching from replica to logical is fairly easy / without restart, we
could keep the current default.

Not sure if it's sufficient, though, because switching to logical may
require bumping up number of slots, walsenders, etc. At which point you
actually need a restart. Not to mention that extensions using logical
decoding (like pglogical) need to allocate shared memory etc. But for
the built-in logical replication that is not an issue, ofc.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Wrong width of UNION statement

2020-06-08 Thread Ashutosh Bapat
On Mon, Jun 1, 2020 at 8:35 PM Tom Lane  wrote:
>
> Kenichiro Tanaka  writes:
> > I think table column width of  UNION statement should be equal one of UNION 
> > ALL.
>
> I don't buy that argument, because there could be type coercions involved,
> so that the result width isn't necessarily equal to any one of the inputs.
>
> Having said that, the example you give shows that we make use of
> pg_statistic.stawidth values when estimating the width of immediate
> relation outputs, but that data isn't available by the time we get to
> a UNION output.  So we fall back to get_typavgwidth, which in this
> case is going to produce something involving the typmod times the
> maximum encoding length.  (I suppose you're using UTF8 encoding...)
>
> There's room for improvement there, but this is all bound up in the legacy
> mess that we have in prepunion.c.  For example, because we don't have
> RelOptInfo nodes associated with individual set-operation outputs,

We already have that infrastructure, IIUC through commit

commit c596fadbfe20ff50a8e5f4bc4b4ff5b7c302ecc0
Author: Robert Haas 
Date:   Mon Mar 19 11:55:38 2018 -0400

Generate a separate upper relation for each stage of setop planning.

Can we use that to fix this bug?

-- 
Best Wishes,
Ashutosh Bapat




TAP tests and symlinks on Windows

2020-06-08 Thread Peter Eisentraut

The tests

src/bin/pg_basebackup/t/010_pg_basebackup.pl
src/bin/pg_rewind/t/004_pg_xlog_symlink.pl

both contain a TAP skip notice "symlinks not supported on Windows".

This is untrue.  Symlinks certainly work on Windows, and we have other 
TAP tests using them, for example for tablespaces.


pg_rewind/t/004_pg_xlog_symlink.pl passes for me on Windows if I just 
remove the skip stuff.  My attached patch does that.


If I remove the skip stuff in pg_basebackup/t/010_pg_basebackup.pl, it 
fails in various interesting ways.  Apparently, some more work is needed 
to get the various paths handled correct on Windows.  (At least a 
TestLib::perl2host call appears to be required.)  I don't have the 
enthusiasm to fix this right now, but my patch at least updates the 
comment from "this isn't supported" to "this doesn't work correctly yet".


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 822583a01dc4f16d73390e266735d758e1507636 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Sat, 6 Jun 2020 09:38:54 +0200
Subject: [PATCH 1/2] Enable some symlink tests on Windows

---
 src/bin/pg_basebackup/t/010_pg_basebackup.pl |  6 +++---
 src/bin/pg_rewind/t/004_pg_xlog_symlink.pl   | 11 +--
 2 files changed, 4 insertions(+), 13 deletions(-)

diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl 
b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 208df557b8..a10d201378 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -211,11 +211,11 @@
'pg_basebackup tar with long name fails');
 unlink "$pgdata/$superlongname";
 
-# The following tests test symlinks. Windows doesn't have symlinks, so
-# skip on Windows.
+# The following tests need some work to work correctly on Windows, so
+# skip for now.
 SKIP:
 {
-   skip "symlinks not supported on Windows", 18 if ($windows_os);
+   skip "symlink tests not supported on Windows", 18 if ($windows_os);
 
# Move pg_replslot out of $pgdata and create a symlink to it.
$node->stop;
diff --git a/src/bin/pg_rewind/t/004_pg_xlog_symlink.pl 
b/src/bin/pg_rewind/t/004_pg_xlog_symlink.pl
index 639eeb9c91..e271a4ab16 100644
--- a/src/bin/pg_rewind/t/004_pg_xlog_symlink.pl
+++ b/src/bin/pg_rewind/t/004_pg_xlog_symlink.pl
@@ -6,16 +6,7 @@
 use File::Copy;
 use File::Path qw(rmtree);
 use TestLib;
-use Test::More;
-if ($windows_os)
-{
-   plan skip_all => 'symlinks not supported on Windows';
-   exit;
-}
-else
-{
-   plan tests => 5;
-}
+use Test::More tests => 5;
 
 use FindBin;
 use lib $FindBin::RealBin;
-- 
2.27.0



Re: Is it useful to record whether plans are generic or custom?

2020-06-08 Thread Masahiro Ikeda

On 2020-06-04 17:04, Atsushi Torikoshi wrote:

On Mon, May 25, 2020 at 10:54 AM Atsushi Torikoshi 
wrote:


On Thu, May 21, 2020 at 5:10 PM Kyotaro Horiguchi
 wrote:


Cost numbers would look better if it is cooked a bit.  Is it worth
being in core?


I didn't come up with ideas about how to use them.
IMHO they might not so necessary..



Perhaps plan_generation is not needed there.


+1.
Now 'calls' is sufficient and 'plan_generation' may confuse users.

BTW, considering 'calls' in pg_stat_statements is the number of
times
statements were EXECUTED and 'plans' is the number of times
statements were PLANNED,  how about substituting 'calls' for
'plans'?


I've modified the above points and also exposed the numbers of each
 generic plans and custom plans.

I'm now a little bit worried about the below change which removed
the overflow checking for num_custom_plans, which was introduced
in 0001-Expose-counters-of-plancache-to-pg_prepared_statement.patch,
but I've left it because the maximum of int64 seems enough large
for counters.
Also referencing other counters in pg_stat_user_tables, they don't
seem to care about it.

```
-   /* Accumulate total costs of custom plans, but 'ware
overflow */
-   if (plansource->num_custom_plans < INT_MAX)
-   {
-   plansource->total_custom_cost +=
cached_plan_cost(plan, true);
-   plansource->num_custom_plans++;
-   }

```

Regards,

Atsushi Torikoshi





As a user, I think this feature is useful for performance analysis.
I hope that this feature is merged.

BTW, I found that the dependency between function's comments and
the modified code is broken at latest patch. Before this is
committed, please fix it.

```
diff --git a/src/backend/commands/prepare.c 
b/src/backend/commands/prepare.c

index 990782e77f..b63d3214df 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -694,7 +694,8 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, 
IntoClause *into, ExplainState *es,


 /*
  * This set returning function reads all the prepared statements and
- * returns a set of (name, statement, prepare_time, param_types, 
from_sql).
+ * returns a set of (name, statement, prepare_time, param_types, 
from_sql,

+ * generic_plans, custom_plans, last_plan).
  */
 Datum
 pg_prepared_statement(PG_FUNCTION_ARGS)
```

Regards,

--
Masahiro Ikeda




Re: proposal - plpgsql - FOR over unbound cursor

2020-06-08 Thread Pavel Stehule
po 8. 6. 2020 v 13:40 odesílatel Asif Rehman 
napsal:

> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:   tested, passed
> Spec compliant:   not tested
> Documentation:tested, passed
>
> The patch applies cleanly and AFAICS there are no issues with the patch.
>
> The new status of this patch is: Ready for Committer
>

Thank you

Pavel


Re: proposal - plpgsql - FOR over unbound cursor

2020-06-08 Thread Asif Rehman
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:tested, passed

The patch applies cleanly and AFAICS there are no issues with the patch.

The new status of this patch is: Ready for Committer


Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2020-06-08 Thread Alexey Kondratov

On 2020-06-08 09:14, Michael Paquier wrote:

On Sun, Jun 07, 2020 at 10:05:03PM +0300, Alexander Korotkov wrote:

On Sat, Jun 6, 2020 at 8:49 PM Peter Eisentraut
 wrote:
Why is fe_archive.c in src/common/ and not in src/fe_utils/?  It is 
not

"common" to frontend and backend.


Yep, this seems wrong to me.  I can propose following file rename.

src/common/fe_archive.c => src/fe_utils/archive.c
include/common/fe_archive.h => include/fe_utils/archive.h


Is it technically a problem though?  fe_archive.c is listed as a
frontend-only file with OBJS_FRONTEND in src/common/Makefile.  Anyway,
for this one I would not mind to do the move so please see the
attached, but I am not completely sure either why src/fe_utils/ had
better be chosen than src/common/.





Perhaps we have more things that are frontend-only in src/common/ that
could be moved to src/fe_utils/?  I am looking at restricted_token.c,
fe_memutils.c, logging.c and file_utils.c here, but that would mean
breaking a couple of declarations, something never free for plugin
developers.



I noticed this before in the thread [1], but it has been left unnoticed 
I guess:


"I just went through the both patches and realized that I cannot get 
into

semantics of splitting frontend code between common and fe_utils. This
applies only to 0002, where we introduce fe_archive.c. Should it be
placed into fe_utils alongside of the recent recovery_gen.c also used by
pg_rewind? This is a frontend-only code intended to be used by frontend
applications, so fe_utils feels like the right place, doesn't it? Just
tried to do so and everything went fine, so it seems that there is no
obstacles from the build system.

BTW, most of 'common' is a really common code with only four exceptions
like logging.c, which is frontend-only. Is it there for historical
reasons only or something else?"

Personally, I would prefer that everything in the 'common' was actually 
common. I also do not sure about moving an older code, because of 
possible backward compatibility breakage, but doing so for a newer one 
seems to be a good idea.




It actually defines functions that clash with functions in the 
backend,

so this seems doubly wrong.


Let's have frontend version of RestoreArchivedFile() renamed as well.
What about RestoreArchivedFileFrontend()?


-1 from me for the renaming, which was intentional to ease grepping
with the backend counterpart.  We have other cases like that, say
palloc(), fsync_fname(), etc.



Do not like it either. Having the same naming and a guarantee that this 
code is always compiled separately looks more convenient for me.


[1] 
https://www.postgresql.org/message-id/784fa7dc-414b-9dc9-daae-138033db298c%40postgrespro.ru



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: BUG #16481: Stored Procedure Triggered by Logical Replication is Unable to use Notification Events

2020-06-08 Thread Euler Taveira
On Mon, 8 Jun 2020 at 05:27, Kyotaro Horiguchi 
wrote:

>
> That can be fixed by calling ProcessCompletedNotifies() in
> apply_handle_commit. The function has a code to write out
> notifications to connected clients but it doesn't nothing on logical
> replication workers.
>
>
This bug was already reported some time ago (#15293) but it slipped through
the
cracks. I don't think you should simply call ProcessCompletedNotifies [1].

[1] https://www.postgresql.org/message-id/13844.1532468610%40sss.pgh.pa.us


-- 
Euler Taveira http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: race condition when writing pg_control

2020-06-08 Thread amul sul
On Fri, May 29, 2020 at 12:54 PM Fujii Masao 
wrote:

>
>
> On 2020/05/27 16:10, Michael Paquier wrote:
> > On Tue, May 26, 2020 at 07:30:54PM +, Bossart, Nathan wrote:
> >> While an assertion in UpdateControlFile() would not have helped us
> >> catch the problem I initially reported, it does seem worthwhile to add
> >> it.  I have attached a patch that adds this assertion and also
> >> attempts to fix XLogReportParameters().  Since there is only one place
> >> where we feel it is safe to call UpdateControlFile() without a lock, I
> >> just changed it to take the lock.  I don't think this adds any sort of
> >> significant contention risk, and IMO it is a bit cleaner than the
> >> boolean flag.
> >
> > Let's see what Fujii-san and Thomas think about that.  I'd rather
> > avoid taking a lock here because we don't need it and because it makes
> > things IMO confusing with the beginning of StartupXLOG() where a lot
> > of the fields are read, even if we go without this extra assertion.
>
> I have no strong opinion about this, but I tend to agree with Michael here.
>
>
I too don't have a strong opinion about this either, but I like Nathan's
approach more, just take the lock in the startup process as well for the
simplicity if that is not hurting much. I think, apart from the startup
process we
have to take the lock to update the control file, then having separate
treatment
for the startup process looks confusing to me, IMHO.

Regards,
Amul


Re: Improving psql slash usage help message

2020-06-08 Thread ahsan hadi
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   not tested
Spec compliant:   not tested
Documentation:tested, passed

This small documentation patch makes the document more accurate for psql 
terminal help.

The new status of this patch is: Waiting on Author


should CREATE INDEX ON partitioned_table call PreventInTransactionBlock() ?

2020-06-08 Thread Justin Pryzby
This blocks writes to all partitions until commit:

postgres=# begin; CREATE INDEX ON pt(i);
BEGIN
CREATE INDEX

Compare with CLUSTER rel1, rel2, ..., and REINDEX {SCHEMA|DATABASE|SYSTEM},
which release their locks as soon as each rel is processed.

I noticed while implementing REINDEX for partitioned tables, which, it seems
clear, should also avoid slowly accumulating more and more write locks across
an entire partition heirarchy.

-- 
Justin




RE: BUG #16481: Stored Procedure Triggered by Logical Replication is Unable to use Notification Events

2020-06-08 Thread Vianello Fabio
Hi Kyotaro Horiguchi, thanks for you helps.
We have a question about the bug. Why there isn't any solution in the HEAD?

This bug last since 10.4 version and I can't understand why it is not fixed in 
the HEAD  yet.

BR.
Fabio Vianello.


-Original Message-
From: Kyotaro Horiguchi [mailto:horikyota@gmail.com]
Sent: lunedì 8 giugno 2020 10:28
To: Vianello Fabio ; 
pgsql-b...@lists.postgresql.org; pgsql-hackers@lists.postgresql.org
Subject: Re: BUG #16481: Stored Procedure Triggered by Logical Replication is 
Unable to use Notification Events

Hello.

It seems to me a bug.

At Fri, 05 Jun 2020 11:05:14 +, PG Bug reporting form 
 wrote in
> The following bug has been logged on the website:
>
> Bug reference:  16481
> Logged by:  Fabio Vianello
> Email address:  fabio.viane...@salvagninigroup.com
> PostgreSQL version: 12.3
> Operating system:   Windows 10
> Description:
>
> About the bug BUG #15293, on PostgreSQL version 10.4 and 11.2 as
> describe below, we found the same issue on the PostgreSQL version 12.3.

The HEAD behaves the same way.

> Is it a feature?
> Becasue in the documentation we didn't found any constraint that says
> that we can not use NOTIFY/LISTEN on logical replication tables.
>
> "When using logical replication a stored procedure executed on the
> replica is unable to use NOTIFY to send messages to other listeners.
> The stored procedure does execute as expected but the pg_notify()
> doesn't appear to have any effect. If an insert is run on the replica
> side the trigger executes the stored procedure as expected and the
> NOTIFY correctly notifies listeners.

The message is actually queued, but logical replication worker doesn't signal 
that to listener backends. If any ordinary session sent a message to the same 
listener after that, both messages would be shown at once.

That can be fixed by calling ProcessCompletedNotifies() in apply_handle_commit. 
The function has a code to write out notifications to connected clients but it 
doesn't nothing on logical replication workers.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center
SALVAGNINI ITALIA S.p.A.
Via Guido Salvagnini, 51 - IT - 36040 Sarego (VI)
T. +39 0444 725111 | F. +39 0444 43 6404
Società a socio unico - Attività direz. e coord.: Salvagnini Holding S.p.A.
Clicca qui per le 
informazioni societarie
salvagninigroup.com | 
salvagnini.it


Le informazioni trasmesse sono destinate esclusivamente alla persona o alla 
società in indirizzo e sono da intendersi confidenziali e riservate. Ogni 
trasmissione, inoltro, diffusione o altro uso di queste informazioni a persone 
o società differenti dal destinatario è proibita. Se avete ricevuto questa 
comunicazione per errore, per favore contattate il mittente e cancellate le 
informazioni da ogni computer. Questa casella di posta elettronica è riservata 
esclusivamente all’invio ed alla ricezione di messaggi aziendali inerenti 
all’attività lavorativa e non è previsto né autorizzato l’utilizzo per fini 
personali. Pertanto i messaggi in uscita e quelli di risposta in entrata 
verranno trattati quali messaggi aziendali e soggetti alla ordinaria gestione 
disposta con proprio disciplinare dall’azienda e, di conseguenza, eventualmente 
anche alla lettura da parte di persone diverse dall’intestatario della casella.

Any information herein transmitted only concerns the person or the company 
named in the address and is deemed to be confidential. It is strictly forbidden 
to transmit, post, forward or otherwise use said information to anyone other 
than the recipient. If you have received this message by mistake, please 
contact the sender and delete any relevant information from your computer. This 
mailbox is only meant for sending and receiving messages pertaining business 
matters and any other use for personal purposes is forbidden and unauthorized. 
Therefore, any email sent and received will be handled as ordinary business 
messages and subject to the company's own rules, and may thus be read also by 
people other than the user named in the mailbox address.


Re: Bump default wal_level to logical

2020-06-08 Thread Magnus Hagander
On Mon, Jun 8, 2020 at 8:46 AM Michael Paquier  wrote:

> On Mon, Jun 08, 2020 at 11:59:14AM +0530, Amit Kapila wrote:
> > I think we should first do performance testing to see what is the
> > overhead of making this default.  I think pgbench read-write at
> > various scale factors would be a good starting point.  Also, we should
> > see how much additional WAL it generates as compared to current
> > default.
>
> +1.  I recall that changing wal_level to logical has been discussed in
> the past and performance was the actual take to debate on.
>

That was at least the discussion (long-going and multi-repeated) before we
upped it from minimal to replica. There were some pretty extensive
benchmarking done to prove that the difference was very small, and this was
weighed against the ability to take basic backups of the system (which
arguably is more important than being able to do logical replication).

I agree that we should consider changing it *if* it does not come with a
substantial overhead, but that has to be shown.

Of course, what would be even neater would be if it could be changed so you
don't have to bounce the server to change the wal_level. That's a bigger
change though, but perhaps it is now possible once we have the "global
barriers" in 13?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Read access for pg_monitor to pg_replication_origin_status view

2020-06-08 Thread Kyotaro Horiguchi
At Mon, 8 Jun 2020 16:21:45 +0900, Masahiko Sawada 
 wrote in 
> I've looked at these patches and have one question:
> 
>  REVOKE ALL ON pg_replication_origin_status FROM public;
> 
> +GRANT SELECT ON pg_replication_origin_status TO pg_read_all_stats;
> 
> +REVOKE EXECUTE ON FUNCTION pg_show_replication_origin_status() FROM public;
> +
> +GRANT EXECUTE ON FUNCTION pg_show_replication_origin_status() TO
> pg_read_all_stats;
> 
> I thought that this patch has pg_replication_origin_status view behave
> like other pg_stat_* views in terms of privileges but it's slightly
> different. For instance, since we grant all privileges on
> pg_stat_replication to public by default, the only user who either is
> a member of pg_read_all_stats or is superuser can see all values but
> other users not having such privileges also can access that view and
> see the part of statistics. On the other hand, with this patch, we
> allow only user who either is a member of pg_read_all_stats or is
> superuser to access pg_replication_origin_status view. Other users
> cannot even access to that view. Is there any reason why we grant
> select privilege to only pg_read_all_stats? I wonder if we can have
> pg_replication_origin_status accessible by public and filter some
> column data in pg_show_replication_origin_status() that we don't want
> to show to users who neither a member of pg_read_all_stats nor
> superuser.

Yeah, I agree to this (and wrote something like that before).

On the other hand Martín seems to just want to allow other users to
see it while preserving the current behavior.  I also understand that
thought.

> There is a typo in 0001 patch:
> 
> +--
> +-- Permision to execute Replication Origin functions should be
> revoked from public
> +--
> 
> s/Permision/Permission/

Mmm. Right.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: BUG #16481: Stored Procedure Triggered by Logical Replication is Unable to use Notification Events

2020-06-08 Thread Kyotaro Horiguchi
Hello.

It seems to me a bug.

At Fri, 05 Jun 2020 11:05:14 +, PG Bug reporting form 
 wrote in 
> The following bug has been logged on the website:
> 
> Bug reference:  16481
> Logged by:  Fabio Vianello
> Email address:  fabio.viane...@salvagninigroup.com
> PostgreSQL version: 12.3
> Operating system:   Windows 10
> Description:
> 
> About the bug BUG #15293, on PostgreSQL version 10.4 and 11.2 as describe
> below, we found the same issue on the PostgreSQL version 12.3.

The HEAD behaves the same way.

> Is it a feature? 
> Becasue in the documentation we didn't found any constraint that says that
> we can not use NOTIFY/LISTEN on logical replication tables.
> 
> "When using logical replication a stored procedure executed on the replica
> is
> unable to use NOTIFY to send messages to other listeners. The stored
> procedure does execute as expected but the pg_notify() doesn't appear to
> have any effect. If an insert is run on the replica side the trigger
> executes the stored procedure as expected and the NOTIFY correctly
> notifies
> listeners.

The message is actually queued, but logical replication worker doesn't
signal that to listener backends. If any ordinary session sent a
message to the same listener after that, both messages would be shown
at once.

That can be fixed by calling ProcessCompletedNotifies() in
apply_handle_commit. The function has a code to write out
notifications to connected clients but it doesn't nothing on logical
replication workers.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From b8df8c0cf6ae6c2bcd78ffd7d9bd629f51ab3bee Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 8 Jun 2020 16:07:41 +0900
Subject: [PATCH] Signal notifications from logical replication workers

Notifications need to be signaled to listeners but logical replication
worker forgets to do that. Fix that by signaling notifications after
committing a transaction.
---
 src/backend/replication/logical/worker.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index a752a1224d..28ae89c574 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -33,6 +33,7 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
+#include "commands/async.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
@@ -517,6 +518,9 @@ apply_handle_commit(StringInfo s)
 		pgstat_report_stat(false);
 
 		store_flush_position(commit_data.end_lsn);
+
+		/* Send out notify signals */
+		ProcessCompletedNotifies();
 	}
 	else
 	{
-- 
2.18.2



Re: Read access for pg_monitor to pg_replication_origin_status view

2020-06-08 Thread Kyotaro Horiguchi
Hello, Martín.

Thanks for the new version.

At Thu, 4 Jun 2020 09:17:18 -0300, Martín Marqués  
wrote in 
> > I'm not sure where to put the GRANT on
> > pg_show_replication_origin_status(), but maybe it also should be at
> > the same place.
> 
> Yes, I agree that it makes the revoking/granting easier to read if
> it's grouped by objects, or groups of objects.
> 
> Done.

0002 looks fine to me.

> > In the previous comment, one point I meant is that the "to the
> > superuser" should be "to superusers", because a PostgreSQL server
> > (cluster) can define multiple superusers. Another is that "permitted
> > to other users by using the GRANT command." might be obscure for
> > users. In this regard I found a more specific description in the same
> > file:
> 
> OK, now I understand what you were saying. :-)

I'm happy to hear that:)

> > So, as the result it would be like the following: (Note that, as you
> > know, I'm not good at this kind of task..)
> >
> >   Use of functions for replication origin is restricted to superusers.
> >   Use for these functions may be permitted to other users by granting
> >   EXECUTE privilege on the functions.
> >
> > And in regard to the view, granting privileges on both the view and
> > function to individual user is not practical so we should mention only
> > granting pg_read_all_stats to users, like the attached patch.
> 
> I did some re-writing of the doc, which is pretty close to what you
> proposed above.

(0003) Unfortunately, the closing tag of EXECUTE is missing prefixing
'/'.

I see many nearby occurrences of "This function is restricted to
superusers by default, but other users can be granted EXECUTE to run
the function".  I'm not sure, but it might be better to use the same
expression, but I don't insist on that. It's not changed in the
attached.

> > By the way, the attachements of your mail are out-of-order. I'm not
> > sure that that does something bad, though.
> 
> That's likely Gmail giving them random order when you attach multiple
> files all at once.
> 
> New patches attached.

- I'm fine with the direction of this patch. Works as expected, that
  is, makes no changes of behavior for replication origin functions,
  and pg_read_all_stats can read the pg_replication_origin_status
  view.

- The patches cleanly applied to the current HEAD and can be compiled
  with a minor fix (fixed in the attached v5).

- The patches should be merged but I'll left that for committer.

- The commit titles are too long. Each of them should be split up into
  a brief title and a description. But I think committers would rewrite
  them for the final patch to commit so I don't think we don't need to
  rewrite them right now.

I'll wait for a couple of days for comments from others or opinions
before moving this to Ready for Committer.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 857be52f29cf244b7c217bedf87bd4507e49a388 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mart=C3=ADn=20Marqu=C3=A9s?=
 
Date: Tue, 2 Jun 2020 10:44:42 -0300
Subject: [PATCH v5 1/4] Access to pg_replication_origin_status view has
 restricted access only for superusers due to it using
 pg_show_replication_origin_status(), and all replication origin functions
 requiring superuser rights.

This patch removes the check for superuser priviledges when executing
replication origin functions, which is hardcoded, and instead rely on
ACL checks.

This patch will also revoke execution of such functions from PUBLIC
---
 src/backend/catalog/system_views.sql | 16 
 src/backend/replication/logical/origin.c |  5 -
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 56420bbc9d..6658f0c2eb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1469,6 +1469,22 @@ REVOKE EXECUTE ON FUNCTION pg_stat_file(text,boolean) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text,boolean,boolean) FROM public;
 
+--
+-- Permision to execute Replication Origin functions should be revoked from public
+--
+REVOKE EXECUTE ON FUNCTION pg_replication_origin_advance(text, pg_lsn) FROM public;
+REVOKE EXECUTE ON FUNCTION pg_replication_origin_create(text) FROM public;
+REVOKE EXECUTE ON FUNCTION pg_replication_origin_drop(text) FROM public;
+REVOKE EXECUTE ON FUNCTION pg_replication_origin_oid(text) FROM public;
+REVOKE EXECUTE ON FUNCTION pg_replication_origin_progress(text, boolean) FROM public;
+REVOKE EXECUTE ON FUNCTION pg_replication_origin_session_is_setup() FROM public;
+REVOKE EXECUTE ON FUNCTION pg_replication_origin_session_progress(boolean) FROM public;
+REVOKE EXECUTE ON FUNCTION pg_replication_origin_session_reset() FROM public;
+REVOKE EXECUTE ON FUNCTION pg_replication_origin_session_setup(text) FROM public;
+REVOKE EXECUTE ON FUNCTION pg_replication_origin_xact_reset() FROM public;
+REVOKE 

Re: Read access for pg_monitor to pg_replication_origin_status view

2020-06-08 Thread Masahiko Sawada
On Thu, 4 Jun 2020 at 21:17, Martín Marqués  wrote:
>
> Hi Kyotaro-san,
>
> > Sorry for not mentioning it at that time, but about the following diff:
> >
> > +GRANT SELECT ON pg_replication_origin_status TO pg_read_all_stats;
> >
> > system_views.sql already has a REVOKE command on the view. We should
> > put the above just below the REVOKE command.
> >
> > I'm not sure where to put the GRANT on
> > pg_show_replication_origin_status(), but maybe it also should be at
> > the same place.
>
> Yes, I agree that it makes the revoking/granting easier to read if
> it's grouped by objects, or groups of objects.
>
> Done.
>
> > In the previous comment, one point I meant is that the "to the
> > superuser" should be "to superusers", because a PostgreSQL server
> > (cluster) can define multiple superusers. Another is that "permitted
> > to other users by using the GRANT command." might be obscure for
> > users. In this regard I found a more specific description in the same
> > file:
>
> OK, now I understand what you were saying. :-)
>
> >   Computes the total disk space used by the database with the specified
> >   name or OID.  To use this function, you must
> >   have CONNECT privilege on the specified database
> >   (which is granted by default) or be a member of
> >   the pg_read_all_stats role.
> >
> > So, as the result it would be like the following: (Note that, as you
> > know, I'm not good at this kind of task..)
> >
> >   Use of functions for replication origin is restricted to superusers.
> >   Use for these functions may be permitted to other users by granting
> >   EXECUTE privilege on the functions.
> >
> > And in regard to the view, granting privileges on both the view and
> > function to individual user is not practical so we should mention only
> > granting pg_read_all_stats to users, like the attached patch.
>
> I did some re-writing of the doc, which is pretty close to what you
> proposed above.
>
> > By the way, the attachements of your mail are out-of-order. I'm not
> > sure that that does something bad, though.
>
> That's likely Gmail giving them random order when you attach multiple
> files all at once.
>
> New patches attached.
>

I've looked at these patches and have one question:

 REVOKE ALL ON pg_replication_origin_status FROM public;

+GRANT SELECT ON pg_replication_origin_status TO pg_read_all_stats;

+REVOKE EXECUTE ON FUNCTION pg_show_replication_origin_status() FROM public;
+
+GRANT EXECUTE ON FUNCTION pg_show_replication_origin_status() TO
pg_read_all_stats;

I thought that this patch has pg_replication_origin_status view behave
like other pg_stat_* views in terms of privileges but it's slightly
different. For instance, since we grant all privileges on
pg_stat_replication to public by default, the only user who either is
a member of pg_read_all_stats or is superuser can see all values but
other users not having such privileges also can access that view and
see the part of statistics. On the other hand, with this patch, we
allow only user who either is a member of pg_read_all_stats or is
superuser to access pg_replication_origin_status view. Other users
cannot even access to that view. Is there any reason why we grant
select privilege to only pg_read_all_stats? I wonder if we can have
pg_replication_origin_status accessible by public and filter some
column data in pg_show_replication_origin_status() that we don't want
to show to users who neither a member of pg_read_all_stats nor
superuser.

There is a typo in 0001 patch:

+--
+-- Permision to execute Replication Origin functions should be
revoked from public
+--

s/Permision/Permission/

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Debian Sid broke Perl

2020-06-08 Thread Michael Paquier
On Sun, Jun 07, 2020 at 11:06:27AM -0400, Tom Lane wrote:
> The maintainer says he'll revert the change, so I suppose the
> buildfarm will go back to normal without extra effort on our part.

The buildfarm has moved back to a green state as of the moment I am
writing this email (see serinus for example).
--
Michael


signature.asc
Description: PGP signature


Re: Removal of currtid()/currtid2() and some table AM cleanup

2020-06-08 Thread Michael Paquier
On Fri, Jun 05, 2020 at 10:25:00PM +0900, Inoue, Hiroshi wrote:
> Keyset-driven cursors always detect changes made by other applications
> (and themselves). currtid() is necessary to detect the changes.
> CTIDs are changed by updates unfortunately.

You mean currtid2() here and not currtid(), right?  We have two
problems here then:
1) We cannot actually really remove currtid2() from the backend yet
without removing the dependency in the driver, or that may break some
users.
2) The driver does not include tests for that stuff yet.
--
Michael


signature.asc
Description: PGP signature


Re: Bump default wal_level to logical

2020-06-08 Thread Michael Paquier
On Mon, Jun 08, 2020 at 11:59:14AM +0530, Amit Kapila wrote:
> I think we should first do performance testing to see what is the
> overhead of making this default.  I think pgbench read-write at
> various scale factors would be a good starting point.  Also, we should
> see how much additional WAL it generates as compared to current
> default.

+1.  I recall that changing wal_level to logical has been discussed in
the past and performance was the actual take to debate on.
--
Michael


signature.asc
Description: PGP signature


Re: Bump default wal_level to logical

2020-06-08 Thread Amit Kapila
On Mon, Jun 8, 2020 at 10:08 AM David Fetter  wrote:
>
> Hi,
>
> I'd like to propose $subject, as embodied in the attached patch. This
> makes it possible to discover and fulfill a need for logical
> replication that can arise at a time when bouncing the server has
> become impractical, i.e. when there is already high demand on it.
>

I think we should first do performance testing to see what is the
overhead of making this default.  I think pgbench read-write at
various scale factors would be a good starting point.  Also, we should
see how much additional WAL it generates as compared to current
default.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: race condition when writing pg_control

2020-06-08 Thread Michael Paquier
On Mon, Jun 08, 2020 at 03:25:31AM +, Bossart, Nathan wrote:
> On 6/7/20, 7:50 PM, "Thomas Munro"  wrote:
>> I pushed 0001 and 0002, squashed into one commit.  I'm not sure about
>> 0003.  If we're going to do that, wouldn't it be better to just
>> acquire the lock in that one extra place in StartupXLOG(), rather than
>> introducing the extra parameter?
> 
> Thanks!  The approach for 0003 was discussed a bit upthread [0].  I do
> not have a strong opinion, but I lean towards just acquiring the lock.

Fujii-san has provided an answer upthread, that can maybe translated
as a +0.3~0.4:
https://www.postgresql.org/message-id/fc796148-7d63-47bb-e91d-e09b62a50...@oss.nttdata.com

FWIW, I'd rather not take the lock as that's not necessary and just
add the parameter if I were to do it.  Now I would be fine as well to
just take the lock if you decide that's more simple, as long as we add
this new assertion as a safety net for future changes.
--
Michael


signature.asc
Description: PGP signature


Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-06-08 Thread Amit Kapila
On Sun, Jun 7, 2020 at 5:08 PM Dilip Kumar  wrote:
>
> On Fri, Jun 5, 2020 at 11:37 AM Amit Kapila  wrote:
> >
> >
> > Let me know what you think of the changes?  If you find them okay,
> > then feel to include them in the next patch-set.
> >
> > [1] - 
> > https://www.postgresql.org/message-id/CAONYFtOv%2BEr1p3WAuwUsy1zsCFrSYvpHLhapC_fMD-zNaRWxYg%40mail.gmail.com
>
> Thanks for the patch, I will review it and include it in my next version.
>

Okay, I have done review of
0002-Issue-individual-invalidations-with-wal_level-lo.patch and below
are my comments:

1. I don't think it is a good idea that logical decoding process the
new XLOG_XACT_INVALIDATIONS and existing WAL records for invalidations
like XLOG_INVALIDATIONS and what we do in DecodeCommit (see code in
the check "if (parsed->nmsgs > 0)").  I think if that is required for
some particular reason then we should write detailed comments about
the same.  I have tried some experiments to see if those are really
required:
a. After applying patch 0002, I have tried by commenting out the
processing of invalidations via DecodeCommit and found some regression
tests were failing but the reason for failure was that we are not
setting RBTXN_HAS_CATALOG_CHANGES for the toptxn when subtxn has
catalog changes and when I did that all regression tests started
passing.  See the attached diff patch
(v27-0003-Incremental-patch-for-0002-to-test-removal-of-du) atop 0002
patch.
b. The processing of invalidations for XLOG_INVALIDATIONS is added by
commit c6ff84b06a for xid-less transactions.  See
https://postgr.es/m/CAB-SwXY6oH=9twbkxjtgr4uc1nqt-vpyatxcseme62adwyk...@mail.gmail.com
to know why that has been added.  Now, after this patch we will
process the same invalidations via XLOG_XACT_INVALIDATIONS and
XLOG_INVALIDATIONS which doesn't seem warranted.  Also, the below
assertion will fail for xid-less transactions (try create index
concurrently statement):
+ case XLOG_XACT_INVALIDATIONS:
+ {
+ TransactionId xid;
+ xl_xact_invalidations *invals;
+
+ xid = XLogRecGetXid(r);
+ invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+ Assert(TransactionIdIsValid(xid));

I feel we don't need the processing of XLOG_INVALIDATIONS in logical
decoding after this patch but to prove that first we need to write a
test case which need XLOG_INVALIDATIONS in the HEAD as commit
c6ff84b06a doesn't add one.  I think we need two code paths in
XLOG_XACT_INVALIDATIONS where if it is for xid-less transactions, then
execute actions immediately as we are doing in processing of
XLOG_INVALIDATIONS, otherwise, do what we are doing currently in the
patch.  If the above point (b) is correct, I am not sure if it is a
good idea to use RM_XACT_ID as resource manager if for this WAL in
LogLogicalInvalidations, what do you think?

I think one of the usages we still need is in ReorderBufferForget
because it can be called when we skip processing the txn.  See the
comments in DecodeCommit where we call this function.  If I am
correct, we need to probably collect all invalidations in
ReorderBufferTxn as we are collecting tuplecids and use them here.  We
can do the same during processing of XLOG_XACT_INVALIDATIONS.

I had also thought a bit about removing logging of invalidations at
commit time altogether but it seems processing hot-standby is somewhat
tightly coupled with existing WAL logging.  See xact_redo_commit (a
comment atop call to ProcessCommittedInvalidationMessages).  It says
we need to maintain the order when we process invalidations.  If we
can later find a way to avoid that we can probably remove it but for
now maybe we can live with it.

2.
+ /* not expected, but print something anyway */
+ else if (msg->id == SHAREDINVALSMGR_ID)
+ appendStringInfoString(buf, " smgr");
+ /* not expected, but print something anyway */
+ else if (msg->id == SHAREDINVALRELMAP_ID)

I think the above comment is not valid after we started logging at CCI.

3.
+
+ xid = XLogRecGetXid(r);
+ invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+ Assert(TransactionIdIsValid(xid));
+ ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+ invals->nmsgs, invals->msgs);

Here, it should check !ctx->forward as we do in DecodeCommit, do we
have any reason for not doing so.  We can test once by changing this.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


v27-0001-Immediately-WAL-log-subtransaction-and-top-level.patch
Description: Binary data


v27-0002-Issue-individual-invalidations-with-wal_level-lo.patch
Description: Binary data


v27-0003-Incremental-patch-for-0002-to-test-removal-of-du.patch
Description: Binary data


Re: snowball release

2020-06-08 Thread Peter Eisentraut

On 2020-05-22 14:40, Peter Eisentraut wrote:

Snowball has made a release!  With a tag!

I have prepared a patch to update PostgreSQL's copy.  (not attached
here, 566230 bytes, but see
https://github.com/petere/postgresql/commit/52a6133b58c77ada4210a96e5155cbe4da5e5583)

Since we last updated our copy from their commit date 2019-06-24 and the
release is from 2019-10-02, the changes are pretty small and mostly
reformatting.  But there are three new stemmers: Basque, Catalan, Hindi.

I think some consideration could be given for including this into PG13.
Otherwise, I'll park it for PG14.


committed to master

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2020-06-08 Thread Michael Paquier
On Sun, Jun 07, 2020 at 10:05:03PM +0300, Alexander Korotkov wrote:
> On Sat, Jun 6, 2020 at 8:49 PM Peter Eisentraut
>  wrote:
>> Why is fe_archive.c in src/common/ and not in src/fe_utils/?  It is not
>> "common" to frontend and backend.
> 
> Yep, this seems wrong to me.  I can propose following file rename.
> 
> src/common/fe_archive.c => src/fe_utils/archive.c
> include/common/fe_archive.h => include/fe_utils/archive.h

Is it technically a problem though?  fe_archive.c is listed as a
frontend-only file with OBJS_FRONTEND in src/common/Makefile.  Anyway,
for this one I would not mind to do the move so please see the
attached, but I am not completely sure either why src/fe_utils/ had
better be chosen than src/common/.

>> It actually defines functions that clash with functions in the backend,
>> so this seems doubly wrong. 
> 
> Let's have frontend version of RestoreArchivedFile() renamed as well.
> What about RestoreArchivedFileFrontend()?

-1 from me for the renaming, which was intentional to ease grepping
with the backend counterpart.  We have other cases like that, say
palloc(), fsync_fname(), etc.

Perhaps we have more things that are frontend-only in src/common/ that
could be moved to src/fe_utils/?  I am looking at restricted_token.c,
fe_memutils.c, logging.c and file_utils.c here, but that would mean
breaking a couple of declarations, something never free for plugin
developers.
--
Michael
From ca75b91498c71b6558bd6befa3ad1510e00cf5fa Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Mon, 8 Jun 2020 15:06:02 +0900
Subject: [PATCH] Move frontend-side archive APIs to src/fe_utils/

---
 .../{common/fe_archive.h => fe_utils/archive.h}|  4 ++--
 src/common/Makefile|  1 -
 src/fe_utils/Makefile  |  1 +
 src/{common/fe_archive.c => fe_utils/archive.c}| 10 +++---
 src/bin/pg_rewind/parsexlog.c  |  2 +-
 src/tools/msvc/Mkvcbuild.pm|  8 
 6 files changed, 11 insertions(+), 15 deletions(-)
 rename src/include/{common/fe_archive.h => fe_utils/archive.h} (91%)
 rename src/{common/fe_archive.c => fe_utils/archive.c} (94%)

diff --git a/src/include/common/fe_archive.h b/src/include/fe_utils/archive.h
similarity index 91%
rename from src/include/common/fe_archive.h
rename to src/include/fe_utils/archive.h
index 495b560d24..a6beaf04ea 100644
--- a/src/include/common/fe_archive.h
+++ b/src/include/fe_utils/archive.h
@@ -1,12 +1,12 @@
 /*-
  *
- * fe_archive.h
+ * archive.h
  *	  Routines to access WAL archives from frontend
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
- * src/include/common/fe_archive.h
+ * src/include/fe_utils/archive.h
  *
  *-
  */
diff --git a/src/common/Makefile b/src/common/Makefile
index d0be882cca..16619e4ba8 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -89,7 +89,6 @@ endif
 # (Mkvcbuild.pm has a copy of this list, too)
 OBJS_FRONTEND = \
 	$(OBJS_COMMON) \
-	fe_archive.o \
 	fe_memutils.o \
 	file_utils.o \
 	logging.o \
diff --git a/src/fe_utils/Makefile b/src/fe_utils/Makefile
index 9eb4417690..dd20663604 100644
--- a/src/fe_utils/Makefile
+++ b/src/fe_utils/Makefile
@@ -20,6 +20,7 @@ include $(top_builddir)/src/Makefile.global
 override CPPFLAGS := -DFRONTEND -I$(libpq_srcdir) $(CPPFLAGS)
 
 OBJS = \
+	archive.o \
 	cancel.o \
 	conditional.o \
 	mbprint.o \
diff --git a/src/common/fe_archive.c b/src/fe_utils/archive.c
similarity index 94%
rename from src/common/fe_archive.c
rename to src/fe_utils/archive.c
index b0d68870db..c4cb213198 100644
--- a/src/common/fe_archive.c
+++ b/src/fe_utils/archive.c
@@ -1,6 +1,6 @@
 /*-
  *
- * fe_archive.c
+ * archive.c
  *	  Routines to access WAL archives from frontend
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
@@ -8,15 +8,11 @@
  *
  *
  * IDENTIFICATION
- *	  src/common/fe_archive.c
+ *	  src/fe_utils/archive.c
  *
  *-
  */
 
-#ifndef FRONTEND
-#error "This file is not expected to be compiled for backend code"
-#endif
-
 #include "postgres_fe.h"
 
 #include 
@@ -24,8 +20,8 @@
 
 #include "access/xlog_internal.h"
 #include "common/archive.h"
-#include "common/fe_archive.h"
 #include "common/logging.h"
+#include "fe_utils/archive.h"
 
 
 /*
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index d637f5eb77..bc6f976994 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -19,7 +19,7 @@
 #include "catalog/pg_control.h"
 #include "catalog/storage_xlog.h"
 #include "commands/dbcommands_xlog.h"
-#include