Re: backup manifests

2020-03-30 Thread Noah Misch
On Mon, Mar 30, 2020 at 12:16:31PM -0700, Andres Freund wrote:
> On 2020-03-30 15:04:55 -0400, Robert Haas wrote:
> > I guess I'd like to be clear here that I have no fundamental
> > disagreement with taking this tool in any direction that people would
> > like it to go. For me it's just a question of timing. Feature freeze
> > is now a week or so away, and nothing complicated is going to get done
> > in that time. If we can all agree on something simple based on
> > Andres's recent proposal, cool, but I'm not yet sure that will be the
> > case, so what's plan B? We could decide that what I have here is just
> > too little to be a viable facility on its own, but I think Stephen is
> > the only one taking that position. We could release it as
> > pg_validatemanifest with a plan to rename it if other backup-related
> > checks are added later. We could release it as pg_validatebackup with
> > the idea to avoid having to rename it when more backup-related checks
> > are added later, but with a greater possibility of confusion in the
> > meantime and no hard guarantee that anyone will actually develop such
> > checks. We could put it in to pg_checksums, but I think that's really
> > backing ourselves into a corner: if backup validation develops other
> > checks that are not checksum-related, what then? I'd much rather
> > gamble on keeping things together by topic (backup) than technology
> > used internally (checksum). Putting it into pg_basebackup is another
> > option, and would avoid that problem, but it's not my preferred
> > option, because as I noted before, I think the command-line options
> > will get confusing.
> 
> I'm mildly inclined to name it pg_validate, pg_validate_dbdir or
> such. And eventually (definitely not this release) subsume pg_checksums
> in it. That way we can add other checkers too.

Works for me; of those two, I prefer pg_validate.




Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

2020-03-30 Thread Fabien COELHO


Hello,


As I wrote about an earlier version of the patch, ISTM that instead of
reinventing, extending, adapting various ls variants (with/without
metadata, which show only files, which shows target of links, which shows
directory, etc.) we would just need *one* postgres "ls" implementation
which would be like "ls -la arg" (returns file type, dates), and then
everything else is a wrapper around that with appropriate filtering that
can be done at the SQL level, like you started with recurse.


Yeah, I agree that some new function that can represent symlinks
explicitly in its output is the place to deal with this, for
people who want to deal with it.

In the meantime, there's still the question of what pg_ls_dir_files
should do exactly.  Are we content to have it ignore symlinks?
I remain inclined to think that's the right thing given its current
brief.


My 0.02€:

I agree that it is enough to reproduce the current behavior of various 
existing pg_ls* functions, but on the other hand outputing a column type 
char like ls (-, d, l…) looks like really no big deal. I'd say that the 
only reason not to do it may be to pass this before feature freeze.


--
Fabien.

Re: [HACKERS] Restricting maximum keep segments by repslots

2020-03-30 Thread Kyotaro Horiguchi
Thank you for looking this and trouble rebasing!

At Mon, 30 Mar 2020 20:03:27 -0300, Alvaro Herrera  
wrote in 
> I rebased this patch; it's failing to apply due to minor concurrent
> changes in PostgresNode.pm.  I squashed the patches in a series that
> made the most sense to me.
> 
> I have a question about static variable lastFoundOldestSeg in
> FindOldestXLogFileSegNo.  It may be set the first time the function
> runs; if it is, the function never again does anything, it just returns
> that value.  In other words, the static value is never reset; it never
> advances either.  Isn't that strange?  I think the coding is to assume
> that XLogCtl->lastRemovedSegNo will always be set, so its code will
> almost never run ... except when the very first wal file has not been
> removed yet.  This seems weird and pointless.  Maybe we should think
> about this differently -- example: if XLogGetLastRemovedSegno returns
> zero, then the oldest file is the zeroth one.  In what cases this is
> wrong?  Maybe we should fix those.

That's right, but without the static variable, every call to the
pg_replication_slots view before the fist checkpoint causes scanning
pg_xlog. XLogCtl->lastRemovedSegNo advances only at a checkpoint, so
it is actually right that the return value from
FindOldestXLogFileSegNo doesn't change until the first checkpoint.

Also we could set XLogCtl->lastRemovedSegNo at startup, but the
scanning on pg_xlog is useless in most cases.

I avoided to update the XLogCtl->lastRemovedSegNo directlry, but the
third way would be if XLogGetLastRemovedSegno() returned 0, then set
XLogCtl->lastRemovedSegNo by scanning the WAL directory. The attached
takes this way.

> Regarding the PostgresNode change in 0001, I think adding a special
> parameter for primary_slot_name is limited.  I'd like to change the
> definition so that anything that you give as a parameter that's not one
> of the recognized keywords (has_streaming, etc) is tested to see if it's
> a GUC; and if it is, then put it in postgresql.conf.  This would have to
> apply both to PostgresNode::init() as well as
> PostgresNode::init_from_backup(), obviously, since it would make no
> sense for the APIs to diverge on this point.  So you'd be able to do
>   $node->init_from_backup(allow_streaming => 1, work_mem => "4MB");
> without having to add code to init_from_backup to handle work_mem
> specifically.  This could be done by having a Perl hash with all the GUC
> names, that we could read lazily from "postmaster --describe-config" the
> first time we see an unrecognized keyword as an option to init() /
> init_from_backup().

Done that way. We could exclude "known" parameters by explicitly
delete the key at reading it, but I choosed to enumerate the known
keywords.  Although it can be used widely but actually I changed only
018_repslot_limit.pl to use the feature.

> I edited the doc changes a bit.
> 
> I don't know what to think of 0003 yet.  Has this been agreed to be a
> good idea?

So it was a separate patch. I think it has not been approved nor
rejected.  The main objective of the patch is preventing
pg_replication_slots.wal_status from strange coming back from the
"lost" state to other states. However, in the first place I doubt that
it's right that logical replication sends the content of a WAL segment
already recycled.

> I also made a few small edits to the code; all cosmetic so far:
> 
> * added long_desc to the new GUC; it now reads:
> 
> {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
> gettext_noop("Sets the maximum size of WAL space reserved by 
> replication slots."),
> gettext_noop("Replication slots will be marked as failed, and 
> segments released "
>  "for deletion or recycling, if this much space is 
> occupied by WAL "
>  "on disk."),
> 
> * updated the comment to ConvertToXSegs() which is now being used for
>   this purpose
> 
> * remove outdated comment to GetWalAvailability; it was talking about
>   restBytes parameter that no longer exists

Thank you for the fixes. All of the looks fine.

I fixed several typos. (s/requred/required/, s/devinitly/definitely/,
s/errror/error/)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 95165d218d633e354c8136d2e200d83685ef3799 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 19 Dec 2018 12:43:57 +0900
Subject: [PATCH v20 1/3] Allow arbitrary GUC parameter setting init and
 init_from_backup in TAP test.

It is convenient that arbitrary GUC parameters can be specified on
initializing a node or taking a backup.
---
 src/test/perl/PostgresNode.pm | 32 
 1 file changed, 32 insertions(+)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 1d5450758e..4671dc5eb1 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -416,6 +416,13 @@ parameter allows_streaming => 'logical' or 'physical' 

Re: pg_stat_statements issue with parallel maintenance (Was Re: WAL usage calculation patch)

2020-03-30 Thread Masahiko Sawada
On Tue, 31 Mar 2020 at 12:58, Amit Kapila  wrote:
>
> On Mon, Mar 30, 2020 at 12:31 PM Masahiko Sawada
>  wrote:
> >
> > The patch for vacuum conflicts with recent changes in vacuum. So I've
> > attached rebased one.
> >
>
> + /*
> + * Next, accumulate buffer usage.  (This must wait for the workers to
> + * finish, or we might get incomplete data.)
> + */
> + for (i = 0; i < nworkers; i++)
> + InstrAccumParallelQuery(>buffer_usage[i]);
> +
>
> This should be done for launched workers aka
> lps->pcxt->nworkers_launched.  I think a similar problem exists in
> create index related patch.

You're right. Fixed in the new patches.

On Mon, 30 Mar 2020 at 17:00, Julien Rouhaud  wrote:
>
> Just minor nitpicking:
>
> +   int i;
>
> Assert(!IsParallelWorker());
> Assert(ParallelVacuumIsActive(lps));
> @@ -2166,6 +2172,13 @@ lazy_parallel_vacuum_indexes(Relation *Irel, 
> IndexBulkDeleteResult **stats,
> /* Wait for all vacuum workers to finish */
> WaitForParallelWorkersToFinish(lps->pcxt);
>
> +   /*
> +* Next, accumulate buffer usage.  (This must wait for the workers to
> +* finish, or we might get incomplete data.)
> +*/
> +   for (i = 0; i < nworkers; i++)
> +   InstrAccumParallelQuery(>buffer_usage[i]);
>
> We now allow declaring a variable in those loops, so it may be better to avoid
> declaring i outside the for scope?

We can do that but I was not sure if it's good since other codes
around there don't use that. So I'd like to leave it for committers.
It's a trivial change.

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


bufferusage_vacuum_v3.patch
Description: Binary data


bufferusage_create_index_v2.patch
Description: Binary data


Re: Internal key management system

2020-03-30 Thread Masahiko Sawada
On Tue, 31 Mar 2020 at 09:36, Cary Huang  wrote:
>
> Hi
> I had a look on kms_v9 patch and have some comments
>
> --> pg_upgrade.c
> keys are copied correctly, but as pg_upgrade progresses further, it will try 
> to start the new_cluster from "issue_warnings_and_set_wal_level()" function, 
> which is called after key copy. The new cluster will fail to start due to the 
> mismatch between cluster_passphrase_command and the newly copied keys. This 
> causes pg_upgrade to always finish with failure. We could move 
> "copy_master_encryption_key()" to be called after 
> "issue_warnings_and_set_wal_level()" and this will make pg_upgrade to finish 
> with success, but user will still have to manually correct the 
> "cluster_passphrase_command" param on the new cluster in order for it to 
> start up correctly. Should pg_upgrade also take care of copying 
> "cluster_passphrase_command" param from old to new cluster after it has 
> copied the encryption keys so users don't have to do this step? If the 
> expectation is for users to manually correct "cluster_passphrase_command" 
> param after successful pg_upgrade and key copy, then there should be a 
> message to remind the users to do so.

I think both the old cluster and the new cluster must be initialized
with the same passphrase at initdb. Specifying the different
passphrase command to the new cluster at initdb and changing it after
pg_upgrade doesn't make sense. Also I don't think we need to copy
cluster_passphrase_command same as other GUC parameters.

I've changed the patch so that pg_upgrade copies the crypto keys only
if both new and old cluster enable the key management. User must
specify the same passphrase command to both old and new cluster, which
is not cumbersome, I think. I also added the description about this to
the doc.

>
> -->Kmgr.c
> + /*
> + * If there is only temporary directory, it means that the previous
> + * rotation failed after wrapping the all internal keys by the new
> + * passphrase.  Therefore we use the new cluster passphrase.
> + */
> + if (stat(KMGR_DIR, ) != 0)
> + {
> + ereport(DEBUG1,
> + (errmsg("both directories %s and %s exist, use the newly wrapped keys",
> + KMGR_DIR, KMGR_TMP_DIR)));
>
> I think the error message should say "there is only temporary directory 
> exist" instead of "both directories exist"

You're right. Fixed.

I've attached the new version patch.

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


kms_v10.patch
Description: Binary data


Re: pg_stat_statements issue with parallel maintenance (Was Re: WAL usage calculation patch)

2020-03-30 Thread Amit Kapila
On Mon, Mar 30, 2020 at 12:31 PM Masahiko Sawada
 wrote:
>
> The patch for vacuum conflicts with recent changes in vacuum. So I've
> attached rebased one.
>

+ /*
+ * Next, accumulate buffer usage.  (This must wait for the workers to
+ * finish, or we might get incomplete data.)
+ */
+ for (i = 0; i < nworkers; i++)
+ InstrAccumParallelQuery(>buffer_usage[i]);
+

This should be done for launched workers aka
lps->pcxt->nworkers_launched.  I think a similar problem exists in
create index related patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Berserk Autovacuum (let's save next Mandrill)

2020-03-30 Thread David Rowley
On Sat, 28 Mar 2020 at 22:22, David Rowley  wrote:
> I'm unsure yet if this has caused an instability on lousyjack's run in
> [1].

pogona has just joined in on the fun [1], so, we're not out the woods
on this yet. I'll start having a look at this in more detail.

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=pogona=2020-03-30%2023%3A10%3A03




Re: error context for vacuum to include block number

2020-03-30 Thread Justin Pryzby
On Tue, Mar 31, 2020 at 07:50:45AM +0530, Amit Kapila wrote:
> On Mon, Mar 30, 2020 at 9:56 PM Justin Pryzby  wrote:
> >
> > On Mon, Mar 30, 2020 at 02:31:53PM +0530, Amit Kapila wrote:
> > > The v37-0003-Avoid-some-calls-to-RelationGetRelationName.patch looks
> > > good to me.  I have added the commit message in the patch.
> >
> > I realized the 0003 patch has an error in lazy_vacuum_index; it should be:
> >
> > -   RelationGetRelationName(indrel),
> > +   vacrelstats->indname,
> >
> 
> Hmm, it is like that in the patch I have sent yesterday.  Are you
> referring to the patch I have sent yesterday or some older version?

Oh good.  That was a recent fix I made, and I was afraid I'd never sent it, and
not sure if you'd used it.  Looks like it was fixed since v36...  As you can
see, I'm losing track of my branches.  It will be nice to finally put this to
rest.

> One thing I have noticed is that there is some saving by using
> vacrelstats->relnamespace as that avoids sys cache lookup.  OTOH,
> using vacrelstats->relname doesn't save much, but maybe for the sake
> of consistency, we can use it.

Mostly I wrote that to avoid repeatedly calling functions/macro with long name.
I consider it a minor cleanup.  I think we should put them to use.  The
LVRelStats describes them as not being specifically for the error context.

-- 
Justin




Re: Planning counters in pg_stat_statements (using pgss_store)

2020-03-30 Thread Fujii Masao




On 2020/03/31 3:16, Julien Rouhaud wrote:

On Mon, Mar 30, 2020 at 6:36 PM Fujii Masao  wrote:


On 2020/03/30 17:03, Julien Rouhaud wrote:

On Mon, Mar 30, 2020 at 01:56:43PM +0900, Fujii Masao wrote:



On 2020/03/29 15:15, Julien Rouhaud wrote:

On Fri, Mar 27, 2020 at 03:42:50PM +0100, Julien Rouhaud wrote:

On Fri, Mar 27, 2020 at 2:01 PM Fujii Masao  wrote:





So what I'd like to say is that the information that users are interested
in would vary on each situation and case. At least for me it seems
enough for pgss to report only the basic information. Then users
can calculate to get the numbers (like total_time) they're interested in,
from those basic information.

But of course, I'd like to hear more opinions about this...


+1

Unless someone chime in by tomorrow, I'll just drop the sum as it
seems less controversial and not a blocker in userland if users are
interested.


Done in attached v11, with also the s/querytext/query_text/ discrepancy noted
previously.


Thanks for updating the patch! But I still think query_string is better
name because it's used in other several places, for the sake of consistency.


You're absolutely right.  That's what I actually wanted to do given your
previous comment, but somehow managed to miss it, sorry about that and thanks
for fixing.


So I changed the argument name that way and commit the 0001 patch.
If you think query_text is better, let's keep discussing this topic!

Anyway many thanks for your great job!


Thanks a lot!




I also exported BufferUsageAccumDiff as mentioned previously, as it seems
clearner and will avoid future useless code churn, and run pgindent.


Many thanks!! I'm thinking to commit this part separately.
So I made that patch based on your patch. Attached.


Thanks! It looks good to me.


I also kept that part in a distinct commit for convenience.


I also pushed 0002 patch. Thanks!

I will review 0003 patch again.


And thanks for that too :)


While testing the patched pgss, I found that the patched version
may track the statements that the original version doesn't.
Please imagine the case where the following queries are executed,
with pgss.track = top.

  PREPARE hoge AS SELECT * FROM t;
  EXPLAIN EXECUTE hoge;

The pgss view returned "PREPARE hoge AS SELECT * FROM t"
in the patched version, but not in the orignal version.

Is this problematic?


Oh indeed. That's a side effect of having different the executed query
and the planned query being different.

I guess the question is to chose if the top level executed query of a
utilty statement containing an optimisable query, should the top level
planner call of that optimisable statement be considered at top level
or not.  I tend to think that's the correct behavior here, as this is
also what would happen if a regular DML was provided.  What do you
think?


TBH, not sure if that's ok yet...

I'm now just wondering if both plan_nested_level and
exec_nested_level should be incremented in pgss_ProcessUtility().
This is just a guess, so I need more investigation about this.

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters




Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-03-30 Thread Tom Lane
[ not a review, just some drive-by comments on David's comments ]

David Rowley  writes:
> 2. The following change does not seem like it should be part of this
> patch.  I understand you perhaps have done as you think it will
> improve the performance of checking if an expression is in a list of
> expressions.

> - COMPARE_SCALAR_FIELD(varno);
> + /* Compare varattno first since it has higher selectivity than varno */
>   COMPARE_SCALAR_FIELD(varattno);
> + COMPARE_SCALAR_FIELD(varno);

> If you think that is true, then please do it as a separate effort and
> provide benchmarks with your findings.

By and large, I'd reject such micro-optimizations on their face.
The rule in the nodes/ support files is to list fields in the same
order they're declared in.  There is no chance that it's worth
deviating from that for this.

I can believe that there'd be value in, say, comparing all
scalar fields before all non-scalar ones.  But piecemeal hacks
wouldn't be the way to handle that either.  In any case, I'd
prefer to implement such a plan within the infrastructure to
auto-generate these files that Andres keeps muttering about.

> a. including a function call in the foreach macro is not a practise
> that we really follow. It's true that the macro now assigns the 2nd
> param to a variable. Previous to 1cff1b95ab6 this was not the case and
> it's likely best not to leave any bad examples around that code which
> might get backported might follow.

No, I think you're misremembering.  foreach's second arg is
single-evaluation in all branches.  There were some preliminary
versions of 1cff1b95ab6 in which it would not have been, but that
was sufficiently dangerous that I found a way to get rid of it.

> b. We generally subtract InvalidAttrNumber from varattno when
> including in a Bitmapset.

ITYM FirstLowInvalidHeapAttributeNumber, but yeah.  Otherwise
the code fails on system columns, and there's seldom a good
reason to risk that.

regards, tom lane




Re: improve transparency of bitmap-only heap scans

2020-03-30 Thread Amit Kapila
On Mon, Mar 30, 2020 at 9:59 AM Tom Lane  wrote:
>
> Amit Kapila  writes:
> > On Sat, Mar 28, 2020 at 8:02 PM James Coleman  wrote:
> >> I'm curious if Tom's objection is mostly on the grounds that we should
> >> be consistent in what's displayed, or that he thinks the information
> >> is likely to be useless.
>
> > Yeah, it would be good if he clarifies his position.
>
> Some of both: it seems like these ought to be consistent, and the
> lack of complaints so far about regular index-only scans suggests
> that people don't need the info.  But perhaps we ought to add
> similar info in both places.
>

Fair enough.  I have marked this CF entry as RWF.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: [PATCH] Opclass parameters

2020-03-30 Thread Justin Pryzby
On Sat, Mar 28, 2020 at 06:05:51PM +0300, Alexander Korotkov wrote:
> On Wed, Mar 18, 2020 at 3:28 AM Nikita Glukhov  
> wrote:
> > Attached new version of reordered patches.
> 
> I'm going to push this if no objections.

Find attached patch with editorial corrections to docs for this commit.

--word-diff to follow.

commit d3f077b813efa90b25a162bf8d227f3e4218c248
Author: Justin Pryzby 
Date:   Mon Mar 30 20:55:06 2020 -0500

Doc review: Implement operator class parameters

commit 911e70207703799605f5a0e8aad9f06cff067c63
Author: Alexander Korotkov 

diff --git a/doc/src/sgml/hstore.sgml b/doc/src/sgml/hstore.sgml
index f1f2b08cd7..b2e04d0815 100644
--- a/doc/src/sgml/hstore.sgml
+++ b/doc/src/sgml/hstore.sgml
@@ -468,13 +468,13 @@ CREATE INDEX hidx ON testhstore USING GIN (h);


  
   gist_hstore_ops GiST opclass approximates {+a+} set of
   key/value pairs as a bitmap signature.  [-Optional-]{+Its optional+} integer 
parameter
   siglen[-of gist_hstore_ops-] 
determines {+the+}
   signature length in bytes.  [-Default signature-]{+The default+} length is 
16 bytes.
   Valid values of signature length are between 1 and 2024 bytes.  Longer
   signatures [-leads-]{+lead+} to {+a+} more precise search [-(scan 
less-]{+(scanning a smaller+} fraction of [-index, scan-]
[-   less-]{+the index and+}
{+   fewer+} heap pages), [-but-]{+at the cost of a+} larger index.
  

  
diff --git a/doc/src/sgml/intarray.sgml b/doc/src/sgml/intarray.sgml
index 72b4b23c15..7956a746a6 100644
--- a/doc/src/sgml/intarray.sgml
+++ b/doc/src/sgml/intarray.sgml
@@ -265,7 +265,7 @@
  

  
   Two [-parametrized-]{+parameterized+} GiST index operator classes are 
provided:
   gist__int_ops (used by default) is suitable for
   small- to medium-size data sets, while
   gist__intbig_ops uses a larger signature and is more
@@ -276,22 +276,23 @@
  
   
  
   gist__int_ops approximates {+an+} integer set as an array 
of
   integer ranges.  [-Optional-]{+Its optional+} integer parameter 
numranges[-of-]
[-   gist__int_ops-]
   determines {+the+} maximum number of ranges in
   one index key.  [-Default-]{+The default+} value of 
numranges is 100.
   Valid values are between 1 and 253.  Using larger arrays as GiST index
   keys leads to {+a+} more precise search [-(scan less-]{+(scaning a smaller+} 
fraction of [-index, scan less-]{+the index and+}
{+   fewer+} heap pages), [-but-]{+at the cost of a+} larger index.
  
   
  
   gist__intbig_ops approximates {+an+} integer set as a 
bitmap
   [-signature.  Optional-]{+signature XXX.  Its optional+} integer parameter 
siglen[-of-]
[-   gist__intbig_ops-]
   determines {+the+} signature length in bytes.
   [-Default-]{+The default+} signature length is 16 bytes.  Valid values of 
signature length
   are between 1 and 2024 bytes.  Longer signatures [-leads-]{+lead+} to {+a+} 
more precise
   search [-(scan less-]{+(scanning a smaller+} fraction of [-index, scan 
less-]{+the index and fewer+} heap pages), [-but-]{+at+}
{+   the cost of a+} larger index.
  

  
diff --git a/doc/src/sgml/ltree.sgml b/doc/src/sgml/ltree.sgml
index ae4b33ec85..4971b71524 100644
--- a/doc/src/sgml/ltree.sgml
+++ b/doc/src/sgml/ltree.sgml
@@ -506,16 +506,16 @@ Europe  Russia*@  !Transportation
 @, ~, ?


 gist_ltree_ops GiST opclass approximates {+a+} set of
 path labels as a bitmap signature.  [-Optional-]{+Its optional+} integer 
parameter
 siglen[-of gist_ltree_ops-] 
determines {+the+}
 signature length in bytes.  [-Default-]{+The default+} signature length is 
8 bytes.
 Valid values of signature length are between 1 and 2024 bytes.  Longer
 signatures [-leads-]{+lead+} to {+a+} more precise search [-(scan 
less-]{+(scanning a smaller+} fraction of [-index, scan-]
[- less-]{+the index and+}
{+ fewer+} heap pages), [-but-]{+at the cost of a+} larger index.


 Example of creating such an index with [-a-]{+the+} default signature 
length of 8 bytes:


CREATE INDEX path_gist_idx ON test USING GIST (path);
@@ -535,13 +535,13 @@ CREATE INDEX path_gist_idx ON test USING GIST (path 
gist_ltree_ops(siglen=100));
 @, ~, ?


 gist__ltree_ops GiST opclass works 
[-similar-]{+similarly+} to
 gist_ltree_ops and also takes signature length as
 a parameter.  [-Default-]{+The default+} value of 
siglen in
  gist__ltree_ops is 28 bytes.


 Example of creating such an index with [-a-]{+the+} default signature 
length of 28 bytes:


CREATE INDEX path_gist_idx ON test USING GIST (array_path);
diff --git a/doc/src/sgml/pgtrgm.sgml b/doc/src/sgml/pgtrgm.sgml
index dde02634ae..97b3d13a88 100644
--- a/doc/src/sgml/pgtrgm.sgml
+++ b/doc/src/sgml/pgtrgm.sgml
@@ -391,13 +391,13 @@ CREATE INDEX trgm_idx ON test_trgm USING GIN (t 
gin_trgm_ops);
  

  
   gist_trgm_ops GiST opclass approximates {+a+} set of
   trigrams as a bitmap signature.  [-Optional-]{+Its optional+} integer 
parameter
   siglen[-of gist_trgm_ops-] determines 

Re: error context for vacuum to include block number

2020-03-30 Thread Amit Kapila
On Mon, Mar 30, 2020 at 9:56 PM Justin Pryzby  wrote:
>
> On Mon, Mar 30, 2020 at 02:31:53PM +0530, Amit Kapila wrote:
> > The v37-0003-Avoid-some-calls-to-RelationGetRelationName.patch looks
> > good to me.  I have added the commit message in the patch.
>
> I realized the 0003 patch has an error in lazy_vacuum_index; it should be:
>
> -   RelationGetRelationName(indrel),
> +   vacrelstats->indname,
>

Hmm, it is like that in the patch I have sent yesterday.  Are you
referring to the patch I have sent yesterday or some older version?
One thing I have noticed is that there is some saving by using
vacrelstats->relnamespace as that avoids sys cache lookup.  OTOH,
using vacrelstats->relname doesn't save much, but maybe for the sake
of consistency, we can use it.

> That was maybe due to originally using a separate errinfo for each phase, with
> one "char *relname" and no "char *indrel".
>
> > I don't think the above change is correct.  How will vacrelstats have
> > correct values when vacuum_one_index is called via parallel workers
> > (via parallel_vacuum_main)?
>
> You're right: parallel main's vacrelstats was added by this patchset and only
> the error context fields were initialized.  I fixed it up in the attached by
> also setting vacrelstats->new_rel_tuples and old_live_tuples.  It's not clear
> if this is worth it just to save an argument to two functions?
>

Right, it is not clear to me whether that is an improvement, so I
suggest let's leave that patch for now.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-03-30 Thread David Rowley
On Sun, 29 Mar 2020 at 20:50, Andy Fan  wrote:
> Some other changes made in the new patch:
> 1.   Fixed bug for UniqueKey calculation for OUTER join.
> 2.   Fixed some typo error in comments.
> 3.   Renamed the field "grantee" as "guarantee".

I've had a look over this patch. Thank for you doing further work on it.

I've noted down the following during my read of the code:

1. There seem to be some cases where joins are no longer being
detected as unique. This is evident in postgres_fdw.out. We shouldn't
be regressing any of these cases.

2. The following change does not seem like it should be part of this
patch.  I understand you perhaps have done as you think it will
improve the performance of checking if an expression is in a list of
expressions.

- COMPARE_SCALAR_FIELD(varno);
+ /* Compare varattno first since it has higher selectivity than varno */
  COMPARE_SCALAR_FIELD(varattno);
+ COMPARE_SCALAR_FIELD(varno);

If you think that is true, then please do it as a separate effort and
provide benchmarks with your findings.

3. list_all_members_in. I think this would be better named as
list_is_subset. Please follow the lead of bms_is_subset().
Additionally, you should Assert that IsPointerList is true as there's
nothing else to indicate that it can't be used for an int or Oid list.

4. guarantee is not a very good name for the field in UniqueKey.
Maybe something like is_not_null?

5. I think you should be performing a bms_del_member during join
removal rather than removing this Assert()

- Assert(bms_equal(rel->relids, root->all_baserels));

FWIW, it's far from perfect that you've needed to delay the left join
removal, but I do understand why you've done it. It's also far from
perfect that you're including removed relations in the
total_table_pages calculation. c6e4133fae1 took some measures to
improve this calculation and this is making it worse again.

6. Can you explain why you moved the populate_baserel_uniquekeys()
call out of set_plain_rel_size()?

7. I don't think the removal of rel_supports_distinctness() is
warranted.  Is it not ok to check if the relation has any uniquekeys?
It's possible, particularly in join_is_removable that this can save
quite a large amount of effort.

8. Your spelling of unique is incorrect in many places:

src/backend/nodes/makefuncs.c: * makeUnqiueKey
src/backend/optimizer/path/uniquekeys.c:static List
*initililze_unqiuecontext_for_joinrel(RelOptInfo *joinrel,
src/backend/optimizer/path/uniquekeys.c: * check if combination of
unqiuekeys from both side is still useful for us,
src/backend/optimizer/path/uniquekeys.c:outerrel_uniquekey_ctx
= initililze_unqiuecontext_for_joinrel(joinrel, outerrel);
src/backend/optimizer/path/uniquekeys.c:innerrel_uniquekey_ctx
= initililze_unqiuecontext_for_joinrel(joinrel, innerrel);
src/backend/optimizer/path/uniquekeys.c: * we need to convert the
UnqiueKey from sub_final_rel to currel via the positions info in
src/backend/optimizer/path/uniquekeys.c:ctx->pos =
pos; /* the position in current targetlist,  will be used to set
UnqiueKey */
src/backend/optimizer/path/uniquekeys.c: * Check if Unqiue key of the
innerrel is valid after join. innerrel's UniqueKey
src/backend/optimizer/path/uniquekeys.c: * initililze_unqiuecontext_for_joinrel
src/backend/optimizer/path/uniquekeys.c: * all the unqiuekeys which
are not possible to use later
src/backend/optimizer/path/uniquekeys.c:initililze_unqiuecontext_for_joinrel(RelOptInfo
*joinrel,  RelOptInfo *inputrel)
src/backend/optimizer/plan/analyzejoins.c:  /*
This UnqiueKey is what we want */
src/backend/optimizer/plan/planner.c:   /* If we the result if unqiue
already, we just return the input_rel directly */
src/include/nodes/pathnodes.h: * exprs is a list of exprs which is
unqiue on current RelOptInfo.
src/test/regress/expected/join.out:-- : since b.id is unqiue now
so the group by cluase is erased, so
src/test/regress/expected/select_distinct.out:-- create unqiue index on dist_p
src/test/regress/expected/select_distinct.out:-- we also support
create unqiue index on each child tables
src/test/regress/sql/join.sql:-- : since b.id is unqiue now so the
group by cluase is erased, so
src/test/regress/sql/select_distinct.sql:-- create unqiue index on dist_p
src/test/regress/sql/select_distinct.sql:-- we also support create
unqiue index on each child tables

9. A few things wrong with the following fragment:

/* set the not null info now */
ListCell *lc;
foreach(lc, find_nonnullable_vars(qual))
{
Var *var = lfirst_node(Var, lc);
RelOptInfo *rel = root->simple_rel_array[var->varno];
if (var->varattno > InvalidAttrNumber)
rel->not_null_cols = bms_add_member(rel->not_null_cols, var->varattno);
}

a. including a function call in the foreach macro is not a practise
that we really follow. It's true that the macro now assigns the 2nd
param to a variable. Previous to 1cff1b95ab6 this was not the case and
it's likely best not to leave any bad examples 

Very outdated comments in heapam_index_build_range_scan.

2020-03-30 Thread Andres Freund
Hi,

heapam_index_build_range_scan() has the following, long standing,
comment:

/*
 * When dealing with a HOT-chain of updated tuples, we want to 
index
 * the values of the live tuple (if any), but index it under 
the TID
 * of the chain's root tuple.  This approach is necessary to 
preserve
 * the HOT-chain structure in the heap. So we need to be able 
to find
 * the root item offset for every tuple that's in a HOT-chain.  
When
 * first reaching a new page of the relation, call
 * heap_get_root_tuples() to build a map of root item offsets 
on the
 * page.
 *
 * It might look unsafe to use this information across buffer
 * lock/unlock.  However, we hold ShareLock on the table so no
 * ordinary insert/update/delete should occur; and we hold pin 
on the
 * buffer continuously while visiting the page, so no pruning
 * operation can occur either.
 *
 * Also, although our opinions about tuple liveness could 
change while
 * we scan the page (due to concurrent transaction 
commits/aborts),
 * the chain root locations won't, so this info doesn't need to 
be
 * rebuilt after waiting for another transaction.
 *
 * Note the implied assumption that there is no more than one 
live
 * tuple per HOT-chain --- else we could create more than one 
index
 * entry pointing to the same root tuple.
 */

I don't think the second paragraph has been true for a *long* time. At
least since CREATE INDEX CONCURRENTLY was introduced.

There's also:
/*
 * We could possibly get away with not locking the 
buffer here,
 * since caller should hold ShareLock on the relation, 
but let's
 * be conservative about it.  (This remark is still 
correct even
 * with HOT-pruning: our pin on the buffer prevents 
pruning.)
 */
LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);

and
/*
 * Since caller should hold ShareLock 
or better, normally
 * the only way to see this is if it 
was inserted earlier
 * in our own transaction.  However, it 
can happen in
 * system catalogs, since we tend to 
release write lock
 * before commit there.  Give a warning 
if neither case
 * applies.
 */


Greetings,

Andres Freund




Re: BUG #16109: Postgres planning time is high across version (Expose buffer usage during planning in EXPLAIN)

2020-03-30 Thread Justin Pryzby
On Wed, Jan 29, 2020 at 12:15:59PM +0100, Julien Rouhaud wrote:
> Rebase due to conflict with 3ec20c7091e97.

This is failing to apply probably since 
4a539a25ebfc48329fd656a95f3c1eb2cda38af3.
Could you rebase?   (Also, not sure if this can be set as RFC?)

-- 
Justin




Re: pgsql: Improve handling of parameter differences in physical replicatio

2020-03-30 Thread Fujii Masao




On 2020/03/31 3:10, Andres Freund wrote:

Hi,

On 2020-03-30 19:41:43 +0900, Fujii Masao wrote:

On 2020/03/30 16:58, Peter Eisentraut wrote:

Improve handling of parameter differences in physical replication

When certain parameters are changed on a physical replication primary,
this is communicated to standbys using the XLOG_PARAMETER_CHANGE WAL
record.  The standby then checks whether its own settings are at least
as big as the ones on the primary.  If not, the standby shuts down
with a fatal error.

The correspondence of settings between primary and standby is required
because those settings influence certain shared memory sizings that
are required for processing WAL records that the primary might send.
For example, if the primary sends a prepared transaction, the standby
must have had max_prepared_transaction set appropriately or it won't
be able to process those WAL records.

However, fatally shutting down the standby immediately upon receipt of
the parameter change record might be a bit of an overreaction.  The
resources related to those settings are not required immediately at
that point, and might never be required if the activity on the primary
does not exhaust all those resources.  If we just let the standby roll
on with recovery, it will eventually produce an appropriate error when
those resources are used.

So this patch relaxes this a bit.  Upon receipt of
XLOG_PARAMETER_CHANGE, we still check the settings but only issue a
warning and set a global flag if there is a problem.  Then when we
actually hit the resource issue and the flag was set, we issue another
warning message with relevant information.


I find it somewhat hostile that we don't display the actual resource
error once the problem is hit - we just pause. Sure, there's going to be
some previous log entry explaining what the actual parameter difference
is - but that could have been weeks ago. So either hard to find, or even
rotated out.



At that point we pause recovery, so a hot standby remains usable.
We also repeat the last warning message once a minute so it is
harder to miss or ignore.



I can't really imagine that the adjustments made in this patch are
sufficient.

One important issue seems to me to be the size of the array that
TransactionIdIsInProgress() allocates:
/*
 * If first time through, get workspace to remember main XIDs in. We
 * malloc it permanently to avoid repeated palloc/pfree overhead.
 */
if (xids == NULL)
{
/*
 * In hot standby mode, reserve enough space to hold all xids 
in the
 * known-assigned list. If we later finish recovery, we no 
longer need
 * the bigger array, but we don't bother to shrink it.
 */
int maxxids = RecoveryInProgress() ? 
TOTAL_MAX_CACHED_SUBXIDS : arrayP->maxProcs;

xids = (TransactionId *) malloc(maxxids * 
sizeof(TransactionId));
if (xids == NULL)
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
 errmsg("out of memory")));
}

Which I think means we'll just overrun the xids array in some cases,
e.g. if KnownAssignedXids overflowed.  Obviously we should have a
crosscheck in the code (which we don't), but it was previously a
supposedly unreachable path.

Similarly, the allocation in GetSnapshotData() will be too small, I
think:
if (snapshot->xip == NULL)
{
/*
 * First call for this snapshot. Snapshot is same size whether 
or not
 * we are in recovery, see later comments.
 */
snapshot->xip = (TransactionId *)
malloc(GetMaxSnapshotXidCount() * 
sizeof(TransactionId));
if (snapshot->xip == NULL)
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
 errmsg("out of memory")));
Assert(snapshot->subxip == NULL);
snapshot->subxip = (TransactionId *)
malloc(GetMaxSnapshotSubxidCount() * 
sizeof(TransactionId));
if (snapshot->subxip == NULL)
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
 errmsg("out of memory")));
}

I think basically all code using TOTAL_MAX_CACHED_SUBXIDS,
GetMaxSnapshotSubxidCount(), PROCARRAY_MAXPROCS needs to be reviewed
much more carefully than done here.


Also, shouldn't dynahash be adjusted as well? There's e.g. the
following HASH_ENTER path:
/* report a generic message */
if (hashp->isshared)
ereport(ERROR,

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-30 Thread Tomas Vondra

On Mon, Mar 30, 2020 at 06:53:47PM -0400, James Coleman wrote:

On Mon, Mar 30, 2020 at 8:24 AM Tomas Vondra
 wrote:


On Sun, Mar 29, 2020 at 10:16:53PM -0400, James Coleman wrote:
>On Sun, Mar 29, 2020 at 9:44 PM Tomas Vondra
> wrote:
>>
>> Hi,
>>
>> Attached is a slightly reorganized patch series. I've merged the fixes
>> into the appropriate matches, and I've also combined the two patches
>> adding incremental sort paths to additional places in planner.
>>
>> A couple more comments:
>>
>>
>> 1) I think the GUC documentation in src/sgml/config.sgml is a bit too
>> detailed, compared to the other enable_* GUCs. I wonder if there's a
>> better place where to move the details. What about adding some examples
>> and explanation to perform.sgml?
>
>I'll take a look at that and include in a patch series tomorrow.


Attached.


>> 2) Looking at the explain output, the verbose mode looks like this:
>>
>> test=# explain (verbose, analyze) select a from t order by a, b, c;
>>
QUERY PLAN
>> 
--
>>   Gather Merge  (cost=66.31..816072.71 rows=8333226 width=24) (actual 
time=4.787..20092.555 rows=1000 loops=1)
>> Output: a, b, c
>> Workers Planned: 2
>> Workers Launched: 2
>> ->  Incremental Sort  (cost=66.28..729200.36 rows=4166613 width=24) 
(actual time=1.308..14021.575 rows=333 loops=3)
>>   Output: a, b, c
>>   Sort Key: t.a, t.b, t.c
>>   Presorted Key: t.a, t.b
>>   Full-sort Groups: 4169 Sort Method: quicksort Memory: avg=30kB 
peak=30kB
>>   Presorted Groups: 4144 Sort Method: quicksort Memory: avg=128kB 
peak=138kB
>>   Worker 0:  actual time=0.766..16122.368 rows=3841573 loops=1
>>   Full-sort Groups: 6871 Sort Method: quicksort Memory: avg=30kB peak=30kB
>> Presorted Groups: 6823 Sort Method: quicksort Memory: avg=132kB 
peak=141kB
>>   Worker 1:  actual time=1.986..16189.831 rows=3845490 loops=1
>>   Full-sort Groups: 6874 Sort Method: quicksort Memory: avg=30kB peak=30kB
>> Presorted Groups: 6847 Sort Method: quicksort Memory: avg=130kB 
peak=139kB
>>   ->  Parallel Index Scan using t_a_b_idx on public.t  
(cost=0.43..382365.92 rows=4166613 width=24) (actual time=0.040..9808.449 rows=333 
loops=3)
>> Output: a, b, c
>> Worker 0:  actual time=0.048..11275.178 rows=3841573 loops=1
>> Worker 1:  actual time=0.041..11314.133 rows=3845490 loops=1
>>   Planning Time: 0.166 ms
>>   Execution Time: 25135.029 ms
>> (22 rows)
>>
>> There seems to be missing indentation for the first line of worker info.
>
>Working on that too.


See attached. I've folded in the original "explain fixes" patch into
the main series, and the "explain fixes" patch in this series contains
only the changes for the above.



Thanks. I'll take a look at those changes tomorrow.


>> I'm still not quite convinced we should be printing two lines - I know
>> you mentioned the lines might be too long, but see how long the other
>> lines may get ...
>
>All right, I give in :)
>
>Do you think non-workers (both the leader and non-parallel plans)
>should also move to one line?
>

I think we should use the same formatting for both cases, so yes.

FWIW I forgot to mention I tweaked the INSTRUMENT_SORT_GROUP macro a
bit, by moving the if condition in it. That makes the calls easier.


Ah, that actually fixed some of the compile warnings. The other is
fixed in my explain fixes patch.


>> 3) I see the new nodes (plan state, ...) have "presortedCols" which does
>> not indicate it's a "number of". I think we usually prefix names of such
>> fields "n" or "num". What about "nPresortedCols"? (Nitpicking, I know.)
>
>I can fix this too.


Changed everywhere we used this var name. I'm tempted to change to
nPresortedKeys, but a cursory glance suggests some cases might
actually be consistent with other var names reference columns, so I'm
not sure if we want to go down that path (and change more than just
this).



Not sure. We use "sort keys" and "path keys" for this, but I think
"columns" is good enough.


The main thing I've been working on today is benchmarking how this
affects planning. And I'm seeing a regression that worries me a bit,
unfortunately.

The test I'm doing is pretty simple - build a small table with a bunch
of columns:

  create table t (a int, b int, c int, d int, e int, f int, g int);

  insert into t select 100*random(), 100*random(), 100*random(),
100*random(), 100*random(), 100*random(), 100*random()
  from generate_series(1,10) s(i);

and then a number of indexes on subsets of up to 3 columns, as generated
using the attached build-indexes.py script. And then run a bunch of
explains (so no actual execution) sorting the data by at least 4 

Re: Internal key management system

2020-03-30 Thread Cary Huang
Hi

I had a look on kms_v9 patch and have some comments



--> pg_upgrade.c

keys are copied correctly, but as pg_upgrade progresses further, it will try to 
start the new_cluster from "issue_warnings_and_set_wal_level()" function, which 
is called after key copy. The new cluster will fail to start due to the 
mismatch between cluster_passphrase_command and the newly copied keys. This 
causes pg_upgrade to always finish with failure. We could move 
"copy_master_encryption_key()" to be called after 
"issue_warnings_and_set_wal_level()" and this will make pg_upgrade to finish 
with success, but user will still have to manually correct the 
"cluster_passphrase_command" param on the new cluster in order for it to start 
up correctly. Should pg_upgrade also take care of copying 
"cluster_passphrase_command" param from old to new cluster after it has copied 
the encryption keys so users don't have to do this step? If the expectation is 
for users to manually correct "cluster_passphrase_command" param after 
successful pg_upgrade and key copy, then there should be a message to remind 
the users to do so. 



-->Kmgr.c 

+   /*

+* If there is only temporary directory, it means that the previous

+* rotation failed after wrapping the all internal keys by the new

+* passphrase.  Therefore we use the new cluster passphrase.

+*/

+   if (stat(KMGR_DIR, ) != 0)

+   {

+   ereport(DEBUG1,

+   (errmsg("both directories %s and %s exist, use 
the newly wrapped keys",

+   KMGR_DIR, KMGR_TMP_DIR)));



I think the error message should say "there is only temporary directory exist" 
instead of "both directories exist"



thanks!



Cary Huang

-

HighGo Software Inc. (Canada)

mailto:cary.hu...@highgo.ca

http://www.highgo.ca






 On Wed, 25 Mar 2020 01:51:08 -0700 Masahiko Sawada 
 wrote 



On Tue, 24 Mar 2020 at 23:15, Bruce Momjian  wrote: 
> 
> On Tue, Mar 24, 2020 at 02:29:57PM +0900, Masahiko Sawada wrote: 
> > That seems to work fine. 
> > 
> > So we will have pg_cryptokeys within PGDATA and each key is stored 
> > into separate file named the key id such as "sql", "tde-wal" and 
> > "tde-block". I'll update the patch and post. 
> 
> Yes, that makes sense to me. 
> 
 
I've attached the updated patch. With the patch, we have three 
internal keys: SQL key, TDE-block key and TDE-wal key. Only SQL key 
can be used so far to wrap and unwrap user secret via pg_wrap and 
pg_unwrap SQL functions. Each keys is saved to the single file located 
at pg_cryptokeys. After initdb with enabling key manager, the 
pg_cryptokeys directory has the following files: 
 
$ ll data/pg_cryptokeys 
total 12K 
-rw--- 1 masahiko staff 132 Mar 25 15:45  
-rw--- 1 masahiko staff 132 Mar 25 15:45 0001 
-rw--- 1 masahiko staff 132 Mar 25 15:45 0002 
 
I used the integer id rather than string id to make the code simple. 
 
When cluster passphrase rotation, we update all keys atomically using 
temporary directory as follows: 
 
1. Derive the new passphrase 
2. Wrap all internal keys with the new passphrase 
3. Save all internal keys to the temp directory 
4. Remove the original directory, pg_cryptokeys 
5. Rename the temp directory to pg_cryptokeys 
 
In case of failure during rotation, pg_cyrptokeys and 
pg_cyrptokeys_tmp can be left in an incomplete state. We recover it by 
checking if the temporary directory exists and the wrapped keys in the 
temporary directory are valid. 
 
Regards, 
 
-- 
Masahiko Sawada http://www.2ndQuadrant.com/ 
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Corruption during WAL replay

2020-03-30 Thread Andres Freund
Hi,

On 2020-03-24 18:18:12 +0900, Kyotaro Horiguchi wrote:
> At Mon, 23 Mar 2020 20:56:59 +, Teja Mupparti  
> wrote in 
> > The original bug reporting-email and the relevant discussion is here
> ...
> > The crux of the fix is, in the current code, engine drops the buffer and 
> > then truncates the file, but a crash before the truncate and after the 
> > buffer-drop is causing the corruption. Patch reverses the order i.e. 
> > truncate the file and drop the buffer later.
> 
> BufferAlloc doesn't wait for the BM_IO_IN_PROGRESS for a valid buffer.

I don't think that's true. For any of this to be relevant the buffer has
to be dirty. In which case BufferAlloc() has to call
FlushBuffer(). Which in turn does a WaitIO() if BM_IO_IN_PROGRESS is
set.

What path are you thinking of? Or alternatively, what am I missing?


> I'm not sure it's acceptable to remember all to-be-deleted buffers
> while truncation.

I don't see a real problem with it. Nor really a good alternative. Note
that for autovacuum truncations we'll only truncate a limited number of
buffers at once, and for most relation truncations we don't enter this
path (since we create a new relfilenode instead).


> 
> +  /*START_CRIT_SECTION();*/

> Is this a point of argument?  It is not needed if we choose the
> strategy (c) in [1], since the protocol is aiming to allow server to
> continue running after truncation failure.
> 
> [1]: 
> https://www.postgresql.org/message-id/20191207001232.klidxnm756wqxvwx%40alap3.anarazel.de

I think it's entirely broken to continue running after a truncation
failure. We obviously have to first WAL log the truncation (since
otherwise we can crash just after doing the truncation). But we cannot
just continue running after WAL logging, but not performing the
associated action: The most obvious reason is that otherwise a replica
will execute the trunction, but the primary will not.

The whole justification for that behaviour "It would turn a usually
harmless failure to truncate, that might spell trouble at WAL replay,
into a certain PANIC." was always dubious (since on-disk and in-memory
state now can diverge), but it's clearly wrong once replication had
entered the picture. There's just no alternative to a critical section
here.


If we are really concerned with truncation failing - I don't know why we
would be, we accept that we have to be able to modify files etc to stay
up - we can add a pre-check ensuring that permissions are set up
appropriately to allow us to truncate.

Greetings,

Andres Freund




Re: backup manifests

2020-03-30 Thread David Steele

On 3/30/20 4:16 PM, Robert Haas wrote:

On Fri, Mar 27, 2020 at 3:51 PM David Steele  wrote:


  > { "Path": "backup_label", "Size": 224, "Last-Modified": "2020-03-27
18:33:18 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "b914bec9" },

Storing the checksum type with each file seems pretty redundant.
Perhaps that could go in the header?  You could always override if a
specific file had a different checksum type, though that seems unlikely.

In general it might be good to go with shorter keys: "mod", "chk", etc.
Manifests can get pretty big and that's a lot of extra bytes.

I'm also partial to using epoch time in the manifest because it is
generally easier for programs to work with.  But, human-readable doesn't
suck, either.


It doesn't seem impossible for it to come up; for example, consider a
file-level incremental backup facility. You might retain whatever
checksums you have for the unchanged files (to avoid rereading them)
and add checksums for modified or added files.


OK.


I am not convinced that minimizing the size of the file here is a
particularly important goal, because I don't think it's going to get
that big in normal cases. I also think having the keys and values be
easily understandable by human being is a plus. If we really want a
minimal format without redundancy, we should've gone with what I
proposed before (though admittedly that could've been tamped down even
further if we'd cared to squeeze, which I didn't think was important
then either).


Well, normal cases is the key.  But fine, in general we have found that 
the in memory representation is more important in terms of supporting 
clusters with very large numbers of files.



When I ran pg_validatebackup I expected to use -D to specify the backup
dir since pg_basebackup does.  On the other hand -D is weird because I
*really* expect that to be the pg data dir.

But, do we want this to be different from pg_basebackup?


I think it's pretty distinguishable, because pg_basebackup needs an
input (server) and an output (directory), whereas pg_validatebackup
only needs one. I don't really care if we want to change it, but I was
thinking of this as being more analogous to, say, pg_resetwal.
Granted, that's a danger-don't-use-this tool and this isn't, but I
don't think we want the -D-is-optional behavior that tools like pg_ctl
have, because having a tool that isn't supposed to be used on a
running cluster default to $PGDATA seems inadvisable. And if the
argument is mandatory then it's not clear to me why we should make
people type -D in front of it.


Honestly I think pg_basebackup is the confusing one, because in most 
cases -D points at the running cluster dir. So, OK.



  > +checksum_length = checksum_string_length / 2;

This check is defeated if a single character is added the to checksum.

Not too big a deal since you still get an error, but still.


I don't see what the problem is here. We speculatively divide by two
and allocate memory assuming the value that it was even, but then
before doing anything critical we bail out if it was actually odd.
That's harmless. We could get around it by saying:

if (checksum_string_length % 2 != 0)
 context->error_cb(...);
checksum_length = checksum_string_length / 2;
checksum_payload = palloc(checksum_length);
if (!hexdecode_string(...))
 context->error_cb(...);

...but that would be adding additional code, and error messages, for
what's basically a can't-happen-unless-the-user-is-messing-with-us
case.


Sorry, pasted the wrong code and even then still didn't get it quite 
right.


The problem:

If I remove an even characters from a checksum it appears the checksum 
passes but the manifest checksum fails:


$ pg_basebackup -D test/backup5 --manifest-checksums=SHA256

$ vi test/backup5/backup_manifest
* Remove two characters from the checksum of backup_label

$ pg_validatebackup test/backup5

pg_validatebackup: fatal: manifest checksum mismatch

But if I add any number of characters or remove an odd number of 
characters I get:


pg_validatebackup: fatal: invalid checksum for file "backup_label": 
"a98e9164fd59d498d14cfdf19c67d1c2208a30e7b939d1b4a09f524c7adfc11fXX"


and no manifest checksum failure.


  > + * Verify that the manifest checksum is correct.

This is not working the way I would expect -- I could freely modify the
manifest without getting a checksum error on the manifest.  For example:

$ /home/vagrant/test/pg/bin/pg_validatebackup test/backup3
pg_validatebackup: fatal: invalid checksum for file "backup_label":
"408901e0814f40f8ceb7796309a59c7248458325a21941e7c55568e381f53831?"

So, if I deleted the entry above, I got a manifest checksum error.  But
if I just modified the checksum I get a file checksum error with no
manifest checksum error.

I would prefer a manifest checksum error in all cases where it is wrong,
unless --exit-on-error is specified.


I think I would too, but I'm confused as to what you're doing, because
if I just modified the manifest -- 

Re: [HACKERS] Restricting maximum keep segments by repslots

2020-03-30 Thread Alvaro Herrera
I rebased this patch; it's failing to apply due to minor concurrent
changes in PostgresNode.pm.  I squashed the patches in a series that
made the most sense to me.

I have a question about static variable lastFoundOldestSeg in
FindOldestXLogFileSegNo.  It may be set the first time the function
runs; if it is, the function never again does anything, it just returns
that value.  In other words, the static value is never reset; it never
advances either.  Isn't that strange?  I think the coding is to assume
that XLogCtl->lastRemovedSegNo will always be set, so its code will
almost never run ... except when the very first wal file has not been
removed yet.  This seems weird and pointless.  Maybe we should think
about this differently -- example: if XLogGetLastRemovedSegno returns
zero, then the oldest file is the zeroth one.  In what cases this is
wrong?  Maybe we should fix those.

Regarding the PostgresNode change in 0001, I think adding a special
parameter for primary_slot_name is limited.  I'd like to change the
definition so that anything that you give as a parameter that's not one
of the recognized keywords (has_streaming, etc) is tested to see if it's
a GUC; and if it is, then put it in postgresql.conf.  This would have to
apply both to PostgresNode::init() as well as
PostgresNode::init_from_backup(), obviously, since it would make no
sense for the APIs to diverge on this point.  So you'd be able to do
  $node->init_from_backup(allow_streaming => 1, work_mem => "4MB");
without having to add code to init_from_backup to handle work_mem
specifically.  This could be done by having a Perl hash with all the GUC
names, that we could read lazily from "postmaster --describe-config" the
first time we see an unrecognized keyword as an option to init() /
init_from_backup().

I edited the doc changes a bit.

I don't know what to think of 0003 yet.  Has this been agreed to be a
good idea?

I also made a few small edits to the code; all cosmetic so far:

* added long_desc to the new GUC; it now reads:

{"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
gettext_noop("Sets the maximum size of WAL space reserved by 
replication slots."),
gettext_noop("Replication slots will be marked as failed, and 
segments released "
 "for deletion or recycling, if this much space is 
occupied by WAL "
 "on disk."),

* updated the comment to ConvertToXSegs() which is now being used for
  this purpose

* remove outdated comment to GetWalAvailability; it was talking about
  restBytes parameter that no longer exists

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 2b49b112987379edc76d4226989a26432c34fcaf Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 19 Dec 2018 12:43:57 +0900
Subject: [PATCH v19 1/3] Add primary_slot_name to init_from_backup in TAP
 test.

It is convenient that priary_slot_name can be specified on taking a
base backup. This adds a new parameter of the name to the perl
function.
---
 src/test/perl/PostgresNode.pm | 5 +
 1 file changed, 5 insertions(+)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 1d5450758e..b80305b6c5 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -702,6 +702,11 @@ port = $port
 		$self->append_conf('postgresql.conf',
 			"unix_socket_directories = '$host'");
 	}
+	if (defined $params{primary_slot_name})
+	{
+		$self->append_conf('postgresql.conf',
+			"primary_slot_name = $params{primary_slot_name}");
+	}
 	$self->enable_streaming($root_node) if $params{has_streaming};
 	$self->enable_restoring($root_node, $params{standby}) if $params{has_restoring};
 	return;
-- 
2.20.1

>From 34c030876b54144e2c59b14d8b589f8fd840e558 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 21 Dec 2017 21:20:20 +0900
Subject: [PATCH v19 2/3] Add WAL relief vent for replication slots

Replication slot is useful to maintain replication connection in the
configurations where replication is so delayed that connection is
broken. On the other hand so many WAL files can fill up disk that the
master downs by a long delay. This feature, which is activated by a
GUC "max_slot_wal_keep_size", protects master servers from suffering
disk full by limiting the number of WAL files reserved by replication
slots.
---
 contrib/test_decoding/expected/ddl.out|   4 +-
 contrib/test_decoding/sql/ddl.sql |   2 +
 doc/src/sgml/catalogs.sgml|  48 +++
 doc/src/sgml/config.sgml  |  23 ++
 doc/src/sgml/high-availability.sgml   |   8 +-
 src/backend/access/transam/xlog.c | 354 --
 src/backend/catalog/system_views.sql  |   4 +-
 src/backend/replication/slot.c|   1 +
 src/backend/replication/slotfuncs.c   |  39 +-
 src/backend/utils/misc/guc.c 

Re: fix for BUG #3720: wrong results at using ltree

2020-03-30 Thread Tom Lane
Nikita Glukhov  writes:
> I think now it looks as simple as the whole algorithm is.

Yeah, I think we've gotten checkCond to the point of "there's no
longer anything to take away".

I've marked this RFC, and will push tomorrow unless somebody wants
to object to the loss of backwards compatibility.

regards, tom lane




Re: backup manifests

2020-03-30 Thread David Steele

On 3/30/20 5:08 PM, Andres Freund wrote:


The data in the backup label isn't sufficient though. Without having
parsed the timeline file there's no way to verify that the correct WAL
is present. I guess we can also add client side tools to parse
timelines, add command the fetch all of the required files, and then
interpret that somehow.

But that seems much more complicated.

Imo it makes sense to want to be able verify that WAL looks correct even
transporting WAL using another method (say archiving) and thus using
pg_basebackup's -Xnone.

For the manifest to actually list what's required for the base backup
doesn't seem redundant to me. Imo it makes the manifest file make a good
bit more sense, since afterwards it actually describes the whole base
backup.


FWIW, pgBackRest stores the backup WAL stop/start in the manifest. To 
get this information after the backup is complete requires parsing the 
.backup file which doesn't get stored in the backup directory by 
pg_basebackup. As far as I know, this is only accessibly to solutions 
that implement archive_command. So, pgBackRest could do that but it 
seems far more trouble than it is worth.


Regards,
--
-David
da...@pgmasters.net




Re: fix for BUG #3720: wrong results at using ltree

2020-03-30 Thread Nikita Glukhov


On 31.03.2020 1:35, Tom Lane wrote:

Nikita Glukhov  writes:

And we even can simply transform this tail call into a loop:

-if (tlen > 0 && qlen > 0)
+while (tlen > 0 && qlen > 0)

Yeah, the same occurred to me ... and then we can drop the other loop too.


I think now it looks as simple as the whole algorithm is.


--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: fix for BUG #3720: wrong results at using ltree

2020-03-30 Thread Tom Lane
Nikita Glukhov  writes:
> And we even can simply transform this tail call into a loop:

> -if (tlen > 0 && qlen > 0)
> +while (tlen > 0 && qlen > 0)

Yeah, the same occurred to me ... and then we can drop the other loop too.
I've got it down to this now:

/*
 * Try to match an lquery (of qlen items) to an ltree (of tlen items)
 */
static bool
checkCond(lquery_level *curq, int qlen,
  ltree_level *curt, int tlen)
{
/* Since this function recurses, it could be driven to stack overflow */
check_stack_depth();

/* Loop while we have query items to consider */
while (qlen > 0)
{
int low,
high;
lquery_level *nextq;

/*
 * Get min and max repetition counts for this query item, dealing with
 * the backwards-compatibility hack that the low/high fields aren't
 * meaningful for non-'*' items unless LQL_COUNT is set.
 */
if ((curq->flag & LQL_COUNT) || curq->numvar == 0)
low = curq->low, high = curq->high;
else
low = high = 1;

/*
 * We may limit "high" to the remaining text length; this avoids
 * separate tests below.
 */
if (high > tlen)
high = tlen;

/* Fail if a match of required number of items is impossible */
if (high < low)
return false;

/*
 * Recursively check the rest of the pattern against each possible
 * start point following some of this item's match(es).
 */
nextq = LQL_NEXT(curq);
qlen--;

for (int matchcnt = 0; matchcnt < high; matchcnt++)
{
/*
 * If we've consumed an acceptable number of matches of this item,
 * and the rest of the pattern matches beginning here, we're good.
 */
if (matchcnt >= low && checkCond(nextq, qlen, curt, tlen))
return true;

/*
 * Otherwise, try to match one more text item to this query item.
 */
if (!checkLevel(curq, curt))
return false;

curt = LEVEL_NEXT(curt);
tlen--;
}

/*
 * Once we've consumed "high" matches, we can succeed only if the rest
 * of the pattern matches beginning here.  Loop around (if you prefer,
 * think of this as tail recursion).
 */
curq = nextq;
}

/*
 * Once we're out of query items, we match only if there's no remaining
 * text either.
 */
return (tlen == 0);
}


regards, tom lane




Re: fix for BUG #3720: wrong results at using ltree

2020-03-30 Thread Nikita Glukhov

On 31.03.2020 1:12, Tom Lane wrote:


I wrote:

I dunno, that doesn't really seem clearer to me (although some of it
might be that you expended no effort on making the comments match
the new code logic).

... although looking closer, this formulation does have one very nice
advantage: for the typical non-star case with high = low = 1, the
only recursive call is a tail recursion, so it ought to consume less
stack space than what I wrote.


And we even can simply transform this tail call into a loop:

-if (tlen > 0 && qlen > 0)
+while (tlen > 0 && qlen > 0)


Let me see what I can do with the comments.


Thanks.

--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: fix for BUG #3720: wrong results at using ltree

2020-03-30 Thread Tom Lane
I wrote:
> I dunno, that doesn't really seem clearer to me (although some of it
> might be that you expended no effort on making the comments match
> the new code logic).

... although looking closer, this formulation does have one very nice
advantage: for the typical non-star case with high = low = 1, the
only recursive call is a tail recursion, so it ought to consume less
stack space than what I wrote.

Let me see what I can do with the comments.

regards, tom lane




Re: [PATCH] remove condition always true (/src/backend/access/heap/vacuumlazy.c)

2020-03-30 Thread Ranier Vilela
Em seg., 30 de mar. de 2020 às 18:14, Andres Freund 
escreveu:

> Hi,
>
> On 2020-03-30 14:10:29 -0700, Andres Freund wrote:
> > On 2020-03-30 17:08:01 -0300, Ranier Vilela wrote:
> > > Em seg., 30 de mar. de 2020 às 16:05, Andres Freund <
> and...@anarazel.de>
> > > escreveu:
> > >
> > > > Hi,
> > > >
> > > > On 2020-03-30 15:07:40 -0300, Ranier Vilela wrote:
> > > > > I'm not sure that the patch is 100% correct.
> > > >
> > > > This is *NOT* correct.
> > > >
> > > Anyway, the original source, still wrong.
> > > What is the use of testing PageIsNew (page) twice in a row, if nothing
> has
> > > changed.
> >
> > Yea, that can be reduced. It's pretty harmless though.
> >
> > We used to require a cleanup lock (which requires dropping the lock,
> > acquiring a cleanup lock - which allows for others to make the page be
> > not empty) before acting on the empty page in vacuum. That's why
> > PageIsNew() had to be checked again.

Well, this is what the patch does, promove reduced and continue to check
PageIsNew after unlock.

regards,
Ranier Vilela


Re: fix for BUG #3720: wrong results at using ltree

2020-03-30 Thread Tom Lane
Nikita Glukhov  writes:
> On 30.03.2020 21:00, Tom Lane wrote:
>> Hence, new patch versions that do it like that.  (0002 is unchanged.)

> I tried to simplify a bit loops in checkCond() by merging two of them into
> one with an explicit exit condition.  Also I added return statement after
> this loop, so it's now clear that we can't fall into next "while" loop.

I dunno, that doesn't really seem clearer to me (although some of it
might be that you expended no effort on making the comments match
the new code logic).

regards, tom lane




Re: fix for BUG #3720: wrong results at using ltree

2020-03-30 Thread Nikita Glukhov

On 30.03.2020 21:00, Tom Lane wrote:


Hence, new patch versions that do it like that.  (0002 is unchanged.)


I tried to simplify a bit loops in checkCond() by merging two of them into
one with an explicit exit condition.  Also I added return statement after
this loop, so it's now clear that we can't fall into next "while" loop.

The rest code in 0001 and 0002 is unchanged.

--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
>From b30c07ec318edefdcc9faa6c2158f4bb56114789 Mon Sep 17 00:00:00 2001
From: Nikita Glukhov 
Date: Tue, 31 Mar 2020 00:13:40 +0300
Subject: [PATCH v3 1/2] Rationalize ltree input errors

---
 contrib/ltree/expected/ltree.out   | 13 ++-
 contrib/ltree/ltree_io.c   | 99 --
 contrib/ltree/sql/ltree.sql|  1 +
 contrib/ltree_plpython/expected/ltree_plpython.out |  2 +-
 4 files changed, 63 insertions(+), 52 deletions(-)

diff --git a/contrib/ltree/expected/ltree.out b/contrib/ltree/expected/ltree.out
index c78d372..610cb6f 100644
--- a/contrib/ltree/expected/ltree.out
+++ b/contrib/ltree/expected/ltree.out
@@ -464,7 +464,7 @@ SELECT nlevel(('1' || repeat('.1', 65534))::ltree);
 (1 row)
 
 SELECT nlevel(('1' || repeat('.1', 65535))::ltree);
-ERROR:  number of ltree levels (65536) exceeds the maximum allowed (65535)
+ERROR:  number of ltree labels (65536) exceeds the maximum allowed (65535)
 SELECT nlevel(('1' || repeat('.1', 65534))::ltree || '1');
 ERROR:  number of ltree levels (65536) exceeds the maximum allowed (65535)
 SELECT ('1' || repeat('.1', 65534))::lquery IS NULL;
@@ -474,7 +474,7 @@ SELECT ('1' || repeat('.1', 65534))::lquery IS NULL;
 (1 row)
 
 SELECT ('1' || repeat('.1', 65535))::lquery IS NULL;
-ERROR:  number of lquery levels (65536) exceeds the maximum allowed (65535)
+ERROR:  number of lquery items (65536) exceeds the maximum allowed (65535)
 SELECT '*{65535}'::lquery;
   lquery  
 --
@@ -485,7 +485,7 @@ SELECT '*{65536}'::lquery;
 ERROR:  lquery syntax error
 LINE 1: SELECT '*{65536}'::lquery;
^
-DETAIL:  Low limit (65536) exceeds the maximum allowed (65535).
+DETAIL:  Low limit (65536) exceeds the maximum allowed (65535), at character 3.
 SELECT '*{,65534}'::lquery;
   lquery   
 ---
@@ -502,7 +502,12 @@ SELECT '*{,65536}'::lquery;
 ERROR:  lquery syntax error
 LINE 1: SELECT '*{,65536}'::lquery;
^
-DETAIL:  High limit (65536) exceeds the maximum allowed (65535).
+DETAIL:  High limit (65536) exceeds the maximum allowed (65535), at character 4.
+SELECT '*{4,3}'::lquery;
+ERROR:  lquery syntax error
+LINE 1: SELECT '*{4,3}'::lquery;
+   ^
+DETAIL:  Low limit (4) is greater than high limit (3), at character 5.
 SELECT '1.2'::ltree  < '2.2'::ltree;
  ?column? 
 --
diff --git a/contrib/ltree/ltree_io.c b/contrib/ltree/ltree_io.c
index 2503d47..e806a14 100644
--- a/contrib/ltree/ltree_io.c
+++ b/contrib/ltree/ltree_io.c
@@ -17,12 +17,6 @@ PG_FUNCTION_INFO_V1(lquery_in);
 PG_FUNCTION_INFO_V1(lquery_out);
 
 
-#define UNCHAR ereport(ERROR, \
-	   (errcode(ERRCODE_SYNTAX_ERROR), \
-		errmsg("syntax error at position %d", \
-		pos)));
-
-
 typedef struct
 {
 	char	   *start;
@@ -47,7 +41,12 @@ ltree_in(PG_FUNCTION_ARGS)
 	ltree	   *result;
 	ltree_level *curlevel;
 	int			charlen;
-	int			pos = 0;
+	int			pos = 1;		/* character position for error messages */
+
+#define UNCHAR ereport(ERROR, \
+	   errcode(ERRCODE_SYNTAX_ERROR), \
+	   errmsg("ltree syntax error at character %d", \
+			  pos))
 
 	ptr = buf;
 	while (*ptr)
@@ -61,7 +60,7 @@ ltree_in(PG_FUNCTION_ARGS)
 	if (num + 1 > LTREE_MAX_LEVELS)
 		ereport(ERROR,
 (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
- errmsg("number of ltree levels (%d) exceeds the maximum allowed (%d)",
+ errmsg("number of ltree labels (%d) exceeds the maximum allowed (%d)",
 		num + 1, LTREE_MAX_LEVELS)));
 	list = lptr = (nodeitem *) palloc(sizeof(nodeitem) * (num + 1));
 	ptr = buf;
@@ -88,10 +87,10 @@ ltree_in(PG_FUNCTION_ARGS)
 if (lptr->wlen > LTREE_LABEL_MAX_CHARS)
 	ereport(ERROR,
 			(errcode(ERRCODE_NAME_TOO_LONG),
-			 errmsg("name of level is too long"),
-			 errdetail("Name length is %d, must "
-	   "be < 256, in position %d.",
-	   lptr->wlen, pos)));
+			 errmsg("label string is too long"),
+			 errdetail("Label length is %d, must be at most %d, at character %d.",
+	   lptr->wlen, LTREE_LABEL_MAX_CHARS,
+	   pos)));
 
 totallen += MAXALIGN(lptr->len + LEVEL_HDRSIZE);
 lptr++;
@@ -115,10 +114,9 @@ ltree_in(PG_FUNCTION_ARGS)
 		if (lptr->wlen > LTREE_LABEL_MAX_CHARS)
 			ereport(ERROR,
 	(errcode(ERRCODE_NAME_TOO_LONG),
-	 errmsg("name of level is too long"),
-	 errdetail("Name length is %d, must "
-			   "be < 256, in position %d.",
-			   lptr->wlen, pos)));
+	 errmsg("label string is too long"),
+	 

Re: extension patch of CREATE OR REPLACE TRIGGER

2020-03-30 Thread Tom Lane
"osumi.takami...@fujitsu.com"  writes:
> Also, I'm waiting for other kind of feedbacks from anyone.

As David pointed out, this needs to be rebased, though it looks like
the conflict is pretty trivial.

A few other notes from a quick look:

* You missed updating equalfuncs.c/copyfuncs.c.  Pretty much any change in
a Node struct will require touching backend/nodes/ functions, and in
general it's a good idea to grep for uses of the struct to see what else
might be affected.

* Did you use a dartboard while deciding where to add the new field
in struct CreateTrigger?  Its placement certainly seems quite random.
Maybe we should put both "replace" and "isconstraint" up near the
front, to match up with the statement's syntax.

* The patch doesn't appear to have any defenses against being asked to
replace the definition of, say, a foreign key trigger.  It might be
sufficient to refuse to replace an entry that has tgisinternal set,
though I'm not sure if that covers all cases that we'd want to disallow.

* Speaking of which, I think you broke the isInternal case by insisting
on doing a lookup first.  isInternal should *not* do a lookup, period,
especially not with the name it's initially given which will not be the
final trigger name.  A conflict on that name is irrelevant, and so is
the OID of any pre-existing trigger.

* I'm not entirely sure that this patch interacts gracefully with
the provisions for per-partition triggers, either.  Is the change
correctly cascaded to per-partition triggers if there are any?
Do we disallow making a change on a child partition trigger rather
than its parent?  (Checking tgisinternal is going to be bound up
in that, since it looks like somebody decided to set that for child
triggers.  I'm inclined to think that that was a dumb idea; we
may need to break out a separate tgischild flag so that we can tell
what's what.)

* I'm a little bit concerned about the semantics of changing the
tgdeferrable/tginitdeferred properties of an existing trigger.  If there
are trigger events pending, and the trigger is redefined in such a way
that those events should already have been fired, what then?  This doesn't
apply in other sessions, because taking ShareRowExclusiveLock should be
enough to ensure that no other session has uncommitted updates pending
against the table.  But it *does* apply in our own session, because
ShareRowExclusiveLock won't conflict against our own locks.  One answer
would be to run CheckTableNotInUse() once we discover that we're modifying
an existing trigger.  Or we could decide that it doesn't matter --- if you
do that and it breaks, tough.  For comparison, I notice that there doesn't
seem to be any guard against dropping a trigger that has pending events
in our own session, though that doesn't work out too well:

regression=# create constraint trigger my_trig after insert on trig_table 
deferrable initially deferred for each row execute procedure 
before_replacement();
CREATE TRIGGER
regression=# begin;
BEGIN
regression=*# insert into trig_table default values;
INSERT 0 1
regression=*# drop trigger my_trig on trig_table;
DROP TRIGGER
regression=*# commit;
ERROR:  relation 38489 has no triggers

But arguably that's a bug to be fixed, not desirable behavior to emulate.

* Not the fault of this patch exactly, but trigger.c seems to have an
annoyingly large number of copies of the code to look up a trigger by
name.  I wonder if we could refactor that, say by extracting the guts of
get_trigger_oid() into an internal function that's passed an already-open
pg_trigger relation.

* Upthread you were complaining about ShareRowExclusiveLock not being a
strong enough lock, but I think that's nonsense, for the same reason that
it's a sufficient lock for plain CREATE TRIGGER: if we have that lock then
no other session can have pending trigger events of any sort on the
relation, nor can new ones get made before we commit.  But there's no
reason to lock out SELECTs on the relation, since those don't interact
with triggers.

regards, tom lane




Re: backup manifests

2020-03-30 Thread Andres Freund
Hi,

On 2020-03-30 15:23:08 -0400, Robert Haas wrote:
> On Mon, Mar 30, 2020 at 2:59 PM Andres Freund  wrote:
> > I wonder if it'd not be best, independent of whether we build in this
> > verification, to include that metadata in the manifest file. That's for
> > sure better than having to build a separate tool to parse timeline
> > history files.
> 
> I don't think that's better, or at least not "for sure better". The
> backup_label going to include the START TIMELINE, and if -Xfetch is
> used, we're also going to have all the timeline history files. If the
> backup manifest includes those same pieces of information, then we've
> got two sources of truth: one copy in the files the server's actually
> going to read, and another copy in the backup_manifest which we're
> going to potentially use for validation but ignore at runtime. That
> seems not great.

The data in the backup label isn't sufficient though. Without having
parsed the timeline file there's no way to verify that the correct WAL
is present. I guess we can also add client side tools to parse
timelines, add command the fetch all of the required files, and then
interpret that somehow.

But that seems much more complicated.

Imo it makes sense to want to be able verify that WAL looks correct even
transporting WAL using another method (say archiving) and thus using
pg_basebackup's -Xnone.

For the manifest to actually list what's required for the base backup
doesn't seem redundant to me. Imo it makes the manifest file make a good
bit more sense, since afterwards it actually describes the whole base
backup.

Taking the redundancy agreement a bit further you can argue that we
don't need a list of relation files at all, since they're in the catalog
:P. Obviously going to that extreme doesn't make all that much
sense... But I do think it's a second source of truth that's independent
of what the backends actually are going to read.

Greetings,

Andres Freund




Re: Error on failed COMMIT

2020-03-30 Thread Shay Rojansky
Apologies for not responding earlier, busy times.


Fourth, it is not clear how many applications would break if COMMIT
>> started issuing an error rather than return success a with ROLLBACK tag.
>> Certainly SQL scripts would be fine.  They would have one additional
>> error in the script output, but if they had ON_ERROR_STOP enabled, they
>> would have existed before the commit.  Applications that track statement
>> errors and issue rollbacks will be fine.  So, we are left with
>> applications that issue COMMIT and expect success after a transaction
>> block has failed.  Do we know how other database systems handle this?
>>
>
> Well I know pgjdbc handles my patch fine without any changes to the code
> As I mentioned upthread 2 of the 3 go drivers already error if rollback is
> returned. 1 of them does not.
>
> I suspect npgsql would be fine. Shay ?
>

Npgsql would be fine. In fact, Npgsql doesn't have any specific
expectations nor any specific logic around commit; it assumes errors may be
returned for any command (COMMIT or otherwise), and surfaces those errors
as .NET exceptions. The transaction status is tracked via CommandComplete
only, and as mentioned several times, PostgreSQL can already error on
commit for various other reasons (e.g. deferred constraint checks). This
direction makes a lot of sense to me.


Re: backup manifests

2020-03-30 Thread Robert Haas
On Sun, Mar 29, 2020 at 9:05 PM David Steele  wrote:
> Yeah, that seems reasonable.
>
> In our case backups are nearly always compressed and/or encrypted so
> even checking the original size is a bit of work. Getting the checksum
> at the same time seems like an obvious win.

Makes sense. If this even got extended so it could read from tar-files
instead of the filesystem directly, we'd surely want to take the
opposite approach and just make a single pass. I'm not sure whether
it's worth doing that at some point in the future, but it might be. If
we're going to add the capability to compress or encrypt backups to
pg_basebackup, we might want to do that first, and then make this tool
handle all of those formats in one go.

(As always, I don't have the ability to control how arbitrary
developers spend their development time... so this is just a thought.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [PATCH] Atomic pgrename on Windows

2020-03-30 Thread Andres Freund
Hi,

On March 30, 2020 9:17:00 AM PDT, Tom Lane  wrote:
>So far as pg_stat_tmp is concerned, I think there is reasonable hope
>that that problem is just going to go away in the near future.
>I've not been paying attention to the shared-memory stats collector
>thread so I'm not sure if that's anywhere near committable, but
>I think that's clearly something we'll want once it's ready.

I'd give it a ~30-40%  chance for 13 at this point. The patch improved a lot 
after the review cycles in the last ~10 days, but still needs a good bit more 
polish.

Andres

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: backup manifests

2020-03-30 Thread Robert Haas
On Mon, Mar 30, 2020 at 2:59 PM Andres Freund  wrote:
> I wonder if it'd not be best, independent of whether we build in this
> verification, to include that metadata in the manifest file. That's for
> sure better than having to build a separate tool to parse timeline
> history files.

I don't think that's better, or at least not "for sure better". The
backup_label going to include the START TIMELINE, and if -Xfetch is
used, we're also going to have all the timeline history files. If the
backup manifest includes those same pieces of information, then we've
got two sources of truth: one copy in the files the server's actually
going to read, and another copy in the backup_manifest which we're
going to potentially use for validation but ignore at runtime. That
seems not great.

> Btw, just in case somebody suggests it: I don't think it's possible to
> compute the WAL checksums at this point. In stream mode WAL very well
> might already have been removed.

Right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: backup manifests

2020-03-30 Thread Andres Freund
Hi,

On 2020-03-30 15:04:55 -0400, Robert Haas wrote:
> I guess I'd like to be clear here that I have no fundamental
> disagreement with taking this tool in any direction that people would
> like it to go. For me it's just a question of timing. Feature freeze
> is now a week or so away, and nothing complicated is going to get done
> in that time. If we can all agree on something simple based on
> Andres's recent proposal, cool, but I'm not yet sure that will be the
> case, so what's plan B? We could decide that what I have here is just
> too little to be a viable facility on its own, but I think Stephen is
> the only one taking that position. We could release it as
> pg_validatemanifest with a plan to rename it if other backup-related
> checks are added later. We could release it as pg_validatebackup with
> the idea to avoid having to rename it when more backup-related checks
> are added later, but with a greater possibility of confusion in the
> meantime and no hard guarantee that anyone will actually develop such
> checks. We could put it in to pg_checksums, but I think that's really
> backing ourselves into a corner: if backup validation develops other
> checks that are not checksum-related, what then? I'd much rather
> gamble on keeping things together by topic (backup) than technology
> used internally (checksum). Putting it into pg_basebackup is another
> option, and would avoid that problem, but it's not my preferred
> option, because as I noted before, I think the command-line options
> will get confusing.

I'm mildly inclined to name it pg_validate, pg_validate_dbdir or
such. And eventually (definitely not this release) subsume pg_checksums
in it. That way we can add other checkers too.

I don't really see a point in ending up with lots of different commands
over time. Partially because there's probably plenty checks where the
overall cost can be drastically reduced by combining IO. Partially
because there's probably plenty shareable infrastructure. And partially
because I think it makes discovery for users a lot easier.

Greetings,

Andres Freund




Re: tweaking perfect hash multipliers

2020-03-30 Thread John Naylor
On Tue, Mar 31, 2020 at 2:31 AM Andres Freund  wrote:
>
> Hi,
>
> On 2020-03-30 21:33:14 +0800, John Naylor wrote:
> > Then I used the attached program to measure various combinations of
> > compiled instructions using two constant multipliers iterating over
> > bytes similar to a generated hash function.
>
> It looks like you didn't attach the program?

Funny, I did, but then decided to rename the files. Here they are. I
tried to make the loop similar to how it'd be in the actual hash
function, but leaving out the post-loop modulus and array access. Each
loop iteration is dependent on the last one's result.

> It's a bit complicated by the fact that there's more execution ports to
> execute shift/add than there ports to compute some form of leas. And
> some of that won't easily be measurable in a micro-benchmark, because
> there'll be dependencies between the instruction preventing any
> instruction level parallelism.
>
> I think the form of lea generated here is among the ones that can only
> be executed on port 1. Whereas e.g. an register+register/immediate add
> can be executed on four different ports.

That's interesting, I'll have to look into that.

Thanks for the info!

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


test-const-mult.c
Description: Binary data


test-const-mult-2.c
Description: Binary data


Re: [PATCH] remove condition always true (/src/backend/access/heap/vacuumlazy.c)

2020-03-30 Thread Andres Freund
Hi,

On 2020-03-30 15:07:40 -0300, Ranier Vilela wrote:
> I'm not sure that the patch is 100% correct.

This is *NOT* correct.


> But the fix is about expression about always true.
> But if this patch is correct, he fix one possible bug.
> 
> The comment says:
> * Perform checking of FSM after releasing lock, the fsm is
> * approximate, after all.
> 
> But this is not what the code does, apparently it checks before unlocking.

No, it doesn't. The freespace check isn't the PageIsNew(), it's the
GetRecordedFreeSpace() call. Which happens after unlocking.

Greetings,

Andres Freund




Re: backup manifests

2020-03-30 Thread Robert Haas
On Mon, Mar 30, 2020 at 2:24 AM Amit Kapila  wrote:
> > Between those two, I would use "pg_validatebackup" if there's a fair chance 
> > it
> > will end up doing the pg_waldump check.  Otherwise, I would use
> > "pg_validatemanifest".
>
> +1.

I guess I'd like to be clear here that I have no fundamental
disagreement with taking this tool in any direction that people would
like it to go. For me it's just a question of timing. Feature freeze
is now a week or so away, and nothing complicated is going to get done
in that time. If we can all agree on something simple based on
Andres's recent proposal, cool, but I'm not yet sure that will be the
case, so what's plan B? We could decide that what I have here is just
too little to be a viable facility on its own, but I think Stephen is
the only one taking that position. We could release it as
pg_validatemanifest with a plan to rename it if other backup-related
checks are added later. We could release it as pg_validatebackup with
the idea to avoid having to rename it when more backup-related checks
are added later, but with a greater possibility of confusion in the
meantime and no hard guarantee that anyone will actually develop such
checks. We could put it in to pg_checksums, but I think that's really
backing ourselves into a corner: if backup validation develops other
checks that are not checksum-related, what then? I'd much rather
gamble on keeping things together by topic (backup) than technology
used internally (checksum). Putting it into pg_basebackup is another
option, and would avoid that problem, but it's not my preferred
option, because as I noted before, I think the command-line options
will get confusing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: backup manifests

2020-03-30 Thread Andres Freund
Hi,

On 2020-03-30 14:35:40 -0400, Robert Haas wrote:
> On Sun, Mar 29, 2020 at 10:08 PM Andres Freund  wrote:
> > See the attached minimal prototype for what I am thinking of.
> >
> > This would not correctly handle the case where the timeline changes
> > while taking a base backup. But I'm not sure that'd be all that serious
> > a limitation for now?
> >
> > I'd personally not want to use a base backup that included a timeline
> > switch...
>
> Interesting concept. I've never (or almost never) used the -s and -e
> options to pg_waldump, so I didn't think about using those.

Oh - it's how I use it most of the time when investigating a specific
problem. I just about always use -s, and often -e. Besides just reducing
the logging output, and avoiding spurious errors, it makes it a lot
easier to iteratively expand the logging for records that are
problematic for the case at hand.


> I think
> having a --just-parse option to pg_waldump is a good idea, though
> maybe not with that name e.g. we could call it --quiet.

Yea, I didn't like the option's name. It's just the first thing that
came to mind.


> It is less obvious to me what to do about all that as it pertains to
> the current patch.

FWIW, I personally think we can live with this not validating WAL in the
first release. But I also think it'd be within reach to do better and
allow for WAL verification.


> If we want pg_validatebackup to run pg_waldump in that mode or print
> out a hint about how to run pg_waldump in that mode, it would need to
> obtain the relevant LSNs.

We could just include those in the manifest. Seems like good information
to have in there to me, as it allows to build the complete list of files
needed for a restore.


> It's not clear to me what we would do if the backup crosses a timeline
> switch, assuming that's even a case pg_basebackup allows.

I've not tested it, but it sure looks like it's possible. Both by having
a standby replaying from a node that promotes (multiple timeline
switches possible too, I think, if the WAL source follows timelines),
and by backing up from a standby that's being promoted.


> If we don't want to do anything in pg_validatebackup automatically but
> just want to document this as a a possible technique, we could finesse
> that problem with some weasel-wording.

It'd probably not be too hard to simply emit multiple commands, one for
each timeline "segment".

I wonder if it'd not be best, independent of whether we build in this
verification, to include that metadata in the manifest file. That's for
sure better than having to build a separate tool to parse timeline
history files.

I think it wouldn't be too hard to compute that information while taking
the base backup. We know the end timeline (ThisTimeLineID), so we can
just call readTimeLineHistory(ThisTimeLineID). Which should then allow
for something pretty trivial along the lines of

timelines = readTimeLineHistory(ThisTimeLineID);
last_start = InvalidXLogRecPtr;
foreach(lc, timelines)
{
TimeLineHistoryEntry *he = lfirst(lc);

if (he->end < startptr)
continue;

//
manifest_emit_wal_range(Min(he->begin, startptr), he->end);
last_start = he->end;
}

if (last_start == InvalidXlogRecPtr)
   start = startptr;
else
   start = last_start;

manifest_emit_wal_range(start, entptr);


Btw, just in case somebody suggests it: I don't think it's possible to
compute the WAL checksums at this point. In stream mode WAL very well
might already have been removed.

Greetings,

Andres Freund




Re: backup manifests

2020-03-30 Thread Robert Haas
On Sun, Mar 29, 2020 at 10:08 PM Andres Freund  wrote:
> See the attached minimal prototype for what I am thinking of.
>
> This would not correctly handle the case where the timeline changes
> while taking a base backup. But I'm not sure that'd be all that serious
> a limitation for now?
>
> I'd personally not want to use a base backup that included a timeline
> switch...

Interesting concept. I've never (or almost never) used the -s and -e
options to pg_waldump, so I didn't think about using those. I think
having a --just-parse option to pg_waldump is a good idea, though
maybe not with that name e.g. we could call it --quiet.

It is less obvious to me what to do about all that as it pertains to
the current patch. If we want pg_validatebackup to run pg_waldump in
that mode or print out a hint about how to run pg_waldump in that
mode, it would need to obtain the relevant LSNs. I guess that would
require reading the backup_label file. It's not clear to me what we
would do if the backup crosses a timeline switch, assuming that's even
a case pg_basebackup allows. If we don't want to do anything in
pg_validatebackup automatically but just want to document this as a a
possible technique, we could finesse that problem with some
weasel-wording.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-03-30 Thread Justin Pryzby
On Mon, Mar 30, 2020 at 09:02:22PM +0300, Alexey Kondratov wrote:
> Hmm, I went through the well known to me SQL commands in Postgres and a bit
> more. Parenthesized options list is mostly used in two common cases:

There's also ANALYZE(VERBOSE), REINDEX(VERBOSE).
There was debate a year ago [0] as to whether to make "reindex CONCURRENTLY" a
separate command, or to use parenthesized syntax "REINDEX (CONCURRENTLY)".  I
would propose to support that now (and implemented that locally).

..and explain(...)

> - In the beginning for boolean options only, e.g. VACUUM

You're right that those are currently boolean, but note that explain(FORMAT ..)
is not boolean.

> Putting it into the WITH (...) options list looks like an option to me.
> However, doing it only for VACUUM will ruin the consistency, while doing it
> for CLUSTER and REINDEX is not necessary, so I do not like it either.

It's not necessary but I think it's a more flexible way to add new
functionality (requiring no changes to the grammar for vacuum, and for
REINDEX/CLUSTER it would allow future options to avoid changing the grammar).

If we use parenthesized syntax for vacuum, my proposal is to do it for REINDEX, 
and
consider adding parenthesized syntax for cluster, too.

> To summarize, currently I see only 2 + 1 extra options:
> 
> 1) Keep everything with syntax as it is in 0001-0002
> 2) Implement tail syntax for VACUUM, but with limitation for VACUUM FULL of
> the entire database + TABLESPACE change
> 3) Change TABLESPACE to a fully reserved word

+ 4) Use parenthesized syntax for all three.

Note, I mentioned that maybe VACUUM/CLUSTER should support not only "TABLESPACE
foo" but also "INDEX TABLESPACE bar" (I would use that, too).  I think that
would be easy to implement, and for sure it would suggest using () for both.
(For sure we don't want to implement "VACUUM t TABLESPACE foo" now, and then
later implement "INDEX TABLESPACE bar" and realize that for consistency we
cannot parenthesize it.

Michael ? Alvaro ? Robert ?

-- 
Justin




Re: tweaking perfect hash multipliers

2020-03-30 Thread Andres Freund
Hi,

On 2020-03-30 21:33:14 +0800, John Naylor wrote:
> Then I used the attached program to measure various combinations of
> compiled instructions using two constant multipliers iterating over
> bytes similar to a generated hash function.

It looks like you didn't attach the program?


>  -O2 -Wall test-const-mult.c test-const-mult-2.c
> ./a.out
> Median of 3 with clang 10:
> 
> lea, lea 0.181s
> 
> lea, lea+add 0.248s
>   lea, shift+add 0.251s
> 
>   lea+add, shift+add 0.273s
> shift+add, shift+add 0.276s
> 
>   2 leas, 2 leas 0.290s
>  shift+add, imul 0.329s
> 
> Taking this with a grain of salt, it nonetheless seems plausible that
> a single lea could be faster than any two instructions here.

It's a bit complicated by the fact that there's more execution ports to
execute shift/add than there ports to compute some form of leas. And
some of that won't easily be measurable in a micro-benchmark, because
there'll be dependencies between the instruction preventing any
instruction level parallelism.

I think the form of lea generated here is among the ones that can only
be executed on port 1. Whereas e.g. an register+register/immediate add
can be executed on four different ports.

There's also a significant difference in latency that you might not see
in your benchmark. E.g. on coffee lake the relevant form of lea has a
latency of three cycles, but one independent lea can be "started" per
cycle (agner calls this "reciprocal throughput). Whereas a shift has a
latency of 1 cycle and a reciprocal throughput of 0.5 (lower is better),
add has a latency o 1 and a reciprocal throughput of 0.25.

See the tables in  https://www.agner.org/optimize/instruction_tables.pdf

I'm not really sure my musings above matter terribly much, but I just
wanted to point out why I'd not take too much stock in the above timings
in isolation. Even a very high latency wouldn't necessarily be penalized
in a benchmark with one loop iteration independent from each other, but
would matter in the real world.


Cool work!

Greetings,

Andres Freund




Re: Recognizing superuser in pg_hba.conf

2020-03-30 Thread Tom Lane
Tomas Vondra  writes:
> I see this patch is marked as RFC since 12/30, but there seems to be
> quite a lot of discussion about the syntax, keywords and how exactly to
> identify the superuser. So I'll switch it back to needs review, which I
> think is a better representation of the current state.

Somebody switched it to RFC again, despite the facts that

(a) there is absolutely no consensus about what syntax to use
(and some of the proposals imply very different patches),

(b) there's been no discussion at all since the last CF, and

(c) the patch is still failing in the cfbot (src/test/ssl fails).

While resolving (c) would seem to be the author's problem, I don't
think it's worth putting effort into that detail until we have
some meeting of the minds about (a).  So I'll put this back to
"needs review".

regards, tom lane




Re: Planning counters in pg_stat_statements (using pgss_store)

2020-03-30 Thread Julien Rouhaud
On Mon, Mar 30, 2020 at 6:36 PM Fujii Masao  wrote:
>
> On 2020/03/30 17:03, Julien Rouhaud wrote:
> > On Mon, Mar 30, 2020 at 01:56:43PM +0900, Fujii Masao wrote:
> >>
> >>
> >> On 2020/03/29 15:15, Julien Rouhaud wrote:
> >>> On Fri, Mar 27, 2020 at 03:42:50PM +0100, Julien Rouhaud wrote:
>  On Fri, Mar 27, 2020 at 2:01 PM Fujii Masao 
>   wrote:
> >
> 
> > So what I'd like to say is that the information that users are 
> > interested
> > in would vary on each situation and case. At least for me it seems
> > enough for pgss to report only the basic information. Then users
> > can calculate to get the numbers (like total_time) they're interested 
> > in,
> > from those basic information.
> >
> > But of course, I'd like to hear more opinions about this...
> 
>  +1
> 
>  Unless someone chime in by tomorrow, I'll just drop the sum as it
>  seems less controversial and not a blocker in userland if users are
>  interested.
> >>>
> >>> Done in attached v11, with also the s/querytext/query_text/ discrepancy 
> >>> noted
> >>> previously.
> >>
> >> Thanks for updating the patch! But I still think query_string is better
> >> name because it's used in other several places, for the sake of 
> >> consistency.
> >
> > You're absolutely right.  That's what I actually wanted to do given your
> > previous comment, but somehow managed to miss it, sorry about that and 
> > thanks
> > for fixing.
> >
> >> So I changed the argument name that way and commit the 0001 patch.
> >> If you think query_text is better, let's keep discussing this topic!
> >>
> >> Anyway many thanks for your great job!
> >
> > Thanks a lot!
> >
> >>
> >> I also exported BufferUsageAccumDiff as mentioned previously, as it 
> >> seems
> >> clearner and will avoid future useless code churn, and run pgindent.
> >
> > Many thanks!! I'm thinking to commit this part separately.
> > So I made that patch based on your patch. Attached.
> 
>  Thanks! It looks good to me.
> >>>
> >>> I also kept that part in a distinct commit for convenience.
> >>
> >> I also pushed 0002 patch. Thanks!
> >>
> >> I will review 0003 patch again.
> >
> > And thanks for that too :)
>
> While testing the patched pgss, I found that the patched version
> may track the statements that the original version doesn't.
> Please imagine the case where the following queries are executed,
> with pgss.track = top.
>
>  PREPARE hoge AS SELECT * FROM t;
>  EXPLAIN EXECUTE hoge;
>
> The pgss view returned "PREPARE hoge AS SELECT * FROM t"
> in the patched version, but not in the orignal version.
>
> Is this problematic?

Oh indeed. That's a side effect of having different the executed query
and the planned query being different.

I guess the question is to chose if the top level executed query of a
utilty statement containing an optimisable query, should the top level
planner call of that optimisable statement be considered at top level
or not.  I tend to think that's the correct behavior here, as this is
also what would happen if a regular DML was provided.  What do you
think?




Re: pgsql: Improve handling of parameter differences in physical replicatio

2020-03-30 Thread Andres Freund
Hi,

On 2020-03-30 19:41:43 +0900, Fujii Masao wrote:
> On 2020/03/30 16:58, Peter Eisentraut wrote:
> > Improve handling of parameter differences in physical replication
> > 
> > When certain parameters are changed on a physical replication primary,
> > this is communicated to standbys using the XLOG_PARAMETER_CHANGE WAL
> > record.  The standby then checks whether its own settings are at least
> > as big as the ones on the primary.  If not, the standby shuts down
> > with a fatal error.
> > 
> > The correspondence of settings between primary and standby is required
> > because those settings influence certain shared memory sizings that
> > are required for processing WAL records that the primary might send.
> > For example, if the primary sends a prepared transaction, the standby
> > must have had max_prepared_transaction set appropriately or it won't
> > be able to process those WAL records.
> > 
> > However, fatally shutting down the standby immediately upon receipt of
> > the parameter change record might be a bit of an overreaction.  The
> > resources related to those settings are not required immediately at
> > that point, and might never be required if the activity on the primary
> > does not exhaust all those resources.  If we just let the standby roll
> > on with recovery, it will eventually produce an appropriate error when
> > those resources are used.
> > 
> > So this patch relaxes this a bit.  Upon receipt of
> > XLOG_PARAMETER_CHANGE, we still check the settings but only issue a
> > warning and set a global flag if there is a problem.  Then when we
> > actually hit the resource issue and the flag was set, we issue another
> > warning message with relevant information.

I find it somewhat hostile that we don't display the actual resource
error once the problem is hit - we just pause. Sure, there's going to be
some previous log entry explaining what the actual parameter difference
is - but that could have been weeks ago. So either hard to find, or even
rotated out.


> > At that point we pause recovery, so a hot standby remains usable.
> > We also repeat the last warning message once a minute so it is
> > harder to miss or ignore.


I can't really imagine that the adjustments made in this patch are
sufficient.

One important issue seems to me to be the size of the array that
TransactionIdIsInProgress() allocates:
/*
 * If first time through, get workspace to remember main XIDs in. We
 * malloc it permanently to avoid repeated palloc/pfree overhead.
 */
if (xids == NULL)
{
/*
 * In hot standby mode, reserve enough space to hold all xids 
in the
 * known-assigned list. If we later finish recovery, we no 
longer need
 * the bigger array, but we don't bother to shrink it.
 */
int maxxids = RecoveryInProgress() ? 
TOTAL_MAX_CACHED_SUBXIDS : arrayP->maxProcs;

xids = (TransactionId *) malloc(maxxids * 
sizeof(TransactionId));
if (xids == NULL)
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
 errmsg("out of memory")));
}

Which I think means we'll just overrun the xids array in some cases,
e.g. if KnownAssignedXids overflowed.  Obviously we should have a
crosscheck in the code (which we don't), but it was previously a
supposedly unreachable path.

Similarly, the allocation in GetSnapshotData() will be too small, I
think:
if (snapshot->xip == NULL)
{
/*
 * First call for this snapshot. Snapshot is same size whether 
or not
 * we are in recovery, see later comments.
 */
snapshot->xip = (TransactionId *)
malloc(GetMaxSnapshotXidCount() * 
sizeof(TransactionId));
if (snapshot->xip == NULL)
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
 errmsg("out of memory")));
Assert(snapshot->subxip == NULL);
snapshot->subxip = (TransactionId *)
malloc(GetMaxSnapshotSubxidCount() * 
sizeof(TransactionId));
if (snapshot->subxip == NULL)
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
 errmsg("out of memory")));
}

I think basically all code using TOTAL_MAX_CACHED_SUBXIDS,
GetMaxSnapshotSubxidCount(), PROCARRAY_MAXPROCS needs to be reviewed
much more carefully than done here.


Also, shouldn't dynahash be adjusted as well? There's e.g. the
following HASH_ENTER path:
/* report a generic message */
if (hashp->isshared)

[PATCH] remove condition always true (/src/backend/access/heap/vacuumlazy.c)

2020-03-30 Thread Ranier Vilela
Hi,
I'm not sure that the patch is 100% correct.
But the fix is about expression about always true.
But if this patch is correct, he fix one possible bug.

The comment says:
* Perform checking of FSM after releasing lock, the fsm is
* approximate, after all.

But this is not what the code does, apparently it checks before unlocking.

best regards,
Ranier Vilela


remove_condition_always_true.patch
Description: Binary data


Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-03-30 Thread Alexey Kondratov

On 2020-03-28 03:11, Justin Pryzby wrote:

On Thu, Mar 26, 2020 at 11:01:06PM -0500, Justin Pryzby wrote:

> Another issue is this:
> > +VACUUM ( FULL [, ...] ) [ TABLESPACE new_tablespace 
] [ table_and_columns [, ...] ]
> As you mentioned in your v1 patch, in the other cases, "tablespace
> [tablespace]" is added at the end of the command rather than in the middle.  I
> wasn't able to make that work, maybe because "tablespace" isn't a fully
> reserved word (?).  I didn't try with "SET TABLESPACE", although I understand
> it'd be better without "SET".




SET does not change anything in my experience. The problem is that 
opt_vacuum_relation_list is... optional and TABLESPACE is not a fully 
reserved word (why?) as you correctly noted. I have managed to put 
TABLESPACE to the end, but with vacuum_relation_list, like:


| VACUUM opt_full opt_freeze opt_verbose opt_analyze 
vacuum_relation_list TABLESPACE name
| VACUUM '(' vac_analyze_option_list ')' vacuum_relation_list TABLESPACE 
name


It means that one would not be able to do VACUUM FULL of the entire 
database + TABLESPACE change. I do not think that it is a common 
scenario, but this limitation would be very annoying, wouldn't it?






I think we should use the parenthesized syntax for vacuum - it seems 
clear in

hindsight.

Possibly REINDEX should use that, too, instead of adding OptTablespace 
at the

end.  I'm not sure.


The attached mostly implements generic parenthesized options to 
REINDEX(...),
so I'm soliciting opinions: should TABLESPACE be implemented in 
parenthesized

syntax or non?


CLUSTER doesn't support parenthesized syntax, but .. maybe it should?


And this ?



Hmm, I went through the well known to me SQL commands in Postgres and a 
bit more. Parenthesized options list is mostly used in two common cases:


- In the beginning for boolean options only, e.g. VACUUM
- In the end for options of a various type, but accompanied by WITH, 
e.g. COPY, CREATE SUBSCRIPTION


Moreover, TABLESPACE is already used in CREATE TABLE/INDEX in the same 
way I did in 0001-0002. That way, putting TABLESPACE option into the 
parenthesized options list does not look to be convenient and 
semantically correct, so I do not like it. Maybe others will have a 
different opinion.


Putting it into the WITH (...) options list looks like an option to me. 
However, doing it only for VACUUM will ruin the consistency, while doing 
it for CLUSTER and REINDEX is not necessary, so I do not like it either.


To summarize, currently I see only 2 + 1 extra options:

1) Keep everything with syntax as it is in 0001-0002
2) Implement tail syntax for VACUUM, but with limitation for VACUUM FULL 
of the entire database + TABLESPACE change

3) Change TABLESPACE to a fully reserved word


Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: fix for BUG #3720: wrong results at using ltree

2020-03-30 Thread Tom Lane
I wrote:
> Hence, attached are two revised patches that attack the problem
> this way.  The first one is somewhat unrelated to the original
> point --- it's trying to clean up the error messages in ltree_in
> and lquery_in so that they are more consistent and agree with
> the terminology used in the documentation.  (Notably, the term
> "level" is used nowhere in the ltree docs, except in the legacy
> function name nlevel().)

One thing I changed in that patch was to change the syntax error
reports to say "at character %d" not "in position %d", because
I thought the latter was pretty confusing --- it's not obvious
whether it's counting characters or labels or what.  However,
I now notice that what the code is providing is a zero-based
character index, which is out of line with our practice
elsewhere: core parser errors are reported using 1-based indexes.
If we reword these messages then we should adopt that practice too.
Hence, new patch versions that do it like that.  (0002 is unchanged.)

regards, tom lane

diff --git a/contrib/ltree/expected/ltree.out b/contrib/ltree/expected/ltree.out
index c78d372..610cb6f 100644
--- a/contrib/ltree/expected/ltree.out
+++ b/contrib/ltree/expected/ltree.out
@@ -464,7 +464,7 @@ SELECT nlevel(('1' || repeat('.1', 65534))::ltree);
 (1 row)
 
 SELECT nlevel(('1' || repeat('.1', 65535))::ltree);
-ERROR:  number of ltree levels (65536) exceeds the maximum allowed (65535)
+ERROR:  number of ltree labels (65536) exceeds the maximum allowed (65535)
 SELECT nlevel(('1' || repeat('.1', 65534))::ltree || '1');
 ERROR:  number of ltree levels (65536) exceeds the maximum allowed (65535)
 SELECT ('1' || repeat('.1', 65534))::lquery IS NULL;
@@ -474,7 +474,7 @@ SELECT ('1' || repeat('.1', 65534))::lquery IS NULL;
 (1 row)
 
 SELECT ('1' || repeat('.1', 65535))::lquery IS NULL;
-ERROR:  number of lquery levels (65536) exceeds the maximum allowed (65535)
+ERROR:  number of lquery items (65536) exceeds the maximum allowed (65535)
 SELECT '*{65535}'::lquery;
   lquery  
 --
@@ -485,7 +485,7 @@ SELECT '*{65536}'::lquery;
 ERROR:  lquery syntax error
 LINE 1: SELECT '*{65536}'::lquery;
^
-DETAIL:  Low limit (65536) exceeds the maximum allowed (65535).
+DETAIL:  Low limit (65536) exceeds the maximum allowed (65535), at character 3.
 SELECT '*{,65534}'::lquery;
   lquery   
 ---
@@ -502,7 +502,12 @@ SELECT '*{,65536}'::lquery;
 ERROR:  lquery syntax error
 LINE 1: SELECT '*{,65536}'::lquery;
^
-DETAIL:  High limit (65536) exceeds the maximum allowed (65535).
+DETAIL:  High limit (65536) exceeds the maximum allowed (65535), at character 4.
+SELECT '*{4,3}'::lquery;
+ERROR:  lquery syntax error
+LINE 1: SELECT '*{4,3}'::lquery;
+   ^
+DETAIL:  Low limit (4) is greater than high limit (3), at character 5.
 SELECT '1.2'::ltree  < '2.2'::ltree;
  ?column? 
 --
diff --git a/contrib/ltree/ltree_io.c b/contrib/ltree/ltree_io.c
index 2503d47..e806a14 100644
--- a/contrib/ltree/ltree_io.c
+++ b/contrib/ltree/ltree_io.c
@@ -17,12 +17,6 @@ PG_FUNCTION_INFO_V1(lquery_in);
 PG_FUNCTION_INFO_V1(lquery_out);
 
 
-#define UNCHAR ereport(ERROR, \
-	   (errcode(ERRCODE_SYNTAX_ERROR), \
-		errmsg("syntax error at position %d", \
-		pos)));
-
-
 typedef struct
 {
 	char	   *start;
@@ -47,7 +41,12 @@ ltree_in(PG_FUNCTION_ARGS)
 	ltree	   *result;
 	ltree_level *curlevel;
 	int			charlen;
-	int			pos = 0;
+	int			pos = 1;		/* character position for error messages */
+
+#define UNCHAR ereport(ERROR, \
+	   errcode(ERRCODE_SYNTAX_ERROR), \
+	   errmsg("ltree syntax error at character %d", \
+			  pos))
 
 	ptr = buf;
 	while (*ptr)
@@ -61,7 +60,7 @@ ltree_in(PG_FUNCTION_ARGS)
 	if (num + 1 > LTREE_MAX_LEVELS)
 		ereport(ERROR,
 (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
- errmsg("number of ltree levels (%d) exceeds the maximum allowed (%d)",
+ errmsg("number of ltree labels (%d) exceeds the maximum allowed (%d)",
 		num + 1, LTREE_MAX_LEVELS)));
 	list = lptr = (nodeitem *) palloc(sizeof(nodeitem) * (num + 1));
 	ptr = buf;
@@ -88,10 +87,10 @@ ltree_in(PG_FUNCTION_ARGS)
 if (lptr->wlen > LTREE_LABEL_MAX_CHARS)
 	ereport(ERROR,
 			(errcode(ERRCODE_NAME_TOO_LONG),
-			 errmsg("name of level is too long"),
-			 errdetail("Name length is %d, must "
-	   "be < 256, in position %d.",
-	   lptr->wlen, pos)));
+			 errmsg("label string is too long"),
+			 errdetail("Label length is %d, must be at most %d, at character %d.",
+	   lptr->wlen, LTREE_LABEL_MAX_CHARS,
+	   pos)));
 
 totallen += MAXALIGN(lptr->len + LEVEL_HDRSIZE);
 lptr++;
@@ -115,10 +114,9 @@ ltree_in(PG_FUNCTION_ARGS)
 		if (lptr->wlen > LTREE_LABEL_MAX_CHARS)
 			ereport(ERROR,
 	(errcode(ERRCODE_NAME_TOO_LONG),
-	 errmsg("name of level is too long"),
-	 errdetail("Name length is %d, must "
-			   "be < 256, in position %d.",
-			   

Re: snapper vs. HEAD

2020-03-30 Thread Tom Lane
"Tom Turelinckx"  writes:
> Tom Lane wrote:
>> Thanks! But it doesn't seem to have taken: snapper just did a new run
>> that still failed, and it still seems to be using -O2.

> Snapper did build using -O1 a few hours ago, but it failed the check stage 
> very early with a different error:
> FATAL:  null value in column "classid" of relation "pg_depend" violates 
> not-null constraint

Ugh.  Compiler bugs coming out of the woodwork?

Not sure what to suggest here.  It certainly is useful to us to have
sparc buildfarm testing, but probably sparc64 would be more useful
than sparc32.  (It looks like FreeBSD and OpenBSD have dropped sparc32
altogether, and NetBSD has bumped it to tier II status.)  One idea
is to try -O0 ...

regards, tom lane




Re: snapper vs. HEAD

2020-03-30 Thread Tom Lane
Andres Freund  writes:
> On 2020-03-30 12:24:06 -0400, Tom Lane wrote:
>> (As of this weekend, it seemed to be impossible to find the wheezy sparc
>> distribution images on-line anymore.

> The installer downloads are still available at:
> https://www.debian.org/releases/wheezy/debian-installer/

Ah, I should have clarified.  That's 7.11, which I'd tried last time
I was interested in duplicating snapper, and I found that it does
not work under qemu's sparc emulation (my notes don't have much detail,
but some installer component was core-dumping).  What did work, after
some hair pulling, was 7.6 ... and that's the version I can no longer
find on-line.  But it has what seems to be the same gcc release as
snapper is using, so I figured it was close enough.

regards, tom lane




Re: snapper vs. HEAD

2020-03-30 Thread Tom Turelinckx
Hi,

Tom Lane wrote:
> Thanks! But it doesn't seem to have taken: snapper just did a new run
> that still failed, and it still seems to be using -O2.

Snapper did build using -O1 a few hours ago, but it failed the check stage very 
early with a different error:

FATAL:  null value in column "classid" of relation "pg_depend" violates 
not-null constraint

I then cleared out the ccache and forced a build of HEAD: same issue.

Next I cleared out the ccache and forced a build of HEAD with -O2: this is the 
one you saw.

Finally, I've cleared out both the ccache and the accache and forced a build of 
HEAD with -O1. It failed the check stage again very early with the above error.

> to move to a newer Debian LTS release? Or have they dropped Sparc
> support altogether?

Wheezy was the last stable release for Debian sparc. Sparc64 is a Debian ports 
architecture, but there are no stable releases for sparc64. I do maintain 
private sparc64 repositories for Stretch and Buster, and I could configure 
buildfarm animals for those (on faster hardware too), but those releases are 
not officially available.

Best regards,
Tom Turelinckx




Re: snapper vs. HEAD

2020-03-30 Thread Andres Freund
Hi,

On 2020-03-30 12:24:06 -0400, Tom Lane wrote:
> "Tom Turelinckx"  writes:
> Yeah, I've got a couple of those myself.  But perhaps it'd be sensible
> to move to a newer Debian LTS release?  Or have they dropped Sparc
> support altogether?

I think the 32bit sparc support has been phased out. Sparc64 isn't a
"official port", but there's a port:
https://wiki.debian.org/Sparc64
including seemingly regularly updated images.


> (As of this weekend, it seemed to be impossible to find the wheezy sparc
> distribution images on-line anymore.  Fortunately I still had a download
> of the dvd-1 image stashed away, or I would not have been able to recreate
> my qemu VM for the purpose.  It's going to be very hard for any other PG
> hackers to investigate that platform in future.)

They've been moved to archive.debian.org, but they should still be
downloadable. Seems like the website hasn't been quite updated to that
fact...

The installer downloads are still available at:
https://www.debian.org/releases/wheezy/debian-installer/

but sources.list would need to be pointed at something like

deb http://archive.debian.org/debian/ wheezy contrib main non-free

See also https://www.debian.org/distrib/archive

Greetings,

Andres Freund




Re: materialization blocks hash join

2020-03-30 Thread Tom Lane
Tomas Vondra  writes:
> That's because eqjoinsel_inner won't have any statistics for either side
> of the join, so it'll use default ndistinct values (200), resulting in
> estimate of 0.5% for the join condition.

Right.

> But this should not affect the choice of join algorithm, I think,
> because that's only the output of the join.

Lack of stats will also discourage use of a hash join, because the
default assumption in the absence of stats is that the join column
has a pretty non-flat distribution, risking clumping into a few
hash buckets.  Merge join is less sensitive to the data distribution
so it tends to come out as preferred in such cases.

regards, tom lane




Re: Add A Glossary

2020-03-30 Thread Corey Huinker
On Sun, Mar 29, 2020 at 5:29 AM Jürgen Purtz  wrote:

> On 27.03.20 21:12, Justin Pryzby wrote:
> > On Fri, Mar 20, 2020 at 11:32:25PM +0100, Jürgen Purtz wrote:
>  +Archiver
> >>> Can you change that to archiver process ?
> >> I prefer the short term without the addition of 'process' - concerning
> >> 'Archiver' as well as the other cases. But I'm not an native English
> >> speaker.
> > I didn't like it due to lack of context.
> >
> > What about "wal archiver" ?
> >
> > It occured to me when I read this.
> >
> https://www.postgresql.org/message-id/20200327.163007.128069746774242774.horikyota.ntt%40gmail.com
> >
> "WAL archiver" is ok for me. In the current documentation we have 2
> places with "WAL archiver" and 4 with "archiver"-only
> (high-availability.sgml, monitoring.sgml).
>
> "backend process" is an exception to the other terms because the
> standalone term "backend" is sensibly used in diverse situations.
>
> Kind regards, Jürgen
>

I've taken Alvarao's fixes and done my best to incorporate the feedback
into a new patch, which Roger's (tech writer) reviewed yesterday.

The changes are too numerous to list, but the highlights are:

New definitions:
* All four ACID terms
* Vacuum (split off from Autovacuum)
* Tablespace
* WAL Archiver (replaces Archiver)

Changes to existing terms:
* Implemented most wording changes recommended by Justin
* all remaining links were either made into xrefs or edited out of existence

* de-tagged most second uses of of a term within a definition


Did not do
* Addressed the " Process" suffix suggested by Justin. There isn't
consensus on these changes, and I'm neutral on the matter
* change the Cast definition. I think it's important to express that a cast
has a FROM datatype as well as a TO
* anything host/server related as I couldn't see a consensus reached

Other thoughts:
* Trivial definitions that are just see-other-definition are ok with me, as
the goal of this glossary is to aid in discovery of term meanings, so
knowing that two terms are interchangable is itself helpful


It is my hope that this revision represents the final _structural_ change
to the glossary. New definitions and edits to existing definitions will, of
course, go on forever.
From 8a163603102f51a3eddfb05c51baf3b840c5d7f7 Mon Sep 17 00:00:00 2001
From: coreyhuinker 
Date: Mon, 30 Mar 2020 13:08:27 -0400
Subject: [PATCH] glossary v4

---
 doc/src/sgml/filelist.sgml |1 +
 doc/src/sgml/glossary.sgml | 1551 
 doc/src/sgml/postgres.sgml |1 +
 3 files changed, 1553 insertions(+)
 create mode 100644 doc/src/sgml/glossary.sgml

diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 1043d0f7ab..cf21ef857e 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -170,6 +170,7 @@
 
 
 
+
 
 
 
diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
new file mode 100644
index 00..eab14f3c9b
--- /dev/null
+++ b/doc/src/sgml/glossary.sgml
@@ -0,0 +1,1551 @@
+
+ Glossary
+ 
+  This is a list of terms and their meaning in the context of
+  PostgreSQL and relational database
+  systems in general.
+ 
+  
+   
+Aggregating
+
+ 
+  The act of combining a collection of data (input) values into
+  a single output value, which may not be of the same type as the
+  input values.
+ 
+
+   
+
+   
+Aggregate Function
+
+ 
+  A Function that combines multiple input values,
+  for example by counting, averaging or adding them all together,
+  yielding a single output value.
+ 
+ 
+  For more information, see
+  .
+ 
+ 
+  See also Window Function.
+ 
+
+   
+
+   
+Analytic
+
+ 
+  A Function whose computed value can reference
+  values found in nearby Rows of the same
+  Result Set.
+ 
+ 
+  For more information, see
+  .
+ 
+
+   
+
+   
+Atomic
+
+ 
+  In reference to the value of an Attribute or
+  Datum: an item that cannot be broken down
+  into smaller components.
+ 
+ 
+  In reference to an operation: an event that cannot be completed in
+  part; it must either entirely succeed or entirely fail. For
+  example, a series of SQL statements can be
+  combined into a Transaction, and that
+  transaction is said to be atomic.
+  Atomic.
+ 
+
+   
+
+   
+Atomicity
+
+ 
+  One of the ACID properties. This is the state of 
+  being Atomic in the operational/transactional sense.
+ 
+
+   
+
+   
+Attribute
+
+ 
+  An element with a certain name and data type found within a
+  Tuple or Table.
+ 
+
+   
+
+   
+Autovacuum
+
+ 
+  Background Worker processes that routinely
+  perform Vacuum operations.
+ 
+ 
+  For more information, see
+  .
+ 
+
+   
+
+   
+Backend Process
+
+ 
+  Processes of an 

Re: ECPG: proposal for new DECLARE STATEMENT

2020-03-30 Thread David Steele

On 1/11/20 10:41 PM, Tomas Vondra wrote:

On Sun, Jan 12, 2020 at 03:52:48AM +0100, Tomas Vondra wrote:

...

I'm not an ecpg expert (in fact I've never even used it), so my review
is pretty superficial, but I only found a couple of minor whitespace
issues (adding/removing a line/tab) - see the attached file.



Meh, forgot to attach the file ...


Any thoughts on Tomas' comments?

A big part of moving a patch forward is keeping the thread active and 
answering comments/review.


Regards,
--
-David
da...@pgmasters.net




Re: materialization blocks hash join

2020-03-30 Thread Tomas Vondra

On Mon, Mar 30, 2020 at 06:14:42PM +0200, Pavel Stehule wrote:

po 30. 3. 2020 v 18:06 odesílatel Pavel Stehule 
napsal:


Hi

when I was in talk with Silvio Moioli, I found strange hash join. Hash was
created from bigger table.


https://www.postgresql.org/message-id/79dd683d-3296-1b21-ab4a-28fdc2d98807%40suse.de

Now it looks so materialized CTE disallow hash


create table bigger(a int);
create table smaller(a int);
insert into bigger select random()* 1 from generate_series(1,10);
insert into smaller select i from generate_series(1,10) g(i);

analyze bigger, smaller;

-- no problem
explain analyze select * from bigger b join smaller s on b.a = s.a;

postgres=# explain analyze select * from bigger b join smaller s on b.a =
s.a;
 QUERY PLAN



 Hash Join  (cost=3084.00..7075.00 rows=10 width=8) (actual
time=32.937..87.276 rows=4 loops=1)
   Hash Cond: (b.a = s.a)
   ->  Seq Scan on bigger b  (cost=0.00..1443.00 rows=10 width=4)
(actual time=0.028..8.546 rows=10 loops=1)
   ->  Hash  (cost=1443.00..1443.00 rows=10 width=4) (actual
time=32.423..32.423 rows=10 loops=1)
 Buckets: 131072  Batches: 2  Memory Usage: 2785kB
 ->  Seq Scan on smaller s  (cost=0.00..1443.00 rows=10
width=4) (actual time=0.025..9.931 rows=10 loops=1)
 Planning Time: 0.438 ms
 Execution Time: 91.193 ms
(8 rows)

but with materialized CTE

postgres=# explain analyze with b as materialized (select * from bigger),
s as materialized (select * from smaller) select * from b join s on b.a =
s.a;
  QUERY PLAN


--
 Merge Join  (cost=23495.64..773995.64 rows=5000 width=8) (actual
time=141.242..193.375 rows=4 loops=1)
   Merge Cond: (b.a = s.a)
   CTE b
 ->  Seq Scan on bigger  (cost=0.00..1443.00 rows=10 width=4)
(actual time=0.026..11.083 rows=10 loops=1)
   CTE s
 ->  Seq Scan on smaller  (cost=0.00..1443.00 rows=10 width=4)
(actual time=0.015..9.161 rows=10 loops=1)
   ->  Sort  (cost=10304.82..10554.82 rows=10 width=4) (actual
time=78.775..90.953 rows=10 loops=1)
 Sort Key: b.a
 Sort Method: external merge  Disk: 1376kB
 ->  CTE Scan on b  (cost=0.00..2000.00 rows=10 width=4)
(actual time=0.033..39.274 rows=10 loops=1)
   ->  Sort  (cost=10304.82..10554.82 rows=10 width=4) (actual
time=62.453..74.004 rows=6 loops=1)
 Sort Key: s.a
 Sort Method: external sort  Disk: 1768kB
 ->  CTE Scan on s  (cost=0.00..2000.00 rows=10 width=4)
(actual time=0.018..31.669 rows=10 loops=1)
 Planning Time: 0.303 ms
 Execution Time: 199.919 ms
(16 rows)

It doesn't use hash join - the estimations are perfect, but plan is
suboptimal



I was wrong, the estimation on CTE is ok, but JOIN estimation is bad

Merge Join  (cost=23495.64..773995.64 rows=5000 width=8) (actual
time=141.242..193.375 rows=4 loops=1)



That's because eqjoinsel_inner won't have any statistics for either side
of the join, so it'll use default ndistinct values (200), resulting in
estimate of 0.5% for the join condition.

But this should not affect the choice of join algorithm, I think,
because that's only the output of the join.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Planning counters in pg_stat_statements (using pgss_store)

2020-03-30 Thread Fujii Masao




On 2020/03/30 17:03, Julien Rouhaud wrote:

On Mon, Mar 30, 2020 at 01:56:43PM +0900, Fujii Masao wrote:



On 2020/03/29 15:15, Julien Rouhaud wrote:

On Fri, Mar 27, 2020 at 03:42:50PM +0100, Julien Rouhaud wrote:

On Fri, Mar 27, 2020 at 2:01 PM Fujii Masao  wrote:





So what I'd like to say is that the information that users are interested
in would vary on each situation and case. At least for me it seems
enough for pgss to report only the basic information. Then users
can calculate to get the numbers (like total_time) they're interested in,
from those basic information.

But of course, I'd like to hear more opinions about this...


+1

Unless someone chime in by tomorrow, I'll just drop the sum as it
seems less controversial and not a blocker in userland if users are
interested.


Done in attached v11, with also the s/querytext/query_text/ discrepancy noted
previously.


Thanks for updating the patch! But I still think query_string is better
name because it's used in other several places, for the sake of consistency.


You're absolutely right.  That's what I actually wanted to do given your
previous comment, but somehow managed to miss it, sorry about that and thanks
for fixing.


So I changed the argument name that way and commit the 0001 patch.
If you think query_text is better, let's keep discussing this topic!

Anyway many thanks for your great job!


Thanks a lot!




I also exported BufferUsageAccumDiff as mentioned previously, as it seems
clearner and will avoid future useless code churn, and run pgindent.


Many thanks!! I'm thinking to commit this part separately.
So I made that patch based on your patch. Attached.


Thanks! It looks good to me.


I also kept that part in a distinct commit for convenience.


I also pushed 0002 patch. Thanks!

I will review 0003 patch again.


And thanks for that too :)


While testing the patched pgss, I found that the patched version
may track the statements that the original version doesn't.
Please imagine the case where the following queries are executed,
with pgss.track = top.

PREPARE hoge AS SELECT * FROM t;
EXPLAIN EXECUTE hoge;

The pgss view returned "PREPARE hoge AS SELECT * FROM t"
in the patched version, but not in the orignal version.

Is this problematic?

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters




Re: Error on failed COMMIT

2020-03-30 Thread Vik Fearing
On 3/30/20 6:05 PM, Dave Cramer wrote:
> So it appears this is currently languishing as unresolved and feature
> freeze is imminent.
> 
> What has to be done to get a decision one way or another before feature
> freeze.
> 
> I have provided a patch that could be reviewed and at least be considered
> in the commitfest.
> 
> Perhaps someone can review the patch and I can do whatever it takes to get
> it presentable ?


I don't know enough about that part of the code to give a meaningful
review, but I will give my full support to the patch.  (I hadn't
expressed an opinion either way yet.)
-- 
Vik Fearing




Re: error context for vacuum to include block number

2020-03-30 Thread Justin Pryzby
On Mon, Mar 30, 2020 at 02:31:53PM +0530, Amit Kapila wrote:
> Now that the main patch is committed, I have reviewed the other two patches.

Thanks for that

On Mon, Mar 30, 2020 at 02:31:53PM +0530, Amit Kapila wrote:
> The v37-0003-Avoid-some-calls-to-RelationGetRelationName.patch looks
> good to me.  I have added the commit message in the patch.

I realized the 0003 patch has an error in lazy_vacuum_index; it should be:

-   RelationGetRelationName(indrel),
+   vacrelstats->indname,

That was maybe due to originally using a separate errinfo for each phase, with
one "char *relname" and no "char *indrel".

> I don't think the above change is correct.  How will vacrelstats have
> correct values when vacuum_one_index is called via parallel workers
> (via parallel_vacuum_main)?

You're right: parallel main's vacrelstats was added by this patchset and only
the error context fields were initialized.  I fixed it up in the attached by
also setting vacrelstats->new_rel_tuples and old_live_tuples.  It's not clear
if this is worth it just to save an argument to two functions?

-- 
Justin
>From 85672d7f071c91f3ec9190be7feb293f0e49cf8a Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Wed, 26 Feb 2020 19:22:55 -0600
Subject: [PATCH v39 1/2] Avoid some calls to RelationGetRelationName

---
 src/backend/access/heap/vacuumlazy.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 0d2e724a7d..803e7660f7 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -655,8 +655,8 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 			}
 			appendStringInfo(, msgfmt,
 			 get_database_name(MyDatabaseId),
-			 get_namespace_name(RelationGetNamespace(onerel)),
-			 RelationGetRelationName(onerel),
+			 vacrelstats->relnamespace,
+			 vacrelstats->relname,
 			 vacrelstats->num_index_scans);
 			appendStringInfo(, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped frozen\n"),
 			 vacrelstats->pages_removed,
@@ -827,7 +827,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			if (params->nworkers > 0)
 ereport(WARNING,
 		(errmsg("disabling parallel option of vacuum on \"%s\" --- cannot vacuum temporary tables in parallel",
-RelationGetRelationName(onerel;
+vacrelstats->relname)));
 		}
 		else
 			lps = begin_parallel_vacuum(RelationGetRelid(onerel), Irel,
@@ -1722,7 +1722,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	if (vacuumed_pages)
 		ereport(elevel,
 (errmsg("\"%s\": removed %.0f row versions in %u pages",
-		RelationGetRelationName(onerel),
+		vacrelstats->relname,
 		tups_vacuumed, vacuumed_pages)));
 
 	/*
@@ -1751,7 +1751,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 
 	ereport(elevel,
 			(errmsg("\"%s\": found %.0f removable, %.0f nonremovable row versions in %u out of %u pages",
-	RelationGetRelationName(onerel),
+	vacrelstats->relname,
 	tups_vacuumed, num_tuples,
 	vacrelstats->scanned_pages, nblocks),
 			 errdetail_internal("%s", buf.data)));
@@ -1883,7 +1883,7 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
 
 	ereport(elevel,
 			(errmsg("\"%s\": removed %d row versions in %d pages",
-	RelationGetRelationName(onerel),
+	vacrelstats->relname,
 	tupindex, npages),
 			 errdetail_internal("%s", pg_rusage_show(;
 
@@ -2431,7 +2431,7 @@ lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
 
 	ereport(elevel,
 			(errmsg(msg,
-	RelationGetRelationName(indrel),
+	vacrelstats->indname,
 	dead_tuples->num_tuples),
 			 errdetail_internal("%s", pg_rusage_show(;
 
@@ -2602,7 +2602,7 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
 vacrelstats->lock_waiter_detected = true;
 ereport(elevel,
 		(errmsg("\"%s\": stopping truncate due to conflicting lock request",
-RelationGetRelationName(onerel;
+vacrelstats->relname)));
 return;
 			}
 
@@ -2668,7 +2668,7 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
 
 		ereport(elevel,
 (errmsg("\"%s\": truncated %u to %u pages",
-		RelationGetRelationName(onerel),
+		vacrelstats->relname,
 		old_rel_pages, new_rel_pages),
  errdetail_internal("%s",
 	pg_rusage_show(;
@@ -2733,7 +2733,7 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 {
 	ereport(elevel,
 			(errmsg("\"%s\": suspending truncate due to conflicting lock request",
-	RelationGetRelationName(onerel;
+	vacrelstats->relname)));
 
 	vacrelstats->lock_waiter_detected = true;
 	return blkno;
-- 
2.17.0

>From e69a3feb5dcf3860a34771c826cd3a1bdfbf9c83 Mon Sep 17 

Re: snapper vs. HEAD

2020-03-30 Thread Tom Lane
"Tom Turelinckx"  writes:
> In the past, I've already switched from gcc 4.6 to gcc 4.7 as a workaround 
> for a similar compiler bug, but I can't upgrade to a newer gcc without 
> backporting it myself, so for the moment I've switched snapper to use -O1 
> instead of -O2, for HEAD only.

Thanks!  But it doesn't seem to have taken: snapper just did a new run
that still failed, and it still seems to be using -O2.

> Not sure whether wheezy on sparc 32-bit is very relevant today, but it's
> an exotic platform, so I try to keep those buildfarm animals alive as
> long as it's possible.

Yeah, I've got a couple of those myself.  But perhaps it'd be sensible
to move to a newer Debian LTS release?  Or have they dropped Sparc
support altogether?

(As of this weekend, it seemed to be impossible to find the wheezy sparc
distribution images on-line anymore.  Fortunately I still had a download
of the dvd-1 image stashed away, or I would not have been able to recreate
my qemu VM for the purpose.  It's going to be very hard for any other PG
hackers to investigate that platform in future.)

regards, tom lane




Re: ALTER TABLE ... SET STORAGE does not propagate to indexes

2020-03-30 Thread Adam Brusselback
> ALTER TABLE ... SET STORAGE does not propagate to indexes, even though
> indexes created afterwards get the new storage setting.  So depending on
> the order of commands, you can get inconsistent storage settings between
> indexes and tables.

I've absolutely noticed this behavior, I just thought it was intentional
for some reason.

Having this behavior change as stated above would be very welcome in my
opinion. It's always something i've had to manually think about in my
migration scripts, so it would be welcome from my view.

-Adam


Re: Make autovacuum sort tables in descending order of xid_age

2020-03-30 Thread Mark Dilger



> On Mar 30, 2020, at 7:09 AM, David Steele  wrote:
> 
> On 1/11/20 12:53 PM, David Fetter wrote:
>> I agree that it's a complex situation, and that many different
>> approaches will eventually need to be brought to bear.
>> What concerns me about introducing a big lump of complexity here is
>> disentangling the effects of each part and of their interaction terms.
>> We're not, to put it mildly, set up to do ANOVA
>> (https://en.wikipedia.org/wiki/Analysis_of_variance ) , ANCOVA (
>> https://en.wikipedia.org/wiki/Analysis_of_covariance ), etc. on
>> changes.
>> Given the above, I'd like to make the case for changing just this one
>> thing at first and seeing whether the difference it makes is generally
>> positive.
> 
> Mark, Robert, thoughts on this?

I have not been working on this issue lately, but as I recall, my concern was 
that changing the behavior of autovacuum could introduce regressions for some 
users, so we should be careful to get it right before we rush to release 
anything.  It didn't seem like the proposed changes took enough into account.  
But that's clearly a judgement call, having to do with how cautious any 
particular person thinks we should be.  I don't feel strongly enough to stand 
in the way if the general concensus is that this is a good enough 
implementation.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: ALTER TABLE ... SET STORAGE does not propagate to indexes

2020-03-30 Thread Alvaro Herrera
On 2020-Feb-25, Peter Eisentraut wrote:

> An alternative would be that we make this situation fully supported. Then
> we'd probably need at least ALTER INDEX ... ALTER COLUMN ... SET STORAGE,
> and some pg_dump support.

I think this is a more promising direction.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [PATCH] Atomic pgrename on Windows

2020-03-30 Thread Tom Lane
David Steele  writes:
> On 1/11/20 5:13 PM, Alexander Korotkov wrote:
>> Regarding "pg_stat_tmp/global.stat", which is a problem in particular
>> case, we may evade file renaming altogether.  Instead, we may
>> implement shared-memory counter for filename.  So, instead of
>> renaming, new reads will just come to new file.

> I tend to agree with Tom on the question of portability. But it seems 
> upthread we have determined that this can't be sensibly isolated into a 
> Windows-specific rename() function.
> Does anyone have any further ideas? If not I feel like this patch is 
> going to need to be RWF again.

So far as pg_stat_tmp is concerned, I think there is reasonable hope
that that problem is just going to go away in the near future.
I've not been paying attention to the shared-memory stats collector
thread so I'm not sure if that's anywhere near committable, but
I think that's clearly something we'll want once it's ready.

So if that's the main argument why we need this, it's a pretty
thin argument ...

regards, tom lane




Re: materialization blocks hash join

2020-03-30 Thread Pavel Stehule
po 30. 3. 2020 v 18:06 odesílatel Pavel Stehule 
napsal:

> Hi
>
> when I was in talk with Silvio Moioli, I found strange hash join. Hash was
> created from bigger table.
>
>
> https://www.postgresql.org/message-id/79dd683d-3296-1b21-ab4a-28fdc2d98807%40suse.de
>
> Now it looks so materialized CTE disallow hash
>
>
> create table bigger(a int);
> create table smaller(a int);
> insert into bigger select random()* 1 from generate_series(1,10);
> insert into smaller select i from generate_series(1,10) g(i);
>
> analyze bigger, smaller;
>
> -- no problem
> explain analyze select * from bigger b join smaller s on b.a = s.a;
>
> postgres=# explain analyze select * from bigger b join smaller s on b.a =
> s.a;
>  QUERY PLAN
>
>
> 
>  Hash Join  (cost=3084.00..7075.00 rows=10 width=8) (actual
> time=32.937..87.276 rows=4 loops=1)
>Hash Cond: (b.a = s.a)
>->  Seq Scan on bigger b  (cost=0.00..1443.00 rows=10 width=4)
> (actual time=0.028..8.546 rows=10 loops=1)
>->  Hash  (cost=1443.00..1443.00 rows=10 width=4) (actual
> time=32.423..32.423 rows=10 loops=1)
>  Buckets: 131072  Batches: 2  Memory Usage: 2785kB
>  ->  Seq Scan on smaller s  (cost=0.00..1443.00 rows=10
> width=4) (actual time=0.025..9.931 rows=10 loops=1)
>  Planning Time: 0.438 ms
>  Execution Time: 91.193 ms
> (8 rows)
>
> but with materialized CTE
>
> postgres=# explain analyze with b as materialized (select * from bigger),
> s as materialized (select * from smaller) select * from b join s on b.a =
> s.a;
>   QUERY PLAN
>
>
> --
>  Merge Join  (cost=23495.64..773995.64 rows=5000 width=8) (actual
> time=141.242..193.375 rows=4 loops=1)
>Merge Cond: (b.a = s.a)
>CTE b
>  ->  Seq Scan on bigger  (cost=0.00..1443.00 rows=10 width=4)
> (actual time=0.026..11.083 rows=10 loops=1)
>CTE s
>  ->  Seq Scan on smaller  (cost=0.00..1443.00 rows=10 width=4)
> (actual time=0.015..9.161 rows=10 loops=1)
>->  Sort  (cost=10304.82..10554.82 rows=10 width=4) (actual
> time=78.775..90.953 rows=10 loops=1)
>  Sort Key: b.a
>  Sort Method: external merge  Disk: 1376kB
>  ->  CTE Scan on b  (cost=0.00..2000.00 rows=10 width=4)
> (actual time=0.033..39.274 rows=10 loops=1)
>->  Sort  (cost=10304.82..10554.82 rows=10 width=4) (actual
> time=62.453..74.004 rows=6 loops=1)
>  Sort Key: s.a
>  Sort Method: external sort  Disk: 1768kB
>  ->  CTE Scan on s  (cost=0.00..2000.00 rows=10 width=4)
> (actual time=0.018..31.669 rows=10 loops=1)
>  Planning Time: 0.303 ms
>  Execution Time: 199.919 ms
> (16 rows)
>
> It doesn't use hash join - the estimations are perfect, but plan is
> suboptimal
>

I was wrong, the estimation on CTE is ok, but JOIN estimation is bad

Merge Join  (cost=23495.64..773995.64 rows=5000 width=8) (actual
time=141.242..193.375 rows=4 loops=1)


> Regards
>
> Pavel
>
>


materialization blocks hash join

2020-03-30 Thread Pavel Stehule
Hi

when I was in talk with Silvio Moioli, I found strange hash join. Hash was
created from bigger table.

https://www.postgresql.org/message-id/79dd683d-3296-1b21-ab4a-28fdc2d98807%40suse.de

Now it looks so materialized CTE disallow hash


create table bigger(a int);
create table smaller(a int);
insert into bigger select random()* 1 from generate_series(1,10);
insert into smaller select i from generate_series(1,10) g(i);

analyze bigger, smaller;

-- no problem
explain analyze select * from bigger b join smaller s on b.a = s.a;

postgres=# explain analyze select * from bigger b join smaller s on b.a =
s.a;
 QUERY PLAN


 Hash Join  (cost=3084.00..7075.00 rows=10 width=8) (actual
time=32.937..87.276 rows=4 loops=1)
   Hash Cond: (b.a = s.a)
   ->  Seq Scan on bigger b  (cost=0.00..1443.00 rows=10 width=4)
(actual time=0.028..8.546 rows=10 loops=1)
   ->  Hash  (cost=1443.00..1443.00 rows=10 width=4) (actual
time=32.423..32.423 rows=10 loops=1)
 Buckets: 131072  Batches: 2  Memory Usage: 2785kB
 ->  Seq Scan on smaller s  (cost=0.00..1443.00 rows=10
width=4) (actual time=0.025..9.931 rows=10 loops=1)
 Planning Time: 0.438 ms
 Execution Time: 91.193 ms
(8 rows)

but with materialized CTE

postgres=# explain analyze with b as materialized (select * from bigger), s
as materialized (select * from smaller) select * from b join s on b.a = s.a;
  QUERY PLAN

--
 Merge Join  (cost=23495.64..773995.64 rows=5000 width=8) (actual
time=141.242..193.375 rows=4 loops=1)
   Merge Cond: (b.a = s.a)
   CTE b
 ->  Seq Scan on bigger  (cost=0.00..1443.00 rows=10 width=4)
(actual time=0.026..11.083 rows=10 loops=1)
   CTE s
 ->  Seq Scan on smaller  (cost=0.00..1443.00 rows=10 width=4)
(actual time=0.015..9.161 rows=10 loops=1)
   ->  Sort  (cost=10304.82..10554.82 rows=10 width=4) (actual
time=78.775..90.953 rows=10 loops=1)
 Sort Key: b.a
 Sort Method: external merge  Disk: 1376kB
 ->  CTE Scan on b  (cost=0.00..2000.00 rows=10 width=4)
(actual time=0.033..39.274 rows=10 loops=1)
   ->  Sort  (cost=10304.82..10554.82 rows=10 width=4) (actual
time=62.453..74.004 rows=6 loops=1)
 Sort Key: s.a
 Sort Method: external sort  Disk: 1768kB
 ->  CTE Scan on s  (cost=0.00..2000.00 rows=10 width=4)
(actual time=0.018..31.669 rows=10 loops=1)
 Planning Time: 0.303 ms
 Execution Time: 199.919 ms
(16 rows)

It doesn't use hash join - the estimations are perfect, but plan is
suboptimal

Regards

Pavel


Re: Error on failed COMMIT

2020-03-30 Thread Dave Cramer
On Tue, 17 Mar 2020 at 19:32, Dave Cramer  wrote:

>
>
> On Tue, 17 Mar 2020 at 19:23, Bruce Momjian  wrote:
>
>> On Tue, Mar 17, 2020 at 07:15:05PM -0400, Dave Cramer wrote:
>> > On Tue, 17 Mar 2020 at 16:47, Bruce Momjian  wrote:
>> > Third, the idea that individual interfaces, e.g. JDBC, should throw
>> an
>> > error in this case while the server just changes the COMMIT return
>> tag
>> > to ROLLBACK is confusing.  People regularly test SQL commands in the
>> > server before writing applications or while debugging, and a
>> behavior
>> > mismatch would cause confusion.
>> >
>> >
>> > I'm not sure what you mean by this. The server would throw an error.
>>
>> I am saying it is not wise to have interfaces behaving differently than
>> the server, for the reasons stated above.
>>
>> Agreed and this is why I think it is important for the server to be
> defining the behaviour instead of each interface deciding how to handle
> this situation.
>
>
>
So it appears this is currently languishing as unresolved and feature
freeze is imminent.

What has to be done to get a decision one way or another before feature
freeze.

I have provided a patch that could be reviewed and at least be considered
in the commitfest.

Perhaps someone can review the patch and I can do whatever it takes to get
it presentable ?

Dave Cramer
www.postgres.rocks


Re: [PATCH] Atomic pgrename on Windows

2020-03-30 Thread David Steele

On 1/11/20 5:13 PM, Alexander Korotkov wrote:

On Tue, Jan 7, 2020 at 11:04 PM Tom Lane  wrote:

"If the link named by the new argument exists and the file's link
count becomes 0 when it is removed and no process has the file open,
the space occupied by the file shall be freed and the file shall no
longer be accessible. If one or more processes have the file open when
the last link is removed, the link shall be removed before rename()
returns, but the removal of the file contents shall be postponed until
all references to the file are closed."

But issue is that on Windows POSIX rename() is kind of impossible to
implement.  And I suspect other platforms may have issues too.

Regarding "pg_stat_tmp/global.stat", which is a problem in particular
case, we may evade file renaming altogether.  Instead, we may
implement shared-memory counter for filename.  So, instead of
renaming, new reads will just come to new file.


I tend to agree with Tom on the question of portability. But it seems 
upthread we have determined that this can't be sensibly isolated into a 
Windows-specific rename() function.


Does anyone have any further ideas? If not I feel like this patch is 
going to need to be RWF again.


Regards,
--
-David
da...@pgmasters.net




Re: Unix-domain socket support on Windows

2020-03-30 Thread Peter Eisentraut

On 2020-03-27 18:52, Andrew Dunstan wrote:

I have tested this on drongo/fairywren and it works fine. The patches
apply cleanly (with a bit of fuzz) and a full buildfarm run is happy in
both cases.

Unfortunately I don't have a Windows machine that's young enough to
support git master and old enough not to support Unix Domain sockets, so
i can't test that with socket-enabled binaries.

On inspection the patches seem fine.

Let's commit this and keep working on the pg_upgrade and test issues.


I have committed this in chunks over the last couple of days.  It's done 
now.


I didn't commit the initdb auto-detection patch.  As I mentioned 
earlier, that one is probably not necessary.


Btw., the default AppVeyor images are too old to support this.  You need 
something like 'image: Visual Studio 2019' to get a new enough image. 
So that's one way to test what happens when it's not supported at run 
time.  (I did test it and you get a sensible error message.)


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: adding partitioned tables to publications

2020-03-30 Thread Amit Langote
I have updated the comments in apply_handle_tuple_routing() (see 0002)
to better explain what's going on with UPDATE handling.  I also
rearranged the tests a bit for clarity.

Attached updated patches.

-- 
Thank you,

Amit Langote
EnterpriseDB: http://www.enterprisedb.com


v1-0001-worker.c-refactor-code-to-look-up-local-tuple.patch
Description: Binary data


v15-0003-Publish-partitioned-table-inserts-as-its-own.patch
Description: Binary data


v15-0002-Add-subscription-support-to-replicate-into-parti.patch
Description: Binary data


Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-03-30 Thread Tomas Vondra

On Mon, Mar 30, 2020 at 11:47:57AM +0530, Amit Kapila wrote:

On Sun, Mar 29, 2020 at 9:01 PM Tomas Vondra
 wrote:


On Sun, Mar 29, 2020 at 11:19:21AM +0530, Amit Kapila wrote:
>On Sun, Mar 29, 2020 at 6:29 AM Tomas Vondra
> wrote:
>>
>> Ummm, how is that different from what the patch is doing now? I mean, we
>> only write the top-level XID for the first WAL record in each subxact,
>> right? Or what would be the difference with your approach?
>>
>
>We have to do what the patch is currently doing and additionally, we
>will set this flag after PGPROC_MAX_CACHED_SUBXIDS which would allow
>us to call ProcArrayApplyXidAssignment during WAL replay only after
>PGPROC_MAX_CACHED_SUBXIDS number of subxacts.  It will help us in
>clearing the KnownAssignedXids at the same time as we do now, so no
>additional performance overhead.
>

Hmmm. So we'd still log assignment twice? Or would we keep just the
immediate assignments (embedded into xlog records), and cache the
subxids on the replica somehow?



I think we need to cache the subxids on the replica somehow but I
don't have a very good idea for it.  Basically, there are two ways to
do it (a) Change the KnownAssignedXids in some way so that we can
easily find this information without losing on the current benefits of
it.  I can't think of a good way to do that and even if we come up
with something, it could easily be a lot of work, (b) Cache the
subxids for a particular transaction in local memory along with
KnownAssignedXids.  This is doable but now we have two data-structures
(one in shared memory and other in local memory) managing the same
information in different ways.

Do you have any other ideas?


I don't follow. Why couldn't we have a simple cache on the standby? It
could be either a simple array or a hash table (with the top-level xid
as hash key)?

I think the single array would be sufficient, but the hash table would
allow keeping the apply logic more or less as it's today. See the
attached patch that adds such cache - I do admit I haven't tested this,
but hopefully it's a sufficient illustration of the idea.

It does not handle cleanup of the cache, but I think that should not be
difficult - we simply need to remove entries for transactions that got
committed or rolled back. And do something about transactions without an
explicit commit/rollback record, but that can be done by also handling
XLOG_RUNNING_XACTS (by removing anything preceding oldestRunningXid).

I don't think this is particularly complicated or a lot of code, and I
don't see why would it require data structures in shared memory. Only
the walreceiver on standby needs to worry about this, no?

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 1951103b26..b85c046b41 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7300,6 +7300,19 @@ StartupXLOG(void)
TransactionIdIsValid(record->xl_xid))

RecordKnownAssignedTransactionIds(record->xl_xid);
 
+   /*
+   * Assign subtransaction xids to the top level 
xid if the
+   * record has that information. This is required 
at most
+   * once per subtransactions.
+   */
+   if 
(TransactionIdIsValid(xlogreader->toplevel_xid) &&
+   standbyState >= STANDBY_INITIALIZED)
+   {
+   Assert(XLogStandbyInfoActive());
+   
ProcArrayCacheXidAssignment(xlogreader->toplevel_xid,
+   
record->xl_xid);
+   }
+
/* Now apply the WAL record itself */
RmgrTable[record->xl_rmid].rm_redo(xlogreader);
 
diff --git a/src/backend/storage/ipc/procarray.c 
b/src/backend/storage/ipc/procarray.c
index cfb88db4a4..efddd0e1e6 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -958,6 +958,53 @@ ProcArrayApplyXidAssignment(TransactionId topxid,
LWLockRelease(ProcArrayLock);
 }
 
+static HTAB *xidAssignmentsHash = NULL;
+
+typedef struct XidAssignmentEntry
+{
+   /* the hash lookup key MUST BE FIRST */
+   TransactionId   topXid;
+
+   int nsubxids;
+   TransactionId   subxids[PGPROC_MAX_CACHED_SUBXIDS];
+} XidAssignmentEntry;
+
+void
+ProcArrayCacheXidAssignment(TransactionId topXid, TransactionId subXid)
+{
+   XidAssignmentEntry *entry;
+   boolfound;
+
+   if 

Re: Can we get rid of GetLocaleInfoEx() yet?

2020-03-30 Thread Tom Lane
Michael Paquier  writes:
> On Sun, Mar 29, 2020 at 07:06:56PM +0200, Juan José Santamaría Flecha wrote:
>> It works for the issue just fine, and more comments make a better a
>> patch, so no objections from me.

> +1 from me.  And yes, we are still missing lc_codepage in newer
> versions of VS.  Locales + Windows != 2, business as usual.

OK, pushed.

regards, tom lane




Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

2020-03-30 Thread Tom Lane
Fabien COELHO  writes:
> As I wrote about an earlier version of the patch, ISTM that instead of 
> reinventing, extending, adapting various ls variants (with/without 
> metadata, which show only files, which shows target of links, which shows 
> directory, etc.) we would just need *one* postgres "ls" implementation 
> which would be like "ls -la arg" (returns file type, dates), and then 
> everything else is a wrapper around that with appropriate filtering that 
> can be done at the SQL level, like you started with recurse.

Yeah, I agree that some new function that can represent symlinks
explicitly in its output is the place to deal with this, for
people who want to deal with it.

In the meantime, there's still the question of what pg_ls_dir_files
should do exactly.  Are we content to have it ignore symlinks?
I remain inclined to think that's the right thing given its current
brief.

regards, tom lane




[PATCH] Remove some reassigned values.

2020-03-30 Thread Ranier Vilela
Hi,
This patch remove reassigned values, with safety.

Plancat.c, needs a more careful review.

Best regards
Ranier Vilela


remove_reassigned_values.patch
Description: Binary data


PostgreSQL 13 Feature Freeze + Release Management Team (RMT)

2020-03-30 Thread Jonathan S. Katz
Hi,

The Release Management Team (RMT) for the PostgreSQL 13 is assembled and
has determined that the feature freeze date for the PostgreSQL 11
release will be April 7, 2020. This means that any feature for the
PostgreSQL 13 release **must be committed by April 7, 2020 AOE**
("anywhere on earth")[1]. In other words, by April 8, it is too late.

This naturally extends the March 2020 Commitfest to April 7, 2020. After
the freeze is in effect, any open feature in the current Commitfest will
be moved into the subsequent one.

Open items for the PostgreSQL 13 release will be tracked here:

https://wiki.postgresql.org/wiki/PostgreSQL_13_Open_Items

For the PostgreSQL 13 release, the release management team is composed
of:


  Peter Geoghegan 
  Alvaro Herrera 
  Jonathan Katz 

For the time being, if you have any questions about the process, please
feel free to email any member of the RMT.  We will send out notes with
updates and additional guidance in the near future.

Thanks!

Jonathan

[1] https://en.wikipedia.org/wiki/Anywhere_on_Earth



signature.asc
Description: OpenPGP digital signature


Re: Make autovacuum sort tables in descending order of xid_age

2020-03-30 Thread David Steele

On 1/11/20 12:53 PM, David Fetter wrote:


I agree that it's a complex situation, and that many different
approaches will eventually need to be brought to bear.

What concerns me about introducing a big lump of complexity here is
disentangling the effects of each part and of their interaction terms.
We're not, to put it mildly, set up to do ANOVA
(https://en.wikipedia.org/wiki/Analysis_of_variance ) , ANCOVA (
https://en.wikipedia.org/wiki/Analysis_of_covariance ), etc. on
changes.

Given the above, I'd like to make the case for changing just this one
thing at first and seeing whether the difference it makes is generally
positive.


Mark, Robert, thoughts on this?

--
-David
da...@pgmasters.net




Re: Improving connection scalability: GetSnapshotData()

2020-03-30 Thread Alexander Korotkov
On Sun, Mar 29, 2020 at 9:50 PM Andres Freund  wrote:
> On March 29, 2020 11:24:32 AM PDT, Alexander Korotkov 
>  wrote:
> > clearly a big win on majority
> >of workloads, I think we still need to investigate different workloads
> >on different hardware to ensure there is no regression.
>
> Definitely. Which workloads are you thinking of? I can think of those 
> affected facets: snapshot speed, commit speed with writes, connection 
> establishment, prepared transaction speed. All in the small and large 
> connection count cases.

Following pgbench scripts comes first to my mind:
1) SELECT txid_current(); (artificial but good for checking corner case)
2) Single insert statement (as example of very short transaction)
3) Plain pgbench read-write (you already did it for sure)
4) pgbench read-write script with increased amount of SELECTs.  Repeat
select from pgbench_accounts say 10 times with different aids.
5) 10% pgbench read-write, 90% of pgbench read-only

> I did measurements on all of those but prepared xacts, fwiw

Great, it would be nice to see the results in the thread.

> That definitely needs to be measured, due to the locking changes around 
> procarrayaddd/remove.
>
> I don't think regressions besides perhaps 2pc are likely - there's nothing 
> really getting more expensive but procarray add/remove.

I agree that ProcArrayAdd()/Remove() should be first subject of
investigation, but other cases should be checked as well IMHO.
Regarding 2pc I can following scenarios come to my mind:
1) pgbench read-write modified so that every transaction is prepared
first, then commit prepared.
2) 10% of 2pc pgbench read-write, 90% normal pgbench read-write
3) 10% of 2pc pgbench read-write, 90% normal pgbench read-only

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2020-03-30 Thread Alexey Kondratov

On 2020-03-30 04:56, Michael Paquier wrote:

On Fri, Mar 27, 2020 at 07:47:29AM +0900, Michael Paquier wrote:

On Thu, Mar 26, 2020 at 06:56:36PM -0300, Alvaro Herrera wrote:
I don't think any such cleanup should hamper the patch at hand 
anyway.


I don't think either, so I would do the split for the archive routines
at least.  It still feels strange to me to have a different routine
name for the frontend-side of RestoreArchivedFile().  That's not
really consistent with the current practice we have palloc() & co, as
well as the sync routines.


Okay.  Hearing nothing, I have rebased the patch set this morning,
improving some comments and the code format while on it.  I would like
to commit both 0001 (creation of xlogarchiver.h) and 0002 attached in
the next couple of days.  If you have an objection, please feel free.



0001:

What do think about adding following sanity check into xlogarchive.c?

+#ifdef FRONTEND
+#error "This file is not expected to be compiled for frontend code"
+#endif

It would prevent someone from adding typedefs and any other common 
definitions into xlogarchive.h in the future, leading to the accidental 
inclusion of both xlogarchive.h and fe_archive.h in the same time.


0002:

+ */
+int
+RestoreArchivedFile(const char *path, const char *xlogfname,
+   off_t expectedSize, 
const char *restoreCommand)
+{

There is a couple of extra tabs IMO.

+extern int RestoreArchivedFile(const char *path,
+   const char *xlogfname,
+   off_t expectedSize,
+   const char *restoreCommand);

And the same here.

+ * This uses a logic based on "postgres -C" to get the value from
+ * from the cluster.

Repetitive 'from'.

 extractPageMap(const char *datadir, XLogRecPtr startpoint, int 
tliIndex,

-  XLogRecPtr endpoint)
+  XLogRecPtr endpoint, const char *restore_command)

Let us use camel case 'restoreCommand' here as in the header for 
tidiness.


I have left 0001 intact, but fixed all these small remarks in the 0002. 
Please, find it attached.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company
From f94990da84844841a35e80ca30e2b3d8c3bdaded Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Mon, 30 Mar 2020 10:51:51 +0900
Subject: [PATCH v6 2/2] Add -c/--restore-target-wal to pg_rewind

pg_rewind needs to copy from the source server to the target a set of
relation blocks changed from the previous checkpoint where WAL forked
between both up to the end of WAL on the target.  This may require WAL
segments that are not present anymore on the target's pg_wal, causing
pg_rewind to fail.  It is possible to fix that by copying manually the
WAL segments needed but this may lead to extra and useless work.

Instead, this commit introduces a new option allowing pg_rewind to use a
restore_command when running by grabbing it from the target
configuration.  This allows the rewind operation to be more reliable,
and only the necessary segments are fetched from the archives in this
case.

In order to do that, a new routine is added to src/common/ to allow
frontend tools to retrieve a WAL segment using an already-built restore
command.  This version is much more simple than the backend equivalent.

Author: Alexey Kondratov
Reviewed-by: Andrey Borodin, Andres Freund, Alvaro Herrera, Alexander
Korotkov, Michael Paquier
Discussion: https://postgr.es/m/a3acff50-5a0d-9a2c-b3b2-ee3616895...@postgrespro.ru
---
 doc/src/sgml/ref/pg_rewind.sgml   |  28 +--
 src/bin/pg_rewind/parsexlog.c |  33 +++-
 src/bin/pg_rewind/pg_rewind.c |  87 ++--
 src/bin/pg_rewind/pg_rewind.h |   6 +-
 src/bin/pg_rewind/t/001_basic.pl  |   3 +-
 src/bin/pg_rewind/t/RewindTest.pm |  70 +++-
 src/common/Makefile   |   1 +
 src/common/exec.c |   3 +-
 src/common/fe_archive.c   | 128 ++
 src/include/common/fe_archive.h   |  21 +
 src/include/port.h|   3 +-
 src/tools/msvc/Mkvcbuild.pm   |   4 +-
 12 files changed, 360 insertions(+), 27 deletions(-)
 create mode 100644 src/common/fe_archive.c
 create mode 100644 src/include/common/fe_archive.h

diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index f64d659522..07c49e4719 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -66,11 +66,11 @@ PostgreSQL documentation
can be found either on the target timeline, the source timeline, or their common
ancestor. In the typical failover scenario where the target cluster was
shut down soon after the divergence, this is not a problem, but if the
-   target cluster ran for a long time after the divergence, the old WAL
-   files might no longer be present. In that case, 

Re: Conflict handling for COPY FROM

2020-03-30 Thread Surafel Temesgen
On Fri, Mar 27, 2020 at 3:27 PM Tom Lane  wrote:

> Surafel Temesgen  writes:
> > [ conflict-handling-copy-from-v16.patch ]
>
> I took a quick look at this patch, since it was marked "ready for
> committer", but I don't see how it can possibly be considered committable.
>
> 1. Covering only the errors that are thrown in DoCopy itself doesn't
> seem to me to pass the smell test.  Yes, I'm sure there's some set of
> use-cases for which that'd be helpful, but I think most people would
> expect a "skip errors" option to be able to handle cases like malformed
> numbers or bad encoding.  I understand the implementation reasons that
> make it impractical to cover other errors, but do we really want a
> feature that most people will see as much less than half-baked?  I fear
> it'll be an embarrassment.
>
> I did small research and most major database management system didn't
claim
they handle every problem in loading file at least in every usage scenario.


> 2. If I'm reading the patch correctly, (some) rejected rows are actually
> sent back to the client.  This is a wire protocol break of the first
> magnitude, and can NOT be accepted.  At least not without some provisions
> for not doing it with a client that isn't prepared for it.  I also am
> fairly worried about the possibilities for deadlock (ie, both ends stuck
> waiting for the other one to absorb data) if the return traffic volume is
> high enough.
>
>
if my understanding is correct copy_in occur in sub protocol inside simple
or
extended query protocol that know and handle the responds


> 3. I don't think enough thought has been put into the reporting, either.
>
> +ereport(WARNING,
> +(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
> + errmsg("skipping \"%s\" --- extra data after
> last expected column",
> +cstate->line_buf.data)));
>
> That's not going to be terribly helpful if the input line runs to many
> megabytes.  Or even if no individual line is very long, but you get
> millions of such warnings.  It's pretty much at variance with our
> message style guidelines (among other things, those caution you to keep
> the primary error message short); and it's randomly different from
> COPY's existing practice, which is to show the faulty line as CONTEXT.
> Plus it seems plenty weird that some errors are reported this way while
> others are reported by sending back the bad tuple (with, it looks like,
> no mention of what the specific problem is ... what if you have a lot of
> unique indexes?).
>
> Currently we can’t get problem description in speculative insertion
infrastructure  and am afraid adding problem description to return tuple
will make the data less usable without further processing.Warning raised
for error that happen before tuple contracted. Other option is to skip those
record silently but reporting to user give user the chance to correct it.


> BTW, while I don't know much about the ON CONFLICT (speculative
> insertion) infrastructure, I wonder how well it really works to
> not specify an arbiter index.  I see that you're getting away with
> it in a trivial test case that has exactly one index, but that's
> not stressing the point very hard.
>
> If arbiter index is not specified that means all indexes as the comment in
ExecCheckIndexConstraints stated

/* 

* ExecCheckIndexConstraints

*

* This routine checks if a tuple violates any unique or

* exclusion constraints.  Returns true if there is no conflict.

* Otherwise returns false, and the TID of the conflicting

* tuple is returned in *conflictTid.

*

* If 'arbiterIndexes' is given, only those indexes are checked.

* NIL means all indexes.


regards
Surafel


Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2020-03-30 Thread Alexander Korotkov
On Mon, Mar 30, 2020 at 4:57 AM Michael Paquier  wrote:
> On Fri, Mar 27, 2020 at 07:47:29AM +0900, Michael Paquier wrote:
> > On Thu, Mar 26, 2020 at 06:56:36PM -0300, Alvaro Herrera wrote:
> >> I don't think any such cleanup should hamper the patch at hand anyway.
> >
> > I don't think either, so I would do the split for the archive routines
> > at least.  It still feels strange to me to have a different routine
> > name for the frontend-side of RestoreArchivedFile().  That's not
> > really consistent with the current practice we have palloc() & co, as
> > well as the sync routines.
>
> Okay.  Hearing nothing, I have rebased the patch set this morning,
> improving some comments and the code format while on it.  I would like
> to commit both 0001 (creation of xlogarchiver.h) and 0002 attached in
> the next couple of days.  If you have an objection, please feel free.

I'm fine with patchset attached.  Sorry for late reply.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




tweaking perfect hash multipliers

2020-03-30 Thread John Naylor
Hi all,

While playing around with Peter E.'s unicode normalization patch [1],
I found that HEAD failed to build a perfect hash function for any of
the four sets of 4-byte keys ranging from 1k to 17k in number. It
probably doesn't help that codepoints have nul bytes and often cluster
into consecutive ranges. In addition, I found that a couple of the
candidate hash multipliers don't compile to shift-and-add
instructions, although they were chosen with that intent in mind. It
seems compilers will only do that if the number is exactly 2^n +/- 1.

Using the latest gcc and clang, I tested all prime numbers up to 5000
(plus 8191 for good measure), and found a handful that are compiled
into non-imul instructions. Dialing back the version, gcc 4.8 and
clang 7.0 are the earliest I found that have the same behavior as
newer ones. For reference:

https://gcc.godbolt.org/z/bxcXHu

In addition to shift-and-add, there are also a few using lea,
lea-and-add, or 2 leas.

Then I used the attached program to measure various combinations of
compiled instructions using two constant multipliers iterating over
bytes similar to a generated hash function.

 -O2 -Wall test-const-mult.c test-const-mult-2.c
./a.out
Median of 3 with clang 10:

lea, lea 0.181s

lea, lea+add 0.248s
  lea, shift+add 0.251s

  lea+add, shift+add 0.273s
shift+add, shift+add 0.276s

  2 leas, 2 leas 0.290s
 shift+add, imul 0.329s

Taking this with a grain of salt, it nonetheless seems plausible that
a single lea could be faster than any two instructions here. The only
primes that compile to a single lea are 3 and 5, but I've found those
multipliers can build hash functions for all our keyword lists, as
demonstration. None of the others we didn't have already are
particularly interesting from a performance point of view.

With the unicode quick check, I found that the larger sets need (257,
8191) as multipliers to build the hash table, and none of the smaller
special primes I tested will work.

Keeping these two properties in mind, I came up with the scheme in the
attached patch that tries adjacent pairs in this array:

(3, 5, 17, 31, 127, 257, 8191)

so that we try (3,5) first, next (5,17), and then all the pure
shift-and-adds with (257,8191) last.

The main motivation is to be able to build the unicode quick check
tables, but if we ever use this functionality in a hot code path, we
may as well try to shave a few more cycles while we're at it.

[1] 
https://www.postgresql.org/message-id/flat/c1909f27-c269-2ed9-12f8-3ab72c8ca...@2ndquadrant.com

--
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/tools/PerfectHash.pm b/src/tools/PerfectHash.pm
index 74fb1f2ef6..4a029621b2 100644
--- a/src/tools/PerfectHash.pm
+++ b/src/tools/PerfectHash.pm
@@ -78,18 +78,25 @@ sub generate_hash_function
 
 	# Try different hash function parameters until we find a set that works
 	# for these keys.  The multipliers are chosen to be primes that are cheap
-	# to calculate via shift-and-add, so don't change them without care.
+	# to calculate via lea or shift, so don't change them without care.
+	# The list of multipliers is designed such that by iterating through
+	# adjacent pairs, we first try a couple combinations where at least one
+	# multiplier is compiled to a single lea (3 or 5). Also, we eventually
+	# want to include the largest numbers in our search for the sake of
+	# finicky key sets such as a large number of unicode codepoints.
 	# (Commonly, random seeds are tried, but we want reproducible results
 	# from this program so we don't do that.)
-	my $hash_mult1 = 31;
+	my @HASH_MULTS = (3, 5, 17, 31, 127, 257, 8191);
+	my $hash_mult1;
 	my $hash_mult2;
 	my $hash_seed1;
 	my $hash_seed2;
 	my @subresult;
   FIND_PARAMS:
-	foreach (127, 257, 521, 1033, 2053)
+	foreach my $i (0 .. $#HASH_MULTS - 1)
 	{
-		$hash_mult2 = $_;# "foreach $hash_mult2" doesn't work
+		$hash_mult1 = $HASH_MULTS[$i];
+		$hash_mult2 = $HASH_MULTS[$i+1];
 		for ($hash_seed1 = 0; $hash_seed1 < 10; $hash_seed1++)
 		{
 			for ($hash_seed2 = 0; $hash_seed2 < 10; $hash_seed2++)


Re: WIP/PoC for parallel backup

2020-03-30 Thread Ahsan Hadi
On Mon, Mar 30, 2020 at 3:44 PM Rajkumar Raghuwanshi <
rajkumar.raghuwan...@enterprisedb.com> wrote:

> Thanks Asif,
>
> I have re-verified reported issue. expect standby backup, others are fixed.
>

Yes As Asif mentioned he is working on the standby issue and adding
bandwidth throttling functionality to parallel backup.

It would be good to get some feedback on Asif previous email from Robert on
the design considerations for stand-by server support and throttling. I
believe all the other points mentioned by Robert in this thread are
addressed by Asif so it would be good to hear about any other concerns that
are not addressed.

Thanks,

-- Ahsan


> Thanks & Regards,
> Rajkumar Raghuwanshi
>
>
> On Fri, Mar 27, 2020 at 11:04 PM Asif Rehman 
> wrote:
>
>>
>>
>> On Wed, Mar 25, 2020 at 12:22 PM Rajkumar Raghuwanshi <
>> rajkumar.raghuwan...@enterprisedb.com> wrote:
>>
>>> Hi Asif,
>>>
>>> While testing further I observed parallel backup is not able to take
>>> backup of standby server.
>>>
>>> mkdir /tmp/archive_dir
>>> echo "archive_mode='on'">> data/postgresql.conf
>>> echo "archive_command='cp %p /tmp/archive_dir/%f'">> data/postgresql.conf
>>>
>>> ./pg_ctl -D data -l logs start
>>> ./pg_basebackup -p 5432 -Fp -R -D /tmp/slave
>>>
>>> echo "primary_conninfo='host=127.0.0.1 port=5432 user=edb'">>
>>> /tmp/slave/postgresql.conf
>>> echo "restore_command='cp /tmp/archive_dir/%f %p'">>
>>> /tmp/slave/postgresql.conf
>>> echo "promote_trigger_file='/tmp/failover.log'">>
>>> /tmp/slave/postgresql.conf
>>>
>>> ./pg_ctl -D /tmp/slave -l /tmp/slave_logs -o "-p 5433" start -c
>>>
>>> [edb@localhost bin]$ ./psql postgres -p 5432 -c "select
>>> pg_is_in_recovery();"
>>>  pg_is_in_recovery
>>> ---
>>>  f
>>> (1 row)
>>>
>>> [edb@localhost bin]$ ./psql postgres -p 5433 -c "select
>>> pg_is_in_recovery();"
>>>  pg_is_in_recovery
>>> ---
>>>  t
>>> (1 row)
>>>
>>>
>>>
>>>
>>> *[edb@localhost bin]$ ./pg_basebackup -p 5433 -D /tmp/bkp_s --jobs
>>> 6pg_basebackup: error: could not list backup files: ERROR:  the standby was
>>> promoted during online backupHINT:  This means that the backup being taken
>>> is corrupt and should not be used. Try taking another online
>>> backup.pg_basebackup: removing data directory "/tmp/bkp_s"*
>>>
>>> #same is working fine without parallel backup
>>> [edb@localhost bin]$ ./pg_basebackup -p 5433 -D /tmp/bkp_s --jobs 1
>>> [edb@localhost bin]$ ls /tmp/bkp_s/PG_VERSION
>>> /tmp/bkp_s/PG_VERSION
>>>
>>> Thanks & Regards,
>>> Rajkumar Raghuwanshi
>>>
>>>
>>> On Thu, Mar 19, 2020 at 4:11 PM Rajkumar Raghuwanshi <
>>> rajkumar.raghuwan...@enterprisedb.com> wrote:
>>>
 Hi Asif,

 In another scenarios, bkp data is corrupted for tablespace. again this
 is not reproducible everytime,
 but If I am running the same set of commands I am getting the same
 error.

 [edb@localhost bin]$ ./pg_ctl -D data -l logfile start
 waiting for server to start done
 server started
 [edb@localhost bin]$
 [edb@localhost bin]$ mkdir /tmp/tblsp
 [edb@localhost bin]$ ./psql postgres -p 5432 -c "create tablespace
 tblsp location '/tmp/tblsp';"
 CREATE TABLESPACE
 [edb@localhost bin]$ ./psql postgres -p 5432 -c "create database
 testdb tablespace tblsp;"
 CREATE DATABASE
 [edb@localhost bin]$ ./psql testdb -p 5432 -c "create table testtbl (a
 text);"
 CREATE TABLE
 [edb@localhost bin]$ ./psql testdb -p 5432 -c "insert into testtbl
 values ('parallel_backup with tablespace');"
 INSERT 0 1
 [edb@localhost bin]$ ./pg_basebackup -p 5432 -D /tmp/bkp -T
 /tmp/tblsp=/tmp/tblsp_bkp --jobs 2
 [edb@localhost bin]$ ./pg_ctl -D /tmp/bkp -l /tmp/bkp_logs -o "-p
 " start
 waiting for server to start done
 server started
 [edb@localhost bin]$ ./psql postgres -p  -c "select * from
 pg_tablespace where spcname like 'tblsp%' or spcname = 'pg_default'";
   oid  |  spcname   | spcowner | spcacl | spcoptions
 ---++--++
   1663 | pg_default |   10 ||
  16384 | tblsp  |   10 ||
 (2 rows)

 [edb@localhost bin]$ ./psql testdb -p  -c "select * from testtbl";
 psql: error: could not connect to server: FATAL:
  "pg_tblspc/16384/PG_13_202003051/16385" is not a valid data directory
 DETAIL:  File "pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION" is
 missing.
 [edb@localhost bin]$
 [edb@localhost bin]$ ls
 data/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION
 data/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION
 [edb@localhost bin]$ ls
 /tmp/bkp/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION
 ls: cannot access
 /tmp/bkp/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION: No such file or
 directory


 Thanks & Regards,
 Rajkumar Raghuwanshi


 On Mon, Mar 16, 2020 at 6:19 PM Rajkumar Raghuwanshi <

Re: WAL usage calculation patch

2020-03-30 Thread Julien Rouhaud
On Mon, Mar 30, 2020 at 03:52:38PM +0530, Amit Kapila wrote:
> On Sun, Mar 29, 2020 at 5:49 PM Julien Rouhaud  wrote:
> >
> 
> @@ -1249,6 +1250,16 @@ XLogInsertRecord(XLogRecData *rdata,
>   ProcLastRecPtr = StartPos;
>   XactLastRecEnd = EndPos;
> 
> + /* Provide WAL update data to the instrumentation */
> + if (inserted)
> + {
> + pgWalUsage.wal_bytes += rechdr->xl_tot_len;
> + if (doPageWrites && fpw_lsn <= RedoRecPtr)
> + pgWalUsage.wal_fpw_records++;
> + else
> + pgWalUsage.wal_records++;
> + }
> +
> 
> I think the above code has multiple problems. (a) fpw_lsn can be
> InvalidXLogRecPtr and still there could be full-page image (for ex.
> when REGBUF_FORCE_IMAGE flag for buffer is set).  (b) There could be
> multiple FPW records while inserting a record; consider when there are
> multiple registered buffers.  I think the right place to figure this
> out is XLogRecordAssemble. (c) There are cases when we also attach the
> record data even when we decide to write FPW (cf. REGBUF_KEEP_DATA),
> so we might want to increment wal_fpw_records and wal_records for such
> cases.
> 
> I think the right place to compute this information is
> XLogRecordAssemble even though we update it at the place where you
> have it in the patch.  You can probably compute that in local
> variables and then transfer to pgWalUsage in XLogInsertRecord.  I am
> fine if you can think of some other way but the current patch doesn't
> seem correct to me.

My previous approach was indeed totally broken.  v8 attached which hopefully
will be ok.
>From 3b8757a46aff847e5b36bf30a5e1f8d6662d0465 Mon Sep 17 00:00:00 2001
From: Kirill Bychik 
Date: Tue, 17 Mar 2020 14:41:50 +0100
Subject: [PATCH v8 1/4] Add infrastructure to track WAL usage.

Author: Kirill Bychik, Julien Rouhaud
Reviewed-by: Fuji Masao
Discussion: 
https://postgr.es/m/CAB-hujrP8ZfUkvL5OYETipQwA=e3n7oqHFU=4zlxws_cza3...@mail.gmail.com
---
 src/backend/access/transam/xlog.c   | 12 +-
 src/backend/access/transam/xloginsert.c | 13 +--
 src/backend/executor/execParallel.c | 38 ++-
 src/backend/executor/instrument.c   | 50 ++---
 src/include/access/xlog.h   |  3 +-
 src/include/executor/execParallel.h |  1 +
 src/include/executor/instrument.h   | 19 +-
 src/tools/pgindent/typedefs.list|  1 +
 8 files changed, 113 insertions(+), 24 deletions(-)

diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 1951103b26..6fb82c6e6b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -42,6 +42,7 @@
 #include "commands/progress.h"
 #include "commands/tablespace.h"
 #include "common/controldata_utils.h"
+#include "executor/instrument.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
@@ -995,7 +996,8 @@ static void WALInsertLockUpdateInsertingAt(XLogRecPtr 
insertingAt);
 XLogRecPtr
 XLogInsertRecord(XLogRecData *rdata,
 XLogRecPtr fpw_lsn,
-uint8 flags)
+uint8 flags,
+int num_fpw)
 {
XLogCtlInsert *Insert = >Insert;
pg_crc32c   rdata_crc;
@@ -1251,6 +1253,14 @@ XLogInsertRecord(XLogRecData *rdata,
ProcLastRecPtr = StartPos;
XactLastRecEnd = EndPos;
 
+   /* Provide WAL update data to the instrumentation */
+   if (inserted)
+   {
+   pgWalUsage.wal_bytes += rechdr->xl_tot_len;
+   pgWalUsage.wal_records++;
+   pgWalUsage.wal_fpw_records += num_fpw;
+   }
+
return EndPos;
 }
 
diff --git a/src/backend/access/transam/xloginsert.c 
b/src/backend/access/transam/xloginsert.c
index a618dec776..413750948b 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -25,6 +25,7 @@
 #include "access/xloginsert.h"
 #include "catalog/pg_control.h"
 #include "common/pg_lzcompress.h"
+#include "executor/instrument.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "replication/origin.h"
@@ -108,7 +109,7 @@ static MemoryContext xloginsert_cxt;
 
 static XLogRecData *XLogRecordAssemble(RmgrId rmid, uint8 info,
   
XLogRecPtr RedoRecPtr, bool doPageWrites,
-  
XLogRecPtr *fpw_lsn);
+  
XLogRecPtr *fpw_lsn, int *num_fpw);
 static bool XLogCompressBackupBlock(char *page, uint16 hole_offset,
uint16 
hole_length, char *dest, uint16 *dlen);
 
@@ -448,6 +449,7 @@ XLogInsert(RmgrId rmid, uint8 info)
booldoPageWrites;
XLogRecPtr  fpw_lsn;
XLogRecData *rdt;
+   int num_fpw = 0;
 
  

Re: Possible copy and past error? (\usr\backend\commands\analyze.c)

2020-03-30 Thread Ranier Vilela
Em seg., 30 de mar. de 2020 às 06:06, Magnus Hagander 
escreveu:

> On Sat, Mar 28, 2020 at 11:49 AM Ranier Vilela 
> wrote:
> >
> > Em sex., 27 de mar. de 2020 às 20:49, Tom Lane 
> escreveu:
> >>
> >> Ranier Vilela  writes:
> >> > Can someone check if there is a copy and paste error, at file:
> >> > \usr\backend\commands\analyze.c, at lines 2225 and 2226?
> >> > int num_mcv = stats->attr->attstattarget;
> >> > int num_bins = stats->attr->attstattarget;
> >>
> >> No, that's intentional I believe.  Those are independent variables
> >> that just happen to start out with the same value.
> >
> > Neither you nor I can say with 100% certainty that the original author's
> intention.
>
> Given that Tom is the original author, I think it's a lot more likely
> that he knows what the original authors intention was. It's certainly
> been a few years, so it probably isn't 100%, but the likelihood is
> pretty good.
>
Of course, now we all know..


>
>
> >> > To silence this alert.
> >>
> >> If you have a tool that complains about that coding, I think the
> >> tool needs a solid whack upside the head.  There's nothing wrong
> >> with the code, and it clearly expresses the intent, which the other
> >> way doesn't.  (Or in other words: it's the compiler's job to
> >> optimize away the duplicate fetch.  Not the programmer's.)
> >
> > I completely disagree. My tools have proven their worth, including
> finding serious errors in the code, which fortunately have been fixed by
> other committers.
> > When issuing this alert, the tool does not value judgment regarding
> performance or optimization, but it does an excellent job of finding
> similar patterns in adjacent lines, and the only thing it asked for was to
> be asked if this was really the case. original author's intention.
>
> All tools will give false positives. This simply seems one of those --
> it certainly could have been indicating a problem, but in this case it
> didn't.
>
that's what you said, it could be a big problem, if it were the case of
copy-past error.
I do not consider it a false positive, since the tool did not claim it was
a bug, she warned and asked to question.

regards,
Ranier Vilela


Re: truncating timestamps on arbitrary intervals

2020-03-30 Thread John Naylor
I wrote:

> I'm going to look into implementing timezone while awaiting comments on v6.

I attempted this in the attached v7. There are 4 new functions for
truncating timestamptz on an interval -- with and without origin, and
with and without time zone.

Parts of it are hackish, and need more work, but I think it's in
passable enough shape to get feedback on. The origin parameter logic
was designed with timestamps-without-time-zone in mind, and
retrofitting time zone on top of that was a bit messy. There might be
bugs.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


v7-datetrunc_interval.patch
Description: Binary data


Re: Possible copy and past error? (\usr\backend\commands\analyze.c)

2020-03-30 Thread Ranier Vilela
Em seg., 30 de mar. de 2020 às 05:16, Michael Paquier 
escreveu:

> On Sat, Mar 28, 2020 at 07:48:22AM -0300, Ranier Vilela wrote:
> > I completely disagree. My tools have proven their worth, including
> finding
> > serious errors in the code, which fortunately have been fixed by other
> > committers.
>
> FWIW, I think that the rule to always take Coverity's reports with a
> pinch of salt applies for any report.
>
I have certainly taken this advice seriously, since I have received all
kinds of say, "words of discouragement".
I understand perfectly that the list is very busy and perhaps the patience
with mistakes is very little, but these attitudes do not help new people to
work here.
I don't get paid to work with PostgreSQL, so consideration and recognition
are the only rewards for now.


>
> > When issuing this alert, the tool does not value judgment regarding
> > performance or optimization, but it does an excellent job of finding
> > similar patterns in adjacent lines, and the only thing it asked for was
> to
> > be asked if this was really the case. original author's intention.
>
> The code context matters a lot, but here let's leave this code alone.
> There is nothing wrong with it.
>
That is the question. Looking only at the code, there is no way to know
immediately, that there is nothing wrong. Not even a comment warning.
That's what the tool asked for, ask if there's really nothing wrong.

regards,
Ranier Vilela


Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-30 Thread Tomas Vondra

On Sun, Mar 29, 2020 at 10:16:53PM -0400, James Coleman wrote:

On Sun, Mar 29, 2020 at 9:44 PM Tomas Vondra
 wrote:


Hi,

Attached is a slightly reorganized patch series. I've merged the fixes
into the appropriate matches, and I've also combined the two patches
adding incremental sort paths to additional places in planner.

A couple more comments:


1) I think the GUC documentation in src/sgml/config.sgml is a bit too
detailed, compared to the other enable_* GUCs. I wonder if there's a
better place where to move the details. What about adding some examples
and explanation to perform.sgml?


I'll take a look at that and include in a patch series tomorrow.


2) Looking at the explain output, the verbose mode looks like this:

test=# explain (verbose, analyze) select a from t order by a, b, c;
   
QUERY PLAN
--
  Gather Merge  (cost=66.31..816072.71 rows=8333226 width=24) (actual 
time=4.787..20092.555 rows=1000 loops=1)
Output: a, b, c
Workers Planned: 2
Workers Launched: 2
->  Incremental Sort  (cost=66.28..729200.36 rows=4166613 width=24) (actual 
time=1.308..14021.575 rows=333 loops=3)
  Output: a, b, c
  Sort Key: t.a, t.b, t.c
  Presorted Key: t.a, t.b
  Full-sort Groups: 4169 Sort Method: quicksort Memory: avg=30kB 
peak=30kB
  Presorted Groups: 4144 Sort Method: quicksort Memory: avg=128kB 
peak=138kB
  Worker 0:  actual time=0.766..16122.368 rows=3841573 loops=1
  Full-sort Groups: 6871 Sort Method: quicksort Memory: avg=30kB peak=30kB
Presorted Groups: 6823 Sort Method: quicksort Memory: avg=132kB 
peak=141kB
  Worker 1:  actual time=1.986..16189.831 rows=3845490 loops=1
  Full-sort Groups: 6874 Sort Method: quicksort Memory: avg=30kB peak=30kB
Presorted Groups: 6847 Sort Method: quicksort Memory: avg=130kB 
peak=139kB
  ->  Parallel Index Scan using t_a_b_idx on public.t  
(cost=0.43..382365.92 rows=4166613 width=24) (actual time=0.040..9808.449 
rows=333 loops=3)
Output: a, b, c
Worker 0:  actual time=0.048..11275.178 rows=3841573 loops=1
Worker 1:  actual time=0.041..11314.133 rows=3845490 loops=1
  Planning Time: 0.166 ms
  Execution Time: 25135.029 ms
(22 rows)

There seems to be missing indentation for the first line of worker info.


Working on that too.


I'm still not quite convinced we should be printing two lines - I know
you mentioned the lines might be too long, but see how long the other
lines may get ...


All right, I give in :)

Do you think non-workers (both the leader and non-parallel plans)
should also move to one line?



I think we should use the same formatting for both cases, so yes.

FWIW I forgot to mention I tweaked the INSTRUMENT_SORT_GROUP macro a
bit, by moving the if condition in it. That makes the calls easier.


3) I see the new nodes (plan state, ...) have "presortedCols" which does
not indicate it's a "number of". I think we usually prefix names of such
fields "n" or "num". What about "nPresortedCols"? (Nitpicking, I know.)


I can fix this too.

Also I noticed a few compiler warnings I'll fixup in tomorrow's reply.



OK

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: PostgreSQL proposal of Google Summer of Code

2020-03-30 Thread Fabrízio de Royes Mello
On Sun, Mar 29, 2020 at 4:48 PM Maryam Farrukh 
wrote:
>
> Hi Fabrizio,
>

Hi Maryam, please try to avoid top posting!!

Returning the discussion to pgsql-hackers.


> Thank you for reaching out. I have a question. I went through the links
you provided me. There
> are already some project ideas over ther. My question is that do we have
to select a project from
> there or come up with an idea of our own.
>

If you have a good idea and that idea fit to GSoC felt free to propose
it... the listed ideas are just ideas.

Regards,

--
   Fabrízio de Royes Mello Timbira - http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento


Re: [PATCH] Redudant initilization

2020-03-30 Thread Ranier Vilela
Em dom., 29 de mar. de 2020 às 21:57, Kyotaro Horiguchi <
horikyota@gmail.com> escreveu:

> Hello.
>
> At Sat, 28 Mar 2020 19:04:00 -0300, Ranier Vilela 
> wrote in
> > Hi,
> > This patch fixes some redundant initilization, that are safe to remove.
>
> > diff --git a/src/backend/access/gist/gistxlog.c
> b/src/backend/access/gist/gistxlog.c
> > index d3f3a7b803..ffaa2b1ab4 100644
> > --- a/src/backend/access/gist/gistxlog.c
> > +++ b/src/backend/access/gist/gistxlog.c
> > @@ -396,7 +396,7 @@ gistRedoPageReuse(XLogReaderState *record)
> >   if (InHotStandby)
> >   {
> >   FullTransactionId latestRemovedFullXid =
> xlrec->latestRemovedFullXid;
> > - FullTransactionId nextFullXid =
> ReadNextFullTransactionId();
> > + FullTransactionId nextFullXid;
>
> I'd prefer to preserve this and remove the latter.
>

Ok.


> > diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
> > index 9d9e915979..795cf349eb 100644
> > --- a/src/backend/catalog/heap.c
> > +++ b/src/backend/catalog/heap.c
> > @@ -3396,7 +3396,7 @@ List *
> >  heap_truncate_find_FKs(List *relationIds)
> >  {
> >   List   *result = NIL;
> > - List   *oids = list_copy(relationIds);
> > + List   *oids;
>
> This was just a waste of memory, the fix looks fine.
>
> > diff --git a/src/backend/storage/smgr/md.c
> b/src/backend/storage/smgr/md.c
> > index c5b771c531..37fbeef841 100644
> > --- a/src/backend/storage/smgr/md.c
> > +++ b/src/backend/storage/smgr/md.c
> > @@ -730,9 +730,11 @@ mdwrite(SMgrRelation reln, ForkNumber forknum,
> BlockNumber blocknum,
> >  BlockNumber
> >  mdnblocks(SMgrRelation reln, ForkNumber forknum)
> >  {
> > - MdfdVec*v = mdopenfork(reln, forknum, EXTENSION_FAIL);
> > + MdfdVec*v;
> >   BlockNumber nblocks;
> > - BlockNumber segno = 0;
> > + BlockNumber segno;
> > +
> > +mdopenfork(reln, forknum, EXTENSION_FAIL);
> >
> >   /* mdopen has opened the first segment */
> >   Assert(reln->md_num_open_segs[forknum] > 0);
>
> It doesn't seems *to me* an issue.
>

Not a big deal, but the assignment of the variable v here is a small waste,
since it is again highlighted right after.


> > diff --git a/src/backend/utils/adt/json.c b/src/backend/utils/adt/json.c
> > index 8bb00abb6b..7a6a2ecbe9 100644
> > --- a/src/backend/utils/adt/json.c
> > +++ b/src/backend/utils/adt/json.c
> > @@ -990,7 +990,7 @@ catenate_stringinfo_string(StringInfo buffer, const
> char *addon)
> >  Datum
> >  json_build_object(PG_FUNCTION_ARGS)
> >  {
> > - int nargs = PG_NARGS();
> > + int nargs;
>
> This part looks fine.
>
> >   int i;
> >   const char *sep = "";
> >   StringInfo  result;
> > @@ -998,6 +998,8 @@ json_build_object(PG_FUNCTION_ARGS)
> >   bool   *nulls;
> >   Oid*types;
> >
> > +PG_NARGS();
> > +
> >   /* fetch argument values to build the object */
> >   nargs = extract_variadic_args(fcinfo, 0, false, , ,
> );
>
> PG_NARGS() doesn't have a side-effect so no need to call independently.
>
 Sorry, does that mean we can remove it completely?


>
> > diff --git a/src/backend/utils/mmgr/mcxt.c
> b/src/backend/utils/mmgr/mcxt.c
> > index 9e24fec72d..fb0e833b2d 100644
> > --- a/src/backend/utils/mmgr/mcxt.c
> > +++ b/src/backend/utils/mmgr/mcxt.c
> > @@ -475,7 +475,7 @@ MemoryContextMemAllocated(MemoryContext context,
> bool recurse)
> >
> >   if (recurse)
> >   {
> > - MemoryContext child = context->firstchild;
> > + MemoryContext child;
> >
> >   for (child = context->firstchild;
> >child != NULL;
>
> This looks fine.
>

Thank you for the review and consideration.

best regards,
Ranier Vilela


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-03-30 Thread k.jami...@fujitsu.com
On Wednesday, March 25, 2020 3:25 PM, Kirk Jamison wrote:
> As for the performance and how it affects the read-only workloads.
> Using pgbench, I've kept the overload to a minimum, less than 1%.
> I'll post follow-up results.

Here's the follow-up results.
I executed the similar tests from top of the thread.
I hope the performance test results shown below would suffice.
If not, I'd appreciate any feedback with regards to test or the patch itself.

A. VACUUM execution + Failover test
- 100GB shared_buffers

1. 1000 tables (18MB)
1.1. Execution Time
- [MASTER] 77755.218 ms (01:17.755)
- [PATCH] Execution Time:   2147.914 ms (00:02.148)
1.2. Failover Time (Recovery WAL Replay):
- [MASTER] 01:37.084 (1 min 37.884 s)
- [PATCH] 1627 ms (1.627 s)

2. 1 tables (110MB)
2.1. Execution Time
- [MASTER] 844174.572 ms (14:04.175) ~14 min 4.175 s
- [PATCH] 75678.559 ms (01:15.679) ~1 min 15.679 s

2.2. Failover Time:
- [MASTER] est. 14 min++
(I didn't measure anymore because recovery takes
as much as the execution time.)
- [PATCH] 01:25.559 (1 min 25.559 s)

Significant performance results for VACUUM.


B. TPS Regression for READ-ONLY workload
(PREPARED QUERY MODE, NO VACUUM)

# [16 Clients]
- pgbench -n -S -j 16 -c 16 -M prepared -T 60 cachetest

|shbuf|Master  |Patch |% reg|
|--|--|---|--|
|128MB| 77,416.76 | 77,162.78 |0.33% |
|1GB | 81,941.30 | 81,812.05 |0.16% |
|2GB | 84,273.69 | 84,356.38 |-0.10%|
|100GB| 83,807.30 | 83,924.68 |-0.14%|

# [1 Client]
- pgbench -n -S -c 1 -M prepared -T 60 cachetest

|shbuf|Master  |Patch |% reg|
|--|--|---|--|
|128MB| 12,044.54 | 12,037.13 |0.06% |
|1GB | 12,736.57 | 12,774.77 |-0.30%|
|2GB | 12,948.98 | 13,159.90 |-1.63%|
|100GB| 12,982.98 | 13,064.04 |-0.62%|

Both were run for 10 times and average tps and % regression are
shown above. At some point only minimal overload was caused by
the patch. As for other cases, it has higher tps compared to master.

If it does not make it this CF, I hope to receive feedback in the future
on how to proceed. Thanks in advance!

Regards,
Kirk Jamison


Re: Some problems of recovery conflict wait events

2020-03-30 Thread Masahiko Sawada
On Fri, 27 Mar 2020 at 17:54, Fujii Masao  wrote:
>
>
>
> On 2020/03/04 14:31, Masahiko Sawada wrote:
> > On Wed, 4 Mar 2020 at 13:48, Fujii Masao  
> > wrote:
> >>
> >>
> >>
> >> On 2020/03/04 13:27, Michael Paquier wrote:
> >>> On Wed, Mar 04, 2020 at 01:13:19PM +0900, Masahiko Sawada wrote:
>  Yeah, so 0001 patch sets existing wait events to recovery conflict
>  resolution. For instance, it sets (PG_WAIT_LOCK | LOCKTAG_TRANSACTION)
>  to the recovery conflict on a snapshot. 0003 patch improves these wait
>  events by adding the new type of wait event such as
>  WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT. Therefore 0001 (and 0002) patch
>  is the fix for existing versions and 0003 patch is an improvement for
>  only PG13. Did you mean even 0001 patch doesn't fit for back-patching?
> >>
> >> Yes, it looks like a improvement rather than bug fix.
> >>
> >
> > Okay, understand.
> >
> >>> I got my eyes on this patch set.  The full patch set is in my opinion
> >>> just a set of improvements, and not bug fixes, so I would refrain from
> >>> back-backpatching.
> >>
> >> I think that the issue (i.e., "waiting" is reported twice needlessly
> >> in PS display) that 0002 patch tries to fix is a bug. So it should be
> >> fixed even in the back branches.
> >
> > So we need only two patches: one fixes process title issue and another
> > improve wait event. I've attached updated patches.
>
> I started reading 
> v2-0002-Improve-wait-events-for-recovery-conflict-resolut.patch.
>
> -   ProcWaitForSignal(PG_WAIT_BUFFER_PIN);
> +   ProcWaitForSignal(WAIT_EVENT_RECOVERY_CONFLICT_BUFFER_PIN);
>
> Currently the wait event indicating the wait for buffer pin has already
> been reported. But the above change in the patch changes the name of
> wait event for buffer pin only in the startup process. Is this really useful?
> Isn't the existing wait event for buffer pin enough?
>
> -   /* Wait to be signaled by the release of the Relation Lock */
> -   ProcWaitForSignal(PG_WAIT_LOCK | locktag.locktag_type);
> +   /* Wait to be signaled by the release of the Relation Lock */
> +   ProcWaitForSignal(WAIT_EVENT_RECOVERY_CONFLICT_LOCK);
>
> Same as above. Isn't the existing wait event enough?

Yeah, we can use the existing wait events for buffer pin and lock.

>
> -   /*
> -* Progressively increase the sleep times, but not to more than 1s, 
> since
> -* pg_usleep isn't interruptible on some platforms.
> -*/
> -   standbyWait_us *= 2;
> -   if (standbyWait_us > 100)
> -   standbyWait_us = 100;
> +   WaitLatch(MyLatch,
> + WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_TIMEOUT,
> + STANDBY_WAIT_MS,
> + wait_event_info);
> +   ResetLatch(MyLatch);
>
> ResetLatch() should be called before WaitLatch()?

Fixed.

>
> Could you tell me why you dropped the "increase-sleep-times" logic?

I thought we can remove it because WaitLatch is interruptible but my
observation was not correct. The waiting startup process is not
necessarily woken up by signal. I think it's still better to not wait
more than 1 sec even if it's an interruptible wait.

Attached patch fixes the above and introduces only two wait events of
conflict resolution: snapshot and tablespace. I also removed the wait
event of conflict resolution of database since it's unlikely to become
a user-visible and a long sleep as we discussed before.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


recovery_conflict_wait_event_v3.patch
Description: Binary data


Re: WIP/PoC for parallel backup

2020-03-30 Thread Rajkumar Raghuwanshi
Thanks Asif,

I have re-verified reported issue. expect standby backup, others are fixed.

Thanks & Regards,
Rajkumar Raghuwanshi


On Fri, Mar 27, 2020 at 11:04 PM Asif Rehman  wrote:

>
>
> On Wed, Mar 25, 2020 at 12:22 PM Rajkumar Raghuwanshi <
> rajkumar.raghuwan...@enterprisedb.com> wrote:
>
>> Hi Asif,
>>
>> While testing further I observed parallel backup is not able to take
>> backup of standby server.
>>
>> mkdir /tmp/archive_dir
>> echo "archive_mode='on'">> data/postgresql.conf
>> echo "archive_command='cp %p /tmp/archive_dir/%f'">> data/postgresql.conf
>>
>> ./pg_ctl -D data -l logs start
>> ./pg_basebackup -p 5432 -Fp -R -D /tmp/slave
>>
>> echo "primary_conninfo='host=127.0.0.1 port=5432 user=edb'">>
>> /tmp/slave/postgresql.conf
>> echo "restore_command='cp /tmp/archive_dir/%f %p'">>
>> /tmp/slave/postgresql.conf
>> echo "promote_trigger_file='/tmp/failover.log'">>
>> /tmp/slave/postgresql.conf
>>
>> ./pg_ctl -D /tmp/slave -l /tmp/slave_logs -o "-p 5433" start -c
>>
>> [edb@localhost bin]$ ./psql postgres -p 5432 -c "select
>> pg_is_in_recovery();"
>>  pg_is_in_recovery
>> ---
>>  f
>> (1 row)
>>
>> [edb@localhost bin]$ ./psql postgres -p 5433 -c "select
>> pg_is_in_recovery();"
>>  pg_is_in_recovery
>> ---
>>  t
>> (1 row)
>>
>>
>>
>>
>> *[edb@localhost bin]$ ./pg_basebackup -p 5433 -D /tmp/bkp_s --jobs
>> 6pg_basebackup: error: could not list backup files: ERROR:  the standby was
>> promoted during online backupHINT:  This means that the backup being taken
>> is corrupt and should not be used. Try taking another online
>> backup.pg_basebackup: removing data directory "/tmp/bkp_s"*
>>
>> #same is working fine without parallel backup
>> [edb@localhost bin]$ ./pg_basebackup -p 5433 -D /tmp/bkp_s --jobs 1
>> [edb@localhost bin]$ ls /tmp/bkp_s/PG_VERSION
>> /tmp/bkp_s/PG_VERSION
>>
>> Thanks & Regards,
>> Rajkumar Raghuwanshi
>>
>>
>> On Thu, Mar 19, 2020 at 4:11 PM Rajkumar Raghuwanshi <
>> rajkumar.raghuwan...@enterprisedb.com> wrote:
>>
>>> Hi Asif,
>>>
>>> In another scenarios, bkp data is corrupted for tablespace. again this
>>> is not reproducible everytime,
>>> but If I am running the same set of commands I am getting the same error.
>>>
>>> [edb@localhost bin]$ ./pg_ctl -D data -l logfile start
>>> waiting for server to start done
>>> server started
>>> [edb@localhost bin]$
>>> [edb@localhost bin]$ mkdir /tmp/tblsp
>>> [edb@localhost bin]$ ./psql postgres -p 5432 -c "create tablespace
>>> tblsp location '/tmp/tblsp';"
>>> CREATE TABLESPACE
>>> [edb@localhost bin]$ ./psql postgres -p 5432 -c "create database testdb
>>> tablespace tblsp;"
>>> CREATE DATABASE
>>> [edb@localhost bin]$ ./psql testdb -p 5432 -c "create table testtbl (a
>>> text);"
>>> CREATE TABLE
>>> [edb@localhost bin]$ ./psql testdb -p 5432 -c "insert into testtbl
>>> values ('parallel_backup with tablespace');"
>>> INSERT 0 1
>>> [edb@localhost bin]$ ./pg_basebackup -p 5432 -D /tmp/bkp -T
>>> /tmp/tblsp=/tmp/tblsp_bkp --jobs 2
>>> [edb@localhost bin]$ ./pg_ctl -D /tmp/bkp -l /tmp/bkp_logs -o "-p "
>>> start
>>> waiting for server to start done
>>> server started
>>> [edb@localhost bin]$ ./psql postgres -p  -c "select * from
>>> pg_tablespace where spcname like 'tblsp%' or spcname = 'pg_default'";
>>>   oid  |  spcname   | spcowner | spcacl | spcoptions
>>> ---++--++
>>>   1663 | pg_default |   10 ||
>>>  16384 | tblsp  |   10 ||
>>> (2 rows)
>>>
>>> [edb@localhost bin]$ ./psql testdb -p  -c "select * from testtbl";
>>> psql: error: could not connect to server: FATAL:
>>>  "pg_tblspc/16384/PG_13_202003051/16385" is not a valid data directory
>>> DETAIL:  File "pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION" is
>>> missing.
>>> [edb@localhost bin]$
>>> [edb@localhost bin]$ ls
>>> data/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION
>>> data/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION
>>> [edb@localhost bin]$ ls
>>> /tmp/bkp/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION
>>> ls: cannot access
>>> /tmp/bkp/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION: No such file or
>>> directory
>>>
>>>
>>> Thanks & Regards,
>>> Rajkumar Raghuwanshi
>>>
>>>
>>> On Mon, Mar 16, 2020 at 6:19 PM Rajkumar Raghuwanshi <
>>> rajkumar.raghuwan...@enterprisedb.com> wrote:
>>>
 Hi Asif,

 On testing further, I found when taking backup with -R, pg_basebackup
 crashed
 this crash is not consistently reproducible.

 [edb@localhost bin]$ ./psql postgres -p 5432 -c "create table test (a
 text);"
 CREATE TABLE
 [edb@localhost bin]$ ./psql postgres -p 5432 -c "insert into test
 values ('parallel_backup with -R recovery-conf');"
 INSERT 0 1
 [edb@localhost bin]$ ./pg_basebackup -p 5432 -j 2 -D /tmp/test_bkp/bkp
 -R
 Segmentation fault (core dumped)

 stack trace looks the same as it was on earlier reported crash with
 tablespace.
 --stack trace

Re: WAL usage calculation patch

2020-03-30 Thread Amit Kapila
On Sun, Mar 29, 2020 at 5:49 PM Julien Rouhaud  wrote:
>

@@ -1249,6 +1250,16 @@ XLogInsertRecord(XLogRecData *rdata,
  ProcLastRecPtr = StartPos;
  XactLastRecEnd = EndPos;

+ /* Provide WAL update data to the instrumentation */
+ if (inserted)
+ {
+ pgWalUsage.wal_bytes += rechdr->xl_tot_len;
+ if (doPageWrites && fpw_lsn <= RedoRecPtr)
+ pgWalUsage.wal_fpw_records++;
+ else
+ pgWalUsage.wal_records++;
+ }
+

I think the above code has multiple problems. (a) fpw_lsn can be
InvalidXLogRecPtr and still there could be full-page image (for ex.
when REGBUF_FORCE_IMAGE flag for buffer is set).  (b) There could be
multiple FPW records while inserting a record; consider when there are
multiple registered buffers.  I think the right place to figure this
out is XLogRecordAssemble. (c) There are cases when we also attach the
record data even when we decide to write FPW (cf. REGBUF_KEEP_DATA),
so we might want to increment wal_fpw_records and wal_records for such
cases.

I think the right place to compute this information is
XLogRecordAssemble even though we update it at the place where you
have it in the patch.  You can probably compute that in local
variables and then transfer to pgWalUsage in XLogInsertRecord.  I am
fine if you can think of some other way but the current patch doesn't
seem correct to me.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Berserk Autovacuum (let's save next Mandrill)

2020-03-30 Thread David Rowley
On Mon, 30 Mar 2020 at 19:49, David Rowley  wrote:
> I'll see if I can come up with some way to do this in a more
> deterministic way to determine which tables to add vacuums for, rather
> than waiting for and reacting post-failure.

I ended up running make installcheck on an instance with
autovacuum_naptime set to 1s with a small additional debug line in
autovacuum.c, namely:

diff --git a/src/backend/postmaster/autovacuum.c
b/src/backend/postmaster/autovacuum.c
index 7e97ffab27..ad81e321dc 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -3099,6 +3099,9 @@ relation_needs_vacanalyze(Oid relid,
*dovacuum = force_vacuum || (vactuples > vacthresh) ||
(vac_ins_base_thresh >= 0 &&
instuples > vacinsthresh);
*doanalyze = (anltuples > anlthresh);
+
+   if (vac_ins_base_thresh >= 0 && instuples > vacinsthresh)
+   elog(LOG, " %s", NameStr(classForm->relname));
}
else
{

I grepped the log after the installcheck to grab the table names that
saw an insert vacuum during the test then grepped the test output to
see if the table appears to pose a risk of test instability.

I've classed each table with a risk factor. "VeryLow" seems like
there's almost no risk because we don't ever look at EXPLAIN.  Low
risk tables look at EXPLAIN, but I feel are not quite looking in
enough detail to cause issues. Medium risk look at EXPLAIN and I feel
there's a risk of some change, I think these are all Append nodes
which do order subnodes based on their cost. High risk those are
the ones I'm about to look into changing.

The full results of my analysis are:

Table: agg_group_1 aggregates.out. Nothing looks at EXPLAIN. Risk:VeryLow
Table: agg_hash_1 aggregates.out. Nothing looks at EXPLAIN. Risk:VeryLow
Table: atest12 privileges.out. Lots of looking at EXPLAIN, but nothing
appears to look into row estimates in detail. Risk:Low
Table: brin_test brin.out.  Test already does VACUUM ANALYZE. Risk:VeryLow
Table: bt_f8_heap btree_index.out, create_index.out. Rows loaded in
copy.source. Nothing appears to look at EXPLAIN. Risk:VeryLow
Table: bt_i4_heap btree_index.out, create_index.out. Rows loaded in
copy.source. Nothing appears to look at EXPLAIN. Risk:VeryLow
Table: bt_name_heap btree_index.out, create_index.out. Rows loaded in
copy.source. Nothing appears to look at EXPLAIN. Risk:VeryLow
Table: bt_txt_heap btree_index.out, create_index.out. Rows loaded in
copy.source. Nothing appears to look at EXPLAIN. Risk:VeryLow
Table: dupindexcols create_index.out. Some looking at EXPLAIN plans,
but nothing appears to look into row estimates in detail. Risk:Low
Table: fast_emp4000 create_am.out, create_index.out, create_misc.out.
Lots of looking at EXPLAIN, but nothing appears to look into row
estimates in detail. Risk:Low
Table: functional_dependencies stats_ext.out. Lots of looking at
EXPLAIN output. Test looks at row estimates. Risk:High
Table: gist_tbl gist.out. Lots of looking at EXPLAIN, but nothing
appears to look into row estimates in detail. Risk:Low
Table: hash_f8_heap hash_index.out. Rows loaded in copy.source.
Nothing appears to look at EXPLAIN. Risk:VeryLow
Table: hash_i4_heap hash_index.out. Rows loaded in copy.source.
Nothing appears to look at EXPLAIN. Risk:VeryLow
Table: hash_name_heap hash_index.out. Rows loaded in copy.source.
Nothing appears to look at EXPLAIN. Risk:VeryLow
Table: hash_txt_heap hash_index.out. Rows loaded in copy.source.
Nothing appears to look at EXPLAIN. Risk:VeryLow
Table: kd_point_tbl create_index_spgist.out. Lots of looking at
EXPLAIN, but nothing appears to look into row estimates in detail.
Risk:Low
Table: mcv_lists stats_ext.out. Lots of looking at EXPLAIN, but tests
appear to VACUUM after loading rows. Risk:Low
Table: mcv_lists_arrays stats_ext.out. Nothing appears to look at
EXPLAIN. Risk:VeryLow
Table: mcv_lists_bool stats_ext.out.  Lots of looking at EXPLAIN
output. Test looks at row estimates. Risk:High
Table: ndistinct stats_ext.out.  Lots of looking at EXPLAIN output.
Test looks at row estimates. Only 1000 rows are loaded initially and
then 5000 after a truncate. 1000 rows won't trigger the auto-vacuum.
Risk:High
Table: onek Lots of files. Sees a VACUUM in sanity_check test,
however, some tests run before sanity_check, e.g. create_index,
select, copy, none of which appear to pay particular attention to
anything vacuum might change. Risk:Low
Table: pagg_tab_ml_p2_s1 partition_aggregate.out Appears to be some
risk of Append reordering partitions based on cost. Risk:Medium
Table: pagg_tab_ml_p2_s2 partition_aggregate.out Appears to be some
risk of Append reordering partitions based on cost. Risk:Medium
Table: pagg_tab_ml_p3_s1 partition_aggregate.out Appears to be some
risk of Append reordering partitions based on cost. Risk:Medium
Table: pagg_tab_ml_p3_s2 partition_aggregate.out Appears to be some
risk of Append 

Re: snapper vs. HEAD

2020-03-30 Thread Tom Turelinckx
Hi,

Tom Lane wrote:
> Confirmed: building gistget with --enable-cassert, and all of snapper's
> compile/link options, produces something that passes regression.

Skate uses buildfarm default configuration, whereas snapper uses settings which 
are used when building postgresql debian packages. Debian packages are built 
without --enable-cassert, but most buildfarm animals build with 
--enable-cassert. I specifically configured skate and snapper like that because 
I ran into such issues in the past (where debian packages would fail to build 
on sparc, but buildfarm animals on debian sparc did not highlight the issue).

In the past, I've already switched from gcc 4.6 to gcc 4.7 as a workaround for 
a similar compiler bug, but I can't upgrade to a newer gcc without backporting 
it myself, so for the moment I've switched snapper to use -O1 instead of -O2, 
for HEAD only.

Not sure whether wheezy on sparc 32-bit is very relevant today, but it's an 
exotic platform, so I try to keep those buildfarm animals alive as long as it's 
possible.

Best regards,
Tom Turelinckx




Re: Possible copy and past error? (\usr\backend\commands\analyze.c)

2020-03-30 Thread Magnus Hagander
On Sat, Mar 28, 2020 at 11:49 AM Ranier Vilela  wrote:
>
> Em sex., 27 de mar. de 2020 às 20:49, Tom Lane  escreveu:
>>
>> Ranier Vilela  writes:
>> > Can someone check if there is a copy and paste error, at file:
>> > \usr\backend\commands\analyze.c, at lines 2225 and 2226?
>> > int num_mcv = stats->attr->attstattarget;
>> > int num_bins = stats->attr->attstattarget;
>>
>> No, that's intentional I believe.  Those are independent variables
>> that just happen to start out with the same value.
>
> Neither you nor I can say with 100% certainty that the original author's 
> intention.

Given that Tom is the original author, I think it's a lot more likely
that he knows what the original authors intention was. It's certainly
been a few years, so it probably isn't 100%, but the likelihood is
pretty good.


>> > To silence this alert.
>>
>> If you have a tool that complains about that coding, I think the
>> tool needs a solid whack upside the head.  There's nothing wrong
>> with the code, and it clearly expresses the intent, which the other
>> way doesn't.  (Or in other words: it's the compiler's job to
>> optimize away the duplicate fetch.  Not the programmer's.)
>
> I completely disagree. My tools have proven their worth, including finding 
> serious errors in the code, which fortunately have been fixed by other 
> committers.
> When issuing this alert, the tool does not value judgment regarding 
> performance or optimization, but it does an excellent job of finding similar 
> patterns in adjacent lines, and the only thing it asked for was to be asked 
> if this was really the case. original author's intention.

All tools will give false positives. This simply seems one of those --
it certainly could have been indicating a problem, but in this case it
didn't.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: error context for vacuum to include block number

2020-03-30 Thread Amit Kapila
On Fri, Mar 27, 2020 at 5:03 AM Justin Pryzby  wrote:
>

Now that the main patch is committed, I have reviewed the other two patches.

v37-0002-Drop-reltuples
1.
@@ -2289,11 +2289,10 @@ vacuum_one_index(Relation indrel,
IndexBulkDeleteResult **stats,

  /* Do vacuum or cleanup of the index */
  if (lvshared->for_cleanup)
- lazy_cleanup_index(indrel, stats, lvshared->reltuples,
-lvshared->estimated_count, vacrelstats);
+ lazy_cleanup_index(indrel, stats, vacrelstats);
  else
  lazy_vacuum_index(indrel, stats, dead_tuples,
-   lvshared->reltuples, vacrelstats);
+   vacrelstats);

I don't think the above change is correct.  How will vacrelstats have
correct values when vacuum_one_index is called via parallel workers
(via parallel_vacuum_main)?

The v37-0003-Avoid-some-calls-to-RelationGetRelationName.patch looks
good to me.  I have added the commit message in the patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


v38-0001-Avoid-some-calls-to-RelationGetRelationName-and-.patch
Description: Binary data


  1   2   >