date:20210621

Re: intermittent failures in Cygwin from select_parallel tests

2021-06-21 Thread Thomas Munro

On Tue, Jun 22, 2021 at 5:21 PM Noah Misch  wrote:
> On Thu, Aug 03, 2017 at 10:45:50AM -0400, Robert Haas wrote:
> > On Wed, Aug 2, 2017 at 11:47 PM, Noah Misch  wrote:
> > > postmaster algorithms rely on the PG_SETMASK() calls preventing that.  
> > > Without
> > > such protection, duplicate bgworkers are an understandable result.  I 
> > > caught
> > > several other assertions; the PMChildFlags failure is another case of
> > > duplicate postmaster children:
> > >
> > >   6 TRAP: FailedAssertion("!(entry->trans == ((void *)0))", File: 
> > > "pgstat.c", Line: 871)
> > >   3 TRAP: FailedAssertion("!(PMSignalState->PMChildFlags[slot] == 
> > > 1)", File: "pmsignal.c", Line: 229)
> > >  20 TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", 
> > > Line: 2523)
> > >  21 TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File: 
> > > "shm_mq.c", Line: 221)
> > >  Also, got a few "select() failed in postmaster: Bad address"
> > >
> > > I suspect a Cygwin signals bug.  I'll try to distill a self-contained test
> > > case for the Cygwin hackers.  The lack of failures on buildfarm member 
> > > brolga
> > > argues that older Cygwin is not affected.
> >
> > Nice detective work.
>
> Thanks.  http://marc.info/?t=15018329641 has my upstream report.  The
> Cygwin project lead reproduced this, but a fix remained elusive.
>
> I guess we'll ignore weird postmaster-associated lorikeet failures for the
> foreseeable future.

While reading a list of recent build farm assertion failures I learned that
this is still broken in Cygwin 3.2, and eventually found my way back
to this thread.  I was wondering about suggesting some kind of
official warning, but I guess the manual already covers it with this
10 year old notice.  I don't know much about Windows or Cygwin so I'm
not sure if it needs updating or not, but I would guess that there are
no longer any such systems?

  Cygwin is not recommended for running a
  production server, and it should only be used for running on
  older versions of Windows where
  the native build does not work.

Re: [HACKERS] logical decoding of two-phase transactions

2021-06-21 Thread Greg Nancarrow

On Mon, Jun 21, 2021 at 4:37 PM Peter Smith  wrote:
>
> Please find attached the latest patch set v88*
>

Some minor comments:

(1)
v88-0002

doc/src/sgml/logicaldecoding.sgml

"examples shows" is not correct.
I think there is only ONE example being referred to.

BEFORE:
+The following examples shows how logical decoding is controlled over the
AFTER:
+The following example shows how logical decoding is controlled over the


(2)
v88 - 0003

doc/src/sgml/ref/create_subscription.sgml

(i)

BEFORE:
+  to the subscriber on the PREPARE TRANSACTION. By default,
the transaction
+  prepared on publisher is decoded as a normal transaction at commit.
AFTER:
+  to the subscriber on the PREPARE TRANSACTION. By default,
the transaction
+  prepared on the publisher is decoded as a normal
transaction at commit time.

(ii)

src/backend/access/transam/twophase.c

The double-bracketing is unnecessary:

BEFORE:
+ if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
AFTER:
+ if (gxact->valid && strcmp(gxact->gid, gid) == 0)

(iii)

src/backend/replication/logical/snapbuild.c

Need to add some commas to make the following easier to read, and
change "needs" to "need":

BEFORE:
+ * The prepared transactions that were skipped because previously
+ * two-phase was not enabled or are not covered by initial snapshot needs
+ * to be sent later along with commit prepared and they must be before
+ * this point.
AFTER:
+ * The prepared transactions, that were skipped because previously
+ * two-phase was not enabled or are not covered by initial snapshot, need
+ * to be sent later along with commit prepared and they must be before
+ * this point.

(iv)

src/backend/replication/logical/tablesync.c

I think the convention used in Postgres code is to check for empty
Lists using "== NIL" and non-empty Lists using "!= NIL".

BEFORE:
+ if (table_states_not_ready && !last_start_times)
AFTER:
+ if (table_states_not_ready != NIL && !last_start_times)


BEFORE:
+ else if (!table_states_not_ready && last_start_times)
AFTER:
+ else if (table_states_not_ready == NIL && last_start_times)


Regards,
Greg Nancarrow
Fujitsu Australia

Re: Toast compression method options

2021-06-21 Thread Dilip Kumar

On Fri, Jun 18, 2021 at 12:13 PM Michael Paquier  wrote:
>
> On Thu, May 06, 2021 at 07:23:48PM +0530, Dilip Kumar wrote:
> > I have fixed some comments offlist reported by Justin.  Apart from
> > that, I have also improved documentation and test case.  Stil it has a
> > lot of cleanup to be done but I am not planning to do that
> > immediately.
>
> I was testing the various compression algos we touch in core, and I am
> not really convinced that we need more code to control that.  First,
> pglz is living as-is in the backend code for a very long time and no
> one has expressed an interest in controlling the compression strategy
> used AFAIK.  On top of that, LZ4 outclasses it easily, so if there is
> a need to worry about performance with the TOAST data, users could
> just move to use LZ4.
>
> +   if (strcmp(def->defname, "acceleration") == 0)
> +   {
> +   int32 acceleration =
> +   pg_atoi(defGetString(def), sizeof(acceleration), 0);
> +
> +   if (acceleration < INT32_MIN || acceleration > INT32_MAX)
> +   ereport(ERROR,
> +   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +errmsg("unexpected value for lz4 compression 
> acceleration: \"%s\"",
> +   defGetString(def)),
> +errhint("expected value between INT32_MIN and
> INT32_MAX")
> +   ));
>
> Then comes the part with LZ4 and its acceleration.  The default
> compression level used by LZ4 compresses data the most, and it is
> already pretty cheap in CPU.  Do you have cases where this would be
> useful?  Increasing the acceleration reduces the compression to be
> close to zero, but if one cares about the compression cost, he/she
> could fall back to the external storage for basically the same
> effect.  Is there really a use-case for something in-between?

IMHO there is certainly a use case, basically, if we compress the data
so that we can avoid storing it externally.  Now suppose for some
data, with default LZ4 compression, the compression ratio is so high
that you are able to compress to the size which is way under the
limit.  So for such data, the acceleration can be increased such that
compression is fast and compression ratio is good enough that it is
not going to the external storage.  I agree it will be difficult for
the user to make such a decision and select the acceleration value but
based on the data pattern and their compressed length the admin can
make such a decision.  So in short select the acceleration value such
that you can compress it fast and the compression ratio is good enough
to keep it from storing externally.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Detecting File Damage & Inconsistencies

2021-06-21 Thread Craig Ringer

On Tue, 22 Jun 2021 at 00:24, Simon Riggs 
wrote:

> On Thu, Mar 18, 2021 at 6:20 AM Craig Ringer
>  wrote:
> >
> > On Mon, 15 Mar 2021 at 21:01, David Steele  wrote:
> >>
> >> On 11/18/20 5:23 AM, Simon Riggs wrote:
> >> > On Wed, 18 Nov 2020 at 06:42, Craig Ringer
> >> >  wrote:
> >> >>
> >> >> On Fri, Nov 13, 2020 at 7:24 PM Simon Riggs 
> wrote:
> >> >>>
> >> >>>
> >> >>> What I'm proposing is an option to add 16 bytes onto each COMMIT
> >> >>> record
> >> >>
> >> >>
> >> >> Would it make sense to write this at the time we write a topxid
> assignment to WAL instead?
> >> >>
> >> >> Otherwise it won't be accessible to streaming-mode logical decoding.
> >> >
> >> > Do you mean extend the xl_xact_assignment record? My understanding is
> >> > that is not sent in all cases, so not sure what you mean by "instead".
> >>
> >> Craig, can you clarify?
> >
> >
> > Right. Or write a separate WAL record when the feature is enabled. But
> it's probably sufficient to write it as an optional chunk on
> xl_xact_assignment records. We often defer writing them so we can optimise
> away xacts that never actually wrote anything, but IIRC we still write one
> before we write any WAL that references the xid. That'd be fine, since we
> don't need the info any sooner than that during decoding. I'd have to
> double check that we write it in all cases and won't get to that too soon,
> but I'm pretty sure we do...
>
> The commit record is optimized away if no xid is assigned, though is
> still present if we didn't write any WAL records.
>
> But if a commit record exists in the WAL stream, we want to know where
> it came from.
>
> A later patch will add PITR capability based on this information so
> attaching it directly to the commit record is fairly important, IMHO.
>

Why?

All the proposed info:

* 8-byte session start time (from MyStartTime)
* 2-byte pid (from MyProcPid)
* 4-byte user oid

are available at topxid assignment time. If we defer writing them until
commit, we lose the ability to use this information during streaming
logical decoding. That's something I believe you've wanted for other
functionality in the past, such as logical decoding based audit
functionality.

IIRC the restart_lsn horizon already ensures that we can't miss the
xl_xact_assignment at the start of a txn. We would ensure that the desired
info is available throughout decoding of the txn, including at commit
record processing time, by adding it to the toplevel ReorderBufferTxn.

The only advantage I can see to annotating the commit record instead is
that we don't have to spend a few bytes per reorder-buffered topxid to
track this info between start of decoding for the tx and processing of the
commit record. I don't think that's worth caring about.The advantages that
having it earlier would give us are much more significant.

A few examples:

* Skip reorder buffering of non-target transactions early, so we can decode
the WAL stream to find the target transactions much faster using less
memory and I/O;

* Read the database change stream and use the session info to stream info
into an intrusion detection system and/or audit engine in real time, using
txn streaming to avoid the need to create huge reorder buffers;

* Re-decode the WAL stream to identify a target txn you know was aborted,
and commit it instead, so you can recover data from aborted txns from the
WAL stream using logical decoding. (Only possible if the catalog_xmin
hasn't advanced past that point already though)

So yeah. I think it'd be better to log the info you want at start-of-txn
unless there's a compelling reason not so, and I don't see one yet.

Re: Doc chapter for Hash Indexes

2021-06-21 Thread Amit Kapila

On Tue, Jun 22, 2021 at 4:25 AM Justin Pryzby  wrote:
>
> On Mon, Jun 21, 2021 at 02:08:12PM +0100, Simon Riggs wrote:
> > New chapter for Hash Indexes, designed to help users understand how
> > they work and when to use them.
> >
> > Mostly newly written, but a few paras lifted from README when they were 
> > helpful.
>
>
..
> +  the column value also makes all hash index scans lossy. Hash indexes may
> +  take part in bitmap index scans and backward scans.
>
> Isn't it more correct to say that it must use a bitmap scan?
>

Why? Hash indexes do support regular index scan.

-- 
With Regards,
Amit Kapila.

subscription/t/010_truncate.pl failure on desmoxytes in REL_13_STABLE

2021-06-21 Thread Thomas Munro

Hi,

While scanning for assertion failures on the build farm that don't
appear to have been discussed, I found this[1] in
010_truncate_publisher.log on the 13 branch:

TRAP: FailedAssertion("tupdesc->tdrefcount <= 0", File:
"/home/bf/build/buildfarm-desmoxytes/REL_13_STABLE/pgsql.build/../pgsql/src/backend/access/common/tupdesc.c",
Line: 321)
2021-06-17 02:17:04.392 CEST [60ca947c.f7a43:4] LOG:  server process
(PID 1014658) was terminated by signal 11: Segmentation fault
2021-06-17 02:17:04.392 CEST [60ca947c.f7a43:5] DETAIL:  Failed
process was running: SELECT pg_catalog.set_config('search_path', '',
false);

The last thing the segfaulting process said was:

2021-06-17 02:17:03.847 CEST [60ca947f.f7b82:8] LOG:  logical decoding
found consistent point at 0/157D538
2021-06-17 02:17:03.847 CEST [60ca947f.f7b82:9] DETAIL:  There are no
running transactions.

Unfortunately 13 doesn't log PIDs for assertion failures (hmm, commit
18c170a08ee could be back-patched?) so it's not clear which process
that was, and there's no backtrace.

I don't know if "pg_catalog.set_config('search_path', '', false)" is a
clue that this is related to another recent crash[2] I mentioned, also
from the 13 branch.

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes=2021-06-16%2023:58:47
[2] 
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BqdF6QE6rcj_Zj5h2qVARM--%2B92sqdmr-0DUSM_0Qu_g%40mail.gmail.com

Re: fdatasync performance problem with large number of DB files

2021-06-21 Thread Thomas Munro

On Fri, Jun 18, 2021 at 1:11 PM Justin Pryzby  wrote:
> Thomas, could you comment on this ?

Sorry, I missed that.  It is initially a confusing proposal, but after
trying it out (that is: making recovery_init_sync_method PGC_SIGHUP
and testing a scenario where you want to make the next crash use it
that way and without the change), I agree.  +1 from me.

Re: fdatasync performance problem with large number of DB files

2021-06-21 Thread Michael Paquier

On Fri, Jun 18, 2021 at 06:18:59PM +0900, Fujii Masao wrote:
> On 2021/06/04 23:39, Justin Pryzby wrote:
>> You said switching to SIGHUP "would have zero effect"; but, actually it 
>> allows
>> an admin who's DB took a long time in recovery/startup to change the 
>> parameter
>> without shutting down the service.  This mitigates the downtime if it crashes
>> again.  I think that's at least 50% of how this feature might end up being
>> used.
> 
> Yes, it would have an effect when the server is automatically restarted
> after crash when restart_after_crash is enabled. At least for me +1 to
> your proposed change.

Good point about restart_after_crash, I did not consider that.
Switching recovery_init_sync_method to SIGHUP could be helpful with
that.
--
Michael


signature.asc
Description: PGP signature

Re: Added schema level support for publication.

2021-06-21 Thread Bharath Rupireddy

On Tue, Jun 22, 2021 at 9:45 AM vignesh C  wrote:
> I have removed the skip table patches to keep the focus on the main
> patch, once this patch gets into committable shape, I will focus on
> the skip table patch.

IMO it's a good idea to start a new thread for the "skip table"
feature so that we can discuss it separately. If required you can
specify in that thread that the idea of the "skip table" can be
applied to the "add schema level support" feature.

With Regards,
Bharath Rupireddy.

Re: SSL/TLS instead of SSL in docs

2021-06-21 Thread Michael Paquier

On Mon, Jun 21, 2021 at 01:23:56PM +0200, Daniel Gustafsson wrote:
> On 18 Jun 2021, at 07:37, Michael Paquier  wrote:
>> It looks inconsistent to me to point to the libpq documentation to get
>> the details about SNI.  Wouldn't is be better to have an item in the
>> glossary that refers to the bits of RFC 6066, and remove the reference
>> of the RPC from the libpq page?
> 
> I opted for a version with SNI in the glossary but without a link to the RFC
> since no definitions in the glossary has ulinks.

Okay, but why making all this complicated if it can be simple?  If I
read the top of the page, the description of the glossary refers more
to terms that apply to PostgreSQL and RDBMs in general.  I think that
something like the attached is actually more adapted, where there are
acronyms for SNI and MITM, and where the references we have in the
libpq docs are moved to the page for acronyms.  That's consistent with
what we do now for the existing acronyms SSL and TLS, and there does
not seem to need any references to all those terms in the glossary.

>> -   to present a valid (trusted) SSL certificate, while
>> +   to present a valid (trusted) 
>> SSL/TLS certificate, while
>> This style with two acronyms for what we want to be one thing is
>> heavy.  Could it be better to just have one single acronym called
>> SSL/TLS that references both parts?
> 
> Maybe, I don't know.  I certainly don't prefer the way which is in the patch
> but I also think it's the most "correct" way so I opted for that to start the
> discussion.  If we're fine with one acronym tag for the combination then I'm
> happy to change to that.

That feels a bit more natural to me to have SSL/TLS in the contexts
where they apply as a single keyword.  Do we have actually sections in
the docs where only one of them apply, like some protocol references?

> A slightly more invasive idea would be to not have acronyms at all and instead
> move the ones that do benefit from clarification to the glossary?  ISTM that
> having a brief description of terms and not just the expansion is beneficial 
> to
> the users.  That would however be for another thread, but perhaps thats worth
> discussing?

Not sure about this bit.  That's a more general topic for sure, but I
also like the separation we have not between the glossary and the
acronyms.  Having an entry in one does not make necessary the same
entry in the other, and vice-versa.

>> Patch 0003, for the  markups with OpenSSL, included one
>> SSL/TLS entry.
> 
> Ugh, sorry, that must've been a git add -p fat-finger.

The extra SSL/TLS entry was on one of the files changed f80979f, so
git add has been working just fine :)
--
Michael
diff --git a/doc/src/sgml/acronyms.sgml b/doc/src/sgml/acronyms.sgml
index 13bd819eb1..658000b29c 100644
--- a/doc/src/sgml/acronyms.sgml
+++ b/doc/src/sgml/acronyms.sgml
@@ -410,6 +410,17 @@
 

 
+   
+MITM
+
+ 
+  https://en.wikipedia.org/wiki/Man-in-the-middle_attack;>
+  Man-in-the-middle
+ 
+
+   
+

 MSVC
 
@@ -590,6 +601,18 @@
 

 
+   
+SNI
+
+ 
+  https://en.wikipedia.org/wiki/Server_Name_Indication;>
+   Server Name Indication,
+  https://tools.ietf.org/html/rfc6066#section-3;>RFC 6066
+ 
+
+   
+

 SPI
 
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index eeba2caa43..f82ae4e461 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1378,7 +1378,7 @@ include_dir 'conf.d'
   

 Disables anonymous cipher suites that do no authentication.  Such
-cipher suites are vulnerable to man-in-the-middle attacks and
+cipher suites are vulnerable to MITM attacks and
 therefore should not be used.

   
diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 441cc0da3a..641970f2a6 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -1783,18 +1783,17 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
   

 By default, libpq sets the TLS extension Server Name
-Indication (SNI) on SSL-enabled connections.  See https://tools.ietf.org/html/rfc6066#section-3;>RFC 6066
-for details.  By setting this parameter to 0, this is turned off.
+Indication (SNI) on SSL-enabled connections.
+By setting this parameter to 0, this is turned off.

 

 The Server Name Indication can be used by SSL-aware proxies to route
 connections without having to decrypt the SSL stream.  (Note that this
 requires a proxy that is aware of the PostgreSQL protocol handshake,
-not just any SSL proxy.)  However, SNI makes the destination host name
-appear in cleartext in the network traffic, so it might be undesirable
-in some cases.
+not just any SSL proxy.)  However, SNI makes the
+destination host name appear in cleartext in the

Re: Different compression methods for FPI

2021-06-21 Thread Michael Paquier

On Mon, Jun 21, 2021 at 10:13:58PM -0500, Justin Pryzby wrote:
> @Michael: I assume that if you merge this patch, you'd set your animals to use
> wal_compression=lz4, and then they would fail the recovery tests.

Yes, I'd like to do that on my animal dangomushi.

> So the patches that you say are unrelated still seem to me to be a
> prerequisite .

I may be missing something, of course, but I still don't see why
that's necessary?  We don't get any test failures on HEAD by switching
wal_compression to on, no?  That's easy enough to test with a two-line
change that changes the default.

> +/* compression methods supported */
> +#define BKPIMAGE_COMPRESS_PGLZ 0x04
> +#define BKPIMAGE_COMPRESS_ZLIB 0x08
> +#define BKPIMAGE_COMPRESS_LZ4  0x10
> +#define BKPIMAGE_COMPRESS_ZSTD 0x20
> +#defineBKPIMAGE_IS_COMPRESSED(info) \
> +   ((info & (BKPIMAGE_COMPRESS_PGLZ | BKPIMAGE_COMPRESS_ZLIB | \
> + BKPIMAGE_COMPRESS_LZ4 | BKPIMAGE_COMPRESS_ZSTD)) != 
> 0)
> 
> You encouraged saving bits here, so I'm surprised to see that your patches
> use one bit per compression method: 2 bits to support no/pglz/lz4, 3 to add
> zstd, and the previous patch used 4 bits to also support zlib.

Yeah, I know.  I have just finished with that to get something
readable for the sake of the tests.  As you say, the point is moot
just we add one new method, anyway, as we need just one new bit.
And that's what I would like to do for v15 with LZ4 as the resulting
patch is simple.  It would be an idea to discuss more compression
methods here once we hear more from users when this is released in the
field, re-considering at this point if more is necessary or not.
--
Michael


signature.asc
Description: PGP signature

RE: Fix for segfault in logical replication on master

2021-06-21 Thread osumi.takami...@fujitsu.com

On Tuesday, June 22, 2021 11:08 AM Japin Li  wrote:
> On Mon, 21 Jun 2021 at 19:06, Amit Kapila  wrote:
> > On Mon, Jun 21, 2021 at 4:09 PM Japin Li  wrote:
> >>
> >> On Mon, 21 Jun 2021 at 17:54, Amit Kapila 
> wrote:
> >> > On Mon, Jun 21, 2021 at 2:06 PM Japin Li  wrote:
> >> >>
> >> >> On Mon, 21 Jun 2021 at 16:22, Amit Kapila
>  wrote:
> >> >> > On Mon, Jun 21, 2021 at 1:30 PM Japin Li 
> wrote:
> >> >> >>
> >> >> >> On Sat, 19 Jun 2021 at 17:18, Amit Kapila
>  wrote:
> >> >> >> > On Fri, Jun 18, 2021 at 9:18 AM Amit Kapila
>  wrote:
> >> >> >>
> >> >> >> Or we can free the memory owned by indexoidlist after check
> >> >> >> whether it is NIL, because we do not use it in the later.
> >> >> >>
> >> >> >
> >> >> > Valid point. But I am thinking do we really need to fetch and
> >> >> > check indexoidlist here?
> >> >>
> >> >> IMO, we shold not fetch and check the indexoidlist here, since we
> >> >> do not use it.  However, we should use RelationGetIndexList() to
> >> >> update the
> >> >> reladion->rd_replidindex, so we should fetch the indexoidlist,
> >> >> reladion->maybe we
> >> >> can use the following code:
> >> >>
> >> >> indexoidlist = RelationGetIndexList(relation);
> >> >> list_free(indexoidlist);
> >> >>
> >> >> Or does there any function that only update the
> >> >> relation->rd_replidindex or related fields, but do not fetch the
> indexoidlist?
> >> >>
> >> >
> >> > How about RelationGetReplicaIndex? It fetches the indexlist only
> >> > when required and frees it immediately. But otherwise, currently,
> >> > there shouldn't be any memory leak because we allocate this in
> >> > "logical replication output context" which is reset after
> >> > processing each change message, see pgoutput_change.
> >>
> >> Thanks for your explanation.  It might not  be a memory leak, however
> >> it's a little confuse when we free the memory of the indexoidlist in
> >> one place, but not free it in another place.
> >>
> >> I attached a patch to fix this.  Any thoughts?
> >>
> >
> > Your patch appears to be on the lines we discussed but I would prefer
> > to get it done after Beta2 as this is just a minor code improvement.
> > Can you please send the change as a patch file instead of copy-pasting
> > the diff at the end of the email?
> 
> Thanks for your review! Attached v1 patch.
Your patch can be applied to the HEAD.
And, I also reviewed your patch, which seems OK.
Make check-world has passed with your patch in my env as well.

Best Regards,
Takamichi Osumi

Re: Different compression methods for FPI

2021-06-21 Thread Justin Pryzby

On Tue, May 25, 2021 at 12:05:19PM +0530, Dilip Kumar wrote:
> +++ b/src/test/recovery/t/011_crash_recovery.pl
> @@ -14,7 +14,7 @@ use Config;
>  plan tests => 3;
> 
>  my $node = get_new_node('primary');
> -$node->init(allows_streaming => 1);
> +$node->init();
>  $node->start;
> 
> How this change is relevant?

It's necessary for the tests to pass - see the prior discussions.
Revert them and the tests fail.

time make -C src/test/recovery check
#   Failed test 'new xid after restart is greater'

@Michael: I assume that if you merge this patch, you'd set your animals to use
wal_compression=lz4, and then they would fail the recovery tests.  So the
patches that you say are unrelated still seem to me to be a prerequisite.

From: Kyotaro Horiguchi

Subject: [PATCH v8 2/9] Run 011_crash_recovery.pl with wal_level=minimal

From: Kyotaro Horiguchi

Subject: [PATCH v8 3/9] Make sure published XIDs are persistent 

+/* compression methods supported */
+#define BKPIMAGE_COMPRESS_PGLZ 0x04
+#define BKPIMAGE_COMPRESS_ZLIB 0x08
+#define BKPIMAGE_COMPRESS_LZ4  0x10
+#define BKPIMAGE_COMPRESS_ZSTD 0x20
+#defineBKPIMAGE_IS_COMPRESSED(info) \
+   ((info & (BKPIMAGE_COMPRESS_PGLZ | BKPIMAGE_COMPRESS_ZLIB | \
+ BKPIMAGE_COMPRESS_LZ4 | BKPIMAGE_COMPRESS_ZSTD)) != 0)

You encouraged saving bits here, so I'm surprised to see that your patches
use one bit per compression method: 2 bits to support no/pglz/lz4, 3 to add
zstd, and the previous patch used 4 bits to also support zlib.

There are spare bits available for that, but now there can be an inconsistency
if two bits are set.  Also, 2 bits could support 4 methods (including "no").

-- 
Justin

Re: Two patches to speed up pg_rewind.

2021-06-21 Thread Paul Guo

On Thu, Jun 17, 2021 at 3:19 PM Michael Paquier  wrote:
>
> On Wed, Jun 02, 2021 at 05:02:10PM +0900, Michael Paquier wrote:
> > On Wed, Jun 02, 2021 at 06:20:30PM +1200, Thomas Munro wrote:
> > > The main thing I noticed was that Linux < 5.3 can fail with EXDEV if
> > > you cross a filesystem boundary, is that something we need to worry
> > > about there?
> >
> > Hmm.  Good point.  That may justify having a switch to control that.
>
> Paul, the patch set still needs some work, so I am switching it as
> waiting on author.  I am pretty sure that we had better have a
> fallback implementation of copy_file_range() in src/port/, and that we
> are going to need an extra switch in pg_rewind to allow users to
> bypass copy_file_range()/EXDEV if they do a local rewind operation
> across different FSes with a kernel < 5.3.
> --

I did modification on the copy_file_range() patch yesterday by simply falling
back to read()+write() but I think it could be improved further.

We may add a function to determine two file/path are copy_file_range()
capable or not (using POSIX standard statvfs():f_fsid?) - that could be used
by other copy_file_range() users although in the long run the function
is not needed.
And even having this we may still need the fallback code if needed.

- For pg_rewind, we may just determine that ability once on src/dst pgdata, but
  since there might be soft link (tablespace/wal) in pgdata so we should still
  allow fallback for those non copy_fie_range() capable file copying.
- Also it seems that sometimes copy_file_range() could return ENOTSUP/EOPNOTSUP
  (the file system does not support that and the kernel does not fall
back to simple copying?)
  although this is not documented and it seems not usual?

Any idea?

Re: Maintaining a list of pgindent commits for "git blame" to ignore

2021-06-21 Thread Bruce Momjian

On Mon, Jun 21, 2021 at 07:43:59PM -0700, Peter Geoghegan wrote:
> On Mon, Jun 21, 2021 at 5:06 PM Tom Lane  wrote:
> > Agreed.  I think I'd previously suggested something under src/tools,
> > but we might as well do like others are doing; especially since
> > we have .gitattributes and the like there already.
> 
> Great.
> 
> Attached is a patch file that puts it all together. I would like to
> commit this in the next couple of days.

+1

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: Optionally automatically disable logical replication subscriptions on error

2021-06-21 Thread Mark Dilger

> On Jun 21, 2021, at 5:57 PM, Peter Smith  wrote:
> 
> * Is the goal to prevent some *unattended* SUBSCRIPTION from going bad
> at some point in future and then going into a relaunch loop for
> days/weeks and causing 1000's of errors without the user noticing. In
> that case, this patch seems to be quite useful, but for this goal
> maybe you don't want to be checking the tablesync workers at all, but
> should only be checking the apply worker like your original v1 patch
> did.

Yeah, my motivation was preventing an infinite loop, and providing a clean way 
for the users to know that replication they are waiting for won't ever 
complete, rather than having to infer that it will never halt. 

> * Is the goal just to be a convenient way to disable the subscription
> during the CREATE SUBSCRIPTION phase so that the user can make
> corrections in peace without the workers re-launching and making more
> error logs?

No.  This is not and never was my motivation.  It's an interesting question, 
but that idea never crossed my mind.  I'm not sure what changes somebody would 
want to make *after* creating the subscription.  Certainly, there may be 
problems with how they have things set up, but they won't know that until the 
first error happens.

> Here the patch is helpful, but only for simple scenarios
> like 1 faulty table. Imagine if there are 10 tables (all with PK
> violations at DATASYNC copy) then you will encounter them one at a
> time and have to re-enable the subscription 10 times, after fixing
> each error in turn.

You are assuming disable_on_error=true.  It is false by default.  But ok, let's 
accept that assumption for the sake of argument.  Now, will you have to 
manually go through the process 10 times?  I'm not sure.  The user might figure 
out their mistake after seeing the first error.

> So in this scenario the new option might be more
> of a hindrance than a help because it would be easier if the user just
> did "ALTER SUBSCRIPTION sub DISABLE" manually and fixed all the
> problems in one sitting before re-enabling.

Yeah, but since the new option is off by default, I don't see any sensible 
complaint.

> 
> * etc
> 
> //
> 
> Finally, here is one last (crazy?) thought-bubble just for
> consideration. I might be wrong, but my gut feeling is that the Stats
> Collector is intended more for "tracking" and for "metrics" rather
> than for holding duplicates of logged error messages. At the same
> time, I felt that disabling an entire subscription due to a single
> rogue error might be overkill sometimes.

I'm happy to entertain criticism of the particulars of how my patch approaches 
this problem, but it is already making a distinction between transient errors 
(resources, network, etc.) vs. ones that are non-transient.  Again, I might not 
have drawn the line in the right place, but the patch is not intended to 
disable subscriptions in response to transient errors.

> But I wonder if there is a
> way to combine those two ideas so that the Stats Collector gets some
> new counter for tracking the number of worker re-launches that have
> occurred, meanwhile there could be a subscription option which gives a
> threshold above which you would disable the subscription.
> e.g.
> "disable_on_error_threshold=0" default, relaunch forever
> "disable_on_error_threshold=1" disable upon first error encountered.
> (This is how your patch behaves now I think.)
> "disable_on_error_threshold=500" disable if the re-launch errors go
> unattended and happen 500 times.

That sounds like a misfeature to me.  You could have a subscription that works 
fine for a month, surviving numerous short network outages, but then gets 
autodisabled after a longer network outage.  I'm not sure why anybody would 
want that.  You might argue for exponential backoff, where it never gets 
autodisabled on transient errors, but retries less frequently.  But I don't 
want to expand the scope of this patch to include that, at least not without a 
lot more evidence that it is needed.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Use simplehash.h instead of dynahash in SMgr

2021-06-21 Thread Thomas Munro

On Tue, Jun 22, 2021 at 1:55 PM David Rowley  wrote:
> On Tue, 22 Jun 2021 at 03:43, Tom Lane  wrote:
> > I kind of wonder if we really need four different hash table
> > implementations (this being the third "generic" one, plus hash join
> > has its own, and I may have forgotten others).  Should we instead
> > think about revising simplehash to gain the benefits of this patch?
>
> hmm, yeah. I definitely agree with trying to have as much reusable
> code as we can when we can. It certainly reduces maintenance and bugs
> tend to be found more quickly too. It's a very worthy cause.

It is indeed really hard to decide when you have a new thing, and when
you need a new way to parameterise the existing generic thing.  I
wondered about this how-many-hash-tables-does-it-take question a lot
when writing dshash.c (a chaining hash table that can live in weird
dsa.c memory, backed by DSM segments created on the fly that may be
mapped at different addresses in each backend, and has dynahash-style
partition locking), and this was around the time Andres was talking
about simplehash.  In retrospect, I'd probably kick out the built-in
locking and partitions and instead let callers create their own
partitioning scheme on top from N tables, and that'd remove one quirk,
leaving only the freaky pointers and allocator.  I recall from a
previous life that Boost's unordered_map template is smart enough to
support running in shared memory mapped at different addresses just
through parameterisation that controls the way it deals with internal
pointers (boost::unordered_map<..., ShmemAllocator>), which seemed
pretty clever to me, and it might be achievable to do the same with a
generic hash table for us that could take over dshash's specialness.

One idea I had at the time is that the right number of hash table
implementations in our tree is two: one for chaining (like dynahash)
and one for open addressing/probing (like simplehash), and that
everything else should be hoisted out (locking, partitioning) or made
into template parameters through the generic programming technique
that simplehash.h has demonstrated (allocators, magic pointer type for
internal pointers, plus of course the inlinable ops).  But that was
before we'd really fully adopted the idea of this style of template
code.  (I also assumed the weird memory stuff would be temporary and
we'd move to threads, but that's another topic for another thread.)
It seems like you'd disagree with this, and you'd say the right number
is three.  But it's also possible to argue for one...

A more superficial comment: I don't like calling hash tables "hash".
I blame perl.

Re: Optionally automatically disable logical replication subscriptions on error

2021-06-21 Thread Amit Kapila

On Tue, Jun 22, 2021 at 6:27 AM Peter Smith  wrote:
>
> #3. There is another suggestion to use the Stats Collector to hold the
> error message [Amit-2]. For me, this felt like blurring too much the
> distinction between "stats tracking/metrics" and "logs". ERROR logs
> must be flushed, whereas for stats (IIUC) there is no guarantee that
> everything you need to see would be present. Indeed Amit wrote "But in
> this case, if the stats collector missed updating the information, the
> user may have to manually update the subscription and let the error
> happen again to see it." [Amit-3]. Requesting the user to cause the
> same error again just in case it was not captured a first time seems
> too strange to me.
>

I don't think it will often be the case that the stats collector will
miss updating the information. I am not feeling comfortable storing
error information in system catalogs. We have some other views which
capture somewhat similar conflict information
(pg_stat_database_conflicts) or failed transactions information. So, I
thought here we are extending the similar concept by storing some
additional information about errors.

-- 
With Regards,
Amit Kapila.

Re: Maintaining a list of pgindent commits for "git blame" to ignore

2021-06-21 Thread Peter Geoghegan

On Mon, Jun 21, 2021 at 5:06 PM Tom Lane  wrote:
> Agreed.  I think I'd previously suggested something under src/tools,
> but we might as well do like others are doing; especially since
> we have .gitattributes and the like there already.

Great.

Attached is a patch file that puts it all together. I would like to
commit this in the next couple of days.

-- 
Peter Geoghegan


v1-0001-Add-list-of-ignorable-pgindent-commits-for-git-bl.patch
Description: Binary data

Re: Optionally automatically disable logical replication subscriptions on error

2021-06-21 Thread Masahiko Sawada

On Mon, Jun 21, 2021 at 7:48 PM Amit Kapila  wrote:
>
> On Mon, Jun 21, 2021 at 11:19 AM Amit Kapila  wrote:
> >
> > On Mon, Jun 21, 2021 at 10:55 AM Mark Dilger
> >  wrote:
> >
> > >  I don't mind if you want to store more information, and maybe that needs 
> > > to be stored somewhere else.  Do you believe pg_subscription_rel is a 
> > > suitable location?
> > >
> > It won't be sufficient to store information in either
> > pg_subscription_rel or pg_susbscription. I think if we want to store
> > the required information in a catalog then we need to define a new
> > catalog (pg_subscription_conflicts or something like that) with
> > information corresponding to each rel in subscription (srsubid oid
> > (Reference to subscription), srrelid oid (Reference to relation),
> > ). OTOH, we can choose to send the error
> > information to stats collector which will then be available via stat
> > view and update system catalog to disable the subscription but there
> > will be a risk that we might send info of failed transaction to stats
> > collector but then fail to update system catalog to disable the
> > subscription.
> >
>
> I think we should store the input from the user (like disable_on_error
> flag or xid to skip) in the system catalog pg_subscription and send
> the error information (subscrtion_id, rel_id, xid of failed xact,
> error_code, error_message, etc.) to the stats collector which can be
> used to display such information via a stat view.
>
> The disable_on_error flag handling could be that on error it sends the
> required error info to stats collector and then updates the subenabled
> in pg_subscription. In rare conditions, where we are able to send the
> message but couldn't update the subenabled info in pg_subscription
> either due to some error or server restart, the apply worker would
> again try to apply the same change and would hit the same error again
> which I think should be fine because it will ultimately succeed.
>
> The skip xid handling will also be somewhat similar where on an error,
> we will send the error information to stats collector which will be
> displayed via stats view. Then the user is expected to ask for skip
> xid (Alter Subscription ... SKIP ) based on information
> displayed via stat view. Now, the apply worker can skip changes from
> such a transaction, and then during processing of commit record of the
> skipped transaction, it should update xid to invalid value, so that
> next time that shouldn't be used. I think it is important to update
> xid to an invalid value as part of the skipped transaction because
> otherwise, after the restart, we won't be able to decide whether we
> still want to skip the xid stored for a subscription.

Sounds reasonable.

The feature that sends the error information to the stats collector is
a common feature for both and itself is also useful. As discussed in
that skip transaction patch thread, it would also be good if we write
error information (relation, action, xid, etc) to the server log too.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: row filtering for logical replication

2021-06-21 Thread Peter Smith

On Fri, Jun 18, 2021 at 9:40 PM Amit Kapila  wrote:
>
[...]
> I have rebased the patch so that you can try it out. The main thing I
> have done is to remove changes in worker.c and created a specialized
> function to create estate for pgoutput.c as I don't think we need what
> is done in worker.c.

Thanks for the recent rebase.

- The v15 patch applies OK (albeit with whitespace warning)
- make check is passing OK
- the new TAP tests 020_row_filter is passing OK.

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Optionally automatically disable logical replication subscriptions on error

2021-06-21 Thread Mark Dilger

> On Jun 21, 2021, at 5:57 PM, Peter Smith  wrote:
> 
> * Is the goal mainly to help automated (TAP) testing?

Absolutely, that was my original motivation.  But I don't think that is the 
primary reason the patch would be accepted.  There is a cost to having the 
logical replication workers attempt ad infinitum to apply a transaction that 
will never apply.

Also, if you are waiting for a subscription to catch up, it is far from obvious 
that you will wait forever.

> In that case,
> then maybe you do want to store the error message somewhere other than
> the log files. But still I wonder if results would be unpredictable
> anyway - e.g if there are multiple tables all with errors then it
> depends on the tablesync order of execution which error you see caused
> the auto-disable, right? And if it is not predictable maybe it is less
> useful.

But if you are writing a TAP test, you should be the one controlling whether 
that is the case.  I don't think it would be unpredictable from the point of 
view of the test author.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: RFC: Logging plan of the running query

2021-06-21 Thread torikoshia

On 2021-06-16 20:36, torikoshia wrote:

other background or parallel worker.

As far as I looked around, there seems no easy ways to do so.

If we were to invent a new
mechanism just for addressing the above comment, I would rather choose
to not do that as it seems like an overkill. We can leave it up to the
user whether or not to unnecessarily signal those processes which are
bound to print "PID XXX is not executing queries now" message.

+1. I'm going to proceed in this direction.

Updated the patch.

On Thu, Jun 10, 2021 at 11:09 AM torikoshia  
wrote:

On 2021-06-09 23:04, Fujii Masao wrote:

> auto_explain can log the plan of even nested statement
> if auto_explain.log_nested_statements is enabled.
> But ISTM that pg_log_current_plan() cannot log that plan.
> Is this intentional?

> I think that it's better to make pg_log_current_plan() log
> the plan of even nested statement.

+1. It would be better.
But currently plan information is got from ActivePortal and ISTM there
are no easy way to retrieve plan information of nested statements from
ActivePortal.
Anyway I'll do some more research.

I haven't found a proper way yet but it seems necessary to use something 
other than ActivePortal and I'm now thinking this could be a separate 
patch in the future.

--
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATIONFrom 7ad9a280b6f74e4718293863716046c02b0a3835 Mon Sep 17 00:00:00 2001
From: atorik 
Date: Tue, 22 Jun 2021 10:03:39 +0900
Subject: [PATCH v4] Add function to log the untruncated query string and its
 plan for the query currently running on the backend with the specified
 process ID.

Currently, we have to wait for the query execution to finish
before we check its plan. This is not so convenient when
investigating long-running queries on production environments
where we cannot use debuggers.
To improve this situation, this patch adds
pg_log_current_query_plan() function that requests to log the
plan of the specified backend process.

Only superusers are allowed to request to log the plans because
allowing any users to issue this request at an unbounded rate
would cause lots of log messages and which can lead to denial
of service.

On receipt of the request, at the next CHECK_FOR_INTERRUPTS(),
the target backend logs its plan at LOG_SERVER_ONLY level, so
that these plans will appear in the server log but not be sent
to the client.

Since some codes and comments of pg_log_current_query_plan() 
are the same with pg_log_backend_memory_contexts(), this
patch also refactors them to make them common.

Reviewd-by: Bharath Rupireddy, Fujii Masao, Dilip Kumar
---
 doc/src/sgml/func.sgml   | 44 
 src/backend/commands/explain.c   | 75 
 src/backend/storage/ipc/procsignal.c | 66 +
 src/backend/tcop/postgres.c  |  4 ++
 src/backend/utils/adt/mcxtfuncs.c| 51 +
 src/backend/utils/init/globals.c |  1 +
 src/include/catalog/pg_proc.dat  |  6 ++
 src/include/commands/explain.h   |  2 +
 src/include/miscadmin.h  |  1 +
 src/include/storage/procsignal.h |  3 +
 src/test/regress/expected/misc_functions.out | 12 +++-
 src/test/regress/sql/misc_functions.sql  |  8 +--
 12 files changed, 217 insertions(+), 56 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 6388385edc..dad2d34a04 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -24940,6 +24940,27 @@ SELECT collation for ('foo' COLLATE "de_DE");

+  
+   
+
+ pg_log_current_query_plan
+
+pg_log_current_query_plan ( pid integer )
+boolean
+   
+   
+Requests to log the untruncated query string and its plan for
+the query currently running on the backend with the specified
+process ID.
+They will be logged at LOG message level and
+will appear in the server log based on the log
+configuration set (See 
+for more information), but will not be sent to the client
+regardless of .
+Only superusers can request to log plan of the running query.
+   
+  
+

@@ -25053,6 +25074,29 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560
 because it may generate a large number of log messages.

+   
+pg_log_current_query_plan can be used
+to log the plan of a backend process. For example:
+
+postgres=# SELECT pg_log_current_query_plan(201116);
+ pg_log_current_query_plan
+---
+ t
+(1 row)
+
+The format of the query plan is the same as when FORMAT TEXT
+and VEBOSE are used in the EXPLAIN command.
+For example:
+
+LOG:  plan of the query running on backend with PID 201116 is:
+Query Text: SELECT * FROM pgbench_accounts;
+Seq Scan on public.pgbench_accounts

Re: Optionally automatically disable logical replication subscriptions on error

2021-06-21 Thread Mark Dilger

> On Jun 21, 2021, at 5:57 PM, Peter Smith  wrote:
> 
> #5. Document to refer to the logs. All ERROR details are already in
> the logs, and this seems to me the intuitive place to look for them.

My original motivation came from writing TAP tests to check that the 
permissions systems would properly deny the apply worker when running under a 
non-superuser role.  The idea is that the user with the responsibility for 
managing subscriptions won't have enough privilege to read the logs.  Whatever 
information that user needs (if any) must be someplace else.

> Searching for specific errors becomes difficult programmatically (is
> this really a problem other than complex TAP tests?).

I believe there is a problem, because I remain skeptical that these errors will 
be both existent and rare.  Either you've configured your system correctly and 
you get zero of these, or you've misconfigured it and you get some non-zero 
number of them.  I don't see any reason to assume that number will be small.

The best way to deal with that is to be able to tell the system what to do with 
them, like "if the error has this error code and the error message matches this 
regular expression, then do this, else do that."  That's why I think allowing 
triggers to be created on subscriptions makes the most sense (though is 
probably the hardest system being proposed so far.)

> But here there
> is no risk of missing or insufficient information captured in the log
> files ("but still there will be some information related to ERROR
> which we wanted the user to see unless we ask them to refer to logs
> for that." [Amit-4}).

Not only is there a problem if the user doesn't have permission to view the 
logs, but also, if we automatically disable the subscription until the error is 
manually cleared, the logs might be rotated out of existence before the user 
takes any action.  In that case, the logs will be entirely missing, and not 
even the error message will remain.  At least with the patch I submitted, the 
error message will remain, though I take Amit's point that there are 
deficiencies in handling parallel tablesync workers, etc.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Fix for segfault in logical replication on master

2021-06-21 Thread Japin Li


On Mon, 21 Jun 2021 at 19:06, Amit Kapila  wrote:
> On Mon, Jun 21, 2021 at 4:09 PM Japin Li  wrote:
>>
>> On Mon, 21 Jun 2021 at 17:54, Amit Kapila  wrote:
>> > On Mon, Jun 21, 2021 at 2:06 PM Japin Li  wrote:
>> >>
>> >> On Mon, 21 Jun 2021 at 16:22, Amit Kapila  wrote:
>> >> > On Mon, Jun 21, 2021 at 1:30 PM Japin Li  wrote:
>> >> >>
>> >> >> On Sat, 19 Jun 2021 at 17:18, Amit Kapila  
>> >> >> wrote:
>> >> >> > On Fri, Jun 18, 2021 at 9:18 AM Amit Kapila 
>> >> >> >  wrote:
>> >> >>
>> >> >> Or we can free the memory owned by indexoidlist after check whether it 
>> >> >> is NIL,
>> >> >> because we do not use it in the later.
>> >> >>
>> >> >
>> >> > Valid point. But I am thinking do we really need to fetch and check
>> >> > indexoidlist here?
>> >>
>> >> IMO, we shold not fetch and check the indexoidlist here, since we do not
>> >> use it.  However, we should use RelationGetIndexList() to update the
>> >> reladion->rd_replidindex, so we should fetch the indexoidlist, maybe we
>> >> can use the following code:
>> >>
>> >> indexoidlist = RelationGetIndexList(relation);
>> >> list_free(indexoidlist);
>> >>
>> >> Or does there any function that only update the relation->rd_replidindex
>> >> or related fields, but do not fetch the indexoidlist?
>> >>
>> >
>> > How about RelationGetReplicaIndex? It fetches the indexlist only when
>> > required and frees it immediately. But otherwise, currently, there
>> > shouldn't be any memory leak because we allocate this in "logical
>> > replication output context" which is reset after processing each
>> > change message, see pgoutput_change.
>>
>> Thanks for your explanation.  It might not  be a memory leak, however it's
>> a little confuse when we free the memory of the indexoidlist in one place,
>> but not free it in another place.
>>
>> I attached a patch to fix this.  Any thoughts?
>>
>
> Your patch appears to be on the lines we discussed but I would prefer
> to get it done after Beta2 as this is just a minor code improvement.
> Can you please send the change as a patch file instead of copy-pasting
> the diff at the end of the email?

Thanks for your review! Attached v1 patch.

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

>From 0582d92ba4df517f3e67717751cddb8916c8ab9f Mon Sep 17 00:00:00 2001
From: Japin Li 
Date: Tue, 22 Jun 2021 09:59:38 +0800
Subject: [PATCH v1] Minor code beautification for RelationGetIdentityKeyBitmap

---
 src/backend/utils/cache/relcache.c | 13 -
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index d55ae016d0..94fbf1aa19 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -5244,9 +5244,9 @@ Bitmapset *
 RelationGetIdentityKeyBitmap(Relation relation)
 {
 	Bitmapset  *idindexattrs = NULL;	/* columns in the replica identity */
-	List	   *indexoidlist;
 	Relation	indexDesc;
 	int			i;
+	Oid			replidindex;
 	MemoryContext oldcxt;
 
 	/* Quick exit if we already computed the result */
@@ -5260,18 +5260,14 @@ RelationGetIdentityKeyBitmap(Relation relation)
 	/* Historic snapshot must be set. */
 	Assert(HistoricSnapshotActive());
 
-	indexoidlist = RelationGetIndexList(relation);
-
-	/* Fall out if no indexes (but relhasindex was set) */
-	if (indexoidlist == NIL)
-		return NULL;
+	replidindex = RelationGetReplicaIndex(relation);
 
 	/* Fall out if there is no replica identity index */
-	if (!OidIsValid(relation->rd_replidindex))
+	if (!OidIsValid(replidindex))
 		return NULL;
 
 	/* Look up the description for the replica identity index */
-	indexDesc = RelationIdGetRelation(relation->rd_replidindex);
+	indexDesc = RelationIdGetRelation(replidindex);
 
 	if (!RelationIsValid(indexDesc))
 		elog(ERROR, "could not open relation with OID %u",
@@ -5295,7 +5291,6 @@ RelationGetIdentityKeyBitmap(Relation relation)
 	}
 
 	RelationClose(indexDesc);
-	list_free(indexoidlist);
 
 	/* Don't leak the old values of these bitmaps, if any */
 	bms_free(relation->rd_idattr);
-- 
2.30.2

Re: Use simplehash.h instead of dynahash in SMgr

2021-06-21 Thread David Rowley

On Tue, 22 Jun 2021 at 03:43, Tom Lane  wrote:
> I kind of wonder if we really need four different hash table
> implementations (this being the third "generic" one, plus hash join
> has its own, and I may have forgotten others).  Should we instead
> think about revising simplehash to gain the benefits of this patch?

hmm, yeah. I definitely agree with trying to have as much reusable
code as we can when we can. It certainly reduces maintenance and bugs
tend to be found more quickly too. It's a very worthy cause.

I did happen to think of this when I was copying swathes of code out
of simplehash.h.  However, I decided that the two implementations are
sufficiently different that if I tried to merge them both into one .h
file, we'd have some unreadable and unmaintainable mess.  I just don't
think their DNA is compatible enough for the two to be mated
successfully.  For example, simplehash uses open addressing and
generichash does not.  This means that things like iterating over the
table works completely differently.  Lookups in generichash need to
perform an extra step to fetch the actual data from the segment
arrays.  I think it would certainly be possible to merge the two, but
I just don't think it would be easy code to work on if we did that.

The good thing is that that the API between the two is very similar
and it's quite easy to swap one for the other.  I did make changes
around memory allocation due to me being too cheap to zero memory when
I didn't need to and simplehash not having any means of allocating
memory without zeroing it.

I also think that there's just no one-size-fits-all hash table type.
simplehash will not perform well when the size of the stored element
is very large. There's simply too much memcpying to move data around
during insert/delete.  simplehash will also have terrible iteration
performance in sparsely populated tables.  However, simplehash will be
pretty much unbeatable for lookups where the element type is very
small, e.g single Datum, or an int.  The CPU cache efficiency there
will be pretty much unbeatable.

I tried to document the advantages of each in the file header
comments.  I should probably also add something to simplehash.h's
comments to mention generichash.h

David

Re: Use simplehash.h instead of dynahash in SMgr

2021-06-21 Thread David Rowley

On Tue, 22 Jun 2021 at 02:53, Robert Haas  wrote:
> At the risk of kibitzing the least-important detail of this proposal,
> I'm not very happy with the names of our hash implementations.
> simplehash is not especially simple, and dynahash is not particularly
> dynamic, especially now that the main place we use it is for
> shared-memory hash tables that can't be resized. Likewise, generichash
> doesn't really give any kind of clue about how this hash table is
> different from any of the others. I don't know how possible it is to
> do better here; naming things is one of the two hard problems in
> computer science. In a perfect world, though, our hash table
> implementations would be named in such a way that somebody might be
> able to look at the names and guess on that basis which one is
> best-suited to a given task.

I'm certainly open to better names.  I did almost call it stablehash,
in regards to the pointers to elements not moving around like they do
with simplehash.

I think more generally, hash table implementations are complex enough
that it's pretty much impossible to give them a short enough
meaningful name.  Most papers just end up assigning a name to some
technique. e.g Robinhood, Cuckoo etc.

Both simplehash and generichash use a variant of Robinhood hashing.
simplehash uses open addressing and generichash does not.  Instead of
Andres naming it simplehash, if he'd instead called it
"robinhoodhash", then someone might come along and complain that his
implementation is broken because it does not implement tombstoning.
Maybe Andres thought he'd avoid that by not claiming that it's an
implementation of a Robinhood hash table.  That seems pretty wise to
me. Naming it simplehash was a pretty simple way of avoiding that
problem.

Anyway, I'm open to better names, but I don't think the name should
drive the implementation.  If the implementation does not fit the name
perfectly, then the name should change rather than the implementation.

Personally, I think we should call it RowleyHash, but I think others
might object. ;-)

David

Re: Added schema level support for publication.

2021-06-21 Thread Ajin Cherian

On Mon, Jun 21, 2021 at 3:16 PM vignesh C  wrote:


> I felt this is ok as we specify the keycount to be 1, so only the
> key[0] will be used. Thoughts?
> ScanKeyInit([0],
> Anum_pg_class_relkind,
> BTEqualStrategyNumber, F_CHAREQ,
> CharGetDatum(RELKIND_PARTITIONED_TABLE));
>
> scan = table_beginscan_catalog(classRel, 1, key);
>

It maybe fine, just doesn't look correct when you look at the function
as a whole.

> > =
> >
> > in UpdatePublicationTypeTupleValue():
> >
> > + tup = heap_modify_tuple(tup, RelationGetDescr(rel), values, nulls,
> > + replaces);
> > +
> > + /* Update the catalog. */
> > + CatalogTupleUpdate(rel, >t_self, tup);
> >
> > Not sure if this tup needs to be freed or if the memory context will
> > take care of it.
>
> I felt this is ok, as the cleanup is handled in the caller function
> "AlterPublication", thoughts?
> /* Cleanup. */
> heap_freetuple(tup);
> table_close(rel, RowExclusiveLock);

that should be fine.

regards,
Ajin Cherian
Fujitsu Australia

Re: Different compression methods for FPI

2021-06-21 Thread Michael Paquier

On Mon, Jun 21, 2021 at 07:19:27PM -0500, Justin Pryzby wrote:
> The two similar, existing messages are:
> 
> +#define NO_LZ4_SUPPORT() \
> +   ereport(ERROR, \
> +   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), \
> +errmsg("unsupported LZ4 compression method"), \
> +errdetail("This functionality requires the server to 
> be built with lz4 support."), \
> +errhint("You need to rebuild PostgreSQL using 
> --with-lz4.")))
> 
> src/bin/pg_dump/pg_backup_archiver.c:   fatal("cannot 
> restore from compressed archive (compression not supported in this 
> installation)");
> src/bin/pg_dump/pg_backup_archiver.c:   pg_log_warning("archive is 
> compressed, but this installation does not support compression -- no data 
> will be available");
> src/bin/pg_dump/pg_dump.c:  pg_log_warning("requested compression 
> not available in this installation -- archive will be uncompressed");

The difference between the first message and the rest is that the
backend has much more room in terms of error verbosity while
xlogreader.c needs to worry also about the frontend.  In this case, we
need to worry about the block involved and its LSN.  Perhaps you have
a suggestion?
--
Michael


signature.asc
Description: PGP signature

Re: Optionally automatically disable logical replication subscriptions on error

2021-06-21 Thread Peter Smith

Much of the discussion above seems to be related to where to store the
error information and how much information is needed to be useful.

As a summary, the 5 alternatives I have seen mentioned are:

#1. Store some simple message in the pg_subscription ("I wasn't trying
to capture everything, just enough to give the user a reasonable
indication of what went wrong" [Mark-1]). Storing the error message
was also seen as a convenience for writing TAP tests ("I originally
ran into the motivation to write this patch when frustrated that TAP
tests needed to parse the apply worker log file" [Mark-2}). It also
can sometimes provide a simple clue for the error (e.g. PK violation
for table TBL) but still the user will have to look elsewhere for
details to resolve the error. So while this implementation seems good
for simple scenarios, it appears to have been shot down because the
non-trivial scenarios either have insufficient or wrong information in
the error message. Some DETAILS could have been added to give more
information but that would maybe bloat the catalog ("I have not yet
included the DETAIL field because I didn't want to bloat the catalog."
[Mark-3])

#2. Similarly another idea was to use another existing catalog table
pg_subscription_rel. This could have the same problems ("It won't be
sufficient to store information in either pg_subscription_rel or
pg_susbscription." [Amit-1])

#3. There is another suggestion to use the Stats Collector to hold the
error message [Amit-2]. For me, this felt like blurring too much the
distinction between "stats tracking/metrics" and "logs". ERROR logs
must be flushed, whereas for stats (IIUC) there is no guarantee that
everything you need to see would be present. Indeed Amit wrote "But in
this case, if the stats collector missed updating the information, the
user may have to manually update the subscription and let the error
happen again to see it." [Amit-3]. Requesting the user to cause the
same error again just in case it was not captured a first time seems
too strange to me.

#4. The next idea was to have an entirely new catalog for holding the
subscription error information. I feel that storing/duplicating lots
of error information in another table seems like a bridge too far.
What about the risks of storing incorrect or sufficient information?
What is the advantage of duplicating errors over just referring to the
log files for ERROR details?

#5. Document to refer to the logs. All ERROR details are already in
the logs, and this seems to me the intuitive place to look for them.
Searching for specific errors becomes difficult programmatically (is
this really a problem other than complex TAP tests?). But here there
is no risk of missing or insufficient information captured in the log
files ("but still there will be some information related to ERROR
which we wanted the user to see unless we ask them to refer to logs
for that." [Amit-4}).

---

My preferred alternative is #5. ERRORs are logged in the log file, so
there is nothing really for this patch to do in this regard (except
documentation), and there is no risk of missing any information, no
ambiguity of having duplicated errors, and it is the intuitive place
the user would look.

So I felt current best combination is just this:
a) A tri-state indicating the state of the subscription: e.g.
something like "enabled" ('e')/ "disabled" ('d') / "auto-disabled"
('a') [Amit-5]
b) For "auto-disabled" the PG docs would be updated tell the user to
check the logs to resolve the problem before re-enabling the
subscription

//

IMO it is not made exactly clear to me what is the main goal of this
patch. Because of this, I feel that you can't really judge if this new
option is actually useful or not except only in hindsight. It seems
like whatever you implement can be made to look good or bad, just by
citing different test scenarios.

e.g.

* Is the goal mainly to help automated (TAP) testing? In that case,
then maybe you do want to store the error message somewhere other than
the log files. But still I wonder if results would be unpredictable
anyway - e.g if there are multiple tables all with errors then it
depends on the tablesync order of execution which error you see caused
the auto-disable, right? And if it is not predictable maybe it is less
useful.

* Is the goal to prevent some *unattended* SUBSCRIPTION from going bad
at some point in future and then going into a relaunch loop for
days/weeks and causing 1000's of errors without the user noticing. In
that case, this patch seems to be quite useful, but for this goal
maybe you don't want to be checking the tablesync workers at all, but
should only be checking the apply worker like your original v1 patch
did.

* Is the goal just to be a convenient way to disable the subscription
during the CREATE SUBSCRIPTION phase so that the user can make
corrections in peace without the workers re-launching and making more
error logs? Here the patch is helpful, but only for simple

Re: disfavoring unparameterized nested loops

2021-06-21 Thread Peter Geoghegan

On Mon, Jun 21, 2021 at 4:51 PM Bruce Momjian  wrote:
> There were a lot of interesting ideas in this thread and I want to
> analyze some of them.  First, there is the common assumption (not
> stated) that over-estimating by 5% and underestimating by 5% cause the
> same harm, which is clearly false.  If I go to a restaurant and estimate
> the bill to be 5% higher or %5 lower, assuming I have sufficient funds,
> under or over estimating is probably fine.  If I am driving and
> under-estimate the traction of my tires, I am probably fine, but if I
> over-estimate their traction by 5%, I might crash.

My favorite analogy is the home insurance one:

It might make sense to buy home insurance because losing one's home
(say through fire) is a loss that usually just cannot be tolerated --
you are literally ruined. We can debate how likely it is to happen,
but in the end it's not so unlikely that it can't be ruled out. At the
same time I may be completely unwilling to buy insurance for personal
electronic devices. I can afford to replace all of them if I truly
have to. And the chances of all of them breaking or being stolen on
the same day is remote (unless my home burns down!). If I drop my cell
phone and crack the screen, I'll be annoyed, but it's certainly not
the end of the world.

This behavior will make perfect sense to most people. But it doesn't
scale up or down. I have quite a few electronic devices, but only a
single home, so technically I'm taking risks way more often than I am
playing it safe here. Am I risk tolerant when it comes to insurance?
Conservative?

I myself don't think that it is sensible to apply either term here.
It's easier to just look at the specifics. A home is a pretty
important thing to almost everybody; we can afford to treat it as a
special case.

> If that is accurate, I think the big question is how common are cases
> where the outer side can't be proven to have zero or one row and nested
> loops are enough of a win to risk its greater sensitivity to
> misestimation.  If it is uncommon, seems we could just code the
> optimizer to use hash joins in those cases without a user-visible knob,
> beyond the knob that already turns off nested loop joins.

I think it's possible that Robert's proposal will lead to very
slightly slower plans in the vast majority of cases that are affected,
while still being a very good idea. Why should insurance be 100% free,
though? Maybe it can be in some cases where we get lucky, but why
should that be the starting point? It just has to be very cheap
relative to what we do today for us to come out ahead, certainly, but
that seems quite possible in at least this case.

-- 
Peter Geoghegan

Re: pgbench logging broken by time logic changes

2021-06-21 Thread Michael Paquier

On Sun, Jun 20, 2021 at 10:15:55AM -0400, Andrew Dunstan wrote:
> If this were core server code threatening data integrity I would be
> inclined to be more strict, but after all pg_bench is a utility program,
> and I think we can allow a little more latitude.

+1.  Let's be flexible here.  It looks better to not rush a fix, and
we still have some time ahead.
--
Michael


signature.asc
Description: PGP signature

Re: PXGS vs TAP tests

2021-06-21 Thread Michael Paquier

On Sun, Jun 20, 2021 at 01:24:04PM -0400, Andrew Dunstan wrote:
> Tests pass with the attached patch, which puts the setting in the
> Makefile for the recovery tests. The script itself doesn't need any
> changing.

+REGRESS_SHLIB=$(abs_top_builddir)/src/test/regress/regress$(DLSUFFIX)
+export REGRESS_SHLIB
It may be better to add a comment here explaning why REGRESS_SHLIB is
required in this Makefile then?

While on it, could we split those commands into multiple lines and
reduce the noise of future diffs?  Something as simple as that would
make those prove commands easier to follow:
+cd $(srcdir) && TESTDIR='$(CURDIR)' \
+   $(with_temp_install) \
+   PGPORT='6$(DEF_PGPORT)' \
+   PG_REGRESS='$(CURDIR)/$(top_builddir)/src/test/regress/pg_regress' \
+   REGRESS_SHLIB= '$(abs_top_builddir)/src/test/regress/regress$(DLSUFFIX)' \
+   $(PROVE) $(PG_PROVE_FLAGS) $(PROVE_FLAGS) $(if 
$(PROVE_TESTS),$(PROVE_TESTS),t/*.pl)

There are other places where this could happen, but the TAP commands
are particularly long.
--
Michael


signature.asc
Description: PGP signature

Re: Different compression methods for FPI

2021-06-21 Thread Justin Pryzby

On Tue, Jun 22, 2021 at 09:11:26AM +0900, Michael Paquier wrote:
> On Sun, Jun 20, 2021 at 11:15:08PM +0500, Andrey Borodin wrote:
> > I have some small questions.
> > 
> > 1.
> > +   report_invalid_record(record, "image at %X/%X 
> > compressed with %s not supported, block %d",
> > + (uint32) 
> > (record->ReadRecPtr >> 32),
> > + (uint32) 
> > record->ReadRecPtr,
> > + "lz4",
> > + block_id);
> > Can we point out to user that the problem is in the build?
> 
> What about the following error then?  Say:
> "image at %X/%X compressed with LZ4 not supported by build, block
> %d".

The two similar, existing messages are:

+#define NO_LZ4_SUPPORT() \
+   ereport(ERROR, \
+   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), \
+errmsg("unsupported LZ4 compression method"), \
+errdetail("This functionality requires the server to 
be built with lz4 support."), \
+errhint("You need to rebuild PostgreSQL using 
--with-lz4.")))

src/bin/pg_dump/pg_backup_archiver.c:   fatal("cannot 
restore from compressed archive (compression not supported in this 
installation)");
src/bin/pg_dump/pg_backup_archiver.c:   pg_log_warning("archive is 
compressed, but this installation does not support compression -- no data will 
be available");
src/bin/pg_dump/pg_dump.c:  pg_log_warning("requested compression 
not available in this installation -- archive will be uncompressed");

-- 
Justin

Re: Different compression methods for FPI

2021-06-21 Thread Michael Paquier

On Sun, Jun 20, 2021 at 11:15:08PM +0500, Andrey Borodin wrote:
> I have some small questions.
> 
> 1.
> + report_invalid_record(record, "image at %X/%X 
> compressed with %s not supported, block %d",
> +   (uint32) 
> (record->ReadRecPtr >> 32),
> +   (uint32) 
> record->ReadRecPtr,
> +   "lz4",
> +   block_id);
> Can we point out to user that the problem is in the build?

What about the following error then?  Say:
"image at %X/%X compressed with LZ4 not supported by build, block
%d".

> Also, maybe %s can be inlined to lz4 in this case.

I think that's a remnant of the zstd part of the patch set, where I
wanted to have only one translatable message.  Sure, I can align lz4
with the message.

> 2.
> > const char *method = "???";
> Maybe we can use something like "unknown" for unknown compression
> methods? Or is it too long string for waldump output?

A couple of extra bytes for pg_waldump will not matter much.  Using
"unknown" is fine by me.

> 3. Can we exclude lz4 from config if it's not supported by the build?

Enforcing the absence of this value at GUC level is enough IMO:
+static const struct config_enum_entry wal_compression_options[] = {
+   {"pglz", WAL_COMPRESSION_PGLZ, false},
+#ifdef USE_LZ4
+   {"lz4", WAL_COMPRESSION_LZ4, false},
+#endif
[...]
+typedef enum WalCompression
+{
+   WAL_COMPRESSION_NONE = 0,
+   WAL_COMPRESSION_PGLZ,
+   WAL_COMPRESSION_LZ4
+} WalCompression;

It is not possible to set the GUC, still it is listed in the enum that
allows us to track it.  That's the same thing as
default_toast_compression with its ToastCompressionId and its
default_toast_compression_options.
--
Michael


signature.asc
Description: PGP signature

Re: Maintaining a list of pgindent commits for "git blame" to ignore

2021-06-21 Thread Tom Lane

Peter Geoghegan  writes:
> The convention seems to be that it is located in the top-level
> directory. ISTM that we should follow that convention, since following
> the convention is good, and does not in itself force anybody to ignore
> any of the listed commits. Any thoughts on that?

Agreed.  I think I'd previously suggested something under src/tools,
but we might as well do like others are doing; especially since
we have .gitattributes and the like there already.

regards, tom lane

Re: disfavoring unparameterized nested loops

2021-06-21 Thread Bruce Momjian

On Mon, Jun 21, 2021 at 12:52:39PM -0400, Tom Lane wrote:
> You're throwing around a lot of pejorative adjectives there without
> having bothered to quantify any of them.  This sounds less like a sound
> argument to me than like a witch trial.
>
> Reflecting for a bit on the ancient principle that "the only good numbers
> in computer science are 0, 1, and N", it seems to me that we could make
> an effort to classify RelOptInfos as provably empty, provably at most one
> row, and others.  (This would be a property of relations, not individual
> paths, so it needn't bloat struct Path.)  We already understand about
> provably-empty rels, so this is just generalizing that idea a little.
> Once we know about that, at least for the really-common cases like unique
> keys, I'd be okay with a hard rule that we don't consider unparameterized
> nestloop joins with an outer side that falls in the "N" category.
> Unless there's no alternative, of course.
> 
> Another thought that occurs to me here is that maybe we could get rid of
> the enable_material knob in favor of forcing (or at least encouraging)
> materialization when the outer side isn't provably one row.

There were a lot of interesting ideas in this thread and I want to
analyze some of them.  First, there is the common assumption (not
stated) that over-estimating by 5% and underestimating by 5% cause the
same harm, which is clearly false.  If I go to a restaurant and estimate
the bill to be 5% higher or %5 lower, assuming I have sufficient funds,
under or over estimating is probably fine.  If I am driving and
under-estimate the traction of my tires, I am probably fine, but if I
over-estimate their traction by 5%, I might crash.

Closer to home, Postgres is more tolerant of memory usage
under-estimation than over-estimation:

https://momjian.us/main/blogs/pgblog/2018.html#December_7_2018

What I hear Robert saying is that unparameterized nested loops are much
more sensitive to misestimation than hash joins, and only slightly
faster than hash joins when they use accurate row counts, so we might
want to avoid them by default.  Tom is saying that if we know the outer
side will have zero or one row, we should still do unparameterized
nested loops because those are not more sensitive to misestimation than
hash joins, and slightly faster.

If that is accurate, I think the big question is how common are cases
where the outer side can't be proven to have zero or one row and nested
loops are enough of a win to risk its greater sensitivity to
misestimation.  If it is uncommon, seems we could just code the
optimizer to use hash joins in those cases without a user-visible knob,
beyond the knob that already turns off nested loop joins.

Peter's comment about having nodes in the executor that adjust to the
row counts it finds is interesting, and eventually might be necessary
once we are sure we have gone as far as we can without it.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: Maintaining a list of pgindent commits for "git blame" to ignore

2021-06-21 Thread Peter Geoghegan

On Mon, Jun 21, 2021 at 8:34 AM Tom Lane  wrote:
> It's possible that some of these touch few enough lines that they
> are not worth listing; I did not check the commit delta sizes.

Commits that touch very few lines weren't included originally, just
because it didn't seem necessary. Even still, I've looked through the
extra commits now, and everything that you picked out looks in scope.
I'm just going to include these extra commits.

Attached is a new version of the same file, based on your feedback
(with those extra commits, and some commits from the last version
removed). I'll produce a conventional patch file in the next revision,
most likely.

> Meh.  I can get on board with the idea of adding commit+revert pairs
> to this list, but I'd like a more principled selection filter than
> "which ones bother Peter".  Maybe the size of the reverted patch
> should factor into it

I have to admit that you're right. That was why I picked those two
out. Of course I can defend this choice in detail, but in the interest
of not setting a terrible precedent I won't do that. The commits in
question have been removed from this revised version.

I think it's important that we not get into the business of adding
stuff to this willy-nilly. Inevitably everybody will have their own
pet peeve noisy commit, and will want to add to the list -- just like
I did. Naturally nobody will be interested in arguing against
including whatever individual pet peeve commit each time this comes
up. Regardless of the merits of the case. Let's just not go there
(unless perhaps it's truly awful for almost everybody).

> Do we have an idea of how much adding entries to this list slows
> down "git blame"?  If we include commit+revert pairs more than
> very sparingly, I'm afraid we'll soon have an enormous list, and
> I wonder what that will cost us.

I doubt it costs us much, at least in any way that has a very
noticeable relationship as new commits are added. I've now settled on
68 commits, and expect that this won't need to grow very quickly, so
that seems fine. From my point of view it makes "git blame" far more
useful.

LLVM uses a file with fewer entries, and have had such a file since last year:

https://github.com/llvm/llvm-project/blob/main/.git-blame-ignore-revs

The list of commit hashes in the file that the Blender project uses is
about the same size:

https://github.com/blender/blender/blob/master/.git-blame-ignore-revs

We're using more commits than either project here, but it's within an
order of magnitude. Note that this is going to be opt-in, not opt-out.
It won't do anything unless an individual hacker decides to enable it
locally.

The convention seems to be that it is located in the top-level
directory. ISTM that we should follow that convention, since following
the convention is good, and does not in itself force anybody to ignore
any of the listed commits. Any thoughts on that?

-- 
Peter Geoghegan

git-blame-ignore-revs
Description: Binary data

Re: Doc chapter for Hash Indexes

2021-06-21 Thread Justin Pryzby

On Mon, Jun 21, 2021 at 02:08:12PM +0100, Simon Riggs wrote:
> New chapter for Hash Indexes, designed to help users understand how
> they work and when to use them.
> 
> Mostly newly written, but a few paras lifted from README when they were 
> helpful.

+ 
+  PostgreSQL includes an implementation of persistent on-disk hash indexes,
+  which are now fully crash recoverable. Any data type can be indexed by a

I don't see any need to mention that they're "now" crash safe.

+  Each hash index tuple stores just the 4-byte hash value, not the actual
+  column value. As a result, hash indexes may be much smaller than B-trees
+  when indexing longer data items such as UUIDs, URLs etc.. The absence of

comma:
URLs, etc.

+  the column value also makes all hash index scans lossy. Hash indexes may
+  take part in bitmap index scans and backward scans.

Isn't it more correct to say that it must use a bitmap scan?

+  through the tree until the leaf page is found. In tables with millions
+  of rows this descent can increase access time to data. The equivalent

rows comma

+  that hash value. When scanning a hash bucket during queries we need to

queries comma

+ 
+  As a result of the overflow cases, we can say that hash indexes are
+  most suitable for unique, nearly unique data or data with a low number
+  of rows per hash bucket will be suitable for hash indexes. One

The beginning and end of the sentence duplicate "suitable".

+  Each row in the table indexed is represented by a single index tuple in
+  the hash index. Hash index tuples are stored in the bucket pages, and if
+  they exist, the overflow pages. 

"the overflow pages" didn't sound right, but I was confused by the comma.  
I think it should say ".. in bucket pages and overflow pages, if any."

-- 
Justin

Re: disfavoring unparameterized nested loops

2021-06-21 Thread Peter Geoghegan

On Mon, Jun 21, 2021 at 1:52 PM Tom Lane  wrote:
> You're ignoring the fact that the plan shape we generate now is in fact
> *optimal*, and easily proven to be so, in some very common cases.

As I've said I don't reject the idea that there is room for
disagreement on the specifics. For example perhaps it'll turn out that
only a restricted subset of the cases that Robert originally had in
mind will truly turn out to work as well as hoped. But that just seems
like a case of Robert refining a very preliminary proposal. I
absolutely expect there to be some need to iron out the wrinkles.

> I don't
> think the people whose use-cases get worse are going to be mollified by
> the argument that you reduced their risk, when there is provably no risk.

By definition what we're doing here is throwing away slightly cheaper
plans when the potential downside is much higher than the potential
upside of choosing a reasonable alternative. I don't think that the
downside is particularly likely. In fact I believe that it's fairly
unlikely in general. This is an imperfect trade-off, at least in
theory. I fully own that.

> I'm willing to take some flak if there's not an easy proof that the outer
> scan is single-row, but I don't think we should just up and change it
> for cases where there is.

Seems reasonable to me.

-- 
Peter Geoghegan

Re: disfavoring unparameterized nested loops

2021-06-21 Thread Tom Lane

Peter Geoghegan  writes:
> On Mon, Jun 21, 2021 at 10:14 AM Robert Haas  wrote:
>> The other thing is - the risk of a particular path doesn't matter in
>> an absolute sense, only a relative one. In the particular case I'm on
>> about here, we *know* there's a less-risky alternative.

> Exactly! This, a thousand times.

This is a striking oversimplification.

You're ignoring the fact that the plan shape we generate now is in fact
*optimal*, and easily proven to be so, in some very common cases.  I don't
think the people whose use-cases get worse are going to be mollified by
the argument that you reduced their risk, when there is provably no risk.
Obviously the people whose use-cases are currently hitting the wrong end
of the risk will be happy with any change whatever, but those may not be
the same people.

I'm willing to take some flak if there's not an easy proof that the outer
scan is single-row, but I don't think we should just up and change it
for cases where there is.

regards, tom lane

Re: disfavoring unparameterized nested loops

2021-06-21 Thread Peter Geoghegan

On Mon, Jun 21, 2021 at 10:14 AM Robert Haas  wrote:
> Hmm, maybe I need to see an example of the sort of plan shape that you
> have in mind. To me it feels like a comparison on a unique key ought
> to use a *parameterized* nested loop. And it's also not clear to me
> why a nested loop is actually better in a case like this. If the
> nested loop iterates more than once because there are more rows on the
> outer side, then you don't want to have something on the inner side
> that might be expensive, and either an aggregate or an unparameterized
> search for a unique value are potentially quite expensive. Now if you
> put a materialize node on top of the inner side, then you don't have
> to worry about that problem, but how much are you saving at that point
> vs. just doing a hash join?

I suspected that that was true, but even that doesn't seem like the
really important thing. While it may be true that the simple heuristic
can't be quite as simple as we'd hoped at first, ISTM that this is
ultimately not much of a problem. The basic fact remains: some more or
less simple heuristic makes perfect sense, and should be adapted.

This conclusion is counterintuitive because it's addressing a very
complicated problem with a very simple solution. However, if we lived
in a world where things that sound too good to be true always turned
out to be false, we'd also live in a world where optimizers were
completely impractical and useless. Optimizers have that quality
already, whether or not we officially acknowledge it.

> I don't understand how to generate a risk assessment or what we ought
> to do with it if we had one. I don't even understand what units we
> would use. We measure costs using abstract cost units, but those
> abstract cost units are intended to be a proxy for runtime. If it's
> not the case that a plan that runs for longer has a higher cost, then
> something's wrong with the costing model or the settings. In the case
> of risk, the whole thing seems totally subjective. We're talking about
> the risk that our estimate is bogus, but how do we estimate the risk
> that we don't know how to estimate?

Clearly we need a risk estimate for our risk estimate!

> The other thing is - the risk of a particular path doesn't matter in
> an absolute sense, only a relative one. In the particular case I'm on
> about here, we *know* there's a less-risky alternative.

Exactly! This, a thousand times.

This reminds me of how people behave in the real world. In the real
world people deal with this without too much difficulty. Everything is
situational and based on immediate trade-offs, with plenty of
uncertainty at every step. For example, if you think that there is
even a tiny chance of a piece of fruit being poisonous, you don't eat
the piece of fruit -- better to wait until lunchtime. This is one of
the *easiest* decisions I can think of, despite the uncertainty.
(Except perhaps if you happen to be in danger of dying of starvation,
in which case it might be a very different story. And so on.)

-- 
Peter Geoghegan

Re: disfavoring unparameterized nested loops

2021-06-21 Thread Tom Lane

Robert Haas  writes:
> On Mon, Jun 21, 2021 at 1:38 PM Tom Lane  wrote:
>> NestLoop Join
>>   Join Filter: t1.x op t2.y
>>   -> Index Scan on t1_pkey
>>Index Cond: t1.id = constant
>>   -> Seq Scan on t2

> Hmm, yeah, I guess that's possible. How much do you think this loses
> as compared with:

> Hash Join
> Hash Cond: t1.x op t2.y
> -> Seq Scan on t2
> -> Hash
>   -> Index Scan on t1_pkey

Hard to say.  The hash overhead might or might not pay for itself.
If the equality operator proper is expensive and we get to avoid
applying it at most t2 rows, then this might be an OK alternative;
otherwise not so much.

In any case, the former looks like plans that we generate now,
the second not.  Do you really want to field a lot of questions
about why we suddenly changed to a not-clearly-superior plan?

regards, tom lane

Re: disfavoring unparameterized nested loops

2021-06-21 Thread Robert Haas

On Mon, Jun 21, 2021 at 1:38 PM Tom Lane  wrote:
> > Hmm, maybe I need to see an example of the sort of plan shape that you
> > have in mind. To me it feels like a comparison on a unique key ought
> > to use a *parameterized* nested loop.
>
> The unique-key comparison would be involved in the outer scan in
> the cases I'm thinking of.  As an example,
>
> select * from t1, t2 where t1.id = constant and t1.x op t2.y;
>
> where I'm not assuming much about the properties of "op".
> This could be amenable to a plan like
>
> NestLoop Join
>   Join Filter: t1.x op t2.y
>   -> Index Scan on t1_pkey
>Index Cond: t1.id = constant
>   -> Seq Scan on t2
>
> and if we can detect that the pkey indexscan produces just one row,
> this is very possibly the best available plan.

Hmm, yeah, I guess that's possible. How much do you think this loses
as compared with:

Hash Join
Hash Cond: t1.x op t2.y
-> Seq Scan on t2
-> Hash
  -> Index Scan on t1_pkey

(If the operator is not hashable then this plan is impractical, but in
such a case the question of preferring the hash join over the nested
loop doesn't arise anyway.)

> BTW, it strikes me that there might be an additional consideration
> here: did parameterization actually help anything?  That is, the
> proposed rule wants to reject the above but allow
>
> NestLoop Join
>   -> Index Scan on t1_pkey
>Index Cond: t1.id = constant
>   -> Seq Scan on t2
>Filter: t1.x op t2.y
>
> even though the latter isn't meaningfully better.  It's possible
> this won't arise because we don't consider parameterized paths
> except where the parameter is used in an indexqual or the like,
> but I'm not confident of that.  See in particular reparameterize_path
> and friends before you assert there's no such issue.  So we might
> need to distinguish essential from incidental parameterization,
> or something like that.

Hmm, perhaps. I think it won't happen in the normal cases, but I can't
completely rule out the possibility that there are corner cases where
it does.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: disfavoring unparameterized nested loops

2021-06-21 Thread Tom Lane

I wrote:
> ... As an example,
>   select * from t1, t2 where t1.id = constant and t1.x op t2.y;
> where I'm not assuming much about the properties of "op".
> This could be amenable to a plan like
>   NestLoop Join
> Join Filter: t1.x op t2.y
> -> Index Scan on t1_pkey
>  Index Cond: t1.id = constant
> -> Seq Scan on t2

Also, to enlarge on that example: if "op" isn't hashable then the
original argument is moot.  However, it'd still be useful to know
if the outer scan is sure to return no more than one row, as that
could inform the choice whether to plaster a Materialize node on
the inner scan.

regards, tom lane

Re: disfavoring unparameterized nested loops

2021-06-21 Thread Tom Lane

Robert Haas  writes:
> On Mon, Jun 21, 2021 at 11:55 AM Tom Lane  wrote:
>> There are certainly cases where the optimizer can prove (in principle;
>> it doesn't do so today) that a plan node will produce at most one row.
>> They're hardly uncommon either: an equality comparison on a unique
>> key, or a subquery with a simple aggregate function, come to mind.

> Hmm, maybe I need to see an example of the sort of plan shape that you
> have in mind. To me it feels like a comparison on a unique key ought
> to use a *parameterized* nested loop.

The unique-key comparison would be involved in the outer scan in
the cases I'm thinking of.  As an example,

select * from t1, t2 where t1.id = constant and t1.x op t2.y;

where I'm not assuming much about the properties of "op".
This could be amenable to a plan like

NestLoop Join
  Join Filter: t1.x op t2.y
  -> Index Scan on t1_pkey
   Index Cond: t1.id = constant
  -> Seq Scan on t2

and if we can detect that the pkey indexscan produces just one row,
this is very possibly the best available plan.  Nor do I think this
is an unusual situation that we can just ignore.

BTW, it strikes me that there might be an additional consideration
here: did parameterization actually help anything?  That is, the
proposed rule wants to reject the above but allow

NestLoop Join
  -> Index Scan on t1_pkey
   Index Cond: t1.id = constant
  -> Seq Scan on t2
   Filter: t1.x op t2.y

even though the latter isn't meaningfully better.  It's possible
this won't arise because we don't consider parameterized paths
except where the parameter is used in an indexqual or the like,
but I'm not confident of that.  See in particular reparameterize_path
and friends before you assert there's no such issue.  So we might
need to distinguish essential from incidental parameterization,
or something like that.

regards, tom lane

Re: disfavoring unparameterized nested loops

2021-06-21 Thread Robert Haas

On Mon, Jun 21, 2021 at 1:11 PM Peter Geoghegan  wrote:
> On Mon, Jun 21, 2021 at 9:52 AM Tom Lane  wrote:
> > > I'm not so sure that it is. The point isn't the risk, even if it could
> > > be calculated. The point is the downsides of being wrong are huge and
> > > pretty much unbounded, whereas the benefits of being right are tiny
> > > and bounded. It almost doesn't matter what the underlying
> > > probabilities are.
> >
> > You're throwing around a lot of pejorative adjectives there without
> > having bothered to quantify any of them.  This sounds less like a sound
> > argument to me than like a witch trial.
>
> I'm not sure what you mean by pejorative. Isn't what I said about
> downsides and upsides pretty accurate?

Yeah, I don't see why Peter's characterization deserves to be labelled
as pejorative here. A hash join or merge join or parameterized nested
loop can turn out to be slower than some other algorithm, but all of
those approaches have some component that tends to make the asymptotic
cost less than the product of the sizes of the inputs. I don't think
that's true in absolutely every case; for example, if a merge join has
every row duplicated on both sides, it will have to scan every inner
tuple once per outer tuple, just like a nested loop, and the other
algorithms also are going to degrade toward O(NM) performance in the
face of many duplicates. Also, a hash join can be pretty close to that
if it needs a shazillion batches. But in normal cases, any algorithm
other than an unparameterized nested loop tends to read each input
tuple on each side ONCE, so the cost is more like the sum of the input
sizes than the product. And there's nothing pejorative in saying that
N + M can be less than N * M by an unbounded amount. That's just the
facts.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Out-of-bounds (src/backend/utils/misc/queryjumble.c)

2021-06-21 Thread Tom Lane

Ranier Vilela  writes:
> Per Coverity.
> 3 out-of-bounds at function AppendJumble.

> They have the face, smell and color of typo.
> And we usually increment the character count after a memcpy.

> Coverity no longer complained after the patch.

> Thoughts?

This patch is incorrect on its face, as you would know if you'd
spent even a couple minutes absorbing the comment in that function.

I wonder about Coverity here ... independently of whether the
hash-accumulation logic does what we want, it looks to me like
the proposed change doesn't so much remove a buffer overrun as
create one.  It would break the property jumble_len < JUMBLE_SIZE
that the subsequent lines rely on.

Please stop sending us random patches and expecting us to sort
out which ones are valid.  You're rapidly approaching the status
of "boy who cried wolf too many times".

regards, tom lane

Re: disfavoring unparameterized nested loops

2021-06-21 Thread Robert Haas

On Mon, Jun 21, 2021 at 11:55 AM Tom Lane  wrote:
> There are certainly cases where the optimizer can prove (in principle;
> it doesn't do so today) that a plan node will produce at most one row.
> They're hardly uncommon either: an equality comparison on a unique
> key, or a subquery with a simple aggregate function, come to mind.

Hmm, maybe I need to see an example of the sort of plan shape that you
have in mind. To me it feels like a comparison on a unique key ought
to use a *parameterized* nested loop. And it's also not clear to me
why a nested loop is actually better in a case like this. If the
nested loop iterates more than once because there are more rows on the
outer side, then you don't want to have something on the inner side
that might be expensive, and either an aggregate or an unparameterized
search for a unique value are potentially quite expensive. Now if you
put a materialize node on top of the inner side, then you don't have
to worry about that problem, but how much are you saving at that point
vs. just doing a hash join?

> I'd be a lot happier if this proposal were couched around some sort
> of estimate of the risk of the outer side producing more than the
> expected number of rows.  The arguments so far seem like fairly lame
> rationalizations for not putting forth the effort to do that.

I don't understand how to generate a risk assessment or what we ought
to do with it if we had one. I don't even understand what units we
would use. We measure costs using abstract cost units, but those
abstract cost units are intended to be a proxy for runtime. If it's
not the case that a plan that runs for longer has a higher cost, then
something's wrong with the costing model or the settings. In the case
of risk, the whole thing seems totally subjective. We're talking about
the risk that our estimate is bogus, but how do we estimate the risk
that we don't know how to estimate? Given quals (x + 0) = x, x = some
MCV, and x = some non-MCV, we can say that we're most likely to be
wrong about the first one and least likely to be wrong about the
second one, but how much more likely? I don't know how you can decide
that, even in principle. We can also say that an unparameterized
nested loop is more risky than some other plan because it could turn
out to be crazy expensive, but is that more or less risky than
scanning the index on b as a way to implement SELECT * FROM foo WHERE
a = 1 ORDER BY b LIMIT 1? How much more risky, and why?

And then, even supposing we had a risk metric for every path, what
exactly would we do with it? Suppose path A is cheaper than path B,
but also more risky. Which should we keep? We could keep both, but
that seems to be just kicking the can down the road. If plan B is
likely to blow up in our face, we should probably just get rid of it,
or not even generate it in the first place. Even if we do keep both,
at some point we're going to have to make a cost-vs-risk tradeoff, and
I don't see how to do that intelligently, because the point is
precisely that if the risk is high, the cost number might be totally
wrong. If we know that plan A is cheaper than plan B, we should choose
plan A. But if all we know is that plan A would be cheaper than plan B
if our estimate of the cost were correct, but also that it probably
isn't, then what we actually know is nothing. We have no principled
basis for deciding anything based on cost unless we're reasonably
confident that the cost estimate is pretty good. So AFAICT the only
principled strategy here is to throw away high risk paths as early as
we possibly can. What am I missing?

The other thing is - the risk of a particular path doesn't matter in
an absolute sense, only a relative one. In the particular case I'm on
about here, we *know* there's a less-risky alternative. We don't need
to quantify the risk to know which of the two options has more. In
many other cases, the risk is irreducible e.g. a default estimate
could be totally bogus, but switching paths is of no help in getting
rid of it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: disfavoring unparameterized nested loops

2021-06-21 Thread Peter Geoghegan

On Mon, Jun 21, 2021 at 9:52 AM Tom Lane  wrote:
> > I'm not so sure that it is. The point isn't the risk, even if it could
> > be calculated. The point is the downsides of being wrong are huge and
> > pretty much unbounded, whereas the benefits of being right are tiny
> > and bounded. It almost doesn't matter what the underlying
> > probabilities are.
>
> You're throwing around a lot of pejorative adjectives there without
> having bothered to quantify any of them.  This sounds less like a sound
> argument to me than like a witch trial.

I'm not sure what you mean by pejorative. Isn't what I said about
downsides and upsides pretty accurate?

> Reflecting for a bit on the ancient principle that "the only good numbers
> in computer science are 0, 1, and N", it seems to me that we could make
> an effort to classify RelOptInfos as provably empty, provably at most one
> row, and others.  (This would be a property of relations, not individual
> paths, so it needn't bloat struct Path.)  We already understand about
> provably-empty rels, so this is just generalizing that idea a little.

It sounds like you're concerned about properly distinguishing between:

1. Cases where the only non-reckless choice is a hash join over a
unparameterized nested loop join

2. Cases that look like that at first, that don't really have that
quality on closer examination.

This seems like a reasonable concern.

> Once we know about that, at least for the really-common cases like unique
> keys, I'd be okay with a hard rule that we don't consider unparameterized
> nestloop joins with an outer side that falls in the "N" category.
> Unless there's no alternative, of course.

I thought that you were arguing against the premise itself. It's now
clear that you weren't, though.

I don't have an opinion for or against bringing the provably-empty
rels stuff into the picture. At least not right now.

-- 
Peter Geoghegan

Out-of-bounds (src/backend/utils/misc/queryjumble.c)

2021-06-21 Thread Ranier Vilela

Hi,

Per Coverity.
3 out-of-bounds at function AppendJumble.

They have the face, smell and color of typo.
And we usually increment the character count after a memcpy.

Coverity no longer complained after the patch.

Thoughts?

regards,
Ranier Vilela


fix_out_of_bounds_queryjumble.patch
Description: Binary data

Re: disfavoring unparameterized nested loops

2021-06-21 Thread Tom Lane

Tomas Vondra  writes:
> I've been thinking about this a bit recently and searching for papers 
> talking about this, and but it's not clear to me how to calculate the 
> risk (and propagate it through the plan) without making the whole cost 
> evaluation way more complicated / expensive :-(

Yeah, a truly complete approach using confidence intervals or the
like seems frighteningly complicated.

> But maybe you have thought about some much simpler approach, addressing 
> this sufficiently well?

See my nearby response to Peter.  The main case that's distressing me
is the possibility of forcing a hash join even when the outer side
is obviously only one row.  If we could avoid that, at least for
large values of "obvious", I'd be far more comfortable.

regards, tom lane

Re: disfavoring unparameterized nested loops

2021-06-21 Thread Tom Lane

Peter Geoghegan  writes:
> On Mon, Jun 21, 2021 at 8:55 AM Tom Lane  wrote:
>> I'd be a lot happier if this proposal were couched around some sort
>> of estimate of the risk of the outer side producing more than the
>> expected number of rows.  The arguments so far seem like fairly lame
>> rationalizations for not putting forth the effort to do that.

> I'm not so sure that it is. The point isn't the risk, even if it could
> be calculated. The point is the downsides of being wrong are huge and
> pretty much unbounded, whereas the benefits of being right are tiny
> and bounded. It almost doesn't matter what the underlying
> probabilities are.

You're throwing around a lot of pejorative adjectives there without
having bothered to quantify any of them.  This sounds less like a sound
argument to me than like a witch trial.

Reflecting for a bit on the ancient principle that "the only good numbers
in computer science are 0, 1, and N", it seems to me that we could make
an effort to classify RelOptInfos as provably empty, provably at most one
row, and others.  (This would be a property of relations, not individual
paths, so it needn't bloat struct Path.)  We already understand about
provably-empty rels, so this is just generalizing that idea a little.
Once we know about that, at least for the really-common cases like unique
keys, I'd be okay with a hard rule that we don't consider unparameterized
nestloop joins with an outer side that falls in the "N" category.
Unless there's no alternative, of course.

Another thought that occurs to me here is that maybe we could get rid of
the enable_material knob in favor of forcing (or at least encouraging)
materialization when the outer side isn't provably one row.

regards, tom lane

Re: Discarding DISCARD ALL

2021-06-21 Thread Simon Riggs

On Wed, Jan 20, 2021 at 3:53 PM James Coleman  wrote:
>
> On Wed, Jan 20, 2021 at 9:58 AM Simon Riggs  wrote:
> >
> > On Wed, 20 Jan 2021 at 14:21, James Coleman  wrote:
> >
> > > An alternative approach that occurred to me while typing this reply: a
> > > setting in Postgres that would disallow setting session level GUCs
> > > (i.e., enforce `SET LOCAL` transaction level usage instead) would
> > > remove a large chunk of our need to set server_reset_query_always=1
> > > (and more interestingly it'd highlight when broken code gets pushed).
> > > But even with that, I see some value in the proposed setting since
> > > there is additional session state beyond GUCs.
> >
> > With transaction_cleanup=on we could force all SETs to be SET LOCAL.
> >
> > The point is that if we declare ahead of time that the transaction
> > will be reset then we can act differently and more easily for various
> > circumstances, for SETs, for Temp tables and others.
>
> Right, I agree it's independently useful. My "alternative" is a subset
> of that functionality and doesn't cover as many cases.

So if we go for that option, would we call it?

session_state = 'session' (default) | 'local_set'

If you use 'local' then you find that all state is transaction only
* SET defaults to meaning SET LOCAL
* SET SESSION returns ERROR

-- 
Simon Riggshttp://www.EnterpriseDB.com/

Re: Add version macro to libpq-fe.h

2021-06-21 Thread Andrew Dunstan



On 6/21/21 12:34 PM, Tom Lane wrote:
> Daniel Gustafsson  writes:
>> On 21 Jun 2021, at 17:27, Robert Haas  wrote:
>>> What will prevent us from forgetting to do something about this again,
>>> a year from now?
>> An entry in a release checklist could perhaps be an idea?
> Yeah, I was wondering if adding an entry to RELEASE_CHANGES would be
> helpful.  Again, I'm not sure that this coding rule is much more
> likely to be violated than any other.  On the other hand, the fact
> that it's not critical until we approach release does suggest that
> maybe it'd be useful to treat it as a checklist item.
>
>   


Maybe for release note preparation, since that's focused on new
features, but this doesn't sound like a release prep function to me.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: disfavoring unparameterized nested loops

2021-06-21 Thread Tomas Vondra





On 6/21/21 5:55 PM, Tom Lane wrote:

Peter Geoghegan  writes:

The heuristic that has the optimizer flat out avoids an
unparameterized nested loop join is justified by the belief that
that's fundamentally reckless. Even though we all agree on that much,
I don't know when it stops being reckless and starts being "too risky
for me, but not fundamentally reckless". I think that that's worth
living with, but it isn't very satisfying.


There are certainly cases where the optimizer can prove (in principle;
it doesn't do so today) that a plan node will produce at most one row.
They're hardly uncommon either: an equality comparison on a unique
key, or a subquery with a simple aggregate function, come to mind.
  
In such cases, not only is this choice not reckless, but it's provably

superior to a hash join.  So in the end this gets back to the planning
risk factor that we keep circling around but nobody quite wants to
tackle.



Agreed.


I'd be a lot happier if this proposal were couched around some sort
of estimate of the risk of the outer side producing more than the
expected number of rows.  The arguments so far seem like fairly lame
rationalizations for not putting forth the effort to do that.

I agree having such measure would be helpful, but do you have an idea 
how it could be implemented?


I've been thinking about this a bit recently and searching for papers 
talking about this, and but it's not clear to me how to calculate the 
risk (and propagate it through the plan) without making the whole cost 
evaluation way more complicated / expensive :-(


The "traditional approach" to quantifying risk would be confidence 
intervals, i.e. for each cardinality estimate "e" we'd determine a range 
[a,b] so that P(a <= e <= b) = p. So we could pick "p" depending on how 
"robust" the plan choice should be (say 0.9 for "risky" and 0.999 for 
"safe" plans) and calculate a,b. Or maybe we could calculate where the 
plan changes, and then we'd see if those "break points" are within the 
confidence interval. If not, great - we can assume the plan is stable, 
otherwise we'd need to consider the other plans too, somehow.


But what I'm not sure about is:

1) Now we're dealing with three cardinality estimates (the original "e" 
and the boundaries "a, "b"). So which one do we use to calculate cost 
and pass to upper parts of the plan?


2) The outer relation may be a complex join, so we'd need to combine the 
confidence intervals for the two input relations, somehow.


3) We'd need to know how to calculate the confidence intervals for most 
plan nodes, which I'm not sure we know how to do. So it's not clear to 
me how to do this, which seems rather problematic because we need to 
propagate and combine those confidence intervals through the plan.



But maybe you have thought about some much simpler approach, addressing 
this sufficiently well?



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Add version macro to libpq-fe.h

2021-06-21 Thread Tom Lane

Daniel Gustafsson  writes:
> On 21 Jun 2021, at 17:27, Robert Haas  wrote:
>> What will prevent us from forgetting to do something about this again,
>> a year from now?

> An entry in a release checklist could perhaps be an idea?

Yeah, I was wondering if adding an entry to RELEASE_CHANGES would be
helpful.  Again, I'm not sure that this coding rule is much more
likely to be violated than any other.  On the other hand, the fact
that it's not critical until we approach release does suggest that
maybe it'd be useful to treat it as a checklist item.

regards, tom lane

Re: disfavoring unparameterized nested loops

2021-06-21 Thread Peter Geoghegan

On Mon, Jun 21, 2021 at 8:55 AM Tom Lane  wrote:
> There are certainly cases where the optimizer can prove (in principle;
> it doesn't do so today) that a plan node will produce at most one row.
> They're hardly uncommon either: an equality comparison on a unique
> key, or a subquery with a simple aggregate function, come to mind.

That sounds like it might be useful in general.

> In such cases, not only is this choice not reckless, but it's provably
> superior to a hash join.  So in the end this gets back to the planning
> risk factor that we keep circling around but nobody quite wants to
> tackle.

Let's assume for the sake of argument that we really have to have that
additional infrastructure to move forward with the idea. (I'm not sure
if it's possible in principle to use infrastructure like that for some
of the cases that Robert has in mind, but for now I'll assume that it
is both possible and a practical necessity.)

Even when I make this working assumption I don't see what it changes
at a fundamental level. You've merely come up with a slightly more
specific definition of the class of plans that are "reckless". You've
only refined the original provisional definition of "reckless" to
exclude specific "clearly not reckless" cases (I think). But the
definition of "reckless" is no less squishy than what we started out
with.

> I'd be a lot happier if this proposal were couched around some sort
> of estimate of the risk of the outer side producing more than the
> expected number of rows.  The arguments so far seem like fairly lame
> rationalizations for not putting forth the effort to do that.

I'm not so sure that it is. The point isn't the risk, even if it could
be calculated. The point is the downsides of being wrong are huge and
pretty much unbounded, whereas the benefits of being right are tiny
and bounded. It almost doesn't matter what the underlying
probabilities are.

To be clear I'm not arguing against modelling risk. I'm just not sure
that it's useful to think of this problem as truly a problem of risk.

-- 
Peter Geoghegan

Re: Detecting File Damage & Inconsistencies

2021-06-21 Thread Simon Riggs

On Thu, Mar 18, 2021 at 6:20 AM Craig Ringer
 wrote:
>
> On Mon, 15 Mar 2021 at 21:01, David Steele  wrote:
>>
>> On 11/18/20 5:23 AM, Simon Riggs wrote:
>> > On Wed, 18 Nov 2020 at 06:42, Craig Ringer
>> >  wrote:
>> >>
>> >> On Fri, Nov 13, 2020 at 7:24 PM Simon Riggs  wrote:
>> >>>
>> >>>
>> >>> What I'm proposing is an option to add 16 bytes onto each COMMIT
>> >>> record
>> >>
>> >>
>> >> Would it make sense to write this at the time we write a topxid 
>> >> assignment to WAL instead?
>> >>
>> >> Otherwise it won't be accessible to streaming-mode logical decoding.
>> >
>> > Do you mean extend the xl_xact_assignment record? My understanding is
>> > that is not sent in all cases, so not sure what you mean by "instead".
>>
>> Craig, can you clarify?
>
>
> Right. Or write a separate WAL record when the feature is enabled. But it's 
> probably sufficient to write it as an optional chunk on xl_xact_assignment 
> records. We often defer writing them so we can optimise away xacts that never 
> actually wrote anything, but IIRC we still write one before we write any WAL 
> that references the xid. That'd be fine, since we don't need the info any 
> sooner than that during decoding. I'd have to double check that we write it 
> in all cases and won't get to that too soon, but I'm pretty sure we do...

The commit record is optimized away if no xid is assigned, though is
still present if we didn't write any WAL records.

But if a commit record exists in the WAL stream, we want to know where
it came from.

A later patch will add PITR capability based on this information so
attaching it directly to the commit record is fairly important, IMHO.

-- 
Simon Riggshttp://www.EnterpriseDB.com/

Re: Add version macro to libpq-fe.h

2021-06-21 Thread Daniel Gustafsson

> On 21 Jun 2021, at 17:27, Robert Haas  wrote:
> 
> On Sat, Jun 19, 2021 at 11:45 AM Tom Lane  wrote:
>> Hearing no further comments, done that way.
> 
> What will prevent us from forgetting to do something about this again,
> a year from now?

An entry in a release checklist could perhaps be an idea?

--
Daniel Gustafsson   https://vmware.com/

Re: disfavoring unparameterized nested loops

2021-06-21 Thread Tom Lane

Peter Geoghegan  writes:
> The heuristic that has the optimizer flat out avoids an
> unparameterized nested loop join is justified by the belief that
> that's fundamentally reckless. Even though we all agree on that much,
> I don't know when it stops being reckless and starts being "too risky
> for me, but not fundamentally reckless". I think that that's worth
> living with, but it isn't very satisfying.

There are certainly cases where the optimizer can prove (in principle;
it doesn't do so today) that a plan node will produce at most one row.
They're hardly uncommon either: an equality comparison on a unique
key, or a subquery with a simple aggregate function, come to mind.

In such cases, not only is this choice not reckless, but it's provably
superior to a hash join.  So in the end this gets back to the planning
risk factor that we keep circling around but nobody quite wants to
tackle.

I'd be a lot happier if this proposal were couched around some sort
of estimate of the risk of the outer side producing more than the
expected number of rows.  The arguments so far seem like fairly lame
rationalizations for not putting forth the effort to do that.

regards, tom lane

Re: Use simplehash.h instead of dynahash in SMgr

2021-06-21 Thread Tom Lane

Robert Haas  writes:
> On Mon, Jun 21, 2021 at 10:15 AM David Rowley  wrote:
>> I've come up with a new hash table implementation that I've called
>> generichash.

> At the risk of kibitzing the least-important detail of this proposal,
> I'm not very happy with the names of our hash implementations.

I kind of wonder if we really need four different hash table
implementations (this being the third "generic" one, plus hash join
has its own, and I may have forgotten others).  Should we instead
think about revising simplehash to gain the benefits of this patch?

regards, tom lane

Re: Add version macro to libpq-fe.h

2021-06-21 Thread Tom Lane

Robert Haas  writes:
> What will prevent us from forgetting to do something about this again,
> a year from now?

As long as we notice it before 15.0, we can fix it retroactively,
as we just did for 14.  For that matter, fixing before 15.1 or
so would likely be Good Enough.

But realistically, how is this any worse of a problem than a hundred
other easily-forgotten coding rules we have?  We manage to uphold
most of them most of the time.

regards, tom lane

Re: disfavoring unparameterized nested loops

2021-06-21 Thread Peter Geoghegan

On Mon, Jun 21, 2021 at 7:45 AM Robert Haas  wrote:
> On Mon, Jun 21, 2021 at 6:41 AM David Rowley  wrote:
> > For example, in an unparameterized Nested Loop that estimates the
> > outer Path to have 1 row will cost an entire additional inner cost if
> > there are 2 rows.  With Hash Join the cost is just an additional
> > hashtable lookup, which is dead cheap.   I don't know exactly how
> > add_path() would weigh all that up, but it seems to me that I wouldn't
> > take the risk unless I was 100% certain that the Nested Loop's outer
> > Path would only return 1 row exactly, if there was any chance at all
> > it could return more, I'd be picking some other join method.
>
> It seems like everyone agrees that it would be good to do something
> about this problem, but the question is whether it's best to do
> something that tries to be general, or whether we should instead do
> something about this specific case. I favor the latter approach.

I agree with your conclusion, but FWIW I am sympathetic to David's
view too. I certainly understand why he'd naturally want to define the
class of problems that are like this one, to understand what the
boundaries are.

The heuristic that has the optimizer flat out avoids an
unparameterized nested loop join is justified by the belief that
that's fundamentally reckless. Even though we all agree on that much,
I don't know when it stops being reckless and starts being "too risky
for me, but not fundamentally reckless". I think that that's worth
living with, but it isn't very satisfying.

> Risk
> and uncertainty exist all over the place, but dealing with that in a
> general way seems difficult, and maybe unnecessary. Addressing the
> case of unparameterized nest loops specifically seems simpler, because
> it's easier to reason about what the alternatives are. Your last
> sentence here seems right on point to me.

Right. Part of why this is a good idea is that the user is exposed to
so many individual risks and uncertainties. We cannot see any one risk
as existing in a vacuum. It is not the only risk the user will ever
take in the planner -- if it was then it might actually be okay to
allow unparameterized nested loop joins.

A bad unparameterized nested loop join plan has, in a sense, unknown
and unbounded cost/downside. But it is only very slightly faster than
a hash join, by a fixed well understood amount. With enough "trials"
and on a long enough timeline, it will inevitably blow up and cause
the application to grind to a halt. It seems like no amount of fixed,
bounded benefit from "fast unparameterized nested loop joins" could
possibly make up for that. The life of Postgres users would be a lot
better if bad plans were at least "survivable events".

-- 
Peter Geoghegan

Re: Maintaining a list of pgindent commits for "git blame" to ignore

2021-06-21 Thread Tom Lane

Peter Geoghegan  writes:
> What do you think of the attached? I prefer the ISO date style myself,
> so I went with that.

I grepped the git logs for "indent" and found a bunch more commits
that look like they should be included:

db6e2b4c5
d84213909
1e9c85809
f04d4ac91
9ef2dbefc
651902deb
ce5548103
b5bce6c1e
de94e2af1
d0cd7bda9
befa3e648
7584649a1
84288a86a
d74714027
b6b71b85b
46785776c
089003fb4
ea08e6cd5
59f6a57e5

It's possible that some of these touch few enough lines that they
are not worth listing; I did not check the commit delta sizes.

> Note that I have included "Modify BufferGetPage() to prepare for
> "snapshot too old" feature", as well as "Revert no-op changes to
> BufferGetPage()". I've noticed that those two particular commits cause
> unhelpful noise when I run "git blame" on access method code.

Meh.  I can get on board with the idea of adding commit+revert pairs
to this list, but I'd like a more principled selection filter than
"which ones bother Peter".  Maybe the size of the reverted patch
should factor into it.

Do we have an idea of how much adding entries to this list slows
down "git blame"?  If we include commit+revert pairs more than
very sparingly, I'm afraid we'll soon have an enormous list, and
I wonder what that will cost us.

regards, tom lane

Re: Add version macro to libpq-fe.h

2021-06-21 Thread Robert Haas

On Sat, Jun 19, 2021 at 11:45 AM Tom Lane  wrote:
> Hearing no further comments, done that way.

What will prevent us from forgetting to do something about this again,
a year from now?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [bug?] Missed parallel safety checks, and wrong parallel safety

2021-06-21 Thread Robert Haas

On Mon, Jun 21, 2021 at 12:56 AM Amit Kapila  wrote:
> I thought if we scan a system catalog using DirtySnapshot, then we
> should be able to find such a relation. But, if the system catalog is
> updated after our scan then surely we won't be able to see it and in
> that case, we won't be able to send invalidation. Now, say if the rel
> is not visible to us because of the snapshot we used or due to some
> race condition then we won't be able to send the invalidation but why
> we want to consider it worse than the case where we miss such
> invalidations (invalidations due to change of parallel-safe property)
> when the insertion into relation is in-progress.

A concurrent change is something quite different than a change that
happened some time in the past. We all know that DROP TABLE blocks if
it is run while the table is in use, and everybody considers that
acceptable, but if DROP TABLE were to block because the table was in
use at some previous time, everybody would complain, and rightly so.
The same principle applies here. It's not possible to react to a
change that happens in the middle of the query. Somebody could argue
that we ought to lock all the functions we're using against concurrent
changes so that attempts to change their properties block on a lock
rather than succeeding. But given that that's not how it works, we can
hardly go back in time and switch to a non-parallel plan after we've
already decided on a parallel one. On the other hand, we should be
able to notice a change that has *already completed* at the time we do
planning. I don't see how we can blame failure to do that on anything
other than bad coding.

> Yeah, the session in which we are doing Alter Function won't be able
> to lock it but it will wait for the AccessExclusiveLock on the rel to
> be released because it will also try to acquire it before sending
> invalidation.

I think users would not be very happy with such behavior. Users accept
that if they try to access a relation, they might end up needing to
wait for a lock on it, but what you are proposing here might make a
session block waiting for a lock on a relation which it never
attempted to access.

I think this whole line of attack is a complete dead-end. We can
invent new types of invalidations if we want, but they need to be sent
based on which objects actually got changed, not based on what we
think might be affected indirectly as a result of those changes. It's
reasonable to regard something like a trigger or constraint as a
property of the table because it is really a dependent object. It is
associated with precisely one table when it is created and the
association can never be changed. On the other hand, functions clearly
have their own existence. They can be created and dropped
independently of any table and the tables with which they are
associated can change at any time. In that kind of situation,
invalidating the table based on changes to the function is riddled
with problems which I am pretty convinced we're never going to be able
to solve. I'm not 100% sure what we ought to do here, but I'm pretty
sure that looking up the tables that happen to be associated with the
function in the session that is modifying the function is not it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Use simplehash.h instead of dynahash in SMgr

2021-06-21 Thread Robert Haas

On Mon, Jun 21, 2021 at 10:15 AM David Rowley  wrote:
> I've come up with a new hash table implementation that I've called
> generichash.

At the risk of kibitzing the least-important detail of this proposal,
I'm not very happy with the names of our hash implementations.
simplehash is not especially simple, and dynahash is not particularly
dynamic, especially now that the main place we use it is for
shared-memory hash tables that can't be resized. Likewise, generichash
doesn't really give any kind of clue about how this hash table is
different from any of the others. I don't know how possible it is to
do better here; naming things is one of the two hard problems in
computer science. In a perfect world, though, our hash table
implementations would be named in such a way that somebody might be
able to look at the names and guess on that basis which one is
best-suited to a given task.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: disfavoring unparameterized nested loops

2021-06-21 Thread Robert Haas

On Mon, Jun 21, 2021 at 6:41 AM David Rowley  wrote:
> For example, in an unparameterized Nested Loop that estimates the
> outer Path to have 1 row will cost an entire additional inner cost if
> there are 2 rows.  With Hash Join the cost is just an additional
> hashtable lookup, which is dead cheap.   I don't know exactly how
> add_path() would weigh all that up, but it seems to me that I wouldn't
> take the risk unless I was 100% certain that the Nested Loop's outer
> Path would only return 1 row exactly, if there was any chance at all
> it could return more, I'd be picking some other join method.

It seems like everyone agrees that it would be good to do something
about this problem, but the question is whether it's best to do
something that tries to be general, or whether we should instead do
something about this specific case. I favor the latter approach. Risk
and uncertainty exist all over the place, but dealing with that in a
general way seems difficult, and maybe unnecessary. Addressing the
case of unparameterized nest loops specifically seems simpler, because
it's easier to reason about what the alternatives are. Your last
sentence here seems right on point to me.

Basically, what you argue for there is disabling unparameterized
nested loops entirely except when we can prove that the inner side
will never generate more than one row. But, that's almost never going
to be something that we can prove. If the inner side is coming from a
table or sub-join, it can turn out to be big. As far as I can see, the
only way that this doesn't happen is if it's something like a subquery
that aggregates everything down to one row, or has LIMIT 1, but those
are such special cases that I don't even think we should be worrying
about them.

So my counter-proposal is: how about if we split
merge_unsorted_outer() into two functions, one of which generates
nested loops only based on parameterized paths and the other of which
generates nested loops based only on unparameterized paths, and then
rejigger add_paths_to_joinrel() so that we do the latter between the
steps that are currently number 5 and 6 and only if we haven't got any
other paths yet? If somebody later comes up with logic for proving
that the inner side can never have more than 1 row, we can let this be
run in those cases as well. In the meantime, if somebody does
something like a JOIN b ON a.x < b.x, we will still generate these
paths because there's no other approach, or similarly if it's a.x =
b.x but for some strange type that doesn't have a hash-joinable or
merge-joinable equality operator. But otherwise we omit those paths
entirely.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Use extended statistics to estimate (Var op Var) clauses

2021-06-21 Thread Dean Rasheed

On Sun, 13 Jun 2021 at 21:28, Tomas Vondra
 wrote:
>
> Here is a slightly updated version of the patch
>
> The main issue I ran into
> is the special case clauselist_selectivity, which does
>
>  if (list_length(clauses) == 1)
>  return clause_selectivity_ext(...);
>
> which applies to cases like "WHERE a < b" which can now be handled by
> extended statistics, thanks to this patch. But clause_selectivity_ext
> only used to call restriction_selectivity for these clauses, which does
> not use extended statistics, of course.
>
> I considered either getting rid of the special case, passing everything
> through extended stats, including cases with a single clause. But that
> ends up affecting e.g. OR clauses, so I tweaked clause_selectivity_ext a
> bit, which seems like a better approach.

Yeah, I looked at this a few months ago, while looking at the other
extended stats stuff, and I came to the same conclusion that solving
this issue by tweaking clause_selectivity_ext() is the best approach.

I haven't looked at the patch in much detail yet, but I think the
basic approach looks OK.

There are a few comments that need updating, e.g.:
 - In statext_is_compatible_clause_internal(), before the "if
(is_opclause(clause))" test.
 - The description of the arguments for examine_opclause_args().

I wonder if "expronleftp" for examine_opclause_args() should be
"constonrightp", or some such -- as it stands it's being set to false
for an Expr Op Expr clause, which doesn't seem right because there
*is* an expression on the left.

Regards,
Dean

Re: PG 14 release notes, first draft

2021-06-21 Thread Bruce Momjian

On Mon, Jun 21, 2021 at 02:47:16PM +0900, Tatsuo Ishii wrote:
> > I got the parse error after applying the patch:
> > 
> > release-14.sgml:3562: parser error : Input is not proper UTF-8,
> > indicate encoding !
> > Bytes: 0xE9 0x20 0x53 0x61
> > (Juan Jos Santamara Flecha)
> >  ^
> > 
> > Is that a problem with my environment?
> 
> Me too. I think the problem is, Bruce's patch is encoded in
> ISO-8859-1, not UTF-8. As far as I know PostgreSQL never encodes
> *.sgml files in ISO-8859-1. Anyway, attached is the Bruce's patch
> encoded in UTF-8. This works for me.
> 
> My guess is, when Bruce attached the file, his MUA automatically
> changed the file encoding from UTF-8 to ISO-8859-1 (it could happen in
> many MUA). Also that's the reason why he does not see the problem
> while compiling the sgml files. In his environment release-14.sgml is
> encoded in UTF-8, I guess. To prevent the problem next time, it's
> better to change the mime type of the attached file to
> Application/Octet-Stream.

Oh, people were testing by building from the attached patch, not from
the git tree.  Yes, I see now the email was switched to a single-byte
encoding, and the attachment header confirms it:

Content-Type: text/x-diff; charset=iso-8859-1
   --
Content-Disposition: attachment; filename="master.diff"
Content-Transfer-Encoding: 8bit

I guess my email program, mutt, is trying to be helpful by using a
single-byte encoding when UTF is not necessary, which I guess makes
sense.  I will try to remember this can cause problems with SGML
attachments.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: Use simplehash.h instead of dynahash in SMgr

2021-06-21 Thread David Rowley

I'd been thinking of this patch again.  When testing with simplehash,
I found that the width of the hash bucket type was fairly critical for
getting good performance from simplehash.h.  With simplehash.h I
didn't manage to narrow this any more than 16 bytes. I needed to store
the 32-bit hash value and a pointer to the data. On a 64-bit machine,
with padding, that's 16-bytes.  I've been thinking about a way to
narrow this down further to just 8 bytes and also solve the stable
pointer problem at the same time...

I've come up with a new hash table implementation that I've called
generichash.   It works similarly to simplehash in regards to the
linear probing, only instead of storing the data in the hash bucket,
we just store a uint32 index that indexes off into an array.  To keep
the pointers in that array stable, we cannot resize the array as the
table grows.  Instead, I just allocate another array of the same size.
Since these arrays are always sized as powers of 2, it's very fast to
index into them using the uint32 index that's stored in the bucket.
Unused buckets just store the special index of 0x.

I've also proposed to use this hash table implementation over in [1]
to speed up LockReleaseAll(). The 0001 patch here is just the same as
the patch from [1].

The 0002 patch includes using a generichash hash table for SMgr.

The performance using generichash.h is about the same as the
simplehash.h version of the patch. Although, the test was not done on
the same version of master.

Master (97b713418)
drowley@amd3990x:~$ tail -f pg.log | grep "redo done"
CPU: user: 124.85 s, system: 6.83 s, elapsed: 131.74 s
CPU: user: 115.01 s, system: 4.76 s, elapsed: 119.83 s
CPU: user: 122.13 s, system: 6.41 s, elapsed: 128.60 s
CPU: user: 113.85 s, system: 6.11 s, elapsed: 120.02 s
CPU: user: 121.40 s, system: 6.28 s, elapsed: 127.74 s
CPU: user: 113.71 s, system: 5.80 s, elapsed: 119.57 s
CPU: user: 113.96 s, system: 5.90 s, elapsed: 119.92 s
CPU: user: 122.74 s, system: 6.21 s, elapsed: 129.01 s
CPU: user: 122.00 s, system: 6.38 s, elapsed: 128.44 s
CPU: user: 113.06 s, system: 6.14 s, elapsed: 119.25 s
CPU: user: 114.42 s, system: 4.35 s, elapsed: 118.82 s

Median: 120.02 s

master + v1 + v2

drowley@amd3990x:~$ tail -n 0 -f pg.log | grep "redo done"
CPU: user: 107.75 s, system: 4.61 s, elapsed: 112.41 s
CPU: user: 108.07 s, system: 4.49 s, elapsed: 112.61 s
CPU: user: 106.89 s, system: 5.55 s, elapsed: 112.49 s
CPU: user: 107.42 s, system: 5.64 s, elapsed: 113.12 s
CPU: user: 106.85 s, system: 4.42 s, elapsed: 111.31 s
CPU: user: 107.36 s, system: 4.76 s, elapsed: 112.16 s
CPU: user: 107.20 s, system: 4.47 s, elapsed: 111.72 s
CPU: user: 106.94 s, system: 5.89 s, elapsed: 112.88 s
CPU: user: 115.32 s, system: 6.12 s, elapsed: 121.49 s
CPU: user: 108.02 s, system: 4.48 s, elapsed: 112.54 s
CPU: user: 106.93 s, system: 4.54 s, elapsed: 111.51 s

Median: 112.49 s

So about a 6.69% speedup

David

[1] 
https://www.postgresql.org/message-id/caaphdvokqwrxw5nnupz8+majkhpopxygoy1gqdh0wes4+bi...@mail.gmail.com


v1-0001-Add-a-new-hash-table-type-which-has-stable-pointe.patch
Description: Binary data


v1-0002-Use-generichash.h-hashtables-in-SMgr.patch
Description: Binary data

Re: disfavoring unparameterized nested loops

2021-06-21 Thread Kenneth Marshall

> >
> > Most of the time when I see that happen it's down to either the
> > selectivity of some correlated base-quals being multiplied down to a
> > number low enough that we clamp the estimate to be 1 row.   The other
> > case is similar, but with join quals.
> 
> If an estimate is lower than 1, that should be a red flag that Something Is
> Wrong. This is kind of a crazy idea, but what if we threw it back the other
> way by computing 1 / est , and clamping that result to 2 <= res < 10 (or
> 100 or something)? The theory is, the more impossibly low it is, the more
> wrong it is. I'm attracted to the idea of dealing with it as an estimation
> problem and not needing to know about join types. Might have unintended
> consequences, though.
> 
> Long term, it would be great to calculate something about the distribution
> of cardinality estimates, so we can model risk in the estimates.
> 

Hi,

Laurenz suggested clamping to 2 in this thread in 2017:

https://www.postgresql.org/message-id/1509611428.3268.5.camel%40cybertec.at

Having been the victim of this problem in the past, I like the risk
based approach to this. If the cost of being wrong in the estimate is
high, use a merge join instead. In every case that I have encountered,
that heuristic would have given the correct performant plan.

Regards,
Ken

Fix pkg-config file for static linking

2021-06-21 Thread Filip Gospodinov

Since https://github.com/postgres/postgres/commit/ea53100d5 (or Postgres 
12.0) the shipped pkg-config file is broken for statically linking libpq 
because libpgcommon and libpgport are missing. This patch adds those two 
missing private dependencies.
diff --git a/src/interfaces/libpq/Makefile b/src/interfaces/libpq/Makefile
index 0c4e55b6ad..fe21335d2d 100644
--- a/src/interfaces/libpq/Makefile
+++ b/src/interfaces/libpq/Makefile
@@ -94,7 +94,7 @@ SHLIB_PREREQS = submake-libpgport
 
 SHLIB_EXPORTS = exports.txt
 
-PKG_CONFIG_REQUIRES_PRIVATE = libssl libcrypto
+PKG_CONFIG_REQUIRES_PRIVATE = libpgcommon libpgport libssl libcrypto
 
 all: all-lib

Re: Character expansion with ICU collations

2021-06-21 Thread Finnerty, Jim

I have a proposal for how to support tailoring rules in ICU collations: The 
ucol_openRules() function is an alternative to the ucol_open() function that 
PostgreSQL calls today, but it takes the collation strength as one if its 
parameters so the locale string would need to be parsed before creating the 
collator.  After the collator is created using either ucol_openRules or 
ucol_open, the ucol_setAttribute() function may be used to set individual 
attributes from keyword=value pairs in the locale string as it does now, except 
that the strength probably can't be changed after opening the collator with 
ucol_openRules.  So the logic in pg_locale.c would need to be reorganized a 
little bit, but that sounds straightforward.

One simple solution would be to have the tailoring rules be specified as a new 
keyword=value pair, such as colTailoringRules=.  Since the 
 may contain single quote characters or PostgreSQL escape 
characters, any single quote characters or escapes would need to be escaped 
using PostgreSQL escape rules.  If colTailoringRules is present, colStrength 
would also be known prior to opening the collator, or would default to 
tertiary, and we would keep a local flag indicating that we should not process 
the colStrength keyword again, if specified. 

Representing the TailoringRules as just another keyword=value in the locale 
string means that we don't need any change to the catalog to store it.  It's 
just part of the locale specification.  I think we wouldn't even need to bump 
the catversion.

Are there any tailoring rules, such as expansions and contractions, that we 
should disallow?  I realize that we don't handle nondeterministic collations in 
LIKE or regular expression operations as of PG14, but given expr LIKE 'a%' on a 
database with a UTF-8 encoding and arbitrary tailoring rules that include 
expansions and contractions, is it still guaranteed that expr must sort BETWEEN 
'a' AND ('a' || E'/u') ?

Doc chapter for Hash Indexes

2021-06-21 Thread Simon Riggs

New chapter for Hash Indexes, designed to help users understand how
they work and when to use them.

Mostly newly written, but a few paras lifted from README when they were helpful.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


doc_hash_index.v1.patch
Description: Binary data

Using indexUnchanged with nbtree

2021-06-21 Thread Simon Riggs

Seems like we can skip the uniqueness check if indexUnchanged, which
will speed up non-HOT UPDATEs on tables with B-Trees.

Passes tests.

Comments?

-- 
Simon Riggshttp://www.EnterpriseDB.com/


skip_nonHOT_btree_unique_check.v1.patch
Description: Binary data

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

2021-06-21 Thread Andres Freund

Hi,

On 2021-06-16 12:12:23 -0700, Andres Freund wrote:
> Could you share your testcase? I've been working on a series of patches
> to address this (I'll share in a bit), and I've run quite a few tests,
> and didn't hit any infinite loops.

Sorry for not yet doing that. Unfortunately I have an ongoing family
health issue (& associated travel) claiming time and energy :(.

I've pushed the minimal fix due to beta 2.

Beyond beta 2 I am thinking of the below to unify the horizon
determination:

static inline GlobalVisHorizonKind
GlobalVisHorizonKindForRel(Relation rel)
{
   if (!rel)
   return VISHORIZON_SHARED;

   /*
* Other relkkinds currently don't contain xids, nor always the necessary
* logical decoding markers.
*/
   Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
  rel->rd_rel->relkind == RELKIND_MATVIEW ||
  rel->rd_rel->relkind == RELKIND_TOASTVALUE);

   if (rel->rd_rel->relisshared || RecoveryInProgress())
   return VISHORIZON_SHARED;
   else if (IsCatalogRelation(rel) ||
RelationIsAccessibleInLogicalDecoding(rel))
   return VISHORIZON_CATALOG;
   else if (!RELATION_IS_LOCAL(rel))
   return VISHORIZON_DATA;
   else
   return VISHORIZON_TEMP;
}

That's then used in GetOldestNonRemovableTransactionId(),
GlobalVisTestFor(). Makes sense?

Regards,

Andres

Re: Add index OID macro argument to DECLARE_INDEX

2021-06-21 Thread John Naylor

On Mon, Jun 21, 2021 at 3:23 AM Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:
>
>
> This patch changes places like this
>
> DECLARE_UNIQUE_INDEX_PKEY(pg_aggregate_fnoid_index, 2650, on
> pg_aggregate using btree(aggfnoid oid_ops));
> #define AggregateFnoidIndexId  2650
>
> to this
>
> DECLARE_UNIQUE_INDEX_PKEY(pg_aggregate_fnoid_index, 2650,
> AggregateFnoidIndexId, on pg_aggregate using btree(aggfnoid oid_ops));

+1, and the patch looks good to me.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Remove useless int64 range checks on BIGINT sequence MINVALUE/MAXVALUE values

2021-06-21 Thread Greg Nancarrow

On Mon, Jun 21, 2021 at 8:32 PM David Rowley  wrote:
>
> It might be worth putting in a comment to mention that the check is
> not needed.  Just in case someone looks again one day and thinks the
> checks are missing.
>
> Probably best to put this in the July commitfest so it does not get missed.

Updated the patch, and will add it to the Commitfest, thanks.

Regards,
Greg Nancarrow
Fujitsu Australia


v2-0001-Remove-useless-int64-range-checks-on-BIGINT-sequence.patch
Description: Binary data

Re: disfavoring unparameterized nested loops

2021-06-21 Thread John Naylor

On Tue, Jun 15, 2021 at 8:00 PM David Rowley  wrote:

> In my experience, the most common reason that the planner chooses
> non-parameterized nested loops wrongly is when there's row
> underestimation that says there's just going to be 1 row returned by
> some set of joins.  The problem often comes when some subsequent join
> is planned and the planner sees the given join rel only produces one
> row.  The cheapest join method we have to join 1 row is Nested Loop.
> So the planner just sticks the 1-row join rel on the outer side
> thinking the executor will only need to scan the inner side of the
> join once.  When the outer row count blows up, then we end up scanning
> that inner side many more times. The problem is compounded when you
> nest it a few joins deep
>
> Most of the time when I see that happen it's down to either the
> selectivity of some correlated base-quals being multiplied down to a
> number low enough that we clamp the estimate to be 1 row.   The other
> case is similar, but with join quals.

If an estimate is lower than 1, that should be a red flag that Something Is
Wrong. This is kind of a crazy idea, but what if we threw it back the other
way by computing 1 / est , and clamping that result to 2 <= res < 10 (or
100 or something)? The theory is, the more impossibly low it is, the more
wrong it is. I'm attracted to the idea of dealing with it as an estimation
problem and not needing to know about join types. Might have unintended
consequences, though.

Long term, it would be great to calculate something about the distribution
of cardinality estimates, so we can model risk in the estimates.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: SSL/TLS instead of SSL in docs

2021-06-21 Thread Daniel Gustafsson

> On 18 Jun 2021, at 07:37, Michael Paquier  wrote:
> 
> On Tue, Jun 15, 2021 at 03:59:18PM +0200, Daniel Gustafsson wrote:
>> While in there I added IMO missing items to the glossary and acronyms 
>> sections
>> as well as fixed up markup around OpenSSL.
>> 
>> This only deals with docs, but if this is deemed interesting then userfacing
>> messages in the code should use SSL/TLS as well of course.
> 
> +SNI
> +
> + 
> +  Server Name Indication
> + 
> +
> It looks inconsistent to me to point to the libpq documentation to get
> the details about SNI.  Wouldn't is be better to have an item in the
> glossary that refers to the bits of RFC 6066, and remove the reference
> of the RPC from the libpq page?

I opted for a version with SNI in the glossary but without a link to the RFC
since no definitions in the glossary has ulinks.

> -   to present a valid (trusted) SSL certificate, while
> +   to present a valid (trusted) 
> SSL/TLS certificate, while
> This style with two acronyms for what we want to be one thing is
> heavy.  Could it be better to just have one single acronym called
> SSL/TLS that references both parts?

Maybe, I don't know.  I certainly don't prefer the way which is in the patch
but I also think it's the most "correct" way so I opted for that to start the
discussion.  If we're fine with one acronym tag for the combination then I'm
happy to change to that.

A slightly more invasive idea would be to not have acronyms at all and instead
move the ones that do benefit from clarification to the glossary?  ISTM that
having a brief description of terms and not just the expansion is beneficial to
the users.  That would however be for another thread, but perhaps thats worth
discussing?

> Patch 0003, for the  markups with OpenSSL, included one
> SSL/TLS entry.

Ugh, sorry, that must've been a git add -p fat-finger.

--
Daniel Gustafsson   https://vmware.com/

v2-0002-docs-Replace-usage-of-SSL-with-SSL-TLS.patch
Description: Binary data

v2-0001-docs-SSL-TLS-related-acronyms-and-glossary.patch
Description: Binary data

Is the testing a bit too light on GROUP BY DISTINCT?

2021-06-21 Thread David Rowley

In [1], Yaoguang reported an Assert failure in expand_grouping_sets.
Since beta2 deadline is looming, I pushed a quick fix for that.

As mentioned over on bugs, only 1 test triggers that code and because
the List of IntLists always had an empty list as the first element due
to the code just above sorting the top-level List by the number of
elements each of the contained IntLists, the NIL was always at the
start of the top-level List.

It wasn't too hard to modify the test to change that.

I wonder if the testing for the feature is just a bit too light.

Would it maybe be worth adding a GROUP BY DISTINCT with GROUPING SETS test?

Any thoughts?

David

[1] https://www.postgresql.org/message-id/17067-665d50fa321f7...@postgresql.org

Re: Doc patch for Logical Replication Message Formats (PG14)

2021-06-21 Thread Brar Piening


Amit Kapila schrieb:

On Mon, Jun 21, 2021 at 4:13 PM Brar Piening  wrote:

Amit Kapila wrote:
After looking at the docs once again I have another minor amendment (new
patch attached).


+The value of the column, eiter in binary or in text format.

Typo. /eiter/either


Fixed - thanks!

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index bc2a2feb0b..7d29308abc 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -6498,6 +6498,18 @@ Message
 
 
 
+
+
+
+Int32
+
+
+
+Length of the content.
+
+
+
+
 
 
 Byten
@@ -7430,6 +7442,19 @@ TupleData
 
 
 
+
+Or
+
+
+
+Byte1('b')
+
+
+
+Identifies the data as binary value.
+
+
+
 
 
 Int32
@@ -7446,8 +7471,8 @@ TupleData
 
 
 
-The value of the column, in text format.  (A future release
-might support additional formats.)
+The value of the column, either in binary or in text format.
+(As specified in the preceding format byte).
 n is the above length.

RE: [bug?] Missed parallel safety checks, and wrong parallel safety

2021-06-21 Thread houzj.f...@fujitsu.com

On Wednesday, June 16, 2021 11:27 AM houzj.f...@fujitsu.com wrote:
> On Tuesday, June 15, 2021 10:01 PM Robert Haas  wrote:
> > Just to be clear here, I don't think it really matters what we *want*
> > to do. I don't think it's reasonably *possible* to check all the
> > partitions, because we don't hold locks on them. When we're assessing
> > a bunch of stuff related to an individual relation, we have a lock on
> > it. I think - though we should double-check tablecmds.c - that this is
> > enough to prevent all of the dependent objects - triggers,
> > constraints, etc. - from changing. So the stuff we care about is
> > stable. But the situation with a partitioned table is different. In
> > that case, we can't even examine that stuff without locking all the
> > partitions. And even if we do lock all the partitions, the stuff could 
> > change
> immediately afterward and we wouldn't know. So I think it would be difficult 
> to
> >make it correct.
> > Now, maybe it could be done, and I think that's worth a little more
> > thought. For example, perhaps whenever we invalidate a relation, we
> > could also somehow send some new, special kind of invalidation for its
> > parent saying, essentially, "hey, one of your children has changed in
> > a way you might care about." But that's not good enough, because it
> > only goes up one level. The grandparent would still be unaware that a
> > change it potentially cares about has occurred someplace down in the
> > partitioning hierarchy. That seems hard to patch up, again because of
> > the locking rules. The child can know the OID of its parent without
> > locking the parent, but it can't know the OID of its grandparent
> > without locking its parent. Walking up the whole partitioning
> > hierarchy might be an issue for a number of reasons, including
> > possible deadlocks, and possible race conditions where we don't emit
> > all of the right invalidations in the face of concurrent changes. So I
> > don't quite see a way around this part of the problem, but I may well be
> missing something. In fact I hope I am missing something, because solving this
> problem would be really nice.
> 
> I think the check of partition could be even more complicated if we need to
> check the parallel safety of partition key expression. If user directly 
> insert into a
> partition, then we need invoke ExecPartitionCheck which will execute all it's
> parent's and grandparent's partition key expressions. It means if we change a
> parent table's partition key expression(by 1) change function in expr or 2)
> attach the parent table as partition of another parent table), then we need to
> invalidate all its child's relcache.
> 
> BTW, currently, If user attach a partitioned table 'A' to be partition of 
> another
> partitioned table 'B', the child of 'A' will not be invalidated.

To be honest, I didn't find a cheap way to invalidate partitioned table's
parallel safety automatically. For one thing, We need to recurse higher
in the partition tree to invalid all the parent table's relcache(and perhaps
all its children's relcache) not only when alter function parallel safety,
but also for DDLs which will invalid the partition's relcache, such as
CREATE/DROP INDEX/TRIGGER/CONSTRAINT. It seems too expensive. For another,
even if we can invalidate the partitioned table's parallel safety
automatically, we still need to lock all the partition when checking table's
parallel safety, because the partition's parallel safety could be changed
after checking the parallel safety.

So, IMO, at least for partitioned table, an explicit flag looks more acceptable.
For regular table, It seems we can work it out automatically, although 
I am not sure does anyone think it looks a bit inconsistent.

Best regards,
houzj

Re: Doc patch for Logical Replication Message Formats (PG14)

2021-06-21 Thread Amit Kapila

On Mon, Jun 21, 2021 at 4:13 PM Brar Piening  wrote:
>
> Amit Kapila wrote:
> >
> After looking at the docs once again I have another minor amendment (new
> patch attached).
>

+The value of the column, eiter in binary or in text format.

Typo. /eiter/either

-- 
With Regards,
Amit Kapila.

Re: Fix for segfault in logical replication on master

2021-06-21 Thread Amit Kapila

On Mon, Jun 21, 2021 at 4:09 PM Japin Li  wrote:
>
> On Mon, 21 Jun 2021 at 17:54, Amit Kapila  wrote:
> > On Mon, Jun 21, 2021 at 2:06 PM Japin Li  wrote:
> >>
> >> On Mon, 21 Jun 2021 at 16:22, Amit Kapila  wrote:
> >> > On Mon, Jun 21, 2021 at 1:30 PM Japin Li  wrote:
> >> >>
> >> >> On Sat, 19 Jun 2021 at 17:18, Amit Kapila  
> >> >> wrote:
> >> >> > On Fri, Jun 18, 2021 at 9:18 AM Amit Kapila  
> >> >> > wrote:
> >> >>
> >> >> Or we can free the memory owned by indexoidlist after check whether it 
> >> >> is NIL,
> >> >> because we do not use it in the later.
> >> >>
> >> >
> >> > Valid point. But I am thinking do we really need to fetch and check
> >> > indexoidlist here?
> >>
> >> IMO, we shold not fetch and check the indexoidlist here, since we do not
> >> use it.  However, we should use RelationGetIndexList() to update the
> >> reladion->rd_replidindex, so we should fetch the indexoidlist, maybe we
> >> can use the following code:
> >>
> >> indexoidlist = RelationGetIndexList(relation);
> >> list_free(indexoidlist);
> >>
> >> Or does there any function that only update the relation->rd_replidindex
> >> or related fields, but do not fetch the indexoidlist?
> >>
> >
> > How about RelationGetReplicaIndex? It fetches the indexlist only when
> > required and frees it immediately. But otherwise, currently, there
> > shouldn't be any memory leak because we allocate this in "logical
> > replication output context" which is reset after processing each
> > change message, see pgoutput_change.
>
> Thanks for your explanation.  It might not  be a memory leak, however it's
> a little confuse when we free the memory of the indexoidlist in one place,
> but not free it in another place.
>
> I attached a patch to fix this.  Any thoughts?
>

Your patch appears to be on the lines we discussed but I would prefer
to get it done after Beta2 as this is just a minor code improvement.
Can you please send the change as a patch file instead of copy-pasting
the diff at the end of the email?

-- 
With Regards,
Amit Kapila.

Re: Optionally automatically disable logical replication subscriptions on error

2021-06-21 Thread Amit Kapila

On Mon, Jun 21, 2021 at 11:19 AM Amit Kapila  wrote:
>
> On Mon, Jun 21, 2021 at 10:55 AM Mark Dilger
>  wrote:
>
> >  I don't mind if you want to store more information, and maybe that needs 
> > to be stored somewhere else.  Do you believe pg_subscription_rel is a 
> > suitable location?
> >
> It won't be sufficient to store information in either
> pg_subscription_rel or pg_susbscription. I think if we want to store
> the required information in a catalog then we need to define a new
> catalog (pg_subscription_conflicts or something like that) with
> information corresponding to each rel in subscription (srsubid oid
> (Reference to subscription), srrelid oid (Reference to relation),
> ). OTOH, we can choose to send the error
> information to stats collector which will then be available via stat
> view and update system catalog to disable the subscription but there
> will be a risk that we might send info of failed transaction to stats
> collector but then fail to update system catalog to disable the
> subscription.
>

I think we should store the input from the user (like disable_on_error
flag or xid to skip) in the system catalog pg_subscription and send
the error information (subscrtion_id, rel_id, xid of failed xact,
error_code, error_message, etc.) to the stats collector which can be
used to display such information via a stat view.

The disable_on_error flag handling could be that on error it sends the
required error info to stats collector and then updates the subenabled
in pg_subscription. In rare conditions, where we are able to send the
message but couldn't update the subenabled info in pg_subscription
either due to some error or server restart, the apply worker would
again try to apply the same change and would hit the same error again
which I think should be fine because it will ultimately succeed.

The skip xid handling will also be somewhat similar where on an error,
we will send the error information to stats collector which will be
displayed via stats view. Then the user is expected to ask for skip
xid (Alter Subscription ... SKIP ) based on information
displayed via stat view. Now, the apply worker can skip changes from
such a transaction, and then during processing of commit record of the
skipped transaction, it should update xid to invalid value, so that
next time that shouldn't be used. I think it is important to update
xid to an invalid value as part of the skipped transaction because
otherwise, after the restart, we won't be able to decide whether we
still want to skip the xid stored for a subscription.

-- 
With Regards,
Amit Kapila.

RE: Refactor ECPGconnect and allow IPv6 connection

2021-06-21 Thread kuroda.hay...@fujitsu.com

Dear Michael,

Thank you for replying!

> it does not strike me as a great idea to have a duplicate
> logic doing the parsing of URIs, even if libpq accepts multiple
> hosts/ports as an option.

Yeah, I agree your argument that duplicated parse function should be removed.

ECPG parses connection string because it uses PQconnectdbParams()
even if target is specified in the new-style,
hence my elementary idea is that the paring can be skipped if PQconnectdb() 
calls.

I will try to remake patches based on the idea.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Doc patch for Logical Replication Message Formats (PG14)

2021-06-21 Thread Brar Piening


Amit Kapila wrote:

On Mon, Jun 21, 2021 at 12:26 PM Brar Piening  wrote:

Hello Hackers,
while amending Npgsql to account for the Logical Streaming Replication
Protocol changes in PostgreSQL 14 I stumbled upon two documentation
inaccuracies in the Logical Replication Message Formats documentation
(https://www.postgresql.org/docs/devel/protocol-logicalrep-message-formats.html)
that have been introduced (or rather omitted) with the recent changes to
allow pgoutput to send logical decoding messages
(https://github.com/postgres/postgres/commit/ac4645c0157fc5fcef0af8ff571512aa284a2cec)
and to allow logical replication to transfer data in binary format
(https://github.com/postgres/postgres/commit/9de77b5453130242654ff0b30a551c9c862ed661).


  1. The content of the logical decoding message in the 'Message' message
 is prefixed with a length field (Int32) which isn't documented yet.
 See
 
https://github.com/postgres/postgres/blob/69a58bfe4ab05567a8fab8bdce7f3095ed06b99c/src/backend/replication/logical/proto.c#L388
  2. The TupleData may now contain the byte 'b' as indicator for binary
 data which isn't documented yet. See
 
https://github.com/postgres/postgres/blob/69a58bfe4ab05567a8fab8bdce7f3095ed06b99c/src/include/replication/logicalproto.h#L83
 and
 
https://github.com/postgres/postgres/blob/69a58bfe4ab05567a8fab8bdce7f3095ed06b99c/src/backend/replication/logical/proto.c#L558.

The attached documentation patch fixes both.


Yeah, I think these should be fixed and your patch looks good to me in
that regard.


After looking at the docs once again I have another minor amendment (new
patch attached).

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index bc2a2feb0b..7d29308abc 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -6498,6 +6498,18 @@ Message
 
 
 
+
+
+
+Int32
+
+
+
+Length of the content.
+
+
+
+
 
 
 Byten
@@ -7430,6 +7442,19 @@ TupleData
 
 
 
+
+Or
+
+
+
+Byte1('b')
+
+
+
+Identifies the data as binary value.
+
+
+
 
 
 Int32
@@ -7446,8 +7471,8 @@ TupleData
 
 
 
-The value of the column, in text format.  (A future release
-might support additional formats.)
+The value of the column, eiter in binary or in text format.
+(As specified in the preceding format byte).
 n is the above length.

Re: disfavoring unparameterized nested loops

2021-06-21 Thread David Rowley

On Wed, 16 Jun 2021 at 15:08, Peter Geoghegan  wrote:
> It seems important to distinguish between risk and uncertainty --
> they're rather different things. The short version is that you can
> model risk but you cannot model uncertainty. This seems like a problem
> of uncertainty to me.

You might be right there.  "Uncertainty" or "Certainty" seems more
like a value that clauselist_selectivity() would be able to calculate
itself. It would just be up to the planner to determine what to do
with it.

One thing I thought about is that if the costing modal was able to
separate out a cost of additional (unexpected) rows then it would be
easier for add_path() to take into account how bad things might go if
we underestimate.

For example, in an unparameterized Nested Loop that estimates the
outer Path to have 1 row will cost an entire additional inner cost if
there are 2 rows.  With Hash Join the cost is just an additional
hashtable lookup, which is dead cheap.   I don't know exactly how
add_path() would weigh all that up, but it seems to me that I wouldn't
take the risk unless I was 100% certain that the Nested Loop's outer
Path would only return 1 row exactly, if there was any chance at all
it could return more, I'd be picking some other join method.

David

Re: Fix for segfault in logical replication on master

2021-06-21 Thread Japin Li



On Mon, 21 Jun 2021 at 17:54, Amit Kapila  wrote:
> On Mon, Jun 21, 2021 at 2:06 PM Japin Li  wrote:
>>
>> On Mon, 21 Jun 2021 at 16:22, Amit Kapila  wrote:
>> > On Mon, Jun 21, 2021 at 1:30 PM Japin Li  wrote:
>> >>
>> >> On Sat, 19 Jun 2021 at 17:18, Amit Kapila  wrote:
>> >> > On Fri, Jun 18, 2021 at 9:18 AM Amit Kapila  
>> >> > wrote:
>> >>
>> >> Or we can free the memory owned by indexoidlist after check whether it is 
>> >> NIL,
>> >> because we do not use it in the later.
>> >>
>> >
>> > Valid point. But I am thinking do we really need to fetch and check
>> > indexoidlist here?
>>
>> IMO, we shold not fetch and check the indexoidlist here, since we do not
>> use it.  However, we should use RelationGetIndexList() to update the
>> reladion->rd_replidindex, so we should fetch the indexoidlist, maybe we
>> can use the following code:
>>
>> indexoidlist = RelationGetIndexList(relation);
>> list_free(indexoidlist);
>>
>> Or does there any function that only update the relation->rd_replidindex
>> or related fields, but do not fetch the indexoidlist?
>>
>
> How about RelationGetReplicaIndex? It fetches the indexlist only when
> required and frees it immediately. But otherwise, currently, there
> shouldn't be any memory leak because we allocate this in "logical
> replication output context" which is reset after processing each
> change message, see pgoutput_change.

Thanks for your explanation.  It might not  be a memory leak, however it's
a little confuse when we free the memory of the indexoidlist in one place,
but not free it in another place.

I attached a patch to fix this.  Any thoughts?

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.


diff --git a/src/backend/utils/cache/relcache.c 
b/src/backend/utils/cache/relcache.c
index d55ae016d0..94fbf1aa19 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -5244,9 +5244,9 @@ Bitmapset *
 RelationGetIdentityKeyBitmap(Relation relation)
 {
Bitmapset  *idindexattrs = NULL;/* columns in the replica 
identity */
-   List   *indexoidlist;
RelationindexDesc;
int i;
+   Oid replidindex;
MemoryContext oldcxt;

/* Quick exit if we already computed the result */
@@ -5260,18 +5260,14 @@ RelationGetIdentityKeyBitmap(Relation relation)
/* Historic snapshot must be set. */
Assert(HistoricSnapshotActive());

-   indexoidlist = RelationGetIndexList(relation);
-
-   /* Fall out if no indexes (but relhasindex was set) */
-   if (indexoidlist == NIL)
-   return NULL;
+   replidindex = RelationGetReplicaIndex(relation);

/* Fall out if there is no replica identity index */
-   if (!OidIsValid(relation->rd_replidindex))
+   if (!OidIsValid(replidindex))
return NULL;

/* Look up the description for the replica identity index */
-   indexDesc = RelationIdGetRelation(relation->rd_replidindex);
+   indexDesc = RelationIdGetRelation(replidindex);

if (!RelationIsValid(indexDesc))
elog(ERROR, "could not open relation with OID %u",
@@ -5295,7 +5291,6 @@ RelationGetIdentityKeyBitmap(Relation relation)
}

RelationClose(indexDesc);
-   list_free(indexoidlist);

/* Don't leak the old values of these bitmaps, if any */
bms_free(relation->rd_idattr);

Re: Remove useless int64 range checks on BIGINT sequence MINVALUE/MAXVALUE values

2021-06-21 Thread David Rowley

On Mon, 21 Jun 2021 at 22:10, Greg Nancarrow  wrote:
> Sequence MINVALUE/MAXVALUE values are read into "int64" variables and
> then range-checked according to the sequence data-type.
> However, for a BIGINT sequence, checking whether these are <
> PG_INT64_MIN or > PG_INT64_MAX always evaluates to false, as an int64
> can't hold such values.

It might be worth putting in a comment to mention that the check is
not needed.  Just in case someone looks again one day and thinks the
checks are missing.

Probably best to put this in the July commitfest so it does not get missed.

David

Re: seawasp failing, maybe in glibc allocator

2021-06-21 Thread Andres Freund

Hi,

On 2021-06-20 19:56:56 -0400, Tom Lane wrote:
> Thomas Munro  writes:
> > Looking at their release schedule on https://llvm.org/, I see we have
> > a gamble to make.  They currently plan to cut RC1 at the end of July,
> > and to release in late September (every second LLVM major release
> > coincides approximately with a PG major release).  Option 1: wait
> > until we branch for 14, and then push this to master so that at least
> > seawasp can get back to looking for new problems, and then back-patch
> > only after they release (presumably in time for our November
> > releases).  If their API change sticks, PostgreSQL crashes and gives
> > weird results with the initial release of LLVM 13 until our fix comes
> > out.  Option 2: get ahead of their release and get this into 14 +
> > August back branch releases based on their current/RC behaviour.  If
> > they decide to revert the change before the final release, we'll leak
> > symbol names because we hold an extra reference, until we can fix
> > that.

I think I'd vote for 2 or 2+ (backpatch immediately).


> If that's an accurate characterization of the tradeoff, I have little
> difficulty in voting for #2.  A crash is strictly worse than a memory
> leak.  Besides which, I've heard little indication that they might
> revert.

We might be able to get them to revert and put in a different API, but I
don't think it'd clearly be an improvement at this point.

Greetings,

Andres Freund

Remove useless int64 range checks on BIGINT sequence MINVALUE/MAXVALUE values

2021-06-21 Thread Greg Nancarrow

Hi,

Sequence MINVALUE/MAXVALUE values are read into "int64" variables and
then range-checked according to the sequence data-type.
However, for a BIGINT sequence, checking whether these are <
PG_INT64_MIN or > PG_INT64_MAX always evaluates to false, as an int64
can't hold such values.
I've attached a patch to remove those useless checks.
The MINVALUE/MAXVALUE values are anyway int64 range-checked prior to
this, as part of conversion to int64.

Regards,
Greg Nancarrow
Fujitsu Australia


v1-0001-Remove-useless-int64-range-checks-on-BIGINT-sequence.patch
Description: Binary data

Re: Fix for segfault in logical replication on master

2021-06-21 Thread Amit Kapila

On Mon, Jun 21, 2021 at 2:06 PM Japin Li  wrote:
>
> On Mon, 21 Jun 2021 at 16:22, Amit Kapila  wrote:
> > On Mon, Jun 21, 2021 at 1:30 PM Japin Li  wrote:
> >>
> >> On Sat, 19 Jun 2021 at 17:18, Amit Kapila  wrote:
> >> > On Fri, Jun 18, 2021 at 9:18 AM Amit Kapila  
> >> > wrote:
> >>
> >> Or we can free the memory owned by indexoidlist after check whether it is 
> >> NIL,
> >> because we do not use it in the later.
> >>
> >
> > Valid point. But I am thinking do we really need to fetch and check
> > indexoidlist here?
>
> IMO, we shold not fetch and check the indexoidlist here, since we do not
> use it.  However, we should use RelationGetIndexList() to update the
> reladion->rd_replidindex, so we should fetch the indexoidlist, maybe we
> can use the following code:
>
> indexoidlist = RelationGetIndexList(relation);
> list_free(indexoidlist);
>
> Or does there any function that only update the relation->rd_replidindex
> or related fields, but do not fetch the indexoidlist?
>

How about RelationGetReplicaIndex? It fetches the indexlist only when
required and frees it immediately. But otherwise, currently, there
shouldn't be any memory leak because we allocate this in "logical
replication output context" which is reset after processing each
change message, see pgoutput_change.

-- 
With Regards,
Amit Kapila.

Re: [Patch] Rename PQtraceSetFlags to PQsetTraceFlags for bookindex.html

2021-06-21 Thread Noah Misch

On Mon, Jun 21, 2021 at 02:36:19AM +, zhangj...@fujitsu.com wrote:
> Here is a patch.

Pushed.

Re: Doc patch for Logical Replication Message Formats (PG14)

2021-06-21 Thread Amit Kapila

On Mon, Jun 21, 2021 at 12:26 PM Brar Piening  wrote:
>
> Hello Hackers,
> while amending Npgsql to account for the Logical Streaming Replication
> Protocol changes in PostgreSQL 14 I stumbled upon two documentation
> inaccuracies in the Logical Replication Message Formats documentation
> (https://www.postgresql.org/docs/devel/protocol-logicalrep-message-formats.html)
> that have been introduced (or rather omitted) with the recent changes to
> allow pgoutput to send logical decoding messages
> (https://github.com/postgres/postgres/commit/ac4645c0157fc5fcef0af8ff571512aa284a2cec)
> and to allow logical replication to transfer data in binary format
> (https://github.com/postgres/postgres/commit/9de77b5453130242654ff0b30a551c9c862ed661).
>
>
>  1. The content of the logical decoding message in the 'Message' message
> is prefixed with a length field (Int32) which isn't documented yet.
> See
> 
> https://github.com/postgres/postgres/blob/69a58bfe4ab05567a8fab8bdce7f3095ed06b99c/src/backend/replication/logical/proto.c#L388
>  2. The TupleData may now contain the byte 'b' as indicator for binary
> data which isn't documented yet. See
> 
> https://github.com/postgres/postgres/blob/69a58bfe4ab05567a8fab8bdce7f3095ed06b99c/src/include/replication/logicalproto.h#L83
> and
> 
> https://github.com/postgres/postgres/blob/69a58bfe4ab05567a8fab8bdce7f3095ed06b99c/src/backend/replication/logical/proto.c#L558.
>
> The attached documentation patch fixes both.
>

Yeah, I think these should be fixed and your patch looks good to me in
that regard.

-- 
With Regards,
Amit Kapila.

Re: Pipeline mode and PQpipelineSync()

2021-06-21 Thread Boris Kolpackov

Alvaro Herrera  writes:

> I think I should rephrase this to say that PQpipelineSync() is needed
> where the user needs the server to start executing commands; and
> separately indicate that it is possible (but not promised) that the
> server would start executing commands ahead of time because $reasons.

I think always requiring PQpipelineSync() is fine since it also serves
as an error recovery boundary. But the fact that the server waits until
the sync message to start executing the pipeline is surprising. To me
this seems to go contrary to the idea of a "pipeline".

In fact, I see the following ways the server could behave:

1. The server starts executing queries and sending their results before
   receiving the sync message.

2. The server starts executing queries before receiving the sync message
   but buffers the results until it receives the sync message.

3. The server buffers the queries and only starts executing them and
   sending the results after receiving the sync message.

My observations suggest that the server behaves as (3) but it could
also be (2).

While it can be tempting to say that this is an implementation detail,
this affects the way one writes a client. For example, I currently have
the following comment in my code:

  // Send queries until we get blocked. This feels like a better
  // overall strategy to keep the server busy compared to sending one
  // query at a time and then re-checking if there is anything to read
  // because the results of INSERT/UPDATE/DELETE are presumably small
  // and quite a few of them can get buffered before the server gets
  // blocked.

This would be a good strategy for behavior (1) but not (3) (where it
would make more sense to queue the queries on the client side). So I
think it would be useful to clarify the server behavior and specify
it in the documentation.

> Do I have it right that other than this documentation problem, you've
> been able to use pipeline mode successfully?

So far I've only tried it in a simple prototype (single INSERT statement).
But I am busy plugging it into ODB's bulk operation support (that we
already have for Oracle and MSSQL) and once that's done I should be
able to exercise things in more meaningful ways.

Re: Fix for segfault in logical replication on master

2021-06-21 Thread Japin Li

On Mon, 21 Jun 2021 at 16:22, Amit Kapila  wrote:
> On Mon, Jun 21, 2021 at 1:30 PM Japin Li  wrote:
>>
>> On Sat, 19 Jun 2021 at 17:18, Amit Kapila  wrote:
>> > On Fri, Jun 18, 2021 at 9:18 AM Amit Kapila  
>> > wrote:
>>
>> Or we can free the memory owned by indexoidlist after check whether it is 
>> NIL,
>> because we do not use it in the later.
>>
>
> Valid point. But I am thinking do we really need to fetch and check
> indexoidlist here?

IMO, we shold not fetch and check the indexoidlist here, since we do not
use it.  However, we should use RelationGetIndexList() to update the
reladion->rd_replidindex, so we should fetch the indexoidlist, maybe we
can use the following code:

indexoidlist = RelationGetIndexList(relation);
list_free(indexoidlist);

Or does there any function that only update the relation->rd_replidindex
or related fields, but do not fetch the indexoidlist?

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

1 2 >

1 - 100 of 107 matches

Mail list logo