date:20200918

Re: Handing off SLRU fsyncs to the checkpointer

2020-09-18 Thread Thomas Munro

On Mon, Aug 31, 2020 at 8:50 PM Jakub Wartak  wrote:
> - IO_URING - gives a lot of promise here I think, is it even planned to be 
> shown for PgSQL14 cycle ? Or it's more like PgSQL15?

I can't answer that, but I've played around with the prototype quite a
bit, and thought quite a lot about how to port it to systems without
IO_URING, and I'm just as keen to see this happen as you are.

In the meantime, from the low-hanging-fruit department, here's a new
version of the SLRU-fsync-offload patch.  The only changes are a
tweaked commit message, and adoption of C99 designated initialisers
for the function table, so { [SYNC_HANDLER_CLOG] = ... } instead of
relying on humans to make the array order match the enum values.  If
there are no comments or objections, I'm planning to commit this quite
soon.
From 0eba29484684a0642c4424741e256b0787051a19 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 19 Sep 2020 16:22:06 +1200
Subject: [PATCH v3] Defer flushing of SLRU files.

Previously, we called fsync() after writing out pg_xact, multixact and
commit_ts pages due to cache pressure, leading to an I/O stall in user
backends and recovery.  Collapse requests for the same file into a
single system call as part of the next checkpoint, as we already do for
relation files.  This causes a significant improvement to recovery
times.

Tested-by: Jakub Wartak 
Discussion: https://postgr.es/m/CA+hUKGLJ=84yt+nvhkeedauutvhmfq9i-n7k_o50jmq6rpj...@mail.gmail.com
---
 src/backend/access/transam/clog.c  |  13 +++-
 src/backend/access/transam/commit_ts.c |  12 ++-
 src/backend/access/transam/multixact.c |  24 +-
 src/backend/access/transam/slru.c  | 101 +++--
 src/backend/access/transam/subtrans.c  |   4 +-
 src/backend/commands/async.c   |   5 +-
 src/backend/storage/lmgr/predicate.c   |   4 +-
 src/backend/storage/sync/sync.c|  30 +++-
 src/include/access/clog.h  |   3 +
 src/include/access/commit_ts.h |   3 +
 src/include/access/multixact.h |   4 +
 src/include/access/slru.h  |  12 ++-
 src/include/storage/sync.h |   7 +-
 13 files changed, 175 insertions(+), 47 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 65aa8841f7..304612c159 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -42,6 +42,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "storage/proc.h"
+#include "storage/sync.h"
 
 /*
  * Defines for CLOG page sizes.  A page is the same BLCKSZ as is used
@@ -691,7 +692,8 @@ CLOGShmemInit(void)
 {
 	XactCtl->PagePrecedes = CLOGPagePrecedes;
 	SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
-  XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER);
+  XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER,
+  SYNC_HANDLER_CLOG);
 }
 
 /*
@@ -1033,3 +1035,12 @@ clog_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "clog_redo: unknown op code %u", info);
 }
+
+/*
+ * Entrypoint for sync.c to sync clog files.
+ */
+int
+clogsyncfiletag(const FileTag *ftag, char *path)
+{
+	return slrusyncfiletag(XactCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 5244b06a2b..913ec9e48d 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -555,7 +555,8 @@ CommitTsShmemInit(void)
 	CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
 	SimpleLruInit(CommitTsCtl, "CommitTs", CommitTsShmemBuffers(), 0,
   CommitTsSLRULock, "pg_commit_ts",
-  LWTRANCHE_COMMITTS_BUFFER);
+  LWTRANCHE_COMMITTS_BUFFER,
+  SYNC_HANDLER_COMMIT_TS);
 
 	commitTsShared = ShmemInitStruct("CommitTs shared",
 	 sizeof(CommitTimestampShared),
@@ -1083,3 +1084,12 @@ commit_ts_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "commit_ts_redo: unknown op code %u", info);
 }
+
+/*
+ * Entrypoint for sync.c to sync commit_ts files.
+ */
+int
+committssyncfiletag(const FileTag *ftag, char *path)
+{
+	return slrusyncfiletag(CommitTsCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b8bedca04a..344006b0f5 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1831,11 +1831,13 @@ MultiXactShmemInit(void)
 	SimpleLruInit(MultiXactOffsetCtl,
   "MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0,
   MultiXactOffsetSLRULock, "pg_multixact/offsets",
-  LWTRANCHE_MULTIXACTOFFSET_BUFFER);
+  LWTRANCHE_MULTIXACTOFFSET_BUFFER,
+  SYNC_HANDLER_MULTIXACT_OFFSET);
 	SimpleLruInit(MultiXactMemberCtl,
   "MultiXactMember", NUM_MULTIXACTMEMBER_BUFFERS, 0,
   MultiXactMemberSLRULock, "pg_multixact/members",
-  LWTRANCHE_MULTIXACTMEMBER_BUFFER);
+  LWTRANCHE_MULTIXACTMEMBER_BUFFER,
+  SYNC_HANDLER_MULTIXACT_MEMBER);
 
 	/* Initialize our shared state struct */
 	MultiXactState = ShmemI

Re: Proposal of new PostgreSQL Extension - PGSpiderExt

2020-09-18 Thread Pavel Stehule

so 19. 9. 2020 v 0:42 odesílatel Bruce Momjian  napsal:

> On Wed, Sep 16, 2020 at 05:40:30PM +0200, Pavel Stehule wrote:
> > May be I am wrong, but it seems to me that not so much people know
> about
> > pgxn.org
> > Before writing this mail I have tried to locate such resource in
> Google and
> > didn't succeed.
> >
> > yes, It is not strongly joined with the Postgres community, and it
> doesn't
> > deploy binary (compiled) code if I know it. So for some people it is not
> > usable.
> >
> > But anytime this and similar repositories will have problems, because the
> > extensions there are not reviewed, nobody did security check, nobody did
> QA.
> >
> > This is useful for Postgres developers, for very advanced users, or for
> very
> > fearless users :).
>
> I think if PGXN had more must-have extensions, its popularity would
> increase.
>

There is nothing else.  But it can be much more useful, if somebody does
review, or if allows compilation to target platform on server side, if has
integrated tests, ...

Regards

Pavel



> --
>   Bruce Momjian  https://momjian.us
>   EnterpriseDB https://enterprisedb.com
>
>   The usefulness of a cup is in its emptiness, Bruce Lee
>
>

Re: Division in dynahash.c due to HASH_FFACTOR

2020-09-18 Thread Thomas Munro

On Sat, Sep 19, 2020 at 1:30 PM Tom Lane  wrote:
> I wrote:
> > ISTM that getting rid of the division obviates the concern that the
> > nentries condition is too expensive,

True, that comment needed to go.

> Also, we could make it slightly cheaper yet, by changing the condition
> to
>
> hctl->freeList[0].nentries > (long) (hctl->max_bucket)
>
> ie drop the +1.  I'd argue that this is actually a more faithful
> rendition of the previous condition, since what we had before was
> basically
>
> hctl->freeList[0].nentries >= (long) (hctl->max_bucket + 1)

Agreed, and done.  Thanks!

Re: pg_logging_init() can return ENOTTY with TAP tests

2020-09-18 Thread Michael Paquier

On Fri, Sep 18, 2020 at 09:54:42PM -0400, Tom Lane wrote:
> No, absolutely not.  That way leads to madness, because you will be trying
> to enforce a system-wide property for the benefit of a few places.  There
> is *no code anywhere* that promises to leave errno zero, but what you are
> suggesting will soon lead to a situation where we have to require that of
> everything.  It's not sane, it's not maintainable, and it's going to be
> inefficient as heck, because it will require adding a whole lot more
> "errno = 0" statements than the other way.

Well, then what about cleaning up the state for the binaries where it
matters then?  FWIW, this example simply fails with an incorrect error
but it should succeed:
--- a/contrib/vacuumlo/t/001_basic.pl
+++ b/contrib/vacuumlo/t/001_basic.pl
@@ -2,8 +2,11 @@ use strict;
 use warnings;

 use TestLib;
-use Test::More tests => 8;
+use Test::More tests => 9;

 program_help_ok('vacuumlo');
 program_version_ok('vacuumlo');
 program_options_handling_ok('vacuumlo');
+
+command_ok([ 'vacuumlo', '-l', '1', 'postgres' ],
+   'vacuumlo: error: transaction limit must not be negative (0 disables)');

This makes me wonder if it could not be a real problem in some
environments.  vacuumlo is not that used, pg_resetwal though..
--
Michael


signature.asc
Description: PGP signature

Re: Online checksums patch - once again

2020-09-18 Thread Justin Pryzby

+ * changed to "inprogress-off", the barrier for mvoving to "off" can be
moving

+ * When disabling checksums, data_checksums will be set of "inprogress-off"
set *to*

+   get_namespace_name(RelationGetNamespace(reln)), 
RelationGetRelationName(reln),

I think this palloc()s a new copy of the namespace every 100 blocks.
Better do it outside the loop.

+   {"inprogress-on", DATA_CHECKSUMS_INPROGRESS_ON, true},
+   {"inprogress-off", DATA_CHECKSUMS_INPROGRESS_OFF, true},

enabling / disabling ?

+typedef enum ChecksumType
+{
+   DATA_CHECKSUMS_OFF = 0,
+   DATA_CHECKSUMS_ON,
+   DATA_CHECKSUMS_INPROGRESS_ON,
+   DATA_CHECKSUMS_INPROGRESS_OFF
+}  ChecksumType;

Should this be an bitmask, maybe
DATA_CHECKSUMS_WRITE = 1 
DATA_CHECKSUMS_VERIFY = 2,

It occured to me that you could rephrase this patch as "Make checksum state
per-relation rather than cluster granularity".  That's currently an
implementation detail, but could be exposed as a feature.  That could be a
preliminary 0001 patch.  Half the existing patch would be 0002 "Allow
online enabling checksums for a given relation/database/cluster".  You might
save some of the existing effort of synchronize the cluster-wide checksum
state, since it doesn't need to be synchronized.  The "data_checksums" GUC
might be removed, or changed to an enum: on/off/per_relation.  Moving from
"per_relation" to "on" would be an optional metadata-only change, allowed only
when all rels in all dbs are checksumed.  I'm not sure if you'd even care about
temp tables, since "relhaschecksum" would be authoritative, rather than a
secondary bit only used during processing.

XLogHintBitIsNeeded() and DataChecksumsEnabled() would need to check
relhaschecksum, which tentatively seems possible.

I'm not sure if it's possible, but maybe pg_checksums would be able to skip
rels which had already been checksummed "online" (with an option to force
reprocessing).

Maybe some people would want (no) checksums on specific tables, and that could
eventually be implemented as 0003: "ALTER TABLE SET checksums=".  I'm thinking
of that being used immediately after an CREATE, but I suppose ON would cause
the backend to rewrite the table with checksums synchronously (not in the BGW),
either with AEL or by calling ProcessSingleRelationByOid().

-- 
Justin

Re: OpenSSL 3.0.0 compatibility

2020-09-18 Thread Michael Paquier

On Fri, Sep 18, 2020 at 04:11:13PM +0200, Daniel Gustafsson wrote:
> Since we support ciphers that are now deprecated, we have no other choice than
> to load the legacy provider.

Ah, thanks.  I actually tried something similar to that when I had my
mind on it by loading the legacy providers, but missed the padding
part.  Nice find.

> The other problem was that the cipher context
> padding must be explicitly set, whereas in previous versions relying on the
> default worked fine.  EVP_CIPHER_CTX_set_padding always returns 1 so thats why
> it isn't checking the returnvalue as the other nearby initialization calls.

It seems to me that it would be a good idea to still check for the
return value of EVP_CIPHER_CTX_set_padding() and just return with
a PXE_CIPHER_INIT.  By looking at the upstream code, it is true that
it always returns true for <= 1.1.1, but that's not the case for
3.0.0.  Some code paths of upstream also check after it.

Also, what's the impact with disabling the padding for <= 1.1.1?  This
part of the upstream code is still a bit obscure to me..

> To avoid problems with the by LibreSSL overloaded OPENSSL_VERSION_NUMBER macro
> (which too is deprecated in 3.0.0), I used the new macro which is only set in
> 3.0.0. Not sure if that's considered acceptable or if we should invent our own
> version macro in autoconf.

OSSL_PROVIDER_load() is new as of 3.0.0, so using a configure switch
similarly as what we do for the other functions should be more
consistent and enough, no?

> For the main SSL tests, the incorrect password test has a new errormessage
> which is added in 0002.

Hmm.  I am linking to a build of alpha6 here, but I still see the
error being reported as a bad decrypt for this test.  Interesting. 
--
Michael

signature.asc
Description: PGP signature

Re: pg_logging_init() can return ENOTTY with TAP tests

2020-09-18 Thread Tom Lane

Michael Paquier  writes:
> On Fri, Sep 18, 2020 at 09:56:14AM -0400, Tom Lane wrote:
>> No, this is a bad idea.  The right place to fix this is whatever code
>> segment is relying on errno to be initially zero; that is NEVER a sane
>> assumption.  That is, a valid coding pattern is something like

> So for now, what about looking at all those code paths and enforce
> errno to 0?

No, absolutely not.  That way leads to madness, because you will be trying
to enforce a system-wide property for the benefit of a few places.  There
is *no code anywhere* that promises to leave errno zero, but what you are
suggesting will soon lead to a situation where we have to require that of
everything.  It's not sane, it's not maintainable, and it's going to be
inefficient as heck, because it will require adding a whole lot more
"errno = 0" statements than the other way.

regards, tom lane

> --
> Michael

Re: pg_logging_init() can return ENOTTY with TAP tests

2020-09-18 Thread Michael Paquier

On Fri, Sep 18, 2020 at 09:56:14AM -0400, Tom Lane wrote:
> No, this is a bad idea.  The right place to fix this is whatever code
> segment is relying on errno to be initially zero; that is NEVER a sane
> assumption.  That is, a valid coding pattern is something like

It seems to me that it could be a good idea to enforce a rule here
then.  There are problems in pg_resetwal and vacuumlo which use
strto[u]l() for option parsing without caring if errno is already set
or not, and we could have a ENOTTY while parsing options.  There is
more in the backend code, like assign_recovery_target_timeline() in
guc.c, bootstrap.c, just to notice a few ones.  We don't need to go
down to having our own wrapper for strtol() though.  There were lately
discussions about removing entirely our dependencies to strtol()
because we mostly rely on base 10 for the parsing and that's the part
impacting performance the most (it is possible to see a clear
difference for some workloads of COPY once you remove the base part).

So for now, what about looking at all those code paths and enforce
errno to 0?
--
Michael

signature.asc
Description: PGP signature

Re: Division in dynahash.c due to HASH_FFACTOR

2020-09-18 Thread Tom Lane

I wrote:
> ISTM that getting rid of the division obviates the concern that the
> nentries condition is too expensive,

Also, we could make it slightly cheaper yet, by changing the condition
to

hctl->freeList[0].nentries > (long) (hctl->max_bucket)

ie drop the +1.  I'd argue that this is actually a more faithful
rendition of the previous condition, since what we had before was
basically

hctl->freeList[0].nentries >= (long) (hctl->max_bucket + 1)

regards, tom lane

Re: Division in dynahash.c due to HASH_FFACTOR

2020-09-18 Thread Tom Lane

Thomas Munro  writes:
> Pushed.  Thanks Jakub, everyone.

BTW, looking over this patch, I wonder about

/*
 * Can't split if running in partitioned mode, nor if frozen, nor if
 * table is the subject of any active hash_seq_search scans.  Strange
 * order of these tests is to try to check cheaper conditions first.
 */
if (!IS_PARTITIONED(hctl) && !hashp->frozen &&
hctl->freeList[0].nentries > (long) (hctl->max_bucket + 1) &&
!has_seq_scans(hashp))
(void) expand_table(hashp);

ISTM that getting rid of the division obviates the concern that the
nentries condition is too expensive, and therefore we should revert
to checking it first, on the grounds that that condition is most
likely to fail.

regards, tom lane

Re: fixing old_snapshot_threshold's time->xid mapping

2020-09-18 Thread Robert Haas

On Thu, Sep 17, 2020 at 10:40 AM Robert Haas  wrote:
> Yeah, I plan to push forward with 0001 through 0003 soon, but 0001
> needs to be revised with a PGDLLIMPORT marking, I think, and 0002
> needs documentation.

So here's an updated version of those three, with proposed commit
messages, a PGDLLIMPORT for 0001, and docs for 0002.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


v9-0003-Fix-two-bugs-in-MaintainOldSnapshotTimeMapping.patch
Description: Binary data


v9-0002-Add-new-old_snapshot-contrib-module.patch
Description: Binary data


v9-0001-Expose-oldSnapshotControl-definition-via-new-head.patch
Description: Binary data

Re: factorial function/phase out postfix operators?

2020-09-18 Thread Robert Haas

On Fri, Sep 18, 2020 at 4:48 PM Tom Lane  wrote:
> Pushed with the discussed terminological changes and some other
> fooling about, including fixing the documentation.

Awesome. Thanks!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Division in dynahash.c due to HASH_FFACTOR

2020-09-18 Thread Thomas Munro

On Tue, Sep 15, 2020 at 9:56 AM Thomas Munro  wrote:
> > I looked over the patch and the only thing I saw was that we might
> > also want to remove the following line:
> >
> > #define DEF_FFACTOR1 /* default fill factor */
>
> Right, thanks.  Fixed in the attached.

Pushed.  Thanks Jakub, everyone.

Separately, we really should tidy up the int/long/uint32/size_t
confusion in this module.  I know we have K&R C-era long-itude in
numerous other modules, but this one is a little more egregious in its
data type mixing.

Re: pg_service.conf file with unknown parameters

2020-09-18 Thread Bruce Momjian

On Fri, Sep 11, 2020 at 02:39:51PM +0200, Magnus Hagander wrote:
> Right now, pg_service.conf returns "syntax error" if it encounters a parameter
> it doesn't know about.
> 
> This seems user-unfriendly, both in the error message (because it really isn't
> a syntax error) and in the behaviour itself (because it doesn't work when
> sometimes it should).
> 
> For example, if I have a service file with gssencmode=disable set, that 
> service
> file cannot be used by a psql client linked against libpq from version 10. 
> Even
> if the behavior would be identical (since v10 doesn't have gssencmode).
> 
> Is there a particular reason we (1) refuse unknown parameters in the file, and
> (2) call it a "syntax error"?
> 
> The documentation just says it's "INI format" file -- though in my experience
> most other INI file parsers just ignore extra parameters included..

My guess is that because the file can contain passwords, we want to
report as little as possible on error.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Proposal of new PostgreSQL Extension - PGSpiderExt

2020-09-18 Thread Bruce Momjian

On Wed, Sep 16, 2020 at 05:40:30PM +0200, Pavel Stehule wrote:
> May be I am wrong, but it seems to me that not so much people know about
> pgxn.org
> Before writing this mail I have tried to locate such resource in Google 
> and
> didn't succeed.
> 
> yes, It is not strongly joined with the Postgres community, and it doesn't
> deploy binary (compiled) code if I know it. So for some people it is not
> usable.
> 
> But anytime this and similar repositories will have problems, because the
> extensions there are not reviewed, nobody did security check, nobody did QA.
> 
> This is useful for Postgres developers, for very advanced users, or for very
> fearless users :).

I think if PGXN had more must-have extensions, its popularity would
increase.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Proposals for making it easier to write correct bgworkers

2020-09-18 Thread Bruce Momjian

On Thu, Sep 10, 2020 at 11:02:07AM +0800, Craig Ringer wrote:
> Hi all
> 
> As I've gained experience working on background workers, it's become
> increasingly clear that they're a bit too different to normal backends for 
> many
> nontrivial uses.
> 
> I thought I'd take a moment to note some of it here, along with some proposals
> for things we could potentially do to make it much easier to use bgworkers
> correctly especially when using them to run queries.
> 
> This is NOT A PATCH SET. It's a set of discussion proposals and it's also
> intended as a bit of a helper for people just getting started on bgworkers.
> There are a lot of subtle differences in the runtime environment a basic
> bgworker provides vs the runtime environment extension authors will be used to
> when writing fmgr-callable C functions.
> 
> (It looks like pg12 and pg13 have some improvements, so some of the issues I
> was going to mention with error cleanup paths and locking aren't relevant
> anymore.)
> 
> DIFFERENCES WHEN CODING FOR BGWORKERS

Can we put this information somewhere in our docs or source code as a
README?

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-18 Thread Tom Lane

Robert Haas  writes:
> Cool, thanks to both of you for looking. Committed.

Hmph, according to whelk this is worse not better:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=whelk&dt=2020-09-18%2017%3A42%3A11

I'm at a loss to understand what's going on there.

regards, tom lane

Re: WIP: BRIN multi-range indexes

2020-09-18 Thread Tomas Vondra

On Fri, Sep 18, 2020 at 05:06:49PM -0400, John Naylor wrote:

On Fri, Sep 18, 2020 at 7:40 AM Tomas Vondra
 wrote:

On Thu, Sep 17, 2020 at 06:48:11PM -0400, John Naylor wrote:
>I wrote:
>
>> Hmm, I came across that paper while doing background reading. Okay,
>> now I get that "% (filter->nbits - 1)" is the second hash function in
>> that scheme. But now I wonder if that second function should actually
>> act on the passed "value" (the original hash), so that they are
>> actually independent, as required. In the language of that paper, the
>> patch seems to have
>>
>> g(x) = h1(x) + i*h2(h1(x)) + f(i)
>>
>> instead of
>>
>> g(x) = h1(x) + i*h2(x) + f(i)
>>
>> Concretely, I'm wondering if it should be:
>>
>>  big_h = DatumGetUint32(hash_uint32(value));
>>  h = big_h % filter->nbits;
>> -d = big_h % (filter->nbits - 1);
>> +d = value % (filter->nbits - 1);
>>
>> But I could be wrong.
>
>I'm wrong -- if we use different operands to the moduli, we throw away
>the assumption of co-primeness. But I'm still left wondering why we
>have to re-hash the hash for this to work. In any case, there should
>be some more documentation around the core algorithm, so that future
>readers are not left scratching their heads.
>

Hmm, good question. I think we don't really need to hash it twice. It
does not rally achieve anything - it won't reduce number of collisions
or anything like that.

Yeah, looking back at the discussion you linked previously, I think
it's a holdover from when the uint32 was rehashed with k different
seeds. Anyway, after thinking about it some more, I still have doubts
about the mapping algorithm. There are two stages to a hash mapping --
hashing and modulus. I don't think a single hash function (whether
rehashed or not) can be turned into two independent functions via a
choice of second modulus. At least, that's not what the Kirsch &
Mitzenmacher paper is claiming. Since we're not actually applying two
independent hash functions on the scan key, we're kind of shooting in
the dark.

OK. I admit the modulo by nbits and (nbits - 1) is a bit suspicious, so
you may be right this is not quite correct construction.

The current scheme was meant to reduce the number of expensive hashing
calls (especially for low fpr values we may require quite a few of
those, easily 10 or more.

But maybe we could still use this scheme by actually computing

   h1 = hash_uint32_extended(value, seed1);
   h2 = hash_uint32_extended(value, seed2);

and then use this as the independent hash functions. I think that would
meet the requirements of the paper.

It turns out there is something called a one-hash bloom filter, and
the paper in [1] has a straightforward algorithm. Since we can
implement it exactly as stated in the paper, that gives me more
confidence in the real-world false positive rate. It goes like this:

Partition the filter bitmap into k partitions of similar but unequal
length, corresponding to consecutive prime numbers. Use the primes for
moduli of the uint32 value and map it to the bit of the corresponding
partition. For a simple example, let's use 7, 11, 13 for partitions in
a filter of size 31. The three bits are:

value % 7
7 + (value % 11)
7 + 11 + (value % 13)

We could store a const array of the first 256 primes. The largest such
prime is 1613, so with k=7 we can support up to ~11k bits, which is
more than we'd like to store anyway. Then we store the array index of
the largest prime in the 8bits of padding we currently have in
BloomFilter struct.

Why would 11k bits be more than we'd like to store? Assuming we could
use the whole 8kB page for the bloom filter, that'd be about 64k bits.
In practice there'd be a bit of overhead (page header ...) but it's
still much more than 11k bits. But I guess we can simply make the table
of primes a bit larger, right?

FWIW I don't think we need to be that careful about the space to store
stuff in padding etc. If we can - great, but compared to the size of the
filter it's negligible and I'd prioritize simplicity over a byte or two.

One wrinkle is that the sum of k primes is not going to match m
exactly. If the sum is too small, we can trim a few bits off of the
filter bitmap. If the sum is too large, the last partition can spill
into the front of the first one. This shouldn't matter much in the
common case since we need to round m to the nearest byte anyway.

AFAIK the paper simply says that as long as the sum of partitions is
close to the requested nbits, it's good enough. So I guess we could just
roll with that, no need to trim/wrap or something like that.

This should be pretty straightforward to turn into code and I can take
a stab at it. Thoughts?

Sure, go ahead. I'm happy someone is actually looking at those patches
and proposing alternative solutions, and this might easily be a better
hashing scheme.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Inconsistent Japanese name order in v13 contributors list

2020-09-18 Thread Bruce Momjian

On Wed, Sep  9, 2020 at 03:27:42PM -0300, Alvaro Herrera wrote:
> On 2020-Sep-09, Fujii Masao wrote:
> 
> > On 2020/09/09 14:15, Etsuro Fujita wrote:
> > > Hi,
> > > 
> > > Attached is a patch to standardize Japanese names as given-name-first
> > > in the v13 contributors list as before.
> > 
> > Using given-name-first order is our consensus? I was thinking we have not
> > reached that yet and our "vague" consensus was to use the name that each
> > contributor prefers, for example the name that used in the email signature, 
> > etc.
> 
> That's indeed our historical practice.  See previous thread where we've
> discussed this at length,
> https://www.postgresql.org/message-id/flat/20150613231826.GY133018%40postgresql.org#88d245a5cdd2b32e1e3e80fc07eab6f2
> 
> The Economist piece Peter G cited is also relevant.
> 
> The commit Peter E cited seems more anecdotical than precedence-setting,
> since there was no actual discussion, and whatever little there was was
> confined to pgsql-committers.

This thread from 2015 is the most comprehensive discussion I remember of
Japanese name ordering, including a suggestion to use small caps:


https://www.postgresql.org/message-id/flat/20150613231826.GY133018%40postgresql.org#88d245a5cdd2b32e1e3e80fc07eab6f2

I have been following this guidance ever since.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: XversionUpgrade tests broken by postfix operator removal

2020-09-18 Thread Tom Lane

I wrote:
> Andrew Dunstan  writes:
>> Yeah, probably worth doing. It's a small enough change and it's only in
>> the test suite.

> OK, I'll go take care of that in a bit.

Done, you should be able to remove @#@ (NONE, bigint) from the
kill list.

regards, tom lane

Re: WIP: BRIN multi-range indexes

2020-09-18 Thread John Naylor

On Fri, Sep 18, 2020 at 7:40 AM Tomas Vondra
 wrote:
>
> On Thu, Sep 17, 2020 at 06:48:11PM -0400, John Naylor wrote:
> >I wrote:
> >
> >> Hmm, I came across that paper while doing background reading. Okay,
> >> now I get that "% (filter->nbits - 1)" is the second hash function in
> >> that scheme. But now I wonder if that second function should actually
> >> act on the passed "value" (the original hash), so that they are
> >> actually independent, as required. In the language of that paper, the
> >> patch seems to have
> >>
> >> g(x) = h1(x) + i*h2(h1(x)) + f(i)
> >>
> >> instead of
> >>
> >> g(x) = h1(x) + i*h2(x) + f(i)
> >>
> >> Concretely, I'm wondering if it should be:
> >>
> >>  big_h = DatumGetUint32(hash_uint32(value));
> >>  h = big_h % filter->nbits;
> >> -d = big_h % (filter->nbits - 1);
> >> +d = value % (filter->nbits - 1);
> >>
> >> But I could be wrong.
> >
> >I'm wrong -- if we use different operands to the moduli, we throw away
> >the assumption of co-primeness. But I'm still left wondering why we
> >have to re-hash the hash for this to work. In any case, there should
> >be some more documentation around the core algorithm, so that future
> >readers are not left scratching their heads.
> >
>
> Hmm, good question. I think we don't really need to hash it twice. It
> does not rally achieve anything - it won't reduce number of collisions
> or anything like that.

Yeah, looking back at the discussion you linked previously, I think
it's a holdover from when the uint32 was rehashed with k different
seeds. Anyway, after thinking about it some more, I still have doubts
about the mapping algorithm. There are two stages to a hash mapping --
hashing and modulus. I don't think a single hash function (whether
rehashed or not) can be turned into two independent functions via a
choice of second modulus. At least, that's not what the Kirsch &
Mitzenmacher paper is claiming. Since we're not actually applying two
independent hash functions on the scan key, we're kind of shooting in
the dark.

It turns out there is something called a one-hash bloom filter, and
the paper in [1] has a straightforward algorithm. Since we can
implement it exactly as stated in the paper, that gives me more
confidence in the real-world false positive rate. It goes like this:

Partition the filter bitmap into k partitions of similar but unequal
length, corresponding to consecutive prime numbers. Use the primes for
moduli of the uint32 value and map it to the bit of the corresponding
partition. For a simple example, let's use 7, 11, 13 for partitions in
a filter of size 31. The three bits are:

value % 7
7 + (value % 11)
7 + 11 + (value % 13)

We could store a const array of the first 256 primes. The largest such
prime is 1613, so with k=7 we can support up to ~11k bits, which is
more than we'd like to store anyway. Then we store the array index of
the largest prime in the 8bits of padding we currently have in
BloomFilter struct.

One wrinkle is that the sum of k primes is not going to match m
exactly. If the sum is too small, we can trim a few bits off of the
filter bitmap. If the sum is too large, the last partition can spill
into the front of the first one. This shouldn't matter much in the
common case since we need to round m to the nearest byte anyway.

This should be pretty straightforward to turn into code and I can take
a stab at it. Thoughts?

[1] https://www.researchgate.net/publication/284283336_One-Hashing_Bloom_Filter

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: factorial function/phase out postfix operators?

2020-09-18 Thread Tom Lane

Pushed with the discussed terminological changes and some other
fooling about, including fixing the documentation.

regards, tom lane

Re: XversionUpgrade tests broken by postfix operator removal

2020-09-18 Thread Tom Lane

Andrew Dunstan  writes:
> On 9/18/20 4:25 PM, Tom Lane wrote:
>> Hmm, that's not a postfix operator ... oh, it's because it depends on the
>> numeric_fac function alias which we also removed.  We could eliminate
>> the need to drop it if we changed the definition to use "factorial"
>> instead of "numeric_fac" in all the back branches.  Not sure if that's
>> a better solution or not.  Might be worth doing, because in the older
>> branches that's the only user-defined prefix operator, so we're missing
>> some pg_upgrade test coverage if we just drop it.

> Yeah, probably worth doing. It's a small enough change and it's only in
> the test suite.

OK, I'll go take care of that in a bit.

regards, tom lane

Re: XversionUpgrade tests broken by postfix operator removal

2020-09-18 Thread Andrew Dunstan



On 9/18/20 4:25 PM, Tom Lane wrote:
> Andrew Dunstan  writes:
>> On 9/18/20 10:39 AM, Tom Lane wrote:
>>> I intentionally let that happen, figuring that it'd be good to get some
>>> buildfarm cycles on the new code in pg_upgrade that does this check.
>>> But now we have to think about updating the test.  I think the best
>>> bet is just to add some DROP OPERATOR commands to the existing
>>> cleanup logic in TestUpgradeXversion.pm.  It looks like this would
>>> do the trick:
>>>
>>> drop operator #@# (bigint,NONE);
>>> drop operator #%# (bigint,NONE);
>>> drop operator !=- (bigint,NONE);
>>> drop operator #@%# (bigint,NONE);
>> Almost. I had to remove one more operator.
> Hmm, that's not a postfix operator ... oh, it's because it depends on the
> numeric_fac function alias which we also removed.  We could eliminate
> the need to drop it if we changed the definition to use "factorial"
> instead of "numeric_fac" in all the back branches.  Not sure if that's
> a better solution or not.  Might be worth doing, because in the older
> branches that's the only user-defined prefix operator, so we're missing
> some pg_upgrade test coverage if we just drop it.
>
>   


Yeah, probably worth doing. It's a small enough change and it's only in
the test suite.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: XversionUpgrade tests broken by postfix operator removal

2020-09-18 Thread Tom Lane

Andrew Dunstan  writes:
> On 9/18/20 10:39 AM, Tom Lane wrote:
>> I intentionally let that happen, figuring that it'd be good to get some
>> buildfarm cycles on the new code in pg_upgrade that does this check.
>> But now we have to think about updating the test.  I think the best
>> bet is just to add some DROP OPERATOR commands to the existing
>> cleanup logic in TestUpgradeXversion.pm.  It looks like this would
>> do the trick:
>> 
>> drop operator #@# (bigint,NONE);
>> drop operator #%# (bigint,NONE);
>> drop operator !=- (bigint,NONE);
>> drop operator #@%# (bigint,NONE);

> Almost. I had to remove one more operator.

Hmm, that's not a postfix operator ... oh, it's because it depends on the
numeric_fac function alias which we also removed.  We could eliminate
the need to drop it if we changed the definition to use "factorial"
instead of "numeric_fac" in all the back branches.  Not sure if that's
a better solution or not.  Might be worth doing, because in the older
branches that's the only user-defined prefix operator, so we're missing
some pg_upgrade test coverage if we just drop it.

regards, tom lane

Re: XversionUpgrade tests broken by postfix operator removal

2020-09-18 Thread Andrew Dunstan



On 9/18/20 10:39 AM, Tom Lane wrote:
> Unsurprisingly, commit 1ed6b8956 has led to the buildfarm's
> cross-version upgrade tests no longer working, eg
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2020-09-18%2013%3A06%3A52
>
> Checking for reg* data types in user tables ok
> Checking for contrib/isn with bigint-passing mismatch   ok
> Checking for user-defined postfix operators fatal
> Your installation contains user-defined postfix operators, which are not
> supported anymore.  Consider dropping the postfix operators and replacing
> them with prefix operators or function calls.
> A list of user-defined postfix operators is in the file:
> postfix_ops.txt
>
> I intentionally let that happen, figuring that it'd be good to get some
> buildfarm cycles on the new code in pg_upgrade that does this check.
> But now we have to think about updating the test.  I think the best
> bet is just to add some DROP OPERATOR commands to the existing
> cleanup logic in TestUpgradeXversion.pm.  It looks like this would
> do the trick:
>
> drop operator #@# (bigint,NONE);
> drop operator #%# (bigint,NONE);
> drop operator !=- (bigint,NONE);
> drop operator #@%# (bigint,NONE);
>
> We could shorten this list a bit by changing create_operator.sql
> in the back branches --- most of these have no particular reason
> to be postfix rather than prefix operators.  I would not want to
> remove them all, though, since that'd result in loss of test
> coverage for postfix operators in the back branches.
>
>   


Almost. I had to remove one more operator.

Here's the hot fix:


I have been working in recent days towards a release - this will hurry
it up.


cheers


andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Memory allocation abstraction in pgcrypto

2020-09-18 Thread Daniel Gustafsson

pgcrypto has an abstraction for palloc/pfree via PX_OWN_ALLOC with the
intention to make it easy to swap out for another allocator.  There are however
a number of palloc calls that have snuck in over the years, so the abstraction
is leaking a fair bit making it less useful.  Since there have been no
complaints that I can see (searching the archives for PX_OWN_ALLOC yields zero
threads, and px_alloc none about this topic) then maybe it's time to remove it
and simplify the code?

The attached removes it in favor of using palloc() et.al directly.  Is there
any reason to keep it around still?

cheers ./daniel



0001-Remove-memory-allocation-abstraction-in-pgcrypto.patch
Description: Binary data

Re: factorial function/phase out postfix operators?

2020-09-18 Thread Tom Lane

Robert Haas  writes:
> On Fri, Sep 18, 2020 at 2:11 PM Tom Lane  wrote:
>> What I now propose is to add two output columns:
>> 
>> barelabel bool  (t or f, obviously)
>> baredesc text   ("can be bare label" or "requires AS", possibly localized)

> That might be over-engineered in a vacuum, but it seems like it may be
> cleaner to stick with the existing precedent than to diverge from it.

Yeah, my recollection of the pg_get_keywords design is that we couldn't
agree on whether to emit a machine-friendly description or a
human-friendly one, so we compromised by doing both :-(.  But the same
factors exist with this addition --- you can make an argument for
preferring either boolean or text output.

regards, tom lane

Re: factorial function/phase out postfix operators?

2020-09-18 Thread Robert Haas

On Fri, Sep 18, 2020 at 2:11 PM Tom Lane  wrote:
> After re-reading the description of pg_get_keywords, I was reminded that
> what it outputs now is intended to provide both a machine-friendly
> description of the keyword category ("catcode") and a human-friendly
> description ("catdesc").  So we really should do likewise for the
> label property.  What I now propose is to add two output columns:
>
> barelabel bool  (t or f, obviously)
> baredesc text   ("can be bare label" or "requires AS", possibly localized)

That might be over-engineered in a vacuum, but it seems like it may be
cleaner to stick with the existing precedent than to diverge from it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: factorial function/phase out postfix operators?

2020-09-18 Thread Tom Lane

Robert Haas  writes:
> On Fri, Sep 18, 2020 at 11:29 AM Tom Lane  wrote:
>> As for what to expose in pg_get_keywords, I think something like
>> "label_requires_as bool" would be immediately understandable.
>> If you really want it to be an enum sort of thing, maybe the output
>> column title could be "collabel" with values "bare" or "requires_AS".

> It's sort of possible to be confused by "label requires as" since "as"
> is being used as a known but isn't really one generally speaking, but
> we can't very well quote it so I don't know how to make it more clear.

After re-reading the description of pg_get_keywords, I was reminded that
what it outputs now is intended to provide both a machine-friendly
description of the keyword category ("catcode") and a human-friendly
description ("catdesc").  So we really should do likewise for the
label property.  What I now propose is to add two output columns:

barelabel bool  (t or f, obviously)
baredesc text   ("can be bare label" or "requires AS", possibly localized)

Feel free to bikeshed on those details.

regards, tom lane

Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)

2020-09-18 Thread Fujii Masao





On 2020/09/18 9:30, Thomas Munro wrote:

On Thu, Sep 17, 2020 at 10:47 PM Heikki Linnakangas  wrote:

Hmm, so for speedy response to postmaster death, you're relying on the
loops to have other postmaster-death checks besides
HandleStartupProcInterrupts(), in the form of WL_EXIT_ON_PM_DEATH. That
seems a bit fragile, at the very least it needs a comment in
HandleStartupProcInterrupts() to call it out.


Surely that's the direction we want to go in, though, no?  Commit
cfdf4dc4 was intended to standardise the way we react to postmaster
death where waiting is involved.  I updated the comment in
HandleStartupProcInterrupts() to highlight that the
PostmasterIsAlive() check in there is only for the benefit of
CPU-bound loops.


Note that there's one more loop in ShutdownWalRcv() that uses pg_usleep().


Updating that one required me to invent a new wait_event for
pg_stat_activity, which seems like progress.

Unfortunately, while I was doing that I realised that WaitLatch()
without WL_SET_LATCH was broken by commit 3347c982bab (in master
only), in a way not previously reached by the tests.  So 0001 is a
patch to fix that.


-   pgstat_report_wait_start(WAIT_EVENT_RECOVERY_PAUSE);
-   pg_usleep(100L);/* 1000 ms */
-   pgstat_report_wait_end();
+   WaitLatch(NULL, WL_EXIT_ON_PM_DEATH | WL_TIMEOUT, 1000,
+ WAIT_EVENT_RECOVERY_PAUSE);

This change may cause at most one second delay against the standby
promotion request during WAL replay pause? It's only one second,
but I'd like to avoid this (unnecessary) wait to shorten the failover time
as much as possible basically. So what about using WL_SET_LATCH here?

But when using WL_SET_LATCH, one concern is that walreceiver can
wake up the startup process too frequently even during WAL replay pause.
This is also problematic? I'm ok with this, but if not, using pg_usleep()
might be better as the original code does.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: factorial function/phase out postfix operators?

2020-09-18 Thread Robert Haas

On Fri, Sep 18, 2020 at 11:29 AM Tom Lane  wrote:
> I confess to not having paid very close attention to this thread
> lately, but the last I'd noticed the terminology proposed for
> internal use was "bare column label", which I think is much better.

I agree.

> As for what to expose in pg_get_keywords, I think something like
> "label_requires_as bool" would be immediately understandable.
> If you really want it to be an enum sort of thing, maybe the output
> column title could be "collabel" with values "bare" or "requires_AS".

It's sort of possible to be confused by "label requires as" since "as"
is being used as a known but isn't really one generally speaking, but
we can't very well quote it so I don't know how to make it more clear.

> So I'm thinking about making these changes in gram.y:
>
> ImplicitAlias -> BareColLabel
> implicit_alias_keyword -> bare_label_keyword
>
> and corresponding terminology changes elsewhere.

+1.

Thanks for picking this up; I am pretty excited about this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-18 Thread Robert Haas

On Wed, Sep 16, 2020 at 10:27 PM Ashutosh Sharma  wrote:
> > This is OK by me.
>
> Looks good to me too.

Cool, thanks to both of you for looking. Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] [PATCH] Generic type subscripting

2020-09-18 Thread Pavel Stehule

pá 18. 9. 2020 v 13:01 odesílatel Dmitry Dolgov <9erthali...@gmail.com>
napsal:

> > On Thu, Sep 17, 2020 at 05:19:19PM +0200, Pavel Stehule wrote:
> >
> > ok, then I think we can design some workable behaviour
> >
> > My first rule - there should not be any implicit action that shifts
> > positions in the array. It can be explicit, but not implicit. It is true
> > for positive indexes, and it should be true for negative indexes too.
> >
> > then I think so some like this can work
> >
> > if (idx < 0)
> > {
> >   if (abs(idx) > length of array)
> > exception("index is of of range");
> >   array[length of array - idx] := value;
> > }
> > else
> > {
> >/* known behave for positive index */
> > }
>
> In this way (returning an error on a negative indices bigger than the
> number of elements) functionality for assigning via subscripting will be
> already significantly differ from the original one via jsonb_set. Which
> in turn could cause a new wave of something similar to "why assigning an
> SQL NULL as a value returns NULL instead of jsonb?". Taking into account
> that this is not absolutely new interface, but rather a convenient
> shortcut for the existing one it probably makes sense to try to find a
> balance between both consistency with regular array and similarity with
> already existing jsonb modification functions.
>
> Having said that, my impression is that this balance should be not fully
> shifted towards consistensy with the regular array type, as jsonb array
> and regular array are fundamentally different in terms of
> implementation. If any differences are of concern, they should be
> addressed at different level. At the same time I've already sort of gave
> up on this patch in the form I wanted to see it anyway, so anything goes
> if it helps bring it to the finish point. In case if there would be no
> more arguments from other involved sides, I can post the next version
> with your suggestion included.
>

This is a relatively new interface and at this moment we can decide if it
will be consistent or not.  I have not a problem if I have different
functions with different behaviors, but I don't like one interface with
slightly different behaviors for different types. I understand your
argument about implementing a lighter interface to some existing API. But I
think so more important should be consistency in maximall possible rate
(where it has sense).

For me "jsonb" can be a very fundamental type in PLpgSQL development - it
can bring a lot of dynamic to this environment (it can work perfectly like
PL/SQL collection or like Perl dictionary), but for this purpose the
behaviour should be well consistent without surprising elements.

Regards

Pavel

Re: factorial function/phase out postfix operators?

2020-09-18 Thread Mark Dilger




> On Sep 18, 2020, at 8:29 AM, Tom Lane  wrote:
> 
> So I've finished up applying 0001 and started to look at 0002
> ... and I find the terminology you've chosen to be just really
> opaque and confusing.  "aliastype" being "implicit" or "explicit"
> is not going to make any sense to anyone until they read the
> manual, and it probably still won't make sense after that.
> 
> In the first place, the terminology we use for these things
> is usually "column label", not "alias"; see e.g.
> https://www.postgresql.org/docs/devel/queries-select-lists.html#QUERIES-COLUMN-LABELS
> Likewise, gram.y itself refers to the construct as a ColLabel.
> Aliases are things that appear in the FROM clause.
> 
> In the second place, "implicit" vs "explicit" just doesn't make
> any sense to me.  You could maybe say that the AS is implicit
> when you omit it, but the column label is surely not implicit;
> it's right there where you wrote it.
> 
> I confess to not having paid very close attention to this thread
> lately, but the last I'd noticed the terminology proposed for
> internal use was "bare column label", which I think is much better.
> As for what to expose in pg_get_keywords, I think something like
> "label_requires_as bool" would be immediately understandable.
> If you really want it to be an enum sort of thing, maybe the output
> column title could be "collabel" with values "bare" or "requires_AS".
> 
> So I'm thinking about making these changes in gram.y:
> 
> ImplicitAlias -> BareColLabel
> implicit_alias_keyword -> bare_label_keyword
> 
> and corresponding terminology changes elsewhere.

That sounds ok to me.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Batching page logging during B-tree build

2020-09-18 Thread Andrey M. Borodin

Hi!

There's a thread about GiST build [0]. And we've found out there that logging 
newly created index with batches of 32 pages is slightly faster. Heikki 
implemented new logging routine log_newpages() for GiST build and it makes code 
for this batching nice and clean.
Here is PoC with porting that same routine to B-tree. It allows to build 
B-trees ~10% faster on my machine.

Thanks!

Best regards, Andrey Borodin.

[0] 
https://www.postgresql.org/message-id/flat/12CC4B83-9A5D-4A59-85A1-215CF9D1AB4B%40yandex-team.ru#6b93ee95fcc51ba710715d2ff6017b7f


v3-0001-Use-new-log_newpages-routine-in-B-tree-build.patch
Description: Binary data

Re: factorial function/phase out postfix operators?

2020-09-18 Thread Tom Lane

So I've finished up applying 0001 and started to look at 0002
... and I find the terminology you've chosen to be just really
opaque and confusing.  "aliastype" being "implicit" or "explicit"
is not going to make any sense to anyone until they read the
manual, and it probably still won't make sense after that.

In the first place, the terminology we use for these things
is usually "column label", not "alias"; see e.g.
https://www.postgresql.org/docs/devel/queries-select-lists.html#QUERIES-COLUMN-LABELS
Likewise, gram.y itself refers to the construct as a ColLabel.
Aliases are things that appear in the FROM clause.

In the second place, "implicit" vs "explicit" just doesn't make
any sense to me.  You could maybe say that the AS is implicit
when you omit it, but the column label is surely not implicit;
it's right there where you wrote it.

I confess to not having paid very close attention to this thread
lately, but the last I'd noticed the terminology proposed for
internal use was "bare column label", which I think is much better.
As for what to expose in pg_get_keywords, I think something like
"label_requires_as bool" would be immediately understandable.
If you really want it to be an enum sort of thing, maybe the output
column title could be "collabel" with values "bare" or "requires_AS".

So I'm thinking about making these changes in gram.y:

ImplicitAlias -> BareColLabel
implicit_alias_keyword -> bare_label_keyword

and corresponding terminology changes elsewhere.

Thoughts?

regards, tom lane

Re: Feature improvement for pg_stat_statements

2020-09-18 Thread Adam Brusselback

That'd be useful in my book. My scripts just have a hard coded timestamp I
replace when I call reset so those calculations work, but it would be much
preferred to have that data available by a built in function.

-Adam

XversionUpgrade tests broken by postfix operator removal

2020-09-18 Thread Tom Lane

Unsurprisingly, commit 1ed6b8956 has led to the buildfarm's
cross-version upgrade tests no longer working, eg

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2020-09-18%2013%3A06%3A52

Checking for reg* data types in user tables ok
Checking for contrib/isn with bigint-passing mismatch   ok
Checking for user-defined postfix operators fatal
Your installation contains user-defined postfix operators, which are not
supported anymore.  Consider dropping the postfix operators and replacing
them with prefix operators or function calls.
A list of user-defined postfix operators is in the file:
postfix_ops.txt

I intentionally let that happen, figuring that it'd be good to get some
buildfarm cycles on the new code in pg_upgrade that does this check.
But now we have to think about updating the test.  I think the best
bet is just to add some DROP OPERATOR commands to the existing
cleanup logic in TestUpgradeXversion.pm.  It looks like this would
do the trick:

drop operator #@# (bigint,NONE);
drop operator #%# (bigint,NONE);
drop operator !=- (bigint,NONE);
drop operator #@%# (bigint,NONE);

We could shorten this list a bit by changing create_operator.sql
in the back branches --- most of these have no particular reason
to be postfix rather than prefix operators.  I would not want to
remove them all, though, since that'd result in loss of test
coverage for postfix operators in the back branches.

regards, tom lane

Re: OpenSSL 3.0.0 compatibility

2020-09-18 Thread Daniel Gustafsson

> On 17 Aug 2020, at 06:12, Michael Paquier  wrote:

> Leaving the problems with pgcrypto aside for now

Returning to this subject, I decided to take a stab at fixing pgcrypto since it
wasn't working.

Since we support ciphers that are now deprecated, we have no other choice than
to load the legacy provider.  The other problem was that the cipher context
padding must be explicitly set, whereas in previous versions relying on the
default worked fine.  EVP_CIPHER_CTX_set_padding always returns 1 so thats why
it isn't checking the returnvalue as the other nearby initialization calls.
To avoid problems with the by LibreSSL overloaded OPENSSL_VERSION_NUMBER macro
(which too is deprecated in 3.0.0), I used the new macro which is only set in
3.0.0. Not sure if that's considered acceptable or if we should invent our own
version macro in autoconf.

For the main SSL tests, the incorrect password test has a new errormessage
which is added in 0002.

With these two all SSL tests pass for me in 1.0.1 through 3.0.0-alpha6 (tested
on a mix of Debian and macOS).

Thoughts on these?

cheers ./daniel

0002-Fix-ssl-tests-to-support-OpenSSL-3.0.0.patch
Description: Binary data

0001-Fix-pgcrypto-to-support-OpenSSL-3.0.0.patch
Description: Binary data

Re: Transactions involving multiple postgres foreign servers, take 2

2020-09-18 Thread Masahiko Sawada

On Wed, 16 Sep 2020 at 13:20, tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Masahiko Sawada 
> > > If so, can't we stipulate that the FDW implementor should ensure that the
> > commit function always returns control to the caller?
> >
> > How can the FDW implementor ensure that? Since even palloc could call
> > ereport(ERROR) I guess it's hard to require that to all FDW
> > implementors.
>
> I think the what FDW commit routine will do is to just call xa_commit(), or 
> PQexec("COMMIT PREPARED") in postgres_fdw.

Yes, but it still seems hard to me that we require for all FDW
implementations to commit/rollback prepared transactions without the
possibility of ERROR.

>
>
> > It's still a rough idea but I think we can use TMASYNC flag and
> > xa_complete explained in the XA specification. The core transaction
> > manager call prepare, commit, rollback APIs with the flag, requiring
> > to execute the operation asynchronously and to return a handler (e.g.,
> > a socket taken by PQsocket in postgres_fdw case) to the transaction
> > manager. Then the transaction manager continues polling the handler
> > until it becomes readable and testing the completion using by
> > xa_complete() with no wait, until all foreign servers return OK on
> > xa_complete check.
>
> Unfortunately, even Oracle and Db2 don't support XA asynchronous execution 
> for years.  Our DBMS Symfoware doesn't, either.  I don't expect other DBMSs 
> support it.
>
> Hmm, I'm afraid this may be one of the FDW's intractable walls for a serious 
> scale-out DBMS.  If we define asynchronous FDW routines for 2PC, postgres_fdw 
> would be able to implement them by using libpq asynchronous functions.  But 
> other DBMSs can't ...

I think it's not necessarily that all FDW implementations need to be
able to support xa_complete(). We can support both synchronous and
asynchronous executions of prepare/commit/rollback.

>
>
> > > Maybe we can consider VOLATILE functions update data.  That may be
> > overreaction, though.
> >
> > Sorry I don't understand that. The volatile functions are not pushed
> > down to the foreign servers in the first place, no?
>
> Ah, you're right.  Then, the choices are twofold: (1) trust users in that 
> their functions don't update data or the user's claim (specification) about 
> it, and (2) get notification through FE/BE protocol that the remote 
> transaction may have updated data.
>

I'm confused about the point you're concerned about the UDF function.
If you're concerned that executing a UDF function by like 'SELECT
myfunc();' updates data on a foreign server, since the UDF should know
which foreign server it modifies data on it should be able to register
the foreign server and mark as modified. Or you’re concerned that a
UDF function in WHERE condition is pushed down and updates data (e.g.,
 ‘SELECT … FROM foreign_tbl WHERE id = myfunc()’)?

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pg_logging_init() can return ENOTTY with TAP tests

2020-09-18 Thread Tom Lane

Michael Paquier  writes:
> Some system calls may not set errno even if they succeed, like
> strtol() on Linux for example, so in this case it can cause option
> handling to fail because of the error set by logging initialization.
> Shouldn't we reset errno to 0 once logging initialization is done?  It
> seems to me that this code path should finish with a clean sheet, and
> attached is a proposal of patch to address this issue.

No, this is a bad idea.  The right place to fix this is whatever code
segment is relying on errno to be initially zero; that is NEVER a sane
assumption.  That is, a valid coding pattern is something like

 errno = 0;
 strtol(...);
 if (errno) ...

regards, tom lane

Re: Dynamic gathering the values for seq_page_cost/xxx_cost

2020-09-18 Thread Tomas Vondra

On Fri, Sep 18, 2020 at 09:28:10PM +0800, Andy Fan wrote:

On Thu, Nov 28, 2019 at 12:48 AM Tomas Vondra 
wrote:

On Tue, Nov 26, 2019 at 08:59:22AM +0800, Andy Fan wrote:
>The optimizer cost model usually needs 2 inputs,  one is used to represent
>data distribution and the other one is used to represent the capacity of
>the hardware, like cpu/io let's call this one as system stats.
>
>In Oracle database, the system stats can be gathered with
>dbms_stats.gather_system_stats [1] on the running hardware,  In
>postgresql,  the value is set on based on experience (user can change the
>value as well, but is should be hard to decide which values they should
>use).  The pg way is not perfect in theory(In practice, it may be good
>enough or not).  for example,  HDD & SSD have different capacity regards
to
>seq_scan_cost/random_page_cost,   cpu cost may also different on different
>hardware as well.
>
>I run into a paper [2] which did some research on dynamic gathering the
>values for xxx_cost, looks it is interesting.  However it doesn't provide
>the code for others to do more research.  before I dive into this,  It
>would be great to hear some suggestion from experts.
>
>so what do you think about this method and have we have some discussion
>about this before and the result?
>

IMHO it would be great to have a tool that helps with tuning those
parameters, particularly random_page_cost. I'm not sure how feasible it
is, though, but if you're willing to do some initial experiments and
research, I think it's worth looking into.

It's going to be challenging, though, because even random_page_cost=4
mismatches the "raw" characteristics on any existing hardware. On old
drives the sequential/random difference is way worse, on SSDs it's about
right. But then again, we know random_page_cost=1.5 or so works mostly
fine on SSDs, and that's much lower than just raw numbers.

So it's clearly one thing to measure HW capabilities, and it's another
thing to conclude what the parameters should be ...

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

I recently tried something in this direction and the result looks
promising based on my limited test.

Since the unit of a xxx_cost is "seq_page_cost", then how to detect
seq_page_cost is important. In the cost model, the IO cost of a seqscan is
rel->pages * seq_page_cost, it doesn't consider any cache (file system
cache or
shared buffer cache).  However, it assumes the OS will prefetch the IO. So
to
detect the seq_page_cost, I enabled the prefetch but avoided the file system
cache. I tested this with 1). drop the cache on the file system. 2). Open
the test
file without O_DIRECT so that the prefetch can work.

To detect the random page read, I read it with pread with a random offset.
Since the random offsets may be the same as each other during the test,
so even dropping the file system cache at the beginning doesn't work. so
I open it with the O_DIRECT option.

I also measure the cost of reading a page from a file system cache, during
my test, it is about 10% of a seq scan read.

After I get the basic numbers about the hardware capability, I let the user
provide a cache hit ratio (This is a place where we can further improve if
this
is a right direction).

Here is the test result on my hardware.

fs_cache_lat = 0.832025us, seq_read_lat = 8.570290us, random_page_lat =
73.987732us

cache hit ratio: 1.00 random_page_cost 1.00
cache hit ratio: 0.90 random_page_cost 5.073692
cache hit ratio: 0.50 random_page_cost 7.957589
cache hit ratio: 0.10 random_page_cost 8.551591
cache hit ratio: 0.00 random_page_cost 8.633049

Then I tested the suggested value with the 10GB TPCH
workload. I compared the plans with 2 different settings random_page_cost =
1). 4 is the default value) 2). 8.6  the cache hint ratio = 0 one.  Then 11
out of the 22
queries generated a different plan.  At last I drop the cache (including
both
file system cache and shared_buffer) before run each query and run the 11
queries
under the 2 different settings. The execution time is below.

| | random_page_cost=4 | random_page_cost=8.6 |
|-++--|
| Q1  |   1425.964 | 1121.928 |
| Q2  |   2553.072 | 2567.450 |
| Q5  |   4397.514 | 1475.343 |
| Q6  |  12576.985 | 4622.503 |
| Q7  |   3459.777 | 2987.241 |
| Q8  |   8360.995 | 8415.311 |
| Q9  |   4661.842 | 2930.370 |
| Q11 |   4885.289 | 2348.541 |
| Q13 |   2610.937 | 1497.776 |
| Q20 |  13218.122 |10985.738 |
| Q21 |264.639 |  262.350 |

The attached main.c is the program I used to detect the
random_page_cost. result.tar.gz is the test result, you can run a git log
first
to see the difference

Re: Planner, check if can use consider HASH for groupings (src/backend/optimizer/plan/planner.c)

2020-09-18 Thread Tomas Vondra

On Thu, Sep 17, 2020 at 06:31:12PM -0300, Ranier Vilela wrote:

Em qui., 17 de set. de 2020 às 17:09, Tomas Vondra <
tomas.von...@2ndquadrant.com> escreveu:

On Thu, Sep 17, 2020 at 02:12:12PM -0300, Ranier Vilela wrote:
>Hi,
>
>In case gd->any_hashable is FALSE, grouping_is_hashable is never called.
>In this case, the planner could use HASH for groupings, but will never
know.
>

The condition is pretty simple - if the query has grouping sets, look at
grouping sets, otherwise look at groupClause. For this to be an issue
the query would need to have both grouping sets and (independent) group
clause - which AFAIK is not possible.

Uh?
(parse->groupClause != NIL) If different from NIL we have ((independent)
group clause), grouping_is_hashable should check?
(gd ? gd->any_hashable : grouping_is_hashable(parse->groupClause
If gd is not NIL and gd->any_hashtable is FALSE?

Sorry, I'm not quite sure I understand what this is meant to say :-(

Anyway, (groupClause != NIL) does not mean the groupClause is somehow
independent (of what?). Add some debugging to create_grouping_paths and
you'll see that e.g. this query ends up with groupClause with 3 items:

select 1 from t group by grouping sets ((a), (b), (c));

and so does this one:

select 1 from t group by grouping sets ((a,c), (a,b));

I'm no expert in grouping sets, but I'd bet this means we transform the
grouping sets into a groupClause by extracting the keys. I haven't
investigated why exactly we do this, but I'm sure there's a reason (e.g.
it gives us SortGroupClause).

You seem to believe a query can have both grouping sets and regular
grouping at the same level - but how would such query look like? Surely
you can't have two GROuP BY clauses. You can do

select 1 from t group by a, grouping sets ((b), (c));

which is however mostly equivalent to (AFAICS)

select 1 from t group by grouping sets ((a,b), (a,c))

so it's not like we have an independent groupClause either I think.

The whole point of the original condition is - if there are grouping
sets, check if at least one can be executed using hashing (i.e. all keys
are hashable). Otherwise (without grouping sets) look at the grouping as
a whole.

I don't see how your change improves this - if there are grouping sets,
it's futile to look at the whole groupClause if at least one grouping
set can't be hashed.

But there's a simple way to disprove this - show us a case (table and a
query) where your code correctly allows hashing while the current one
does not.

For hashing to be worth considering, at least one grouping set has to be
hashable - otherwise it's pointless.

Granted, you could have something like

 GROUP BY GROUPING SETS ((a), (b)), c

which I think essentially says "add c to every grouping set" and that
will be covered by the any_hashable check.

I am not going into the merit of whether or not it is worth hashing.
grouping_is_hashable as a last resort, must decide.

I don't know what this is supposed to say either. The whole point of
this check is to simply skip construction of hash-based paths in cases
when it's obviously futile (i.e. when some of the keys don't support
hashing). We do this as early as possible, because the whole point is to
save precious CPU cycles during query planning.

>Apparently gd pointer, will never be NULL there, verified with Assert(gd
!=
>NULL).
>

Um, what? If you add the assert right before the if condition, you won't
even be able to do initdb. It's pretty clear it'll crash for any query
without grouping sets.

Here not:
Assert(gd != NULL);
create_ordinary_grouping_paths(root, input_rel, grouped_rel,
 agg_costs, gd, &extra,
 &partially_grouped_rel);

I have no idea where you're adding this assert. But simply adding it to
create_grouping_paths (right before the if condition changed by your
patch) will trigger a failure during initdb. Simply because for queries
without grouping sets (and with regular grouping) we pass gs=NULL.

Try applying the attached patch and do "pg_ctl -D ... init" - you'll get
a failure proving that gd=NULL.

Otherwise Coverity would be right.
CID 1412604 (#1 of 1): Dereference after null check (FORWARD_NULL)13.
var_deref_model: Passing null pointer gd to create_ordinary_grouping_paths,
which dereferences it. [show details
]

Which cannot be true, gd is never NULL here.

Or maybe coverity simply does not realize the NULL-dereference can't
happen, because it's guarded by other conditions, or something like
that ... As the assert failure demonstrates, we do indeed get NULLs
here, and it does not crash so coverity is missing something.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/optimizer/plan/pla

Re: [HACKERS] logical decoding of two-phase transactions

2020-09-18 Thread Ajin Cherian

On Thu, Sep 17, 2020 at 10:35 PM Amit Kapila  wrote:

> Yeah, I think that would be better. How about if name the new variable
> as cleanup_prepared?

I haven't added a new flag to indicate that the prepare was cleaned
up, as that wasn' really necessary. Instead I used a new function to
do partial cleanup to do whatever was not done in the truncate. If you
think, using a flag and doing special handling in
ReorderBufferCleanupTXN was a better idea, let me know.

regards,
Ajin Cherian
Fujitsu Australia

Re: Concurrency issue in pg_rewind

2020-09-18 Thread Andrey M. Borodin

> 18 сент. 2020 г., в 11:59, Michael Paquier  написал(а):
> 
> On Fri, Sep 18, 2020 at 11:31:26AM +0500, Andrey M. Borodin wrote:
>> This is whole point of having prefetch. restore_command just links
>> file from the same partition.
> 
> If this stuff is willing to do so, you may have your reasons, but even
> if you wish to locate both pg_wal/ and the prefetch path in the same
> partition, I don't get why it is necessary to have the prefetch path
> included directly in pg_wal?  You could just use different paths for
> both.  Say, with a base partition at /my/path/, you can just have
> /my/path/pg_wal/ that the Postgres backend links to, and
> /my/path/wal-g/prefetch/ for the secondary path.

This complexity doesn't seem necessary to me. What we gain? Prefetched WAL is 
WAL per se. Makes sense to keep it in pg_wal tree by default.

I will implement possibility to move cache out of pg_wal (similar functionality 
is implemented in pgBackRest). But it seems useless to me: user can configure 
WAL prefetch to be less performant, without any benefits.

Best regards, Andrey Borodin.

Re: [HACKERS] logical decoding of two-phase transactions

2020-09-18 Thread Ajin Cherian

On Tue, Sep 15, 2020 at 10:43 PM Amit Kapila  wrote:

> I don't think it is complete yet.
> *
> * This error can only occur when we are sending the data in
>   * streaming mode and the streaming is not finished yet.
>   */
> - Assert(streaming);
> - Assert(stream_started);
> + Assert(streaming || rbtxn_prepared(txn));
> + Assert(stream_started  || rbtxn_prepared(txn));
>
> Here, you have updated the code but comments are still not updated.
>

Updated the comments.


> I don't think we need to perform abort here. Later we will anyway
> encounter the WAL for Rollback Prepared for which we will call
> abort_prepared_cb. As we have set the 'concurrent_abort' flag, it will
> allow us to skip all the intermediate records. Here, we need only
> enough state in ReorderBufferTxn that it can be later used for
> ReorderBufferFinishPrepared(). Basically, you need functionality
> similar to ReorderBufferTruncateTXN where except for invalidations you
> can free memory for everything else. You can either write a new
> function ReorderBufferTruncatePreparedTxn or pass another bool
> parameter in ReorderBufferTruncateTXN to indicate it is prepared_xact
> and then clean up additional things that are not required for prepared
> xact.

Added a new parameter to ReorderBufferTruncatePreparedTxn  for
prepared transactions and did cleanup of tupulecids as well, I have
left snapshots and transactions.
As a result of this, I also had to create a new function
ReorderBufferCleanupPreparedTXN which will clean up the rest as part
of FinishPrepared handling as we can't call
ReorderBufferCleanupTXN again after this.

>
> *
> Similarly, I don't understand why we need below code:
> ReorderBufferProcessTXN()
> {
> ..
> + if (rbtxn_rollback(txn))
> + rb->abort(rb, txn, commit_lsn);
> ..
> }
>
> There is nowhere we are setting the RBTXN_ROLLBACK flag, so how will
> this check be true? If we decide to remove this code then don't forget
> to update the comments.
>

Removed.

> *
> If my previous two comments are correct then I don't think we need the
> below interface.
> +
> + Transaction Abort Callback
> +
> + 
> +  The required abort_cb callback is called whenever
> +  a transaction abort has to be initiated. This can happen if we are
> +  decoding a transaction that has been prepared for two-phase commit and
> +  a concurrent rollback happens while we are decoding it.
> +
> +typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
> +   ReorderBufferTXN *txn,
> +   XLogRecPtr abort_lsn);
>
>
>

Removed.

> >>
> >>
> >> I don't know why the patch has used this way to implement an option to
> >> enable two-phase. Can't we use how we implement 'stream-changes'
> >> option in commit 7259736a6e? Just refer how we set ctx->streaming and
> >> you can use a similar way to set this parameter.
> >
> >
> > Done, I've moved the checks for callbacks to inside the corresponding 
> > wrappers.
> >
>
> This is not what I suggested. Please study the commit 7259736a6e and
> see how streaming option is implemented. I want later subscribers can
> specify whether they want transactions to be decoded at prepare time
> similar to what we have done for streaming. Also, search for
> ctx->streaming in the code and see how it is set to get the idea.
>

Changed it similar to ctx->streaming logic.

> Note: Please use version number while sending patches, you can use
> something like git format-patch -N -v n to do that. It makes easier
> for the reviewer to compare it with the previous version.

Done.

> Few other comments:
> ===
> 1.
> ReorderBufferProcessTXN()
> {
> ..
> if (streaming)
> {
> ReorderBufferTruncateTXN(rb, txn);
>
> /* Reset the CheckXidAlive */
> CheckXidAlive = InvalidTransactionId;
> }
> else
> ReorderBufferCleanupTXN(rb, txn);
> ..
> }
>
> I don't think we can perform ReorderBufferCleanupTXN for the prepared
> transactions because if we have removed the ReorderBufferTxn before
> commit, the later code might not consider such a transaction in the
> system and compute the wrong value of restart_lsn for a slot.
> Basically, in SnapBuildProcessRunningXacts() when we call
> ReorderBufferGetOldestTXN(), it should show the ReorderBufferTxn of
> the prepared transaction which is not yet committed but because we
> have removed it after prepare, it won't get that TXN and then that
> leads to wrong computation of restart_lsn. Once we start from a wrong
> point in WAL, the snapshot built was incorrect which will lead to the
> wrong result. This is the same reason why the patch is not doing
> ReorderBufferForget in DecodePrepare when we decide to skip the
> transaction. Also, here, we need to set CheckXidAlive =
> InvalidTransactionId; for prepared xact as well.

Updated as suggested above.

> 2. Have you thought about the interaction of streaming with prepared
> transactions? You can try writing some tests using pg_logical* APIs
> an

Re: WIP: BRIN multi-range indexes

2020-09-18 Thread Tomas Vondra


On Thu, Sep 17, 2020 at 06:48:11PM -0400, John Naylor wrote:

I wrote:


Hmm, I came across that paper while doing background reading. Okay,
now I get that "% (filter->nbits - 1)" is the second hash function in
that scheme. But now I wonder if that second function should actually
act on the passed "value" (the original hash), so that they are
actually independent, as required. In the language of that paper, the
patch seems to have

g(x) = h1(x) + i*h2(h1(x)) + f(i)

instead of

g(x) = h1(x) + i*h2(x) + f(i)

Concretely, I'm wondering if it should be:

 big_h = DatumGetUint32(hash_uint32(value));
 h = big_h % filter->nbits;
-d = big_h % (filter->nbits - 1);
+d = value % (filter->nbits - 1);

But I could be wrong.


I'm wrong -- if we use different operands to the moduli, we throw away
the assumption of co-primeness. But I'm still left wondering why we
have to re-hash the hash for this to work. In any case, there should
be some more documentation around the core algorithm, so that future
readers are not left scratching their heads.



Hmm, good question. I think we don't really need to hash it twice. It
does not rally achieve anything - it won't reduce number of collisions
or anything like that.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: BRIN multi-range indexes

2020-09-18 Thread Tomas Vondra

On Thu, Sep 17, 2020 at 05:42:59PM -0400, John Naylor wrote:

On Thu, Sep 17, 2020 at 12:34 PM Tomas Vondra
 wrote:

On Thu, Sep 17, 2020 at 10:33:06AM -0400, John Naylor wrote:
>On Sun, Sep 13, 2020 at 12:40 PM Tomas Vondra
> wrote:
>> <20200913 patch set>
But those are opclass parameters, so the parameters are not specified in
WITH clause, but right after the opclass name:

CREATE INDEX idx ON table USING brin (
bigint_col int8_minmax_multi_ops(values_per_range = 15)
);

D'oh!

>+ * The index only supports equality operator, similarly to hash indexes.
>
>s/operator/operators/
>

Hmmm, are there really multiple equality operators?

Ah, I see what you meant -- then "_the_ equality operator" is what we want.

The number of items the comment refers to is this:

 /* how many uint32 hashes can we fit into the bitmap */
 int maxvalues = filter->nbits / (8 * sizeof(uint32));

where nbits is the size of the bloom filter. So I think the "same" is
quite right here.

Ok, I get it now.

>+ /*
>+ * The 1% value is mostly arbitrary, it just looks nice.
>+ */
>+#define BLOOM_DEFAULT_FALSE_POSITIVE_RATE 0.01 /* 1% fp rate */
>
>I think we'd want better stated justification for this default, even
>if just precedence in other implementations. Maybe I can test some
>other values here?
>

Well, I don't know how to pick a better default :-( Ultimately it's a
tarde-off between larger indexes and scanning larger fraction of a table
due to lower false positive. Then there's the restriction that the whole
index tuple needs to fit into a single 8kB page.

Well, it might be a perfectly good default, and it seems common in
articles on the topic, but the comment is referring to aesthetics. :-)
I still intend to test some cases.

I think we may formulate this as a question of how much I/O we need to
do for a random query, and pick the false positive rate minimizing that.
For a single BRIN range an approximation might look like this:

  bloom_size(fpr, ...) + (fpr * range_size) + (selectivity * range_size)

The "selectivity" shows the true selectivity of ranges, and it might be
esimated from a per-row selectivity I guess. But it does not matter much
because this is constant and independent of the false-positive rate, so
we can ignore it. Which leaves us with

  bloom_size(fpr, ...) + (fpr * range_size)

We might solve this for fixed parameters (range_size, ndistinct, ...),
either analytically or by brute force, giving us the "optimal" fpr.

The trouble is the bloom_size is restricted, and we don't really know
the limit - the whole index tuple needs to fit on a single 8kB page, and
there may be other BRIN summaries etc. So I've opted to use a somewhat
defensive default for the false positive rate.

>Also, is there a reference for the algorithm for hash values that
>follows? I didn't see anything like it in my cursory reading of the
>topic. Might be good to include it in the comments.
>

This was suggested by Yura Sokolov [1] in 2017. The post refers to a
paper [2] but I don't recall which part describes "our" algorithm.

[1] 
https://www.postgresql.org/message-id/94c173b54a0aef6ae9b18157ef52f...@postgrespro.ru
[2] https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf

Hmm, I came across that paper while doing background reading. Okay,
now I get that "% (filter->nbits - 1)" is the second hash function in
that scheme. But now I wonder if that second function should actually
act on the passed "value" (the original hash), so that they are
actually independent, as required. In the language of that paper, the
patch seems to have

g(x) = h1(x) + i*h2(h1(x)) + f(i)

instead of

g(x) = h1(x) + i*h2(x) + f(i)

Concretely, I'm wondering if it should be:

big_h = DatumGetUint32(hash_uint32(value));
h = big_h % filter->nbits;
-d = big_h % (filter->nbits - 1);
+d = value % (filter->nbits - 1);

But I could be wrong.

Also, I take it that f(i) = 1 in the patch, hence the increment here:

+ h += d++;

But it's a little hidden. That might be worth commenting, if I haven't
completely missed something.

OK

>+ * Tweak the ndistinct value based on the pagesPerRange value. First,
>
>Nitpick: "Tweak" to my mind means to adjust an existing value. The
>above is only true if ndistinct is negative, but we're really not
>tweaking, but using it as a scale factor. Otherwise it's not adjusted,
>only clamped.
>

OK. Perhaps 'adjust' would be a better term?

I felt like rewriting the whole thing, but your original gets the
point across ok, really.

"If the passed ndistinct value is positive, we can just use that, but
we also clamp the value to prevent over-sizing the bloom filter
unnecessarily. If it's negative, it represents a multiplier to apply
to the maximum number of tuples in the range (assuming each page gets
MaxHeapTuplesPerPage tuples, which is likely a significant
over-estimate), similar to the corresponding value in table
statistics."

>+ /* TODO include the sorted/unsorted values */
>

This was simplemented as par

Re: [HACKERS] [PATCH] Generic type subscripting

2020-09-18 Thread Dmitry Dolgov

> On Thu, Sep 17, 2020 at 05:19:19PM +0200, Pavel Stehule wrote:
>
> ok, then I think we can design some workable behaviour
>
> My first rule - there should not be any implicit action that shifts
> positions in the array. It can be explicit, but not implicit. It is true
> for positive indexes, and it should be true for negative indexes too.
>
> then I think so some like this can work
>
> if (idx < 0)
> {
>   if (abs(idx) > length of array)
> exception("index is of of range");
>   array[length of array - idx] := value;
> }
> else
> {
>/* known behave for positive index */
> }

In this way (returning an error on a negative indices bigger than the
number of elements) functionality for assigning via subscripting will be
already significantly differ from the original one via jsonb_set. Which
in turn could cause a new wave of something similar to "why assigning an
SQL NULL as a value returns NULL instead of jsonb?". Taking into account
that this is not absolutely new interface, but rather a convenient
shortcut for the existing one it probably makes sense to try to find a
balance between both consistency with regular array and similarity with
already existing jsonb modification functions.

Having said that, my impression is that this balance should be not fully
shifted towards consistensy with the regular array type, as jsonb array
and regular array are fundamentally different in terms of
implementation. If any differences are of concern, they should be
addressed at different level. At the same time I've already sort of gave
up on this patch in the form I wanted to see it anyway, so anything goes
if it helps bring it to the finish point. In case if there would be no
more arguments from other involved sides, I can post the next version
with your suggestion included.

Re: BUG #15858: could not stat file - over 4GB

2020-09-18 Thread Juan José Santamaría Flecha

On Thu, Sep 17, 2020 at 8:47 PM Juan José Santamaría Flecha <
juanjo.santama...@gmail.com> wrote:

> On Thu, Sep 17, 2020 at 6:04 PM Tom Lane  wrote:
>
>>
>> One thing I noticed, which is a pre-existing problem but maybe now
>> is a good time to consider it, is that we're mapping lstat() to be
>> exactly stat() on Windows.  That made sense years ago when (we
>> believed that) Windows didn't have symlinks, but surely it no longer
>> makes sense.
>>
>
> I will have to take a better look at it, but from a quick look it, all
> lstat() calls seem to test just if the file exists, and that can be done
> with a cheap call to GetFileAttributes(). Would a limited (but fast)
> lstat(), where only st_mode could be informed, be acceptable?
>

After thinking more about this, that approach would be problematic for
DELETE_PENDING files. The proposed patch logic is meant to maintain current
behaviour, which is not broken for WIN32 symlinks AFAICT.

Regards,

Juan José Santamaría Flecha

Re: Redundant tuple copy in tqueueReceiveSlot()

2020-09-18 Thread Amit Khandekar

On Thu, 17 Sep 2020 at 08:55, Andres Freund  wrote:
>
> Hi,
>
> On 2020-09-17 14:20:50 +1200, Thomas Munro wrote:
> > I wonder if there is a way we could make "Minimal Tuples but with the
> > length travelling separately (and perhaps chopped off?)" into a
> > first-class concept...  It's also a shame to be schlepping a bunch of
> > padding bytes around.

Yeah, I think we can pass a  "length" data separately, but since the
receiver end already is assuming that it knows the received data is a
minimal tuple, I thought why not skip passing this redundant
component. But anyways, if you and Andres are suggesting that being
able to skip the copy is important for virtual tuples as well, then I
think the approach you suggested (supplying an allocated memory to the
tuple API for conversion) would be one of the better options with us,
if not the only good option. Maybe I will try looking into the shm_mq
working to see if we can come up with a good solution.

>
> There really is no justification for having MinimalTuples, as we have
> them today at least, anymore. We used to rely on being able to construct
> pointers to MinimalTuples that are mostly compatible with HeapTuple. But
> I think there's none of those left since PG 12.

Ah ok.

>
> I think it'd make a bit more sense to do some steps towards having a
> more suitable "minimal" tuple representation, rather than doing this
> local, pretty ugly, hacks. A good way would be to just starting to
> remove the padding, unnecessary fields etc from MinimalTuple.

So there are two things we wish to do :
1. Prevent an extra tuple forming step before sending minimal tuple
data. Possibly device an shm_mq API to get memory to write tuple of a
given length, and device something like
FormMinimalTupleDataInHere(memory_allocated_by_shm_mq) which will
write minimal tuple data.
2. Shrink the MinimalTupleData structure because it no longer needs
the current padding etc and we can substitute this new MinimalTuple
structure with the current one all over the code wherever it is
currently being used.

If we remove the unnecessary fields from the tuple data being sent to
Gather node, then we need to again form a MinimalTuple at the
receiving end, which again adds an extra tuple forming. So I
understand, that's the reason why you are saying we should shrink the
MinimalTupleData structure itself, in which case we will continue to
use the received new MinimalTupledata as an already-formed tuple, like
how we are doing now.

Now, the above two things (1. and 2.) look independent to me. Suppose
we first do 1. i.e. we come up with a good way to form an in-place
MinimalTuple at the sender's end, without any change to the
MinimalTupleData. And then when we do 2. i.e. shrink the
MinimalTupleData; but for that, we won't require any change in the
in-place-tuple-forming API we wrote in 1. . Just the existing
underlying function heap_form_minimal_tuple() or something similar
might need to be changed. At least that's what I feel right now.

>
> I also think that it'd be good to look at a few of the other places that
> are heavily bottlenecked by MinimalTuple overhead before designing new
> API around this. IIRC it's e.g. very easy to see hash joins spending a
> lot of time doing MinimalTuple copies & conversions.

Yeah, makes sense. The above FormMinimalTupleDataInHere() should be
able to be used for these other places as well. Will keep that in
mind.

>
> > > Thomas, I guess you had a different approach in mind when you said
> > > about "returning either success or
> > > hey-that-buffer's-too-small-I-need-N-bytes".  But what looks clear to
> >
> > Yeah I tried some things like that, but I wasn't satisfied with any of
> > them; basically the extra work involved in negotiating the size was a
> > bit too high.

Hmm, ok. Let me see if there is any way around this.

>> On the other hand, because my interface was "please
> > write a MinimalTuple here!", it had the option to *form* a
> > MinimalTuple directly in place, whereas your approach can only avoid
> > creating and destroying a temporary tuple when the source is a heap
> > tuple.
True.
>
> There's a lot of cases where the source is a virtual slot (since we'll
> often project stuff below Gather). So it'd be quite advantageous to
> avoid building an unnecessary HeapTuple in those cases.

Yeah right.

Re: Report error position in partition bound check

2020-09-18 Thread Ashutosh Bapat

On Thu, 17 Sep 2020 at 13:06, Amit Langote  wrote:

> Hi Ashutosh,
>
> I had forgotten about this thread, but Michael's ping email brought it
> to my attention.
>
> Thanks Amit for addressing comments.

@@ -4256,5 +4256,8 @@ transformPartitionBoundValue(ParseState *pstate, Node
*val,
  if (!IsA(value, Const))
  elog(ERROR, "could not evaluate partition bound expression");

+ /* Preserve parser location information. */
+ ((Const *) value)->location = exprLocation(val);
+
  return (Const *) value;
 }

This caught my attention and I was wondering whether transformExpr() itself
should transfer the location from input expression to the output
expression. Some minions of transformExprRecurse() seem to be doing that.
The change here may be an indication that some of them are not doing this.
In that case may be it's better to find those and fix rather than a
white-wash fix here. In what case did we find that location was not set by
transformExpr? Sorry for not catching this earlier.

/* New lower bound is certainly >= bound at offet. */
offet/offset? But this comment is implied by the comment just two lines
above. So I am not sure it's really needed.

/* Fetch the problem bound from lower datums list. */
This is fetching problematic key value rather than the whole problematic
bound. I think the comment would be useful if it explains why cmpval -1 th
key is problematic but then that's evident from the prologue
of partition_rbound_cmp() so I am not sure if this comment is really
required. For example, we aren't adding a comment here
+ overlap_location = ((PartitionRangeDatum *)
+ list_nth(spec->upperdatums, -cmpval - 1))->location;

-- 
Best Wishes,
Ashutosh

pg_logging_init() can return ENOTTY with TAP tests

2020-09-18 Thread Michael Paquier

Hi,

While hacking my way around a different patch (better option handling
for pg_test_fsync), I got surprised by the fact that if we run the TAP
tests, logging initialization could return with errno set to ENOTTY
because we call isatty() to check for output coloring.  I found that
this can be possible with Debian gid that uses a recent version of
IPC::Run.

Some system calls may not set errno even if they succeed, like
strtol() on Linux for example, so in this case it can cause option
handling to fail because of the error set by logging initialization.
Shouldn't we reset errno to 0 once logging initialization is done?  It
seems to me that this code path should finish with a clean sheet, and
attached is a proposal of patch to address this issue.

Thoughts?
--
Michael
diff --git a/src/common/logging.c b/src/common/logging.c
index d9632fffc8..13913a71ac 100644
--- a/src/common/logging.c
+++ b/src/common/logging.c
@@ -149,6 +149,12 @@ pg_logging_init(const char *argv0)
 			sgr_locus = SGR_LOCUS_DEFAULT;
 		}
 	}
+
+	/*
+	 * isatty() may fail, setting errno to ENOTTY if running for example
+	 * TAP tests that depend on IPC::Run, so reset properly.
+	 */
+	errno = 0;
 }
 
 void


signature.asc
Description: PGP signature

Re: [PATCH] Add features to pg_stat_statements

2020-09-18 Thread Julien Rouhaud

Hi,

On Fri, Sep 18, 2020 at 10:54 AM Katsuragi Yuta
 wrote:
>
> This is a proposal to add some features to pg_stat_statements.
> Attached is the patch of this.
>
> pg_stat_statements uses a hash table to hold statistics,
> and the maximum number of its entries can be configured through
> pg_stat_statements.max.
> When the number of entries exceeds the pg_stat_statements.max,
> pg_stat_statements deallocate existing entries.
>
> Currently, it is impossible to know how many times/when this
> deallocation happened.
> But, with this information, more detailed performance analysis would be
> possible.
> So, this patch provides access to this information.
>
> This patch provides a view (pg_stat_statements_ctl) and
> a function (pg_stat_statements_ctl).
> The pg_stat_statements_ctl view is defined
> in terms of a function also named pg_stat_statements_ctl.
>
> The entry of pg_stat_statements_ctl view is removed
> when pg_stat_statements_reset(0,0,0) is called.
> Maybe, it is convenient to provide a function named
> pg_stat_statements_ctl_reset
> that removes only the entry of pg_stat_statements_ctl.
>
> The following is an example of this feature.
> Here, a deallocation is shown with the INFO message.
> pg_stat_statements_ctl tracks the number of deallocations
> and timestamp of last deallocation.
>
>
> testdb=# select * from pg_stat_statements_ctl;
>   dealloc | last_dealloc
> -+---
> 2 | 2020-09-18 16:55:31.128256+09
> (1 row)
>
> testdb=# select count(*) from test1;
> 2020-09-18 16:59:20.745 JST [3920] INFO:  deallocate
> 2020-09-18 16:59:20.745 JST [3920] STATEMENT:  select count(*) from
> test1;
> INFO:  deallocate
>   count
> ---
>  90
> (1 row)
>
> testdb=# select * from pg_stat_statements_ctl;
>   dealloc | last_dealloc
> -+---
> 3 | 2020-09-18 16:59:20.745652+09
> (1 row)

I like it, this is especially important since this can lead to quite
huge overhead.  Did you consider also adding the cumulated number of
evicted entries?  This could be useful to know how to configure
pg_stat_statements.max.

Re: pgbench calculates summary numbers a wrong way.

2020-09-18 Thread Kyotaro Horiguchi

At Fri, 18 Sep 2020 17:28:22 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> > I have submitted a patch which reworks how things are computed so that
> > performance figures make some/more sense, among other things.
> > 
> > Maybe you could have a look at it and tell whether it is an
> > improvement wrt your issue?
> 
> Thanks for the pointer, I'm going to look it.

I remember about the topic. It started from an issue of thread/client
connection synchronization.  Current it seems to be added overhauling
of time counting code. (I think they are in separate patches).
However, the patch doesn't fix my issues.

If I re-explain my issues in few words:

The first issue is: in printResuts, conn_total_delay is assumed to be
the sum of connection delay of all clients. But actually it is the sum
of connection delay of all *threads*. As the results the connection
delay is underestimated when nclients is larger than nthreads, then
tps(excluding conn establishment) is too-small on that condition.


The second issue is: estimated latency average (without -L) is
calculated as time_include * nclients / <# of completed
txns>. Considering that the latency is the time after connection
establishement until transaction end, the "time_include" should be
time_exclude.  As the result the estimated (non -L case) average
latency gets overestimated than the measured latency (with -L case) if
connections takes a long time.

The patch doesn't affect the first issue, but alleviates the error in
the second issue. But still it doesn't handles connection delay
correctly in calculating the value when non -C mode, and doesn't fix
the error when -C (conn per tx) mode at all.

So the two patches are almost orthogonal (but heavily conflicts each
other^^;).  Anyway, it's a good opportunity, so I'll take a look on
your patch.

Thanks.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Feature improvement for pg_stat_statements

2020-09-18 Thread btnakamichin


Hello.

I’d like to improve pg_stat_statements so that it can report the 
timestamp when the stats are reset. Currently it’s difficult to check 
that reset timestamp. But this information is useful, for example, when 
calculating the SQL executions per second by SELECT calls / (now() - 
reset_timestamp) FROM pg_stat_statements.


I’m thinking of adding adding a function called 
pg_stat_statements_reset_time() that returns the last timestamp when 
executed pg_stat_statements_reset(). pg_stat_statements can reset each 
SQL statement. We can record each sql reset timing timestamp but I don’t 
feel the need. This time, I’m thinking of updating the reset timestamp 
only when all stats were reset.


what do you think ?
Regards.

Naoki Nakamichi

[PATCH] Add features to pg_stat_statements

2020-09-18 Thread Katsuragi Yuta


This is a proposal to add some features to pg_stat_statements.
Attached is the patch of this.

pg_stat_statements uses a hash table to hold statistics,
and the maximum number of its entries can be configured through 
pg_stat_statements.max.

When the number of entries exceeds the pg_stat_statements.max,
pg_stat_statements deallocate existing entries.

Currently, it is impossible to know how many times/when this 
deallocation happened.
But, with this information, more detailed performance analysis would be 
possible.

So, this patch provides access to this information.

This patch provides a view (pg_stat_statements_ctl) and
a function (pg_stat_statements_ctl).
The pg_stat_statements_ctl view is defined
in terms of a function also named pg_stat_statements_ctl.

The entry of pg_stat_statements_ctl view is removed
when pg_stat_statements_reset(0,0,0) is called.
Maybe, it is convenient to provide a function named 
pg_stat_statements_ctl_reset

that removes only the entry of pg_stat_statements_ctl.

The following is an example of this feature.
Here, a deallocation is shown with the INFO message.
pg_stat_statements_ctl tracks the number of deallocations
and timestamp of last deallocation.


testdb=# select * from pg_stat_statements_ctl;
 dealloc | last_dealloc
-+---
   2 | 2020-09-18 16:55:31.128256+09
(1 row)

testdb=# select count(*) from test1;
2020-09-18 16:59:20.745 JST [3920] INFO:  deallocate
2020-09-18 16:59:20.745 JST [3920] STATEMENT:  select count(*) from 
test1;

INFO:  deallocate
 count
---
90
(1 row)

testdb=# select * from pg_stat_statements_ctl;
 dealloc | last_dealloc
-+---
   3 | 2020-09-18 16:59:20.745652+09
(1 row)


Regards,
Katsuragi Yutadiff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile
index 081f997d70..63fd4874b1 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -6,7 +6,7 @@ OBJS = \
 	pg_stat_statements.o
 
 EXTENSION = pg_stat_statements
-DATA = pg_stat_statements--1.4.sql \
+DATA = pg_stat_statements--1.4.sql pg_stat_statements--1.8--1.9.sql\
 	pg_stat_statements--1.7--1.8.sql pg_stat_statements--1.6--1.7.sql \
 	pg_stat_statements--1.5--1.6.sql pg_stat_statements--1.4--1.5.sql \
 	pg_stat_statements--1.3--1.4.sql pg_stat_statements--1.2--1.3.sql \
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.8--1.9.sql b/contrib/pg_stat_statements/pg_stat_statements--1.8--1.9.sql
new file mode 100644
index 00..43c7966946
--- /dev/null
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.8--1.9.sql
@@ -0,0 +1,18 @@
+/* contrib/pg_stat_statements/pg_stat_statements--1.8--1.9.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pg_stat_statements UPDATE TO '1.9'" to load this file. \quit
+
+--- Define pg_stat_statements_ctl
+CREATE FUNCTION pg_stat_statements_ctl (
+	OUT dealloc integer,
+	OUT last_dealloc TIMESTAMP WITH TIME ZONE
+)
+RETURNS record
+AS 'MODULE_PATHNAME'
+LANGUAGE  C STRICT VOLATILE PARALLEL SAFE;
+
+CREATE VIEW pg_stat_statements_ctl AS
+	SELECT * FROM pg_stat_statements_ctl();
+
+GRANT SELECT ON pg_stat_statements TO PUBLIC;
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 1eac9edaee..aa116db498 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -81,6 +81,7 @@
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
+#include "utils/timestamp.h"
 
 PG_MODULE_MAGIC;
 
@@ -193,6 +194,17 @@ typedef struct Counters
 	uint64		wal_bytes;		/* total amount of WAL bytes generated */
 } Counters;
 
+/*
+ * Counter for pg_stat_statements_ctl
+ */
+typedef struct pgssCtlCounter
+{
+	int64		dealloc;	/* # of deallocation */
+	TimestampTz	last_dealloc;	/* timestamp of last deallocation */
+	bool		ts_isnull;	/* if true last_dealloc is null */
+	slock_t		mutex;
+} pgssCtlCounter;
+
 /*
  * Statistics per statement
  *
@@ -279,6 +291,7 @@ static ProcessUtility_hook_type prev_ProcessUtility = NULL;
 /* Links to shared memory state */
 static pgssSharedState *pgss = NULL;
 static HTAB *pgss_hash = NULL;
+static pgssCtlCounter *pgss_ctl = NULL;
 
 /* GUC variables */
 
@@ -327,6 +340,7 @@ PG_FUNCTION_INFO_V1(pg_stat_statements_1_2);
 PG_FUNCTION_INFO_V1(pg_stat_statements_1_3);
 PG_FUNCTION_INFO_V1(pg_stat_statements_1_8);
 PG_FUNCTION_INFO_V1(pg_stat_statements);
+PG_FUNCTION_INFO_V1(pg_stat_statements_ctl);
 
 static void pgss_shmem_startup(void);
 static void pgss_shmem_shutdown(int code, Datum arg);
@@ -380,6 +394,8 @@ static char *generate_normalized_query(pgssJumbleState *jstate, const char *quer
 static void fill_in_constant_lengths(pgssJumbleState *jstate, const char *query,
 	 int query_loc);
 static int	comp_location(const void *a, const void *b);
+static v

Re: Concurrency issue in pg_rewind

2020-09-18 Thread Oleksandr Shulgin

On Fri, Sep 18, 2020 at 8:10 AM Michael Paquier  wrote:

> On Thu, Sep 17, 2020 at 10:20:16AM +0200, Oleksandr Shulgin wrote:
> > Ouch.  I think pg_rewind shouldn't try to remove any random files in
> pg_wal
> > that it doesn't know about.
> > What if the administrator made a backup of some WAL segments there?
>
> IMO, this would be a rather bad strategy anyway, so just don't do
> that, because that could also mean that this is on the same partition
> as pg_wal/ which would crash the server if the partition has the idea
> to get full even if max_wal_size is set correctly.

To clarify my point, I don't mean to backup WAL segments in the background
when the server is running, but precisely when the server is down and you
need to intervene, such as running pg_rewind.  You might want to "stash"
some of the latest segments in case you need to start over (name it
pg_wal/00840A76001E.backup, or
pg_wal/backup/00840A76001E).  It is surprising that pg_rewind
might want to decide to remove those.

--
Alex

Re: pgbench calculates summary numbers a wrong way.

2020-09-18 Thread Kyotaro Horiguchi

At Fri, 18 Sep 2020 09:55:30 +0200 (CEST), Fabien COELHO  
wrote in 
> 
> Hello,
> 
> >> Sorry, I sent a wrong version of the patch, contains some spelling
> >> errors. This is the right one.
> >
> > Sigh.. I fixed some more wordings.
> 
> I have submitted a patch which reworks how things are computed so that
> performance figures make some/more sense, among other things.
> 
> Maybe you could have a look at it and tell whether it is an
> improvement wrt your issue?

Thanks for the pointer, I'm going to look it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: pgbench calculates summary numbers a wrong way.

2020-09-18 Thread Kyotaro Horiguchi

At Thu, 17 Sep 2020 17:59:45 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Sigh.. I fixed some more wordings.

The condition "all clients took the same time to establish a
connection" is not needed for the transformation, and an intermediate
expression was wrong. Fixed them.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 802c06d95aebdb8a4ed4e3adc4a6c74d9675c625 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 17 Sep 2020 17:20:10 +0900
Subject: [PATCH v4] Fix latency and tps calculation of pgbench

Fix the calculation for "latency average" and the "tps excluding
connections establishing" not to wrongly affected by connections
establishing time.
---
 src/bin/pgbench/pgbench.c | 24 +---
 1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 663d7d292a..4164a546e1 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -5194,16 +5194,34 @@ printResults(StatsData *total, instr_time total_time,
 			 instr_time conn_total_time, int64 latency_late)
 {
 	double		time_include,
+time_exclude,
 tps_include,
 tps_exclude;
 	int64		ntx = total->cnt - total->skipped;
 
 	time_include = INSTR_TIME_GET_DOUBLE(total_time);
 
+	/*
+	 * conn_total_time is the sum of the time each client took to establish a
+	 * connection. In the multi-threaded case, all clients run on a thread wait
+	 * for all the clients to establish a connection. So the actual total
+	 * connection time of a thread is thread->conn_time * thread->nstate. Thus
+	 * the total time took for connection establishment is:
+	 *
+	 *   sum(thread->conn_time * thread->nstate) / nclients
+	 *
+	 * Assuming clients are distributed equally to threads, the expression is
+	 * approximated as:
+	 *
+	 *   sum(thread->conn_time) * (nclients/nthreads) / nclients
+	 * = conn_total_time / nthreads
+	 */
+	time_exclude = (time_include -
+	(INSTR_TIME_GET_DOUBLE(conn_total_time) / nthreads));
+
 	/* tps is about actually executed transactions */
 	tps_include = ntx / time_include;
-	tps_exclude = ntx /
-		(time_include - (INSTR_TIME_GET_DOUBLE(conn_total_time) / nclients));
+	tps_exclude = ntx / time_exclude;
 
 	/* Report test parameters. */
 	printf("transaction type: %s\n",
@@ -5249,7 +5267,7 @@ printResults(StatsData *total, instr_time total_time,
 	{
 		/* no measurement, show average latency computed from run time */
 		printf("latency average = %.3f ms\n",
-			   1000.0 * time_include * nclients / total->cnt);
+			   1000.0 * time_exclude * nclients / total->cnt);
 	}
 
 	if (throttle_delay)
-- 
2.18.4

Re: Range checks of pg_test_fsync --secs-per-test and pg_test_timing --duration

2020-09-18 Thread Michael Paquier

On Tue, Sep 15, 2020 at 02:39:08PM +0200, Peter Eisentraut wrote:
> I didn't mean use strtol() to be able to process larger values, but for the
> error checking.  atoi() cannot detect any errors other than ERANGE. So if
> you are spending effort on making the option value parsing more robust,
> relying on atoi() will result in an incomplete solution.

Okay, after looking at that, here is v3.  This includes range checks
as well as errno checks based on strtol().  What do you think?
--
Michael
diff --git a/src/bin/pg_test_fsync/.gitignore b/src/bin/pg_test_fsync/.gitignore
index f3b5932498..5eb5085f45 100644
--- a/src/bin/pg_test_fsync/.gitignore
+++ b/src/bin/pg_test_fsync/.gitignore
@@ -1 +1,3 @@
 /pg_test_fsync
+
+/tmp_check/
diff --git a/src/bin/pg_test_fsync/Makefile b/src/bin/pg_test_fsync/Makefile
index 7632c94eb7..c4f9ae0664 100644
--- a/src/bin/pg_test_fsync/Makefile
+++ b/src/bin/pg_test_fsync/Makefile
@@ -22,6 +22,12 @@ install: all installdirs
 installdirs:
 	$(MKDIR_P) '$(DESTDIR)$(bindir)'
 
+check:
+	$(prove_check)
+
+installcheck:
+	$(prove_installcheck)
+
 uninstall:
 	rm -f '$(DESTDIR)$(bindir)/pg_test_fsync$(X)'
 
diff --git a/src/bin/pg_test_fsync/pg_test_fsync.c b/src/bin/pg_test_fsync/pg_test_fsync.c
index 6e47293123..3ab78ac0d5 100644
--- a/src/bin/pg_test_fsync/pg_test_fsync.c
+++ b/src/bin/pg_test_fsync/pg_test_fsync.c
@@ -5,6 +5,7 @@
 
 #include "postgres_fe.h"
 
+#include 
 #include 
 #include 
 #include 
@@ -62,7 +63,7 @@ do { \
 
 static const char *progname;
 
-static int	secs_per_test = 5;
+static int32 secs_per_test = 5;
 static int	needs_unlink = 0;
 static char full_buf[DEFAULT_XLOG_SEG_SIZE],
 		   *buf,
@@ -148,6 +149,8 @@ handle_args(int argc, char *argv[])
 
 	int			option;			/* Command line option */
 	int			optindex = 0;	/* used by getopt_long */
+	long		optval;			/* used for option parsing */
+	char	   *endptr;
 
 	if (argc > 1)
 	{
@@ -173,7 +176,23 @@ handle_args(int argc, char *argv[])
 break;
 
 			case 's':
-secs_per_test = atoi(optarg);
+optval = strtol(optarg, &endptr, 10);
+
+if (endptr == optarg || *endptr != '\0' ||
+	errno != 0 || optval != (int32) optval)
+{
+	pg_log_error("invalid argument for option %s", "--secs-per-test");
+	fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
+	exit(1);
+}
+
+secs_per_test = (int32) optval;
+if (secs_per_test <= 0)
+{
+	pg_log_error("%s must be in range %d..%d",
+ "--secs-per-test", 1, INT_MAX);
+	exit(1);
+}
 break;
 
 			default:
diff --git a/src/bin/pg_test_fsync/t/001_basic.pl b/src/bin/pg_test_fsync/t/001_basic.pl
new file mode 100644
index 00..d4b67092cb
--- /dev/null
+++ b/src/bin/pg_test_fsync/t/001_basic.pl
@@ -0,0 +1,25 @@
+use strict;
+use warnings;
+
+use Config;
+use TestLib;
+use Test::More tests => 12;
+
+#
+# Basic checks
+
+program_help_ok('pg_test_fsync');
+program_version_ok('pg_test_fsync');
+program_options_handling_ok('pg_test_fsync');
+
+#
+# Test invalid option combinations
+
+command_fails_like(
+	[ 'pg_test_fsync', '--secs-per-test', 'a' ],
+	qr/\Qpg_test_fsync: error: invalid argument for option --secs-per-test\E/,
+	'pg_test_fsync: invalid argument for option --secs-per-test');
+command_fails_like(
+	[ 'pg_test_fsync', '--secs-per-test', '0' ],
+	qr/\Qpg_test_fsync: error: --secs-per-test must be in range 1..2147483647\E/,
+'pg_test_fsync: --secs-per-test must be in range');
diff --git a/src/bin/pg_test_timing/.gitignore b/src/bin/pg_test_timing/.gitignore
index f6c664c765..e5aac2ab12 100644
--- a/src/bin/pg_test_timing/.gitignore
+++ b/src/bin/pg_test_timing/.gitignore
@@ -1 +1,3 @@
 /pg_test_timing
+
+/tmp_check/
diff --git a/src/bin/pg_test_timing/Makefile b/src/bin/pg_test_timing/Makefile
index 334d6ff5c0..52994b4103 100644
--- a/src/bin/pg_test_timing/Makefile
+++ b/src/bin/pg_test_timing/Makefile
@@ -22,6 +22,12 @@ install: all installdirs
 installdirs:
 	$(MKDIR_P) '$(DESTDIR)$(bindir)'
 
+check:
+	$(prove_check)
+
+installcheck:
+	$(prove_installcheck)
+
 uninstall:
 	rm -f '$(DESTDIR)$(bindir)/pg_test_timing$(X)'
 
diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index e14802372b..2a19ac6368 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -6,6 +6,8 @@
 
 #include "postgres_fe.h"
 
+#include 
+
 #include "getopt_long.h"
 #include "portability/instr_time.h"
 
@@ -14,7 +16,7 @@ static const char *progname;
 static int32 test_duration = 3;
 
 static void handle_args(int argc, char *argv[]);
-static uint64 test_timing(int32);
+static uint64 test_timing(int32 duration);
 static void output(uint64 loop_count);
 
 /* record duration in powers of 2 microseconds */
@@ -47,6 +49,8 @@ handle_args(int argc, char *argv[])
 
 	int			option;			/* Command line option */
 	int			optindex = 0;	/* used by getopt_l

Re: Transactions involving multiple postgres foreign servers, take 2

2020-09-18 Thread Masahiko Sawada

On Thu, 17 Sep 2020 at 14:25, Michael Paquier  wrote:
>
> On Fri, Aug 21, 2020 at 03:25:29PM +0900, Masahiko Sawada wrote:
> > Thank you for letting me know. I've attached the latest version patch set.
>
> A rebase is needed again as the CF bot is complaining.

Thank you for letting me know. I'm updating the patch and splitting
into small pieces as Fujii-san suggested. I'll submit the latest patch
set early next week.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pgbench calculates summary numbers a wrong way.

2020-09-18 Thread Fabien COELHO




Hello,


Sorry, I sent a wrong version of the patch, contains some spelling
errors. This is the right one.


Sigh.. I fixed some more wordings.


I have submitted a patch which reworks how things are computed so that 
performance figures make some/more sense, among other things.


Maybe you could have a look at it and tell whether it is an improvement 
wrt your issue?


https://commitfest.postgresql.org/29/2557/

--
Fabien.

Re: Refactor pg_rewind code and make it work against a standby

2020-09-18 Thread Kyotaro Horiguchi

Hello.

It needed rebasing. (Attached)

At Tue, 25 Aug 2020 16:32:02 +0300, Heikki Linnakangas  wrote 
in 
> On 20/08/2020 11:32, Kyotaro Horiguchi wrote:
> > 0002: Rewording that old->target and new->source makes the meaning far
> 
> Good idea! I changed the patch that way.

Looks Good.

> > 0003: Thomas is propsing sort template. It could be used if committed.


+* If the block is beyond the EOF in the source system, or the file 
doesn't
+* doesn'exist in the source at all, we're going to truncate/remove it 
away

"the file doesn't doesn'exist"

I don't think filemap_finalize needs to iterate over filemap twice.

hash_string_pointer is a copy of that of pg_verifybackup.c. Is it
worth having in hashfn.h or .c?

> --- a/src/bin/pg_rewind/pg_rewind.c
> +++ b/src/bin/pg_rewind/pg_rewind.c
> ...
> + filemap_t  *filemap;
> ..
> + filemap_init();
> ...
> + filemap = filemap_finalize();

I'm a bit confused by this, and realized that what filemap_init
initializes is *not* the filemap, but the filehash. So for example,
the name of the functions might should be something like this?

filehash_init()
filemap = filehash_finalyze()/create_filemap()


> > 0004:
> >   The names of many of the functions gets an additional word "local"
> >   but I don't get the meaning clearly. but its about linguistic sense
> >   and I'm not fit to that..
> >   -rewind_copy_file_range(const char *path, off_t begin, off_t end, bool
> >   -trunc)
> > +local_fetch_file_range(rewind_source *source, const char *path,
> > uint64 off,
> >   The function actually copying the soruce range to the target file. So
> >   "fetch" might give somewhat different meaning, but its about
> >   linguistic (omitted..).
> 
> Hmm. It is "fetching" the range from the source server, and writing it
> to the target. The term makes more sense with a libpq source. Perhaps
> this function should be called "local_copy_range" or something, but
> it'd also be nice to have "fetch" in the name because the function
> pointer it's assigned to is called "queue_fetch_range".

Thanks. Yeah libpq_fetch_file makes sense. I agree to the name.
The refactoring looks good to me.

> > I'm going to continue reviewing this later.
> 
> Thanks! Attached is a new set of patches. The only meaningful change
> is in the 2nd patch, which I modified per your suggestion. Also, I
> moved the logic to decide each file's fate into a new subroutine
> called decide_file_action().

That's looks good.

0005:

+   /*
+* We don't intend do any updates.  Put the connection in read-only mode
+* to keep us honest.
+*/
run_simple_command(conn, "SET default_transaction_read_only = off");

The comment is wrong since the time it was added by 0004 but that's
not a problem since it was to be fixed by 0005. However, we need the
variable turned on in order to be really honest:p

> /*
>  * Also check that full_page_writes is enabled.  We can get torn pages if
>  * a page is modified while we read it with pg_read_binary_file(), and we
>  * rely on full page images to fix them.
>  */
> str = run_simple_query(conn, "SHOW full_page_writes");
> if (strcmp(str, "on") != 0)
>   pg_fatal("full_page_writes must be enabled in the source server");
> pg_free(str);

This is a part not changed by this patch set. But If we allow to
connect to a standby, this check can be tricked by setting off on the
primary and "on" on the standby (FWIW, though). Some protection
measure might be necessary. (Maybe standby should be restricted to
have the same value with the primary.)


+   thislen = Min(len, CHUNK_SIZE - prev->length);
+   src->request_queue[src->num_requests - 1].length += 
thislen;

prev == &src->request_queue[src->num_requests - 1] here.


+   if (chunksize > rq->length)
+   {
+   pg_fatal("received more than requested for file \"%s\"",
+rq->path);
+   /* receiving less is OK, though */

Don't we need to truncate the target file, though?


+* Source is a local data directory. It should've shut down 
cleanly,
+* and we must to the latest shutdown checkpoint.

"and we must to the" => "and we must replay to the" ?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 6cfbb5a686e5b87159aed9769b1e52859bae02b4 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 19 Aug 2020 15:34:35 +0300
Subject: [PATCH v3 1/5] pg_rewind: Move syncTargetDirectory() to file_ops.c

For consistency. All the other low-level functions that operate on the
target directory are in file_ops.c.
---
 src/bin/pg_rewind/file_ops.c  | 19 +++
 src/bin/pg_rewind/file_ops.h  |  1 +
 src/bin/pg_rewind/pg_rewind.c | 22 +-
 src/bin/pg_rewind/pg_rewind.h |  1 +
 4 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rew

Re: Concurrency issue in pg_rewind

2020-09-18 Thread Alexander Kukushkin

Hi,

On Fri, 18 Sep 2020 at 08:59, Michael Paquier  wrote:

> If this stuff is willing to do so, you may have your reasons, but even
> if you wish to locate both pg_wal/ and the prefetch path in the same
> partition, I don't get why it is necessary to have the prefetch path
> included directly in pg_wal?  You could just use different paths for
> both.  Say, with a base partition at /my/path/, you can just have
> /my/path/pg_wal/ that the Postgres backend links to, and
> /my/path/wal-g/prefetch/ for the secondary path.

Well, I agree and disagree at the same time.
Yes, it would be nice not to write anything unexpected to PGDATA and
pg_wal, but this is also a usability issue.
Creating a separate directory and configuring wal-e/wal-g to use it
(now it is not even possible to configure it), requires some effort
from the administrator, while using something inside pg_wal just
works.

At the same time, pg_rewind due to such "fatal" error leaves PGDATA in
an inconsistent state with empty pg_control file, this is totally bad
and easily fixable. We want the specific file to be absent and it is
already absent, why should it be a fatal error and not warning?

Regards,
--
Alexander Kukushkin

Re: Concurrency issue in pg_rewind

2020-09-18 Thread Michael Paquier

On Fri, Sep 18, 2020 at 11:31:26AM +0500, Andrey M. Borodin wrote:
> This is whole point of having prefetch. restore_command just links
> file from the same partition.

If this stuff is willing to do so, you may have your reasons, but even
if you wish to locate both pg_wal/ and the prefetch path in the same
partition, I don't get why it is necessary to have the prefetch path
included directly in pg_wal?  You could just use different paths for
both.  Say, with a base partition at /my/path/, you can just have
/my/path/pg_wal/ that the Postgres backend links to, and
/my/path/wal-g/prefetch/ for the secondary path.
--
Michael


signature.asc
Description: PGP signature

67 matches

Mail list logo