Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)

2019-07-01 Thread Paul Guo
On Tue, Jul 2, 2019 at 12:35 AM Alvaro Herrera 
wrote:

> On 2019-Apr-19, Paul Guo wrote:
>
> > The below patch runs single mode Postgres if needed to make sure the
> target
> > is cleanly shutdown. A new option is added (off by default).
> > v2-0001-Ensure-target-clean-shutdown-at-beginning-of-pg_r.patch
>
> Why do we need an option for this?  Is there a reason not to do this
> unconditionally?
>

There is concern about this (see previous emails in this thread). On
greenplum (MPP DB based on Postgres),
we unconditionally do this. I'm not sure about usually how Postgres users
do this when there is an unclean shutdown,
but providing an option seem to be safer to avoid breaking existing
script/service whatever. If many people
think this option is unnecessary, I'm fine to remove the option and keep
the code logic.


Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)

2019-07-01 Thread Paul Guo
Rebased, aligned with recent changes in pg_rewind/pg_basebackup and then
retested. Thanks.

On Mon, Jul 1, 2019 at 7:35 PM Thomas Munro  wrote:

> On Fri, Apr 19, 2019 at 3:41 PM Paul Guo  wrote:
> > I updated the patches as attached following previous discussions.
>
> Hi Paul,
>
> Could we please have a fresh rebase now that the CF is here?
>
> Thanks,
>
> --
> Thomas Munro
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__enterprisedb.com=DwIBaQ=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw=Usi0ex6Ch92MsB5QQDgYFw=mxictY8xxFIFvsyxFYx4bXwF4PfnGWWRuYLLX4R1yhs=qvC2BI2OhKkBz71FO1w2XNy6dvfhIeGHT3X0yU3XDlU=
>


v3-0001-Extact-common-functions-from-pg_basebackup-into-s.patch
Description: Binary data


v3-0002-Add-option-to-write-recovery-configuration-inform.patch
Description: Binary data


v3-0003-Ensure-target-clean-shutdown-at-the-beginning-of-.patch
Description: Binary data


Re: Refactoring base64 encoding and decoding into a safer interface

2019-07-01 Thread Michael Paquier
On Mon, Jul 01, 2019 at 11:11:43PM +0200, Daniel Gustafsson wrote:
> I very much agree that functions operating on a buffer like this should have
> the size of the buffer in order to safeguard against overflow, so +1 on the
> general concept.

Thanks for the review!

> A few small comments:
> 
> In src/common/scram-common.c there are a few instances like this.  Shouldn’t 
> we
> also free the result buffer in these cases?
> 
> +#ifdef FRONTEND
> +   return NULL;
> +#else

Fixed.

> In the below passage, we leave the input buffer with a non-complete
> encoded string.  Should we memset the buffer to zero to avoid the
> risk that code which fails to check the return value believes it has
> an encoded string?

Hmm.  Good point.  I have not thought of that, and your suggestion
makes sense.

Another question is if we'd want to actually use explicit_bzero()
here, but that could be a discussion on this other thread, except if
the patch discussed there is merged first:
https://www.postgresql.org/message-id/42d26bde-5d5b-c90d-87ae-6cab875f7...@2ndquadrant.com

Attached is an updated patch.
--
Michael
diff --git a/src/backend/libpq/auth-scram.c b/src/backend/libpq/auth-scram.c
index 6b60abe1dd..3a31afc7b7 100644
--- a/src/backend/libpq/auth-scram.c
+++ b/src/backend/libpq/auth-scram.c
@@ -510,9 +510,11 @@ scram_verify_plain_password(const char *username, const char *password,
 		return false;
 	}
 
-	salt = palloc(pg_b64_dec_len(strlen(encoded_salt)));
-	saltlen = pg_b64_decode(encoded_salt, strlen(encoded_salt), salt);
-	if (saltlen == -1)
+	saltlen = pg_b64_dec_len(strlen(encoded_salt));
+	salt = palloc(saltlen);
+	saltlen = pg_b64_decode(encoded_salt, strlen(encoded_salt), salt,
+			saltlen);
+	if (saltlen < 0)
 	{
 		ereport(LOG,
 (errmsg("invalid SCRAM verifier for user \"%s\"", username)));
@@ -596,9 +598,10 @@ parse_scram_verifier(const char *verifier, int *iterations, char **salt,
 	 * Verify that the salt is in Base64-encoded format, by decoding it,
 	 * although we return the encoded version to the caller.
 	 */
-	decoded_salt_buf = palloc(pg_b64_dec_len(strlen(salt_str)));
+	decoded_len = pg_b64_dec_len(strlen(salt_str));
+	decoded_salt_buf = palloc(decoded_len);
 	decoded_len = pg_b64_decode(salt_str, strlen(salt_str),
-decoded_salt_buf);
+decoded_salt_buf, decoded_len);
 	if (decoded_len < 0)
 		goto invalid_verifier;
 	*salt = pstrdup(salt_str);
@@ -606,16 +609,18 @@ parse_scram_verifier(const char *verifier, int *iterations, char **salt,
 	/*
 	 * Decode StoredKey and ServerKey.
 	 */
-	decoded_stored_buf = palloc(pg_b64_dec_len(strlen(storedkey_str)));
+	decoded_len = pg_b64_dec_len(strlen(storedkey_str));
+	decoded_stored_buf = palloc(decoded_len);
 	decoded_len = pg_b64_decode(storedkey_str, strlen(storedkey_str),
-decoded_stored_buf);
+decoded_stored_buf, decoded_len);
 	if (decoded_len != SCRAM_KEY_LEN)
 		goto invalid_verifier;
 	memcpy(stored_key, decoded_stored_buf, SCRAM_KEY_LEN);
 
-	decoded_server_buf = palloc(pg_b64_dec_len(strlen(serverkey_str)));
+	decoded_len = pg_b64_dec_len(strlen(serverkey_str));
+	decoded_server_buf = palloc(decoded_len);
 	decoded_len = pg_b64_decode(serverkey_str, strlen(serverkey_str),
-decoded_server_buf);
+decoded_server_buf, decoded_len);
 	if (decoded_len != SCRAM_KEY_LEN)
 		goto invalid_verifier;
 	memcpy(server_key, decoded_server_buf, SCRAM_KEY_LEN);
@@ -649,8 +654,18 @@ mock_scram_verifier(const char *username, int *iterations, char **salt,
 	/* Generate deterministic salt */
 	raw_salt = scram_mock_salt(username);
 
-	encoded_salt = (char *) palloc(pg_b64_enc_len(SCRAM_DEFAULT_SALT_LEN) + 1);
-	encoded_len = pg_b64_encode(raw_salt, SCRAM_DEFAULT_SALT_LEN, encoded_salt);
+	encoded_len = pg_b64_enc_len(SCRAM_DEFAULT_SALT_LEN);
+	/* don't forget the zero-terminator */
+	encoded_salt = (char *) palloc(encoded_len + 1);
+	encoded_len = pg_b64_encode(raw_salt, SCRAM_DEFAULT_SALT_LEN, encoded_salt,
+encoded_len);
+
+	/*
+	 * Note that we cannot reveal any information to an attacker here so the
+	 * error message needs to remain generic.
+	 */
+	if (encoded_len < 0)
+		elog(ERROR, "could not encode salt");
 	encoded_salt[encoded_len] = '\0';
 
 	*salt = encoded_salt;
@@ -1144,8 +1159,15 @@ build_server_first_message(scram_state *state)
 (errcode(ERRCODE_INTERNAL_ERROR),
  errmsg("could not generate random nonce")));
 
-	state->server_nonce = palloc(pg_b64_enc_len(SCRAM_RAW_NONCE_LEN) + 1);
-	encoded_len = pg_b64_encode(raw_nonce, SCRAM_RAW_NONCE_LEN, state->server_nonce);
+	encoded_len = pg_b64_enc_len(SCRAM_RAW_NONCE_LEN);
+	/* don't forget the zero-terminator */
+	state->server_nonce = palloc(encoded_len + 1);
+	encoded_len = pg_b64_encode(raw_nonce, SCRAM_RAW_NONCE_LEN,
+state->server_nonce, encoded_len);
+	if (encoded_len < 0)
+		ereport(ERROR,
+(errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("could not encode random nonce")));
 	state->server_nonce[encoded_len] = 

Re: cleanup & refactoring on reindexdb.c

2019-07-01 Thread Julien Rouhaud
On Tue, Jul 2, 2019 at 4:44 AM Michael Paquier  wrote:
>
> On Sat, Jun 29, 2019 at 11:24:49AM +0900, Michael Paquier wrote:
> > Thanks for the rebase (and the reminder..).  I'll look at that once
> > v13 opens for business.
>
> And applied.

Thanks!




Re: MSVC Build support with visual studio 2019

2019-07-01 Thread Michael Paquier
On Mon, Jul 01, 2019 at 07:56:29PM +1000, Haribabu Kommi wrote:
> We stopped the support of building with all the visual studio versions less
> than 2013.  I updated the SDK versions accordingly.

I have spent some time looking around, and wikipedia-sensei has proved
to be helpful to grasp the release references:
https://en.wikipedia.org/wiki/Microsoft_Windows_SDK

So the suggestions from the patch are fine.  This one was actually
forgotten:
src/tools/msvc/README:from www.microsoft.com (v6.0 or greater).

> The similar change is required for the CreateProject also.

I have changed both messages so as the version of VS attempted to be
used is reported in the error message directly.

> + # The major visual studio that is supported has nmake
> + # version >= 14.30, so stick with it as the latest version
> 
> The major visual studio version that is supported has nmake version
> <=14.30

Damn.  Thanks for pointing out that.

> Except for the above two changes, overall the patch is in good shape.

OK, committed to HEAD for now after perltidy'ing the patch.  Let's see
what the buildfarm has to say about it first.  Once we are sure that
the thing is stable, I'll try to backpatch it.  This works on my own
dev machines with VS 2015 and 2019, but who knows what hides in the
shadows... 
--
Michael


signature.asc
Description: PGP signature


Re: Add missing operator <->(box, point)

2019-07-01 Thread Tom Lane
[ warning, drive-by comment ahead ]

Fabien COELHO  writes:
> I notice that other distance tests do not test for commutativity. Are they 
> also not implemented, or just not tested? If not implemented, I'd suggest 
> to add them in the same batch.

Yeah ... just looking at operators named <->, I see

regression=# select oid, oid::regoperator, oprcom, oprcode from pg_operator 
where oprname = '<->';
 oid  | oid  | oprcom |  oprcode  
--+--++---
  517 | <->(point,point) |517 | point_distance
  613 | <->(point,line)  |  0 | dist_pl
  614 | <->(point,lseg)  |  0 | dist_ps
  615 | <->(point,box)   |  0 | dist_pb
  616 | <->(lseg,line)   |  0 | dist_sl
  617 | <->(lseg,box)|  0 | dist_sb
  618 | <->(point,path)  |  0 | dist_ppath
  706 | <->(box,box) |706 | box_distance
  707 | <->(path,path)   |707 | path_distance
  708 | <->(line,line)   |708 | line_distance
  709 | <->(lseg,lseg)   |709 | lseg_distance
  712 | <->(polygon,polygon) |712 | poly_distance
 1520 | <->(circle,circle)   |   1520 | circle_distance
 1522 | <->(point,circle)|   3291 | dist_pc
 3291 | <->(circle,point)|   1522 | dist_cpoint
 3276 | <->(point,polygon)   |   3289 | dist_ppoly
 3289 | <->(polygon,point)   |   3276 | dist_polyp
 1523 | <->(circle,polygon)  |  0 | dist_cpoly
 1524 | <->(line,box)|  0 | dist_lb
 5005 | <->(tsquery,tsquery) |  0 | pg_catalog.tsquery_phrase
(20 rows)

It's not clear to me why to be particularly more excited about
<->(box, point) than about the other missing cases here.

regards, tom lane




Re: Code comment change

2019-07-01 Thread Peter Geoghegan
On Mon, Jul 1, 2019 at 7:28 PM Tom Lane  wrote:
> Even more interesting, the same para also exists verbatim in
> v4r2's src/backend/access/nobtree/nobtpage.c, which is dated 1991-10-29
> in the same tarball.  (If you're wondering, "nobtree" seems to stand
> for "no-overwrite btree"; so I suppose it went the way of all flesh
> when Stonebraker lost interest in write-once mass storage.)  So presumably
> this comment dates back to some common ancestor of the mainline btree code
> and the no-overwrite code, which must have been even older than the 1991
> date.

"no-overwrite btree" is described here, if you're interested:

https://pdfs.semanticscholar.org/a0de/438d5efd96e8af51bc7595aa1c30d0497a57.pdf

This is a link to the B-Tree focused paper "An Index Implementation
Supporting Fast Recovery for the POSTGRES Storage System". I found
that the paper provided me with some interesting historic context. I
am pretty sure that the authors were involved in early work on the
Postgres B-Tree code. It references Lanin and Shasha, even though the
nbtree code that is influenced by L first appears in the same 2003
commit of yours that I mentioned.

> > I think that the whole sentence about "the standard class of race
> > conditions" should go. There is no more dance. Nothing in
> > _bt_getroot() is surprising to me. The other comments explain things
> > comprehensively.
>
> +1

I'll take care of it soon.

-- 
Peter Geoghegan




Re: Add parallelism and glibc dependent only options to reindexdb

2019-07-01 Thread Michael Paquier
On Mon, Jul 01, 2019 at 06:14:20PM +0200, Julien Rouhaud wrote:
> On Mon, Jul 1, 2019 at 3:51 PM Alvaro Herrera  
> wrote:
> >
> > Please don't reuse a file name as generic as "parallel.c" -- it's
> > annoying when navigating source.  Maybe conn_parallel.c multiconn.c
> > connscripts.c admconnection.c ...?
> 
> I could use scripts_parallel.[ch] as I've already used it in the
> #define part?

multiconn.c sounds rather good, but I have a poor ear for any kind of
naming..

>> If your server crashes or is stopped midway during the reindex, you
>> would have to start again from scratch, and it's tedious (if it's
>> possible at all) to determine which indexes were missed.  I think it
>> would be useful to have a two-phase mode: in the initial phase reindexdb
>> computes the list of indexes to be reindexed and saves them into a work
>> table somewhere.  In the second phase, it reads indexes from that table
>> and processes them, marking them as done in the work table.  If the
>> second phase crashes or is stopped, it can be restarted and consults the
>> work table.  I would keep the work table, as it provides a bit of an
>> audit trail.  It may be important to be able to run even if unable to
>> create such a work table (because of the numerous users that
>> DROP DATABASE postgres).
> 
> Or we could create a table locally in each database, that would fix
> this problem and probably make the code simpler?
> 
> It also raises some additional concerns about data expiration.  I
> guess that someone could launch the tool by mistake, kill reindexdb,
> and run it again 2 months later while a lot of new objects have been
> added for instance.

This looks like fancy additions, still that's not the core of the
problem, no?  If you begin to play in this area you would need more
control options, basically a "continue" mode to be able to restart a
previously failed attempt, and a "reinit" mode able to restart the
operation completely from scratch, and perhaps even a "reset" mode
which cleans up any data already present.  Not really a complexity,
but this has to be maintained a database level.

>>  The --glibc-dependent
>> switch seems too ad-hoc.  Maybe "--exclude-rule=glibc"?  That way we can
>> add other rules later.  (Not "--exclude=foo" because we'll want to add
>> the possibility to ignore specific indexes by name.)
> 
> That's a good point, I like the --exclude-rule switch.

Sounds kind of nice.
--
Michael


signature.asc
Description: PGP signature


Re: Add parallelism and glibc dependent only options to reindexdb

2019-07-01 Thread Michael Paquier
On Mon, Jul 01, 2019 at 06:28:13PM +0200, Julien Rouhaud wrote:
> On Mon, Jul 1, 2019 at 4:10 PM Tom Lane  wrote:
>> Couldn't we make this enormously simpler and less bug-prone by just
>> dictating that --jobs applies only to reindex-table operations?

I had the same argument about the first patch sets actually, but... :)

> That would also mean that we'll have to fallback on doing reindex at
> table-level, even if we only want to reindex indexes that depends on
> glibc.  I'm afraid that this will often add a huge penalty.

Yes, I would expect that most of the time glibc-sensible indexes are
also mixed with other ones which we don't care about here.  One
advantage of the argument from Tom though is that it is possible to
introduce --jobs with minimal steps:
1) Refactor the code for connection slots, without the cell addition
2) Introduce --jobs without INDEX support.

In short, the conflict business between indexes is something which
could be tackled afterwards and with a separate patch.  Parallel
indexes at table-level has value in itself, particularly with
CONCURRENTLY coming in the picture.
--
Michael


signature.asc
Description: PGP signature


Re: cleanup & refactoring on reindexdb.c

2019-07-01 Thread Michael Paquier
On Sat, Jun 29, 2019 at 11:24:49AM +0900, Michael Paquier wrote:
> Thanks for the rebase (and the reminder..).  I'll look at that once
> v13 opens for business.

And applied.
--
Michael


signature.asc
Description: PGP signature


Re: Code comment change

2019-07-01 Thread Tom Lane
Peter Geoghegan  writes:
> On Sun, Jun 23, 2019 at 3:36 AM Thomas Munro  wrote:
>> Pushed.  Thanks!

> I wonder what the comment is supposed to mean.
> I think that it's addressing the situation prior to commit 70508ba7aed
> in 2003, which was the point when the "fast" root concept was
> introduced.

Yeah.  I did some research into the provenance of that comment when
Thomas pushed the change.  It's *old*.  The whole para exists verbatim
in Postgres v4r2, src/backend/access/nbtree/nbtpage.c dated 1993-12-10
(in my copy of that tarball).  The only change since then has been to
change the whitespace for 4-space tabs.

Even more interesting, the same para also exists verbatim in
v4r2's src/backend/access/nobtree/nobtpage.c, which is dated 1991-10-29
in the same tarball.  (If you're wondering, "nobtree" seems to stand
for "no-overwrite btree"; so I suppose it went the way of all flesh
when Stonebraker lost interest in write-once mass storage.)  So presumably
this comment dates back to some common ancestor of the mainline btree code
and the no-overwrite code, which must have been even older than the 1991
date.

This is only marginally relevant to what we should do about it today,
but I think it's reasonable to conclude that the current locking
considerations are nearly unrelated to what they were when the comment
was written.

> I think that the whole sentence about "the standard class of race
> conditions" should go. There is no more dance. Nothing in
> _bt_getroot() is surprising to me. The other comments explain things
> comprehensively.

+1

regards, tom lane




Re: Code comment change

2019-07-01 Thread Peter Geoghegan
On Sun, Jun 23, 2019 at 3:36 AM Thomas Munro  wrote:
> Pushed.  Thanks!

I wonder what the comment is supposed to mean.

I think that it's addressing the situation prior to commit 70508ba7aed
in 2003, which was the point when the "fast" root concept was
introduced. Prior to that commit, there was only what we would now
call a true root, and _bt_getroot() had to loop to make sure that it
reliably found it without deadlocking, while dealing with concurrent
splits. This was necessary because the old design also involved
maintaining a pointer to each page's parent in each page, which sounds
like a seriously bad approach to me.

I think that the whole sentence about "the standard class of race
conditions" should go. There is no more dance. Nothing in
_bt_getroot() is surprising to me. The other comments explain things
comprehensively.

-- 
Peter Geoghegan




Re: Rearranging ALTER TABLE to avoid multi-operations bugs

2019-07-01 Thread Tom Lane
movead li  writes:
> I applied the 'alter-table-with-recursive-process-utility-calls-wip.patch'
> on the master(e788e849addd56007a0e75f3b5514f294a0f3bca). And 
> when I test the cases, I find it works well on 'alter table t1 add column
> f2 int not null, alter column f2 add generated always as identity' case, 
> but it doesn't work on #14827, #15180, #15670, #15710.

This review seems not very on-point, because I made no claim to have fixed
any of those bugs.  The issue at the moment is how to structure the code
to allow ALTER TABLE to call other utility statements --- or, if we aren't
going to do that as Robert seems not to want to, what exactly we're going
to do instead.

The patch at hand does fix some ALTER TABLE ... IDENTITY bugs, because
fixing those doesn't require any conditional execution of utility
statements.  But we'll need infrastructure for such conditional execution
to fix the original bugs.  I don't see much point in working on that part
until we have some agreement about how to handle what this patch is
already doing.

regards, tom lane




Re: Hash join explain is broken

2019-07-01 Thread Tom Lane
Andres Freund  writes:
> Tom, any comments? Otherwise I'll go ahead, and commit after a round or
> two of polishing.

Sorry for not getting to this sooner --- I'll try to look tomorrow.

regards, tom lane




Re: Hash join explain is broken

2019-07-01 Thread Andres Freund
Hi,

On 2019-06-18 00:00:28 -0700, Andres Freund wrote:
> On 2019-06-13 16:23:34 -0700, Andres Freund wrote:
> > On June 13, 2019 3:38:47 PM PDT, Tom Lane  wrote:
> > >Andres Freund  writes:
> > >> I am too tired to look further into this. I suspect the only reason
> > >we
> > >> didn't previously run into trouble with the executor stashing
> > >hashkeys
> > >> manually at a different tree level with:
> > >> ((HashState *) innerPlanState(hjstate))->hashkeys
> > >> is that hashkeys itself isn't printed...
> > >
> > >TBH, I think 5f32b29c is just wrong and should be reverted for now.
> > >If there's a need to handle those expressions differently, it will
> > >require some cooperation from the planner not merely a two-line hack
> > >in executor startup.  That commit didn't include any test case or
> > >other demonstration that it was solving a live problem, so I think
> > >we can leave it for v13 to address the issue.
> > 
> > I'm pretty sure you'd get an assertion failure if you reverted it
> > (that's why it was added). So it's a bit more complicated than that.
> > Unfortunately I'll not get back to work until Monday, but I'll spend
> > time on this then.
> 
> Indeed, there are assertion failures when initializing the expression
> with HashJoinState as parent - that's because when computing the
> hashvalue for nodeHash input, we expect the slot from the node below to
> be of the type that HashState returns (as that's what INNER_VAR for an
> expression at the HashJoin level refers to), rather than the type of the
> input to HashState.  We could work around that by marking the slots from
> underlying nodes as being of an unknown type, but that'd slow down
> execution.
> 
> I briefly played with the dirty hack of set_deparse_planstate()
> setting dpns->inner_planstate = ps for IsA(ps, HashState), but that
> seems just too ugly.
> 
> I think the most straight-forward fix might just be to just properly
> split the expression at plan time. Adding workarounds for things as
> dirty as building an expression for a subsidiary node in the parent, and
> then modifying the subsidiary node from the parent, doesn't seem like a
> better way forward.
> 
> The attached *prototype* does so.
> 
> If we go that way, we probably need to:
> - Add a test for the failure case at hand
> - check a few of the comments around inner/outer in nodeHash.c
> - consider moving the setrefs.c code into its own function?
> - probably clean up the naming scheme in createplan.c
> 
> I think there's a few more things we could do, although it's not clear
> that that needs to happen in v12:
> - Consider not extracting hj_OuterHashKeys, hj_HashOperators,
>   hj_Collations out of HashJoin->hashclauses, and instead just directly
>   handing them individually in the planner.  create_mergejoin_plan()
>   already partially does that.

Tom, any comments? Otherwise I'll go ahead, and commit after a round or
two of polishing.

- Andres




Re: Usage of epoch in txid_current

2019-07-01 Thread Thomas Munro
On Sun, Jun 30, 2019 at 9:07 AM Robert Haas  wrote:
> On Mon, Jun 24, 2019 at 1:43 PM Alvaro Herrera  
> wrote:
> > I think enlarging multixacts to 64 bits is a terrible idea.  I would
> > rather get rid of multixacts completely; zheap is proposing not to use
> > multixacts at all, for example.
>
> But zedstore, at least as of the Saturday after PGCon, is proposing to
> keep using them, after first widening them to 64 bits:
>
> https://www.postgresql.org/message-id/CA+TgmoYeTeQSmALox0PmSm5Gh03oe=unjhml+k+btofy_u2...@mail.gmail.com
>
> I think we all have a visceral reaction against MultiXacts at this
> point; they've just caused us too much pain, and the SLRU
> infrastructure is a bunch of crap.[1] However, the case where N
> sessions all do "SELECT * FROM sometab FOR SHARE" is pretty wretched
> without multixacts.  You'll just have to keep reducing the tuple
> density per page to fit all the locks, whereas the current heap is
> totally fine and neither the heap nor the multixact space bloats all
> that terribly much.
>
> I currently think it's likely that zheap will use something
> multixact-like rather than actually using multixacts per se.  I am
> having trouble seeing how we can have some sort of fixed-bit-width ID
> number that represents as set of XIDs for purposes of lock-tracking
> without creating some nasty degenerate cases that don't exist
> today.[2]

The new thing described over here is intended to support something a
bit like multixacts:

https://www.postgresql.org/message-id/CA%2BhUKGKni7EEU4FT71vZCCwPeaGb2PQOeKOFjQJavKnD577UMQ%40mail.gmail.com

For example, here is some throw-away demo code that puts arbitrary
data into an undo log, in this case a list of xids given with SELECT
test_multixact(ARRAY[1234, 2345, 3456]), and provides a callback to
say when the data can be discarded, in this case when all of those
xids are no longer running.  You can see the record with SELECT * FROM
undoinspect('shared'), until it gets eaten by a background worker.
The reason it has to be an in-core demo is just because we don't have
a way to do extensions that have an RMGR ID and callback functions
yet.  Although this demo throws the return value away, the function
PrepareUndoInsert() returns a 64 bit UndoRecPtr which could be stored
in any page and can be used to retrieve the records (via the buffer
pool) including the binary payload which can be whatever struct you
like (though there is a size limit so you might need more than one
record for a huge list of multi-lockers).  When you try to load the
record, you might be told that it's gone, which means that the lockers
are gone, whcih means that your callback must have said that was OK.
This probably makes most sense for a system that is already planning
to use UndoRecPtr for other things, like MVCC, since it probably
already has per page or per tuple space to store UndoRecPtr, and
updates and explicit locks are so closely related.

https://github.com/EnterpriseDB/zheap/commit/c92fdfd1f1178cbb44557a7c630b366d1539c332

-- 
Thomas Munro
https://enterprisedb.com




Re: Usage of epoch in txid_current

2019-07-01 Thread Thomas Munro
On Sat, Jun 22, 2019 at 11:01 AM Alexander Korotkov
 wrote:
> 3. Change SLRU on-disk format to handle 64-bit xids and multixacts.
> In particular make SLRU page numbers 64-bit too.  Add SLRU upgrade
> procedure to pg_upgrade.

+1 for doing this for the xid-indexed SLRUs so the segment file names
are never recycled.  Having yet another level of wraparound code to
feed and water and debug is not nice:

https://www.postgresql.org/message-id/flat/20190202083822.GC32531%40gust.leadboat.com

-- 
Thomas Munro
https://enterprisedb.com




Re: Increasing default value for effective_io_concurrency?

2019-07-01 Thread Andres Freund
Hi,

On 2019-06-29 22:15:19 +0200, Tomas Vondra wrote:
> I think we should consider changing the effective_io_concurrency default
> value, i.e. the guc that determines how many pages we try to prefetch in
> a couple of places (the most important being Bitmap Heap Scan).

Maybe we need improve the way it's used / implemented instead - it seems
just too hard to determine the correct setting as currently implemented.


> In some cases it helps a bit, but a bit higher value (4 or 8) performs
> significantly better. Consider for example this "sequential" data set
> from the 6xSSD RAID system (x-axis shows e_i_c values, pct means what
> fraction of pages matches the query):

I assume that the y axis is the time of the query?

How much data is this compared to memory available for the kernel to do
caching?


>pct 0 14 1664   128
>---
>  1 25990 18624  3269  2219  2189  2171
>  5 88116 60242 14002  8663  8560  8726
> 10120556 99364 29856 17117 16590 17383
> 25101080184327 79212 47884 46846 46855
> 50130709309857163614103001 94267 94809
> 75126516435653248281156586139500140087
> 
> compared to the e_i_c=0 case, it looks like this:
> 
>pct   14 1664   128
>
>  1 72%  13% 9%8%8%
>  5 68%  16%10%   10%   10%
> 10 82%  25%14%   14%   14%
> 25182%  78%47%   46%   46%
> 50237% 125%79%   72%   73%
> 75344% 196%   124%  110%  111%
> 
> So for 1% of the table the e_i_c=1 is faster by about ~30%, but with
> e_i_c=4 (or more) it's ~10x faster. This is a fairly common pattern, not
> just on this storage system.
> 
> The e_i_c=1 can perform pretty poorly, especially when the query matches
> large fraction of the table - for example in this example it's 2-3x
> slower compared to no prefetching, and higher e_i_c values limit the
> damage quite a bit.

I'm surprised the slowdown for small e_i_c values is that big - it's not
obvious to me why that is.  Which os / os version / filesystem / io
scheduler / io scheduler settings were used?

Greetings,

Andres Freund




Re: POC: converting Lists into arrays

2019-07-01 Thread Tom Lane
I spent some time experimenting with the idea mentioned upthread of
adding a macro to support deletion of a foreach loop's current element
(adjusting the loop's state behind the scenes).  This turns out to work
really well: it reduces the complexity of fixing existing loops around
element deletions quite a bit.  Whereas in existing code you have to not
use foreach() at all, and you have to track both the next list element and
the previous undeleted element, now you can use foreach() and you don't
have to mess with extra variables at all.

A good example appears in the trgm_regexp.c changes below.  Typically
we've coded such loops with a handmade expansion of foreach, like

prev = NULL;
cell = list_head(state->enterKeys);
while (cell)
{
TrgmStateKey *existingKey = (TrgmStateKey *) lfirst(cell);

next = lnext(cell);
if (need to delete)
state->enterKeys = list_delete_cell(state->enterKeys,
cell, prev);
else
prev = cell;
cell = next;
}

My previous patch would have had you replace this with a loop using
an integer list-position index.  You can still do that if you like,
but it's less change to convert the loop to a foreach(), drop the
prev/next variables, and replace the list_delete_cell call with
foreach_delete_current:

foreach(cell, state->enterKeys)
{
TrgmStateKey *existingKey = (TrgmStateKey *) lfirst(cell);

if (need to delete)
state->enterKeys = 
foreach_delete_current(state->enterKeys,
cell);
}

So I think this is a win, and attached is v7.

regards, tom lane



reimplement-List-as-array-7.patch.gz
Description: reimplement-List-as-array-7.patch.gz


Re: Are there still non-ELF BSD systems?

2019-07-01 Thread Peter Eisentraut
On 2019-06-12 16:06, Tom Lane wrote:
> Peter Eisentraut  writes:
> I checked around a bit ... all of the *BSD systems in the buildfarm
> report ELF_SYS='true', so it's safe to say that the code you want to
> remove is untested.
> 
> Excavation in the various BSDens' change logs suggests that the last
> one to fully drop a.out was OpenBSD, which didn't do so until 5.5
> (released 1 May 2015).  That's more recent than I'd have hoped for,
> though it looks like the holdout architectures were ones we don't
> support anyway (e.g., m68k, vax).
> 
> If we're considering this change for v13, it's hard to believe
> there'd be any objections in practice.

committed

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: range_agg

2019-07-01 Thread Jeff Davis
On Fri, 2019-05-03 at 15:56 -0700, Paul Jungwirth wrote:
> Hello,
> 
> I wrote an extension to add a range_agg function with similar
> behavior 
> to existing *_agg functions, and I'm wondering if folks would like
> to 
> have it in core? Here is the repo: 
> https://github.com/pjungwir/range_agg

This seems like a very useful extension, thank you.

For getting into core though, it should be a more complete set of
related operations. The patch is implicitly introducing the concept of
a "multirange" (in this case, an array of ranges), but it's not making
the concept whole.

What else should return a multirange? This patch implements the union-
agg of ranges, but we also might want an intersection-agg of ranges
(that is, the set of ranges that are subranges of every input). Given
that there are other options here, the name "range_agg" is too generic
to mean union-agg specifically.

What can we do with a multirange? A lot of range operators still make
sense, like "contains" or "overlaps"; but "adjacent" doesn't quite
work. What about new operations like inverting a multirange to get the
gaps?

Do we want to continue with the array-of-ranges implementation of a
multirange, or do we want a first-class multirange concept that might
eliminate the boilerplate around defining all of these operations?

If we have a more complete set of operators here, the flags for
handling overlapping ranges and gaps will be unnecessary.

Regards,
Jeff Davis






Re: Add parallelism and glibc dependent only options to reindexdb

2019-07-01 Thread Alvaro Herrera
On 2019-Jul-02, Thomas Munro wrote:

> On Tue, Jul 2, 2019 at 8:34 AM Julien Rouhaud  wrote:
> > Even if that's just me being delusional, I'd still prefer Alvaro's
> > approach to have distinct switches for each collation system.
> 
> Hi Julien,
> 
> Makes sense.  But why use the name "glibc" in the code and user
> interface?  The name of the collation provider in PostgreSQL is "libc"
> (for example in the CREATE COLLATION command), and the problem applies
> no matter who makes your libc.

Makes sense.  "If your libc is glibc and you go across an upgrade over
version X, please use --include-rule=libc-collation"

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Add parallelism and glibc dependent only options to reindexdb

2019-07-01 Thread Thomas Munro
On Tue, Jul 2, 2019 at 8:34 AM Julien Rouhaud  wrote:
> Even if that's just me being delusional, I'd still prefer Alvaro's
> approach to have distinct switches for each collation system.

Hi Julien,

Makes sense.  But why use the name "glibc" in the code and user
interface?  The name of the collation provider in PostgreSQL is "libc"
(for example in the CREATE COLLATION command), and the problem applies
no matter who makes your libc.

-- 
Thomas Munro
https://enterprisedb.com




Re: Add parallelism and glibc dependent only options to reindexdb

2019-07-01 Thread Peter Geoghegan
On Mon, Jul 1, 2019 at 1:34 PM Julien Rouhaud  wrote:
> I have a vague recollection that ICU was providing some backward
> compatibility so that even if you upgrade your lib you can still get
> the sort order that was active when you built your indexes, though
> maybe for a limited number of versions.

That isn't built in. Another database system that uses ICU handles
this by linking to multiple versions of ICU, each with its own UCA
version and associated collations. I don't think that we want to go
there, so it makes sense to make an upgrade that crosses ICU or glibc
versions as painless as possible.

Note that ICU does at least provide a standard way to use multiple
versions at once; the symbol names have the ICU version baked in.
You're actually calling the functions using the versioned symbol names
without realizing it, because there is macro trickery involved.

-- 
Peter Geoghegan




Re: Refactoring base64 encoding and decoding into a safer interface

2019-07-01 Thread Daniel Gustafsson
> On 23 Jun 2019, at 15:25, Michael Paquier  wrote:

> Attached is a refactoring patch for those interfaces, which introduces
> a set of overflow checks so as we cannot repeat errors of the past.

I’ve done a review of this submission.  The patch applies cleanly, and passes
make check, ssl/scram tests etc. There is enough documentation 

I very much agree that functions operating on a buffer like this should have
the size of the buffer in order to safeguard against overflow, so +1 on the
general concept.

> Any thoughts?

A few small comments:

In src/common/scram-common.c there are a few instances like this.  Shouldn’t we
also free the result buffer in these cases?

+#ifdef FRONTEND
+   return NULL;
+#else

In the below passage, we leave the input buffer with a non-complete encoded
string.  Should we memset the buffer to zero to avoid the risk that code which
fails to check the returnvalue believes it has an encoded string?

+   /*
+* Leave if there is an overflow in the area allocated 
for
+* the encoded string.
+*/
+   if ((p - dst + 4) > dstlen)
+   return -1;

cheers ./daniel



Re: Add parallelism and glibc dependent only options to reindexdb

2019-07-01 Thread Julien Rouhaud
On Mon, Jul 1, 2019 at 10:13 PM Daniel Verite  wrote:
>
> > > I think you'd be better off to define and document this as "reindex
> > > only collation-sensitive indexes", without any particular reference
> > > to a reason why somebody might want to do that.
> >
> > We should still document that indexes based on ICU would be exluded?
>
> But why exclude them?
> As a data point, in the last 5 years, the en_US collation in ICU
> had 7 different versions (across 11 major versions of ICU):
>
> ICU Unicode en_US
>
> 54.17.0 137.56
> 55.17.0 153.56
> 56.18.0 153.64
> 57.18.0 153.64
> 58.29.0 153.72
> 59.19.0 153.72
> 60.210.0153.80
> 61.110.0153.80
> 62.111.0153.88
> 63.211.0153.88
> 64.212.1153.97
>
> The rightmost column corresponds to pg_collation.collversion
> in Postgres.
> Each time there's a new Unicode version, it seems
> all collation versions are bumped. And there's a new Unicode
> version pretty much every year these days.
> Based on this, most ICU upgrades in practice would require reindexing
> affected indexes.

I have a vague recollection that ICU was providing some backward
compatibility so that even if you upgrade your lib you can still get
the sort order that was active when you built your indexes, though
maybe for a limited number of versions.

Even if that's just me being delusional, I'd still prefer Alvaro's
approach to have distinct switches for each collation system.




Re: Add parallelism and glibc dependent only options to reindexdb

2019-07-01 Thread Alvaro Herrera
On 2019-Jul-01, Daniel Verite wrote:

> But why exclude them?
> As a data point, in the last 5 years, the en_US collation in ICU
> had 7 different versions (across 11 major versions of ICU):

So we need a switch --include-rule=icu-collations?

(I mentioned "--exclude-rule=glibc" elsewhere in the thread, but I think
it should be --include-rule=glibc-collations instead, no?)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Add parallelism and glibc dependent only options to reindexdb

2019-07-01 Thread Daniel Verite
Julien Rouhaud wrote:

> > I think you'd be better off to define and document this as "reindex
> > only collation-sensitive indexes", without any particular reference
> > to a reason why somebody might want to do that.
> 
> We should still document that indexes based on ICU would be exluded?

But why exclude them?
As a data point, in the last 5 years, the en_US collation in ICU
had 7 different versions (across 11 major versions of ICU):

ICU Unicode en_US

54.17.0 137.56
55.17.0 153.56
56.18.0 153.64
57.18.0 153.64
58.29.0 153.72
59.19.0 153.72
60.210.0153.80
61.110.0153.80
62.111.0153.88
63.211.0153.88
64.212.1153.97

The rightmost column corresponds to pg_collation.collversion
in Postgres.
Each time there's a new Unicode version, it seems
all collation versions are bumped. And there's a new Unicode
version pretty much every year these days.
Based on this, most ICU upgrades in practice would require reindexing
affected indexes.


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite




Re: Comment typo in tableam.h

2019-07-01 Thread Ashwin Agrawal
On Fri, Jun 28, 2019 at 1:47 PM Amit Kapila  wrote:

> On Mon, Jun 24, 2019 at 11:26 PM Ashwin Agrawal 
> wrote:
> >
> > On Mon, Jun 3, 2019 at 5:24 PM Ashwin Agrawal 
> wrote:
> >>
> >> There were few more minor typos I had collected for table am, passing
> them along here.
> >>
> >> Some of the required callback functions are missing Assert checking
> (minor thing), adding them in separate patch.
> >
> >
> > Curious to know if need to register such small typo fixing and assertion
> adding patchs to commit-fest as well ?
> >
>
> Normally, such things are handled out of CF, but to avoid forgetting,
> we can register it.  However, this particular item suits more to 'Open
> Items'[1].  You can remove the objectionable part of the comment,
> other things in your patch look good to me.  If nobody else picks it
> up, I will take care of it.
>

Thank you, I thought Committer would remove the objectionable part of
comment change and commit the patch if seems fine. I don't mind changing,
just feel adds extra back and forth cycle.

Please find attached v2 of patch 1 without objectionable comment change. v1
of patch 2 attaching here just for convenience, no modifications made to it.
From 5c75807a56101a07685ed1f435eabcc43cd22fb7 Mon Sep 17 00:00:00 2001
From: Ashwin Agrawal 
Date: Fri, 24 May 2019 16:30:38 -0700
Subject: [PATCH v2 1/2] Fix typos in few tableam comments.

---
 src/include/access/tableam.h | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index fd07773b74f..1e45908c78a 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -434,8 +434,8 @@ typedef struct TableAmRoutine
 	 *
 	 * Note that only the subset of the relcache filled by
 	 * RelationBuildLocalRelation() can be relied upon and that the relation's
-	 * catalog entries either will either not yet exist (new relation), or
-	 * will still reference the old relfilenode.
+	 * catalog entries will either not yet exist (new relation), or will still
+	 * reference the old relfilenode.
 	 *
 	 * As output *freezeXid, *minmulti must be set to the values appropriate
 	 * for pg_class.{relfrozenxid, relminmxid}. For AMs that don't need those
@@ -591,7 +591,7 @@ typedef struct TableAmRoutine
 	 * See table_relation_estimate_size().
 	 *
 	 * While block oriented, it shouldn't be too hard for an AM that doesn't
-	 * doesn't internally use blocks to convert into a usable representation.
+	 * internally use blocks to convert into a usable representation.
 	 *
 	 * This differs from the relation_size callback by returning size
 	 * estimates (both relation size and tuple count) for planning purposes,
@@ -967,7 +967,7 @@ table_index_fetch_end(struct IndexFetchTableData *scan)
  *
  * *all_dead, if all_dead is not NULL, will be set to true by
  * table_index_fetch_tuple() iff it is guaranteed that no backend needs to see
- * that tuple. Index AMs can use that do avoid returning that tid in future
+ * that tuple. Index AMs can use that to avoid returning that tid in future
  * searches.
  *
  * The difference between this function and table_fetch_row_version is that
@@ -1014,8 +1014,8 @@ extern bool table_index_fetch_tuple_check(Relation rel,
  * true, false otherwise.
  *
  * See table_index_fetch_tuple's comment about what the difference between
- * these functions is. This function is the correct to use outside of
- * index entry->table tuple lookups.
+ * these functions is. This function is correct to use outside of index
+ * entry->table tuple lookups.
  */
 static inline bool
 table_tuple_fetch_row_version(Relation rel,
-- 
2.19.1

From 4ed947590f2f61182356a7fa4bc429c679e7619f Mon Sep 17 00:00:00 2001
From: Ashwin Agrawal 
Date: Mon, 3 Jun 2019 17:07:05 -0700
Subject: [PATCH v2 2/2] Add assertions for required table am callbacks.

---
 src/backend/access/table/tableamapi.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index b2f58768107..98b8ea559d8 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -51,6 +51,7 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->scan_begin != NULL);
 	Assert(routine->scan_end != NULL);
 	Assert(routine->scan_rescan != NULL);
+	Assert(routine->scan_getnextslot != NULL);
 
 	Assert(routine->parallelscan_estimate != NULL);
 	Assert(routine->parallelscan_initialize != NULL);
@@ -62,8 +63,12 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->index_fetch_tuple != NULL);
 
 	Assert(routine->tuple_fetch_row_version != NULL);
+	Assert(routine->tuple_tid_valid != NULL);
+	Assert(routine->tuple_get_latest_tid != NULL);
 	Assert(routine->tuple_satisfies_snapshot != NULL);
 
+	Assert(routine->compute_xid_horizon_for_tuples != NULL);
+
 	Assert(routine->tuple_insert != NULL);
 
 	/*
@@ -89,6 +94,7 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->index_validate_scan != NULL);
 
 	

Re: POC: converting Lists into arrays

2019-07-01 Thread Jesper Pedersen

Hi,

On 7/1/19 2:44 PM, Tom Lane wrote:

Yup, here's a rebase against HEAD (and I also find that check-world shows
no problems).


Thanks - no further comments.


 This is pretty much of a pain to maintain, since it changes
the API for lnext() which is, um, a bit invasive.  I'd like to make a
decision pretty quickly on whether we're going to do this, and either
commit this patch or abandon it.



IMHO it is an improvement over the existing API.


There is some unneeded MemoryContext stuff in async.c's
pg_listening_channels() which should be cleaned up.


Yeah, there's a fair amount of follow-on cleanup that could be undertaken
afterwards, but I've wanted to keep the patch's footprint as small as
possible for the moment.  Assuming we pull the trigger, I'd then go look
at removing the planner's duplicative lists+arrays for RTEs and such as
the first cleanup step.  But thanks for the pointer to async.c, I'll
check that too.



Yeah, I only called out the async.c change, as that function isn't 
likely to change in any of the follow up patches.


Best regards,
 Jesper




Re: Zedstore - compressed in-core columnar storage

2019-07-01 Thread Ashwin Agrawal
On Thu, May 30, 2019 at 8:07 AM DEV_OPS  wrote:

>
> it's really cool and very good progress,
>
> I'm interesting if SIDM/JIT will be supported
>

That's something outside of Zedstore work directly at least now. The intent
is to work with current executor code or enhance it only wherever needed.
If current executor code supports something that would work for Zedstore.
But any other enhancements to executor will be separate undertaking.


Memory-Bounded Hash Aggregation

2019-07-01 Thread Jeff Davis
This is for design review. I have a patch (WIP) for Approach 1, and if
this discussion starts to converge on that approach I will polish and
post it.

Let's start at the beginning: why do we have two strategies -- hash
and sort -- for aggregating data? The two are more similar than they
first appear. A partitioned hash strategy writes randomly among the
partitions, and later reads the partitions sequentially; a sort will
write sorted runs sequentially, but then read the among the runs
randomly during the merge phase. A hash is a convenient small
representation of the data that is cheaper to operate on; sort uses
abbreviated keys for the same reason.

Hash offers:

* Data is aggregated on-the-fly, effectively "compressing" the amount
  of data that needs to go to disk. This is particularly important
  when the data contains skewed groups (see below).

* Can output some groups after the first pass of the input data even
  if other groups spilled.

* Some data types only support hashing; not sorting.

Sort+Group offers:

* Only one group is accumulating at once, so if the transition state
  grows (like with ARRAY_AGG), it minimizes the memory needed.

* The input may already happen to be sorted.

* Some data types only support sorting; not hashing.

Currently, Hash Aggregation is only chosen if the optimizer believes
that all the groups (and their transition states) fit in
memory. Unfortunately, if the optimizer is wrong (often the case if the
input is not a base table), the hash table will
keep growing beyond work_mem, potentially bringing the entire system
to OOM. This patch fixes that problem by extending the Hash
Aggregation strategy to spill to disk when needed.

Previous discussions:


https://www.postgresql.org/message-id/1407706010.6623.16.camel@jeff-desktop

https://www.postgresql.org/message-id/1419326161.24895.13.camel%40jeff-desktop

https://www.postgresql.org/message-id/87be3bd5-6b13-d76e-5618-6db0a4db584d%40iki.fi

A lot was discussed, which I will try to summarize and address here.

Digression: Skewed Groups:

Imagine the input tuples have the following grouping keys:

  0, 1, 0, 2, 0, 3, 0, 4, ..., 0, N-1, 0, N

Group 0 is a skew group because it consists of 50% of all tuples in
the table, whereas every other group has a single tuple. If the
algorithm is able to keep group 0 in memory the whole time until
finalized, that means that it doesn't have to spill any group-0
tuples. In this example, that would amount to a 50% savings, and is a
major advantage of Hash Aggregation versus Sort+Group.

High-level approaches:

1. When the in-memory hash table fills, keep existing entries in the
hash table, and spill the raw tuples for all new groups in a
partitioned fashion. When all input tuples are read, finalize groups
in memory and emit. Now that the in-memory hash table is cleared (and
memory context reset), process a spill file the same as the original
input, but this time with a fraction of the group cardinality.

2. When the in-memory hash table fills, partition the hash space, and
evict the groups from all partitions except one by writing out their
partial aggregate states to disk. Any input tuples belonging to an
evicted partition get spilled to disk. When the input is read
entirely, finalize the groups remaining in memory and emit. Now that
the in-memory hash table is cleared, process the next partition by
loading its partial states into the hash table, and then processing
its spilled tuples.

3. Use some kind of hybrid[1][2] of hashing and sorting.

Evaluation of approaches:

Approach 1 is a nice incremental improvement on today's code. The
final patch may be around 1KLOC. It's a single kind of on-disk data
(spilled tuples), and a single algorithm (hashing). It also handles
skewed groups well because the skewed groups are likely to be
encountered before the hash table fills up the first time, and
therefore will stay in memory.

Approach 2 is nice because it resembles the approach of Hash Join, and
it can determine whether a tuple should be spilled without a hash
lookup. Unfortunately, those upsides are fairly mild, and it has
significant downsides:

* It doesn't handle skew values well because it's likely to evict
  them.

* If we leave part of the hash table in memory, it's difficult to
  ensure that we will be able to actually use the space freed by
  eviction, because the freed memory may be fragmented. That could
  force us to evict the entire in-memory hash table as soon as we
  partition, reducing a lot of the benefit of hashing.

* It requires eviction for the algorithm to work. That may be
  necessary for handling cases like ARRAY_AGG (see below) anyway, but
  this approach constrains the specifics of eviction.

Approach 3 is interesting because it unifies the two approaches and
can get some of the benfits of both. It's only a single path, so it
avoids planner mistakes. I really like this idea and it's possible we
will end up with approach 3. However:

* It requires that all data 

Re: Zedstore - compressed in-core columnar storage

2019-07-01 Thread Ashwin Agrawal
On Sun, Jun 30, 2019 at 7:59 PM Tsunakawa, Takayuki <
tsunakawa.ta...@jp.fujitsu.com> wrote:

> From: Ashwin Agrawal [mailto:aagra...@pivotal.io]
> > The objective is to gather feedback on design and approach to the same.
> > The implementation has core basic pieces working but not close to
> complete.
>
> Thank you for proposing a very interesting topic.  Are you thinking of
> including this in PostgreSQL 13 if possible?
>
>
> > * All Indexes supported
> ...
> > work. Btree indexes can be created. Btree and bitmap index scans work.
>
> Does Zedstore allow to create indexes of existing types on the table
> (btree, GIN, BRIN, etc.) and perform index scans (point query, range query,
> etc.)?
>

Yes, all indexes types work for zedstore and allow point or range queries.


> > * Hybrid row-column store, where some columns are stored together, and
> >   others separately. Provide flexibility of granularity on how to
> >   divide the columns. Columns accessed together can be stored
> >   together.
> ...
> > This way of laying out the data also easily allows for hybrid row-column
> > store, where some columns are stored together, and others have a
> dedicated
> > B-tree. Need to have user facing syntax to allow specifying how to group
> > the columns.
> ...
> > Zedstore Table can be
> > created using command:
> >
> > CREATE TABLE  (column listing) USING zedstore;
>
> Are you aiming to enable Zedstore to be used for HTAP, i.e. the same table
> can be accessed simultaneously for both OLTP and analytics with the minimal
> performance impact on OLTP?  (I got that impression from the word "hybrid".)
>

Well "hybrid" is more to convey compressed row and column store can be
supported with same design. It really wasn't referring to HTAP. In general
the goal we are moving towards is column store to be extremely efficient at
analytics but still should be able to support all the OLTP operations (with
minimal performance or storage size impact) Like when making trade-offs
between different design choices and if both can't be meet, preference if
towards analytics.

If yes, is the assumption that only a limited number of  columns are to be
> stored in columnar format (for efficient scanning), and many other columns
> are to be stored in row format for efficient tuple access?
>

Yes, like if its known that certain columns are always accessed together
better to store them together and avoid the tuple formation cost. Though
its still to be seen if compression plays role and storing each individual
column and compressing can still be winner compared to compressing
different columns as blob. Like saving on IO cost offsets out the tuple
formation cost or not.

Are those row-formatted columns stored in the same file as the
> column-formatted columns, or in a separate file?
>

Currently, we are focused to just get pure column store working and hence
not coded anything for hybrid layout yet. But at least right now the
thought is would be in same file.

Regarding the column grouping, can I imagine HBase and Cassandra?
> How could the current CREATE TABLE syntax support column grouping?  (I
> guess CREATE TABLE needs a syntax for columnar store, and Zedstore need to
> be incorporated in core, not as an extension...)
>

When column grouping comes up yes will need to modify CREATE TABLE syntax,
we are still to reach that point in development.


> > A column store uses the same structure but we have *multiple* B-trees,
> one
> > for each column, all indexed by TID. The B-trees for all columns are
> stored
> > in the same physical file.
>
> Did you think that it's not a good idea to have a different file for each
> group of columns?  Is that because we can't expect physical adjacency of
> data blocks on disk even if we separate a column in a separate file?
>
> I thought a separate file for each group of columns would be easier and
> less error-prone to implement and debug.  Adding and dropping the column
> group would also be very easy and fast.
>

Currently, each group is a single column (till we don't have column
families) and having file for each column definitely seems not good idea.
As it just explodes the number of files. Separate file may have its
advantage from pre-fetching point of view but yes can't expect physical
adjacency of data blocks plus access pattern will anyways involve reading
multiple files (if each column stored in separate file).

I doubt storing each group makes it any easier to implement or debug, I
feel its actually reverse. Storing everything in single file but separate
blocks, keep the logic contained inside AM layer. And don't have to write
special code for example for drop table to delete files for all the groups
and all, or while moving table to different tablespace and all such
complication.

Adding and dropping column group, irrespective can be made easy and fast
with blocks for that group, added or marked for reuse within same file.

Thank you for the questions.


Re: POC: converting Lists into arrays

2019-07-01 Thread Tom Lane
Jesper Pedersen  writes:
> This needs a rebase. After that check-world passes w/ and w/o 
> -DDEBUG_LIST_MEMORY_USAGE.

Yup, here's a rebase against HEAD (and I also find that check-world shows
no problems).  This is pretty much of a pain to maintain, since it changes
the API for lnext() which is, um, a bit invasive.  I'd like to make a
decision pretty quickly on whether we're going to do this, and either
commit this patch or abandon it.

> There is some unneeded MemoryContext stuff in async.c's 
> pg_listening_channels() which should be cleaned up.

Yeah, there's a fair amount of follow-on cleanup that could be undertaken
afterwards, but I've wanted to keep the patch's footprint as small as
possible for the moment.  Assuming we pull the trigger, I'd then go look
at removing the planner's duplicative lists+arrays for RTEs and such as
the first cleanup step.  But thanks for the pointer to async.c, I'll
check that too.

regards, tom lane



reimplement-List-as-array-6.patch.gz
Description: reimplement-List-as-array-6.patch.gz


Re: [PATCH][PROPOSAL] Add enum releation option type

2019-07-01 Thread Alvaro Herrera
It strikes me that the way to avoid sentence construction is to have
each enum reloption declare a string that it uses to list the values it
accepts.  So for example we would have 

+#define GIST_OPTION_BUFFERING_ENUM_DEF {   \
+   { "on", GIST_OPTION_BUFFERING_ON }, \
+   { "off",GIST_OPTION_BUFFERING_OFF },\
+   { "auto",   GIST_OPTION_BUFFERING_AUTO },   \
+   { (const char *) NULL, 0 }  \
+}
+
+ GistBufferingValidMsg = gettext_noop("Valid values are \"on\", \"off\", and 
\"auto\".");

I think that's the most contentious point on this patch at this point
(though I may be misremembering).

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [PATCH] get rid of StdRdOptions, use individual binary reloptions representation for each relation kind instead

2019-07-01 Thread Alvaro Herrera
It seems strange to have relation_reloptions which abstracts away the
need to know which function to call for each relkind, and separately
have a bunch of places that call the specific relkind.  Why not just
call the specific function directly?  It doesn't seem that we're gaining
any abstraction ... maybe it'd be better to just remove
relation_reloptions entirely.  Or maybe we need to make it do a better
job so that other places don't have to call the specific functions at
all.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Change atoi to strtol in same place

2019-07-01 Thread Surafel Temesgen
Hello,

we use atoi for user argument processing in same place which return zero
for both invalid input and input value zero. In most case its ok because we
error out with appropriate error message for input zero but in same place
where we accept zero as valued input it case a problem by preceding for
invalid input as input value zero. The attached patch change those place to
strtol which can handle invalid input

regards

Surafel
diff --git a/contrib/pg_standby/pg_standby.c b/contrib/pg_standby/pg_standby.c
index 23f706b21d..2bcacbfbb5 100644
--- a/contrib/pg_standby/pg_standby.c
+++ b/contrib/pg_standby/pg_standby.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -638,6 +639,14 @@ sigquit_handler(int sig)
 int
 main(int argc, char **argv)
 {
+	char	   *keepfilesendptr;
+	char	   *maxretriesendptr;
+	char	   *sleeptimeendptr;
+	char	   *maxwaittimeendptr;
+	long		numkeepfiles;
+	long		nummaxretries;
+	long		numsleeptime;
+	long		nummaxwaittime;
 	int			c;
 
 	progname = get_progname(argv[0]);
@@ -688,12 +697,17 @@ main(int argc, char **argv)
 debug = true;
 break;
 			case 'k':			/* keepfiles */
-keepfiles = atoi(optarg);
-if (keepfiles < 0)
+errno = 0;
+numkeepfiles = strtol(optarg, , 10);
+
+if (keepfilesendptr == optarg || *keepfilesendptr != '\0' ||
+	numkeepfiles < 0 || numkeepfiles > INT_MAX ||
+	errno == ERANGE)
 {
-	fprintf(stderr, "%s: -k keepfiles must be >= 0\n", progname);
+	fprintf(stderr, "%s: -k keepfiles must be in range %d..%d\n", progname, 0, INT_MAX);
 	exit(2);
 }
+keepfiles = (int) numkeepfiles;
 break;
 			case 'l':			/* Use link */
 
@@ -707,31 +721,46 @@ main(int argc, char **argv)
 #endif
 break;
 			case 'r':			/* Retries */
-maxretries = atoi(optarg);
-if (maxretries < 0)
+errno = 0;
+nummaxretries = strtol(optarg, , 10);
+
+if (maxretriesendptr == optarg || *maxretriesendptr != '\0' ||
+	nummaxretries < 0 || nummaxretries > INT_MAX ||
+	errno == ERANGE)
 {
-	fprintf(stderr, "%s: -r maxretries must be >= 0\n", progname);
+	fprintf(stderr, "%s: -r maxretries must be in range %d..%d\n", progname, 0, INT_MAX);
 	exit(2);
 }
+maxretries = (int) nummaxretries;
 break;
 			case 's':			/* Sleep time */
-sleeptime = atoi(optarg);
-if (sleeptime <= 0 || sleeptime > 60)
+errno = 0;
+numsleeptime = strtol(optarg, , 10);
+
+if (sleeptimeendptr == optarg || *sleeptimeendptr != '\0' ||
+	numsleeptime <= 0 || numsleeptime > 60 ||
+	errno == ERANGE)
 {
-	fprintf(stderr, "%s: -s sleeptime incorrectly set\n", progname);
+	fprintf(stderr, "%s: -s sleeptime must be in range %d..%d\n", progname, 1, 59);
 	exit(2);
 }
+sleeptime = (int) numsleeptime;
 break;
 			case 't':			/* Trigger file */
 triggerPath = pg_strdup(optarg);
 break;
 			case 'w':			/* Max wait time */
-maxwaittime = atoi(optarg);
-if (maxwaittime < 0)
+errno = 0;
+nummaxwaittime = strtol(optarg, , 10);
+
+if (maxwaittimeendptr == optarg || *maxwaittimeendptr != '\0' ||
+	nummaxwaittime < 0 || nummaxwaittime > INT_MAX ||
+	errno == ERANGE)
 {
-	fprintf(stderr, "%s: -w maxwaittime incorrectly set\n", progname);
+	fprintf(stderr, "%s: -w maxwaittime must be in range %d..%d\n", progname, 0, INT_MAX);
 	exit(2);
 }
+maxwaittime = (int) nummaxwaittime;
 break;
 			default:
 fprintf(stderr, "Try \"%s --help\" for more information.\n", progname);
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 63f554307c..16d44c8617 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -2193,6 +2193,10 @@ main(int argc, char **argv)
 		{"no-verify-checksums", no_argument, NULL, 3},
 		{NULL, 0, NULL, 0}
 	};
+	char	   *compressEndptr;
+	char	   *timeoutEndptr;
+	long		compressNumber;
+	long		timeoutNumber;
 	int			c;
 
 	int			option_index;
@@ -2305,12 +2309,18 @@ main(int argc, char **argv)
 #endif
 break;
 			case 'Z':
-compresslevel = atoi(optarg);
-if (compresslevel < 0 || compresslevel > 9)
+errno = 0;
+compressNumber = strtol(optarg, , 10);
+
+if (compressEndptr == optarg || *compressEndptr != '\0' ||
+	compressNumber < 0 || compressNumber > 9 ||
+	errno == ERANGE)
 {
-	pg_log_error("invalid compression level \"%s\"\n", optarg);
+	pg_log_error("compression level must be in range %d..%d \"%s\"\n",
+ 0, 9, optarg);
 	exit(1);
 }
+compresslevel = (int) compressNumber;
 break;
 			case 'c':
 if (pg_strcasecmp(optarg, "fast") == 0)
@@ -2343,12 +2353,18 @@ main(int argc, char **argv)
 dbgetpassword = 1;
 break;
 			case 's':
-standby_message_timeout = atoi(optarg) * 1000;
-if (standby_message_timeout < 0)
+errno = 0;
+timeoutNumber = strtol(optarg, , 10);
+
+if (timeoutEndptr == 

Re: [PATCH] kNN for btree

2019-07-01 Thread Nikita Glukhov

On 01.07.2019 13:41, Thomas Munro wrote:


On Tue, Mar 26, 2019 at 4:30 AM Nikita Glukhov  wrote:

Fixed two bugs in patches 3 and 10 (see below).
Patch 3 was extracted from the main patch 5 (patch 4 in v9).

This patch no longer applies so marking Waiting on Author.

Attached 11th version of the patches rebased onto current master.

Hi Nikita,

Since a new Commitfest is here and this doesn't apply, could we please
have a fresh rebase?


Attached 12th version of the patches rebased onto the current master.

--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


0001-Fix-get_index_column_opclass-v12.patch.gz
Description: application/gzip


0002-Introduce-ammatchorderby-function-v12.patch.gz
Description: application/gzip


0003-Enable-ORDER-BY-operator-scans-on-ordered-indexes-v12.patch.gz
Description: application/gzip


0004-Extract-structure-BTScanState-v12.patch.gz
Description: application/gzip


0005-Add-kNN-support-to-btree-v12.patch.gz
Description: application/gzip


0006-Add-btree-distance-operators-v12.patch.gz
Description: application/gzip


0007-Remove-distance-operators-from-btree_gist-v12.patch.gz
Description: application/gzip


0008-Add-regression-tests-for-kNN-btree-v12.patch.gz
Description: application/gzip


0009-Allow-ammatchorderby-to-return-pathkey-sublists-v12.patch.gz
Description: application/gzip


0010-Add-support-of-array-ops-to-btree-kNN-v12.patch.gz
Description: application/gzip


Re: FETCH FIRST clause WITH TIES option

2019-07-01 Thread Surafel Temesgen
Hi Thomas
On Mon, Jul 1, 2019 at 1:31 PM Thomas Munro  wrote:

> On Mon, Apr 8, 2019 at 8:26 PM Surafel Temesgen 
> wrote:
> > agree
>
> Hi Surafel,
>
> A new Commitfest is starting.  Can you please post a fresh rebase of this
> patch?
>
>
Thank you for informing. attach is a rebased patch against current master

regards
Surafel
diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml
index 06d611b64c..e83d309c5b 100644
--- a/doc/src/sgml/ref/select.sgml
+++ b/doc/src/sgml/ref/select.sgml
@@ -44,7 +44,7 @@ SELECT [ ALL | DISTINCT [ ON ( expressionexpression [ ASC | DESC | USING operator ] [ NULLS { FIRST | LAST } ] [, ...] ]
 [ LIMIT { count | ALL } ]
 [ OFFSET start [ ROW | ROWS ] ]
-[ FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } ONLY ]
+[ FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } { ONLY | WITH TIES } ]
 [ FOR { UPDATE | NO KEY UPDATE | SHARE | KEY SHARE } [ OF table_name [, ...] ] [ NOWAIT | SKIP LOCKED ] [...] ]
 
 where from_item can be one of:
@@ -1430,7 +1430,7 @@ OFFSET start
 which PostgreSQL also supports.  It is:
 
 OFFSET start { ROW | ROWS }
-FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } ONLY
+FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } { ONLY | WITH TIES }
 
 In this syntax, the start
 or count value is required by
@@ -1440,7 +1440,10 @@ FETCH { FIRST | NEXT } [ count ] {
 ambiguity.
 If count is
 omitted in a FETCH clause, it defaults to 1.
-ROW
+ROW .
+WITH TIES option is used to return two or more rows
+that tie for the last place in the result set according to ORDER BY
+clause (ORDER BY clause must be specified in this case).
 and ROWS as well as FIRST
 and NEXT are noise words that don't influence
 the effects of these clauses.
diff --git a/src/backend/executor/nodeLimit.c b/src/backend/executor/nodeLimit.c
index baa669abe8..60660e710f 100644
--- a/src/backend/executor/nodeLimit.c
+++ b/src/backend/executor/nodeLimit.c
@@ -41,6 +41,7 @@ static TupleTableSlot *			/* return: a tuple or NULL */
 ExecLimit(PlanState *pstate)
 {
 	LimitState *node = castNode(LimitState, pstate);
+	ExprContext *econtext = node->ps.ps_ExprContext;
 	ScanDirection direction;
 	TupleTableSlot *slot;
 	PlanState  *outerPlan;
@@ -126,12 +127,16 @@ ExecLimit(PlanState *pstate)
 			{
 /*
  * Forwards scan, so check for stepping off end of window. If
- * we are at the end of the window, return NULL without
- * advancing the subplan or the position variable; but change
- * the state machine state to record having done so.
+ * we are at the end of the window, the behavior depends whether
+ * ONLY or WITH TIES was specified.  In case of ONLY, we return
+ * NULL without advancing the subplan or the position variable;
+ * but change the state machine state to record having done so.
+ * In the WITH TIES mode, we need to advance the subplan until
+ * we find the first row with different ORDER BY pathkeys.
  */
 if (!node->noCount &&
-	node->position - node->offset >= node->count)
+	node->position - node->offset >= node->count &&
+	node->limitOption == EXACT_NUMBER)
 {
 	node->lstate = LIMIT_WINDOWEND;
 
@@ -144,18 +149,69 @@ ExecLimit(PlanState *pstate)
 
 	return NULL;
 }
+else if (!node->noCount &&
+		 node->position - node->offset >= node->count &&
+		 node->limitOption == WITH_TIES)
+{
+	/*
+	 * Get next tuple from subplan, if any.
+	 */
+	slot = ExecProcNode(outerPlan);
+	if (TupIsNull(slot))
+	{
+		node->lstate = LIMIT_SUBPLANEOF;
+		return NULL;
+	}
+	/*
+	 * Test if the new tuple and the last tuple match.
+	 * If so we return the tuple.
+	 */
+	econtext->ecxt_innertuple = slot;
+	econtext->ecxt_outertuple = node->last_slot;
+	if (ExecQualAndReset(node->eqfunction, econtext))
+	{
+		node->subSlot = slot;
+		node->position++;
+	}
+	else
+	{
+		node->lstate = LIMIT_WINDOWEND;
+
+		/*
+		 * If we know we won't need to back up, we can release
+		 * resources at this point.
+		 */
+		if (!(node->ps.state->es_top_eflags & EXEC_FLAG_BACKWARD))
+			(void) ExecShutdownNode(outerPlan);
+
+		return NULL;
+	}
 
-/*
- * Get next tuple from subplan, if any.
- */
-slot = ExecProcNode(outerPlan);
-if (TupIsNull(slot))
+}
+else
 {
-	node->lstate = LIMIT_SUBPLANEOF;
-	return NULL;
+	/*
+	 * Get next tuple from subplan, if any.
+	 */
+	slot = ExecProcNode(outerPlan);
+	if (TupIsNull(slot))
+	{
+		node->lstate = LIMIT_SUBPLANEOF;
+		return NULL;
+	}
+
+	/*
+	 * Tuple at limit is needed for comparation in subsequent execution
+	 * to detect ties.
+	 */
+	if (node->limitOption == WITH_TIES &&
+		node->position - node->offset == node->count - 1)
+	{
+		ExecCopySlot(node->last_slot, slot);
+	}
+	

Re: POC: converting Lists into arrays

2019-07-01 Thread Jesper Pedersen

Hi,

On 5/25/19 11:48 AM, Tom Lane wrote:

The cfbot noticed a set-but-not-used variable that my compiler hadn't
whined about.  Here's a v5 to pacify it.  No other changes.



This needs a rebase. After that check-world passes w/ and w/o 
-DDEBUG_LIST_MEMORY_USAGE.


There is some unneeded MemoryContext stuff in async.c's 
pg_listening_channels() which should be cleaned up.


Thanks for working on this, as the API is more explicit now about what 
is going on.


Best regards,
 Jesper




Re: Error message inconsistency

2019-07-01 Thread Alvaro Herrera
Do we have an actual patch here?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)

2019-07-01 Thread Alvaro Herrera
On 2019-Apr-19, Paul Guo wrote:

> The below patch runs single mode Postgres if needed to make sure the target
> is cleanly shutdown. A new option is added (off by default).
> v2-0001-Ensure-target-clean-shutdown-at-beginning-of-pg_r.patch

Why do we need an option for this?  Is there a reason not to do this
unconditionally?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Add parallelism and glibc dependent only options to reindexdb

2019-07-01 Thread Julien Rouhaud
On Mon, Jul 1, 2019 at 4:10 PM Tom Lane  wrote:
>
> Michael Paquier  writes:
> > - 0003 begins to be the actual fancy thing with the addition of a
> > --jobs option into reindexdb.  The main issue here which should be
> > discussed is that when it comes to reindex of tables, you basically
> > are not going to have any conflicts between the objects manipulated.
> > However if you wish to do a reindex on a set of indexes then things
> > get more tricky as it is necessary to list items per-table so as
> > multiple connections do not conflict with each other if attempting to
> > work on multiple indexes of the same table.  What this patch does is
> > to select the set of indexes which need to be worked on (see the
> > addition of cell in ParallelSlot), and then does a kind of
> > pre-planning of each item into the connection slots so as each
> > connection knows from the beginning which items it needs to process.
> > This is quite different from vacuumdb where a new item is distributed
> > only on a free connection from a unique list.  I'd personally prefer
> > if we keep the facility in parallel.c so as it is only
> > execution-dependent and that we have no pre-planning.  This would
> > require keeping within reindexdb.c an array of lists, with one list
> > corresponding to one connection instead which feels more natural.
>
> Couldn't we make this enormously simpler and less bug-prone by just
> dictating that --jobs applies only to reindex-table operations?

That would also mean that we'll have to fallback on doing reindex at
table-level, even if we only want to reindex indexes that depends on
glibc.  I'm afraid that this will often add a huge penalty.

> > - 0004 is the part where the concurrent additions really matter as
> > this consists in applying an extra filter to the indexes selected so
> > as only the glibc-sensitive indexes are chosen for the processing.
>
> I think you'd be better off to define and document this as "reindex
> only collation-sensitive indexes", without any particular reference
> to a reason why somebody might want to do that.

We should still document that indexes based on ICU would be exluded?
I also realize that I totally forgot to update reindexdb.sgml.  Sorry
about that, I'll fix with the next versions.




Re: Add parallelism and glibc dependent only options to reindexdb

2019-07-01 Thread Julien Rouhaud
On Mon, Jul 1, 2019 at 3:51 PM Alvaro Herrera  wrote:
>
> Please don't reuse a file name as generic as "parallel.c" -- it's
> annoying when navigating source.  Maybe conn_parallel.c multiconn.c
> connscripts.c admconnection.c ...?

I could use scripts_parallel.[ch] as I've already used it in the #define part?

> If your server crashes or is stopped midway during the reindex, you
> would have to start again from scratch, and it's tedious (if it's
> possible at all) to determine which indexes were missed.  I think it
> would be useful to have a two-phase mode: in the initial phase reindexdb
> computes the list of indexes to be reindexed and saves them into a work
> table somewhere.  In the second phase, it reads indexes from that table
> and processes them, marking them as done in the work table.  If the
> second phase crashes or is stopped, it can be restarted and consults the
> work table.  I would keep the work table, as it provides a bit of an
> audit trail.  It may be important to be able to run even if unable to
> create such a work table (because of the numerous users that
> DROP DATABASE postgres).

Or we could create a table locally in each database, that would fix
this problem and probably make the code simpler?

It also raises some additional concerns about data expiration.  I
guess that someone could launch the tool by mistake, kill reindexdb,
and run it again 2 months later while a lot of new objects have been
added for instance.

> The "glibc filter" thing (which I take to mean "indexes that depend on
> collations") would apply to the first phase: it just skips adding other
> indexes to the work table.  I suppose ICU collations are not affected,
> so the filter would be for glibc collations only?

Indeed, ICU shouldn't need such a filter.  xxx_pattern_ops based
indexes are also excluded.

>  The --glibc-dependent
> switch seems too ad-hoc.  Maybe "--exclude-rule=glibc"?  That way we can
> add other rules later.  (Not "--exclude=foo" because we'll want to add
> the possibility to ignore specific indexes by name.)

That's a good point, I like the --exclude-rule switch.




Re: Cleanup/remove/update references to OID column

2019-07-01 Thread Tom Lane
Justin Pryzby  writes:
> I'm resending this patch, which still seems to be needed.

Yeah, clearly one copy of that text got missed out.  Pushed that.

> Also, should this be removed ?  Or at leat remove the parenthesized text, 
> since
> non-system tables no longer have OIDs: "(use to avoid output on system 
> tables)"

No, I think that's still fine as-is.  Tables still have OIDs, they
just don't *contain* magic OID columns.

> And maybe this (?)
> trace_lock_table (integer)

Hm, the description of that isn't English, at least:

gettext_noop("Sets the OID of the table with 
unconditionally lock tracing."),

I'm not entirely sure about the point of tracing locks on just one
table, which seems to be what this is for.

regards, tom lane




Re: Cleanup/remove/update references to OID column

2019-07-01 Thread Justin Pryzby
I'm resending this patch, which still seems to be needed.

Also, should this be removed ?  Or at leat remove the parenthesized text, since
non-system tables no longer have OIDs: "(use to avoid output on system tables)"

https://www.postgresql.org/docs/devel/runtime-config-developer.html
trace_lock_oidmin (integer)

And maybe this (?)
trace_lock_table (integer)

On Wed, May 08, 2019 at 02:05:57PM -0500, Justin Pryzby wrote:
> I found what appears to be a dangling reference to old "hidden" OID behavior.
> 
> Justin

> From 1c6712c0ade949648dbc415dfd7ea80312360ef7 Mon Sep 17 00:00:00 2001
> From: Justin Pryzby 
> Date: Wed, 8 May 2019 13:57:12 -0500
> Subject: [PATCH v1] Cleanup/remove/update references to OID column...
> 
> ..in wake of 578b229718e8f.
> 
> See also
> 93507e67c9ca54026019ebec3026de35d30370f9
> 1464755fc490a9911214817fe83077a3689250ab
> f6b39171f3d65155b9390c2c69bc5b3469f923a8
> 
> Author: Justin Pryzby 
> ---
>  doc/src/sgml/catalogs.sgml | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
> index d544e60..0f9c6f2 100644
> --- a/doc/src/sgml/catalogs.sgml
> +++ b/doc/src/sgml/catalogs.sgml
> @@ -610,7 +610,7 @@
>oid
>oid
>
> -  Row identifier (hidden attribute; must be explicitly 
> selected)
> +  Row identifier
>   
>  
>   
> -- 
> 2.7.4
> 


-- 
Justin Pryzby
System Administrator
Telsasoft
+1-952-707-8581




Re: Avoid full GIN index scan when possible

2019-07-01 Thread Marc Cousin
On 29/06/2019 00:23, Julien Rouhaud wrote:
> On Fri, Jun 28, 2019 at 10:16 PM Tom Lane  wrote:
>>
>> Tomas Vondra  writes:
>>> On Fri, Jun 28, 2019 at 03:03:19PM -0400, Tom Lane wrote:
 I not only don't want that function in indxpath.c, I don't even want
 it to be known/called from there.  If we need the ability for the index
 AM to editorialize on the list of indexable quals (which I'm not very
 convinced of yet), let's make an AM interface function to do it.
>>
>>> Wouldn't it be better to have a function that inspects a single qual and
>>> says whether it's "optimizable" or not? That could be part of the AM
>>> implementation, and we'd call it and it'd be us messing with the list.
>>
>> Uh ... we already determined that the qual is indexable (ie is a member
>> of the index's opclass), or allowed the index AM to derive an indexable
>> clause from it, so I'm not sure what you envision would happen
>> additionally there.  If I understand what Julien is concerned about
>> --- and I may not --- it's that the set of indexable clauses *as a whole*
>> may have or lack properties of interest.  So I'm thinking the answer
>> involves some callback that can do something to the whole list, not
>> qual-at-a-time.  We've already got facilities for the latter case.
> 
> Yes, the root issue here is that with gin it's entirely possible that
> "WHERE sometable.col op value1" is way more efficient than "WHERE
> sometable.col op value AND sometable.col op value2", where both qual
> are determined indexable by the opclass.  The only way to avoid that
> is indeed to inspect the whole list, as done in this poor POC.
> 
> This is a problem actually hit in production, and as far as I know
> there's no easy way from the application POV to prevent unexpected
> slowdown.  Maybe Marc will have more details about the actual problem
> and how expensive such a case was compared to the normal ones.

Sorry for the delay...

Yes, quite easily, here is what we had (it's just a bit simplified, we have 
other criterions but I think it shows the problem):

rh2=> explain analyze select * from account_employee where typeahead like 
'%albert%';
   QUERY 
PLAN   

 Bitmap Heap Scan on account_employee  (cost=53.69..136.27 rows=734 width=666) 
(actual time=15.562..35.044 rows=8957 loops=1)
   Recheck Cond: (typeahead ~~ '%albert%'::text)
   Rows Removed by Index Recheck: 46
   Heap Blocks: exact=8919
   ->  Bitmap Index Scan on account_employee_site_typeahead_gin_idx  
(cost=0.00..53.51 rows=734 width=0) (actual time=14.135..14.135 rows=9011 
loops=1)
 Index Cond: (typeahead ~~ '%albert%'::text)
 Planning time: 0.224 ms
 Execution time: 35.389 ms
(8 rows)

rh2=> explain analyze select * from account_employee where typeahead like 
'%albert%' and typeahead like '%lo%';
   
QUERY PLAN  
 

 Bitmap Heap Scan on account_employee  (cost=28358.38..28366.09 rows=67 
width=666) (actual time=18210.109..18227.134 rows=1172 loops=1)
   Recheck Cond: ((typeahead ~~ '%albert%'::text) AND (typeahead ~~ 
'%lo%'::text))
   Rows Removed by Index Recheck: 7831
   Heap Blocks: exact=8919
   ->  Bitmap Index Scan on account_employee_site_typeahead_gin_idx  
(cost=0.00..28358.37 rows=67 width=0) (actual time=18204.756..18204.756 
rows=9011 loops=1)
 Index Cond: ((typeahead ~~ '%albert%'::text) AND (typeahead ~~ 
'%lo%'::text))
 Planning time: 0.288 ms
 Execution time: 18230.182 ms
(8 rows)


We noticed this because the application timed out for users searching someone 
whose name was 2 characters ( it happens :) ).

We reject such filters when it's the only criterion, as we know it's going to 
be slow, but ignoring it as a supplementary filter would be a bit weird.

Of course there is the possibility of filtering with two stages with a CTE, but 
that's not as great as having PostgreSQL doing it itself.


By the way, while preparing this, I noticed that it seems that during this kind 
of index scan, the interrupt signal is masked
for a very long time. Control-C takes a very long while to cancel the query. 
But it's an entirely different problem :)

Regards



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] Cached plans and statement generalization

2019-07-01 Thread Konstantin Knizhnik



On 01.07.2019 12:51, Thomas Munro wrote:

On Wed, Apr 10, 2019 at 12:52 AM Konstantin Knizhnik
 wrote:

New version of the patching disabling autoprepare for rules and handling
planner error.

Hi Konstantin,

This doesn't apply. Could we please have a fresh rebase for the new Commitfest?

Thanks,


Attached please find rebased version of the patch.
Also this version can be found in autoprepare branch of this repository 
https://github.com/postgrespro/postgresql.builtin_pool.git

on github.
diff --git a/doc/src/sgml/autoprepare.sgml b/doc/src/sgml/autoprepare.sgml
new file mode 100644
index 000..b3309bd
--- /dev/null
+++ b/doc/src/sgml/autoprepare.sgml
@@ -0,0 +1,62 @@
+
+
+ 
+  Autoprepared statements
+
+  
+   autoprepared statements
+  
+
+  
+PostgreSQL makes it possible prepare
+frequently used statements to eliminate cost of their compilation
+and optimization on each execution of the query. On simple queries
+(like ones in pgbench -S) using prepared statements
+increase performance more than two times.
+  
+
+  
+Unfortunately not all database applications are using prepared statements
+and, moreover, it is not always possible. For example, in case of using
+pgbouncer or any other session pooler,
+there is no session state (transactions of one client may be executed at different
+backends) and so prepared statements can not be used.
+  
+
+  
+Autoprepare mode allows to overcome this limitation.
+In this mode Postgres tries to generalize executed statements
+and build parameterized plan for them. Speed of execution of
+autoprepared statements is almost the same as of explicitly
+prepared statements.
+  
+
+  
+By default autoprepare mode is switched off. To enable it, assign non-zero
+value to GUC variable autoprepare_tthreshold.
+This variable specified minimal number of times the statement should be
+executed before it is autoprepared. Please notice that, despite to the
+value of this parameter, Postgres makes a decision about using
+generalized plan vs. customized execution plans based on the results
+of comparison of average time of five customized plans with
+time of generalized plan.
+  
+
+  
+If number of different statements issued by application is large enough,
+then autopreparing all of them can cause memory overflow
+(especially if there are many active clients, because prepared statements cache
+is local to the backend). To prevent growth of backend's memory because of
+autoprepared cache, it is possible to limit number of autoprepared statements
+by setting autoprepare_limit GUC variable. LRU strategy will be used
+to keep in memory most frequently used queries.
+  
+
+  
+It is possible to inspect autoprepared queries in the backend using
+pg_autoprepared_statements view. It shows original text of the
+query, types of the extracted parameters (replacing literals) and
+query execution counter.
+  
+
+ 
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index f2b9d40..cb703f2 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -8314,6 +8314,11 @@ SCRAM-SHA-256$iteration count:
  
 
  
+  pg_autoprepared_statements
+  autoprepared statements
+ 
+
+ 
   pg_prepared_xacts
   prepared transactions
  
@@ -9630,6 +9635,68 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
   
  
 
+
+ 
+  pg_autoprepared_statements
+
+  
+   pg_autoprepared_statements
+  
+
+  
+   The pg_autoprepared_statements view displays
+   all the autoprepared statements that are available in the current
+   session. See  for more information about autoprepared
+   statements.
+  
+
+  
+   pg_autoprepared_statements Columns
+
+   
+
+ 
+  Name
+  Type
+  Description
+ 
+
+
+ 
+  statement
+  text
+  
+The query string submitted by the client from which this prepared statement
+was created. Please notice that literals in this statement are not
+replaced with prepared statement placeholders.
+  
+ 
+ 
+  parameter_types
+  regtype[]
+  
+   The expected parameter types for the autoprepared statement in the
+   form of an array of regtype. The OID corresponding
+   to an element of this array can be obtained by casting the
+   regtype value to oid.
+  
+ 
+ 
+  exec_count
+  int8
+  
+Number of times this statement was executed.
+  
+ 
+
+   
+  
+
+  
+   The pg_autoprepared_statements view is read only.
+  
+ 
+
  
   pg_prepared_xacts
 
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 84341a3..fcbb68b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5327,9 +5327,57 @@ SELECT * FROM parent WHERE key = 2400;
  
 
  
+
+ 
+  autoprepare_threshold (integer/type>)
+  
+   

Re: [HACKERS] Custom compression methods

2019-07-01 Thread Alexander Korotkov
On Mon, Jul 1, 2019 at 5:51 PM Alvaro Herrera  wrote:
> On 2019-Jul-01, Alexander Korotkov wrote:
>
> > As I get we're currently need to make high-level decision of whether
> > we need this [1].  I was going to bring this topic up at last PGCon,
> > but I didn't manage to attend.  Does it worth bothering Ildus with
> > continuous rebasing assuming we don't have this high-level decision
> > yet?
>
> I agree that having to constantly rebase a patch that doesn't get acted
> upon is a bit pointless.  I see a bit of a process problem here: if the
> patch doesn't apply, it gets punted out of commitfest and reviewers
> don't look at it.  This means the discussion goes unseen and no
> decisions are made.  My immediate suggestion is to rebase even if other
> changes are needed.

OK, let's do this assuming Ildus didn't give up yet :)

> Longer-term I think it'd be useful to have patches
> marked as needing "high-level decisions" that may lag behind current
> master; maybe we have them provide a git commit-ID on top of which the
> patch applies cleanly.

+1,
Sounds like good approach for me.

> I recently found git-imerge which can make rebasing of large patch
> series easier, by letting you deal with smaller conflicts one step at a
> time rather than one giant conflict; it may prove useful.

Thank you for pointing, will try.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Fix two issues after moving to unified logging system for command-line utils

2019-07-01 Thread Alexey Kondratov

Hi hackers,

I have found two minor issues with unified logging system for 
command-line programs (commited by Peter cc8d415117), while was rebasing 
my pg_rewind patch:


1) forgotten new-line symbol in pg_fatal call inside pg_rewind, which 
will cause the following Assert in common/logging.c to fire


Assert(fmt[strlen(fmt) - 1] != '\n');

It seems not to be a problem for a production Postgres installation 
without asserts, but should be removed for sanity.


2) swapped progname <-> full_path in initdb.c setup_bin_paths's call 
[1], while logging message remained the same. So the output will be 
rather misleading, since in the pg_ctl and pg_dumpall the previous order 
is used.


Attached is a small patch that fixes these issues.

[1] 
https://github.com/postgres/postgres/commit/cc8d41511721d25d557fc02a46c053c0a602fed0#diff-c4414062a0071ec15df504d39a6df705R2500




Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

>From 2ea4a17ecc8f9bd57bb676f684fb729279339534 Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Mon, 1 Jul 2019 18:11:25 +0300
Subject: [PATCH v1] Fix usage of unified logging pg_log_* in pg_rewind and
 initdb

---
 src/bin/initdb/initdb.c   | 2 +-
 src/bin/pg_rewind/pg_rewind.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 2ef179165b..70273be783 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2497,7 +2497,7 @@ setup_bin_paths(const char *argv0)
 			pg_log_error("The program \"postgres\" is needed by %s but was not found in the\n"
 		 "same directory as \"%s\".\n"
 		 "Check your installation.",
-		 full_path, progname);
+		 progname, full_path);
 		else
 			pg_log_error("The program \"postgres\" was found by \"%s\"\n"
 		 "but was not the same version as %s.\n"
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 6e77201be6..d378053de4 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -555,7 +555,7 @@ getTimelineHistory(ControlFileData *controlFile, int *nentries)
 		else if (controlFile == _target)
 			histfile = slurpFile(datadir_target, path, NULL);
 		else
-			pg_fatal("invalid control file\n");
+			pg_fatal("invalid control file");
 
 		history = rewind_parseTimeLineHistory(histfile, tli, nentries);
 		pg_free(histfile);

base-commit: 95bbe5d82e428db342fa3ec60b95f1b9873741e5
-- 
2.17.1



Re: Built-in connection pooler

2019-07-01 Thread Konstantin Knizhnik



On 01.07.2019 12:57, Thomas Munro wrote:

On Thu, Mar 21, 2019 at 4:33 AM Konstantin Knizhnik
 wrote:

New version of the patch (rebased + bug fixes) is attached to this mail.

Hi again Konstantin,

Interesting work.  No longer applies -- please rebase.


Rebased version of the patch is attached.
Also all this version of built-ni proxy can be found in conn_proxy 
branch of https://github.com/postgrespro/postgresql.builtin_pool.git
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 84341a30e5..9398e561e8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -719,6 +719,123 @@ include_dir 'conf.d'
   
  
 
+ 
+  max_sessions (integer)
+  
+   max_sessions configuration parameter
+  
+  
+  
+
+  The maximum number of client sessions that can be handled by
+  one connection proxy when session pooling is switched on.
+  This parameter does not add any memory or CPU overhead, so
+  specifying a large max_sessions value
+  does not affect performance.
+  If the max_sessions limit is reached new connection are not accepted.
+
+
+  The default value is 1000. This parameter can only be set at server start.
+   
+  
+ 
+
+ 
+  session_pool_size (integer)
+  
+   session_pool_size configuration parameter
+  
+  
+  
+
+  Enables session pooling and defines the maximum number of
+  backends that can be used by client sessions for each database/user combination.
+  Launched non-tainted backends are never terminated even if there are no active sessions.
+  Backend is considered as tainted if client updates GUCs, creates temporary table or prepared statements.
+  Tainted backend can server only one client.
+
+
+  The default value is 10, so up to 10 backends will server each database,
+
+  
+ 
+
+ 
+  proxy_port (integer)
+  
+   proxy_port configuration parameter
+  
+  
+  
+
+  Sets the TCP port for the connection pooler.
+  Clients connected to main "port" will be assigned dedicated backends,
+  while client connected to proxy port will be connected to backends through proxy which
+  performs transaction level scheduling. 
+   
+
+  The default value is 6543.
+
+  
+ 
+
+ 
+  connection_proxies (integer)
+  
+   connection_proxies configuration parameter
+  
+  
+  
+
+  Sets number of connection proxies.
+  Postmaster spawns separate worker process for each proxy. Postmaster scatters connections between proxies using one of scheduling policies (round-robin, random, load-balancing)."
+		 "Each proxy launches its own subset of backends. So maximal number of non-tainted backends is "
+		  "session_pool_size*connection_proxies*databases*roles.
+   
+
+  The default value is 0, so session pooling is disabled.
+
+  
+ 
+
+ 
+  session_schedule (enum)
+  
+   session_schedule configuration parameter
+  
+  
+  
+
+  Specifies scheduling policy for assigning session to proxies in case of
+  connection pooling. Default policy is round-robin.
+
+
+  With round-robin policy postmaster cyclicly scatter sessions between proxies.
+
+
+  With random policy postmaster randomly choose proxy for new session.
+
+
+  With load-balancing policy postmaster choose proxy with lowest load average.
+  Load average of proxy is estimated by number of clients connection assigned to this proxy with extra weight for SSL connections.
+   
+  
+ 
+
+ 
+  restart_pooler_on_reload (string)
+  
+   restart_pooler_on_reload configuration parameter
+  
+  
+  
+
+  Restart session pool workers once pg_reload_conf() is called.
+  The default value is false.
+   
+  
+ 
+
  
   unix_socket_directories (string)
   
diff --git a/doc/src/sgml/connpool.sgml b/doc/src/sgml/connpool.sgml
new file mode 100644
index 00..07f4202f75
--- /dev/null
+++ b/doc/src/sgml/connpool.sgml
@@ -0,0 +1,174 @@
+
+
+ 
+  Connection pooling
+
+  
+   built-in connection pool proxy
+  
+
+  
+PostgreSQL spawns a separate process (backend) for each client.
+For large number of clients such model can cause consumption of large number of system
+resources and lead to significant performance degradation, especially at computers with large
+number of CPU cores. The reason is high contention between backends for postgres resources.
+Also size of many Postgres internal data structures are proportional to the number of
+active backends as well as complexity of algorithms 

Re: [HACKERS] Custom compression methods

2019-07-01 Thread Alvaro Herrera
On 2019-Jul-01, Alexander Korotkov wrote:

> As I get we're currently need to make high-level decision of whether
> we need this [1].  I was going to bring this topic up at last PGCon,
> but I didn't manage to attend.  Does it worth bothering Ildus with
> continuous rebasing assuming we don't have this high-level decision
> yet?

I agree that having to constantly rebase a patch that doesn't get acted
upon is a bit pointless.  I see a bit of a process problem here: if the
patch doesn't apply, it gets punted out of commitfest and reviewers
don't look at it.  This means the discussion goes unseen and no
decisions are made.  My immediate suggestion is to rebase even if other
changes are needed.  Longer-term I think it'd be useful to have patches
marked as needing "high-level decisions" that may lag behind current
master; maybe we have them provide a git commit-ID on top of which the
patch applies cleanly.

I recently found git-imerge which can make rebasing of large patch
series easier, by letting you deal with smaller conflicts one step at a
time rather than one giant conflict; it may prove useful.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2019-07-01 Thread Alexey Kondratov

Hi Thomas,

On 01.07.2019 15:02, Thomas Munro wrote:


Hi Alexey,

This no longer applies.  Since the Commitfest is starting now, could
you please rebase it?


Thank you for a reminder. Rebased version of the patch is attached. I've 
also modified my logging code in order to obey new unified logging 
system for command-line programs commited by Peter (cc8d415117).



Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

>From f5f359274322020c2338b5b494f6327eaa61c0e1 Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Tue, 19 Feb 2019 19:14:53 +0300
Subject: [PATCH v7] pg_rewind: options to use restore_command from command
 line or cluster config

Previously, when pg_rewind could not find required WAL files in the
target data directory the rewind process would fail. One had to
manually figure out which of required WAL files have already moved to
the archival storage and copy them back.

This patch adds possibility to specify restore_command via command
line option or use one specified inside postgresql.conf. Specified
restore_command will be used for automatic retrieval of missing WAL
files from archival storage.
---
 doc/src/sgml/ref/pg_rewind.sgml   |  30 -
 src/bin/pg_rewind/parsexlog.c | 164 +-
 src/bin/pg_rewind/pg_rewind.c |  92 ++-
 src/bin/pg_rewind/pg_rewind.h |   6 +-
 src/bin/pg_rewind/t/001_basic.pl  |   4 +-
 src/bin/pg_rewind/t/002_databases.pl  |   4 +-
 src/bin/pg_rewind/t/003_extrafiles.pl |   4 +-
 src/bin/pg_rewind/t/RewindTest.pm |  84 -
 8 files changed, 371 insertions(+), 17 deletions(-)

diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 4d91eeb0ff..746c07e4df 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -67,8 +67,10 @@ PostgreSQL documentation
ancestor. In the typical failover scenario where the target cluster was
shut down soon after the divergence, this is not a problem, but if the
target cluster ran for a long time after the divergence, the old WAL
-   files might no longer be present. In that case, they can be manually
-   copied from the WAL archive to the pg_wal directory, or
+   files might no longer be present. In that case, they can be automatically
+   copied by pg_rewind from the WAL archive to the 
+   pg_wal directory if either -r or
+   -R option is specified, or
fetched on startup by configuring  or
.  The use of
pg_rewind is not limited to failover, e.g.  a standby
@@ -202,6 +204,30 @@ PostgreSQL documentation
   
  
 
+ 
+  -r
+  --use-postgresql-conf
+  
+   
+Use restore_command in the postgresql.conf to
+retrieve missing in the target pg_wal directory
+WAL files from the WAL archive.
+   
+  
+ 
+
+ 
+  -R restore_command
+  --restore-command=restore_command
+  
+   
+Specifies the restore_command to use for retrieval of the missing
+in the target pg_wal directory WAL files from
+the WAL archive.
+   
+  
+ 
+
  
   --debug
   
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 287af60c4e..d1de08320c 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -12,6 +12,7 @@
 #include "postgres_fe.h"
 
 #include 
+#include 
 
 #include "pg_rewind.h"
 #include "filemap.h"
@@ -44,6 +45,7 @@ static char xlogfpath[MAXPGPATH];
 typedef struct XLogPageReadPrivate
 {
 	const char *datadir;
+	const char *restoreCommand;
 	int			tliIndex;
 } XLogPageReadPrivate;
 
@@ -52,6 +54,9 @@ static int	SimpleXLogPageRead(XLogReaderState *xlogreader,
 			   int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
 			   TimeLineID *pageTLI);
 
+static int RestoreArchivedWAL(const char *path, const char *xlogfname,
+   off_t expectedSize, const char *restoreCommand);
+
 /*
  * Read WAL from the datadir/pg_wal, starting from 'startpoint' on timeline
  * index 'tliIndex' in target timeline history, until 'endpoint'. Make note of
@@ -59,7 +64,7 @@ static int	SimpleXLogPageRead(XLogReaderState *xlogreader,
  */
 void
 extractPageMap(const char *datadir, XLogRecPtr startpoint, int tliIndex,
-			   XLogRecPtr endpoint)
+			   XLogRecPtr endpoint, const char *restore_command)
 {
 	XLogRecord *record;
 	XLogReaderState *xlogreader;
@@ -68,6 +73,7 @@ extractPageMap(const char *datadir, XLogRecPtr startpoint, int tliIndex,
 
 	private.datadir = datadir;
 	private.tliIndex = tliIndex;
+	private.restoreCommand = restore_command;
 	xlogreader = XLogReaderAllocate(WalSegSz, ,
 	);
 	if (xlogreader == NULL)
@@ -155,7 +161,7 @@ readOneRecord(const char *datadir, XLogRecPtr ptr, int tliIndex)
 void
 findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
    XLogRecPtr *lastchkptrec, TimeLineID *lastchkpttli,
-   XLogRecPtr *lastchkptredo)
+		

Re: Add parallelism and glibc dependent only options to reindexdb

2019-07-01 Thread Tom Lane
Michael Paquier  writes:
> - 0003 begins to be the actual fancy thing with the addition of a
> --jobs option into reindexdb.  The main issue here which should be
> discussed is that when it comes to reindex of tables, you basically
> are not going to have any conflicts between the objects manipulated.
> However if you wish to do a reindex on a set of indexes then things
> get more tricky as it is necessary to list items per-table so as
> multiple connections do not conflict with each other if attempting to
> work on multiple indexes of the same table.  What this patch does is
> to select the set of indexes which need to be worked on (see the
> addition of cell in ParallelSlot), and then does a kind of
> pre-planning of each item into the connection slots so as each
> connection knows from the beginning which items it needs to process.
> This is quite different from vacuumdb where a new item is distributed
> only on a free connection from a unique list.  I'd personally prefer
> if we keep the facility in parallel.c so as it is only
> execution-dependent and that we have no pre-planning.  This would
> require keeping within reindexdb.c an array of lists, with one list 
> corresponding to one connection instead which feels more natural.

Couldn't we make this enormously simpler and less bug-prone by just
dictating that --jobs applies only to reindex-table operations?

> - 0004 is the part where the concurrent additions really matter as
> this consists in applying an extra filter to the indexes selected so
> as only the glibc-sensitive indexes are chosen for the processing.

I think you'd be better off to define and document this as "reindex
only collation-sensitive indexes", without any particular reference
to a reason why somebody might want to do that.

regards, tom lane




Re: Add parallelism and glibc dependent only options to reindexdb

2019-07-01 Thread Alvaro Herrera
Now that we have REINDEX CONCURRENTLY, I think reindexdb is going to
gain more popularity.

Please don't reuse a file name as generic as "parallel.c" -- it's
annoying when navigating source.  Maybe conn_parallel.c multiconn.c
connscripts.c admconnection.c ...?

If your server crashes or is stopped midway during the reindex, you
would have to start again from scratch, and it's tedious (if it's
possible at all) to determine which indexes were missed.  I think it
would be useful to have a two-phase mode: in the initial phase reindexdb
computes the list of indexes to be reindexed and saves them into a work
table somewhere.  In the second phase, it reads indexes from that table
and processes them, marking them as done in the work table.  If the
second phase crashes or is stopped, it can be restarted and consults the
work table.  I would keep the work table, as it provides a bit of an
audit trail.  It may be important to be able to run even if unable to
create such a work table (because of the numerous users that
DROP DATABASE postgres).

Maybe we'd have two flags in the work table for each index:
"reindex requested", "reindex done".

The "glibc filter" thing (which I take to mean "indexes that depend on
collations") would apply to the first phase: it just skips adding other
indexes to the work table.  I suppose ICU collations are not affected,
so the filter would be for glibc collations only?  The --glibc-dependent
switch seems too ad-hoc.  Maybe "--exclude-rule=glibc"?  That way we can
add other rules later.  (Not "--exclude=foo" because we'll want to add
the possibility to ignore specific indexes by name.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Optimize partial TOAST decompression

2019-07-01 Thread Binguo Bao
Hi!

> Andrey Borodin  于2019年6月29日周六 下午9:48写道:

> Hi!
> Please, do not use top-posting, i.e. reply style where you quote whole
> message under your response. It makes reading of archives terse.
>
> > 24 июня 2019 г., в 7:53, Binguo Bao  написал(а):
> >
> >> This is not correct: L bytes of compressed data do not always can be
> decoded into at least L bytes of data. At worst we have one control byte
> per 8 bytes of literal bytes. This means at most we need (L*9 + 8) / 8
> bytes with current pglz format.
> >
> > Good catch! I've corrected the related code in the patch.
> > ...
> > <0001-Optimize-partial-TOAST-decompression-2.patch>
>
> I've took a look into the code.
> I think we should extract function for computation of max_compressed_size
> and put it somewhere along with pglz code. Just in case something will
> change something about pglz so that they would not forget about compression
> algorithm assumption.
>
> Also I suggest just using 64 bit computation to avoid overflows. And I
> think it worth to check if max_compressed_size is whole data and use min of
> (max_compressed_size, uncompressed_data_size).
>
> Also you declared needsize and max_compressed_size too far from use. But
> this will be solved by function extraction anyway.
>
> Thanks!
>
> Best regards, Andrey Borodin.


Thanks for the suggestion.
I've extracted function for computation for max_compressed_size and put the
function into pg_lzcompress.c.

Best regards, Binguo Bao.
From 79a1b4c292a0629df9d7ba3dc04e879aadca7a61 Mon Sep 17 00:00:00 2001
From: BBG 
Date: Sun, 2 Jun 2019 19:18:46 +0800
Subject: [PATCH] Optimize partial TOAST decompression

---
 src/backend/access/heap/tuptoaster.c | 24 +---
 src/common/pg_lzcompress.c   | 22 ++
 src/include/common/pg_lzcompress.h   |  1 +
 3 files changed, 40 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c
index 55d6e91..684f1b2 100644
--- a/src/backend/access/heap/tuptoaster.c
+++ b/src/backend/access/heap/tuptoaster.c
@@ -266,6 +266,7 @@ heap_tuple_untoast_attr_slice(struct varlena *attr,
 	if (VARATT_IS_EXTERNAL_ONDISK(attr))
 	{
 		struct varatt_external toast_pointer;
+		int32 max_size;
 
 		VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
 
@@ -273,8 +274,13 @@ heap_tuple_untoast_attr_slice(struct varlena *attr,
 		if (!VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer))
 			return toast_fetch_datum_slice(attr, sliceoffset, slicelength);
 
-		/* fetch it back (compressed marker will get set automatically) */
-		preslice = toast_fetch_datum(attr);
+		max_size = pglz_maximum_compressed_size(sliceoffset + slicelength,
+toast_pointer.va_rawsize);
+		/*
+		 * Be sure to get enough compressed slice
+		 * and compressed marker will get set automatically
+		 */
+		preslice = toast_fetch_datum_slice(attr, 0, max_size);
 	}
 	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
 	{
@@ -2031,7 +2037,8 @@ toast_fetch_datum(struct varlena *attr)
  *	Reconstruct a segment of a Datum from the chunks saved
  *	in the toast relation
  *
- *	Note that this function only supports non-compressed external datums.
+ *	Note that this function supports non-compressed external datums
+ *	and compressed external datum slices at the start of the object.
  * --
  */
 static struct varlena *
@@ -2072,10 +2079,9 @@ toast_fetch_datum_slice(struct varlena *attr, int32 sliceoffset, int32 length)
 	VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
 
 	/*
-	 * It's nonsense to fetch slices of a compressed datum -- this isn't lo_*
-	 * we can't return a compressed datum which is meaningful to toast later
+	 * It's meaningful to fetch slices at the start of a compressed datum.
 	 */
-	Assert(!VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer));
+	Assert(!VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer) || 0 == sliceoffset);
 
 	attrsize = toast_pointer.va_extsize;
 	totalchunks = ((attrsize - 1) / TOAST_MAX_CHUNK_SIZE) + 1;
@@ -2091,7 +2097,11 @@ toast_fetch_datum_slice(struct varlena *attr, int32 sliceoffset, int32 length)
 
 	result = (struct varlena *) palloc(length + VARHDRSZ);
 
-	SET_VARSIZE(result, length + VARHDRSZ);
+	if (VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer)) {
+		SET_VARSIZE_COMPRESSED(result, length + VARHDRSZ);
+	} else {
+		SET_VARSIZE(result, length + VARHDRSZ);
+	}
 
 	if (length == 0)
 		return result;			/* Can save a lot of work at this point! */
diff --git a/src/common/pg_lzcompress.c b/src/common/pg_lzcompress.c
index 988b398..2b5f112 100644
--- a/src/common/pg_lzcompress.c
+++ b/src/common/pg_lzcompress.c
@@ -771,3 +771,25 @@ pglz_decompress(const char *source, int32 slen, char *dest,
 	 */
 	return (char *) dp - dest;
 }
+
+
+
+/* --
+ * pglz_max_compressed_size -
+ *
+ * 		Calculate the maximum size of the compressed slice corresponding to the
+ * 		raw slice. Return the maximum size, or raw size if maximum size is greater
+ * 		than raw size.
+ * --
+ */
+int32

Re: Custom table AMs need to include heapam.h because of BulkInsertState

2019-07-01 Thread David Rowley
On Sun, 30 Jun 2019 at 17:54, David Rowley  wrote:

> Any further thoughts on this Michael?
>
> Or Andres? Do you have a preference to which of the approaches
> (mentioned upthread) I use for the fix?
>
> If I don't hear anything I'll probably just push the first fix. The
> inefficiency does not affect heap, so likely the people with the most
> interest in improving that will be authors of other table AMs that
> actually do something during table_finish_bulk_insert() for
> partitions. We could revisit this in PG13 if someone comes up with a
> need to improve things here.

I've pushed the original patch plus a small change to only call
table_finish_bulk_insert() for the target of the copy when we're using
bulk inserts.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Commitfest 2019-07, the first of five* for PostgreSQL 13

2019-07-01 Thread Thomas Munro
Hello hackers,

It's now July everywhere on Earth, so I marked CF 2019-07 as
in-progress, and 2019-09 as open for bumping patches into.  I pinged
most of the "Needs Review" threads that don't apply and will do a few
more tomorrow, and then I'll try to chase patches that fail on CI, and
then see what I can do to highlight some entries that really need
review/discussion.  I'll do end-of-week status reports.

-- 
Thomas Munro
https://enterprisedb.com




Re: Change ereport level for QueuePartitionConstraintValidation

2019-07-01 Thread Sergei Kornilov
Hello

This change is discussed as open item for pg12. Seems we have nor objections 
nor agreement. I attached updated version due merge conflict.

> Per discussion started here: 
> https://www.postgresql.org/message-id/CA%2BTgmoZWSLUjVcc9KBSVvbn%3DU5QRgW1O-MgUX0y5CnLZOA2qyQ%40mail.gmail.com

regards, Sergeidiff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index fd67d2a841..bc837b499e 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -15357,11 +15357,11 @@ QueuePartitionConstraintValidation(List **wqueue, Relation scanrel,
 	if (PartConstraintImpliedByRelConstraint(scanrel, partConstraint))
 	{
 		if (!validate_default)
-			ereport(INFO,
+			ereport(DEBUG1,
 	(errmsg("partition constraint for table \"%s\" is implied by existing constraints",
 			RelationGetRelationName(scanrel;
 		else
-			ereport(INFO,
+			ereport(DEBUG1,
 	(errmsg("updated partition constraint for default partition \"%s\" is implied by existing constraints",
 			RelationGetRelationName(scanrel;
 		return;
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index f3c9236ad5..3344120190 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -1252,7 +1252,7 @@ check_default_partition_contents(Relation parent, Relation default_rel,
 	 */
 	if (PartConstraintImpliedByRelConstraint(default_rel, def_part_constraints))
 	{
-		ereport(INFO,
+		ereport(DEBUG1,
 (errmsg("updated partition constraint for default partition \"%s\" is implied by existing constraints",
 		RelationGetRelationName(default_rel;
 		return;
@@ -1303,7 +1303,7 @@ check_default_partition_contents(Relation parent, Relation default_rel,
 			if (PartConstraintImpliedByRelConstraint(part_rel,
 	 def_part_constraints))
 			{
-ereport(INFO,
+ereport(DEBUG1,
 		(errmsg("updated partition constraint for default partition \"%s\" is implied by existing constraints",
 RelationGetRelationName(part_rel;
 
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index e687150511..4c28029ee0 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -3633,11 +3633,9 @@ ALTER TABLE list_parted2 ATTACH PARTITION part_3_4 FOR VALUES IN (3, 4);
 ALTER TABLE list_parted2 DETACH PARTITION part_3_4;
 ALTER TABLE part_3_4 ALTER a SET NOT NULL;
 ALTER TABLE list_parted2 ATTACH PARTITION part_3_4 FOR VALUES IN (3, 4);
-INFO:  partition constraint for table "part_3_4" is implied by existing constraints
 -- check if default partition scan skipped
 ALTER TABLE list_parted2_def ADD CONSTRAINT check_a CHECK (a IN (5, 6));
 CREATE TABLE part_55_66 PARTITION OF list_parted2 FOR VALUES IN (55, 66);
-INFO:  updated partition constraint for default partition "list_parted2_def" is implied by existing constraints
 -- check validation when attaching range partitions
 CREATE TABLE range_parted (
 	a int,
@@ -3662,7 +3660,6 @@ CREATE TABLE part2 (
 	b int NOT NULL CHECK (b >= 10 AND b < 18)
 );
 ALTER TABLE range_parted ATTACH PARTITION part2 FOR VALUES FROM (1, 10) TO (1, 20);
-INFO:  partition constraint for table "part2" is implied by existing constraints
 -- Create default partition
 CREATE TABLE partr_def1 PARTITION OF range_parted DEFAULT;
 -- Only one default partition is allowed, hence, following should give error
@@ -3690,13 +3687,11 @@ ERROR:  partition constraint is violated by some row
 DELETE FROM part_5_a WHERE a NOT IN (3);
 ALTER TABLE part_5 ADD CONSTRAINT check_a CHECK (a IS NOT NULL AND a = 5);
 ALTER TABLE list_parted2 ATTACH PARTITION part_5 FOR VALUES IN (5);
-INFO:  partition constraint for table "part_5" is implied by existing constraints
 ALTER TABLE list_parted2 DETACH PARTITION part_5;
 ALTER TABLE part_5 DROP CONSTRAINT check_a;
 -- scan should again be skipped, even though NOT NULL is now a column property
 ALTER TABLE part_5 ADD CONSTRAINT check_a CHECK (a IN (5)), ALTER a SET NOT NULL;
 ALTER TABLE list_parted2 ATTACH PARTITION part_5 FOR VALUES IN (5);
-INFO:  partition constraint for table "part_5" is implied by existing constraints
 -- Check the case where attnos of the partitioning columns in the table being
 -- attached differs from the parent.  It should not affect the constraint-
 -- checking logic that allows to skip the scan.
@@ -3707,7 +3702,6 @@ CREATE TABLE part_6 (
 );
 ALTER TABLE part_6 DROP c;
 ALTER TABLE list_parted2 ATTACH PARTITION part_6 FOR VALUES IN (6);
-INFO:  partition constraint for table "part_6" is implied by existing constraints
 -- Similar to above, but the table being attached is a partitioned table
 -- whose partition has still different attnos for the root partitioning
 -- columns.
@@ -3725,10 +3719,7 @@ CREATE TABLE part_7_a_null (
 );
 ALTER TABLE part_7_a_null DROP c, DROP d, DROP e;
 ALTER TABLE part_7 ATTACH PARTITION part_7_a_null FOR 

Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2019-07-01 Thread Thomas Munro
On Thu, Mar 28, 2019 at 6:49 AM Alexey Kondratov
 wrote:
> Updated version of patch is attached.

Hi Alexey,

This no longer applies.  Since the Commitfest is starting now, could
you please rebase it?

Thanks,

-- 
Thomas Munro
https://enterprisedb.com




Re: extension patch of CREATE OR REPLACE TRIGGER

2019-07-01 Thread Thomas Munro
On Tue, Mar 5, 2019 at 10:19 PM David Steele  wrote:
> On 2/28/19 10:43 AM, Osumi, Takamichi wrote:
> > One past thread about introducing CREATE OR REPLACE TRIGGER into the syntax
> >
> > had stopped without complete discussion in terms of LOCK level.
> >
> > The past thread is this. I'd like to inherit this one.
>
> Since this patch landed at the last moment in the last commitfest for
> PG12 I have marked it as targeting PG13.

Hello Osumi-san,

The July Commitfest is now beginning.  To give the patch the best
chance of attracting reviewers, could you please post a rebased
version?  The last version doesn't apply.

Thanks,

-- 
Thomas Munro
https://enterprisedb.com




Re: [PATCH] get rid of StdRdOptions, use individual binary reloptions representation for each relation kind instead

2019-07-01 Thread Thomas Munro
On Mon, Apr 1, 2019 at 4:36 AM Nikolay Shaplov  wrote:
> В письме от понедельник, 18 марта 2019 г. 3:03:04 MSK пользователь Iwata, Aya
> написал:
> > This patch does not apply.
> Oh... Sorry... here goes new version

Hi Nikolay,

Could we please have a new rebase?

Thanks,

-- 
Thomas Munro
https://enterprisedb.com


Re: [PROPOSAL] Temporal query processing with range types

2019-07-01 Thread Thomas Munro
On Wed, Apr 3, 2019 at 2:12 AM Ibrar Ahmed  wrote:
> I start looking at the patch, there is a couple of problems with the patch. 
> The first one is the OID conflict, which I fixed on my machine. The second 
> problem is assertion failure. I think you have not compiled the PostgreSQL 
> code with the assertion.

Hi Peter,

Looks like there was some good feedback for this WIP project last time
around.  It's currently in "Needs Review" status in the July
Commitfest.  To encourage more review and see some automated compile
and test results, could we please have a fresh rebase?  The earlier
patches no longer apply.

Thanks,

-- 
Thomas Munro
https://enterprisedb.com




Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)

2019-07-01 Thread Thomas Munro
On Fri, Apr 19, 2019 at 3:41 PM Paul Guo  wrote:
> I updated the patches as attached following previous discussions.

Hi Paul,

Could we please have a fresh rebase now that the CF is here?

Thanks,

-- 
Thomas Munro
https://enterprisedb.com




Re: allow online change primary_conninfo

2019-07-01 Thread Sergei Kornilov
Hello

Updated version attached. Merge conflict was about tests count in 
001_stream_rep.pl. Nothing else was changed. My approach can be still 
incorrect, any redesign ideas are welcome. Thanks in advance!

regards, Sergeidiff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 84341a30e5..054be17e08 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3916,9 +3916,14 @@ ANY num_sync ( 
@@ -3933,9 +3938,14 @@ ANY num_sync ( ).
-  This parameter can only be set at server start.
+  This parameter can only be set in the postgresql.conf
+  file or on the server command line.
   This setting has no effect if primary_conninfo is not
-  set.
+  set or the server is not in standby mode.
+ 
+ 
+  WAL receiver will be restarted after primary_slot_name
+  was changed.
  
 

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3e2c4e3e5b..964989432c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12125,6 +12125,42 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	return false;/* not reached */
 }
 
+void
+ProcessStartupSigHup(void)
+{
+	char	   *conninfo = pstrdup(PrimaryConnInfo);
+	char	   *slotname = pstrdup(PrimarySlotName);
+	bool		conninfoChanged;
+	bool		slotnameChanged;
+
+	ProcessConfigFile(PGC_SIGHUP);
+
+	/*
+	 * We need restart XLOG_FROM_STREAM source if replication settings was
+	 * changed
+	 */
+	conninfoChanged = (strcmp(conninfo, PrimaryConnInfo) != 0);
+	slotnameChanged = (strcmp(slotname, PrimarySlotName) != 0);
+
+	if ((conninfoChanged || slotnameChanged) &&
+		currentSource == XLOG_FROM_STREAM
+		&& WalRcvRunning())
+	{
+		if (conninfoChanged && slotnameChanged)
+			ereport(LOG,
+	(errmsg("The WAL receiver is going to be restarted due to change of primary_conninfo and primary_slot_name")));
+		else
+			ereport(LOG,
+	(errmsg("The WAL receiver is going to be restarted due to change of %s",
+			conninfoChanged ? "primary_conninfo" : "primary_slot_name")));
+
+		pendingRestartSource = true;
+	}
+
+	pfree(conninfo);
+	pfree(slotname);
+}
+
 /*
  * Determine what log level should be used to report a corrupt WAL record
  * in the current WAL page, previously read by XLogPageRead().
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index 5048a2c2aa..9bf5c792fe 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -147,7 +147,7 @@ HandleStartupProcInterrupts(void)
 	if (got_SIGHUP)
 	{
 		got_SIGHUP = false;
-		ProcessConfigFile(PGC_SIGHUP);
+		ProcessStartupSigHup();
 	}
 
 	/*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 92c4fee8f8..e54d8e7172 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3571,7 +3571,7 @@ static struct config_string ConfigureNamesString[] =
 	},
 
 	{
-		{"primary_conninfo", PGC_POSTMASTER, REPLICATION_STANDBY,
+		{"primary_conninfo", PGC_SIGHUP, REPLICATION_STANDBY,
 			gettext_noop("Sets the connection string to be used to connect to the sending server."),
 			NULL,
 			GUC_SUPERUSER_ONLY
@@ -3582,7 +3582,7 @@ static struct config_string ConfigureNamesString[] =
 	},
 
 	{
-		{"primary_slot_name", PGC_POSTMASTER, REPLICATION_STANDBY,
+		{"primary_slot_name", PGC_SIGHUP, REPLICATION_STANDBY,
 			gettext_noop("Sets the name of the replication slot to use on the sending server."),
 			NULL
 		},
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d519252aad..9e49020b19 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -320,6 +320,7 @@ extern void SetWalWriterSleeping(bool sleeping);
 
 extern void XLogRequestWalReceiverReply(void);
 
+extern void ProcessStartupSigHup(void);
 extern void assign_max_wal_size(int newval, void *extra);
 extern void assign_checkpoint_completion_target(double newval, void *extra);
 
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 3c743d7d7c..ae80f4df3a 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 32;
+use Test::More tests => 33;
 
 # Initialize master node
 my $node_master = get_new_node('master');
@@ -208,7 +208,9 @@ $node_standby_2->append_conf('postgresql.conf',
 	"primary_slot_name = $slotname_2");
 $node_standby_2->append_conf('postgresql.conf',
 	"wal_receiver_status_interval = 1");
-$node_standby_2->restart;
+# should be able change primary_slot_name without restart
+# will wait effect in get_slot_xmins above
+$node_standby_2->reload;
 
 # Fetch xmin columns from slot's pg_replication_slots row, after waiting for
 # given boolean condition to be true to ensure we've reached a quiescent state
@@ -344,3 +346,21 @@ is($catalog_xmin, 

Re: [HACKERS] Transactions involving multiple postgres foreign servers, take 2

2019-07-01 Thread Thomas Munro
On Wed, Apr 17, 2019 at 10:23 PM Masahiko Sawada  wrote:
> Sorry for the very late. Attached updated version patches.

Hello Sawada-san,

Can we please have a fresh rebase?

Thanks,

-- 
Thomas Munro
https://enterprisedb.com




Re: Planning counters in pg_stat_statements (using pgss_store)

2019-07-01 Thread Julien Rouhaud
On Tue, Apr 2, 2019 at 9:22 AM legrand legrand
 wrote:
>
> >> case  avg_tps   pct_diff
> >> 089 278   --
> >> 188 745   0,6%
> >> 288 282   1,1%
> >> 386 660   2,9%
> >>
> >> This means that even in this extrem test case, the worst degradation is 
> >> less
> >> than 3%
> >> (this overhead can be removed using pg_stat_statements.track_planning guc)
>
> > Is the difference between 2 and 3 the extraneous pgss_store call to
> > always store the query text if planner hook doesn't have access to the
> > query text?
>
> Yes it is,
> but I agree it seems a big gap (1,8%) compared to the difference between 1 
> and 2 (0,5%).
> Maybe this is just mesure "noise" ...

Rebased patches attached.
From c7a407143d1283a6b86c3244efccf0da7990e330 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud 
Date: Wed, 27 Mar 2019 11:24:35 +0100
Subject: [PATCH 2/2] Add planning counters to pg_stat_statements

---
 contrib/pg_stat_statements/Makefile   |   3 +-
 .../expected/pg_stat_statements.out   |  64 
 .../pg_stat_statements--1.7--1.8.sql  |  52 
 .../pg_stat_statements/pg_stat_statements.c   | 293 +-
 .../pg_stat_statements.control|   2 +-
 .../sql/pg_stat_statements.sql|  16 +
 doc/src/sgml/pgstatstatements.sgml|  49 ++-
 7 files changed, 401 insertions(+), 78 deletions(-)
 create mode 100644 contrib/pg_stat_statements/pg_stat_statements--1.7--1.8.sql

diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile
index 051ce46f0c..db33d9ffe1 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -4,7 +4,8 @@ MODULE_big = pg_stat_statements
 OBJS = pg_stat_statements.o $(WIN32RES)
 
 EXTENSION = pg_stat_statements
-DATA = pg_stat_statements--1.4.sql pg_stat_statements--1.6--1.7.sql \
+DATA = pg_stat_statements--1.4.sql \
+pg_stat_statements--1.7--1.8.sql pg_stat_statements--1.6--1.7.sql \
 	pg_stat_statements--1.5--1.6.sql pg_stat_statements--1.4--1.5.sql \
 	pg_stat_statements--1.3--1.4.sql pg_stat_statements--1.2--1.3.sql \
 	pg_stat_statements--1.1--1.2.sql pg_stat_statements--1.0--1.1.sql \
diff --git a/contrib/pg_stat_statements/expected/pg_stat_statements.out b/contrib/pg_stat_statements/expected/pg_stat_statements.out
index c759763be2..8517b124e4 100644
--- a/contrib/pg_stat_statements/expected/pg_stat_statements.out
+++ b/contrib/pg_stat_statements/expected/pg_stat_statements.out
@@ -593,4 +593,68 @@ SELECT query, calls, rows FROM pg_stat_statements ORDER BY query COLLATE "C";
 --
 DROP ROLE regress_stats_user1;
 DROP ROLE regress_stats_user2;
+--
+-- [re]plan counting
+--
+SELECT pg_stat_statements_reset();
+ pg_stat_statements_reset 
+--
+ 
+(1 row)
+
+CREATE TABLE test ();
+PREPARE prep1 AS SELECT COUNT(*) FROM test;;
+EXECUTE prep1;
+ count 
+---
+ 0
+(1 row)
+
+EXECUTE prep1;
+ count 
+---
+ 0
+(1 row)
+
+EXECUTE prep1;
+ count 
+---
+ 0
+(1 row)
+
+ALTER TABLE test ADD COLUMN x int;
+EXECUTE prep1;
+ count 
+---
+ 0
+(1 row)
+
+SELECT 42;
+ ?column? 
+--
+   42
+(1 row)
+
+SELECT 42;
+ ?column? 
+--
+   42
+(1 row)
+
+SELECT 42;
+ ?column? 
+--
+   42
+(1 row)
+
+SELECT query, plans, calls, rows FROM pg_stat_statements ORDER BY query COLLATE "C";
+   query| plans | calls | rows 
++---+---+--
+ ALTER TABLE test ADD COLUMN x int  | 0 | 1 |0
+ CREATE TABLE test ()   | 0 | 1 |0
+ PREPARE prep1 AS SELECT COUNT(*) FROM test | 2 | 4 |4
+ SELECT $1  | 3 | 3 |3
+ SELECT pg_stat_statements_reset()  | 0 | 1 |1
+(5 rows)
+
 DROP EXTENSION pg_stat_statements;
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.7--1.8.sql b/contrib/pg_stat_statements/pg_stat_statements--1.7--1.8.sql
new file mode 100644
index 00..de75643928
--- /dev/null
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.7--1.8.sql
@@ -0,0 +1,52 @@
+/* contrib/pg_stat_statements/pg_stat_statements--1.7--1.8.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pg_stat_statements UPDATE TO '1.8'" to load this file. \quit
+
+/* First we have to remove them from the extension */
+ALTER EXTENSION pg_stat_statements DROP VIEW pg_stat_statements;
+ALTER EXTENSION pg_stat_statements DROP FUNCTION pg_stat_statements(boolean);
+
+/* Then we can drop them */
+DROP VIEW pg_stat_statements;
+DROP FUNCTION pg_stat_statements(boolean);
+
+/* Now redefine */
+CREATE FUNCTION pg_stat_statements(IN showtext boolean,
+OUT userid oid,
+OUT dbid oid,
+OUT queryid bigint,
+OUT query text,
+OUT plans int8,
+OUT total_plan_time float8,
+OUT calls int8,
+OUT total_exec_time float8,
+

Re: Tid scan improvements

2019-07-01 Thread Thomas Munro
On Tue, Mar 26, 2019 at 7:25 PM David Steele  wrote:
> On 3/26/19 8:11 AM, Edmund Horner wrote:
> > Should I set update CF app to a) set the target version to 13, and/or
> > move it to next commitfest?
>
> If you plan to continue working on it in this CF then you can just
> change the target to PG13.  If you plan to take a break and pick up the
> work later then go ahead and push it to the next CF.

Hi Edmund,

The new CF is here.  I'm going through poking threads for submissions
that don't apply, but it sounds like this needs more than a rebase?
Perhaps this belongs in the next CF?

--
Thomas Munro
https://enterprisedb.com




Re: Converting NOT IN to anti-joins during planning

2019-07-01 Thread David Rowley
On Mon, 27 May 2019 at 20:43, Antonin Houska  wrote:
> I've spent some time looking into this.

Thank you for having a look at this.

> One problem I see is that SubLink can be in the JOIN/ON clause and thus it's
> not necessarily at the top of the join tree. Consider this example:
>
> CREATE TABLE a(i int);
> CREATE TABLE b(j int);
> CREATE TABLE c(k int NOT NULL);
> CREATE TABLE d(l int);
>
>   SELECT *
> FROM
> a
> JOIN b ON b.j NOT IN
> ( SELECT
> c.k
> FROM
> c)
> JOIN d ON b.j = d.l;

hmm yeah. Since the proofs that are being used in
expressions_are_not_nullable assume the join has already taken place,
then we'll either need to not use the join conditions are proofs in
that case, or just disable the optimisation instead.   I think it's
fine to just disable the optimisation since it seem rather unlikely
that someone would write a join condition like that.  Plus it seems
quite a bit more complex to validate that the optimisation would even
be correct if NULLs were not possible.

I've attached a patch which restricts the pullups to FromExpr quals.
Anything below a JoinExpr disables the optimisation now.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


not_in_anti_join_v2.patch
Description: Binary data


Re: shared-memory based stats collector

2019-07-01 Thread Thomas Munro
On Fri, May 17, 2019 at 6:48 PM Kyotaro HORIGUCHI
 wrote:
> Fixed. Version 20.

Hello Horiguchi-san,

A new Commitfest is here.  This doesn't apply (maybe just because of
the new improved pgindent).  Could we please have a fresh rebase?

Thanks,

-- 
Thomas Munro
https://enterprisedb.com




Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2019-07-01 Thread Etsuro Fujita
On Mon, Jul 1, 2019 at 6:50 PM Thomas Munro  wrote:
> On Sat, May 18, 2019 at 12:20 AM Robert Haas  wrote:
> > On Tue, May 14, 2019 at 12:24 AM Amit Langote
> >  wrote:
> > > He did mention that cases where the nullable side is provably empty can be
> > > handled by simply returning the path of the non-nullable side with
> > > suitable projection path added on top to emit NULLs for the columns of the
> > > nullable-side.  If we teach populate_joinrel_with_paths() and underlings
> > > about that, then we can allow partitionwise join even in the case where
> > > the nullable side has some partitions missing.
> >
> > Yes, I think that would be a good approach to pursue.
>
> Hi Ashutosh, Amul, Fujita-san,
>
> Could we please have a fresh rebase for the new Commitfest?

Will do unless Ashutosh, Amul, or anyone wants to.

Thanks!

Best regards,
Etsuro Fujita




logical replication slot and publication alter

2019-07-01 Thread Jinhua Luo
Hi All,

There is an interesting issue: I created one replication slot,
specifying pgouput plugin and one publication "foobar". The
publication "foobar" was set to tables "foo1, foo2".

The slot was left unread, while the tables foo1 and foo2 get changed.
Then, I alter the publication "foobar" to remove table "foo2" and
start logical replication.

The changes from foo2 still be sent from the slot!

Another try is even if I drop the publication "foobar", the slot still
find the original publication definition and send the changes without
problem.

I check the source codes, and I think it's due to the snapshot, when
pgoutput load the publication, it would use the catalog tuples from
the snapshot instead of current version, so even if the publication
get altered or get dropped, the original version is still there in the
snapshot.

Is it expected or it's a bug? Anyways, alter publication would not
affect the replication stream is unexpected.

Regards,
Jinhua Luo




Re: Psql patch to show access methods info

2019-07-01 Thread Thomas Munro
On Sun, Mar 31, 2019 at 2:13 PM  wrote:
> Thanks for review.

Hi Sergey,

A new Commitfest is here and this doesn't apply -- could you please
post a rebase?

Thanks,

-- 
Thomas Munro
https://enterprisedb.com




Re: How to estimate the shared memory size required for parallel scan?

2019-07-01 Thread Thomas Munro
On Fri, May 3, 2019 at 6:06 AM Thomas Munro  wrote:
> Added to commitfest.

Rebased.

-- 
Thomas Munro
https://enterprisedb.com


0001-Make-file_fdw-parallel-aware-v3.patch
Description: Binary data


Re: Choosing values for multivariate MCV lists

2019-07-01 Thread Dean Rasheed
On Sat, 29 Jun 2019 at 14:01, Tomas Vondra  wrote:
>
> >However, it looks like the problem is with mcv_list_items()'s use
> >of %f to convert to text, which is pretty ugly.
> 
> >>>There's one issue with the signature, though - currently the function
> >>>returns null flags as bool array, but values are returned as simple
> >>>text value (formatted in array-like way, but still just a text).
> >>>
> >>IMO fixing this to return a text array is worth doing, even though it
> >>means a catversion bump.
> >>
> Attached is a cleaned-up version of that patch. The main difference is
> that instead of using construct_md_array() this uses ArrayBuildState to
> construct the arrays, which is much easier. The docs don't need any
> update because those were already using text[] for the parameter, the
> code was inconsistent with it.
>

Cool, this looks a lot neater and fixes the issues discussed with both
floating point values no longer being converted to text, and proper
text arrays for values.

One minor nitpick -- it doesn't seem to be necessary to build the
arrays *outfuncs and *fmgrinfo. You may as well just fetch the
individual column output function information at the point where it's
used (in the "if (!item->isnull[i])" block) rather than building those
arrays.


> This does require catversion bump, but as annoying as it is, I think
> it's worth it (and there's also the thread discussing the serialization
> issues). Barring objections, I'll get it committed later next week, once
> I get back from PostgresLondon.
>
> As I mentioned before, if we don't want any additional catversion bumps,
> it's possible to pass the arrays through output functions - that would
> allow us keeping the text output (but correct, unlike what we have now).
>

I think this is a clear bug fix, so I'd vote for fixing it properly
now, with a catversion bump.

Regards,
Dean




RE: [PATCH] Speedup truncates of relation forks

2019-07-01 Thread Jamison, Kirk
On Wednesday, June 26, 2019 6:10 PM(GMT+9), Adrien Nayrat wrote:
> As far as I remember, you should see "relation" wait events (type lock) on
> standby server. This is due to startup process acquiring AccessExclusiveLock
> for the truncation and other backend waiting to acquire a lock to read the
> table.

Hi Adrien, thank you for taking time to reply.

I understand that RelationTruncate() can block read-only queries on
standby during redo. However, it's difficult for me to reproduce the 
test case where I need to catch that wait for relation lock, because
one has to execute SELECT within the few milliseconds of redoing the
truncation of one table.

Instead, I just measured the whole recovery time, smgr_redo(),
to show the recovery improvement compared to head. Please refer below.

[Recovery Test]
I used the same stored functions and configurations in the previous email
& created "test" db.

$ createdb test
$ psql -d test

1. [Primary] Create 10,000 relations.
test=# SELECT create_tables(1);

2. [P] Insert one row in each table.
test=# SELECT insert_tables(1);

3. [P] Delete row of each table.
test=# SELECT delfrom_tables(1);

4. [Standby] WAL application is stopped at Standby server.
test=# SELECT pg_wal_replay_pause();

5. [P] VACUUM is executed at Primary side, and measure its execution time.  

test=# \timing on
test=# VACUUM;

Alternatively, you may use:
$ time psql -d test -c 'VACUUM;'
(Note: WAL has not replayed on standby because it's been paused.)

6. [P] Wait until VACUUM has finished execution. Then, stop primary server. 
test=# pg_ctl stop -w

7. [S] Resume WAL replay, then promote standby (failover).
I used a shell script to execute recovery & promote standby server
because it's kinda difficult to measure recovery time. Please refer to the 
script below.
- "SELECT pg_wal_replay_resume();" is executed and the WAL application is 
resumed.
- "pg_ctl promote" to promote standby.
- The time difference of "select pg_is_in_recovery();" from "t" to "f" is 
measured.

shell script:

PGDT=/path_to_storage_directory/

if [ "$1" = "resume" ]; then
psql -c "SELECT pg_wal_replay_resume();" test
date +%Y/%m/%d_%H:%M:%S.%3N
pg_ctl promote -D ${PGDT}
set +x
date +%Y/%m/%d_%H:%M:%S.%3N
while [ 1 ]
do
RS=`psql -Atc "select pg_is_in_recovery();" test`   
if [ ${RS} = "f" ]; then
break
fi
done
date +%Y/%m/%d_%H:%M:%S.%3N
set -x
exit 0
fi


[Test Results]
shared_buffers = 24GB

1. HEAD
(wal replay resumed)
2019/07/01_08:48:50.326
server promoted
2019/07/01_08:49:50.482
2019/07/01_09:02:41.051

 Recovery Time:
 13 min 50.725 s -> Time difference from WAL replay to complete recovery
 12 min 50.569 s -> Time difference of "select pg_is_in_recovery();" from "t" 
to "f"

2. PATCH
(wal replay resumed)
2019/07/01_07:34:26.766
server promoted
2019/07/01_07:34:57.790
2019/07/01_07:34:57.809

 Recovery Time: 
 31.043 s -> Time difference from WAL replay to complete recovery
 00.019 s -> Time difference of "select pg_is_in_recovery();" from "t" to "f"
 
[Conclusion]
The recovery time significantly improved compared to head
from 13 minutes to 30 seconds.

Any thoughts?
I'd really appreciate your comments/feedback about the patch and/or test.


Regards,
Kirk Jamison


Re: Optimization of some jsonb functions

2019-07-01 Thread Thomas Munro
> >> On 2/22/19 2:05 AM, Nikita Glukhov wrote:
> >>> Attached set of patches with some jsonb optimizations that were made
> >>> during
> >>> comparison of performance of ordinal jsonb operators and jsonpath
> >>> operators.

Hi Nikita,

This doesn't apply -- to attract reviewers, could we please have a rebase?

Thanks,

-- 
Thomas Munro
https://enterprisedb.com




Re: [HACKERS] Custom compression methods

2019-07-01 Thread Alexander Korotkov
Hi, Thomas!

On Mon, Jul 1, 2019 at 1:22 PM Thomas Munro  wrote:
>
> On Sat, Mar 16, 2019 at 12:52 AM Ildus Kurbangaliev
>  wrote:
> > in my opinion this patch is usually skipped not because it is not
> > needed, but because of its size. It is not hard to maintain it until
> > commiters will have time for it or I will get actual response that
> > nobody is going to commit it.
>
> To maximise the chances of more review in the new Commitfest that is
> about to begin, could you please send a fresh rebase?  This doesn't
> apply anymore.


As I get we're currently need to make high-level decision of whether
we need this [1].  I was going to bring this topic up at last PGCon,
but I didn't manage to attend.  Does it worth bothering Ildus with
continuous rebasing assuming we don't have this high-level decision
yet?

Links
1. 
https://www.postgresql.org/message-id/20190216054526.zss2cufdxfeudr4i%40alap3.anarazel.de

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: make installcheck-world in a clean environment

2019-07-01 Thread Thomas Munro
On Mon, Mar 25, 2019 at 8:30 PM Alexander Lakhin  wrote:
> 25.03.2019 10:25, David Steele wrote:
> > If it doesn't attract more review or a committer during this CF, I
> > think we should mark it as rejected.

> Please, delay rejecting it till the end of the commitfest, I'll try to
> find some enthusiasm amongst my colleagues.

Hi Alexander,

A new CF is here and this is in "Needs Review".  Would you like to
provide a rebased patch, or should it really be withdrawn?

Thanks,

--
Thomas Munro
https://enterprisedb.com




Re: [PATCH] kNN for btree

2019-07-01 Thread Thomas Munro
On Tue, Mar 26, 2019 at 4:30 AM Nikita Glukhov  wrote:
> >> Fixed two bugs in patches 3 and 10 (see below).
> >> Patch 3 was extracted from the main patch 5 (patch 4 in v9).
> >
> > This patch no longer applies so marking Waiting on Author.
> >
> Attached 11th version of the patches rebased onto current master.

Hi Nikita,

Since a new Commitfest is here and this doesn't apply, could we please
have a fresh rebase?

Thanks,

-- 
Thomas Munro
https://enterprisedb.com




Re: FETCH FIRST clause WITH TIES option

2019-07-01 Thread Thomas Munro
On Mon, Apr 8, 2019 at 8:26 PM Surafel Temesgen  wrote:
> agree

Hi Surafel,

A new Commitfest is starting.  Can you please post a fresh rebase of this patch?

Thanks,

-- 
Thomas Munro
https://enterprisedb.com




Re: Add parallelism and glibc dependent only options to reindexdb

2019-07-01 Thread Julien Rouhaud
On Mon, Jul 1, 2019 at 10:55 AM Michael Paquier  wrote:
>
> On Sun, Jun 30, 2019 at 11:45:47AM +0200, Julien Rouhaud wrote:
> > With the glibc 2.28 coming, all users will have to reindex almost
> > every indexes after a glibc upgrade to guarantee the lack of
> > corruption.  Unfortunately, reindexdb is not ideal for that as it's
> > processing everything using a single connexion and isn't able to
> > discard indexes that doesn't depend on a glibc collation.
>
> We have seen that with a database of up to 100GB we finish by cutting
> the reindex time from 30 minutes to a couple of minutes with a schema
> we work on.  Julien, what were the actual numbers?

I did my benchmarking using a quite ideal database, having a large
number of tables and various set of indexes, for a 75 GB total size.
This was done on my laptop which has 6 multithreaded cores (and crappy
IO), also keeping the default max_parallel_maintenance_worker = 2.

A naive reindexdb took approximately 33 minutes.  Filtering the list
of indexes took that down to slightly less than 15 min, but of course
each database will have a different ratio there.

Then, keeping the --glibc-dependent and using different level of parallelism:

-j1: ~ 14:50
-j3: ~ 7:30
-j6: ~ 5:23
-j8: ~ 4:45

That's pretty much the kind of results I was expecting given the
hardware I used.

> > PFA a patchset to add parallelism to reindexdb (reusing the
> > infrastructure in vacuumdb with some additions) and an option to
> > discard indexes that doesn't depend on glibc (without any specific
> > collation filtering or glibc version detection), with updated
> > regression tests.  Note that this should be applied on top of the
> > existing reindexdb cleanup & refactoring patch
> > (https://commitfest.postgresql.org/23/2115/).
>
> Please note that patch 0003 does not seem to apply correctly on HEAD
> as of c74d49d4.

Yes, this is because this patchset has to be applied on top of the
reindexdb refactoring patch mentioned.  It's sad that we don't have a
good way to deal with that kind of dependency, as it's also breaking
Thomas' cfbot :(

> - 0003 begins to be the actual fancy thing with the addition of a
> --jobs option into reindexdb.  The main issue here which should be
> discussed is that when it comes to reindex of tables, you basically
> are not going to have any conflicts between the objects manipulated.
> However if you wish to do a reindex on a set of indexes then things
> get more tricky as it is necessary to list items per-table so as
> multiple connections do not conflict with each other if attempting to
> work on multiple indexes of the same table.  What this patch does is
> to select the set of indexes which need to be worked on (see the
> addition of cell in ParallelSlot), and then does a kind of
> pre-planning of each item into the connection slots so as each
> connection knows from the beginning which items it needs to process.
> This is quite different from vacuumdb where a new item is distributed
> only on a free connection from a unique list.  I'd personally prefer
> if we keep the facility in parallel.c so as it is only
> execution-dependent and that we have no pre-planning.  This would
> require keeping within reindexdb.c an array of lists, with one list
> corresponding to one connection instead which feels more natural.

My fear here is that this approach would add some extra complexity,
especially requiring to deal with free connection handling both in
GetIdleSlot() and the main reindexdb loop.  Also, the pre-planning
allows us to start processing the biggest tables first, which
optimises the overall runtime.




Re: [HACKERS] Custom compression methods

2019-07-01 Thread Thomas Munro
On Sat, Mar 16, 2019 at 12:52 AM Ildus Kurbangaliev
 wrote:
> in my opinion this patch is usually skipped not because it is not
> needed, but because of its size. It is not hard to maintain it until
> commiters will have time for it or I will get actual response that
> nobody is going to commit it.

Hi Ildus,

To maximise the chances of more review in the new Commitfest that is
about to begin, could you please send a fresh rebase?  This doesn't
apply anymore.

Thanks,

-- 
Thomas Munro
https://enterprisedb.com




Re: Libpq support to connect to standby server as priority

2019-07-01 Thread Haribabu Kommi
On Mon, 3 Jun 2019 at 16:32, Haribabu Kommi 
wrote:

>
>
> On Thu, Apr 11, 2019 at 9:13 AM Haribabu Kommi 
> wrote:
>
>> I fixed all the comments that you have raised above and attached the
>> updated
>> patches.
>>
>
> Rebased patches are attached.
>

Rebased patches are attached.

Regards,
Haribabu Kommi


0001-Restructure-the-code-to-remove-duplicate-code.patch
Description: Binary data


0002-New-TargetSessionAttrsType-enum.patch
Description: Binary data


0003-Make-transaction_read_only-as-GUC_REPORT-variable.patch
Description: Binary data


0004-New-prefer-read-target_session_attrs-type.patch
Description: Binary data


0005-New-read-only-target_session_attrs-type.patch
Description: Binary data


0006-Primary-prefer-standby-and-standby-options.patch
Description: Binary data


0007-New-function-to-rejecting-the-checked-write-connecti.patch
Description: Binary data


0008-Server-recovery-mode-handling.patch
Description: Binary data


Re: Built-in connection pooler

2019-07-01 Thread Thomas Munro
On Thu, Mar 21, 2019 at 4:33 AM Konstantin Knizhnik
 wrote:
> New version of the patch (rebased + bug fixes) is attached to this mail.

Hi again Konstantin,

Interesting work.  No longer applies -- please rebase.

-- 
Thomas Munro
https://enterprisedb.com




Re: MSVC Build support with visual studio 2019

2019-07-01 Thread Haribabu Kommi
On Thu, 27 Jun 2019 at 17:28, Michael Paquier  wrote:

> On Wed, Jun 26, 2019 at 10:29:05PM +1000, Haribabu Kommi wrote:
> > Thanks for the review. Yes, that patch applies till 9.5, it is my mistake
> > in naming the patch.
>
> I have been able to finally set up an environment with VS 2019 (as
> usual this stuff needs time, anyway..), and I can confirm that the
> patch is able to compile properly.
>

Thanks for the review.


> -  Visual Studio 2017 (including Express
> editions),
> -  as well as standalone Windows SDK releases 6.0 to 8.1.
> +  Visual Studio 2019 (including Express
> editions),
> +  as well as standalone Windows SDK releases 8.1a to 10.
> I would like to understand why this range of requirements is updated.
> Is there any reason to do so.  If we change these docs, what does it
> mean in terms of versions of Visual Studio supported?
>

We stopped the support of building with all the visual studio versions less
than 2013.
I updated the SDK versions accordingly.



> -or a VS2015Solution or a VS2017Solution, all in Solution.pm, depending on
> -the user's build environment) and adding objects implementing the
> corresponding
> -Project interface (VC2013Project or VC2015Project or VC2017Project from
> -MSBuildProject.pm) to it.
> +or a VS2015Solution or a VS2017Solution or a VS2019Solution, all in
> Solution.pm,
> +depending on the user's build environment) and adding objects implementing
> +the corresponding Project interface (VC2013Project or VC2015Project or
> VC2017Project
> +or VC2019Project from MSBuildProject.pm) to it.
> This formulation is weird the more we accumulate new objects, let's
> put that in a proper list of elements separated with commas except
> for the two last ones which should use "or".
>
> s/greather/greater/.
>
> The patch still has typos, and the format is not satisfying yet, so I
> have done a set of fixes as per the attached.
>

The change in the patch is good.


>
> -   elsif ($major < 6)
> +   elsif ($major < 12)
> {
> croak
> - "Unable to determine Visual Studio version:
> Visual Studio versions before 6.0 aren't supported.";
> + "Unable to determine Visual Studio version:
> Visual Studio versions before 12.0 aren't supported.";
> Well, this is a separate bug fix, still I don't mind fixing that in
> the same patch as we meddle with those code paths now.  Good catch.
>
> -   croak $visualStudioVersion;
> +   carp $visualStudioVersion;
> Same here.  Just wouldn't it be better to print the version found in
> the same message?
>

Yes, that is a good change, I thought of doing the same, but I don't know
how to do it.

The similar change is required for the CreateProject also.



> +   # The major visual studio that is supported has nmake version >=
>  14.20 and < 15.
> if ($major > 14)
> Comment line is too long.  It seems to me that the condition here
> should be ($major >= 14 && $minor >= 30).  That's not completely
> correct either as we have a version higher than 14.20 for VS 2019 but
> that's better than just using 14.29 or a fake number I guess.
>

The change is good, but the comment is wrong.

+ # The major visual studio that is supported has nmake
+ # version >= 14.30, so stick with it as the latest version

The major visual studio version that is supported has nmake version <=14.30

Except for the above two changes, overall the patch is in good shape.

Regards,
Haribabu Kommi


Re: Attempt to consolidate reading of XLOG page

2019-07-01 Thread Thomas Munro
On Tue, May 21, 2019 at 9:12 PM Antonin Houska  wrote:
> Robert Haas  wrote:
> > It seems to me that it's better to unwind the stack i.e. have the
> > function return the error information to the caller and let the caller
> > do as it likes.
>
> Thanks for a hint. The next version tries to do that.

Hi Antonin,

Could you please send a fresh rebase for the new Commitfest?

Thanks,


--
Thomas Munro
https://enterprisedb.com




Re: [HACKERS] Cached plans and statement generalization

2019-07-01 Thread Thomas Munro
On Wed, Apr 10, 2019 at 12:52 AM Konstantin Knizhnik
 wrote:
> New version of the patching disabling autoprepare for rules and handling
> planner error.

Hi Konstantin,

This doesn't apply. Could we please have a fresh rebase for the new Commitfest?

Thanks,

-- 
Thomas Munro
https://enterprisedb.com




Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2019-07-01 Thread Thomas Munro
On Sat, May 18, 2019 at 12:20 AM Robert Haas  wrote:
> On Tue, May 14, 2019 at 12:24 AM Amit Langote
>  wrote:
> > He did mention that cases where the nullable side is provably empty can be
> > handled by simply returning the path of the non-nullable side with
> > suitable projection path added on top to emit NULLs for the columns of the
> > nullable-side.  If we teach populate_joinrel_with_paths() and underlings
> > about that, then we can allow partitionwise join even in the case where
> > the nullable side has some partitions missing.
>
> Yes, I think that would be a good approach to pursue.

Hi Ashutosh, Amul, Fujita-san,

Could we please have a fresh rebase for the new Commitfest?

Thanks,

-- 
Thomas Munro
https://enterprisedb.com




Re: Option to dump foreign data in pg_dump

2019-07-01 Thread Daniel Gustafsson
> On 28 Jun 2019, at 17:30, Tom Lane  wrote:


> > Yeah, I think the feature as-proposed is a shotgun that's much more likely
> > to cause problems than solve them.  Almost certainly, what people would
> > really need is the ability to dump individual foreign tables' data not
> > everything.  (I also note that the main reason for "dump everything",
> > namely to get a guaranteed-consistent snapshot, isn't really valid for
> > foreign tables anyhow.)


I think this is sort of key here, the consistency guarantees are wildly
different.  A note about this should perhaps be added to the docs for the
option discussed here?

> On 28 Jun 2019, at 19:55, Luis Carril  wrote:


> What about providing a list of FDW servers instead of an all or nothing 
> option? In that way the user really has to do a conscious decision to dump 
> the content of the foreign tables for a specific server, this would allow 
> distinction if multiple FDW are being used in the same DB.

I think this is a good option, the normal exclusion rules can then still apply
in case not everything from a specific server is of interest.

cheers ./daniel



Re: Re: [PATCH][PROPOSAL] Add enum releation option type

2019-07-01 Thread Thomas Munro
On Mon, Mar 25, 2019 at 9:39 PM David Steele  wrote:
> It's not clear to me that all of Michael's and Álvaro's issues have been
> addressed, so I've marked this Waiting on Author.

Hi Nikolay,

To help reviewers for the Commitfest that is starting, could you
please rebase this patch?

-- 
Thomas Munro
https://enterprisedb.com




Re: [PATCH] Add support for ON UPDATE/DELETE actions on ALTER CONSTRAINT

2019-07-01 Thread Thomas Munro
On Tue, Mar 19, 2019 at 1:04 PM Matheus de Oliveira
 wrote:
> Sorry for the long delay. I've rebased the patch to current master (at 
> f2004f19ed now), attached as postgresql-alter-constraint.v4.patch. All tests 
> passed cleanly.

Hi Matheus,

As the commitfest is starting, could you please send a rebased patch?

Thanks,

-- 
Thomas Munro
https://enterprisedb.com




Re: Add parallelism and glibc dependent only options to reindexdb

2019-07-01 Thread Michael Paquier
On Sun, Jun 30, 2019 at 11:45:47AM +0200, Julien Rouhaud wrote:
> With the glibc 2.28 coming, all users will have to reindex almost
> every indexes after a glibc upgrade to guarantee the lack of
> corruption.  Unfortunately, reindexdb is not ideal for that as it's
> processing everything using a single connexion and isn't able to
> discard indexes that doesn't depend on a glibc collation.

We have seen that with a database of up to 100GB we finish by cutting
the reindex time from 30 minutes to a couple of minutes with a schema
we work on.  Julien, what were the actual numbers?

> PFA a patchset to add parallelism to reindexdb (reusing the
> infrastructure in vacuumdb with some additions) and an option to
> discard indexes that doesn't depend on glibc (without any specific
> collation filtering or glibc version detection), with updated
> regression tests.  Note that this should be applied on top of the
> existing reindexdb cleanup & refactoring patch
> (https://commitfest.postgresql.org/23/2115/).

Please note that patch 0003 does not seem to apply correctly on HEAD
as of c74d49d4.  Here is also a small description of each patch:
- 0001 refactors the connection slot facility from vacuumdb.c into a
new, separate file called parallel.c in src/bin/scripts/.  This is not
really fancy as some code only moves around.
- 0002 adds an extra option for simple lists to be able to use
pointers, with an interface to append elements in it.
- 0003 begins to be the actual fancy thing with the addition of a
--jobs option into reindexdb.  The main issue here which should be
discussed is that when it comes to reindex of tables, you basically
are not going to have any conflicts between the objects manipulated.
However if you wish to do a reindex on a set of indexes then things
get more tricky as it is necessary to list items per-table so as
multiple connections do not conflict with each other if attempting to
work on multiple indexes of the same table.  What this patch does is
to select the set of indexes which need to be worked on (see the
addition of cell in ParallelSlot), and then does a kind of
pre-planning of each item into the connection slots so as each
connection knows from the beginning which items it needs to process.
This is quite different from vacuumdb where a new item is distributed
only on a free connection from a unique list.  I'd personally prefer
if we keep the facility in parallel.c so as it is only
execution-dependent and that we have no pre-planning.  This would
require keeping within reindexdb.c an array of lists, with one list 
corresponding to one connection instead which feels more natural.
- 0004 is the part where the concurrent additions really matter as
this consists in applying an extra filter to the indexes selected so
as only the glibc-sensitive indexes are chosen for the processing.
--
Michael


signature.asc
Description: PGP signature


Re: POC: Cleaning up orphaned files using undo logs

2019-07-01 Thread Thomas Munro
On Mon, Jul 1, 2019 at 7:53 PM Thomas Munro  wrote:
> 3.  Recognise UNDO_SHARED record set boundaries differently.  Whereas
> undolog.c recognises transaction boundaries automatically for the
> other categories (UNDO_PERMANENT, UNDO_UNLOGGED, UNDO_TEMP), for
> UNDO_SHARED the

... set of records inserted in between calls to
BeginUndoRecordInsert() and FinishUndoRecordInsert() calls is
eventually discarded as a unit, and the rm_undo_status() callback for
the calling AM decides when that is allowed.  In contrast, for the
other categories there may be records from any number undo-aware AMs
that are entirely unaware of each other and they must all be discarded
together if the transaction commits and becomes all visible, so
undolog.c automatically manages the boundaries to make that work when
inserting.

-- 
Thomas Munro
https://enterprisedb.com




Re: Superfluous libpq-be.h include in GSSAPI code

2019-07-01 Thread Daniel Gustafsson
> On 29 Jun 2019, at 04:23, Michael Paquier  wrote:
> 
> On Fri, Jun 28, 2019 at 08:47:33PM +0200, Julien Rouhaud wrote:
>> On Fri, Jun 28, 2019 at 4:37 PM Daniel Gustafsson  wrote:
>>> backend/libpq/be-secure-gssapi.c is including both libpq-be.h and libpq.h,
>>> which makes libpq-be.h superfluous as it gets included via libpq.h.  The
>>> attached patch removes the inclusion of libpq-be.h to make 
>>> be-secure-gssapi.c
>>> behave like other files which need both headers.
>> 
>> LGTM.
> 
> Thanks, committed.  I looked at the area in case but did not notice
> anything else strange.

Thanks!

> (We have in hba.h a kludge with hbaPort to avoid including libpq-be.h,
> I got to wonder if we could do something about that..)

I looked at that one too at the time, but didn’t come up with anything less
kludgy.

cheers ./daniel



Re: Protect syscache from bloating with negative cache entries

2019-07-01 Thread Kyotaro Horiguchi
Hello, 

my_gripe> But, still fluctulates by around 5%..
my_gripe> 
my_gripe> If this level of the degradation is still not acceptable, that
my_gripe> means that nothing can be inserted in the code path and the new
my_gripe> code path should be isolated from existing code by using indirect
my_gripe> call.

Finally, after some struggling, I think I could manage to measure
the impact on performace precisely and reliably. Starting from
"make distclean" every time building, then removing all in
$TARGET before installation makes things stable enough. (I don't
think it's good but I didin't investigate the cause..)

I measured time/call by directly calling SearchSysCache3() many
times. It showed that the patch causes around 0.1 microsec
degradation per call. (The funtion overall took about 6.9
microsec on average.)

Next, I counted how many times SearchSysCache is called during a
planning with, as an instance, a query on a partitioned table
having 3000 columns and 1000 partitions.

  explain analyze select sum(c) from test.p;

Planner made 6020608 times syscache calls while planning and the
overall planning time was 8641ms. (Exec time was 48ms.) 6020608
times 0.1 us is 602 ms of degradation. So roughly -7% degradation
in planning time in estimation. The degradation was given by
really only the two successive instructions "ADD/conditional
MOVE(CMOVE)". The fact leads to the conclusion that the existing
code path as is doesn't have room for any additional code.


So I sought for room at least for one branch and found that (on
gcc 7.3.1/CentOS7/x64). Interestingly, de-inlining
SearchCatCacheInternal gave me gain on performance by about
3%. Further inlining of CatalogCacheComputeHashValue() gave
another gain about 3%. I could add a branch in
SearchCatCacheInteral within the gain.

I also tried indirect calls but the degradation overwhelmed the
gain, so I choosed branching rather than indirect calls. I didn't
investigated how it happens.


The following is the result. The binaries are build with the same
configuration using -O2.

binary means
  master  : master HEAD.
  patched_off : patched, but pruning disabled (catalog_cache_prune_min_age=-1).
  patched_on  : patched with pruning enabled.
("300s" for 1, "1s" for2, "0" for 3)

bench:
  1: corresponds to catcachebench(1); fetching STATRELATTINH 3000
 * 1000 times generating new cache entriies. (Massive cache
   creatiion)
 Pruning doesn't happen while running this.

  2: catcachebench(2); 6 times cache access on 1000
 STATRELATTINH entries. (Frequent cache reference)
 Pruning doesn't happen while running this.

  3: catcachebench(3); fetching 1000(tbls) * 3000(cols)

 STATRELATTINH entries. Catcache clock advancing with the
 interval of 100(tbls) * 3000(cols) times of access and
 pruning happenshoge.

 While running catcachebench(3) once, pruning happens 28
 times and most of the time 202202 entries are removed and
 the total number of entries was limite to 524289. (The
 systable has 3000 * 1001 = 3003000 tuples.)

iter: Number of iterations. Time ms and stddev is calculated over
 the iterations.


binar| bench | iter  |  time ms | stddev
-+---+---+--+
 master  | 1 |10 |  8150.30 |  12.96
 master  | 2 |10 |  4002.88 |  16.18
 master  | 3 |10 |  9065.06 |  11.46
-+---+---+--+
 patched_off | 1 |10 |  8090.95 |   9.95
 patched_off | 2 |10 |  3984.67 |  12.33
 patched_off | 3 |10 |  9050.46 |   4.64
-+---+---+--+
 patched_on  | 1 |10 |  8158.95 |   6.29
 patched_on  | 2 |10 |  4023.72 |  10.41
 patched_on  | 3 |10 | 16532.66 |  18.39

patched_off is slightly faster than master. patched_on is
generally a bit slower. Even though patched_on/3 seems take too
long time, the extra time comes from increased catalog table
acess in exchange of memory saving. (That is, it is expected
behavior.) I ran it several times and most them showed the same
tendency.

As a side-effect, once the branch added, the shared syscache in a
neighbour thread will be able to be inserted together without
impact on existing code path.


===
The benchmark script is used as the follows:

- create many (3000, as example) tables in "test" schema. I
  created a partitioned table with 3000 children.

- The tables have many columns, 1000 for me.

- Run the following commands.

  =# select catcachebench(0);  -- warm up systables.
  =# set catalog_cache_prune_min_age = any; -- as required
  =# select catcachebench(n);  -- 3 >= n >= 1, the number of "bench" above.


  The above result is taked with the following query.

  =# select 'patched_on', '3' , count(a), avg(a)::numeric(10,2), 
stddev(a)::numeric(10,2) from (select catcachebench(3) from generate_series(1, 
10)) as a(a);


The attached patches are: