Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

2019-03-21 Thread Masahiko Sawada
On Thu, Mar 21, 2019 at 11:27 PM Pavan Deolasee
 wrote:
>
>
>
> On Thu, Mar 14, 2019 at 3:54 PM Masahiko Sawada  wrote:
>>
>>
>> >
>> >
>> > Ok. I will run some tests. But please note that this patch is a bug fix to 
>> > address the performance issue that is caused by having to rewrite the 
>> > entire table when all-visible bit is set on the page during first vacuum. 
>> > So while we may do some more work during COPY FREEZE, we're saving a lot 
>> > of page writes during next vacuum. Also, since the scan that we are doing 
>> > in this patch is done on a page that should be in the buffer cache, we 
>> > will pay a bit in terms of CPU cost, but not anything in terms of IO cost.
>>
>> Agreed. I had been misunderstanding this patch. The page scan during
>> COPY FREEZE is necessary and it's very cheaper than doing in the first
>> vacuum.
>
>
> Thanks for agreeing to the need of this bug fix. I ran some simple tests 
> anyways and here are the results.
>
> The test consists of a simple table with three columns, two integers and one 
> char(100). I then ran COPY (FREEZE), loading 7M rows, followed by a VACUUM. 
> The total size of the raw data is about 800MB and the table size in Postgres 
> is just under 1GB. The results for 3 runs in milliseconds are:
>
> Master:
>  COPY FREEZE: 40243.725   40309.67540783.836
>  VACUUM: 2685.871  2517.4452508.452
>
> Patched:
>  COPY FREEZE: 40942.410  40495.303   40638.075
>  VACUUM: 25.067  35.793   25.390
>
> So there is a slight increase in the time to run COPY FREEZE, but a 
> significant reduction in time to VACUUM the table. The benefits will only go 
> up if the table is vacuumed much  later when most of the pages are already 
> written to the disk and removed from shared buffers and/or kernel cache.
>
> I hope this satisfies your doubts regarding performance implications of the 
> patch.

Thank you for the performance testing, that's a great improvement!

I've looked at the patch and have comments and questions.

+/*
+ * While we are holding the lock on the page, check if all tuples
+ * in the page are marked frozen at insertion. We can safely mark
+ * such page all-visible and set visibility map bits too.
+ */
+if (CheckPageIsAllFrozen(relation, buffer))
+PageSetAllVisible(page);
+
+MarkBufferDirty(buffer);

Maybe we don't need to mark the buffer dirty if the page is not set all-visible.

-
+if (PageIsAllVisible(page))
+{
+uint8   vm_status = visibilitymap_get_status(relation,
+targetBlock, &myvmbuffer);
+uint8   flags = 0;
+
+/* Set the VM all-frozen bit to flag, if needed */
+if ((vm_status & VISIBILITYMAP_ALL_VISIBLE) == 0)
+flags |= VISIBILITYMAP_ALL_VISIBLE;
+if ((vm_status & VISIBILITYMAP_ALL_FROZEN) == 0)
+flags |= VISIBILITYMAP_ALL_FROZEN;
+
+Assert(BufferIsValid(myvmbuffer));
+if (flags != 0)
+visibilitymap_set(relation, targetBlock, buffer, InvalidXLogRecPtr,
+myvmbuffer, InvalidTransactionId, flags);

Since CheckPageIsAllFrozen() is used only when inserting frozen tuples
CheckAndSetPageAllVisible() seems to be implemented for the same
situation. If we have CheckAndSetPageAllVisible() for only this
situation we would rather need to check that the VM status of the page
should be 0 and then set two flags to the page? The 'flags' will
always be (VISIBILITYMAP_ALL_FROZEN | VISIBILITYMAP_ALL_VISIBLE) in
copy freeze case. I'm confused that this function has both code that
assumes some special situations and code that can be used in generic
situations.

-
Perhaps we can add some tests for this feature to pg_visibility module.


Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: Problem with default partition pruning

2019-03-21 Thread Amit Langote
Hosoya-san,

On 2019/03/22 15:02, Yuzuko Hosoya wrote:
> I understood Amit's proposal.  But I think the issue Thibaut reported would 
> occur regardless of whether clauses have OR clauses or not as follows.
> I tested a query which should output "One-Time Filter: false".
> 
> # explain select * from test2_0_20 where id = 25;
>   QUERY PLAN   
> ---
>  Append  (cost=0.00..25.91 rows=6 width=36)
>->  Seq Scan on test2_10_20_def  (cost=0.00..25.88 rows=6 width=36)
>  Filter: (id = 25)
> 

Good catch, thanks.

> As Amit described in the previous email, id = 25 contradicts test2_0_20's
> partition constraint, so I think this clause should be ignored and we can
> also handle this case in the similar way as Amit proposal.
> 
> I attached v1-delta-2.patch which fix the above issue.  
> 
> What do you think about it?

It looks fine to me.  You put the code block to check whether a give
clause contradicts the partition constraint in its perfect place. :)

Maybe we should have two patches as we seem to be improving two things:

1. Patch to fix problems with default partition pruning originally
reported by Hosoya-san

2. Patch to determine if a given clause contradicts a sub-partitioned
table's partition constraint, fixing problems unearthed by Thibaut's tests

About the patch that Horiguchi-san proposed upthread, I think it has merit
that it will make partprune.c code easier to reason about, but I think we
should pursue it separately.

Thanks,
Amit




RE: Problem with default partition pruning

2019-03-21 Thread Yuzuko Hosoya
Hi,

Thanks a lot for additional tests and the new patch.


> Le 20/03/2019 à 10:06, Amit Langote a écrit :
> > Hi Thibaut,
> >
> > On 2019/03/19 23:58, Thibaut Madelaine wrote:
> >> I kept on testing with sub-partitioning.
> > Thanks.
> >
> >> I found a case, using 2 default partitions, where a default partition
> >> is not pruned:
> >>
> >> --
> >>
> >> create table test2(id int, val text) partition by range (id); create
> >> table test2_20_plus_def partition of test2 default; create table
> >> test2_0_20 partition of test2 for values from (0) to (20)
> >>   partition by range (id);
> >> create table test2_0_10 partition of test2_0_20 for values from (0)
> >> to (10); create table test2_10_20_def partition of test2_0_20
> >> default;
> >>
> >> # explain (costs off) select * from test2 where id=5 or id=25;
> >>QUERY PLAN
> >> -
> >>  Append
> >>->  Seq Scan on test2_0_10
> >>  Filter: ((id = 5) OR (id = 25))
> >>->  Seq Scan on test2_10_20_def
> >>  Filter: ((id = 5) OR (id = 25))
> >>->  Seq Scan on test2_20_plus_def
> >>  Filter: ((id = 5) OR (id = 25))
> >> (7 rows)
> >>
> >> --
> >>
> >> I have the same output using Amit's v1-delta.patch or Hosoya's
> >> v2_default_partition_pruning.patch.
> > I think I've figured what may be wrong.
> >
> > Partition pruning step generation code should ignore any arguments of
> > an OR clause that won't be true for a sub-partitioned partition, given
> > its partition constraint.
> >
> > In this case, id = 25 contradicts test2_0_20's partition constraint
> > (which is, a IS NOT NULL AND a >= 0 AND a < 20), so the OR clause
> > should really be simplified to id = 5, ignoring the id = 25 argument.
> > Note that we remove id = 25 only for the considerations of pruning and
> > not from the actual clause that's passed to the final plan, although
> > it wouldn't be a bad idea to try to do that.
> >
> > Attached revised delta patch, which includes the fix described above.
> >
> > Thanks,
> > Amit
> Amit, I tested many cases with nested range sub-partitions... and I did not 
> find any problem with your
> last patch  :-)
> 
> I tried mixing with hash partitions with no problems.
> 
> From the patch, there seems to be less checks than before. I cannot think of 
> a case that can have
> performance impacts.
> 
> Hosoya-san, if you agree with Amit's proposal, do you think you can send a 
> patch unifying your
> default_partition_pruning.patch and Amit's second v1-delta.patch?
>

I understood Amit's proposal.  But I think the issue Thibaut reported would 
occur regardless of whether clauses have OR clauses or not as follows.
I tested a query which should output "One-Time Filter: false".

# explain select * from test2_0_20 where id = 25;
  QUERY PLAN   
---
 Append  (cost=0.00..25.91 rows=6 width=36)
   ->  Seq Scan on test2_10_20_def  (cost=0.00..25.88 rows=6 width=36)
 Filter: (id = 25)


As Amit described in the previous email, id = 25 contradicts test2_0_20's
partition constraint, so I think this clause should be ignored and we can
also handle this case in the similar way as Amit proposal.

I attached v1-delta-2.patch which fix the above issue.  

What do you think about it?


Best regards,
Yuzuko Hosoya


v1-delta-2.patch
Description: Binary data


Re: BUG #15668: Server crash in transformPartitionRangeBounds

2019-03-21 Thread Amit Langote
Hi,

Thanks for splitting.  It makes sense, because, as you know, the bug that
causes the crash is a separate problem from unintuitive error messages
which result from the way in which we parse expressions.

On 2019/03/22 14:09, Michael Paquier wrote:
> On Wed, Mar 20, 2019 at 06:17:27PM +0900, Michael Paquier wrote:
>> The thing is that in order to keep the tests for the crash, we finish
>> with the inintuitive RTE-related errors, so it is also inconsistent to
>> not group things..
> 
> As I have seen no feedback from others regarding the changes in error
> messages depending on the parsing context, so I have been looking at
> splitting the fix for the crash and changing the error messages, and
> attached is the result of the split (minus the commit messages).  The
> first patch fixes the crash, and includes test cases to cover the
> crash as well as extra cases for list and range strategies with
> partition bounds.  Some of the error messages are confusing, but that
> fixes the issue.  This is not the most elegant thing without the
> second patch, but well that could be worse.

A comment on this one:

+   if (cname == NULL)
+   {
+   /*
+* No field names have been found, meaning that there
+* is not much to do with special value handling.  Instead
+* let the expression transformation handle any errors and
+* limitations.
+*/

This comment sounds a bit misleading.  The code above this "did" find
field names, but there were too many.  What this comment should mention is
that parsing didn't return a single field name, which is the format that
the code below this can do something useful with.  I had proposed that
upthread, but maybe that feedback got lost in the discussion about other
related issues.

I had proposed this:

+  /*
+   * There should be a single field named "minvalue" or "maxvalue".
+   */
  if (list_length(cref->fields) == 1 &&
  IsA(linitial(cref->fields), String))
  cname = strVal(linitial(cref->fields));

- Assert(cname != NULL);
- if (strcmp("minvalue", cname) == 0)
+ if (cname == NULL)
+ {
+ /*
+  * ColumnRef is not in the desired single-field-name form; for
+  * consistency, let transformExpr() report the error rather
+  * than doing it ourselves.
+  */
+ }

Maybe that could use few more tweaks, but hope I've made my point.

+CREATE TABLE part_bogus_expr_fail PARTITION OF range_parted
+  FOR VALUES FROM (sum(a)) TO ('2019-01-01');
+ERROR:  function sum(date) does not exist
+LINE 2:   FOR VALUES FROM (sum(a)) TO ('2019-01-01');

Maybe, we should add to this patch only the tests relevant to the cases
that would lead to crash without this patch.

Tests regarding error messages fine tuning can be added in the other patch.

> The second patch adds better error context for the different error
> messages, and includes tests for default expressions, which we could
> discuss in a separate thread.  So I am not proposing to commit that
> without more feedback.

A separate thread will definitely attract more attention, at least in due
time. :)

Thanks,
Amit




Re: Fwd: Add tablespace tap test to pg_rewind

2019-03-21 Thread Michael Paquier
On Thu, Mar 21, 2019 at 11:41:01PM +0800, Shaoqi Bai wrote:
> Have updated the patch doing as you suggested

+   RewindTest::setup_cluster($test_mode, ['-g']);
+   RewindTest::start_master();

There is no need to test for group permissions here, 002_databases.pl
already looks after that.

+   # Check for symlink -- needed only on source dir, only allow symlink
+   # when under pg_tblspc
+   # (note: this will fall through quietly if file is already gone)
+   if (-l $srcpath)
So you need that but in RecursiveCopy.pm because of init_from_backup
when creating the standby, which makes sense when it comes to
pg_tblspc.  I am wondering about any side effects though, and if it
would make sense to just remove the restriction for symlinks in
_copypath_recurse().
--
Michael


signature.asc
Description: PGP signature


Re: BUG #15668: Server crash in transformPartitionRangeBounds

2019-03-21 Thread Michael Paquier
On Wed, Mar 20, 2019 at 06:17:27PM +0900, Michael Paquier wrote:
> The thing is that in order to keep the tests for the crash, we finish
> with the inintuitive RTE-related errors, so it is also inconsistent to
> not group things..

As I have seen no feedback from others regarding the changes in error
messages depending on the parsing context, so I have been looking at
splitting the fix for the crash and changing the error messages, and
attached is the result of the split (minus the commit messages).  The
first patch fixes the crash, and includes test cases to cover the
crash as well as extra cases for list and range strategies with
partition bounds.  Some of the error messages are confusing, but that
fixes the issue.  This is not the most elegant thing without the
second patch, but well that could be worse.

The second patch adds better error context for the different error
messages, and includes tests for default expressions, which we could
discuss in a separate thread.  So I am not proposing to commit that
without more feedback.
--
Michael
From 61e77542fe720edcb7b80689ca0191ea195071ec Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 22 Mar 2019 13:49:41 +0900
Subject: [PATCH 1/2] Fix crash in partition bounds

---
 src/backend/parser/parse_utilcmd.c | 12 +++-
 src/test/regress/expected/create_table.out | 65 +-
 src/test/regress/sql/create_table.sql  | 26 -
 3 files changed, 99 insertions(+), 4 deletions(-)

diff --git a/src/backend/parser/parse_utilcmd.c b/src/backend/parser/parse_utilcmd.c
index a37d1f18be..ee129ba18f 100644
--- a/src/backend/parser/parse_utilcmd.c
+++ b/src/backend/parser/parse_utilcmd.c
@@ -3750,8 +3750,16 @@ transformPartitionRangeBounds(ParseState *pstate, List *blist,
 IsA(linitial(cref->fields), String))
 cname = strVal(linitial(cref->fields));
 
-			Assert(cname != NULL);
-			if (strcmp("minvalue", cname) == 0)
+			if (cname == NULL)
+			{
+/*
+ * No field names have been found, meaning that there
+ * is not much to do with special value handling.  Instead
+ * let the expression transformation handle any errors and
+ * limitations.
+ */
+			}
+			else if (strcmp("minvalue", cname) == 0)
 			{
 prd = makeNode(PartitionRangeDatum);
 prd->kind = PARTITION_RANGE_DATUM_MINVALUE;
diff --git a/src/test/regress/expected/create_table.out b/src/test/regress/expected/create_table.out
index d51e547278..130dccb241 100644
--- a/src/test/regress/expected/create_table.out
+++ b/src/test/regress/expected/create_table.out
@@ -489,19 +489,35 @@ Partitions: part_null FOR VALUES IN (NULL),
 part_p2 FOR VALUES IN (2),
 part_p3 FOR VALUES IN (3)
 
--- forbidden expressions for partition bound
+-- forbidden expressions for partition bound with list partitioned table
 CREATE TABLE part_bogus_expr_fail PARTITION OF list_parted FOR VALUES IN (somename);
 ERROR:  column "somename" does not exist
 LINE 1: ...expr_fail PARTITION OF list_parted FOR VALUES IN (somename);
  ^
+CREATE TABLE part_bogus_expr_fail PARTITION OF list_parted FOR VALUES IN (somename.somename);
+ERROR:  missing FROM-clause entry for table "somename"
+LINE 1: ...expr_fail PARTITION OF list_parted FOR VALUES IN (somename.s...
+ ^
 CREATE TABLE part_bogus_expr_fail PARTITION OF list_parted FOR VALUES IN (a);
 ERROR:  cannot use column references in partition bound expression
 LINE 1: ..._bogus_expr_fail PARTITION OF list_parted FOR VALUES IN (a);
 ^
+CREATE TABLE part_bogus_expr_fail PARTITION OF list_parted FOR VALUES IN (a.a);
+ERROR:  missing FROM-clause entry for table "a"
+LINE 1: ...ogus_expr_fail PARTITION OF list_parted FOR VALUES IN (a.a);
+  ^
 CREATE TABLE part_bogus_expr_fail PARTITION OF list_parted FOR VALUES IN (sum(a));
 ERROR:  aggregate functions are not allowed in partition bound
 LINE 1: ...s_expr_fail PARTITION OF list_parted FOR VALUES IN (sum(a));
^
+CREATE TABLE part_bogus_expr_fail PARTITION OF list_parted FOR VALUES IN (sum(somename));
+ERROR:  column "somename" does not exist
+LINE 1: ..._fail PARTITION OF list_parted FOR VALUES IN (sum(somename))...
+ ^
+CREATE TABLE part_bogus_expr_fail PARTITION OF list_parted FOR VALUES IN (sum(1));
+ERROR:  aggregate functions are not allowed in partition bound
+LINE 1: ...s_expr_fail PARTITION OF list_parted FOR VALUES IN (sum(1));
+   ^
 CREATE TABLE part_bogus_expr_fail PARTITION OF list_parted FOR VALUES IN ((select 1));
 ERROR:  cannot use subquery in partition bound
 LINE 1: ...expr_fail PARTITION OF list_parted FOR VALUES IN ((se

Re: [HACKERS] Block level parallel vacuum

2019-03-21 Thread Masahiko Sawada
On Fri, Mar 22, 2019 at 4:53 AM Robert Haas  wrote:
>
> On Tue, Mar 19, 2019 at 7:26 AM Masahiko Sawada  wrote:
> > In parsing vacuum command, since only PARALLEL option can have an
> > argument I've added the check in ExecVacuum to erroring out when other
> > options have an argument. But it might be good to make other vacuum
> > options (perhaps except for DISABLE_PAGE_SKIPPING option) accept an
> > argument just like EXPLAIN command.
>
> I think all of the existing options, including DISABLE_PAGE_SKIPPING,
> should permit an argument that is passed to defGetBoolean().
>

Agreed. The attached 0001 patch changes so.

On Thu, Mar 21, 2019 at 8:05 PM Sergei Kornilov  wrote:
>
> Hello
>

Thank you for reviewing the patch!

> > * in_parallel is true if we're performing parallel lazy vacuum. Since any
> > * updates are not allowed during parallel mode we don't update statistics
> > * but set the index bulk-deletion result to *stats. Otherwise we update it
> > * and set NULL.
>
> lazy_cleanup_index has in_parallel argument only for this purpose, but caller 
> still should check in_parallel after lazy_cleanup_index call and do something 
> else with stats for parallel execution.
> Would be better always return stats and update statistics in caller? It's 
> possible to update all index stats in lazy_vacuum_all_indexes for example? 
> This routine is always parallel leader and has comment /* Do post-vacuum 
> cleanup and statistics update for each index */ on for_cleanup=true call.

Agreed. I've changed the patch so that we update index statistics in
lazy_vacuum_all_indexes().

>
> I think we need note in documentation that parallel leader is not counted in 
> PARALLEL N option, so with PARALLEL 2 option we want use 3 processes. Or even 
> change behavior? Default with PARALLEL 1 - only current backend in single 
> process is running, PARALLEL 2 - leader + one parallel worker, two processes 
> works in parallel.
>

Hmm, the documentation says "Perform vacuum index and cleanup index
phases of VACUUM in parallel using N background workers". Doesn't it
already explain that?

Attached the updated version patch. 0001 patch allows all existing
vacuum options an boolean argument. 0002 patch introduces parallel
lazy vacuum. 0003 patch adds -P (--parallel) option to vacuumdb
command.



Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


v19-0001-All-VACUUM-command-options-allow-an-argument.patch
Description: Binary data


v19-0003-Add-paralell-P-option-to-vacuumdb-command.patch
Description: Binary data


v19-0002-Add-parallel-option-to-VACUUM-command.patch
Description: Binary data


Re: Transaction commits VS Transaction commits (with parallel) VS query mean time

2019-03-21 Thread Alvaro Herrera
On 2019-Mar-21, Robert Haas wrote:

> I don't have a strong position on what the "right" thing
> to do here is, but I think if you want to know how many client
> transactions are being executed, you should count them on the client
> side, as pgbench does.

I think counting on the client side is untenable in practical terms.
They may not even know what clients there are, or may not have the
source code to add such a feature even if they know how.  Plus they
would have to aggregate data coming from dozens of different systems?
I don't think this argument has any wings.

OTOH the reason the server offers stats is so that the user can know
what the server activity is, not to display useless internal state.
If a user disables an internal option (such as parallel query) and their
monitoring system suddenly starts showing half as many transactions as
before, they're legitimaly going to think that the server is broken.
Such stats are pointless.

The use case for those stats seems fairly clear to me: display numbers
in a monitoring system.  You seem to be saying that we're just not going
to help developers of monitoring systems, and that users have to feed
them on their own.

> I agree that it's a little funny to count the parallel worker commit
> as a separate transaction, since in a certain sense it is part of the
> same transaction.

Right.  So you don't count it.  This seems crystal clear.  Doing the
other thing is outright wrong, there's no doubt about that.

> But if you do that, then as already noted you have to next decide what
> to do about other transactions that parallel workers use internally.

You don't count those ones either.

> And then you have to decide what to do about other background
> transactions.

Not count them if they're implementation details; otherwise count them.
For example, IMO autovacuum transactions should definitely be counted
(as one transaction, even if they run parallel vacuum).

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: BUG #15572: Misleading message reported by "Drop function operation" on DB with functions having same name

2019-03-21 Thread David Rowley
On Fri, 22 Mar 2019 at 05:04, Tom Lane  wrote:
> Pushed with mostly-cosmetic adjustments.

Thanks for pushing this.

> I noticed a couple of loose ends that are somewhat outside the scope
> of the bug report, but maybe are worth considering now:
>
> 1. There's some inconsistency in the wording of the error messages in
> these routines, eg
>
>  errmsg("%s is not a function",
> vs
>  errmsg("%s is not a procedure",
> vs
>  errmsg("function %s is not an aggregate",
>
> Also there's
>  errmsg("function name \"%s\" is not unique",
> where elsewhere in parse_func.c, we find
>  errmsg("function %s is not unique",
>
> You didn't touch this and I didn't either, but maybe we should try to
> make these consistent?

I think aligning those is a good idea.   I had just been wondering to
myself last night about how much binary space is taken up by needless
additional string constants that could be normalised a bit.
Translators might be happy if we did that.

> 2. Consider
>
> regression=# CREATE FUNCTION ambig(int) returns int as $$ select $1; $$ 
> language sql;
> CREATE FUNCTION
> regression=# CREATE PROCEDURE ambig() as $$ begin end; $$ language plpgsql;
> CREATE PROCEDURE
> regression=# DROP PROCEDURE ambig;
> ERROR:  procedure name "ambig" is not unique
> HINT:  Specify the argument list to select the procedure unambiguously.
>
> Arguably, because I said "drop procedure", there's no ambiguity here;
> but we don't account for objtype while doing the lookup.

Yeah. I went with reporting the objtype that was specified in a
command.  I stayed well clear of allowing overlapping names between
procedures and functions.  It would be hard to put that back if we
ever discovered a reason we shouldn't have done it.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Special role for subscriptions

2019-03-21 Thread Michael Paquier
On Fri, Mar 22, 2019 at 10:15:59AM +0800, Andrey Borodin wrote:
> It seems to me that we have consensus that:
> 1. We need special role to create subscription
> 2. This role can create subscription with some security checks
> 3. We have complete list of possible security checks

These are basically that the truncate, insert, delete and insert
rights for the role creating the subscription.  Why would we actually
need that?

> 4. We have code that implements most of these checks (I believe
> pg_subscription_role_v2.patch is enough, but we can tighten checks a
> little more)

If a unique system role is the conclusion on the matter, it looks so.

> If not, it is RFC, it should not be returned.

The patch still needs some work before being RFC.  From what I can
read, pg_dump still ignores roles which are members of the system role
pg_subscription_users and these should be able to dump subscriptions,
so you have at least one problem.
--
Michael


signature.asc
Description: PGP signature


Re: current_logfiles not following group access and instead follows log_file_mode permissions

2019-03-21 Thread Michael Paquier
On Fri, Mar 22, 2019 at 02:35:41PM +1100, Haribabu Kommi wrote:
> Thanks for the correction.  Yes, that is correct and it works fine.

Thanks for double-checking.  Are there any objections with this patch?
--
Michael


signature.asc
Description: PGP signature


Re: pg_basebackup ignores the existing data directory permissions

2019-03-21 Thread Michael Paquier
On Fri, Mar 22, 2019 at 02:45:24PM +1100, Haribabu Kommi wrote:
> How about letting the pg_basebackup to decide group permissions of the
> standby directory irrespective of the primary directory permissions.
> 
> Default - permissions are same as primary
> --allow-group-access - standby directory have group access permissions
> --no-group--access - standby directory doesn't have group permissions
> 
> The last two options behave irrespective of the primary directory
> permissions.

Yes, I'd imagine that we would want to be able to define three
different behaviors, by either having a set of options, or a sinple
option with a switch, say --group-access:
- "inherit" causes the permissions to be inherited from the source
node, and that's the default.
- "none" enforces the default 0700/0600.
- "group" enforces group read access.
--
Michael


signature.asc
Description: PGP signature


Re: jsonpath

2019-03-21 Thread Oleg Bartunov
On Fri, 22 Mar 2019, 03:14 Nikita Glukhov,  wrote:

> Hi.
>
> Attached patch enables throwing of errors in jsonb_path_match() in its
> non-silent mode when the jsonpath expression failed to return a singleton
> boolean.  Previously, NULL was always returned, and it seemed to be
> inconsistent with the behavior of other functions, in which structural and
> other errors were not suppressed in non-silent mode.
>
>
> We also think that jsonb_path_match() needs to be renamed, because its
> current name is confusing to many users.  "Match" name is more suitable for
> jsonpath-based pattern matching like that (maybe we'll implement it later):
>
>   jsonb_path_match(
> '{ "a": 1, "b": 2,"c": "str" }',
> '{ "a": 1, "b": @ > 1, * : @.type == "string" }'
>   )
>
> Would be very useful for constraints.

>

> Below are some possible name variants:
>
>   jsonb_path_predicate() (original name)
>
> Too long

  jsonb_path_pred()
>   jsonb_path_test()
>   jsonb_path_check()
>
>
Check is good to me

>
>   jsonb_path_bool()
>
> --
> Nikita Glukhov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>


Re: pg_basebackup ignores the existing data directory permissions

2019-03-21 Thread Haribabu Kommi
On Fri, Mar 22, 2019 at 11:42 AM Michael Paquier 
wrote:

> On Thu, Mar 21, 2019 at 02:56:24PM -0400, Robert Haas wrote:
> > On Tue, Mar 19, 2019 at 2:29 AM Michael Paquier 
> wrote:
> >> Hm.  We have been assuming that the contents of a base backup inherit
> >> the permission of the source when using pg_basebackup because this
> >> allows users to keep a nodes in a consistent state without deciding
> >> which option to use.  Do you mean that you would like to enforce the
> >> permissions of only the root directory if it exists?  Or the root
> >> directory with all its contents?  The former may be fine.  The latter
> >> is definitely not.
> >
> > Why not?
>
> Because we have released v11 so as we respect the permissions set on
> the source instead from which the backup is taken for all the folder's
> content.  If we begin to enforce it we may break some cases.  If a new
> option is introduced, it seems to me that the default should remain
> what has been released with v11, but that it is additionally possible
> to enforce group permissions or non-group permissions at will on the
> backup taken for all the contents in the data folder, including the
> root folder, created manually or not before running the pg_basebackup
> command.
>

How about letting the pg_basebackup to decide group permissions of the
standby directory irrespective of the primary directory permissions.

Default - permissions are same as primary
--allow-group-access - standby directory have group access permissions
--no-group--access - standby directory doesn't have group permissions

The last two options behave irrespective of the primary directory
permissions.

opinions?

Regards,
Haribabu Kommi
Fujitsu Australia


Re: current_logfiles not following group access and instead follows log_file_mode permissions

2019-03-21 Thread Haribabu Kommi
On Fri, Mar 22, 2019 at 12:24 PM Michael Paquier 
wrote:

> On Thu, Mar 21, 2019 at 12:52:14PM +1100, Haribabu Kommi wrote:
> > Earlier attached patch is wrong.
>
> -   oumask = umask(pg_file_create_mode);
> +   oumask = umask(pg_mode_mask);
> Indeed that was wrong.
>
> > Correct patch attached. Sorry for the inconvenience.
>
> This looks better for the umask setting, still it could be more
> simple.
>
>  #include 
> -
> +#include "common/file_perm.h"
>  #include "lib/stringinfo.h"
> Nit: it is better for readability to keep an empty line between the
> system includes and the Postgres ones.
>
> A second thing, more important, is that you can reset umask just after
> opening the file, as attached.  This way there is no need to reset the
> umask in all the code paths leaving update_metainfo_datafile().  Does
> that look fine to you?
>

Thanks for the correction, Yes, that is correct and it works fine.

Regards,
Haribabu Kommi
Fujitsu Australia


Re: jsonpath

2019-03-21 Thread John Naylor
On Thu, Mar 21, 2019 at 9:59 PM Alexander Korotkov
 wrote:
> Attaches patches improving jsonpath parser.  First one introduces
> cosmetic changes, while second gets rid of backtracking.  I'm also
> planning to add high-level comment for both grammar and lexer.

The cosmetic changes look good to me. I just noticed a couple things
about the comments.

0001:

+/* Check if current scanstring value constitutes a keyword */

'is a keyword' is better. 'Constitutes' implies parts of a whole.

+ * Resize scanstring for appending of given length.  Reinitilize if required.

s/Reinitilize/Reinitialize/

The first sentence is not entirely clear to me.


0002:

These two rules are not strictly necessary:

{unicode}+\\ {
/* throw back the \\, and treat as unicode */
yyless(yyleng - 1);
parseUnicode(yytext, yyleng);
}

{hex_char}+\\ {
/* throw back the \\, and treat as hex */
yyless(yyleng - 1);
parseHexChars(yytext, yyleng);
}

...and only seem to be there because of how these are written:

{unicode}+ { parseUnicode(yytext, yyleng); }
{hex_char}+ { parseHexChars(yytext, yyleng); }
{unicode}*{unicodefail} { yyerror(NULL, "Unicode
sequence is invalid"); }
{hex_char}*{hex_fail} { yyerror(NULL, "Hex character
sequence is invalid"); }

I don't understand the reasoning here -- is it a micro-optimization?
The following is simpler, allow the rules I mentioned to be removed,
and make check still passes. I would prefer it unless there is a
performance penalty, in which case a comment to describe the
additional complexity would be helpful.

{unicode} { parseUnicode(yytext, yyleng); }
{hex_char} { parseHexChars(yytext, yyleng); }
{unicodefail} { yyerror(NULL, "Unicode sequence is invalid"); }
{hex_fail} { yyerror(NULL, "Hex character sequence is
invalid"); }

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: speeding up planning with partitions

2019-03-21 Thread Amit Langote
Jesper, Imai-san,

Thanks for testing and reporting your findings.

On 2019/03/21 23:10, Imai Yoshikazu wrote:
> On 2019/03/20 23:25, Jesper Pedersen wrote:> Hi,
>  > My tests - using hash partitions - shows that the extra time is spent in
>  > make_partition_pruneinfo() for the relid_subplan_map variable.
>  >
>  >64 partitions: make_partition_pruneinfo() 2.52%
>  > 8192 partitions: make_partition_pruneinfo() 5.43%
>  >
>  > TPS goes down ~8% between the two. The test setup is similar to the 
> above.
>  >
>  > Given that Tom is planning to change the List implementation [1] in 13 I
>  > think using the palloc0 structure is ok for 12 before trying other
>  > implementation options.

Hmm, relid_subplan_map's size should be constant (number of partitions
scanned) even as the actual partition count grows.

However, looking into make_partitionedrel_pruneinfo(), it seems that it's
unconditionally allocating 3 arrays that all have nparts elements:

subplan_map = (int *) palloc0(nparts * sizeof(int));
subpart_map = (int *) palloc0(nparts * sizeof(int));
relid_map = (Oid *) palloc0(nparts * sizeof(int));

So, that part has got to cost more as the partition count grows.

This is the code for runtime pruning, which is not exercised in our tests,
so it might seem odd that we're spending any time here at all.  I've been
told that we have to perform at least *some* work here if only to conclude
that runtime pruning is not needed and it seems that above allocations
occur before making that conclusion.  Maybe, that's salvageable by
rearranging this code a bit.  David may be in a better position to help
with that.

> Thanks for testing.
> Yeah, after code looking, I think bottleneck seems be lurking in another 
> place where this patch doesn't change. I also think the patch is ok as 
> it is for 12, and this problem will be fixed in 13.

If this drop in performance can be attributed to the fact that having too
many partitions starts stressing other parts of the backend like stats
collector, lock manager, etc. as time passes, then I agree that we'll have
to tackle them later.

Thanks,
Amit




Re: Special role for subscriptions

2019-03-21 Thread Andrey Borodin



> 22 марта 2019 г., в 9:28, Michael Paquier  написал(а):
> 
> On Thu, Mar 21, 2019 at 10:06:03AM -0300, Euler Taveira wrote:
>> It will be really strange but I can live with that. Another idea is
>> CREATE bit to create subscriptions (without replicate) and SUBSCRIBE
>> bit to replicate tables. It is not just a privilege to create a
>> subscription but also to modify tables that a role doesn't have
>> explicit permission. Let's allocate another AclItem?
> 
> By the way, as the commit fest is coming to its end in a couple of
> days, and that we are still discussing how the thing should be shaped,
> I would recommend to mark the patch as returned with feedback.  Any
> objections with that?

It seems to me that we have consensus that:
1. We need special role to create subscription
2. This role can create subscription with some security checks
3. We have complete list of possible security checks
4. We have code that implements most of these checks (I believe 
pg_subscription_role_v2.patch is enough, but we can tighten checks a little 
more)

Do we have any objection on these points?

If not, it is RFC, it should not be returned.

Best regards, Andrey Borodin.


Re: PostgreSQL pollutes the file system

2019-03-21 Thread Mark Kirkwood
On 22/03/19 3:05 PM, Tom Lane wrote:
> Michael Paquier  writes:
>> I would be curious to hear the reason why such tool names have been
>> chosen from the start.  The tools have been switched to C in 9e0ab71
>> from 2003, have been introduced by Peter Eisentraut as of 240e4c9 from
>> 1999, and I cannot spot the thread from the time where this was
>> discussed.
> createuser, at least, dates back to Berkeley days: my copy of the
> PG v4r2 tarball contains a "src/bin/createuser/createuser.sh" file
> dated 1994-03-19.  (The 1999 commit you mention just moved the
> functionality around; it was there before.)  So I imagine the answer
> is that nobody at the time thought of fitting these scripts into a
> larger ecosystem.


FWIW the whole set is there in version 6.4.2:

markir@vedavec:/download/postgres/src/postgresql-6.4.2/src/bin$ ls -l
total 72
drwxr-sr-x 3 markir adm 4096 Dec 31  1998 cleardbdir
drwxr-sr-x 3 markir adm 4096 Dec 31  1998 createdb
drwxr-sr-x 3 markir adm 4096 Dec 31  1998 createuser
drwxr-sr-x 2 markir adm 4096 Dec 31  1998 CVS
drwxr-sr-x 3 markir adm 4096 Dec 31  1998 destroydb
drwxr-sr-x 3 markir adm 4096 Dec 31  1998 destroyuser
drwxr-sr-x 3 markir adm 4096 Dec 31  1998 initdb
drwxr-sr-x 3 markir adm 4096 Dec 31  1998 initlocation
drwxr-sr-x 3 markir adm 4096 Dec 31  1998 ipcclean
-rw-r--r-- 1 markir adm  795 Dec 19  1998 Makefile
drwxr-sr-x 3 markir adm 4096 Dec 31  1998 pgaccess
drwxr-sr-x 3 markir adm 4096 Dec 31  1998 pg_dump
drwxr-sr-x 3 markir adm 4096 Dec 31  1998 pg_encoding
drwxr-sr-x 3 markir adm 4096 Dec 31  1998 pg_id
drwxr-sr-x 3 markir adm 4096 Dec 31  1998 pg_passwd
drwxr-sr-x 3 markir adm 4096 Dec 31  1998 pgtclsh
drwxr-sr-x 3 markir adm 4096 Dec 31  1998 pg_version
drwxr-sr-x 3 markir adm 4096 Dec 31  1998 psql

--

Mark





Re: PostgreSQL pollutes the file system

2019-03-21 Thread Tom Lane
Michael Paquier  writes:
> I would be curious to hear the reason why such tool names have been
> chosen from the start.  The tools have been switched to C in 9e0ab71
> from 2003, have been introduced by Peter Eisentraut as of 240e4c9 from
> 1999, and I cannot spot the thread from the time where this was
> discussed.

createuser, at least, dates back to Berkeley days: my copy of the
PG v4r2 tarball contains a "src/bin/createuser/createuser.sh" file
dated 1994-03-19.  (The 1999 commit you mention just moved the
functionality around; it was there before.)  So I imagine the answer
is that nobody at the time thought of fitting these scripts into a
larger ecosystem.

regards, tom lane



Re: Special role for subscriptions

2019-03-21 Thread Michael Paquier
On Thu, Mar 21, 2019 at 10:06:03AM -0300, Euler Taveira wrote:
> It will be really strange but I can live with that. Another idea is
> CREATE bit to create subscriptions (without replicate) and SUBSCRIBE
> bit to replicate tables. It is not just a privilege to create a
> subscription but also to modify tables that a role doesn't have
> explicit permission. Let's allocate another AclItem?

By the way, as the commit fest is coming to its end in a couple of
days, and that we are still discussing how the thing should be shaped,
I would recommend to mark the patch as returned with feedback.  Any
objections with that?
--
Michael


signature.asc
Description: PGP signature


Re: current_logfiles not following group access and instead follows log_file_mode permissions

2019-03-21 Thread Michael Paquier
On Thu, Mar 21, 2019 at 12:52:14PM +1100, Haribabu Kommi wrote:
> Earlier attached patch is wrong.

-   oumask = umask(pg_file_create_mode);
+   oumask = umask(pg_mode_mask);
Indeed that was wrong.

> Correct patch attached. Sorry for the inconvenience.

This looks better for the umask setting, still it could be more
simple.

 #include 
-
+#include "common/file_perm.h"
 #include "lib/stringinfo.h"
Nit: it is better for readability to keep an empty line between the
system includes and the Postgres ones.

A second thing, more important, is that you can reset umask just after
opening the file, as attached.  This way there is no need to reset the
umask in all the code paths leaving update_metainfo_datafile().  Does
that look fine to you?
--
Michael
diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index d1ea46deb8..d1f71c0f7e 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 
+#include "common/file_perm.h"
 #include "lib/stringinfo.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
@@ -1453,6 +1454,7 @@ static void
 update_metainfo_datafile(void)
 {
 	FILE	   *fh;
+	mode_t		oumask;
 
 	if (!(Log_destination & LOG_DESTINATION_STDERR) &&
 		!(Log_destination & LOG_DESTINATION_CSVLOG))
@@ -1465,7 +1467,12 @@ update_metainfo_datafile(void)
 		return;
 	}
 
-	if ((fh = logfile_open(LOG_METAINFO_DATAFILE_TMP, "w", true)) == NULL)
+	/* use the same permissions as the data directory for the new file */
+	oumask = umask(pg_mode_mask);
+	fh = fopen(LOG_METAINFO_DATAFILE_TMP, "w");
+	umask(oumask);
+
+	if (fh == NULL)
 	{
 		ereport(LOG,
 (errcode_for_file_access(),


signature.asc
Description: PGP signature


Re: Proposal to suppress errors thrown by to_reg*()

2019-03-21 Thread Takuma Hoshiai
Hi Nagata-san,

sorry for te late reply.
Thank you for your comments, I have attached a patch that reflected it.

On Tue, 19 Mar 2019 15:13:04 +0900
Yugo Nagata  wrote:

> I reviewed the patch. Here are some comments:
> 
>  /*
> + * RangeVarCheckNamespaceAccessNoError
> + * Returns true if given relation's namespace can be accessable by 
> the
> + * current user. If no namespace is given in the relation, just 
> returns
> + * true.
> + */
> +bool
> +RangeVarCheckNamespaceAccessNoError(RangeVar *relation)
> 
> Although it might be trivial, the new function's name 'RangeVar...' seems a 
> bit
> weird to me because this is used for not only to_regclass but also to_regproc,
> to_regtype, and so on, that is, the argument "relation" is not always a 
> relation. 
> 
> This function is used always with makeRangeVarFromNameList() as
> 
>   if (!RangeVarCheckNamespaceAccessNoError(makeRangeVarFromNameList(names)))
> 
> , so how about merging the two function as below, for example:
> 
>  /*
>   * CheckNamespaceAccessNoError
>   * Returns true if the namespace in given qualified-name can be 
> accessable
>   * by the current user. If no namespace is given in the names, just 
> returns
>   * true.
>   */
>  bool
>  CheckNamespaceAccessNoError(List *names);
> 
> 
> BTW, this patch supresses also "Cross-database references is not allowed" 
> error in
> addition to the namespace ACL error.  Is this an intentional change?  If this 
> error
> can be allowed, you can use DeconstructQualifiedName() instead of
> makeRangeVarFromNameList().

I think it is enough to supress napesapace ACL error only. so I will use its 
function.

> In the regression test, you are using \gset and \connect psql meta-commands 
> to test
> the user privilege to a namespace, but I think we can make this more simpler 
> by using SET SESSION AUTHORIZATION and RESET AUTHORIZATION.

I forgot this SQL, I fixed to use it.

Best regards,

-- 
Takuma Hoshiai 


fix_to_reg_v2.patch
Description: Binary data


Re: pg_basebackup ignores the existing data directory permissions

2019-03-21 Thread Michael Paquier
On Thu, Mar 21, 2019 at 02:56:24PM -0400, Robert Haas wrote:
> On Tue, Mar 19, 2019 at 2:29 AM Michael Paquier  wrote:
>> Hm.  We have been assuming that the contents of a base backup inherit
>> the permission of the source when using pg_basebackup because this
>> allows users to keep a nodes in a consistent state without deciding
>> which option to use.  Do you mean that you would like to enforce the
>> permissions of only the root directory if it exists?  Or the root
>> directory with all its contents?  The former may be fine.  The latter
>> is definitely not.
> 
> Why not?

Because we have released v11 so as we respect the permissions set on
the source instead from which the backup is taken for all the folder's
content.  If we begin to enforce it we may break some cases.  If a new
option is introduced, it seems to me that the default should remain
what has been released with v11, but that it is additionally possible
to enforce group permissions or non-group permissions at will on the
backup taken for all the contents in the data folder, including the
root folder, created manually or not before running the pg_basebackup
command.
--
Michael


signature.asc
Description: PGP signature


Re: PostgreSQL pollutes the file system

2019-03-21 Thread Michael Paquier
On Thu, Mar 21, 2019 at 10:02:40AM -0400, Tom Lane wrote:
> So there seems like a real risk that taking away createuser would
> result in security holes, not just annoying-but-trivial script update
> work.  That puts me more in the camp of "if we're going to do anything,
> rename it with a pg_ prefix" than "if we're going to do anything,
> remove it".

Removing it would be a bad idea as it is very easy to mess up with
things in such cases.  As you mentioned, renaming the tools now would
create more pain than actually solving things, so that's a bad idea
anyway.

I would be curious to hear the reason why such tool names have been
chosen from the start.  The tools have been switched to C in 9e0ab71
from 2003, have been introduced by Peter Eisentraut as of 240e4c9 from
1999, and I cannot spot the thread from the time where this was
discussed.
--
Michael


signature.asc
Description: PGP signature


Re: Offline enabling/disabling of data checksums

2019-03-21 Thread Michael Paquier
On Thu, Mar 21, 2019 at 08:17:32AM +0100, Fabien COELHO wrote:
> I can try, but I must admit that I'm fuzzy about the actual issue. Is there
> a problem on a streaming replication with inconsistent checksum settings, or
> not?
> 
> You seem to suggest that the issue is more about how some commands or backup
> tools operate on a cluster.

Yes.  That's what I am writing about.  Imagine for example this case
with pg_rewind:
- primary has checksums enabled.
- standby has checksums disabled.
- a hard crash of the primary happens, there is a failover to the
standby which gets promoted.
- The primary's host is restarted, it is started and stopped once
cleanly to have a clear point in its past timeline where WAL forked
thanks to the generation of at least the shutdown checkpoint generated
by the clean stop.
- pg_rewind is run, copying some pages from the promoted standby,
which don't have checksums, to the primary with checksums enabled, and
causing some pages to have an incorrect checksum.

There is another tool I know of which is called pg_rman, which is a
backup tool able to take incremental backups in the shape of a delta
of relation blocks.  Then imagine the following:
- One instance of Postgres runs, has checksums disabled.
- pg_rman takes a full backup of it.
- Checksums are enabled on this instance.
- An incremental backup from the previous full backup point is taken.
If I recall correctly pg_rman takes a copy of the new control file as
well, which tracks checksums as being enabled.
- A crash happens, the data folder is dead.
- Rollback to the previous backup is done, and we restore up to a
point after the incremental backup.
- And you finish with a cluster which has checksums enabled, but as
the initial full backup had checksums disabled, not all the pages may
be in a correct state.

So I think that it makes sense to tell to be careful within the
documentation, but being too much careful in the tool discards also
many possibilities (see the example of the clean failover where it is
possible to enable checksums with no actual downtime).  And this part
has a lot of value.

> ISTM that this would not work: The control file update can only be done
> *after* the fsync to describe the cluster actual status, otherwise it is
> just a question of luck whether the cluster is corrupt on an crash while
> fsyncing. The enforced order of operation, with a barrier in between, is the
> important thing here.

Done the switch for this case.  For pg_rewind actually I think that
this is an area where its logic could be improved a bit.  So first
the data folder is synced, and then the control file is updated.  It
took less time to change the code than to write this paragraph,
including the code compilation and one run of the TAP tests,
confirmed.

I have added in the docs a warning about a host crash while doing the 
operation, with a recommendation to check the state of the checksums
on the data folder should it happen, and the previous portion of the
docs about clusters.  Your suggestion sounds adapted.  I would be
tempted to add a bigger warning in pg_rewind or pg_basebackup about
that, but that's a different story for another time.

Does that look fine to you?
--
Michael
From 7ef4a14421bd148fa24f3107e8ef8be8b348 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 22 Mar 2019 09:21:18 +0900
Subject: [PATCH v9 1/2] Add options to enable and disable checksums in
 pg_checksums

An offline cluster can now work with more modes in pg_checksums:
- --enable can enable checksums in a cluster, updating all blocks with a
correct checksum, and update the control file at the end.
- --disable can disable checksums in a cluster, updating the the control
file.
- --check is an extra option able to verify checksums for a cluster.

When running --enable or --disable, the data folder gets fsync'd for
durability.  If no mode is specified in the options, then --check is
used for compatibility with older versions of pg_verify_checksums (now
renamed to pg_checksums in v12).

Author: Michael Banck
Reviewed-by: Fabien Coelho, Michael Paquier
Discussion: https://postgr.es/m/20181221201616.gd4...@nighthawk.caipicrew.dd-dns.de
---
 doc/src/sgml/ref/pg_checksums.sgml|  72 ++-
 src/bin/pg_checksums/pg_checksums.c   | 175 ++
 src/bin/pg_checksums/t/002_actions.pl |  76 ---
 src/tools/pgindent/typedefs.list  |   1 +
 4 files changed, 278 insertions(+), 46 deletions(-)

diff --git a/doc/src/sgml/ref/pg_checksums.sgml b/doc/src/sgml/ref/pg_checksums.sgml
index 6a47dda683..fda85e7ea0 100644
--- a/doc/src/sgml/ref/pg_checksums.sgml
+++ b/doc/src/sgml/ref/pg_checksums.sgml
@@ -16,7 +16,7 @@ PostgreSQL documentation
 
  
   pg_checksums
-  verify data checksums in a PostgreSQL database cluster
+  enable, disable or check data checksums in a PostgreSQL database cluster
  
 
  
@@ -36,10 +36,19 @@ PostgreSQL documentation
  
   Description
   
-   pg_checksums verifies data checksums in a
-   PostgreSQL c

Re: ToDo: show size of partitioned table

2019-03-21 Thread Amit Langote
On 2019/03/22 2:23, David Steele wrote:
> On 3/14/19 6:19 AM, Amit Langote wrote:
>> On 2019/03/14 2:11, Pavel Stehule wrote:
>>>
 I've attached v11 of the patch, which merges most of Justin's changes and
 some of my own on top -- documentation and partition size column names.
>>
>> Maybe, we should set this ready for committer then?
> 
> There don't appear to be any objections.  Perhaps you should do that?

OK, done.

Thanks,
Amit





Re: jsonpath

2019-03-21 Thread Nikita Glukhov

Hi.

Attached patch enables throwing of errors in jsonb_path_match() in its
non-silent mode when the jsonpath expression failed to return a singleton
boolean.  Previously, NULL was always returned, and it seemed to be
inconsistent with the behavior of other functions, in which structural and
other errors were not suppressed in non-silent mode.


We also think that jsonb_path_match() needs to be renamed, because its
current name is confusing to many users.  "Match" name is more suitable for
jsonpath-based pattern matching like that (maybe we'll implement it later):

  jsonb_path_match(
'{ "a": 1, "b": 2,"c": "str" }',
'{ "a": 1, "b": @ > 1, * : @.type == "string" }'
  )

Below are some possible name variants:

  jsonb_path_predicate() (original name)
  jsonb_path_pred()
  jsonb_path_test()
  jsonb_path_check()
  jsonb_path_bool()

--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

>From f298aaa1258c6a4b4a487fb44980654b3a8e2e36 Mon Sep 17 00:00:00 2001
From: Nikita Glukhov 
Date: Fri, 22 Mar 2019 02:34:24 +0300
Subject: [PATCH] Throw error in jsonb_path_match() when silent is false

---
 src/backend/utils/adt/jsonpath_exec.c| 26 +-
 src/test/regress/expected/jsonb_jsonpath.out | 51 
 src/test/regress/sql/jsonb_jsonpath.sql  | 12 +++
 3 files changed, 80 insertions(+), 9 deletions(-)

diff --git a/src/backend/utils/adt/jsonpath_exec.c b/src/backend/utils/adt/jsonpath_exec.c
index c072257..074cea2 100644
--- a/src/backend/utils/adt/jsonpath_exec.c
+++ b/src/backend/utils/adt/jsonpath_exec.c
@@ -320,7 +320,6 @@ jsonb_path_match(PG_FUNCTION_ARGS)
 {
 	Jsonb	   *jb = PG_GETARG_JSONB_P(0);
 	JsonPath   *jp = PG_GETARG_JSONPATH_P(1);
-	JsonbValue *jbv;
 	JsonValueList found = {0};
 	Jsonb	   *vars = NULL;
 	bool		silent = true;
@@ -333,18 +332,27 @@ jsonb_path_match(PG_FUNCTION_ARGS)
 
 	(void) executeJsonPath(jp, vars, jb, !silent, &found);
 
-	if (JsonValueListLength(&found) < 1)
-		PG_RETURN_NULL();
-
-	jbv = JsonValueListHead(&found);
-
 	PG_FREE_IF_COPY(jb, 0);
 	PG_FREE_IF_COPY(jp, 1);
 
-	if (jbv->type != jbvBool)
-		PG_RETURN_NULL();
+	if (JsonValueListLength(&found) == 1)
+	{
+		JsonbValue *jbv = JsonValueListHead(&found);
+
+		if (jbv->type == jbvBool)
+			PG_RETURN_BOOL(jbv->val.boolean);
+
+		if (jbv->type == jbvNull)
+			PG_RETURN_NULL();
+	}
+
+	if (!silent)
+		ereport(ERROR,
+(errcode(ERRCODE_SINGLETON_JSON_ITEM_REQUIRED),
+ errmsg(ERRMSG_SINGLETON_JSON_ITEM_REQUIRED),
+ errdetail("expression should return a singleton boolean")));
 
-	PG_RETURN_BOOL(jbv->val.boolean);
+	PG_RETURN_NULL();
 }
 
 /*
diff --git a/src/test/regress/expected/jsonb_jsonpath.out b/src/test/regress/expected/jsonb_jsonpath.out
index e604bae..66f0ffd 100644
--- a/src/test/regress/expected/jsonb_jsonpath.out
+++ b/src/test/regress/expected/jsonb_jsonpath.out
@@ -1769,6 +1769,57 @@ SELECT jsonb_path_exists('[{"a": 1}, {"a": 2}, {"a": 3}, {"a": 5}]', '$[*] ? (@.
  f
 (1 row)
 
+SELECT jsonb_path_match('true', '$', silent => false);
+ jsonb_path_match 
+--
+ t
+(1 row)
+
+SELECT jsonb_path_match('false', '$', silent => false);
+ jsonb_path_match 
+--
+ f
+(1 row)
+
+SELECT jsonb_path_match('null', '$', silent => false);
+ jsonb_path_match 
+--
+ 
+(1 row)
+
+SELECT jsonb_path_match('1', '$', silent => true);
+ jsonb_path_match 
+--
+ 
+(1 row)
+
+SELECT jsonb_path_match('1', '$', silent => false);
+ERROR:  singleton SQL/JSON item required
+DETAIL:  expression should return a singleton boolean
+SELECT jsonb_path_match('"a"', '$', silent => false);
+ERROR:  singleton SQL/JSON item required
+DETAIL:  expression should return a singleton boolean
+SELECT jsonb_path_match('{}', '$', silent => false);
+ERROR:  singleton SQL/JSON item required
+DETAIL:  expression should return a singleton boolean
+SELECT jsonb_path_match('[true]', '$', silent => false);
+ERROR:  singleton SQL/JSON item required
+DETAIL:  expression should return a singleton boolean
+SELECT jsonb_path_match('{}', 'lax $.a', silent => false);
+ERROR:  singleton SQL/JSON item required
+DETAIL:  expression should return a singleton boolean
+SELECT jsonb_path_match('{}', 'strict $.a', silent => false);
+ERROR:  SQL/JSON member not found
+DETAIL:  JSON object does not contain key "a"
+SELECT jsonb_path_match('{}', 'strict $.a', silent => true);
+ jsonb_path_match 
+--
+ 
+(1 row)
+
+SELECT jsonb_path_match('[true, true]', '$[*]', silent => false);
+ERROR:  singleton SQL/JSON item required
+DETAIL:  expression should return a singleton boolean
 SELECT jsonb '[{"a": 1}, {"a": 2}]' @@ '$[*].a > 1';
  ?column? 
 --
diff --git a/src/test/regress/sql/jsonb_jsonpath.sql b/src/test/regress/sql/jsonb_jsonpath.sql
index 41b346b..f8ef39c 100644
--- a/src/test/regress/sql/jsonb_jsonpath.sql
+++ b/src/test/regress/sql/jsonb_jsonpath.sql
@@ -366,6 +366,18 @@ SELECT jsonb_path_exis

warning to publication created and wal_level is not set to logical

2019-03-21 Thread Lucas Viecelli
Hi everyone,

A very common question among new users is how wal_level works and it
levels. I heard about some situations like that, a user create a new
publication in its master database and he/she simply does not change
wal_level to logical, sometimes, this person lost maintenance window, or a
chance to restart postgres service, usually a production database, and it
will discover that wal_level is not right just in subscription creation.
Attempting to iterate between new (and even experienced) users with logical
replication, I am sending a patch that when an PUBLICATION is created and
the wal_level is different from logical prints a WARNING in console/log:

-> WARNING: `PUBLICATION` created but wal_level `is` not set to logical,
you need to change it before creating any SUBSCRIPTION

Initiatives like this can make a good user experience with PostgreSQL and
its own logical replication.

Thanks

--

*Lucas Viecelli*


diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c
index 4d48be0b92..991065d0a2 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -232,6 +232,12 @@ CreatePublication(CreatePublicationStmt *stmt)
 
 	InvokeObjectPostCreateHook(PublicationRelationId, puboid, 0);
 
+	if (wal_level != WAL_LEVEL_LOGICAL)
+	{
+		elog(WARNING, "PUBLICATION created but wal_level is not set to logical, you need to "
+  "change it before creating any SUBSCRIPTION");
+	}
+
 	return myself;
 }
 
diff --git a/src/test/regress/expected/object_address.out b/src/test/regress/expected/object_address.out
index 02070fd8af..dbc3ffa63d 100644
--- a/src/test/regress/expected/object_address.out
+++ b/src/test/regress/expected/object_address.out
@@ -43,6 +43,7 @@ CREATE TRANSFORM FOR int LANGUAGE SQL (
 	FROM SQL WITH FUNCTION prsd_lextype(internal),
 	TO SQL WITH FUNCTION int4recv(internal));
 CREATE PUBLICATION addr_pub FOR TABLE addr_nsp.gentable;
+WARNING:  PUBLICATION created but wal_level is not set to logical, you need to change it before creating any SUBSCRIPTION
 CREATE SUBSCRIPTION addr_sub CONNECTION '' PUBLICATION bar WITH (connect = false, slot_name = NONE);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 CREATE STATISTICS addr_nsp.gentable_stat ON a, b FROM addr_nsp.gentable;
diff --git a/src/test/regress/expected/publication.out b/src/test/regress/expected/publication.out
index afbbdd543d..2b05babf7a 100644
--- a/src/test/regress/expected/publication.out
+++ b/src/test/regress/expected/publication.out
@@ -5,7 +5,14 @@ CREATE ROLE regress_publication_user LOGIN SUPERUSER;
 CREATE ROLE regress_publication_user2;
 CREATE ROLE regress_publication_user_dummy LOGIN NOSUPERUSER;
 SET SESSION AUTHORIZATION 'regress_publication_user';
+SHOW wal_level;
+ wal_level 
+---
+ replica
+(1 row)
+
 CREATE PUBLICATION testpub_default;
+WARNING:  PUBLICATION created but wal_level is not set to logical, you need to change it before creating any SUBSCRIPTION
 COMMENT ON PUBLICATION testpub_default IS 'test publication';
 SELECT obj_description(p.oid, 'pg_publication') FROM pg_publication p;
  obj_description  
@@ -14,6 +21,7 @@ SELECT obj_description(p.oid, 'pg_publication') FROM pg_publication p;
 (1 row)
 
 CREATE PUBLICATION testpib_ins_trunct WITH (publish = insert);
+WARNING:  PUBLICATION created but wal_level is not set to logical, you need to change it before creating any SUBSCRIPTION
 ALTER PUBLICATION testpub_default SET (publish = update);
 -- error cases
 CREATE PUBLICATION testpub_xxx WITH (foo);
@@ -44,6 +52,7 @@ CREATE TABLE pub_test.testpub_nopk (foo int, bar int);
 CREATE VIEW testpub_view AS SELECT 1;
 CREATE TABLE testpub_parted (a int) PARTITION BY LIST (a);
 CREATE PUBLICATION testpub_foralltables FOR ALL TABLES WITH (publish = 'insert');
+WARNING:  PUBLICATION created but wal_level is not set to logical, you need to change it before creating any SUBSCRIPTION
 ALTER PUBLICATION testpub_foralltables SET (publish = 'insert, update');
 CREATE TABLE testpub_tbl2 (id serial primary key, data text);
 -- fail - can't add to for all tables publication
@@ -87,7 +96,9 @@ DROP PUBLICATION testpub_foralltables;
 CREATE TABLE testpub_tbl3 (a int);
 CREATE TABLE testpub_tbl3a (b text) INHERITS (testpub_tbl3);
 CREATE PUBLICATION testpub3 FOR TABLE testpub_tbl3;
+WARNING:  PUBLICATION created but wal_level is not set to logical, you need to change it before creating any SUBSCRIPTION
 CREATE PUBLICATION testpub4 FOR TABLE ONLY testpub_tbl3;
+WARNING:  PUBLICATION created but wal_level is not set to logical, you need to change it before creating any SUBSCRIPTION
 \dRp+ testpub3
   Publication testpub3
   Owner   | All tables | Inserts | Updates | Deletes | Truncates 
@@ -112,6 +123,7 @@ CREATE PUBLICATION testpub_fortbl FOR TABLE testpub_view;
 ERROR:  "

Re: partitioned tables referenced by FKs

2019-03-21 Thread Alvaro Herrera
Hi Jesper

On 2019-Mar-21, Jesper Pedersen wrote:

> running
> 
> pgbench -M prepared -f select.sql 
> 
> I'm seeing 82.64% spent in GetCachedPlan(). plan_cache_mode is auto.

Hmm, I can't reproduce this at all ...  I don't even see GetCachedPlan
in the profile.  Do you maybe have some additional patch in your tree?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: partitioned tables referenced by FKs

2019-03-21 Thread Alvaro Herrera
On 2019-Mar-18, Alvaro Herrera wrote:

> A pretty silly bug remains here.  Watch:
> 
> create table pk (a int primary key) partition by list (a);
> create table pk1 partition of pk for values in (1);
> create table fk (a int references pk);
> insert into pk values (1);
> insert into fk values (1);
> drop table pk cascade;
> 
> Note that the DROP of the main PK table is pretty aggressive: since it
> cascades, you want it to drop the constraint on the FK side.  This would
> work without a problem if 'pk' was a non-partitioned table, but in this
> case it fails:
> 
> alvherre=# drop table pk cascade;
> NOTICE:  drop cascades to constraint fk_a_fkey on table fk
> ERROR:  removing partition "pk1" violates foreign key constraint "fk_a_fkey1"
> DETALLE:  Key (a)=(1) still referenced from table "fk".

Here's v7; known problems have been addressed except the one above.
Jesper's GetCachedPlan() slowness has not been addressed either.

As I said before, I'm thinking of getting rid of the whole business of
checking partitions on the referenced side of an FK at DROP time, and
instead jut forbid the DROP completely if any FKs reference an ancestor
of that partition.  We can allow DETACH for the case, which can deal
with scanning the table referenced tuples and remove the appropriate
dependencies.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 1d5d2a755bfacedc1846b8515ede6465aaf411f5 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 28 Nov 2018 11:52:00 -0300
Subject: [PATCH v7 1/2] Rework deleteObjectsInList to allow objtype-specific
 checks

This doesn't change any functionality yet.
---
 src/backend/catalog/dependency.c | 41 +++-
 1 file changed, 25 insertions(+), 16 deletions(-)

diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index f7acb4103eb..d7cc6040cde 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -230,29 +230,38 @@ deleteObjectsInList(ObjectAddresses *targetObjects, Relation *depRel,
 	int			i;
 
 	/*
-	 * Keep track of objects for event triggers, if necessary.
+	 * Invoke pre-deletion callbacks and keep track of objects for event
+	 * triggers, if necessary.
 	 */
-	if (trackDroppedObjectsNeeded() && !(flags & PERFORM_DELETION_INTERNAL))
+	for (i = 0; i < targetObjects->numrefs; i++)
 	{
-		for (i = 0; i < targetObjects->numrefs; i++)
+		const ObjectAddress *thisobj = &targetObjects->refs[i];
+		Oid			objectClass = getObjectClass(thisobj);
+
+		if (trackDroppedObjectsNeeded() && !(flags & PERFORM_DELETION_INTERNAL))
 		{
-			const ObjectAddress *thisobj = &targetObjects->refs[i];
-			const ObjectAddressExtra *extra = &targetObjects->extras[i];
-			bool		original = false;
-			bool		normal = false;
-
-			if (extra->flags & DEPFLAG_ORIGINAL)
-original = true;
-			if (extra->flags & DEPFLAG_NORMAL)
-normal = true;
-			if (extra->flags & DEPFLAG_REVERSE)
-normal = true;
-
-			if (EventTriggerSupportsObjectClass(getObjectClass(thisobj)))
+			if (EventTriggerSupportsObjectClass(objectClass))
 			{
+bool		original = false;
+bool		normal = false;
+const ObjectAddressExtra *extra = &targetObjects->extras[i];
+
+if (extra->flags & DEPFLAG_ORIGINAL)
+	original = true;
+if (extra->flags & DEPFLAG_NORMAL ||
+	extra->flags & DEPFLAG_REVERSE)
+	normal = true;
+
 EventTriggerSQLDropAddObject(thisobj, original, normal);
 			}
 		}
+
+		/* Invoke class-specific pre-deletion checks */
+		switch (objectClass)
+		{
+			default:
+break;
+		}
 	}
 
 	/*
-- 
2.17.1

>From 8b7708b6639ec8f8cd417fcb6a28989c0ffdc95f Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 20 Feb 2019 15:08:20 -0300
Subject: [PATCH v7 2/2] support FKs referencing partitioned tables

---
 doc/src/sgml/ref/create_table.sgml|7 +-
 src/backend/catalog/dependency.c  |3 +
 src/backend/catalog/heap.c|   24 +
 src/backend/commands/tablecmds.c  | 1376 +++--
 src/backend/utils/adt/ri_triggers.c   |  244 +++-
 src/backend/utils/adt/ruleutils.c |   18 +
 src/include/catalog/heap.h|2 +
 src/include/commands/tablecmds.h  |8 +-
 src/include/commands/trigger.h|1 +
 src/include/utils/ruleutils.h |1 +
 src/test/regress/expected/foreign_key.out |  166 ++-
 src/test/regress/sql/foreign_key.sql  |  114 +-
 12 files changed, 1529 insertions(+), 435 deletions(-)

diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index e94fe2c3b67..37ae0f00fda 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -378,9 +378,6 @@ WITH ( MODULUS numeric_literal, REM
  
   Partitioned tables do not support EXCLUDE constraints;
   however, you can define these constraints on individual partitions.
-  Also, while it's p

Re: Performance issue in foreign-key-aware join estimation

2019-03-21 Thread Tom Lane
David Rowley  writes:
> [ eclass_indexes_v4.patch ]

I still don't like this approach too much.  I think we can fairly easily
construct the eclass_indexes at a higher level instead of trying to
manage them in add_eq_member, and after some hacking I have the attached.
Some notes:

* To be sure of knowing whether the indexes have been made yet, I added
a new flag variable PlannerInfo.ec_merging_done.  This seems like a
cleaner fix for the dependency between canonical-pathkey construction
and EC construction, too.  I did need to add code to set the flag in
a couple of short-circuit code paths where we never call
generate_base_implied_equalities(), though.

* I kind of want to rename generate_base_implied_equalities() to
something else, considering that it does a good bit more than just
that now.  But I haven't thought of a good name.  I considered
finalize_equivalence_classes(), but that seems like an overstatement
when it's still possible for new sort-key eclasses to get made later.

* I did not include your changes to use the indexes in
generate_join_implied_equalities[_for_ecs], because they were wrong.
generate_join_implied_equalities_for_ecs is supposed to consider only
ECs listed in the input list.  (It's troublesome that apparently we
don't have any regression tests exposing that need; we should try to
make a test case that does show it.)  There's probably some way to
still make use of the indexes there.  If nothing else, we could refactor
so that generate_join_implied_equalities isn't just a wrapper around
generate_join_implied_equalities_for_ecs but has its own looping, and
we only use the indexes in generate_join_implied_equalities not the
other entry point.  I haven't tried though.

* You mentioned postgres_fdw.c's get_useful_ecs_for_relation as
potentially affected by this change.  It looks to me like we could
nuke that function altogether in favor of having its caller scan the
foreign table's eclass_indexes.  I haven't tried that either.


I'm unsure how hard we should push to get something like this into v12.
I'm concerned that its dependency on list_nth might result in performance
regressions in some cases; it's a lot easier to believe that this will
be mostly-a-win with the better List infrastructure we're hoping to get
into v13.  If you want to keep playing with it, OK, but I'm kind of
tempted to shelve it for now.

regards, tom lane

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 69179a0..4e8400d 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2190,6 +2190,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
 	WRITE_NODE_FIELD(cte_plan_ids);
 	WRITE_NODE_FIELD(multiexpr_params);
 	WRITE_NODE_FIELD(eq_classes);
+	WRITE_BOOL_FIELD(ec_merging_done);
 	WRITE_NODE_FIELD(canon_pathkeys);
 	WRITE_NODE_FIELD(left_join_clauses);
 	WRITE_NODE_FIELD(right_join_clauses);
@@ -2256,6 +2257,7 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node)
 	WRITE_UINT_FIELD(pages);
 	WRITE_FLOAT_FIELD(tuples, "%.0f");
 	WRITE_FLOAT_FIELD(allvisfrac, "%.6f");
+	WRITE_BITMAPSET_FIELD(eclass_indexes);
 	WRITE_NODE_FIELD(subroot);
 	WRITE_NODE_FIELD(subplan_params);
 	WRITE_INT_FIELD(rel_parallel_workers);
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 61b5b11..804b211 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -64,6 +64,10 @@ static bool reconsider_outer_join_clause(PlannerInfo *root,
 			 bool outer_on_left);
 static bool reconsider_full_join_clause(PlannerInfo *root,
 			RestrictInfo *rinfo);
+static Bitmapset *get_eclass_indexes_for_relids(PlannerInfo *root,
+			  Relids relids);
+static Bitmapset *get_common_eclass_indexes(PlannerInfo *root, Relids relids1,
+		  Relids relids2);
 
 
 /*
@@ -341,10 +345,11 @@ process_equivalence(PlannerInfo *root,
 
 		/*
 		 * Case 2: need to merge ec1 and ec2.  This should never happen after
-		 * we've built any canonical pathkeys; if it did, those pathkeys might
-		 * be rendered non-canonical by the merge.
+		 * the ECs have reached canonical state; otherwise, pathkeys could be
+		 * rendered non-canonical by the merge, and relation eclass indexes
+		 * would get broken by removal of an eq_classes list entry.
 		 */
-		if (root->canon_pathkeys != NIL)
+		if (root->ec_merging_done)
 			elog(ERROR, "too late to merge equivalence classes");
 
 		/*
@@ -743,6 +748,26 @@ get_eclass_for_sort_expr(PlannerInfo *root,
 
 	root->eq_classes = lappend(root->eq_classes, newec);
 
+	/*
+	 * If EC merging is already complete, we have to mop up by adding the new
+	 * EC to the eclass_indexes of the relation(s) mentioned in it.
+	 */
+	if (root->ec_merging_done)
+	{
+		int			ec_index = list_length(root->eq_classes) - 1;
+		int			i = -1;
+
+		while ((i = bms_next_member(newec->ec_relids, i)) > 0)
+		{
+			RelOptInfo *rel = root->simple_rel_array[i];
+
+			Assert(rel->r

Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2019-03-21 Thread Alvaro Herrera
On 2019-Mar-21, Alexander Korotkov wrote:

> However, I think this still can be backpatched correctly.  We can
> determine whether xlog record data contains deleteXid by its size.
> See the attached patch.  I haven't test this yet.  I'm going to test
> it.  If OK, then push.

Wow, this is so magical that I think it merits a long comment explaining
what is going on.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: propagating replica identity to partitions

2019-03-21 Thread Alvaro Herrera
On 2019-Mar-21, Robert Haas wrote:

> On Wed, Mar 20, 2019 at 4:42 PM Alvaro Herrera  
> wrote:
> > Unless there are any objections to fixing the REPLICA IDENTITY bug, I
> > intend to push that tomorrow.
> 
> I still think that this is an ill-considered, piecemeal approach to a
> problem that deserves a better solution.

I already argued that TABLESPACE and OWNER TO are documented to work
that way, and have been for a long time, whereas REPLICA IDENTITY has
never been.  If you want to change long-standing behavior, be my guest,
but that's not my patch.  On the other hand, there's no consensus that
those should be changed, whereas there no opposition specifically
against changing this one, and in fact it was reported as a bug to me by
actual users.

FWIW I refrained from pushing today because after posting I realized
that I've failed to investigate Dolgov's reported problem.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: partitioned tables referenced by FKs

2019-03-21 Thread Alvaro Herrera
On 2019-Mar-21, Alvaro Herrera wrote:

> I figured this out.  It's actually a bug in pg11, whereby we're setting
> a dependency wrongly.  If you try to do pg_describe_object() the
> pg_depend entries for tables set up the way the regression test does it,
> it'll fail with a "cache lookup failed for constraint XYZ".  In other
> words, pg_depend contains bogus data :-(

Oh, actually, it's not a problem in PG11 ... or at least, it's not THIS
problem in pg11.  The bug was introduced by 1d92a0c9f7dd which ended the
DEPENDENCY_INTERNAL_AUTO saga, so the bug is only in master, and that
explains why I hadn't seen the problem before with my posted patch
series.

I think the attached is a better solution, which I'll go push shortly.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index c339a2bb779..cb2c0010171 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1247,7 +1247,7 @@ index_constraint_create(Relation heapRelation,
 {
 	Oid			namespaceId = RelationGetNamespace(heapRelation);
 	ObjectAddress myself,
-referenced;
+idxaddr;
 	Oid			conOid;
 	bool		deferrable;
 	bool		initdeferred;
@@ -1341,15 +1341,9 @@ index_constraint_create(Relation heapRelation,
 	 * Note that the constraint has a dependency on the table, so we don't
 	 * need (or want) any direct dependency from the index to the table.
 	 */
-	myself.classId = RelationRelationId;
-	myself.objectId = indexRelationId;
-	myself.objectSubId = 0;
-
-	referenced.classId = ConstraintRelationId;
-	referenced.objectId = conOid;
-	referenced.objectSubId = 0;
-
-	recordDependencyOn(&myself, &referenced, DEPENDENCY_INTERNAL);
+	ObjectAddressSet(myself, ConstraintRelationId, conOid);
+	ObjectAddressSet(idxaddr, RelationRelationId, indexRelationId);
+	recordDependencyOn(&idxaddr, &myself, DEPENDENCY_INTERNAL);
 
 	/*
 	 * Also, if this is a constraint on a partition, give it partition-type
@@ -1357,7 +1351,8 @@ index_constraint_create(Relation heapRelation,
 	 */
 	if (OidIsValid(parentConstraintId))
 	{
-		ObjectAddressSet(myself, ConstraintRelationId, conOid);
+		ObjectAddress	referenced;
+
 		ObjectAddressSet(referenced, ConstraintRelationId, parentConstraintId);
 		recordDependencyOn(&myself, &referenced, DEPENDENCY_PARTITION_PRI);
 		ObjectAddressSet(referenced, RelationRelationId,
@@ -1444,7 +1439,7 @@ index_constraint_create(Relation heapRelation,
 		table_close(pg_index, RowExclusiveLock);
 	}
 
-	return referenced;
+	return myself;
 }
 
 /*


Re: Libpq support to connect to standby server as priority

2019-03-21 Thread Haribabu Kommi
On Fri, Mar 22, 2019 at 6:57 AM Robert Haas  wrote:

> On Thu, Mar 21, 2019 at 2:26 AM Haribabu Kommi 
> wrote:
> > Based on the above new options that can be added to target_session_attrs,
> >
> > primary - it is just an alias to the read-write option.
> > standby, prefer-standby - These options should check whether server is
> running in recovery mode or not
> > instead of checking whether server accepts read-only connections or not?
>
> I think it will be best to have one set of attributes that check
> default_transaction_read_only and a differently-named set that check
> pg_is_in_recovery().  For each, there should be one value that looks
> for a 'true' return and one value that looks for a 'false' return and
> perhaps values that accept either but prefer one or the other.
>
> IOW, there's no reason to make primary an alias for read-write.  If
> you want read-write, you can just say read-write.  But we can make
> 'primary' or 'master' look for a server that's not in recovery and
> 'standby' look for one that is.
>

OK, I agree with opinion. I will produce a patch for those new options.

Regards,
Haribabu Kommi
Fujitsu Australia


Re: pg_basebackup ignores the existing data directory permissions

2019-03-21 Thread Haribabu Kommi
On Thu, Mar 21, 2019 at 3:02 AM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 2019-03-19 08:34, Haribabu Kommi wrote:
> > How about the following change?
> >
> > pg_basebackup  --> copies the contents of the src directory (with group
> > access)
> > and even the root directory permissions.
> >
> > pg_basebackup --no-group-access   --> copies the contents of the src
> > directory
> > (with no group access) even for the root directory.
>
> I'm OK with that.  Perhaps a positive option --allow-group-access would
> also be useful.
>

Do you want both --allow-group-access and --no-group-access options to
be added to pg_basebackup?

Without --allow-group-access is automatically --no-group-access?

Or you want pg_basebackup independently decide the group access irrespective
of the source directory?

Even if the source directory is "not group access", pg_basebackup
--allow-group-access
make it standby as "group access".

source directory is "group access", pg_basebackup --no-group-access make it
"no group access" standby directory.

Default behavior of pg_basebackup is just to copy same as source directory?


> Let's make sure the behavior of these options is aligned with initdb.
> And write tests for each variant.
>

OK.

Regards,
Haribabu Kommi
Fujitsu Australia


Re: partitioned tables referenced by FKs

2019-03-21 Thread Alvaro Herrera
On 2019-Mar-18, Alvaro Herrera wrote:

> > I'm getting a failure in the pg_upgrade test:
> > 
> >  --
> > +-- Name: pk5 pk5_pkey; Type: CONSTRAINT; Schema: regress_fk; Owner:
> > jpedersen
> > +--
> > +
> > +ALTER TABLE ONLY regress_fk.pk5
> > +ADD CONSTRAINT pk5_pkey PRIMARY KEY (a);
> > +
> > +
> > +--
> > 
> > when running check-world.
> 
> So I tested this case just now (after fixing a couple of bugs) and see a
> slightly different problem now: those excess lines are actually just
> that pg_dump chose to print the constraints in different order, as there
> are identical lines with "-" in the diff I get.  The databases actually
> *are* identical as far as I can tell.  I haven't yet figured out how to
> fix this; of course, the easiest solution is to just drop the regress_fk
> schema at the end of the foreign_key.sql regression test, but I would
> prefer a different solution.

I figured this out.  It's actually a bug in pg11, whereby we're setting
a dependency wrongly.  If you try to do pg_describe_object() the
pg_depend entries for tables set up the way the regression test does it,
it'll fail with a "cache lookup failed for constraint XYZ".  In other
words, pg_depend contains bogus data :-(

Here's one possible fix:

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index c339a2bb779..8f62b454f75 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1357,6 +1357,8 @@ index_constraint_create(Relation heapRelation,
 */
if (OidIsValid(parentConstraintId))
{
+   ObjectAddress   referenced; /* XXX shadow outer variable */
+
ObjectAddressSet(myself, ConstraintRelationId, conOid);
ObjectAddressSet(referenced, ConstraintRelationId, 
parentConstraintId);
recordDependencyOn(&myself, &referenced, 
DEPENDENCY_PARTITION_PRI);

I can tell the raised eyebrows from a mile away :-(

The problem (and the reason shadowing the variable solves it) is that
the outer 'referenced' is the function's return value.  This block was
clobbering the value previously set, which is what we really wanted to
return.

/me dons paper bag

I'll go figure out what a better solution is.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2019-03-21 Thread Simon Riggs
On Thu, 21 Mar 2019 at 15:18, Alexander Korotkov 
wrote:

> On Thu, Mar 21, 2019 at 9:26 PM Simon Riggs  wrote:
>
>

> > It's been pointed out to me that 52ac6cd2d0cd70e01291e0ac4ee6d068b69bc478
> > introduced a WAL incompatibility that has not been flagged.
> >
> > In ginRedoDeletePage() we use the struct directly to read the WAL
> record, so if a WAL record was written prior to
> 52ac6cd2d0cd70e01291e0ac4ee6d068b69bc478, yet read by code at
> 52ac6cd2d0cd70e01291e0ac4ee6d068b69bc478 or later then we will have
> problems, since deleteXid will not be set correctly.
> >
> > It seems this should not have been backpatched.
> >
> > Please give your assessment.
>
> Oh, right.  This is my fault.
>
> However, I think this still can be backpatched correctly.  We can
> determine whether xlog record data contains deleteXid by its size.
> See the attached patch.  I haven't test this yet.  I'm going to test
> it.  If OK, then push.
>

Patch looks like it will work.

I'd prefer to see a tighter test, since the length should either be the old
or new length, no other.

Thanks

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Pluggable Storage - Andres's take

2019-03-21 Thread Haribabu Kommi
On Fri, Mar 22, 2019 at 5:16 AM Andres Freund  wrote:

> Hi,
>
> Attached is a version of just the first patch. I'm still updating it,
> but it's getting closer to commit:
>
> - There were no tests testing EPQ interactions with DELETE, and only an
>   accidental test for EPQ in UPDATE with a concurrent DELETE. I've added
>   tests. Plan to commit that ahead of the big change.
>
> - I was pretty unhappy about how the EPQ integration looked before, I've
>   changed that now.
>
>   I still wonder if we should restore EvalPlanQualFetch and move the
>   table_lock_tuple() calls in ExecDelete/Update into it. But it seems
>   like it'd not gain that much, because there's custom surrounding code,
>   and it's not that much code.
>
> - I changed heapam_tuple_tuple to return *WouldBlock rather than just
>   the last result. I think that's one of the reason Haribabu had
>   neutered a few asserts.
>
> - I moved comments from heapam.h to tableam.h where appropriate
>
> - I updated the name of HeapUpdateFailureData to TM_FailureData,
>   HTSU_Result to TM_Result, TM_Results members now properly distinguish
>   between update vs modifications (delete & update).
>
> - I separated the HEAP_INSERT_ flags into TABLE_INSERT_* and
>   HEAP_INSERT_ with the latter being a copy of TABLE_INSERT_ with the
>   sole addition of _SPECULATIVE. table_insert_speculative callers now
>   don't specify that anymore.
>
>
> Pending work:
> - Wondering if table_insert/delete/update should rather be
>   table_tuple_insert etc. Would be a bit more consistent with the
>   callback names, but a bigger departure from existing code.
>
> - I'm not yet happy with TableTupleDeleted computation in heapam.c, I
>   want to revise that further
>
> - formatting
>
> - commit message
>
> - a few comments need a bit of polishing (ExecCheckTIDVisible,
> heapam_tuple_lock)
>
> - Rename TableTupleMayBeModified to TableTupleOk, but also probably a
> s/TableTuple/TableMod/
>
> - I'll probably move TUPLE_LOCK_FLAG_LOCK_* into tableam.h
>
> - two more passes through the patch
>

Thanks for the corrections.



> On 2019-03-21 15:07:04 +1100, Haribabu Kommi wrote:
> > As you are modifying the 0003 patch for modify API's, I went and reviewed
> > the
> > existing patch and found couple corrections that are needed, in case if
> you
> > are not
> > taken care of them already.
>
> Some I have...
>
>
>
> > + /* Update the tuple with table oid */
> > + slot->tts_tableOid = RelationGetRelid(relation);
> > + if (slot->tts_tableOid != InvalidOid)
> > + tuple->t_tableOid = slot->tts_tableOid;
> >
> > The setting of slot->tts_tableOid is not required in this function,
> > After set the check is happening, the above code is present in both
> > heapam_heap_insert and heapam_heap_insert_speculative.
>
> I'm not following? Those functions are independent?
>

In those functions, the slot->tts_tableOid is set and in the next statement
checked whether it is invalid or not? Callers of table_insert should have
already set that. So setting the value and checking in the next line is it
required?
The value cannot be InvalidOid.


>
> > + slot->tts_tableOid = RelationGetRelid(relation);
> >
> > In heapam_heap_update, i don't think there is a need to update
> > slot->tts_tableOid.
>
> Why?
>

The slot->tts_tableOid should have been updated before the call to
heap_update.
setting it again after the heap_update is required?

I also observed setting slot->tts_tableOid after table_insert_XXX calls
also in
Exec_insert function?

Is this to make sure that AM hasn't modified that value?


> > + default:
> > + elog(ERROR, "unrecognized heap_update status: %u", result);
> >
> > heap_update --> table_update?
> >
> >
> > + default:
> > + elog(ERROR, "unrecognized heap_delete status: %u", result);
> >
> > same as above?
>
> I've fixed that in a number of places.
>
>
> > + /*hari FIXME*/
> > + /*Assert(result != HeapTupleUpdated && hufd.traversed);*/
> >
> > Removing the commented codes in both ExecDelete and ExecUpdate functions.
>
> I don't think that's the right fix. I've refactored that code
> significantly now, and restored the assert in a, imo, correct version.
>
>
OK.


> > + /**/
> > + if (epqreturnslot)
> > + {
> > + *epqreturnslot = epqslot;
> > + return NULL;
> > + }
> >
> > comment update missed?
>
> Well, you'd deleted a comment around there ;). I've added something back
> now...
>

This is not only the problem I could have introduced, All the comments that
listed are introduced by me ;).

Regards,
Haribabu Kommi
Fujitsu Australia


Re: "WIP: Data at rest encryption" patch and, PostgreSQL 11-beta3

2019-03-21 Thread Robert Haas
On Thu, Mar 21, 2019 at 7:46 AM Antonin Houska  wrote:
> Nevertheless, with the current version of our patch, PG should be resistant
> against such a partial write anyway because we chose to align XLOG records to
> 16 bytes (as long as the encryption is enabled) for the following reasons:
>
> If one XLOG record ends and the following one starts in the same encryption
> block, both records can get corrupted during streaming replication. The
> scenario looks like: 1) the first record is written on master (the unused part
> of the block contains zeroes), 2) the block is encrypted and its initial part
> (i.e. the number of bytes occupied by the first record in the plain text) is
> streamed to slave, 3) the second record is written on master, 4) the
> containing encryption block is encrypted again and the trailing part (i.e. the
> number of bytes occupied by the second record) is streamed, 5) decryption of
> the block on slave will produce garbage and thus corrupt both records. This is
> because the trailing part of the block was filled with zeroes during
> encryption, but it contains different data at decryption time.

Wouldn't Tom's proposal to use a stream cipher fix all this?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Transaction commits VS Transaction commits (with parallel) VS query mean time

2019-03-21 Thread Robert Haas
On Thu, Mar 21, 2019 at 2:18 AM Haribabu Kommi  wrote:
> With Inclusion of parallel worker transactions, the TPS will be a higher 
> number,
> thus user may find it as better throughput. This is the case with one of our 
> tool.
> The tool changes the configuration randomly to find out the best configuration
> for the server based on a set of workload, during our test, with some 
> configurations,
> the TPS is so good, but the overall performance is slow as the system is 
> having
> less resources to keep up with that configuration.
>
> Opinions?

Well, I think that might be a sign that the data isn't being used
correctly.  I don't have a strong position on what the "right" thing
to do here is, but I think if you want to know how many client
transactions are being executed, you should count them on the client
side, as pgbench does.  I agree that it's a little funny to count the
parallel worker commit as a separate transaction, since in a certain
sense it is part of the same transaction.  But if you do that, then as
already noted you have to next decide what to do about other
transactions that parallel workers use internally.  And then you have
to decide what to do about other background transactions.  And there's
not really one "right" answer to any of these questions, I don't
think.  You might want to count different things depending on how the
information is going to be used.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Custom compression methods

2019-03-21 Thread Tomas Vondra


On 3/19/19 4:44 PM, Chris Travers wrote:
> 
> 
> On Tue, Mar 19, 2019 at 12:19 PM Tomas Vondra
> mailto:tomas.von...@2ndquadrant.com>> wrote:
> 
> 
> On 3/19/19 10:59 AM, Chris Travers wrote:
> >
> >
> > Not discussing whether any particular committer should pick this
> up but
> > I want to discuss an important use case we have at Adjust for this
> sort
> > of patch.
> >
> > The PostgreSQL compression strategy is something we find
> inadequate for
> > at least one of our large deployments (a large debug log spanning
> > 10PB+).  Our current solution is to set storage so that it does not
> > compress and then run on ZFS to get compression speedups on
> spinning disks.
> >
> > But running PostgreSQL on ZFS has some annoying costs because we have
> > copy-on-write on copy-on-write, and when you add file
> fragmentation... I
> > would really like to be able to get away from having to do ZFS as an
> > underlying filesystem.  While we have good write throughput, read
> > throughput is not as good as I would like.
> >
> > An approach that would give us better row-level compression  would
> allow
> > us to ditch the COW filesystem under PostgreSQL approach.
> >
> > So I think the benefits are actually quite high particularly for those
> > dealing with volume/variety problems where things like JSONB might
> be a
> > go-to solution.  Similarly I could totally see having systems which
> > handle large amounts of specialized text having extensions for dealing
> > with these.
> >
> 
> Sure, I don't disagree - the proposed compression approach may be a big
> win for some deployments further down the road, no doubt about it. But
> as I said, it's unclear when we get there (or if the interesting stuff
> will be in some sort of extension, which I don't oppose in principle).
> 
> 
> I would assume that if extensions are particularly stable and useful
> they could be moved into core.
> 
> But I would also assume that at first, this area would be sufficiently
> experimental that folks (like us) would write our own extensions for it.
>  
> 
> 
> >
> >     But hey, I think there are committers working for postgrespro,
> who might
> >     have the motivation to get this over the line. Of course,
> assuming that
> >     there are no serious objections to having this functionality
> or how it's
> >     implemented ... But I don't think that was the case.
> >
> >
> > While I am not currently able to speak for questions of how it is
> > implemented, I can say with very little doubt that we would almost
> > certainly use this functionality if it were there and I could see
> plenty
> > of other cases where this would be a very appropriate direction
> for some
> > other projects as well.
> >
> Well, I guess the best thing you can do to move this patch forward is to
> actually try that on your real-world use case, and report your results
> and possibly do a review of the patch.
> 
> 
> Yeah, I expect to do this within the next month or two.
>  
> 
> 
> IIRC there was an extension [1] leveraging this custom compression
> interface for better jsonb compression, so perhaps that would work for
> you (not sure if it's up to date with the current patch, though).
> 
> [1]
> 
> https://www.postgresql.org/message-id/20171130182009.1b492eb2%40wp.localdomain
> 
> Yeah I will be looking at a couple different approaches here and
> reporting back. I don't expect it will be a full production workload but
> I do expect to be able to report on benchmarks in both storage and
> performance.
>  

FWIW I was a bit curious how would that jsonb compression affect the
data set I'm using for testing jsonpath patches, so I spent a bit of
time getting it to work with master. It attached patch gets it to
compile, but unfortunately then it fails like this:

ERROR:  jsonbd: worker has detached

It seems there's some bug in how sh_mq is used, but I don't have time
investigate that further.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/jsonbd.c b/jsonbd.c
index 511cfcf..9b86e1f 100644
--- a/jsonbd.c
+++ b/jsonbd.c
@@ -766,11 +766,6 @@ jsonbd_cminitstate(Oid acoid, List *options)
 	return NULL;
 }
 
-static void
-jsonbd_cmdrop(Oid acoid)
-{
-	/* TODO: if there is no compression options, remove the dictionary */
-}
 
 static struct varlena *
 jsonbd_cmdecompress(CompressionAmOptions *cmoptions, const struct varlena *data)
@@ -863,7 +858,6 @@ jsonbd_compression_handler(PG_FUNCTION_ARGS)
 	CompressionAmRoutine *routine = makeNode(CompressionAmRoutine);
 
 	routine->cmcheck = jsonbd_cmcheck;
-	routine->cmdrop = jsonbd_cmdrop;		/* no drop behavior */
 	routine->cminitstate = jsonbd_cminitstate;
 	routine->cmcom

Re: Libpq support to connect to standby server as priority

2019-03-21 Thread Robert Haas
On Thu, Mar 21, 2019 at 2:26 AM Haribabu Kommi  wrote:
> Based on the above new options that can be added to target_session_attrs,
>
> primary - it is just an alias to the read-write option.
> standby, prefer-standby - These options should check whether server is 
> running in recovery mode or not
> instead of checking whether server accepts read-only connections or not?

I think it will be best to have one set of attributes that check
default_transaction_read_only and a differently-named set that check
pg_is_in_recovery().  For each, there should be one value that looks
for a 'true' return and one value that looks for a 'false' return and
perhaps values that accept either but prefer one or the other.

IOW, there's no reason to make primary an alias for read-write.  If
you want read-write, you can just say read-write.  But we can make
'primary' or 'master' look for a server that's not in recovery and
'standby' look for one that is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Block level parallel vacuum

2019-03-21 Thread Robert Haas
On Tue, Mar 19, 2019 at 7:26 AM Masahiko Sawada  wrote:
> In parsing vacuum command, since only PARALLEL option can have an
> argument I've added the check in ExecVacuum to erroring out when other
> options have an argument. But it might be good to make other vacuum
> options (perhaps except for DISABLE_PAGE_SKIPPING option) accept an
> argument just like EXPLAIN command.

I think all of the existing options, including DISABLE_PAGE_SKIPPING,
should permit an argument that is passed to defGetBoolean().

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Block level parallel vacuum

2019-03-21 Thread Robert Haas
On Tue, Mar 19, 2019 at 3:59 AM Kyotaro HORIGUCHI
 wrote:
> The leader doesn't continue heap-scan while index vacuuming is
> running. And the index-page-scan seems eat up CPU easily. If
> index vacuum can run simultaneously with the next heap scan
> phase, we can make index scan finishes almost the same time with
> the next round of heap scan. It would reduce the (possible) CPU
> contention. But this requires as the twice size of shared
> memoryas the current implement.

I think you're approaching this from the wrong point of view.  If we
have a certain amount of memory available, is it better to (a) fill
the entire thing with dead tuples once, or (b) better to fill half of
it with dead tuples, start index vacuuming, and then fill the other
half of it with dead tuples for the next index-vacuum cycle while the
current one is running?  I think the answer is that (a) is clearly
better, because it results in half as many index vacuum cycles.

We can't really ask the user how much memory it's OK to use and then
use twice as much.  But if we could, what you're proposing here is
probably still not the right way to use it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-03-21 Thread Joel Jacobson
On Wed, Mar 20, 2019 at 9:24 PM Tom Lane  wrote:

> Joel Jacobson  writes:
> > I've seen a performance trick in other hash functions [1]
> > to instead read multiple bytes in each iteration,
> > and then handle the remaining bytes after the loop.
> > [1] https://github.com/wangyi-fudan/wyhash/blob/master/wyhash.h#L29
>
> I can't get very excited about this, seeing that we're only going to
> be hashing short strings.  I don't really believe your 30% number
> for short strings; and even if I did, there's no evidence that the
> hash functions are worth any further optimization in terms of our
> overall performance.
>

I went ahead and tested this approach anyway, since I need this algorithm
in a completely different project.

The benchmark below shows stats for three different keywords per length,
compiled with -O2:

$ c++ -O2 -std=c++14 -o bench_perfect_hash bench_perfect_hash.cc
$ ./bench_perfect_hash

keyword  length char-a-time (ns) word-a-time (ns) diff (%)
as   2  3.30 2.62 -0.21
at   2  3.54 2.66 -0.25
by   2  3.30 2.59 -0.22
add  3  4.01 3.15 -0.21
all  3  4.04 3.11 -0.23
and  3  3.84 3.11 -0.19
also 4  4.50 3.17 -0.30
both 4  4.49 3.06 -0.32
call 4  4.95 3.42 -0.31
abort5  6.09 4.02 -0.34
admin5  5.26 3.65 -0.31
after5  5.18 3.76 -0.27
access   6  5.97 3.91 -0.34
action   6  5.86 3.89 -0.34
always   6  6.10 3.77 -0.38
analyse  7  6.67 4.64 -0.30
analyze  7  7.09 4.87 -0.31
between  7  7.02 4.66 -0.34
absolute 8  7.49 3.82 -0.49
backward 8  7.13 3.88 -0.46
cascaded 8  7.23 4.17 -0.42
aggregate9  8.04 4.49 -0.44
assertion9  7.98 4.52 -0.43
attribute9  8.03 4.44 -0.45
assignment   10 8.58 4.67 -0.46
asymmetric   10 9.07 4.57 -0.50
checkpoint   10 9.15 4.53 -0.51
constraints  11 9.58 5.14 -0.46
insensitive  11 9.62 5.30 -0.45
publication  11 10.305.60 -0.46
concurrently 12 10.364.81 -0.54
current_date 12 11.175.48 -0.51
current_role 12 11.155.10 -0.54
authorization13 11.875.50 -0.54
configuration13 11.505.51 -0.52
xmlattributes13 11.725.66 -0.52
current_schema   14 12.175.58 -0.54
localtimestamp   14 11.785.46 -0.54
characteristics  15 12.775.97 -0.53
current_catalog  15 12.655.87 -0.54
current_timestamp17 14.196.12 -0.57


> Also, as best I can tell, the approach you propose would result
> in an endianness dependence, meaning we'd have to have separate
> lookup tables for BE and LE machines.  That's not a dealbreaker
> perhaps, but it is certainly another point on the "it's not worth it"
> side of the argument.
>

I can see how the same problem has been worked-around in e.g. pg_crc32.h:

#ifdef WORDS_BIGENDIAN
#define FIN_CRC32C(crc) ((crc) = pg_bswap32(crc) ^ 0x)
#else
#define FIN_CRC32C(crc) ((crc) ^= 0x)
#endif

So I used the same trick in PerfectHash.pm:

$f .= sprintf "#ifdef WORDS_BIGENDIAN\n";
$f .= sprintf "\t\tc4 = pg_bswap32(c4);\n";
$f .= sprintf "#endif\n";

I've also tried to measure the overall effect by hacking postgres.c:

+   struct timespec start, stop;
+   clock_gettime( CLOCK_REALTIME, &start);
+   for (int i=0; i<10; i++)
+  {
+   List   *parsetree_list2;
+   MemoryContext oldcontext2;
+
+   oldcontext2 = MemoryContextSwitchTo(MessageContext);
+   parsetree_list2 = pg_parse_query(query_string);
+   MemoryContextSwitchTo(oldcontext2);
+// MemoryContextReset(MessageContext);
+   CHECK_F

Re: propagating replica identity to partitions

2019-03-21 Thread Robert Haas
On Wed, Mar 20, 2019 at 4:42 PM Alvaro Herrera  wrote:
> Unless there are any objections to fixing the REPLICA IDENTITY bug, I
> intend to push that tomorrow.

I still think that this is an ill-considered, piecemeal approach to a
problem that deserves a better solution.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Feature: triggers on materialized views

2019-03-21 Thread Robert Haas
On Fri, Jan 4, 2019 at 6:23 AM Peter Eisentraut
 wrote:
> What bothers me about this patch is that it subtly changes what a
> trigger means.  It currently means, say, INSERT was executed on this
> table.  You are expanding that to mean, a row was inserted into this
> table -- somehow.

Yeah.  The fact that a concurrent refresh currently does DELETE+INSERT
rather than UPDATE is currently an implementation detail.  If you
allow users to hook up triggers to the inserts, then suddenly it's no
longer an implementation detail: it is a user-visible behavior that
can't be changed in the future without breaking backward
compatibility.

> Triggers should generally refer to user-facing commands.  Could you not
> make a trigger on REFRESH itself?

I'm not sure that would help with the use case... but that seems like
something to think about, especially if it could use the transition
table machinery somehow.

> > Triggers are not fired if you call REFRESH without CONCURRENTLY. This
> > is based on some discussion on the mailing list because implementing
> > it for without CONCURRENTLY would require us to add logic for firing
> > triggers where there was none before (and is just an efficient heap
> > swap).
>
> This is also a problem, because it would allow bypassing the trigger
> accidentally.
>
> Moreover, consider that there could be updatable materialized views,
> just like there are updatable normal views.  And there could be triggers
> on those updatable materialized views.  Those would look similar but
> work quite differently from what you are proposing here.

Yeah.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Feature: triggers on materialized views

2019-03-21 Thread Robert Haas
On Tue, Dec 25, 2018 at 10:05 PM Alvaro Herrera
 wrote:
> Well, REFRESH CONCURRENTLY runs completely different than non-concurrent
> REFRESH.  The former updates the existing data by observing the
> differences with the previous data; the latter simply re-runs the query
> and overwrites everything.  So if you simply enabled triggers on
> non-concurrent refresh, you'd just see a bunch of inserts into a
> throwaway data area (a transient relfilenode, we call it), then a swap
> of the MV's relfilenode with the throwaway one.  I doubt it'd be useful.
> But then I'm not clear *why* you would like to do a non-concurrent
> refresh.  Maybe your situation would be best served by forbidding non-
> concurrent refresh when the MV contains any triggers.
>
> Alternatively, maybe reimplement non-concurrent refresh so that it works
> identically to concurrent refresh (except with a stronger lock).  Not
> sure if this implies any performance penalties.

Sorry to jump in late, but all of this sounds very strange to me.
It's possible for either concurrent or non-concurrent refresh to be
faster, depending on the circumstances; for example, if a concurrent
refresh would end up deleting all the rows and inserting them again, I
think that could be slower than just blowing all the data away and
starting over.  So disabling non-concurrent refresh sounds like a bad
idea.  For the same reason, reimplementing it to work like a
concurrent refresh also sounds like a bad idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2019-03-21 Thread Alexander Korotkov
On Thu, Mar 21, 2019 at 9:26 PM Simon Riggs  wrote:
> On Thu, 13 Dec 2018 at 14:48, Alexander Korotkov  wrote:
>> On Thu, Dec 13, 2018 at 10:46 PM Andres Freund  wrote:
>> > On 2018-12-13 22:40:59 +0300, Alexander Korotkov wrote:
>> > > It doesn't mater, because we release all locks on every buffer at one
>> > > time.  The unlock order can have effect on what waiter will acquire
>> > > the lock next after ginRedoDeletePage().  However, I don't see why one
>> > > unlock order is better than another.  Thus, I just used the rule of
>> > > thumb to not change code when it's not necessary for bug fix.
>> >
>> > I think it's right to not change unlock order at the same time as a
>> > bugfix here.  More generally I think it can often be useful to default
>> > to release locks in the inverse order they've been acquired - if there's
>> > any likelihood that somebody will acquire them in the same order, that
>> > ensures that such a party would only need to wait for a lock once,
>> > instead of being woken up for one lock, and then immediately having to
>> > wait for the next one.
>>
>> Good point, thank you!
>
>
> It's been pointed out to me that 52ac6cd2d0cd70e01291e0ac4ee6d068b69bc478
> introduced a WAL incompatibility that has not been flagged.
>
> In ginRedoDeletePage() we use the struct directly to read the WAL record, so 
> if a WAL record was written prior to 
> 52ac6cd2d0cd70e01291e0ac4ee6d068b69bc478, yet read by code at 
> 52ac6cd2d0cd70e01291e0ac4ee6d068b69bc478 or later then we will have problems, 
> since deleteXid will not be set correctly.
>
> It seems this should not have been backpatched.
>
> Please give your assessment.

Oh, right.  This is my fault.

However, I think this still can be backpatched correctly.  We can
determine whether xlog record data contains deleteXid by its size.
See the attached patch.  I haven't test this yet.  I'm going to test
it.  If OK, then push.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


gin-redo-delete-page-backpatch-fix.patch
Description: Binary data


Re: Automated way to find actual COMMIT LSN of subxact LSN

2019-03-21 Thread Jeremy Finzel
>
> I find this to be unactionably vague.  What does it mean to claim "an
> LSN is visible"?  An LSN might not even point to a WAL record, or it
> might point to one that has nontransactional effects.  Moreover, any
> behavior of this sort would destroy what I regard as a bedrock property
> of recover-to-LSN, which is that there's a well defined, guaranteed-finite
> stopping point.  (A property that recover-to-XID lacks, since the
> specified XID might've crashed without recording either commit or abort.)
>

I mentioned that my specific use case is that I am picking out an LSN or
XID within the context of logical decoding.  So I don't think that's vague,
but let me clarify.  When using the peek_changes or get_changes functions,
they only show a particular update to a particular row, with a
corresponding LSN and transaction ID.  That's what I see using both
test_decoding and pglogical_output, both of which I have used in this way.
In that context at least, all LSNs and XIDs point to committed WAL data
only.


> I think what you ought to be doing is digging the xmin out of the inserted
> tuple you're concerned with and then doing recover-to-XID.  There's
> definitely room for us to make it easier if the XID is a subxact rather
> than a main xact.  But I think identifying the target XID is your job
> not the job of the recovery-stop-point mechanism.
>

I'm open to that, but how will it help if I can't guarantee that the tuple
wasn't updated again after the original insert/update of interest?

Thank you,
Jeremy


Re: pg_basebackup ignores the existing data directory permissions

2019-03-21 Thread Robert Haas
On Tue, Mar 19, 2019 at 2:29 AM Michael Paquier  wrote:
> Hm.  We have been assuming that the contents of a base backup inherit
> the permission of the source when using pg_basebackup because this
> allows users to keep a nodes in a consistent state without deciding
> which option to use.  Do you mean that you would like to enforce the
> permissions of only the root directory if it exists?  Or the root
> directory with all its contents?  The former may be fine.  The latter
> is definitely not.

Why not?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [proposal] Add an option for returning SQLSTATE in psql error message

2019-03-21 Thread Daniel Gustafsson
On Thursday, March 21, 2019 7:47 PM, Tom Lane  wrote:

> David Steele da...@pgmasters.net writes:
>
> > > > > > Why are you not including a test for \set VERBOSITY verbose?
>
> > What do you think, Peter? Is the extra test valuable or will it cause
> > unpredictable outputs as Tom and Michael predict?
>
> I'm not really sure why this is open for discussion.
>
> regression=# \set VERBOSITY verbose
> regression=# select 1/0;
> ERROR: 22012: division by zero
> LOCATION: int4div, int.c:824
>
> It's not going to be tolerable to have to adjust such a test anytime
> somebody adds or removes lines in whichever backend file throws the
> tested-for error (never mind more-substantial refactoring such as
> moving the ereport call to a different function or file). I also
> believe that the reported line number will vary across compilers
> even without that: IME you might get either the starting or ending
> line number of the ereport() construct.

In GPDB we have to handle variable output in tests similar to this (but
for very different reasons), and unless there is a big gain elsewhere
from having variable test output I would advise against it. It adds a
fair bit of complexity and another moving part which can mask subtle
test failures.

cheers ./daniel



Re: [proposal] Add an option for returning SQLSTATE in psql error message

2019-03-21 Thread Tom Lane
David Steele  writes:
> Why are you not including a test for \set VERBOSITY verbose?

> What do you think, Peter?  Is the extra test valuable or will it cause 
> unpredictable outputs as Tom and Michael predict?

I'm not really sure why this is open for discussion.

regression=# \set VERBOSITY verbose
regression=# select 1/0;
ERROR:  22012: division by zero
LOCATION:  int4div, int.c:824

It's not going to be tolerable to have to adjust such a test anytime
somebody adds or removes lines in whichever backend file throws the
tested-for error (never mind more-substantial refactoring such as
moving the ereport call to a different function or file).  I also
believe that the reported line number will vary across compilers
even without that: IME you might get either the starting or ending
line number of the ereport() construct.

There's also the question of whether the function name is reliably
available.  Maybe it is now that we require C99 compatibility,
but the code hasn't been taught to assume that.

regards, tom lane



Re: Best way to keep track of a sliced TOAST

2019-03-21 Thread Robert Haas
On Wed, Mar 20, 2019 at 9:20 PM Bruno Hass  wrote:
> I would like to optimize the jsonb key access operations. I could not find 
> the discussion you've mentioned, but I am giving some thought to the idea.
>
> Instead of storing lengths, could we dedicate the first chunk of the TOASTed 
> jsonb to store where each key is located? Would it be a good idea?

I don't see how that would work.  To know which key is which, you'd
have to store all the keys.  They might not fit in the first chunk.
That's the whole reason this has to be TOASTed to begin with.

> You've mentioned that the current jsonb format is byte-oriented. Does that 
> imply that a single jsonb key value might be split between multiple chunks?

Yes.

You're going to need to look at the code yourself to get anywhere
here... I don't have unlimited time to answer questions about it, and
even if I did, you're not really going to understand how it works
without studying it yourself.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2019-03-21 Thread Simon Riggs
On Thu, 13 Dec 2018 at 14:48, Alexander Korotkov 
wrote:

> On Thu, Dec 13, 2018 at 10:46 PM Andres Freund  wrote:
> > On 2018-12-13 22:40:59 +0300, Alexander Korotkov wrote:
> > > It doesn't mater, because we release all locks on every buffer at one
> > > time.  The unlock order can have effect on what waiter will acquire
> > > the lock next after ginRedoDeletePage().  However, I don't see why one
> > > unlock order is better than another.  Thus, I just used the rule of
> > > thumb to not change code when it's not necessary for bug fix.
> >
> > I think it's right to not change unlock order at the same time as a
> > bugfix here.  More generally I think it can often be useful to default
> > to release locks in the inverse order they've been acquired - if there's
> > any likelihood that somebody will acquire them in the same order, that
> > ensures that such a party would only need to wait for a lock once,
> > instead of being woken up for one lock, and then immediately having to
> > wait for the next one.
>
> Good point, thank you!
>

It's been pointed out to me that 52ac6cd2d0cd70e01291e0ac4ee6d068b69bc478
introduced a WAL incompatibility that has not been flagged.

In ginRedoDeletePage() we use the struct directly to read the WAL record,
so if a WAL record was written prior to
52ac6cd2d0cd70e01291e0ac4ee6d068b69bc478, yet read by code at
52ac6cd2d0cd70e01291e0ac4ee6d068b69bc478 or later then we will have
problems, since deleteXid will not be set correctly.

It seems this should not have been backpatched.

Please give your assessment.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Re: INSTALL file

2019-03-21 Thread David Steele

Hi Andreas,

On 2/15/19 11:59 AM, Peter Eisentraut wrote:

On 14/02/2019 20:13, Andres Freund wrote:

On 2019-02-04 11:02:44 +0900, Michael Paquier wrote:

+1.  I have just looked at the patch, and my take is that listing all
the ways to build Postgres directly in the README is just a
duplication of what we already have in the documentation.  So I would
just rip out all that and keep a simple link to the documentation.


Well, the documentation cannot be built without the dependencies, and
not everyone has convenient internet access.  I'm not sure what the
solution to that is, but somehow consolidating that information in the,
by now pretty standardized, location of INSTALL seems pretty reasonable
to me.


(I suppose you meant README here.  The information is already in INSTALL.)

But the proposed additions don't actually mention the required
dependencies to build the documentation, so it wouldn't even achieve
that goal.

And if you don't have internet access, how did you get the git checkout?

The use case here seems pretty narrow.


Any thoughts on Peter and Andres' comments?

Regards,
--
-David
da...@pgmasters.net



Re: Re: [proposal] Add an option for returning SQLSTATE in psql error message

2019-03-21 Thread David Steele

On 2/4/19 5:54 AM, Michael Paquier wrote:

On Mon, Jan 07, 2019 at 10:44:24PM +0100, didier wrote:

On Sat, Jan 5, 2019 at 6:30 PM Tom Lane  wrote:

Peter Eisentraut  writes:

Why are you not including a test for \set VERBOSITY verbose?


Stability of the output would be a problem ...

Yes it could moreover the functionality wasn't tested before.

Should I add one ?


Unpredictible outputs mean more maintenance and more alternate
outputs.  I have moved the patch to next CF, still ready for
committer.


What do you think, Peter?  Is the extra test valuable or will it cause 
unpredictable outputs as Tom and Michael predict?


Regards,
--
-David
da...@pgmasters.net



Re: Re: libpq host/hostaddr/conninfo inconsistencies

2019-03-21 Thread David Steele

On 2/22/19 9:44 PM, Fabien COELHO wrote:


Hmmm, I do not buy the typing argument: "host" is actually for 
everything, including directories. I do not think that adding "hostdir" 
would be desirable.



In any case, the existing doco never comes out and states either
rule set in so many words.  Maybe it should.


Personally I like the second and third edit from Robert's patch, but not 
the first one.  I'm having a hard time seeing why you would specify host 
*and* hostaddr as this seems to imply.


I also agree with Fabien's comment that host can be a path -- it's 
really a path to a socket file, but it's certainly not a host name or IP 
address.  Perhaps that should be called out explicitly.


Regards,
--
-David
da...@pgmasters.net



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-21 Thread Peter Geoghegan
On Tue, Mar 19, 2019 at 4:15 PM Peter Geoghegan  wrote:
> Heikki and I discussed this issue privately, over IM, and reached
> final agreement on remaining loose ends. I'm going to use his code for
> _bt_findsplitloc(). Plan to push a final version of the first four
> patches tomorrow morning PST.

I've committed the first 4 patches. Many thanks to Heikki for his very
valuable help! Thanks also to the other reviewers.

I'll likely push the remaining two patches on Sunday or Monday.

-- 
Peter Geoghegan



Re: Re: pgbench - add pseudo-random permutation function

2019-03-21 Thread David Steele

Hi Hironobu,

On 3/3/19 12:55 PM, Fabien COELHO wrote:



Indeed, the patch needs a rebase & conflit resolution. I'll do it. Later.


Here is an update:

  - take advantage of pg_bitutils (although I noted that the "slow"
    popcount there could be speeded-up and shorten with a bitwise operator
    implementation that I just removed from pgbench).

  - add comments about the bijective transformations in the code.

As already stated, this function makes sense for people who want to test 
performance with pgbench using non uniform rands. If you do not want to 
do that, you will probably find the function pretty useless. I can't 
help it.


Also, non uniform rands is also a way to test pg lock contention behavior.


You have signed up as a reviewer for this patch.  Do you know when 
you'll have time to do the review?


Regards,
--
-David
da...@pgmasters.net



Re: Re: ToDo: show size of partitioned table

2019-03-21 Thread David Steele

On 3/14/19 6:19 AM, Amit Langote wrote:

On 2019/03/14 2:11, Pavel Stehule wrote:



I've attached v11 of the patch, which merges most of Justin's changes and
some of my own on top -- documentation and partition size column names.


Maybe, we should set this ready for committer then?


There don't appear to be any objections.  Perhaps you should do that?

Regards,
--
-David
da...@pgmasters.net



Re: GiST VACUUM

2019-03-21 Thread Heikki Linnakangas

On 21/03/2019 18:06, Andrey Borodin wrote:

21 марта 2019 г., в 2:30, Heikki Linnakangas 
написал(а): one remaining issue that needs to be fixed:

During Hot Standby, the B-tree code writes a WAL reord, when a
deleted page is recycled, to prevent the deletion from being
replayed too early in the hot standby. See _bt_getbuf() and
btree_xlog_reuse_page(). I think we need to do something similar in
GiST.

I'll try fixing that tomorrow, unless you beat me to it. Making
the changes is pretty straightforward, but it's a bit cumbersome to
test.


I've tried to deal with it and stuck...


So, I came up with the attached. I basically copy-pasted the page-reuse 
WAL-logging stuff from nbtree.


When I started testing this, I quickly noticed that empty pages were not 
being deleted nearly as much as I expected. I tracked it to this check 
in gistdeletepage:



+   if (GistFollowRight(leafPage)
+   || GistPageGetNSN(parentPage) > GistPageGetNSN(leafPage))
+   {
+   /* Don't mess with a concurrent page split. */
+   return false;
+   }


That NSN test was bogus. It prevented the leaf page from being reused, 
if the parent page was *ever* split after the leaf page was created. I 
don't see any reason to check the NSN here. The NSN is normally used to 
detect if a (leaf) page has been concurrently split, when you descend 
the tree. We don't need to care about that here; as long as the 
FOLLOW_RIGHT flag is not set, the page has a downlink, and if we can 
find the downlink and the page is empty, we can delete it.


After removing that bogus NSN check, page reuse become much more 
effective. I've been testing this by running this test script repeatedly:


--
/*
create sequence gist_test_seq;
create table gist_point_tbl(id int4, p point);
create index gist_pointidx on gist_point_tbl using gist(p);
*/

insert into gist_point_tbl (id, p)
   select nextval('gist_test_seq'), point(nextval('gist_test_seq'), 
1000 + g) from generate_series(1, 1) g;


delete from gist_point_tbl where id < currval('gist_test_seq') - 2;
vacuum gist_point_tbl;

select pg_table_size('gist_point_tbl'), pg_indexes_size('gist_point_tbl');
--

It inserts a bunch of rows, deletes a bunch of older rows, and vacuums. 
The interesting thing here is that the key values keep "moving", so that 
new tuples are added to different places than where old ones are 
removed. That's the case where page reuse is needed.


Before this patch, the index bloated very quickly. With the patch, it 
still bloats, because we still don't delete internal pages. But it's a 
small fraction of the bloat you got before.


Attached is the latest patch version, to be applied on top of the 
IntegerSet patch.



I think we should make B-tree WAL record for this shared, remove
BlockNumber and other unused stuff, leaving just xid and db oid. And
reuse this record for B-tree, GiST and GIN (yeah, it is not checking
for that conflict).
Good point. For now, I didn't try to generalize this, but perhaps we 
should.



Though, I'm not sure it is important for GIN. Scariest thing that can
happen: it will return same tid twice. But it is doing bitmap scan,
you cannot return same bit twice...


Hmm. Could it return a completely unrelated tuple? We don't always 
recheck the original index quals in a bitmap index scan, IIRC. Also, a 
search might get confused if it's descending down a posting tree, and 
lands on a different kind of a page, altogether.


Alexander, you added the mechanism to GIN recently, to prevent pages 
from being reused too early (commit 52ac6cd2d0). Do we need something 
like B-tree's REUSE_PAGE records in GIN, too, to prevent the same bug 
from happening in hot standby?



PS. for Gist, we could almost use the LSN / NSN mechanism to detect the 
case that a deleted page is reused: Add a new field to the GiST page 
header, to store a new "deleteNSN" field. When a page is deleted, the 
deleted page's deleteNSN is set to the LSN of the deletion record. When 
the page is reused, the deleteNSN field is kept unchanged. When you 
follow a downlink during search, if you see that the page's deleteNSN > 
parent's LSN, you know that it was concurrently deleted and recycled, 
and should be ignored. That would allow reusing deleted pages 
immediately. Unfortunately that would require adding a new field to the 
gist page header/footer, which requires upgrade work :-(. Maybe one day, 
we'll bite the bullet. Something to keep in mind, if we have to change 
the page format anyway, for some reason.


- Heikki
>From d7a77ad483251b62514778d2235516f6f9237ad7 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 20 Mar 2019 20:24:44 +0200
Subject: [PATCH 2/2] Delete empty pages during GiST VACUUM

This commit teaches GiST to actually delete pages during VACUUM.

To do this we scan GiST two times. At first pass we make note of empty
pages and internal pages. At second pass we scan through internal pages
looking for references

Re: GiST VACUUM

2019-03-21 Thread Andrey Borodin



> 21 марта 2019 г., в 2:30, Heikki Linnakangas  написал(а):
> one remaining issue that needs to be fixed:
> 
> During Hot Standby, the B-tree code writes a WAL reord, when a deleted 
> page is recycled, to prevent the deletion from being replayed too early 
> in the hot standby. See _bt_getbuf() and btree_xlog_reuse_page(). I 
> think we need to do something similar in GiST.
> 
> I'll try fixing that tomorrow, unless you beat me to it. Making the 
> changes is pretty straightforward, but it's a bit cumbersome to test.

I've tried to deal with it and stuck... I think we should make B-tree WAL 
record for this shared, remove BlockNumber and other unused stuff, leaving just 
xid and db oid.
And reuse this record for B-tree, GiST and GIN (yeah, it is not checking for 
that conflict).

Though, I'm not sure it is important for GIN. Scariest thing that can happen: 
it will return same tid twice. But it is doing bitmap scan, you cannot return 
same bit twice...

Eventually, hash, spgist and others will have same thing too.

Best regards, Andrey Borodin.


Re: BUG #15572: Misleading message reported by "Drop function operation" on DB with functions having same name

2019-03-21 Thread Tom Lane
David Rowley  writes:
> [ drop_func_if_not_exists_fix_v9.patch ]

Pushed with mostly-cosmetic adjustments.

I noticed a couple of loose ends that are somewhat outside the scope
of the bug report, but maybe are worth considering now:

1. There's some inconsistency in the wording of the error messages in
these routines, eg

 errmsg("%s is not a function",
vs
 errmsg("%s is not a procedure",
vs
 errmsg("function %s is not an aggregate",

Also there's
 errmsg("function name \"%s\" is not unique",
where elsewhere in parse_func.c, we find
 errmsg("function %s is not unique",

You didn't touch this and I didn't either, but maybe we should try to
make these consistent?

2. Consider

regression=# CREATE FUNCTION ambig(int) returns int as $$ select $1; $$ 
language sql;
CREATE FUNCTION
regression=# CREATE PROCEDURE ambig() as $$ begin end; $$ language plpgsql;
CREATE PROCEDURE
regression=# DROP PROCEDURE ambig;
ERROR:  procedure name "ambig" is not unique
HINT:  Specify the argument list to select the procedure unambiguously.

Arguably, because I said "drop procedure", there's no ambiguity here;
but we don't account for objtype while doing the lookup.

I'm inclined to leave point 2 alone, because we haven't had complaints
about it, and because I'm not sure we could make it behave in a clean
way given the historical ambiguity about what OBJECT_FUNCTION should
match.  But perhaps it's worth discussing.

regards, tom lane



Re: Fwd: Add tablespace tap test to pg_rewind

2019-03-21 Thread Shaoqi Bai
On Tue, Mar 12, 2019 at 10:27 AM Michael Paquier 
wrote:

> It could be an idea to split the patch in two pieces:
> - One patch which refactors the code for the new option in
> PostgresNode.pm
> - Second patch for the new test with integration in RewindTest.pm.
> This should touch different parts of the code, so combining both would
> be fine as well for me :)
> --
> Michael
>

Have updated the patch doing as you suggested


On Wed, Mar 20, 2019 at 7:45 AM Michael Paquier  wrote:

> I would have avoided
> extra routines in the patch, like what you have done with
> create_standby_tbl_mapping(), but instead do something like
> init_from_backup() which is able to take extra parameters in a way
> similar to has_streaming and has_restoring.  However, the trick with
> tablespace mapping is that the caller of backup() should be able to
> pass down multiple tablespace mapping references to make that a
> maximum portable.
> --
> Michael


Also updated the patch to achieve your suggestion.


0001-Refactors-the-code-for-the-new-option-in-PostgresNod.patch
Description: Binary data


0002-Add-new-test-with-integration-in-RewindTest.pm.patch
Description: Binary data


Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2019-03-21 Thread David Rowley
On Mon, 18 Mar 2019 at 02:18, Tomas Vondra  wrote:
> Yes, it was using the toasted value directly. The attached patch
> detoasts the value explicitly, similarly to the per-column stats, and it
> also removes the 1MB limit.

I just made a pass over 0001 and 0002.

0002 is starting to look pretty good, but I did note down a few things
while looking.  Some things below might just me being unclear on how
something works. Perhaps that means more comments are needed, but it
might also mean I need a brain upgrade. I'm hoping it's the former.

0001:

1. Could you write a full commit message for this patch. Without
reading the backlog on this ticket it's not all that obvious what the
patch aims to fix. (I have read the backlog, so I know, but the next
person might not have)

2. Should all the relpages variables be BlockNumber rather than double?

0002:

3. I'm not sure what the following is trying to say:

* Estimate selectivity on any clauses applicable by stats tracking
* actual values first, then apply functional dependencies on the
* remaining clauses.

can you reword it?

4. This seems out of date:

* clauses that we've already estimated for.  Each selectivity
* function will set the appropriate bit in the bitmapset to mark that
* no further estimation is required for that list item.

We're only passing estimatedclauses to 1 function before
clauselist_selectivity_simple is called for the remainder.

5. In build_attnums_array there's
Assert(AttrNumberIsForUserDefinedAttr(j)); I just wanted to point out
that this could only possibly trigger of the bitmapset had a 0 member.
It cannot have negative members.  Maybe it would be worth adding a
comment to acknowledge that as it looks a bit misguided otherwise.

6. In build_attnums_array(), what's the reason to return int *, rather
than an AttrNumber * ? Likewise in the code that calls that function.

7. Not properly indented. Should be two tabs.

 * build sorted array of SortItem with values from rows

Should also be "a sorted array"

8. This comment seems to duplicate what is just mentioned in the
header comment for the function.

/*
* We won't allocate the arrays for each item independenly, but in one
* large chunk and then just set the pointers. This allows the caller to
* simply pfree the return value to release all the memory.
*/

Also, typo "independenly" -> "independently"

9. Not properly indented:

/*
 * statext_is_compatible_clause_internal
 * Does the heavy lifting of actually inspecting the clauses for
 * statext_is_compatible_clause. It needs to be split like this because
 * of recursion.  The attnums bitmap is an input/output parameter collecting
 * attribute numbers from all compatible clauses (recursively).
 */

10. Header comment for get_mincount_for_mcv_list() ends with
*-- but does not start with that.

11. In get_mincount_for_mcv_list() it's probably better to have the
numerical literals of 0.0 instead of just 0.

12. I think it would be better if you modified build_attnums_array()
to add an output argument that sets the size of the array. It seems
that most places you call this function you perform bms_num_members()
to determine the array size.

13. This comment seems to be having a fight with itself:

* Preallocate Datum/isnull arrays (not as a single chunk, as we will
* pass the result outside and thus it needs to be easy to pfree().
*
* XXX On second thought, we're the only ones dealing with MCV lists,
* so we might allocate everything as a single chunk to reduce palloc
* overhead (chunk headers, etc.) without significant risk. Not sure
* it's worth it, though, as we're not re-building stats very often.

14. The following might be easier to read if you used a local variable
instead of counts[dim].

for (i = 0; i < mcvlist->nitems; i++)
{
/* skip NULL values - we don't need to deduplicate those */
if (mcvlist->items[i]->isnull[dim])
continue;

values[dim][counts[dim]] = mcvlist->items[i]->values[dim];
counts[dim] += 1;
}

Then just assign the value of the local variable to counts[dim] at the end.

15. Why does this not use stats[dim]->attrcollid ?

ssup[dim].ssup_collation = DEFAULT_COLLATION_OID;

16. The following:

else if (info[dim].typlen == -2) /* cstring */
{
info[dim].nbytes = 0;
for (i = 0; i < info[dim].nvalues; i++)
{
values[dim][i] = PointerGetDatum(PG_DETOAST_DATUM(values[dim][i]));
info[dim].nbytes += strlen(DatumGetCString(values[dim][i]));
}
}

seems to conflict with:

else if (info[dim].typlen == -2) /* cstring */
{
memcpy(data, DatumGetCString(v), strlen(DatumGetCString(v)) + 1);
data += strlen(DatumGetCString(v)) + 1; /* terminator */
}

It looks like you'll reserve 1 byte too few for each cstring value.

(Might also be nicer to assign the strlen to a local variable rather
than leave it up to the compiler to optimize out the 2nd strlen call
in the latter of the two code fragments above.)

17. I wonder if some compilers will warn about this:

ITEM_INDEXES(item)[dim] = (value - values[dim]);

Probably a cast to uint16 mi

Re: partitioned tables referenced by FKs

2019-03-21 Thread Jesper Pedersen

Hi Alvaro,

On 3/18/19 6:02 PM, Alvaro Herrera wrote:

I spent a few hours studying this and my conclusion is the opposite of
yours: we should make addFkRecurseReferencing the recursive one, and
CloneFkReferencing a non-recursive caller of that.  So we end up with
both addFkRecurseReferenced and addFkRecurseReferencing as recursive
routines, and CloneFkReferenced and CloneFkReferencing being
non-recursive callers of those.  With this structure change there is one
more call to CreateConstraintEntry than before, and now there are two
calls of tryAttachPartitionForeignKey instead of one; I think with this
new structure things are much simpler.  I also changed
CloneForeignKeyConstraints's API: instead of returning a list of cloned
constraint giving its caller the responsibility of adding FK checks to
phase 3, we now give CloneForeignKeyConstraints the 'wqueue' list, so
that it can add the FK checks itself.  It seems much cleaner this way.



Using

-- ddl.sql --
CREATE TABLE t1 (i1 INT PRIMARY KEY, i2 INT NOT NULL) PARTITION BY HASH 
(i1);
CREATE TABLE t2 (i1 INT PRIMARY KEY, i2 INT NOT NULL) PARTITION BY HASH 
(i1);


\o /dev/null
SELECT 'CREATE TABLE t1_p' || x::text || ' PARTITION OF t1
FOR VALUES WITH (MODULUS 64, REMAINDER ' || x::text || ');'
from generate_series(0,63) x;
\gexec
\o

\o /dev/null
SELECT 'CREATE TABLE t2_p' || x::text || ' PARTITION OF t2
FOR VALUES WITH (MODULUS 64, REMAINDER ' || x::text || ');'
from generate_series(0,63) x;
\gexec
\o

ALTER TABLE t1 ADD CONSTRAINT fk_t1_i2_t2_i1 FOREIGN KEY (i2) REFERENCES 
t2(i1);


ANALYZE;


with

-- select.sql --
\set a random(1, 10)
SELECT t1.i1 AS t1i1, t1.i2 AS t1i2, t2.i1 AS t2i1, t2.i2 AS t2i2 FROM 
t1, t2 WHERE t1.i1 = :a;


running

pgbench -M prepared -f select.sql 

I'm seeing 82.64% spent in GetCachedPlan(). plan_cache_mode is auto.

Best regards,
 Jesper



Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

2019-03-21 Thread Darafei Praliaskouski
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

This patch is particularly helpful in processing OpenStreetMap Data in PostGIS.
OpenStreetMap is imported as a stream of 300-900 (depending on settings) 
gigabytes, that are needing a VACUUM after a COPY FREEZE.
With this patch, the first and usually the last transforming query is performed 
much faster after initial load.

I have read this patch and have no outstanding comments on it.
Pavan Deolasee demonstrates the expected speed improvement.

The new status of this patch is: Ready for Committer


Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

2019-03-21 Thread Pavan Deolasee
On Thu, Mar 14, 2019 at 3:54 PM Masahiko Sawada 
wrote:

>
> >
> >
> > Ok. I will run some tests. But please note that this patch is a bug fix
> to address the performance issue that is caused by having to rewrite the
> entire table when all-visible bit is set on the page during first vacuum.
> So while we may do some more work during COPY FREEZE, we're saving a lot of
> page writes during next vacuum. Also, since the scan that we are doing in
> this patch is done on a page that should be in the buffer cache, we will
> pay a bit in terms of CPU cost, but not anything in terms of IO cost.
>
> Agreed. I had been misunderstanding this patch. The page scan during
> COPY FREEZE is necessary and it's very cheaper than doing in the first
> vacuum.
>

Thanks for agreeing to the need of this bug fix. I ran some simple tests
anyways and here are the results.

The test consists of a simple table with three columns, two integers and
one char(100). I then ran COPY (FREEZE), loading 7M rows, followed by a
VACUUM. The total size of the raw data is about 800MB and the table size in
Postgres is just under 1GB. The results for 3 runs in milliseconds are:

Master:
 COPY FREEZE: 40243.725   40309.67540783.836
 VACUUM: 2685.871  2517.4452508.452

Patched:
 COPY FREEZE: 40942.410  40495.303   40638.075
 VACUUM: 25.067  35.793   25.390

So there is a slight increase in the time to run COPY FREEZE, but a
significant reduction in time to VACUUM the table. The benefits will only
go up if the table is vacuumed much  later when most of the pages are
already written to the disk and removed from shared buffers and/or kernel
cache.

I hope this satisfies your doubts regarding performance implications of the
patch.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Automated way to find actual COMMIT LSN of subxact LSN

2019-03-21 Thread Tom Lane
Jeremy Finzel  writes:
>> I'd be in favor of that for recovery_target_xid, but I'm not at all
>> convinced about changing the behavior for a target LSN.  The fact that
>> the target is a subcommit seems irrelevant when you specify by LSN.

> For this use case, my goal is simply to be able to recover the the point
> immediately after a particular decoded log line is visible, without
> necessarily having to find out the final parent transaction id.

> Given this, I am open to different implementations but I would like to
> either be able to specify an LSN or transaction ID, and have a feature that
> allows the recovery target to roll forward just until it is visible, even
> if the LSN or transaction ID is not the actual commit of the parent
> transaction.

I find this to be unactionably vague.  What does it mean to claim "an
LSN is visible"?  An LSN might not even point to a WAL record, or it
might point to one that has nontransactional effects.  Moreover, any
behavior of this sort would destroy what I regard as a bedrock property
of recover-to-LSN, which is that there's a well defined, guaranteed-finite
stopping point.  (A property that recover-to-XID lacks, since the
specified XID might've crashed without recording either commit or abort.)

I think what you ought to be doing is digging the xmin out of the inserted
tuple you're concerned with and then doing recover-to-XID.  There's
definitely room for us to make it easier if the XID is a subxact rather
than a main xact.  But I think identifying the target XID is your job
not the job of the recovery-stop-point mechanism.

regards, tom lane



Re: speeding up planning with partitions

2019-03-21 Thread Imai Yoshikazu
Hi Jesper,

On 2019/03/20 23:25, Jesper Pedersen wrote:> Hi,
 >
 > On 3/19/19 11:15 PM, Imai, Yoshikazu wrote:
 >> Here the details.
 >>
 >> [creating partitioned tables (with 1024 partitions)]
 >> drop table if exists rt;
 >> create table rt (a int, b int, c int) partition by range (a);
 >> \o /dev/null
 >> select 'create table rt' || x::text || ' partition of rt for values
 >> from (' ||
 >>   (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1,
 >> 1024) x;
 >> \gexec
 >> \o
 >>
 >> [select1024.sql]
 >> \set a random (1, 1024)
 >> select * from rt where a = :a;
 >>
 >> [pgbench]
 >> pgbench -n -f select1024.sql -T 60
 >>
 >>
 >
 > My tests - using hash partitions - shows that the extra time is spent in
 > make_partition_pruneinfo() for the relid_subplan_map variable.
 >
 >64 partitions: make_partition_pruneinfo() 2.52%
 > 8192 partitions: make_partition_pruneinfo() 5.43%
 >
 > TPS goes down ~8% between the two. The test setup is similar to the 
above.
 >
 > Given that Tom is planning to change the List implementation [1] in 13 I
 > think using the palloc0 structure is ok for 12 before trying other
 > implementation options.
 >
 > perf sent off-list.
 >
 > [1] 
https://www.postgresql.org/message-id/24783.1551568303%40sss.pgh.pa.us
 >
 > Best regards,
 >   Jesper
 >
 >

Thanks for testing.
Yeah, after code looking, I think bottleneck seems be lurking in another 
place where this patch doesn't change. I also think the patch is ok as 
it is for 12, and this problem will be fixed in 13.

--
Yoshikazu Imai


Re: PostgreSQL pollutes the file system

2019-03-21 Thread Tom Lane
Andreas Karlsson  writes:
> On 3/21/19 7:07 AM, Chris Travers wrote:
>> 1.  createuser/dropuser are things that I don't consider good ways of 
>> creating users anyway.

> Those binaries are pretty convenient to use in scripts since they handle 
> SQL escaping for you, but probably not convenient enough that we would 
> have added createuser today.

> Compare
> createuser "$USER"
> vs
> echo 'CREATE ROLE :"user" LOGIN' | psql postgres -v "user=$USER"

Hmm.  That example is actually quite scary, because while nearly
anybody who's ever done any shell scripting would get the first
one right, the second one requires a fair deal of specialized
knowledge and creativity.  I fear that 99% of people would have
coded it like

echo "CREATE USER $USER" | psql

or some variant on that, and now they have a SQL-injection
hazard that they didn't have before.

So there seems like a real risk that taking away createuser would
result in security holes, not just annoying-but-trivial script update
work.  That puts me more in the camp of "if we're going to do anything,
rename it with a pg_ prefix" than "if we're going to do anything,
remove it".

regards, tom lane



Re: Contribution to Perldoc for TestLib module in Postgres

2019-03-21 Thread Prajwal A V
Sure, please go ahead.

Regards,
Prajwal.

On Thu, 21 Mar 2019, 19:11 Ramanarayana,  wrote:

> Hi,
> Can I take this up?
>
> Regards,
> Ram
>


Re: jsonpath

2019-03-21 Thread Alexander Korotkov
On Tue, Mar 19, 2019 at 8:10 PM Alexander Korotkov
 wrote:
> On Sun, Mar 17, 2019 at 6:03 PM Tom Lane  wrote:
> > Andrew Dunstan  writes:
> > > Why are we installing the jsonpath_gram.h file? It's not going to be
> > > used by anything else, is it? TBH, I'm not sure I see why we're
> > > installing the scanner.h file either.
> >
> > As near as I can see, jsonpath_gram.h and jsonpath_scanner.h exist only
> > for communication between jsonpath_gram.y and jsonpath_scan.l.  Maybe
> > we'd be better off handling that need by compiling the .l file as part
> > of the .y file, as we used to do with the core lexer and still do with
> > several others (cf comments for commit 72b1e3a21); then we wouldn't
> > even have to generate these files much less install them.
> >
> > A quick look at jsonpath_scan.c shows that it's pretty innocent of
> > the tricks we've learned over the years for flex/bison portability;
> > in particular I see that it's #including  before postgres.h,
> > which is a no-no.  So that whole area needs more review anyway.
>
> Attached patch is getting rid of jsonpath_gram.h.  Would like to see
> results of http://commitfest.cputube.org/ before committing, because
> I'm not available to test this of windows machine.  There would be
> further patches rearranging jsonpath_gram.y and jsonpath_scan.l.

Attaches patches improving jsonpath parser.  First one introduces
cosmetic changes, while second gets rid of backtracking.  I'm also
planning to add high-level comment for both grammar and lexer.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


0001-cosmetic-changes-jsonpath-parser.patch
Description: Binary data


0002-get-rid-of-backtracking.patch
Description: Binary data


Re: Contribution to Perldoc for TestLib module in Postgres

2019-03-21 Thread Ramanarayana
Hi,
Can I take this up?

Regards,
Ram


Re: Re: A separate table level option to control compression

2019-03-21 Thread Pavan Deolasee
Hi,

On Thu, Mar 21, 2019 at 3:10 AM Shaun Thomas 
wrote:

>
> I can't really speak for the discussion related to `storage.sgml`, but
> I based my investigation on the existing patch to `create_table.sgml`.
> About the only thing I would suggest there is to possibly tweak the
> wording.
>
> * "The compress_tuple_target ... " for example should probably read
>   "The toast_tuple_target parameter ...".
> * "the (blocksize - header)" can drop "the".
> * "If the value is set to a value" redundant wording should be rephrased;
>   "If the specified value is greater than toast_tuple_target, then
>   we will substitute the current setting of toast_tuple_target instead."
>   would work.
>

Thanks Shaun. Attached patch makes these adjustments.


> * I'd recommend a short discussion on what negative consequences can be
>   expected by playing with this value. As an example in my tests, setting
>   it very high may result in extremely sparse pages that could have an
>   adverse impact on HOT updates.
>

Setting compress_tuple_target to a higher value won't be negative because
the toast_tuple_target is used for compression anyways when
compress_tuple_target is higher than toast_tuple_target. May be some
discussion in the paragraph related to toast_tuple_target can be added to
explain the negative impact of the high value.

I added a small discussion about negative effects of setting
compress_tuple_target lower though, per your suggestion.

Also added some details in storage.sgml as recommended by Sawada-san. Hope
this helps.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


0001-Add-a-table-level-option-to-control-compression.patch
Description: Binary data


Re: PostgreSQL pollutes the file system

2019-03-21 Thread Fred .Flintstone
Then someone who don't want the symlinks could delete them.
Or the symlinks could ship in an optional postgesql-legacy-symlinks package.

On Wed, Mar 20, 2019 at 6:32 PM Alvaro Herrera  wrote:
>
> On 2019-Mar-20, Fred .Flintstone wrote:
>
> > Even just creating symlinks would be a welcome change.
> > So the real binary is pg_foo and foo is a symoblic link that points to 
> > pg_foo.
> > Then at least I can type pg_ and use tab auto-completion to find
> > everything related to PostgreSQL.
>
> There is merit to this argument; if the starting point is an unknown
> file /usr/bin/foo, then having it be a symlink to /usr/bin/pg_foo makes
> it clear which package it belongs to.  We don't *have to* get rid of the
> symlinks any time soon, but installing as symlinks now will allow Skynet
> to get rid of them some decades from now.
>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PostgreSQL pollutes the file system

2019-03-21 Thread Fred .Flintstone
The binaries:
* clusterdb
* createdb
* createuser
* dropdb
* dropuser
* reindexdb
* vacuumdb

On Wed, Mar 20, 2019 at 8:13 PM Jonathan S. Katz  wrote:
>
> On 3/20/19 2:08 PM, Alvaro Herrera wrote:
> > On 2019-Mar-20, Euler Taveira wrote:
> >
> >> Em qua, 20 de mar de 2019 às 14:57, Tom Lane  escreveu:
> >>>
> >>> We managed to get rid of createlang and droplang in v10, and there
> >>> hasn't been that much push-back about it.  So maybe there could be
> >>> a move to remove createuser/dropuser?  Or at least rename them to
> >>> pg_createuser and pg_dropuser.  But I think this was discussed
> >>> (again) during the v10 cycle, and we couldn't agree to do more than
> >>> get rid of createlang/droplang.
> >
> > Previous discussion:
> > https://postgr.es/m/CABUevExPrfPH5K5qM=zst7tvfyace+i5qja6bfwckkyrh8m...@mail.gmail.com
> >
> >> Votes? +1 to remove createuser/dropuser (and also createdb/dropdb as I
> >> said in the other email). However, if we don't have sufficient votes,
> >> let's at least consider a 'pg_' prefix.
> >
> > I vote to keep these rename these utilities to have a pg_ prefix and to
> > simultaneously install symlinks for their current names, so that nothing
> > breaks.
>
> This sounds like a reasonable plan, pending which binaries we feel to do
> that with.
>
> Pardon this naive question as I have not used such systems in awhile,
> but would this work on systems that do not support symlinks?
>
> Jonathan
>



Re: PostgreSQL pollutes the file system

2019-03-21 Thread Fred .Flintstone
I would be fine with that.
We can make an exception for psql.

As long as we get rid of:
* clusterdb
* createdb
* createuser
* dropdb
* dropuser
* reindexdb
* vacuumdb

On Wed, Mar 20, 2019 at 7:11 PM Tom Lane  wrote:
>
> "Fred .Flintstone"  writes:
> > Even just creating symlinks would be a welcome change.
> > So the real binary is pg_foo and foo is a symoblic link that points to 
> > pg_foo.
> > Then at least I can type pg_ and use tab auto-completion to find
> > everything related to PostgreSQL.
>
> You'd miss psql.  I think the odds of renaming psql are not
> distinguishable from zero: whatever arguments you might want to make
> about, say, renaming initdb perhaps not affecting too many scripts
> are surely not going to fly for psql.  So that line of argument
> isn't too convincing.
>
> regards, tom lane



Re: Problem with default partition pruning

2019-03-21 Thread Thibaut


Le 20/03/2019 à 10:06, Amit Langote a écrit :
> Hi Thibaut,
>
> On 2019/03/19 23:58, Thibaut Madelaine wrote:
>> I kept on testing with sub-partitioning.
> Thanks.
>
>> I found a case, using 2 default partitions, where a default partition is
>> not pruned:
>>
>> --
>>
>> create table test2(id int, val text) partition by range (id);
>> create table test2_20_plus_def partition of test2 default;
>> create table test2_0_20 partition of test2 for values from (0) to (20)
>>   partition by range (id);
>> create table test2_0_10 partition of test2_0_20 for values from (0) to (10);
>> create table test2_10_20_def partition of test2_0_20 default;
>>
>> # explain (costs off) select * from test2 where id=5 or id=25;
>>    QUERY PLAN   
>> -
>>  Append
>>    ->  Seq Scan on test2_0_10
>>  Filter: ((id = 5) OR (id = 25))
>>    ->  Seq Scan on test2_10_20_def
>>  Filter: ((id = 5) OR (id = 25))
>>    ->  Seq Scan on test2_20_plus_def
>>  Filter: ((id = 5) OR (id = 25))
>> (7 rows)
>>
>> --
>>
>> I have the same output using Amit's v1-delta.patch or Hosoya's
>> v2_default_partition_pruning.patch.
> I think I've figured what may be wrong.
>
> Partition pruning step generation code should ignore any arguments of an
> OR clause that won't be true for a sub-partitioned partition, given its
> partition constraint.
>
> In this case, id = 25 contradicts test2_0_20's partition constraint (which
> is, a IS NOT NULL AND a >= 0 AND a < 20), so the OR clause should really
> be simplified to id = 5, ignoring the id = 25 argument.  Note that we
> remove id = 25 only for the considerations of pruning and not from the
> actual clause that's passed to the final plan, although it wouldn't be a
> bad idea to try to do that.
>
> Attached revised delta patch, which includes the fix described above.
>
> Thanks,
> Amit
Amit, I tested many cases with nested range sub-partitions... and I did
not find any problem with your last patch  :-)

I tried mixing with hash partitions with no problems.

From the patch, there seems to be less checks than before. I cannot
think of a case that can have performance impacts.

Hosoya-san, if you agree with Amit's proposal, do you think you can send
a patch unifying your default_partition_pruning.patch and Amit's second
v1-delta.patch?

Cordialement,

Thibaut







Re: Automated way to find actual COMMIT LSN of subxact LSN

2019-03-21 Thread Jeremy Finzel
>
> It would seem like what you're asking for is to continue until the commit
> of the parent transaction, not just the next commit after the subcommit.
> Otherwise (if that's an unrelated xact) the subxact would still not be
> committed, so that you might as well have stopped short of it.
>

Right, the parent transaction is what I meant.


> I'd be in favor of that for recovery_target_xid, but I'm not at all
> convinced about changing the behavior for a target LSN.  The fact that
> the target is a subcommit seems irrelevant when you specify by LSN.
>

Perhaps some context will help.  There have been 2 cases in which I have
tried to do this, both of them based on logical decoding, and finding
either a transaction id or an LSN to recover to.  Actually, the only reason
I have ever used transaction id instead of LSN is on <= 9.6 because the
latter isn't supported until pg10.

For this use case, my goal is simply to be able to recover the the point
immediately after a particular decoded log line is visible, without
necessarily having to find out the final parent transaction id.

Given this, I am open to different implementations but I would like to
either be able to specify an LSN or transaction ID, and have a feature that
allows the recovery target to roll forward just until it is visible, even
if the LSN or transaction ID is not the actual commit of the parent
transaction.


> I don't recall this for sure, but doesn't a parent xact's commit record
> include all subxact XIDs?  If so, the implementation would just require
> searching the subxacts as well as the main XID for a match to
> recovery_target_xid.
>

Yes, I believe so.

Thanks,
Jeremy


Re: Special role for subscriptions

2019-03-21 Thread Euler Taveira
Em qua, 20 de mar de 2019 às 21:57, Michael Paquier
 escreveu:
>
> Perhaps we would want something at database level different from GRANT
> CREATE ON DATABASE, but only for subscriptions?  This way, it is
> possible to have per-database groups having the right to create
> subscriptions, and I'd like to think that we should not include
> subcription creation into the existing CREATE rights.  It would be
> kind of funny to not have CREATE include the creation of this specific
> object though :)
>
It will be really strange but I can live with that. Another idea is
CREATE bit to create subscriptions (without replicate) and SUBSCRIBE
bit to replicate tables. It is not just a privilege to create a
subscription but also to modify tables that a role doesn't have
explicit permission. Let's allocate another AclItem?


-- 
   Euler Taveira   Timbira -
http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento



Re: speeding up planning with partitions

2019-03-21 Thread Jesper Pedersen

Hi Amit,

On 3/19/19 8:41 PM, Amit Langote wrote:

I have fixed all.  Attached updated patches.



The comments in the thread has been addressed, so I have put the CF 
entry into Ready for Committer.


Best regards,
 Jesper



Re: PostgreSQL pollutes the file system

2019-03-21 Thread Andreas Karlsson

On 3/21/19 7:07 AM, Chris Travers wrote:
1.  createuser/dropuser are things that I don't consider good ways of 
creating users anyway.  I think we should just consider removing these 
binaries.  The SQL queries are better, more functional, and can be 
rolled back as a part of a larger transaction.


Those binaries are pretty convenient to use in scripts since they handle 
SQL escaping for you, but probably not convenient enough that we would 
have added createuser today.


Compare

createuser "$USER"

vs

echo 'CREATE ROLE :"user" LOGIN' | psql postgres -v "user=$USER"

Andreas



Re: PostgreSQL pollutes the file system

2019-03-21 Thread Alvaro Herrera
On 2019-Mar-20, Tomas Vondra wrote:

> So to me this seems like a fairly invasive change (potentially breaking
> quite a few scripts/tools) just to address a minor inconvenience.

I don't think anything would break, actually.  What are you thinking
would break?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Re: Reporting script runtimes in pg_regress

2019-03-21 Thread Christoph Berg
Re: David Steele 2019-03-20 <8a85bece-b18f-0433-acf3-d106b31f0...@pgmasters.net>
> > > Oh, right. So the way to go would be to use _("FAILED   "), and
> > > ask translators to use the same length.
> > 
> > Note there's no translation for pg_regress.  All these _() markers are
> > currently dead code.  It seems hard to become motivated to translate
> > that kind of program.  I don't think it has much value, myself.
> 
> This patch has been "Waiting on Author" since March 8th.  Do you know when
> you'll have a new version ready?

Here is a new revision that blank-pads "ok" to the length of "FAILED".

Christoph
>From cd2a5927b52473ca9c56e2e5142c257853429224 Mon Sep 17 00:00:00 2001
From: Christoph Berg 
Date: Thu, 21 Feb 2019 10:35:19 +0100
Subject: [PATCH] Align timestamps in pg_regress output

---
 src/test/regress/pg_regress.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index a18a6f6c45..8111d95b1e 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -1799,7 +1799,7 @@ run_schedule(const char *schedule, test_function tfunc)
 			}
 			else
 			{
-status(_("ok"));
+status(_("ok")); /* align with FAILED */
 success_count++;
 			}
 
@@ -1879,7 +1879,7 @@ run_single_test(const char *test, test_function tfunc)
 	}
 	else
 	{
-		status(_("ok"));
+		status(_("ok")); /* align with FAILED */
 		success_count++;
 	}
 
-- 
2.20.1



Re: "WIP: Data at rest encryption" patch and, PostgreSQL 11-beta3

2019-03-21 Thread Antonin Houska
Tom Lane  wrote:

> Robert Haas  writes:
> > If the WAL *is* encrypted, the state at this point is that the block
> > is unreadable, because the first 4kB of the block is the first half of
> > the bits that resulted from encrypting 8kB of data that includes the
> > new record, and the second 4kB of the block is the second half of the
> > bits that resulted from encrypting 8kB of data that did not include
> > the new record, and since the new record perturbed every bit on the
> > page with probability ~50%, what that means is you now have garbage.
> > That means that not only did we lose the new record, but we also lost
> > the 3.5kB of good data which the page previously contained.  That's
> > NOT ok.  Some of the changes associated with those WAL records may
> > have been flushed to disk already, or there may be commits in there
> > that were acknowledged to the client, and we can't just lose them.
> 
> ISTM that this is only a problem if you choose the wrong encryption
> method.  One not-wrong encryption method is to use a stream cipher
> --- maybe that's not the exact right technical term, but anyway
> I'm talking about a method which notionally XOR's the cleartext
> data with a random bit stream generated from the encryption key
> (probably along with other knowable inputs such as the block number).
> In such a method, corruption of individual on-disk bytes doesn't
> prevent you from getting the correct decryption of on-disk bytes
> that aren't corrupted.

We actually use a block cipher (with block size 16 bytes), as opposed to
stream cipher. It's true that partial write is a problem because if a single
bit of the cipher text changed, decryption will produce 16 bytes of
garbage. However I'm not sure if partial write can affect as small unit as 16
bytes.

Nevertheless, with the current version of our patch, PG should be resistant
against such a partial write anyway because we chose to align XLOG records to
16 bytes (as long as the encryption is enabled) for the following reasons:

If one XLOG record ends and the following one starts in the same encryption
block, both records can get corrupted during streaming replication. The
scenario looks like: 1) the first record is written on master (the unused part
of the block contains zeroes), 2) the block is encrypted and its initial part
(i.e. the number of bytes occupied by the first record in the plain text) is
streamed to slave, 3) the second record is written on master, 4) the
containing encryption block is encrypted again and the trailing part (i.e. the
number of bytes occupied by the second record) is streamed, 5) decryption of
the block on slave will produce garbage and thus corrupt both records. This is
because the trailing part of the block was filled with zeroes during
encryption, but it contains different data at decryption time.

Alternative approach to this replication problem is that walsender decrypts
the stream and walreceiver encrypts it again. While this can provide us with
the advantage to have master and slave encrypted with different keys, this
approach brings some additional complexity. For example, pg_basebackup would
need to deal with encryption.

This design decision can be changed, but there's one more thing to consider:
if the XLOG stream is decrypted, the decryption cannot be disabled unless the
XLOG records are aligned to 16 bytes (and in turn, the XLOG alignment cannot
be enabled w/o initdb).

-- 
Antonin Houska
https://www.cybertec-postgresql.com



Re: [HACKERS] Block level parallel vacuum

2019-03-21 Thread Sergei Kornilov
Hello

> * in_parallel is true if we're performing parallel lazy vacuum. Since any
> * updates are not allowed during parallel mode we don't update statistics
> * but set the index bulk-deletion result to *stats. Otherwise we update it
> * and set NULL.

lazy_cleanup_index has in_parallel argument only for this purpose, but caller 
still should check in_parallel after lazy_cleanup_index call and do something 
else with stats for parallel execution.
Would be better always return stats and update statistics in caller? It's 
possible to update all index stats in lazy_vacuum_all_indexes for example? This 
routine is always parallel leader and has comment /* Do post-vacuum cleanup and 
statistics update for each index */ on for_cleanup=true call.

I think we need note in documentation that parallel leader is not counted in 
PARALLEL N option, so with PARALLEL 2 option we want use 3 processes. Or even 
change behavior? Default with PARALLEL 1 - only current backend in single 
process is running, PARALLEL 2 - leader + one parallel worker, two processes 
works in parallel.

regards, Sergei



Re: Special role for subscriptions

2019-03-21 Thread Evgeniy Efimkin
Hi!

> Perhaps we would want something at database level different from GRANT
> CREATE ON DATABASE, but only for subscriptions?
How about 4 checks to create subscription for nonsuperuser?
1. Special role for create subscription 
2. CREATE ON DATABASE privilege 
3. INSERT, UPDATE, DELETE, TRUNCATE, REFERENCE privilege on target table
4. target table not in information_schema and pg_catalog

 
Efimkin Evgeny




Re: Special role for subscriptions

2019-03-21 Thread Evgeniy Efimkin
Hi!
> - If the user's permissions are later revoked, the subscription is unaffected.
Now it work the same, if we revoke superuser, subscription is unaffected and 
replication still work

Don't check grants in target database is very dangerous, i create publication 
with system tables(it's not difficult)

select * from pg_publication_tables ;
 pubname | schemaname | tablename
-++
 pub | pg_catalog | pg_authid
(1 row)

After that i create subscription, in log i see that
2019-03-21 11:19:50.863 MSK [58599] LOG:  logical replication table 
synchronization worker for subscription "sub_nosuper", table "pg_authid" has 
started
2019-03-21 11:19:51.039 MSK [58599] ERROR:  null value in column "oid" violates 
not-null constraint
2019-03-21 11:19:51.039 MSK [58599] DETAIL:  Failing row contains (null, 
pg_monitor, f, t, f, f, f, f, f, -1, null, null).
2019-03-21 11:19:51.039 MSK [58599] CONTEXT:  COPY pg_authid, line 1: 
"pg_monitor   f   t   f   f   f   f   f   -1  \N  \N"

I think it's no problem use it to attack target server after some hack on 
publication side.

 
Efimkin Evgeny




Re: Offline enabling/disabling of data checksums

2019-03-21 Thread Fabien COELHO



Anyway, as this stuff is very useful for upgrade scenarios 
a-la-pg_upgrade, for backup validation and as it does not produce false 
positives, I would really like to get something committed for v12 in its 
simplest form...


Fine with me, the detailed doc is not a showstopper and can be improved 
later one.


Are there any recommendations that people would like to add to the 
documentation?


For me, I just want at least a clear warning on potential risks.

--
Fabien.



Re: Offline enabling/disabling of data checksums

2019-03-21 Thread Fabien COELHO


Bonjour Michaël,


On Wed, Mar 20, 2019 at 05:46:32PM +0100, Fabien COELHO wrote:

I think that the motivation/risks should appear before the solution. "As xyz
..., ...", or there at least the logical link should be outlined.

It is not clear for me whether the following sentences, which seems specific
to "pg_rewind", are linked to the previous advice, which seems rather to
refer to streaming replication?


Do you have a better idea of formulation?


I can try, but I must admit that I'm fuzzy about the actual issue. Is 
there a problem on a streaming replication with inconsistent checksum 
settings, or not?


You seem to suggest that the issue is more about how some commands or 
backup tools operate on a cluster.


I'll reread the thread carefully and will make a proposal.


Imagine for example a primary-standby with checksums disabled: [...]


Yep, that's cool.


Should not disabling in reverse order be safe? the checksum are not checked
afterwards?


I don't quite understand your comment about the ordering.  If all the 
standbys are destroyed first, then enabling/disabling checksums happens 
at a single place.


Sure. I was suggesting that disabling on replicated clusters is possibly 
safer, but do not know the detail of replication & checksumming with 
enough precision to be that sure about it.



After the reboot, some data files are not fully updated with their
checksums, although the controlfiles tells that they are. It should then
fail after a restart when a no-checksum page is loaded?

What am I missing?


Please note that we do that in other tools as well and we live fine
with that as pg_basebackup, pg_rewind just to name two.


The fact that other commands are exposed to the same potential risk is not 
a very good argument not to fix it.


I am not saying that it is not a problem in some cases, but I am saying 
that this is not a problem that this patch should solve.


As solving the issue involves exchanging two lines and turning one boolean 
parameter to true, I do not see why it should not be done. Fixing the 
issue takes much less time than writing about it...


And if other commands can be improved fine with me.

If we were to do something about that, it could make sense to make 
fsync_pgdata() smarter so as the control file is flushed last there, or 
define flush strategies there.


ISTM that this would not work: The control file update can only be done 
*after* the fsync to describe the cluster actual status, otherwise it is 
just a question of luck whether the cluster is corrupt on an crash while 
fsyncing. The enforced order of operation, with a barrier in between, is 
the important thing here.


--
Fabien.

Re: MSVC Build support with visual studio 2019

2019-03-21 Thread Michael Paquier
On Thu, Mar 21, 2019 at 12:45:57PM +1100, Haribabu Kommi wrote:
> The commit f2ab389 is later back-patch to version till 9.3 in commit
> 19acfd65.  I guess that building the windows installer for all the
> versions using the same visual studio is may be the reason behind
> that back-patch.

I did not remember this one, thanks for pointing it out.  So my
memories on that were incorrect.  If it is possible to get the code to
build with MSVC 2019 on all the supported branches, we could do so.
--
Michael


signature.asc
Description: PGP signature


Re: Offline enabling/disabling of data checksums

2019-03-21 Thread Michael Paquier
On Thu, Mar 21, 2019 at 07:59:24AM +0900, Michael Paquier wrote:
> Please note that we do that in other tools as well and we live fine
> with that as pg_basebackup, pg_rewind just to name two.  I am not
> saying that it is not a problem in some cases, but I am saying that
> this is not a problem that this patch should solve.  If we were to do
> something about that, it could make sense to make fsync_pgdata()
> smarter so as the control file is flushed last there, or define flush
> strategies there.

Anyway, as this stuff is very useful for upgrade scenarios
a-la-pg_upgrade, for backup validation and as it does not produce
false positives, I would really like to get something committed for
v12 in its simplest form...  Are there any recommendations that people
would like to add to the documentation?
--
Michael


signature.asc
Description: PGP signature