RE: How to make partitioning scale better for larger numbers of partitions

2018-07-12 Thread Tsunakawa, Takayuki
From: David Rowley [mailto:david.row...@2ndquadrant.com]
> > David has submitted multiple patches for PG 12, one of which speeds up
> pruning of UPDATE/DELETE (I couldn't find it in the current CF, though.)
> What challenges are there for future versions, and which of them are being
> addressed by patches in progress for PG 12, and which issues are untouched?
> 
> I've not submitted that for PG12 yet. I had other ideas about just
> getting rid of the inheritance planner altogether, but so far don't
> have a patch for that. Still uncertain if there are any huge blockers
> to that either.

Sorry, I seem to have misunderstood something.

By the way, what do you think is the "ideal and should-be-feasible" goal and 
the "realistic" goal we can reach in the near future (e.g. PG 12)?  Say,

* Planning and execution time is O(log n), where n is the number of partitions
* Planning time is O(log n), execution time is O(1)
* Planning and execution time is O(1), where n is the number of partitions

Regards
Takayuki Tsunakawa




Re: How to make partitioning scale better for larger numbers of partitions

2018-07-12 Thread Amit Langote
On 2018/07/13 14:49, Tsunakawa, Takayuki wrote:
> From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp]
>> For SELECT/UPDATE/DELETE, overhead of partitioning in the planning phase
>> is pretty significant and gets worse as the number of partitions grows.
>> I
>> had intended to fix that in PG 11, but we could only manage to get part
>> of
>> that work into PG 11, with significant help from David and others.  So,
>> while PG 11's overhead of partitioning during planning is less than PG
>> 10's due to improved pruning algorithm, it's still pretty far from ideal,
>> because it isn't just the pruning algorithm that had overheads.  In fact,
>> PG 11 only removes the pruning overhead for SELECT, so UPDATE/DELETE still
>> carry the overhead that was in PG 10.
> 
> David has submitted multiple patches for PG 12, one of which speeds up 
> pruning of UPDATE/DELETE (I couldn't find it in the current CF, though.)

I don't think he has posted a new patch for improving the speed for
UDPATE/DELETE planning yet, although I remember he had posted a PoC patch
back in February or March.

> What challenges are there for future versions, and which of them are being 
> addressed by patches in progress for PG 12, and which issues are untouched?

The immediate one I think is to refactor the planner such that the new
pruning code, that we were able to utilize for SELECT in PG 11, can also
be used for UPDATE/DELETE.  Refactoring needed to replace the pruning
algorithm was minimal for SELECT, but not so much for UPDATE/DELETE.

Once we're able to utilize the new pruning algorithm for all the cases, we
can move on to refactoring to avoid locking and opening of all partitions.
 As long as we're relying on constraint exclusion for partition pruning,
which we still do for UPDATE/DELETE, we cannot do that because constraint
exclusion has to look at each partition individually.

The UPDATE/DELETE planning for partitioning using huge memory and CPU is a
pretty big issue and refactoring planner to avoid that may be what's
hardest of all the problems to be solved here.

>> The overheads I mention stem from
>> the fact that for partitioning we still rely on the old planner code
>> that's used to perform inheritance planning, which requires to lock and
>> open *all* partitions.  We have so far been able to refactor just enough
>> to use the new code for partition pruning, but there is much refactoring
>> work left to avoid needlessly locking and opening all partitions.
> 
> I remember the issue of opening and locking all partitions was discussed 
> before.

Quite a few times I suppose.

> Does this relate only to the planning phase?

If the benchmark contains queries that will need to access just one
partition, then yes the planning part has is the biggest overhead.

Execution-time overhead is limited to having an extra, possibly needless,
Append node, but I know David has patch for that too.

Thanks,
Amit




Re: How to make partitioning scale better for larger numbers of partitions

2018-07-12 Thread David Rowley
On 13 July 2018 at 14:58, Kato, Sho  wrote:
> Of course I'm sure table partitioning work well with up to a hundred
> partitions as written on the postgresql document.
>
> But, my customer will use partitioned table with 1.1k leaf partitions.
>
> So, we need to improve performance.
>
> Any ideas?

It would be really good if you could review and test my partitioning
patches in the current commitfest. These are the ones authored by me
with the word "partition" in the title.

These 4 patches don't solve all the problems, but they do make some
good gains in some areas. I've still more work to do, so the earlier I
can be finished with the 4 that are there now, the more I can focus on
the rest.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: How to make partitioning scale better for larger numbers of partitions

2018-07-12 Thread David Rowley
On 13 July 2018 at 17:49, Tsunakawa, Takayuki
 wrote:
> David has submitted multiple patches for PG 12, one of which speeds up 
> pruning of UPDATE/DELETE (I couldn't find it in the current CF, though.)  
> What challenges are there for future versions, and which of them are being 
> addressed by patches in progress for PG 12, and which issues are untouched?

I've not submitted that for PG12 yet. I had other ideas about just
getting rid of the inheritance planner altogether, but so far don't
have a patch for that. Still uncertain if there are any huge blockers
to that either. Join search needs performed multiple times, but a good
advantage will be that we can take advantage of partition pruning to
only join search the non-pruned partitions.

The reason I change my mind about the patch you're talking about is
that it meant that RangeTblEntries needed to keep the same relation id
in the inheritance planner as they did in the grouping planner, and
another patch I have semi-baked delays building both RelOptInfo and
RangeTblEntry records for partitions until after pruning. Without the
RangeTblEntry it was difficult to ensure the relids were in lock-step
between the two planners.  There are ways to do it, it just didn't
look pretty.

Hoping to have something later in the cycle.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: How to make partitioning scale better for larger numbers of partitions

2018-07-12 Thread Justin Pryzby
On Fri, Jul 13, 2018 at 05:49:20AM +, Tsunakawa, Takayuki wrote:
> David has submitted multiple patches for PG 12, one of which speeds up 
> pruning of UPDATE/DELETE (I couldn't find it in the current CF, though.)  
> What challenges are there for future versions, and which of them are being 
> addressed by patches in progress for PG 12, and which issues are untouched?

Here's an known issue which I'm not sure is on anybody's list:
https://www.postgresql.org/message-id/flat/4F989CD8.8020804%40strategicdata.com.au#4f989cd8.8020...@strategicdata.com.au
=> planning of UPDATE/DELETE (but not SELECT) takes huge amount of RAM when run
on parent with many childs.

We don't typically have UPDATEs, and those we do are against the child tables;
but ran into this last month with a manual query on parent causing OOM.  I
worked around it, but keep meaning to revisit to see if this changed at all in
v11 (very brief testing suggests not changed).

Justin



Re: Cannot dump foreign key constraints on partitioned table

2018-07-12 Thread amul sul
Thanks for the prompt fix, patch [1] works for me.

1]  https://postgr.es/m/20180712184537.5vjwgxlbuiomomqd@alvherre.pgsql

Regards,
Amul



RE: How to make partitioning scale better for larger numbers of partitions

2018-07-12 Thread Tsunakawa, Takayuki
From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp]
> For SELECT/UPDATE/DELETE, overhead of partitioning in the planning phase
> is pretty significant and gets worse as the number of partitions grows.
> I
> had intended to fix that in PG 11, but we could only manage to get part
> of
> that work into PG 11, with significant help from David and others.  So,
> while PG 11's overhead of partitioning during planning is less than PG
> 10's due to improved pruning algorithm, it's still pretty far from ideal,
> because it isn't just the pruning algorithm that had overheads.  In fact,
> PG 11 only removes the pruning overhead for SELECT, so UPDATE/DELETE still
> carry the overhead that was in PG 10.

David has submitted multiple patches for PG 12, one of which speeds up pruning 
of UPDATE/DELETE (I couldn't find it in the current CF, though.)  What 
challenges are there for future versions, and which of them are being addressed 
by patches in progress for PG 12, and which issues are untouched?

> The overheads I mention stem from
> the fact that for partitioning we still rely on the old planner code
> that's used to perform inheritance planning, which requires to lock and
> open *all* partitions.  We have so far been able to refactor just enough
> to use the new code for partition pruning, but there is much refactoring
> work left to avoid needlessly locking and opening all partitions.

I remember the issue of opening and locking all partitions was discussed 
before.  Does this relate only to the planning phase?

Kato-san, could you try pgbench -M prepared?


Regards
Takayuki Tsunakawa






Re: [PATCH] pg_hba.conf : new auth option : clientcert=verify-full

2018-07-12 Thread Thomas Munro
On Sat, Apr 14, 2018 at 3:48 AM, Julian Markwort
 wrote:
> [a patch]

Hello Julian,

Could you please post a rebased patch?

I haven't reviewed or tested any code yet, but here's some proof-reading:

+   This behaviour is similar to the cert autentication method

"behavior" (our manual is written in en_US, "cd doc/src/sgml ; git
grep behavior | wc -l" -> 895, "git grep behaviour" -> 0).

cert

"authentication"

+   chain, but it will also check whether the username or it's
+   mapping match the common name (CN) of the provided certificate.

"its"

"matches"

+   Note that certificate chain validation  is always ensured when
+   cert authentication method is used

extra space

when *the* ...

+   In this case, the CN (nommon name) provided in

"common name"

+   CN (Common Name) in the certificate matches

"common"? (why a capital letter here?)

This line isn't modified by your patch, but I saw it while in
proof-reading mode:

  *err_msg = "clientcert can not be set to 0 when using \"cert\"
authentication";

I think "can not" is usually written "cannot"?

> slightly offtopic opinion:
> While creating the test cases, I stumbled upon the problem of missing
> depencies to run the tests...
> It's complicated enough that the binaries used by these perl tests are not
> named similar to the packages which provide them (the 'prove' binary is
> supplied by 'Test-Harness'), so maybe in the interest of providing a lower
> entry-barrier to running these tests, we could give a more detailed error
> message in the configure script, when using --enable-tap-tests ?

Yeah.  The packages to install depend on your operating system, and in
some cases (macOS, Windows?) which bolt-on package thingamajig you
use, though.  Perhaps the READMEs could be improved with details for
systems we have reports about (like the recently added "Requirements"
section of src/test/ldap/README).

-- 
Thomas Munro
http://www.enterprisedb.com



Re: file cloning in pg_upgrade and CREATE DATABASE

2018-07-12 Thread Thomas Munro
On Wed, Feb 21, 2018 at 4:00 PM, Peter Eisentraut
 wrote:
> - XFS has (optional) reflink support.  This file system is probably more
> widely used than Btrfs.
>
> - Linux and glibc have a proper function to do this now.
>
> - APFS on macOS supports file cloning.

TIL that Solaris 11.4 (closed) ZFS supports reflink() too.  Sadly,
it's not in OpenZFS though I see numerous requests and discussions...
(Of course you can just clone the whole filesystem and then pg_upgrade
the clone in-place).

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Cannot dump foreign key constraints on partitioned table

2018-07-12 Thread Michael Paquier
On Thu, Jul 12, 2018 at 11:34:43PM -0400, Alvaro Herrera wrote:
> I'm not sure what to *do* with the partition, though :-) I don't think
> there's a nice way to verify that the FK actually exists, or that
> catalog rows are set in such-and-such way, after restoring this.
> The pg_restore tests are marked TODO in the suite.  I think that'll have
> to wait.

Ouch, right.  My eyes are betraying me.  I swear when I tested that I
saw an ALTER TABLE ADD CONSTRAINT command generated as well for each
partition on top of the parent :)

But only the parent needs to have the definition, so your test is
sufficient.  Sorry for the noise.
--
Michael


signature.asc
Description: PGP signature


Re: Checkpoint not retrying failed fsync?

2018-07-12 Thread Thomas Munro
On Tue, Jun 12, 2018 at 3:31 PM, Thomas Munro
 wrote:
> I was about to mark this patch "rejected" and forget about it, since
> Craig's patch makes it redundant.  But then I noticed that Craig's
> patch doesn't actually remove the retry behaviour completely: it
> promotes only EIO and ENOSPC to PANIC.

Rejected, since this will have to be dealt with one way or another in
that other thread.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [PATCH] Timestamp for a XLOG_BACKUP_END WAL-record

2018-07-12 Thread Michael Paquier
On Fri, Jul 13, 2018 at 08:13:39AM +0500, Andrey V. Lepikhov wrote:
> Timestamp in a backup history file not correspond to any WAL record and
> can't be bind with a time of backup exactly.
> In my opinion, keeping timestamp in XLOG_BACKUP_END is more reliable, safe
> and easy way for recovering a database to a specific time.

Like Andres and Fujii-san, I don't really see the point of complicating
the code for that.  If your goal is to stop WAL replay once consistency
has been reached, then just use recovery_target = 'immediate' if you
fetch the data from a WAL archive and that there are still segments
after the consistency point.  Or just use a self-contained backup which
has all the WAL it needs without restore_command.

If your goal is to make sure that the timestamp set in recovery.conf is
not older than the moment the backup has ended, then the backup history
file has what you are looking for.  In short, in any case there is no
point in duplicating data which already exists.  You can as well use
recovery_target_lsn to point exactly at the time a backup has ended as
returned by pg_stop_backup, and this even saves maths with timestamps.
--
Michael


signature.asc
Description: PGP signature


Re: How to make partitioning scale better for larger numbers of partitions

2018-07-12 Thread Amit Langote
Kato-san,

On 2018/07/13 11:58, Kato, Sho wrote:
> Hi,
> 
> I benchmarked on a RANGE partitioned table with 1.1k leaf partitions and no 
> sub-partitioned tables.

Thanks for sharing the results.

> But, statement latencies on a partitioned table is much slower than on a 
> non-partitioned table.
> 
> UPDATE latency is 210 times slower than a non-partitioned table.
> SELECT latency is 36 times slower than a non-partitioned table.
> Surprisingly INSERT latency is almost same.

Yes, INSERT comes out ahead because there is no overhead of partitioning
in the planning phase.  As David Rowley reported on the nearby thread
("Speeding up INSERTs and UPDATEs to partitioned tables"), there is still
significant overhead during its execution, so we're still a bit a fair bit
away from the best possible performance.

For SELECT/UPDATE/DELETE, overhead of partitioning in the planning phase
is pretty significant and gets worse as the number of partitions grows.  I
had intended to fix that in PG 11, but we could only manage to get part of
that work into PG 11, with significant help from David and others.  So,
while PG 11's overhead of partitioning during planning is less than PG
10's due to improved pruning algorithm, it's still pretty far from ideal,
because it isn't just the pruning algorithm that had overheads.  In fact,
PG 11 only removes the pruning overhead for SELECT, so UPDATE/DELETE still
carry the overhead that was in PG 10.  The overheads I mention stem from
the fact that for partitioning we still rely on the old planner code
that's used to perform inheritance planning, which requires to lock and
open *all* partitions.  We have so far been able to refactor just enough
to use the new code for partition pruning, but there is much refactoring
work left to avoid needlessly locking and opening all partitions.

Thanks,
Amit




Re: Temporary tables prevent autovacuum, leading to XID wraparound

2018-07-12 Thread Michael Paquier
On Tue, Mar 13, 2018 at 08:08:48AM +, Tsunakawa, Takayuki wrote:
> From: Tom Lane [mailto:t...@sss.pgh.pa.us]
>> On the whole, my vote is to fix and apply step 2, and leave it at that.

Yeah, I have been thinking about the idea 1 mentioned above, or in short
clean up the temporary namespace at connection start instead of
first-use of it, and while that would make the cleanup more aggressive,
it could be possible as well that only having autovacuum do the work
could be enough, so I am +-0 on this idea.

> Done.  It seems to work well.

I have looked at the patch proposed.

+   /* Does the backend own the temp schema? */
+   if (proc->tempNamespaceId != namespaceID)
+   return false;
I have a very hard time believing that this is safe lock-less, and a
spin lock would be enough it seems.

+   /* Is the backend connected to this database? */
+   if (proc->databaseId != MyDatabaseId)
+   return false;
Wouldn't it be more interesting to do this cleanup as well if the
backend is *not* connected to the database autovacuum'ed?  This would
make the cleanup more aggresive, which is better.
--
Michael


signature.asc
Description: PGP signature


Re: Cache invalidation after authentication (on-the-fly role creation)

2018-07-12 Thread Thomas Munro
On Fri, Jul 13, 2018 at 6:52 AM, Tom Lane  wrote:
> Thomas Munro  writes:
>> I suppose the call to AcceptInvalidationMessages() could go at the end
>> of ClientAuthentication().  That'd be closer to the code that creates
>> the negative entry and immediately after the code that might modify
>> the catalogs.  Or in PeformAuthentication(), its caller.  Or in
>> IntializeSessionUserId() immediately before we try to look up the
>> name, but that's also called by background workers that don't need it.
>
> I think my preference would be to put it in InitializeSessionUserId
> so that it's clearly associated with the syscache lookups we're trying
> to protect.

Thanks.  Pushed to master that way.

With this change, I can successfully use a little PAM module that
authenticates with an LDAP server and then creates, grants and revokes
roles as necessary based on LDAP groups.  It's not as good as
Stephen's grand plan, but it's dead simple.  Without the change, it
still works, but the first login attempt for a given user fails.  I
can live with that in the back branches.

> I don't entirely buy your argument that background workers
> don't need up-to-date info for this.

Yeah, a hypothetical background worker that starts up and then waits
to be told the name of the role to use could suffer from this problem.

-- 
Thomas Munro
http://www.enterprisedb.com



RE: How to make partitioning scale better for larger numbers of partitions

2018-07-12 Thread Kato, Sho
>I wondered if you compared to PG10 or to inheritence-partitioning (parent with 
>relkind='r' and either trigger or rule or >INSERT/UPDATE directly into child) ?

Thank you for your reply.

I compared to PG11beta2 with non-partitioned table.

Non-partitioned table has 1100 records in one table.
Partitioned table has one record on each leaf partitions.

Regards,
-Original Message-
From: Justin Pryzby [mailto:pry...@telsasoft.com] 
Sent: Friday, July 13, 2018 12:11 PM
To: Kato, Sho/加藤 翔 
Subject: Re: How to make partitioning scale better for larger numbers of 
partitions

Hi,

On Fri, Jul 13, 2018 at 02:58:53AM +, Kato, Sho wrote:
> I benchmarked on a RANGE partitioned table with 1.1k leaf partitions and no 
> sub-partitioned tables.
> But, statement latencies on a partitioned table is much slower than on a 
> non-partitioned table.
> 
> UPDATE latency is 210 times slower than a non-partitioned table.
> SELECT latency is 36 times slower than a non-partitioned table.
> Surprisingly INSERT latency is almost same.

I wondered if you compared to PG10 or to inheritence-partitioning (parent with 
relkind='r' and either trigger or rule or INSERT/UPDATE directly into child) ?

Justin






Re: Cannot dump foreign key constraints on partitioned table

2018-07-12 Thread Alvaro Herrera
On 2018-Jul-13, Michael Paquier wrote:

> On Thu, Jul 12, 2018 at 02:45:37PM -0400, Alvaro Herrera wrote:
> > Thanks, looks good.  I propose to add following pg_dump test to ensure
> > this stays fixed.
> 
> Thanks for adding the test.  I was looking at a good way to add a test
> but could not come up with something which can be summed up with one
> query for create_sql, so what you have here is nice.  Could you add an
> extra test with a partition of dump_test.test_table_fk?  Children should
> have the FK defined as well with relhastriggers set to true, still when
> I tested if the partitioned was not scanned for its FK, then its
> children partition also missed it.  So I think that it is important to
> check that the FK is defined for all members of the partition tree.

Hmm.  The pg_dump tests make it easy to create a partition (in fact I
had already modified the test to add one after submitting):

diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index 8860928df1..666760c0d8 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -635,7 +635,10 @@ my %tests = (
create_order => 4,
create_sql   => 'CREATE TABLE dump_test.test_table_fk (
col1 int references 
dump_test.test_table)
-   PARTITION BY RANGE 
(col1);',
+   PARTITION BY RANGE 
(col1);
+   CREATE TABLE 
dump_test.test_table_fk_1
+   PARTITION OF 
dump_test.test_table_fk
+   FOR VALUES FROM (0) TO 
(10);',
regexp => qr/
\QADD CONSTRAINT test_table_fk_col1_fkey FOREIGN KEY 
(col1) REFERENCES dump_test.test_table\E
/xm,

I'm not sure what to *do* with the partition, though :-) I don't think
there's a nice way to verify that the FK actually exists, or that
catalog rows are set in such-and-such way, after restoring this.
The pg_restore tests are marked TODO in the suite.  I think that'll have
to wait.

> I am fine to add the test myself and to push if you need help.  Of
> course feel free to do it yourself if you want.  Either way is fine for
> me.

No worries -- I'll push it tomorrow morning.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Timestamp for a XLOG_BACKUP_END WAL-record

2018-07-12 Thread Andrey V. Lepikhov




On 10.07.2018 22:26, Fujii Masao wrote:

On Tue, Jul 10, 2018 at 6:41 PM, Andrey V. Lepikhov
 wrote:



On 10.07.2018 06:45, Andres Freund wrote:


Hi,

On 2018-07-10 06:41:32 +0500, Andrey V. Lepikhov wrote:


This functionality is needed in practice when we have to determine a
recovery time of specific backup.



What do you mean by "recovery time of specific backup"?



recovery time - is a time point where backup of PostgreSQL database instance
was made.
Performing database recovery, we want to know what point in time the
restored database will correspond to.
This functionality refers to improving the usability of pg_basebackup and
pg_probackup utilities.


Why don't you use a backup history file for that purpose?


Timestamp in a backup history file not correspond to any WAL record and 
can't be bind with a time of backup exactly.
In my opinion, keeping timestamp in XLOG_BACKUP_END is more reliable, 
safe and easy way for recovering a database to a specific time.




Regards,



--
Andrey Lepikhov
Postgres Professional:
https://postgrespro.com
The Russian Postgres Company



Re: Concurrency bug in UPDATE of partition-key

2018-07-12 Thread Amit Kapila
On Thu, Jul 12, 2018 at 10:14 PM, Andres Freund  wrote:
> On 2018-07-11 09:16:33 -0400, Alvaro Herrera wrote:
>> On 2018-Jul-11, Amit Kapila wrote:
>>
>> > Attached, please find an updated patch based on comments by Alvaro.
>> > See, if this looks okay to you guys.
>>
>> LGTM as far as my previous comments are concerned.
>
> I see Amit pushed a patch here yesterday
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=40ca70ebcc9d0bc1c02937b27c44b2766617e790
> , is there a need to keep the open item open,
>

No, I have moved the item from the open issues list.  I was waiting
for the build farm, yesterday, it has shown some failure after this
commit, but it turns out to be some unrelated random failure.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



How to make partitioning scale better for larger numbers of partitions

2018-07-12 Thread Kato, Sho
Hi,

I benchmarked on a RANGE partitioned table with 1.1k leaf partitions and no 
sub-partitioned tables.
But, statement latencies on a partitioned table is much slower than on a 
non-partitioned table.

UPDATE latency is 210 times slower than a non-partitioned table.
SELECT latency is 36 times slower than a non-partitioned table.
Surprisingly INSERT latency is almost same.

Of course I'm sure table partitioning work well with up to a hundred partitions 
as written on the postgresql document.
But, my customer will use partitioned table with 1.1k leaf partitions.
So, we need to improve performance.

Any ideas?

The results of pgbench and perf are listed below.

pgbench results
---

transaction type: update.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 180 s
number of transactions actually processed: 648
latency average = 278.202 ms
tps = 3.594512 (including connections establishing)
tps = 3.594545 (excluding connections establishing)
statement latencies in milliseconds:
 0.011  \set aid random(1, 1100 * 1)
 0.004  \set delta random(-5000, 5000)
 0.038  BEGIN;
   277.005  UPDATE test.accounts SET abalance = abalance + :delta WHERE aid 
= :aid;
 1.140  END;

transaction type: select.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 180 s
number of transactions actually processed: 19415
latency average = 9.281 ms
tps = 107.745068 (including connections establishing)
tps = 107.746067 (excluding connections establishing)
statement latencies in milliseconds:
 0.800  \set aid random(1, 1100 * 1)
 0.137  \set delta random(-5000, 5000)
 1.351  BEGIN;
 4.941  SELECT abalance FROM test.accounts WHERE aid = :aid;
 2.052  END;

transaction type: insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 180 s
number of transactions actually processed: 31895
latency average = 5.654 ms
tps = 176.865541 (including connections establishing)
tps = 176.867086 (excluding connections establishing)
statement latencies in milliseconds:
 2.083  \set aid random(1, 1100 * 1)
 0.003  \set delta random(-5000, 5000)
 0.029  BEGIN;
 3.222  INSERT INTO test.accounts_history (aid, delta, mtime) VALUES 
(:aid, :delta, CURRENT_TIMESTAMP);
 0.317  END;

Top 10 of perf report


UPDATE:
21.33%  postgres  postgres   [.] range_table_mutator
12.57%  postgres  postgres   [.] AllocSetAlloc
  4.97%  postgres  postgres   [.] palloc
  4.48%  postgres  postgres   [.] make_one_rel
  3.96%  postgres  postgres   [.] lappend
  2.74%  postgres  [kernel.kallsyms]  [k] get_page_from_freelist
  1.87%  postgres  postgres   [.] setup_append_rel_array
  1.68%  postgres  [kernel.kallsyms]  [k] list_del
  1.64%  postgres  [kernel.kallsyms]  [k] __alloc_pages_nodemask
  1.62%  postgres  [kernel.kallsyms]  [k] unmap_vmas

SELECT:
14.72%  postgres  postgres   [.] AllocSetAlloc
  5.14%  postgres  postgres   [.] hash_search_with_hash_value
  4.23%  postgres  postgres   [.] palloc
  4.06%  postgres  postgres   [.] MemoryContextAllocZeroAligned
  2.61%  postgres  postgres   [.] copyObjectImpl
  2.34%  postgres  postgres   [.] expression_tree_mutator
  2.13%  postgres  [kernel.kallsyms]  [k] _spin_lock
  1.91%  postgres  postgres   [.] lappend
  1.59%  postgres  [kernel.kallsyms]  [k] __link_path_walk
  1.50%  postgres  postgres   [.] set_rel_size

INSERT:
20.75%  postgres  postgres   [.] hash_search_with_hash_value
  6.03%  postgres  postgres   [.] hash_any
  4.88%  postgres  postgres   [.] AllocSetAlloc
  4.05%  postgres  postgres   [.] LWLockRelease
  4.05%  postgres  postgres   [.] LWLockAcquire
  3.27%  postgres  postgres   [.] oid_cmp
  3.06%  postgres  postgres   [.] SearchCatCache
  2.97%  postgres  postgres   [.] LockReleaseAll
  2.57%  postgres  postgres   [.] pg_qsort
  2.37%  postgres  postgres   [.] hash_seq_search


The following is information on the environment used for the benchmark.

Server spec
---

  Server has 16 cpu.
  Memory size is 264GB.
  Database directory is on SSD.

database tuning
---

 shared_buffers = 102GB
  max_locks_per_transactions = 100

postgresql version
--

  11beta2 + patch1 + patch2

  patch1: Allow direct lookups of AppendRelInfo by child relid
  commit 7d872c91a3f9d49b56117557cdbb0c3d4c620687

  patch2: 0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch
  https://commitfest.postgresql.org/18/1690/


table definition


  create table test.accounts(aid serial, abalance int) partition by range(aid));
  create table test.accounts_history(id serial, aid int, delta int, mtime 
timestamp without time zo

Re: requested timeline ... does not contain minimum recovery point ...

2018-07-12 Thread Andres Freund
On 2018-07-12 19:22:50 -0700, Christophe Pettus wrote:
> 
> > On Jul 12, 2018, at 17:52, Michael Paquier  wrote:
> > Wild guess: you did not issue a checkpoint on the promoted standby
> > before running pg_rewind.
> 
> I don't believe a manual checkpoint was done on the target (promoted standby, 
> new master), but it did one as usual during startup after the timeline switch:
> 
> > 2018-07-10 19:28:38 UTC [5068]: [1-1] user=,db=,app=,client= LOG:  
> > checkpoint starting: force
> 
> 
> The pg_rewind was started about 90 seconds later.

Note that that message doesn't indicate a completed checkpoint, just
that one started. Do you see a "checkpoint complete: wrote ..." message
before the rewind started?

Greetings,

Andres Freund



RE: ON CONFLICT DO NOTHING on pg_dump

2018-07-12 Thread Ideriha, Takeshi
>I noticed one more thing: pg_dumpall.c doesn't really need to prohibit
>--on-conflict-do-nothing without --insert.  Its existing validation rejects 
>illegal
>combinations of the settings that are *not* passed on to pg_dump.  It seems OK 
>to
>just pass those on and let pg_dump complain.  For example, if you say 
>"pg_dumpall
>--data-only --schema-only", it's pg_dump that complains, not pg_dumpall.  I 
>think we
>should do the same thing here.

Thank you for the clarification. I didn't give thought to pg_dumpall internally 
running pg_dump.

>Pushed, with those changes.
Thanks!


Re: requested timeline ... does not contain minimum recovery point ...

2018-07-12 Thread Christophe Pettus


> On Jul 12, 2018, at 19:22, Christophe Pettus  wrote:
> 
> 
>> On Jul 12, 2018, at 17:52, Michael Paquier  wrote:
>> Wild guess: you did not issue a checkpoint on the promoted standby
>> before running pg_rewind.
> 
> I don't believe a manual checkpoint was done on the target (promoted standby, 
> new master), but it did one as usual during startup after the timeline switch:
> 
>> 2018-07-10 19:28:38 UTC [5068]: [1-1] user=,db=,app=,client= LOG:  
>> checkpoint starting: force
> 
> The pg_rewind was started about 90 seconds later.

That being said, the pg_rewind output seems to indicate that the old divergence 
point was still being picked up, rather than the one on timeline 104:

> servers diverged at WAL position A58/5000 on timeline 103
> rewinding from last common checkpoint at A58/4E0689F0 on timeline 103

--
-- Christophe Pettus
   x...@thebuild.com




Re: pg_create_logical_replication_slot returns text instead of name

2018-07-12 Thread Masahiko Sawada
On Fri, Jul 13, 2018 at 11:22 AM, Michael Paquier  wrote:
> On Fri, Jul 13, 2018 at 11:14:01AM +0900, Masahiko Sawada wrote:
>> Hmm, I'm also not sure about the policy of usage of name data type for
>> columns that show an object identifier on external servers. There is a
>> similar case; we have the pubname in pg_subscritpion as name type
>> whereas the subpublications in pg_subscription is text[] type.
>
> Let's not bother then.  In the case of the function you pointed out the
> intention behind the code was clear anyway, so that's better now.

Yeah, I agreed with you.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: requested timeline ... does not contain minimum recovery point ...

2018-07-12 Thread Christophe Pettus


> On Jul 12, 2018, at 17:52, Michael Paquier  wrote:
> Wild guess: you did not issue a checkpoint on the promoted standby
> before running pg_rewind.

I don't believe a manual checkpoint was done on the target (promoted standby, 
new master), but it did one as usual during startup after the timeline switch:

> 2018-07-10 19:28:38 UTC [5068]: [1-1] user=,db=,app=,client= LOG:  checkpoint 
> starting: force


The pg_rewind was started about 90 seconds later.

--
-- Christophe Pettus
   x...@thebuild.com




Re: pg_create_logical_replication_slot returns text instead of name

2018-07-12 Thread Michael Paquier
On Fri, Jul 13, 2018 at 11:14:01AM +0900, Masahiko Sawada wrote:
> Hmm, I'm also not sure about the policy of usage of name data type for
> columns that show an object identifier on external servers. There is a
> similar case; we have the pubname in pg_subscritpion as name type
> whereas the subpublications in pg_subscription is text[] type.

Let's not bother then.  In the case of the function you pointed out the
intention behind the code was clear anyway, so that's better now.
--
Michael


signature.asc
Description: PGP signature


RE: Locking B-tree leafs immediately in exclusive mode

2018-07-12 Thread Imai, Yoshikazu
On Mon, July 9, 2018 at 5:25 PM, Simon Riggs wrote:
> Please can you check insertion with the index on 2 keys
> 1st key has 10,000 values
> 2nd key has monotonically increasing value from last 1st key value
> 
> So each session picks one 1st key value
> Then each new INSERTion is a higher value of 2nd key
> so 1,1, then 1,2 then 1,3 etc
> 
> Probably easier to do this with a table like this
> 
> CREATE UNLOGGED TABLE ordered2 (id integer, logdate timestamp default
> now(), value text not null, primary key (id, logdate));
> 
> # script_ordered2.sql
> \set i random(1, 1)
> INSERT INTO ordered2 (id, value) VALUES (:i, 'abcdefghijklmnoprsqtuvwxyz');
> 
> Thanks
I tried to do this, but I might be mistaken your intention, so please specify 
if I am wrong.

While script_ordered.sql supposes that there is one contention point on the 
most right leaf node,
script_ordered2.sql supposes that there are some contention points on some leaf 
nodes, is it right?
I experimented with key1 having 1 values, but there are no difference in 
the results compared to unordered.sql one, so I experimented with key1 having 
1, 2, 3, 5, 10, and 100 values.
Also, If I created primary key, "ERROR:  duplicate key value violates unique 
constraint "ordered2_pkey" happened, so I created non-unique key.

#DDL
CREATE UNLOGGED TABLE ordered2 (id integer, logdate timestamp default now(), 
value text not null);
CREATE INDEX ordered2_key ON ordered2 (id, logdate);

# script_ordered2.sql
\set i random(1, 100)  #second value is 1, 2, 3, 5, 10, or 100
INSERT INTO ordered2 (id, value) VALUES (:i, 'abcdefghijklmnoprsqtuvwxyz');

# ordered2 results, key1 having 1, 2, 3, 5, 10, and 100 values
master,  key1 with 1 values:  236428
master,  key1 with 2 values:  292248
master,  key1 with 3 values:  340980
master,  key1 with 5 values:  362808
master,  key1 with 10 values: 379525
master,  key1 with 100 values: 405265

patched, key1 with 1 values:  295862
patched, key1 with 2 values:  339538
patched, key1 with 3 values:  355793
patched, key1 with 5 values:  371332
patched, key1 with 10 values: 387731
patched, key1 with 100 values: 405115


From an attached graph("some_contention_points_on_leaf_nodes.png"), as 
contention points dispersed, we can see that TPS is increased and TPS 
difference between master and patched version becomes smaller.


Yoshikazu Imai


Re: pg_create_logical_replication_slot returns text instead of name

2018-07-12 Thread Masahiko Sawada
On Fri, Jul 13, 2018 at 9:48 AM, Michael Paquier  wrote:
> On Thu, Jul 12, 2018 at 10:18:53PM +0900, Michael Paquier wrote:
>> That's a small thing, but I agree with you.  As far as I can see slot
>> names are always mapped with the name type.  I'll push that tomorrow if
>> there are no objections.
>
> Pushed, with a catalog version bump.
>

Thank you!

> While double-checking things, I have noticed that pg_stat_wal_receiver
> also uses text for slot names.  I am not sure if this one is worth
> bothering as the docs point out the correct type, just mentioning.

Hmm, I'm also not sure about the policy of usage of name data type for
columns that show an object identifier on external servers. There is a
similar case; we have the pubname in pg_subscritpion as name type
whereas the subpublications in pg_subscription is text[] type.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: ON CONFLICT DO NOTHING on pg_dump

2018-07-12 Thread Thomas Munro
On Fri, Jul 13, 2018 at 12:33 PM, Ideriha, Takeshi
 wrote:
>>+Add ON CONFLICT DO NOTHING clause in the INSERT commands.
>>
>>I think this would be better as: Add ON CONFLICT DO 
>>NOTHING to
>>INSERT commands.
>
> Agreed.
>
>>+printf(_("  --on-conflict-do-nothing dump data as INSERT
>>commands with ON CONFLICT DO NOTHING \n"));
>>
>>That's slightly misleading... let's just use the same wording again, eg "add 
>>ON
>>CONFLICT DO NOTHING to INSERT commands".
>
> Agreed. But you forgot fixing it at pg_dump.c.
> So could you please fix this and commit it?

Right, thanks.

I noticed one more thing: pg_dumpall.c doesn't really need to prohibit
--on-conflict-do-nothing without --insert.  Its existing validation
rejects illegal combinations of the settings that are *not* passed on
to pg_dump.  It seems OK to just pass those on and let pg_dump
complain.  For example, if you say "pg_dumpall --data-only
--schema-only", it's pg_dump that complains, not pg_dumpall.  I think
we should do the same thing here.

Pushed, with those changes.

Thanks for the patch and the reviews!

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] WAL logging problem in 9.4.3?

2018-07-12 Thread Michael Paquier
On Thu, Jul 12, 2018 at 05:12:21PM +0300, Heikki Linnakangas wrote:
> Doesn't have to be a trigger, could be a CHECK constraint, datatype input
> function, etc. Admittedly, having a datatype input function that inserts to
> the table is worth a "huh?", but I'm feeling very confident that we can
> catch all such cases, and some of them might even be sensible.

Sure, but do we want to be that invasive?  Triggers are easy enough to
block because those are available directly within cstate so you would
know if those are triggered.  CHECK constraint can be also easily looked
after by looking at the Relation information, and actually as DEFAULT
values could have an expression we'd want to block them, no?  The input
datatype is well, more tricky to deal with as there is no actual way to
know if the INSERT is happening within the context of a COPY and this
could be just C code.  One way to tackle that would be to enforce the
optimization to not be used if a non-system data type is used when doing
COPY...

Disabling entirely the optimization for any relation which has a CHECK
constraint or DEFAULT expression basically applies to a hell lot of
them, which makes the optimization, at least it seems to me, useless
because it is never going to apply to most real-world cases.
--
Michael


signature.asc
Description: PGP signature


Re: patch to allow disable of WAL recycling

2018-07-12 Thread Thomas Munro
On Thu, Jul 12, 2018 at 10:52 PM, Tomas Vondra
 wrote:
> I don't follow Alvaro's reasoning, TBH. There's a couple of things that
> confuse me ...
>
> I don't quite see how reusing WAL segments actually protects against full
> filesystem? On "traditional" filesystems I would not expect any difference
> between "unlink+create" and reusing an existing file. On CoW filesystems
> (like ZFS or btrfs) the space management works very differently and reusing
> an existing file is unlikely to save anything.

Yeah, I had the same thoughts.

> But even if it reduces the likelihood of ENOSPC, it does not eliminate it
> entirely. max_wal_size is not a hard limit, and the disk may be filled by
> something else (when WAL is not on a separate device, when there is think
> provisioning, etc.). So it's not a protection against data corruption we
> could rely on. (And as was discussed in the recent fsync thread, ENOSPC is a
> likely source of past data corruption issues on NFS and possibly other
> filesystems.)

Right.  That ENOSPC discussion was about checkpointing though, not
WAL.  IIUC the hypothesis was that there may be stacks (possibly
involving NFS or thin provisioning, or perhaps historical versions of
certain local filesystems that had reservation accounting bugs, on a
certain kernel) that could let you write() a buffer, and then later
when the checkpointer calls fsync() the filesystem says ENOSPC, the
kernel reports that and throws away the dirty page, and then at next
checkpoint fsync() succeeds but the checkpoint is a lie and the data
is smoke.

We already PANIC on any errno except EINTR in XLogWriteLog(), as seen
in Jerry's nearby stack trace, so that failure mode seems to be
covered already for WAL, no?

> AFAICS the original reason for reusing WAL segments was the belief that
> overwriting an existing file is faster than writing a new file. That might
> have been true in the past, but the question is if it's still true on
> current filesystems. The results posted here suggest it's not true on ZFS,
> at least.

Yeah.

The wal_recycle=on|off patch seems reasonable to me (modulo Andres's
comments about the documentation; we should make sure that the 'off'
setting isn't accidentally recommended to the wrong audience) and I
vote we take it.

Just by the way, if I'm not mistaken ZFS does avoid faulting when
overwriting whole blocks, just like other filesystems:

https://github.com/freebsd/freebsd/blob/master/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c#L1034

So then where are those faults coming from?  Perhaps the tree page
that holds the block pointer, of which there must be many when the
recordsize is small?

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Cannot dump foreign key constraints on partitioned table

2018-07-12 Thread Michael Paquier
On Thu, Jul 12, 2018 at 02:45:37PM -0400, Alvaro Herrera wrote:
> Thanks, looks good.  I propose to add following pg_dump test to ensure
> this stays fixed.

Thanks for adding the test.  I was looking at a good way to add a test
but could not come up with something which can be summed up with one
query for create_sql, so what you have here is nice.  Could you add an
extra test with a partition of dump_test.test_table_fk?  Children should
have the FK defined as well with relhastriggers set to true, still when
I tested if the partitioned was not scanned for its FK, then its
children partition also missed it.  So I think that it is important to
check that the FK is defined for all members of the partition tree.

I am fine to add the test myself and to push if you need help.  Of
course feel free to do it yourself if you want.  Either way is fine for
me.
--
Michael


signature.asc
Description: PGP signature


Re: requested timeline ... does not contain minimum recovery point ...

2018-07-12 Thread Michael Paquier
On Thu, Jul 12, 2018 at 02:26:17PM -0700, Christophe Pettus wrote:
> What surprises me about the error is that while the recovery point
> seems reasonable, it shouldn't be on timeline 103, but on timeline
> 105.

Wild guess: you did not issue a checkpoint on the promoted standby
before running pg_rewind.
--
Michael


signature.asc
Description: PGP signature


Re: pg_create_logical_replication_slot returns text instead of name

2018-07-12 Thread Michael Paquier
On Thu, Jul 12, 2018 at 10:18:53PM +0900, Michael Paquier wrote:
> That's a small thing, but I agree with you.  As far as I can see slot
> names are always mapped with the name type.  I'll push that tomorrow if
> there are no objections.

Pushed, with a catalog version bump.

While double-checking things, I have noticed that pg_stat_wal_receiver
also uses text for slot names.  I am not sure if this one is worth
bothering as the docs point out the correct type, just mentioning.
--
Michael


signature.asc
Description: PGP signature


RE: ON CONFLICT DO NOTHING on pg_dump

2018-07-12 Thread Ideriha, Takeshi
Hi, thanks for the revision.

>
>+Add ON CONFLICT DO NOTHING clause in the INSERT commands.
>
>I think this would be better as: Add ON CONFLICT DO NOTHING 
>to
>INSERT commands.

Agreed.

>+printf(_("  --on-conflict-do-nothing dump data as INSERT
>commands with ON CONFLICT DO NOTHING \n"));
>
>That's slightly misleading... let's just use the same wording again, eg "add ON
>CONFLICT DO NOTHING to INSERT commands".

Agreed. But you forgot fixing it at pg_dump.c.
So could you please fix this and commit it?

Regards,
Takeshi Ideriha



Re: Vacuum: allow usage of more than 1GB of work mem

2018-07-12 Thread Andrew Dunstan




On 07/12/2018 06:34 PM, Alvaro Herrera wrote:

On 2018-Jul-12, Andrew Dunstan wrote:


I fully understand. I think this needs to go back to "Waiting on Author".

Why?  Heikki's patch applies fine and passes the regression tests.




Well, I understood Claudio was going to do some more work (see 
upthread). If we're going to go with Heikki's patch then do we need to 
change the author, or add him as an author?


cheers

andew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Vacuum: allow usage of more than 1GB of work mem

2018-07-12 Thread Alvaro Herrera
On 2018-Jul-12, Andrew Dunstan wrote:

> I fully understand. I think this needs to go back to "Waiting on Author".

Why?  Heikki's patch applies fine and passes the regression tests.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Shared buffer access rule violations?

2018-07-12 Thread Tom Lane
Asim R P  writes:
> On Tue, Jul 10, 2018 at 8:33 PM, Tom Lane  wrote:
>> Asim R P  writes:
>>> One can find several PageInit() calls with no content lock held.  See,
>>> for example:
>>> fill_seq_with_data()

>> That would be for a relation that no one else can even see yet, no?

> Yes, when the sequence is being created.  No, when the sequence is
> being reset, in ResetSequence().

ResetSequence creates a new relfilenode, which no one else will be able
to see until it commits, so the case is effectively the same as for
creation.

>>> vm_readbuf()
>>> fsm_readbuf()

>> In these cases I'd imagine that the I/O completion interlock is what
>> is preventing other backends from accessing the buffer.

> What is I/O completion interlock?

Oh ... the RBM_ZERO_ON_ERROR action should be done under the I/O lock,
but the ReadBuffer caller isn't holding that lock anymore, so I see your
point here.  Probably, nobody's noticed because it's a corner case that
shouldn't happen under normal use, but it's not safe.  I think what we
want is more like

if (PageIsNew(BufferGetPage(buf)))
{
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
if (PageIsNew(BufferGetPage(buf)))
PageInit(BufferGetPage(buf), BLCKSZ, 0);
UnlockReleaseBuffer(buf);
}

to ensure that the page is initialized once and only once, even if
several backends do this concurrently.

regards, tom lane



Re: Vacuum: allow usage of more than 1GB of work mem

2018-07-12 Thread Andrew Dunstan




On 07/12/2018 12:38 PM, Claudio Freire wrote:

On Thu, Jul 12, 2018 at 10:44 AM Andrew Dunstan
 wrote:



On 04/06/2018 08:00 PM, Claudio Freire wrote:

On Fri, Apr 6, 2018 at 5:25 PM, Claudio Freire  wrote:

On Fri, Apr 6, 2018 at 10:39 AM, Heikki Linnakangas  wrote:

On 06/04/18 01:59, Claudio Freire wrote:

The iteration interface, however, seems quite specific for the use
case of vacuumlazy, so it's not really a good abstraction.

Can you elaborate? It does return the items one block at a time. Is that
what you mean by being specific for vacuumlazy? I guess that's a bit
special, but if you imagine some other users for this abstraction, it's
probably not that unusual. For example, if we started using it in bitmap
heap scans, a bitmap heap scan would also want to get the TIDs one block
number at a time.

But you're also tying the caller to the format of the buffer holding
those TIDs, for instance. Why would you, when you can have an
interface that just iterates TIDs and let the caller store them
if/however they want?

I do believe a pure iterator interface is a better interface.

Between the b-tree or not discussion and the refactoring to separate
the code, I don't think we'll get this in the next 24hs.

So I guess we'll have ample time to poner on both issues during the
next commit fest.




There doesn't seem to have been much pondering done since then, at least
publicly. Can we make some progress on this? It's been around for a long
time now.

Yeah, life has kept me busy and I haven't had much time to make
progress here, but I was planning on doing the refactoring as we were
discussing soon. Can't give a time frame for that, but "soonish".


I fully understand. I think this needs to go back to "Waiting on Author".

cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] Re: [COMMITTERS] pgsql: Remove pgbench "progress" test pending solution of its timing is (fwd)

2018-07-12 Thread Fabien COELHO



Indeed… but then throttling would not be tested:-) The point of the test
is to exercise all time-related options, including throttling with a
reasonable small value.


Ok. I don't think that's really worthwhile. If we add some code that only 
runs in testing, then we're not really testing the real thing. I wouldn't 
trust the test to tell much. Let's just leave out that magic environment 
variable thing, and try to get the rest of the patch finished.


If you remove the environment, then some checks need to be removed, because 
the 2 second run may be randomly shorten when there is nothing to do. If not, 
the test will fail underterminiscally, which is not acceptable. Hence the 
hack. I agree that it is not beautiful.


The more reasonable alternative could be to always last 2 seconds under -T 2, 
even if the execution can be shorten because there is nothing to do at all, 
i.e. remove the environment-based condition but keep the sleep.


Yet another option would be to replace the env variable by an option, eg
"--strict-time", that would be used probaly only by the TAP test, but 
would be an available feature.


--
Fabien.

Re: Temporary WAL segments files not cleaned up after an instance crash

2018-07-12 Thread Michael Paquier
On Thu, Jul 12, 2018 at 03:40:43PM +0300, Heikki Linnakangas wrote:
> Sure.

Thanks for the reviews, I have pushed the patch after moving the elog()
call and changing the logs to mention "WAL segments" instead of "WAL
files".
--
Michael


signature.asc
Description: PGP signature


Re: requested timeline ... does not contain minimum recovery point ...

2018-07-12 Thread Christophe Pettus


> On Jul 12, 2018, at 10:29, Andres Freund  wrote:
> 
> This needs a lot more information before somebody can reasonably act on
> it.

Happy to provide, of course!

The two relevant hosts are "Ash" and "Chi".  As mentioned, they've been flipped 
back and forth repeatedly using pg_rewind: One will be promoted, the other 
pg_rewind'd, and then brought up as a WAL shipping secondary to the first.  
This procedure has worked repeatedly, but this last instance failed.  The logs 
from the failure are attached below, along with the output of from 
pg_controldata for both hosts.  The systems are still in this configuration, so 
we can gather more as required.

What surprises me about the error is that while the recovery point seems 
reasonable, it shouldn't be on timeline 103, but on timeline 105.


Failover to Ash:
2018-07-10 19:14:22 UTC [2072]: [5153-1] user=,db=,app=,client= LOG:  received 
promote request
2018-07-10 19:14:22 UTC [2072]: [5154-1] user=,db=,app=,client= LOG:  redo done 
at A58/4F0822D0
2018-07-10 19:14:22 UTC [2072]: [5155-1] user=,db=,app=,client= LOG:  last 
completed transaction was at log time 2018-07-10 19:13:54.515121+00
2018-07-10 19:14:23 UTC [2072]: [5156-1] user=,db=,app=,client= LOG:  restored 
log file "00670A58004F" from archive
2018-07-10 19:14:23 UTC [2072]: [5157-1] user=,db=,app=,client= LOG:  selected 
new timeline ID: 104
2018-07-10 19:14:24 UTC [2072]: [5158-1] user=,db=,app=,client= LOG:  restored 
log file "0067.history" from archive

[Chi rewound using pg_rewind against Ash, brought up]

Chi:
2018-07-10 19:26:05 UTC [3260]: [4-1] user=,db=,app=,client= LOG:  restored log 
file "0068.history" from archive
2018-07-10 19:26:06 UTC [3260]: [5-1] user=,db=,app=,client= LOG:  redo starts 
at A58/4E061EF8
2018-07-10 19:26:07 UTC [3260]: [6-1] user=,db=,app=,client= LOG:  restored log 
file "00680A580050" from archive
2018-07-10 19:26:08 UTC [3260]: [7-1] user=,db=,app=,client= LOG:  restored log 
file "00680A580051" from archive
2018-07-10 19:26:09 UTC [3260]: [8-1] user=,db=,app=,client= LOG:  restored log 
file "00680A580052" from archive
2018-07-10 19:26:10 UTC [3260]: [9-1] user=,db=,app=,client= LOG:  restored log 
file "00680A580053" from archive
2018-07-10 19:26:11 UTC [3260]: [10-1] user=,db=,app=,client= LOG:  restored 
log file "00680A580054" from archive
2018-07-10 19:26:12 UTC [3260]: [11-1] user=,db=,app=,client= LOG:  restored 
log file "00680A580055" from archive
2018-07-10 19:26:13 UTC [3260]: [12-1] user=,db=,app=,client= LOG:  restored 
log file "00680A580056" from archive
2018-07-10 19:26:13 UTC [3260]: [13-1] user=,db=,app=,client= LOG:  restored 
log file "00680A580057" from archive
2018-07-10 19:26:14 UTC [3260]: [14-1] user=,db=,app=,client= LOG:  restored 
log file "00680A580058" from archive
2018-07-10 19:26:15 UTC [3260]: [15-1] user=,db=,app=,client= LOG:  restored 
log file "00680A580059" from archive
2018-07-10 19:26:15 UTC [3260]: [16-1] user=,db=,app=,client= LOG:  restored 
log file "00680A58005A" from archive
2018-07-10 19:26:16 UTC [3260]: [17-1] user=,db=,app=,client= LOG:  restored 
log file "00680A58005B" from archive
2018-07-10 19:26:17 UTC [3260]: [18-1] user=,db=,app=,client= LOG:  restored 
log file "00680A58005C" from archive
2018-07-10 19:27:28 UTC [3260]: [19-1] user=,db=,app=,client= LOG:  restored 
log file "00680A58005D" from archive
2018-07-10 19:27:28 UTC [3260]: [20-1] user=,db=,app=,client= LOG:  consistent 
recovery state reached at A58/5D01BCA8
2018-07-10 19:27:28 UTC [2564]: [3-1] user=,db=,app=,client= LOG:  database 
system is ready to accept read only connections
2018-07-10 19:28:28 UTC [3260]: [21-1] user=,db=,app=,client= LOG:  restored 
log file "00680A58005E" from archive

Chi promoted:
2018-07-10 19:28:37 UTC [3260]: [22-1] user=,db=,app=,client= LOG:  received 
promote request
2018-07-10 19:28:37 UTC [3260]: [23-1] user=,db=,app=,client= LOG:  redo done 
at A58/5E0817D0
2018-07-10 19:28:37 UTC [3260]: [24-1] user=,db=,app=,client= LOG:  last 
completed transaction was at log time 2018-07-10 19:28:12.954559+00
2018-07-10 19:28:37 UTC [3260]: [25-1] user=,db=,app=,client= LOG:  restored 
log file "00680A58005E" from archive
2018-07-10 19:28:37 UTC [3260]: [26-1] user=,db=,app=,client= LOG:  selected 
new timeline ID: 105
2018-07-10 19:28:38 UTC [3260]: [27-1] user=,db=,app=,client= LOG:  restored 
log file "0068.history" from archive
2018-07-10 19:28:38 UTC [3260]: [28-1] user=,db=,app=,client= LOG:  archive 
recovery complete
2018-07-10 19:28:38 UTC [3260]: [29-1] user=,db=,app=,client= LOG:  MultiXact 
member wraparound protections are now enabled
2018-07-10 19:28:38 UTC [5068]: [1-1] user=,db=,app=,client= LOG:  checkpoint 
starting: force
2018-07-10 19:28:38 UTC [2564]: [4-1] user=,db=,app=,client= LO

Re: When use prepared protocol, transaction will hold backend_xmin until the end of the transaction.

2018-07-12 Thread Tom Lane
chenhj  writes:
> When execute sql with prepared protocol, read committed transaction will hold 
> backend_xmin until the end of the transaction.

No, just till the active portal is dropped.

In the case you show, the issue is that libpq doesn't bother to issue
an explicit Close Portal message, but just lets the unnamed portal
get recycled implicitly by the next query (cf. PQsendQueryGuts).
So the portal stays open, and its snapshot stays alive, till some
other command is sent.  This is different from the behavior for simple
query mode, where the portal is automatically closed after execution.

I agree this isn't very desirable now that we have mechanisms to
advance the advertised xmin as soon as snapshots go away.

Perhaps portals could be taught to drop their snapshots as soon as
the query has reached completion, but it'd be a little bit ticklish
to not break valid use-patterns for cursors.

Another idea would be to fix it on the client side by including an
explicit Close command in the PQsendQuery sequence.  But if there
are similar usage patterns in other client libraries, it might take
a long time to get them all up to speed.

regards, tom lane



Re: Failed assertion due to procedure created with SECURITY DEFINER option

2018-07-12 Thread Jonathan S. Katz

> On Jul 4, 2018, at 3:43 AM, Peter Eisentraut 
>  wrote:
> 
> On 03.07.18 19:20, Andres Freund wrote:
>> On 2018-06-29 10:19:17 -0700, Andres Freund wrote:
>>> Hi,
>>> 
>>> On 2018-06-29 13:56:12 +0200, Peter Eisentraut wrote:
 On 6/29/18 13:07, amul sul wrote:
> This happens because of in fmgr_security_definer() function we are
> changing  global variable SecurityRestrictionContext and in the
> StartTransaction() insisting it should be zero, which is the problem.
 
 Hmm, what is the reason for this insistation?
>>> 
>>> Because it's supposed to be reset by AbortTransaction(), after an error.
>> 
>> Does that make sense Peter?
>> 
>> I've added this thread to the open items list.
> 
> Proposed fix attached.

First, reproduced the issue against HEAD and was able to successfully
do so.

Then, applied the patch and tested using original test case:

testing=# CREATE PROCEDURE transaction_test1() LANGUAGE plpgsql SECURITY DEFINER
testing-# AS $$ BEGIN
testing$#COMMIT;
testing$# END $$;
CREATE PROCEDURE
testing=# CALL transaction_test1();
2018-07-12 15:45:49.846 EDT [39928] ERROR:  invalid transaction termination
2018-07-12 15:45:49.846 EDT [39928] CONTEXT:  PL/pgSQL function 
transaction_test1() line 2 at COMMIT
2018-07-12 15:45:49.846 EDT [39928] STATEMENT:  CALL transaction_test1();
ERROR:  invalid transaction termination
CONTEXT:  PL/pgSQL function transaction_test1() line 2 at COMMIT

So it handles it as expected.

Code and test cases look fine to me from what I can tell. My only suggestion
would be if we could add some guidance to the documentation on what languages
can support transaction control statements inside procedures with a SECURITY
DEFINER.

Jonathan

Re: [HACKERS] [PATCH] WIP Add ALWAYS DEFERRED option for constraints

2018-07-12 Thread Robbie Harwood
Nico Williams  writes:

> Attached is an additional patch, as well as a new, rebased patch.
>
> This includes changes responsive to Álvaro Herrera's commentary about
> the SET CONSTRAINTS manual page.

This patch looks good to me.  +1; Álvaro, please update the CF entry
when you're also satisfied.

Thanks,
--Robbie


signature.asc
Description: PGP signature


Re: [WIP] [B-Tree] Retail IndexTuple deletion

2018-07-12 Thread Peter Geoghegan
On Tue, Jul 3, 2018 at 5:17 AM, Andrey V. Lepikhov
 wrote:
> Done.
> Attachment contains an update for use v.2 of the 'Ensure nbtree leaf tuple
> keys are always unique' patch.

My v3 is still pending, but is now a lot better than v2. There were
bugs in v2 that were fixed.

One area that might be worth investigating is retail index tuple
deletion performed within the executor in the event of non-HOT
updates. Maybe LP_REDIRECT could be repurposed to mean "ghost record",
at least in unique index tuples with no NULL values. The idea is that
MVCC index scans can skip over those if they've already found a
visible tuple with the same value. Also, when there was about to be a
page split, they could be treated a little bit like LP_DEAD items. Of
course, the ghost bit would have to be treated as a hint that could be
"wrong" (e.g. because the transaction hasn't committed yet), so you'd
have to go to the heap in the context of a page split, to double
check. Also, you'd need heuristics that let you give up on this
strategy when it didn't help.

I think that this could work well enough for OLTP workloads, and might
be more future-proof than doing it in VACUUM. Though, of course, it's
still very complicated.

-- 
Peter Geoghegan



Re: Cache invalidation after authentication (on-the-fly role creation)

2018-07-12 Thread Tom Lane
Thomas Munro  writes:
> On Thu, Jul 5, 2018 at 9:35 AM, Tom Lane  wrote:
>> That seems like a *really* ad-hoc place to put it.  Why should it be
>> there, and not (say) somewhere inside InitializeSessionUserId, or maybe
>> (also?) inside PerformAuthentication?  Why do the existing call sites for
>> AcceptInvalidationMessages --- in particular, the ones associated with
>> lock acquisition --- not solve the problem already?

> There is no lock acquisition involved here.  The sequence is:
>
> 1.  ClientAuthentication()->hba_getauthmethod()->check_hba()->get_role_oid()
> tries to look up user "fred" and doesn't find it (that lookup is used
> for group membership checks for hba line matching purposes, and not
> finding it here is not fatal, assuming you match with "all"); the
> cache now has a negative entry.
>
> 2.  The configured authentication method runs, and via a side
> connection it creates the role "fred".
>
> 3.  InitializeSessionUserId() looks up "fred", and finds the stale
> negative entry.

[ thinks about that for awhile... ]  So really this is an instance of
a generic problem, which is that the result of a syscache lookup could
be stale: it's not guaranteed to reflect changes that committed since
the last AcceptInvalidationMessages call, which in the worst case is
the start of the current transaction.  We have workarounds in place
that (mostly?) guarantee that relation-related lookups will get
sufficiently up-to-date data, but there's nothing much protecting other
catalogs such as pg_proc or pg_authid.

I can't help thinking we're going to need to fix that eventually.  But
I can't think of any easy way to do so.  Adding AcceptInvalidationMessages
to SearchSysCache itself is right out for performance reasons, I'm afraid,
unless we can somehow find a way to make the no-op path through it much
cheaper than it is now.

> I suppose the call to AcceptInvalidationMessages() could go at the end
> of ClientAuthentication().  That'd be closer to the code that creates
> the negative entry and immediately after the code that might modify
> the catalogs.  Or in PeformAuthentication(), its caller.  Or in
> IntializeSessionUserId() immediately before we try to look up the
> name, but that's also called by background workers that don't need it.

I think my preference would be to put it in InitializeSessionUserId
so that it's clearly associated with the syscache lookups we're trying
to protect.  I don't entirely buy your argument that background workers
don't need up-to-date info for this.

regards, tom lane



Re: Cannot dump foreign key constraints on partitioned table

2018-07-12 Thread Alvaro Herrera
On 2018-Jul-12, Michael Paquier wrote:

> Changing pg_class.relhastriggers is out of scope because as far as I
> know partitioned tables have no triggers, so the current value is
> correct, and that would be a catalog change at this stage which would
> cause any existing deployments of v11 to complain about the
> inconsistency.  I think that this should be fixed client-side as the
> attached does.

Thanks, looks good.  I propose to add following pg_dump test to ensure
this stays fixed.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 05b2d8912688ce6b1c613d7de8a0a3a874e21653 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Thu, 12 Jul 2018 14:36:11 -0400
Subject: [PATCH] Dump foreign keys on partitioned tables
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

My patch that ended up as commit 3de241dba86f ("Foreign keys on
partitioned tables") lacked pg_dump tests, so the pg_dump code that was
there to support it inadvertently stopped working when I hacked the
backend code to not emit pg_trigger rows for the partitioned table
itself.

Bug analysis and code fix is by Michaël.  I (Álvaro) merely added the
test.

Reported-by: amul sul 
Co-authored-by: Michaël Paquier 
Co-authored-by: Álvaro Herrera 
Discussion: 
https://postgr.es/m/CAAJ_b94n=UsNVhgs97vCaWEZAMe-tGDRVuZ73oePQH=eajk...@mail.gmail.com
---
 src/bin/pg_dump/pg_dump.c|  7 ++-
 src/bin/pg_dump/t/002_pg_dump.pl | 18 ++
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 463639208d..74a1270169 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -7131,7 +7131,12 @@ getConstraints(Archive *fout, TableInfo tblinfo[], int 
numTables)
{
TableInfo  *tbinfo = &tblinfo[i];
 
-   if (!tbinfo->hastriggers ||
+   /*
+* For partitioned tables, foreign keys have no triggers so they
+* must be included anyway in case some foreign keys are 
defined.
+*/
+   if ((!tbinfo->hastriggers &&
+tbinfo->relkind != RELKIND_PARTITIONED_TABLE) ||
!(tbinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
continue;
 
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index 7eee870259..8860928df1 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -631,6 +631,24 @@ my %tests = (
},
},
 
+   'ALTER TABLE (partitioned) ADD CONSTRAINT ... FOREIGN KEY' => {
+   create_order => 4,
+   create_sql   => 'CREATE TABLE dump_test.test_table_fk (
+   col1 int references 
dump_test.test_table)
+   PARTITION BY RANGE 
(col1);',
+   regexp => qr/
+   \QADD CONSTRAINT test_table_fk_col1_fkey FOREIGN KEY 
(col1) REFERENCES dump_test.test_table\E
+   /xm,
+   like => {
+   %full_runs,
+   %dump_test_schema_runs,
+   section_post_data  => 1,
+   },
+   unlike => {
+   exclude_dump_test_schema => 1,
+   },
+   },
+
'ALTER TABLE ONLY test_table ALTER COLUMN col1 SET STATISTICS 90' => {
create_order => 93,
create_sql =>
-- 
2.11.0



Re: assert in nested SQL procedure call in current HEAD

2018-07-12 Thread Peter Eisentraut
On 12.06.18 18:47, Andrew Gierth wrote:
> While testing this, I ran into another semi-related issue:
> shmem_exit_inprogress isn't ever being cleared in the postmaster, which
> means that if you ever have a crash-restart, any attempt to do a
> rollback in a procedure will then crash or get some other form of
> corruption again every time until you manually restart the cluster.

I have committed a fix for this.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Re: [COMMITTERS] pgsql: Remove pgbench "progress" test pending solution of its timing is (fwd)

2018-07-12 Thread Fabien COELHO




I don't understand the 0.5 second rule. For the tests, we only need to 
ensure that at least one progress report is printed, right?


[...]


I still don't understand.


Let's look at the code:

   if (progress && thread->tid == 0)
   {
  ...
  if (last_report == thread_start || now - last_report >= 50)
doProgressReport(thread, &last, &last_report, now, thread_start);

For the testing, we just need to make sure that at least one progress report 
is always printed, if -P is used. Right?


Yep. That is the first condition above the last_report is set to 
thread_start meaning that there has been no report.


So where does the 0.5 second rule come in? Can't we just do "if (no 
progress reports were printed) { print progress report; }" at the end?


The second 0.5s condition is to print a closing report if some time 
significant time passed since the last one, but we do not want to print

a report at the same second.

  pgbench -T 5 -P 2

Would then print report at 2, 4 and 5. 0.5 ensures that we are not going 
to do "2 4.0[00] 4.0[01]" on -t whatever -P 2, which would look silly.


If you do not like it then the second condition can be removed, fine with 
me.



It also adds a small feature which is that there is always a final
progress when the run is completed, which can be useful when computing
progress statistics, otherwise some transactions could not be reported in
any progress at all.


Any transactions in the last 0.5 seconds might still not get reported in any 
progress reports.


Yep. I wanted a reasonable threshold, given that both -T and -P are in 
seconds anyway, so it probably could only happen with -t.



Indeed… but then throttling would not be tested:-) The point of the test
is to exercise all time-related options, including throttling with a
reasonable small value.


Ok. I don't think that's really worthwhile. If we add some code that only 
runs in testing, then we're not really testing the real thing. I wouldn't 
trust the test to tell much. Let's just leave out that magic environment 
variable thing, and try to get the rest of the patch finished.


If you remove the environment, then some checks need to be removed, 
because the 2 second run may be randomly shorten when there is nothing to 
do. If not, the test will fail underterminiscally, which is not 
acceptable. Hence the hack. I agree that it is not beautiful.


The more reasonable alternative could be to always last 2 seconds under 
-T 2, even if the execution can be shorten because there is nothing to do 
at all, i.e. remove the environment-based condition but keep the sleep.


--
Fabien.

Re: requested timeline ... does not contain minimum recovery point ...

2018-07-12 Thread Andres Freund
Hi,

On 2018-07-12 10:20:06 -0700, Christophe Pettus wrote:
> PostgreSQL 9.6.9, Windows Server 2012 Datacenter (64-bit).
> 
> We're trying to diagnose the error:
> 
>   requested timeline 105 does not contain minimum recovery point 
> A58/6B109F28 on timeline 103
> 
> The error occurs when a WAL-shipping (not streaming) secondary starts up.
> 
> These two machines have been part of a stress-test where, repeatedly, the 
> secondary is promoted, the old primary is rewound using pg_rewind, and then 
> attached to the new primary.  This has worked for multiple iterations, but 
> this error popped up.  The last cycle was particularly fast: the new primary 
> was only up for about 10 seconds (although it had completed recovery) before 
> being shut down again, and pg_rewind applied to it to reconnect it with the 
> promoted secondary.

This needs a lot more information before somebody can reasonably act on
it.

Greetings,

Andres Freund



requested timeline ... does not contain minimum recovery point ...

2018-07-12 Thread Christophe Pettus
PostgreSQL 9.6.9, Windows Server 2012 Datacenter (64-bit).

We're trying to diagnose the error:

requested timeline 105 does not contain minimum recovery point 
A58/6B109F28 on timeline 103

The error occurs when a WAL-shipping (not streaming) secondary starts up.

These two machines have been part of a stress-test where, repeatedly, the 
secondary is promoted, the old primary is rewound using pg_rewind, and then 
attached to the new primary.  This has worked for multiple iterations, but this 
error popped up.  The last cycle was particularly fast: the new primary was 
only up for about 10 seconds (although it had completed recovery) before being 
shut down again, and pg_rewind applied to it to reconnect it with the promoted 
secondary.

--
-- Christophe Pettus
   x...@thebuild.com




Re: Cannot dump foreign key constraints on partitioned table

2018-07-12 Thread Alvaro Herrera
On 2018-Jul-12, Michael Paquier wrote:

> On Wed, Jul 11, 2018 at 03:49:59PM +0530, amul sul wrote:
> > On the master head, getConstraints() function skips FK constraints for
> > a partitioned table because of tbinfo->hastriggers is false.
> > 
> > While creating FK constraints on the partitioned table, the FK triggers are 
> > only
> > created on leaf partitions and skipped for the partitioned tables.
> 
> Oops.  Good catch.

Looking at it now.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: GiST VACUUM

2018-07-12 Thread Andrey Borodin



> 12 июля 2018 г., в 20:40, Heikki Linnakangas  написал(а):
> 
> On 12/07/18 19:06, Andrey Borodin wrote:
>>> 11 июля 2018 г., в 0:07, Heikki Linnakangas 
>>> написал(а):
>>> This seems misplaced. This code deals with internal pages, and as
>>> far as I can see, this patch never marks internal pages as deleted,
>>> only leaf pages. However, we should have something like this in the
>>> leaf-page branch, to deal with the case that an insertion lands on
>>> a page that was concurrently deleted. Did you have any tests, where
>>> an insertion runs concurrently with vacuum, that would exercise
>>> this?
>> That bug could manifest only in case of crash between removing
>> downlinks and marking pages deleted.
> 
> Hmm. The downlink is removed first, so I don't think you can see that 
> situation after a crash. After a crash, you might have some empty, orphaned, 
> pages that have already been unlinked from the parent, but a search/insert 
> should never encounter them.
> 
> Actually, now that I think about it more, I'm not happy with leaving orphaned 
> pages like that behind. Let's WAL-log the removal of the downlink, and 
> marking the leaf pages as deleted, in one WAL record, to avoid that.

OK, will do this. But this will complicate WAL replay seriously, and I do not 
know a proper way to test that (BTW there is GiST amcheck in progress, but I 
decided to leave it for a while).

> 
> But the situation in gistdoinsert(), where you encounter a deleted leaf page, 
> could happen during normal operation, if vacuum runs concurrently with an 
> insert. Insertion locks only one page at a time, as it descends the tree, so 
> after it has released the lock on the parent, but before it has locked the 
> child, vacuum might have deleted the page. In the latest patch, you're 
> checking for that just before swapping the shared lock for an exclusive one, 
> but I think that's wrong; you need to check for that after swapping the lock, 
> because otherwise vacuum might delete the page while you're not holding the 
> lock.
Looks like a valid concern, I'll move that code again.
> 
>> I do not know how to test this
>> reliably. Internal pages are locked before leafs and locks are
>> coupled. No cuncurrent backend can see downlinks to pages being
>> deleted, unless crash happens.
> 
> Are you sure? At a quick glance, I don't think the locks are coupled.


Sorry for overquoting

+   /* rescan inner pages that had empty child pages */
+   foreach(cell,rescanList)
+   {
+   Buffer  buffer;
+   Pagepage;
+   OffsetNumber i,
+   maxoff;
+   IndexTuple  idxtuple;
+   ItemId  iid;
+   OffsetNumber todelete[MaxOffsetNumber];
+   Buffer  buftodelete[MaxOffsetNumber];
+   int ntodelete = 0;
+
+   buffer = ReadBufferExtended(rel, MAIN_FORKNUM, 
(BlockNumber)lfirst_int(cell),
+   
RBM_NORMAL, info->strategy);
+   LockBuffer(buffer, GIST_EXCLUSIVE);
Here's first lock
+   gistcheckpage(rel, buffer);
+   page = (Page) BufferGetPage(buffer);
+
+   Assert(!GistPageIsLeaf(page));
+
+   maxoff = PageGetMaxOffsetNumber(page);
+
+   for (i = OffsetNumberNext(FirstOffsetNumber); i <= maxoff; i = 
OffsetNumberNext(i))
+   {
+   Buffer  leafBuffer;
+   PageleafPage;
+
+   iid = PageGetItemId(page, i);
+   idxtuple = (IndexTuple) PageGetItem(page, iid);
+
+   leafBuffer = ReadBufferExtended(rel, MAIN_FORKNUM, 
ItemPointerGetBlockNumber(&(idxtuple->t_tid)),
+   RBM_NORMAL, 
info->strategy);
+   LockBuffer(leafBuffer, GIST_EXCLUSIVE);
now locks are coupled in a top-down descent


> 
> We do need some way of testing this..

Can we test replication of concurrent VACUUM and inserts in existing test suit? 
I just do not know.
I can do this tests manually if this is enough.


Best regards, Andrey Borodin.


Re: Internal error XX000 with enable_partition_pruning=on, pg 11 beta1 on Debian

2018-07-12 Thread Alvaro Herrera
On 2018-Jul-12, Tom Lane wrote:

> Andres Freund  writes:
> > On 2018-06-29 18:17:08 -0400, Tom Lane wrote:
> >> I'm on vacation and won't have time to look at this until week after
> >> next.  If you don't mind putting the topic on hold that long, I'll
> >> be happy to take responsibility for it.
> 
> > Is that still the plan? Do you forsee any issues here? This has been
> > somewhat of a longstanding open item...
> 
> It's on my to-do list, but I'm still catching up vacation backlog.
> Is this item blocking anyone?

I don't think so.  The patch might conflict with other fixes, but I
suppose that's a fact of life.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: small doc fix - using expressions in plpgsql FETCH command

2018-07-12 Thread Pavel Stehule
2018-07-12 18:29 GMT+02:00 Tom Lane :

> Pavel Stehule  writes:
> > PLpgSQL FETCH documentation is has ref on SQL FETCH command. SQL FETCH
> > allows only int constants as count. PLpgSQL allows any expressions. In
> this
> > case documentation is not clear, and people can be messy - and apply SQL
> > FETCH limits on PLpgSQL FETCH.
>
> Right.  Pushed with some rewriting.
>

Thank you

Pavel


> regards, tom lane
>


Re: Fix error message when trying to alter statistics on included column

2018-07-12 Thread Andres Freund
Hi Alexander, Teodor,

On 2018-06-28 18:28:03 +0900, Yugo Nagata wrote:
> According to the error message, it is not allowed to alter statistics on
> included column because this is "non-expression column".
> 
>  postgres=# create table test (i int, d int);
>  CREATE TABLE
>  postgres=# create index idx on test(i) include (d);
>  CREATE INDEX
>  postgres=# alter index idx alter column 2 set statistics 10;
>  ERROR:  cannot alter statistics on non-expression column "d" of index "idx"
>  HINT:  Alter statistics on table column instead.

Is either of you going to take care of this one? IIRC Teodor committed
the underlying patch, and Alexander wrote parts of it?

Greetings,

Andres Freund



Re: patch to allow disable of WAL recycling

2018-07-12 Thread Jerry Jelinek
I was asked to perform two different tests:
1) A benchmarksql run with WAL recycling on and then off, for comparison
2) A test when the filesystem fills up

For #1, I did two 15 minute benchmarksql runs and here are the results.
wal_recycle=on
--
Term-00, Running Average tpmTOTAL: 299.84Current tpmTOTAL: 29412
 Memory U14:49:02,470 [Thread-1] INFO   jTPCC : Term-00,
14:49:02,470 [Thread-1] INFO   jTPCC : Term-00,
14:49:02,471 [Thread-1] INFO   jTPCC : Term-00, Measured tpmC (NewOrders) =
136.49
14:49:02,471 [Thread-1] INFO   jTPCC : Term-00, Measured tpmTOTAL = 299.78
14:49:02,471 [Thread-1] INFO   jTPCC : Term-00, Session Start =
2018-07-12 14:34:02
14:49:02,471 [Thread-1] INFO   jTPCC : Term-00, Session End   =
2018-07-12 14:49:02
14:49:02,471 [Thread-1] INFO   jTPCC : Term-00, Transaction Count = 4497

wal_recycle=off
---
Term-00, Running Average tpmTOTAL: 299.85Current tpmTOTAL: 29520
 Memory U15:10:15,712 [Thread-1] INFO   jTPCC : Term-00,
15:10:15,712 [Thread-1] INFO   jTPCC : Term-00,
15:10:15,713 [Thread-1] INFO   jTPCC : Term-00, Measured tpmC (NewOrders) =
135.89
15:10:15,713 [Thread-1] INFO   jTPCC : Term-00, Measured tpmTOTAL = 299.79
15:10:15,713 [Thread-1] INFO   jTPCC : Term-00, Session Start =
2018-07-12 14:55:15
15:10:15,713 [Thread-1] INFO   jTPCC : Term-00, Session End   =
2018-07-12 15:10:15
15:10:15,713 [Thread-1] INFO   jTPCC : Term-00, Transaction Count = 4497

As can be seen, disabling WAL recycling does not cause any performance
regression.

For #2, I ran the test with WAL recycling on (the current behavior as well
as the default with this patch) since the behavior of postgres is
orthogonal to WAL recycling when the filesystem fills up.

I capped the filesystem with 32MB of free space. I setup a configuration
with wal_keep_segments=50 and started a long benchmarksql run. I had 4 WAL
files already in existence when the run started.

As the filesystem fills up, the performance of postgres gets slower and
slower, as would be expected. This is due to the COW nature of the
filesystem and the fact that all writes need to find space.

When a new WAL file is created, this essentially consumes no space since it
is a zero-filled file, so no filesystem space is consumed, except for a
little metadata for the file. However, as writes occur to the WAL
file, space is being consumed. Eventually all space in the filesystem is
consumed. I could not tell if this occurred during a write to an existing
WAL file or a write to the database itself. As other people have observed,
WAL file creation in a COW filesystem is not the problematic operation when
the filesystem fills up. It is the writes to existing files that will fail.
When postgres core dumped there were 6 WAL files in the pg_wal directory
(well short of the 50 configured).

When the filesystem filled up, postgres core dumped and benchmarksql
emitted a bunch of java debug information which I could provide if anyone
is interested.

Here is some information for the postgres core dump. It looks like postgres
aborted itself, but since the filesystem is full, there is nothing in the
log file.
> ::status
debugging core file of postgres (64-bit) from
f6c22f98-38aa-eb51-80d2-811ed25bed6b
file: /zones/f6c22f98-38aa-eb51-80d2-811ed25bed6b/local/pgsql/bin/postgres
initial argv: /usr/local/pgsql/bin/postgres -D /home/postgres/data
threading model: native threads
status: process terminated by SIGABRT (Abort), pid=76019 uid=1001 code=-1
> $C
f9dfa4b0 libc.so.1`_lwp_kill+0xa()
f9dfa4e0 libc.so.1`raise+0x20(6)
f9dfa530 libc.so.1`abort+0x98()
f9dfa560 errfinish+0x230()
f9dfa5e0 XLogWrite+0x294()
f9dfa610 XLogBackgroundFlush+0x18d()
f9dfaa50 WalWriterMain+0x1a8()
f9dfaab0 AuxiliaryProcessMain+0x3ff()
f9dfab40 0x7b5566()
f9dfab90 reaper+0x60a()
f9dfaba0 libc.so.1`__sighndlr+6()
f9dfac30 libc.so.1`call_user_handler+0x1db(12, 0, f9dfaca0)
f9dfac80 libc.so.1`sigacthandler+0x116(12, 0, f9dfaca0)
f9dfb0f0 libc.so.1`__pollsys+0xa()
f9dfb220 libc.so.1`pselect+0x26b(7, f9dfdad0, 0, 0,
f9dfb230, 0)
f9dfb270 libc.so.1`select+0x5a(7, f9dfdad0, 0, 0,
f9dfb6c0)
f9dffb00 ServerLoop+0x289()
f9dffb70 PostmasterMain+0xcfa()
f9dffba0 main+0x3cd()
f9dffbd0 _start_crt+0x83()
f9dffbe0 _start+0x18()

Let me know if there is any other information I could provide.

Thanks,
Jerry


On Tue, Jun 26, 2018 at 7:35 AM, Jerry Jelinek 
wrote:

> Hello All,
>
> Attached is a patch to provide an option to disable WAL recycling. We have
> found that this can help performance by eliminating read-modify-write
> behavior on old WAL files that are no longer resident in the filesystem
> cache. The is a lot more detail on the background of the motivation for
> this in the following thread.
>
> https://www.postgresql.org/message-id/flat/CACukRjO7

Re: Concurrency bug in UPDATE of partition-key

2018-07-12 Thread Andres Freund
On 2018-07-11 09:16:33 -0400, Alvaro Herrera wrote:
> On 2018-Jul-11, Amit Kapila wrote:
> 
> > Attached, please find an updated patch based on comments by Alvaro.
> > See, if this looks okay to you guys.
> 
> LGTM as far as my previous comments are concerned.

I see Amit pushed a patch here yesterday
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=40ca70ebcc9d0bc1c02937b27c44b2766617e790
, is there a need to keep the open item open, or is this the complete fix?

Greetings,

Andres Freund



Re: Internal error XX000 with enable_partition_pruning=on, pg 11 beta1 on Debian

2018-07-12 Thread Tom Lane
Andres Freund  writes:
> On 2018-06-29 18:17:08 -0400, Tom Lane wrote:
>> I'm on vacation and won't have time to look at this until week after
>> next.  If you don't mind putting the topic on hold that long, I'll
>> be happy to take responsibility for it.

> Is that still the plan? Do you forsee any issues here? This has been
> somewhat of a longstanding open item...

It's on my to-do list, but I'm still catching up vacation backlog.
Is this item blocking anyone?

regards, tom lane



Re: assert in nested SQL procedure call in current HEAD

2018-07-12 Thread Andres Freund
Hi,

On 2018-06-29 13:52:23 +0200, Peter Eisentraut wrote:
> From 95fc7156afe521b715fab08d44606774df875e92 Mon Sep 17 00:00:00 2001
> From: Peter Eisentraut 
> Date: Fri, 29 Jun 2018 13:28:39 +0200
> Subject: [PATCH] Fix assert in nested SQL procedure call

Andrew, Peter, are you happy with this? It'd be good to close this open
item.

- Andres



Re: Internal error XX000 with enable_partition_pruning=on, pg 11 beta1 on Debian

2018-07-12 Thread Andres Freund
Hi Tom,

On 2018-06-29 18:17:08 -0400, Tom Lane wrote:
> Alvaro Herrera  writes:
> > Since Tom has been revamping this code lately, I think it's a good
> > idea to wait for his input.
> 
> I'm on vacation and won't have time to look at this until week after
> next.  If you don't mind putting the topic on hold that long, I'll
> be happy to take responsibility for it.

Is that still the plan? Do you forsee any issues here? This has been
somewhat of a longstanding open item...

Greetings,

Andres Freund



Re: GiST VACUUM

2018-07-12 Thread Heikki Linnakangas

On 12/07/18 19:06, Andrey Borodin wrote:

11 июля 2018 г., в 0:07, Heikki Linnakangas 
написал(а):

This seems misplaced. This code deals with internal pages, and as
far as I can see, this patch never marks internal pages as deleted,
only leaf pages. However, we should have something like this in the
leaf-page branch, to deal with the case that an insertion lands on
a page that was concurrently deleted. Did you have any tests, where
an insertion runs concurrently with vacuum, that would exercise
this?


That bug could manifest only in case of crash between removing
downlinks and marking pages deleted.


Hmm. The downlink is removed first, so I don't think you can see that 
situation after a crash. After a crash, you might have some empty, 
orphaned, pages that have already been unlinked from the parent, but a 
search/insert should never encounter them.


Actually, now that I think about it more, I'm not happy with leaving 
orphaned pages like that behind. Let's WAL-log the removal of the 
downlink, and marking the leaf pages as deleted, in one WAL record, to 
avoid that.


But the situation in gistdoinsert(), where you encounter a deleted leaf 
page, could happen during normal operation, if vacuum runs concurrently 
with an insert. Insertion locks only one page at a time, as it descends 
the tree, so after it has released the lock on the parent, but before it 
has locked the child, vacuum might have deleted the page. In the latest 
patch, you're checking for that just before swapping the shared lock for 
an exclusive one, but I think that's wrong; you need to check for that 
after swapping the lock, because otherwise vacuum might delete the page 
while you're not holding the lock.



I do not know how to test this
reliably. Internal pages are locked before leafs and locks are
coupled. No cuncurrent backend can see downlinks to pages being
deleted, unless crash happens.


Are you sure? At a quick glance, I don't think the locks are coupled.

We do need some way of testing this..

- Heikki



Re: Vacuum: allow usage of more than 1GB of work mem

2018-07-12 Thread Claudio Freire
On Thu, Jul 12, 2018 at 10:44 AM Andrew Dunstan
 wrote:
>
>
>
> On 04/06/2018 08:00 PM, Claudio Freire wrote:
> > On Fri, Apr 6, 2018 at 5:25 PM, Claudio Freire  
> > wrote:
> >> On Fri, Apr 6, 2018 at 10:39 AM, Heikki Linnakangas  
> >> wrote:
> >>> On 06/04/18 01:59, Claudio Freire wrote:
>  The iteration interface, however, seems quite specific for the use
>  case of vacuumlazy, so it's not really a good abstraction.
> >>>
> >>> Can you elaborate? It does return the items one block at a time. Is that
> >>> what you mean by being specific for vacuumlazy? I guess that's a bit
> >>> special, but if you imagine some other users for this abstraction, it's
> >>> probably not that unusual. For example, if we started using it in bitmap
> >>> heap scans, a bitmap heap scan would also want to get the TIDs one block
> >>> number at a time.
> >> But you're also tying the caller to the format of the buffer holding
> >> those TIDs, for instance. Why would you, when you can have an
> >> interface that just iterates TIDs and let the caller store them
> >> if/however they want?
> >>
> >> I do believe a pure iterator interface is a better interface.
> > Between the b-tree or not discussion and the refactoring to separate
> > the code, I don't think we'll get this in the next 24hs.
> >
> > So I guess we'll have ample time to poner on both issues during the
> > next commit fest.
> >
>
>
>
> There doesn't seem to have been much pondering done since then, at least
> publicly. Can we make some progress on this? It's been around for a long
> time now.

Yeah, life has kept me busy and I haven't had much time to make
progress here, but I was planning on doing the refactoring as we were
discussing soon. Can't give a time frame for that, but "soonish".



Re: small doc fix - using expressions in plpgsql FETCH command

2018-07-12 Thread Tom Lane
Pavel Stehule  writes:
> PLpgSQL FETCH documentation is has ref on SQL FETCH command. SQL FETCH
> allows only int constants as count. PLpgSQL allows any expressions. In this
> case documentation is not clear, and people can be messy - and apply SQL
> FETCH limits on PLpgSQL FETCH.

Right.  Pushed with some rewriting.

regards, tom lane



Re: TRUNCATE tables referenced by FKs on partitioned tables

2018-07-12 Thread Alvaro Herrera
On 2018-Jul-12, Michael Paquier wrote:

> On Wed, Jul 11, 2018 at 06:59:16PM -0400, Alvaro Herrera wrote:
> > Anyway, this patch seems to fix it, and adds what I think is appropriate
> > test coverage.
> 
> This looks good to me.  I am noticing that the documentation of TRUNCATE
> does not mention that when running the command on a partitioned table
> then it automatically gets to the children even if CASCADE is not used
> and each child partition is not listed.

Hmm ... well, that's not new -- I think that came in with pg10.

> What is the filler column added in truncpart used for?

Nothing.  Also column b -- I had an additional different test, but then
I discovered it wasn't testing anything new.  Removed both.

Pushed, thanks for looking!

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Re: [COMMITTERS] pgsql: Remove pgbench "progress" test pending solution of its timing is (fwd)

2018-07-12 Thread Heikki Linnakangas

On 12/07/18 19:00, Fabien COELHO wrote:

How pgbenchs prints a progress if none were printed, or if the last
progress was over 0.5 seconds ago, so as to have kind of a catchup in the
end.


I don't understand the 0.5 second rule. For the tests, we only need to ensure
that at least one progress report is printed, right?


I'm not so sure;-) I do not want to trust the threadRun loop in case of
heavy load or whatever issue a run may encounter, so this approach ensures
that structurally there is always one output even of the whole loop went
very wrong.


I still don't understand.

For the testing, we just need to make sure that at least one progress 
report is always printed, if -P is used. Right? So where does the 0.5 
second rule come in? Can't we just do "if (no progress reports were 
printed) { print progress report; }" at the end?



It also adds a small feature which is that there is always a final
progress when the run is completed, which can be useful when computing
progress statistics, otherwise some transactions could not be reported in
any progress at all.


Any transactions in the last 0.5 seconds might still not get reported in 
any progress reports.



when there is nothing to do. This ensures that the -T 2 tap test runs for
at least 2 seconds, whatever. If the host is overload it might be more,
but it cannot be less unless something was wrong.


If you want to write a test that checks that a two-second test takes at least
two seconds, can't you just not use throttling in that test?


Indeed… but then throttling would not be tested:-) The point of the test
is to exercise all time-related options, including throttling with a
reasonable small value.


Ok. I don't think that's really worthwhile. If we add some code that 
only runs in testing, then we're not really testing the real thing. I 
wouldn't trust the test to tell much. Let's just leave out that magic 
environment variable thing, and try to get the rest of the patch finished.


- Heikki



Re: [PATCH] Add missing type conversion functions for PL/Python

2018-07-12 Thread Heikki Linnakangas

On 12/07/18 18:06, Nikita Glukhov wrote:

I have added some cross-type test cases and now almost all new code is covered
(excluding several error cases which can be triggered only by custom numeric
type implementations).


Thanks! Some of those new tests actually fail, if you run them against 
unpatched master. For example:


 SELECT * FROM test_type_conversion_float8_int2(100::float8);
 INFO:  (100.0, )
- test_type_conversion_float8_int2
---
-  100
-(1 row)
-
+ERROR:  invalid input syntax for integer: "100.0"
+CONTEXT:  while creating return value
+PL/Python function "test_type_conversion_float8_int2"

So this patch is making some subtle changes to behavior. I don't think 
we want that.


- Heikki



Re: GiST VACUUM

2018-07-12 Thread Andrey Borodin
Hi!

PFA v5 of the patch series.

> 11 июля 2018 г., в 0:07, Heikki Linnakangas  написал(а):
> 
> This seems misplaced. This code deals with internal pages, and as far as I 
> can see, this patch never marks internal pages as deleted, only leaf pages. 
> However, we should have something like this in the leaf-page branch, to deal 
> with the case that an insertion lands on a page that was concurrently 
> deleted. Did you have any tests, where an insertion runs concurrently with 
> vacuum, that would exercise this?

That bug could manifest only in case of crash between removing downlinks and 
marking pages deleted. I do not know how to test this reliably.
Internal pages are locked before leafs and locks are coupled. No cuncurrent 
backend can see downlinks to pages being deleted, unless crash happens.

I've replaced code covering this situation into leaf code path and added 
comment.

> 
> The code in gistbulkdelete() seems pretty expensive. In the first phase, it 
> records the parent of every empty leaf page it encounters. In the second 
> phase, it scans every leaf page of that parent, not only those leaves that 
> were seen as empty.

It is fixed in second patch of the series.

> 
> I'm a bit wary of using pd_prune_xid for the checks to determine if a deleted 
> page can be recycled yet. In heap pages, pd_prune_xid is just a hint, but 
> here it's used for a critical check. This seems to be the same mechanism we 
> use in B-trees, but in B-trees, we store the XID in BTPageOpaqueData.xact, 
> not pd_prune_xid. Also, in B-trees, we use ReadNewTransactionId() to set it, 
> not GetCurrentTransactionId(). See comments in _bt_unlink_halfdead_page() for 
> explanation. This patch is missing any comments to explain how this works in 
> GiST.

I've replaced usage of GetCurrentTransactionId() with ReadNewTransactionId() 
and added explanation of what is going on. Also, I've added comments about that 
pd_prune_xid usage. There is no other use in GiST for this field. And there is 
no other room to place this xid on a page without pg_upgrade.

> 
> If you crash in the middle of gistbulkdelete(), after it has removed the 
> downlink from the parent, but before it has marked the leaf page as deleted, 
> the leaf page is "leaked". I think that's acceptable, but a comment at least 
> would be good.

Added explanatory comment between WAL-logging downlink removal and marking 
pages deleted.


Thank you for reviewing the patch!

Best regards, Andrey Borodin.


0002-Physical-GiST-scan-during-VACUUM-v5.patch
Description: Binary data


0001-Delete-pages-during-GiST-VACUUM-v5.patch
Description: Binary data


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

2018-07-12 Thread Fabien COELHO




The point is to avoid building the message with dynamic allocation and so
if in the end it is not used.


Ok! About avoidance - I'm afraid there's one more piece of debugging code 
with the same problem:


Indeed. I'd like to avoid all instances, so that PQExpBufferData is not 
needed anywhere, if possible. If not possible, then too bad, but I'd 
prefer to make do with formatted prints only, for simplicity.


--
Fabien.



Re: [HACKERS] Re: [COMMITTERS] pgsql: Remove pgbench "progress" test pending solution of its timing is (fwd)

2018-07-12 Thread Fabien COELHO


Hello Heikki,

Thanks for having a look at this small patch which aim at improving 
pgbench coverage.



How pgbenchs prints a progress if none were printed, or if the last
progress was over 0.5 seconds ago, so as to have kind of a catchup in the
end.


I don't understand the 0.5 second rule. For the tests, we only need to ensure 
that at least one progress report is printed, right?


I'm not so sure;-) I do not want to trust the threadRun loop in case of 
heavy load or whatever issue a run may encounter, so this approach ensures 
that structurally there is always one output even of the whole loop went 
very wrong.


It also adds a small feature which is that there is always a final 
progress when the run is completed, which can be useful when computing 
progress statistics, otherwise some transactions could not be reported in 
any progress at all.


Looking at the code as it exists, I think it already works like that, 
although it's by accident. Not sure though, and if we're going to rely on 
that, it makes sense to make it more explicit.


Yep.

Also, by default, when running under throttling for 2 seconds at 20 tps, 
there is a slight chance that no transactions are scheduled and that 
pgbench returns immediately, hence the added variable to ensure that it 
lasts something anyway, and that some minimal reporting is always done 
whatever.



when there is nothing to do. This ensures that the -T 2 tap test runs for
at least 2 seconds, whatever. If the host is overload it might be more,
but it cannot be less unless something was wrong.


If you want to write a test that checks that a two-second test takes at least 
two seconds, can't you just not use throttling in that test?


Indeed… but then throttling would not be tested:-) The point of the test 
is to exercise all time-related options, including throttling with a 
reasonable small value.


Attached is a rebase.

--
Fabien.diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index f0c5149523..26ac6fb37d 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -5597,6 +5597,96 @@ main(int argc, char **argv)
return 0;
 }
 
+/* display the progress report */
+static void
+doProgressReport(TState *thread, StatsData *plast, int64 *plast_report,
+int64 now, int64 thread_start)
+{
+   StatsData   cur;
+   int64   run = now - *plast_report,
+   ntx;
+   double  tps,
+   total_run,
+   latency,
+   sqlat,
+   lag,
+   stdev;
+   chartbuf[64];
+   int i;
+
+   /*
+* Add up the statistics of all threads.
+*
+* XXX: No locking. There is no guarantee that we get an
+* atomic snapshot of the transaction count and latencies, so
+* these figures can well be off by a small amount. The
+* progress report's purpose is to give a quick overview of
+* how the test is going, so that shouldn't matter too much.
+* (If a read from a 64-bit integer is not atomic, you might
+* get a "torn" read and completely bogus latencies though!)
+*/
+   initStats(&cur, 0);
+   for (i = 0; i < nthreads; i++)
+   {
+   mergeSimpleStats(&cur.latency, &thread[i].stats.latency);
+   mergeSimpleStats(&cur.lag, &thread[i].stats.lag);
+   cur.cnt += thread[i].stats.cnt;
+   cur.skipped += thread[i].stats.skipped;
+   }
+
+   /* we count only actually executed transactions */
+   ntx = (cur.cnt - cur.skipped) - (plast->cnt - plast->skipped);
+   total_run = (now - thread_start) / 100.0;
+   tps = 100.0 * ntx / run;
+   if (ntx > 0)
+   {
+   latency = 0.001 * (cur.latency.sum - plast->latency.sum) / ntx;
+   sqlat = 1.0 * (cur.latency.sum2 - plast->latency.sum2) / ntx;
+   stdev = 0.001 * sqrt(sqlat - 100.0 * latency * latency);
+   lag = 0.001 * (cur.lag.sum - plast->lag.sum) / ntx;
+   }
+   else
+   {
+   latency = sqlat = stdev = lag = 0;
+   }
+
+   if (progress_timestamp)
+   {
+   /*
+* On some platforms the current system timestamp is
+* available in now_time, but rather than get entangled
+* with that, we just eat the cost of an extra syscall in
+* all cases.
+*/
+   struct timeval tv;
+
+   gettimeofday(&tv, NULL);
+   snprintf(tbuf, sizeof(tbuf), "%ld.%03ld s",
+(long) tv.tv_sec, (long) (tv.tv_usec / 1000));
+   }
+   else
+   {
+   /* round seconds are expected, but the thread may be late */
+   snprintf(tbuf, sizeof(tbuf), "%.1f s", t

Re: generating bootstrap entries for array types

2018-07-12 Thread Dagfinn Ilmari Mannsåker
Hi John,

John Naylor  writes:

>> On 4/26/18, Tom Lane  wrote:
>>> if I counted correctly.  (Array entries should be ignored for this
>>> purpose; maybe we'll autogenerate them someday.)
>>
>> Hmm, that wouldn't be too hard. Add a new metadata field called
>> 'array_type_oid', then if it finds such an OID, teach genbki.pl to
>> construct a tuple for the corresponding array type by consulting
>> something like
>>
>> chartypcategoryBKI_ARRAY(A);
>> chartypstorage  BKI_ARRAY(x);
>> ...etc
>>
>> in the header file, plus copying typalign from the element type. I'll
>> whip this up sometime and add it to the next commitfest.
>
> This turned out to be slightly more complicated than that, but still
> pretty straightforward.

Doing this in genbki.pl makes DBD::Pg lose its array type information,
since it uses Catalog::ParseData() to get it
(https://github.com/bucardo/dbdpg/blob/master/types.c#L439).  May I
suggest moving gen_array_types() to Catalog.pm and calling it from
ParseData() if the input file is pg_type.dat, so that it always returns
complete data?

Thanks,

- ilmari
-- 
"A disappointingly low fraction of the human race is,
 at any given time, on fire." - Stig Sandbeck Mathisen



Re: [PATCH] btree_gist: fix union implementation for variable length columns

2018-07-12 Thread Teodor Sigaev

It would be easier to figure this out if the btree_gist code weren't
so desperately undocumented.  Teodor, do you remember why it's like
this?


Will look.

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/



Re: Failure assertion in GROUPS mode of window function in current HEAD

2018-07-12 Thread Masahiko Sawada
On Fri, Jul 13, 2018 at 12:17 AM, Tom Lane  wrote:
> Masahiko Sawada  writes:
>> I think we also can update the doc about that GROUPS offset mode
>> requires ORDER BY clause. Thoughts? Attached patch updates it.
>
> Ooops, I forgot to check the docs.  This isn't quite the right fix
> though --- the problem is that there's a sentence at the end of that
> para that now says exactly the wrong thing.  I fixed that.
>

Yeah, that's not the right fix. Thank you for fixing!

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: _isnan() on Windows

2018-07-12 Thread Andrew Dunstan




On 07/12/2018 10:38 AM, Tom Lane wrote:

Andrew Dunstan  writes:

On 07/12/2018 10:20 AM, Tom Lane wrote:

bowerbird and hamerkop have some gripes like this:

bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : macro 
redefinition (src/pl/plperl/SPI.c) 
[G:\prog\bf\root\HEAD\pgsql.build\plperl.vcxproj]

We actually undef a bunch of things in plperl.h to keep the compiler
quiet. Maybe we need to add this to the list?

Perhaps.  But how do we tell the platforms where we should do that
from the ones where we shouldn't?




In the _MSCVER section:

#ifdef isnan
#undef isnan
#endif

By inspection the perl header is just defining it to _isnan, for every 
MSC version.


cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Failure assertion in GROUPS mode of window function in current HEAD

2018-07-12 Thread Tom Lane
Masahiko Sawada  writes:
> I think we also can update the doc about that GROUPS offset mode
> requires ORDER BY clause. Thoughts? Attached patch updates it.

Ooops, I forgot to check the docs.  This isn't quite the right fix
though --- the problem is that there's a sentence at the end of that
para that now says exactly the wrong thing.  I fixed that.

regards, tom lane



Re: Postgres 11 release notes

2018-07-12 Thread Tom Lane
"Jonathan S. Katz"  writes:
> On Jun 30, 2018, at 5:52 PM, Tom Lane  wrote:
>> Mmm, yeah, I suppose it should say "all framing options" rather than
>> implying that we've implemented every other window-related frammish
>> there is in the spec.

> +1. Attached patch that does exactly that.

Pushed, thanks.

regards, tom lane



Re: [PATCH] Add missing type conversion functions for PL/Python

2018-07-12 Thread Nikita Glukhov

On 11.07.2018 21:03, Heikki Linnakangas wrote:


On 26/03/18 19:07, Nikita Glukhov wrote:

Attached fixed 3th version of the patch:


Thanks, I'm reviewing this now. Nice speedup!


Thank you for your review.



There is no test coverage for some of the added code. You can get a 
code coverage report with:


./configure --enable-coverage ...
make
make -C src/pl/plpython check
make coverage-html

That produces a code coverage report in coverage/index.html. Please 
look at the coverage of the new functions, and add tests to 
src/pl/plpython/sql/plpython_types.sql so that all the new code is 
covered.



I have added some cross-type test cases and now almost all new code is covered
(excluding several error cases which can be triggered only by custom numeric
type implementations).


In some places, where you've already checked the object type e.g. with 
PyFloat_Check(), I think you could use the less safe variants, like 
PyFloat_AS_DOUBLE() instead of PyFloat_AsDouble(). Since this patch is 
all about performance, seems worth doing.


Fixed.



Some of the conversions seem a bit questionable. For example:


/*
 * Convert a Python object to a PostgreSQL float8 datum directly.
 * If can not convert it directly, fallback to PLyObject_ToScalar
 * to convert it.
 */
static Datum
PLyObject_ToFloat8(PLyObToDatum *arg, PyObject *plrv,
   bool *isnull, bool inarray)
{
if (plrv == Py_None)
{
    *isnull = true;
    return (Datum) 0;
}

if (PyFloat_Check(plrv) || PyInt_Check(plrv) || PyLong_Check(plrv))
{
    *isnull = false;
    return Float8GetDatum((double)PyFloat_AsDouble(plrv));
}

return PLyObject_ToScalar(arg, plrv, isnull, inarray);
}


The conversion from Python int to C double is performed by 
PyFloat_AsDouble(). But there's no error checking. And wouldn't 
PyLong_AsDouble() be more appropriate in that case, since we already 
checked the python type?




Yes, this might be wrong, but PyFloat_AsDouble() internally tries first to
convert number to float.  Also, after gaining more experience in PL/Python
during the implementation of jsonb transforms, I found a lot of similar
problems in the code.  All of them are fixed in the 4th version of the patch.

--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


diff --git a/src/pl/plpython/expected/plpython_types.out b/src/pl/plpython/expected/plpython_types.out
index eda965a..dc1232b 100644
--- a/src/pl/plpython/expected/plpython_types.out
+++ b/src/pl/plpython/expected/plpython_types.out
@@ -134,6 +134,158 @@ INFO:  (None, )
   
 (1 row)
 
+CREATE FUNCTION test_type_conversion_int4_int2(x int4) RETURNS int2 AS $$
+plpy.info(x, type(x))
+return x
+$$ LANGUAGE plpythonu;
+SELECT * FROM test_type_conversion_int4_int2(100::int4);
+INFO:  (100, )
+ test_type_conversion_int4_int2 
+
+100
+(1 row)
+
+SELECT * FROM test_type_conversion_int4_int2(-100::int4);
+INFO:  (-100, )
+ test_type_conversion_int4_int2 
+
+   -100
+(1 row)
+
+SELECT * FROM test_type_conversion_int4_int2(100::int4);
+INFO:  (100, )
+ERROR:  value "100" is out of range for type smallint
+CONTEXT:  while creating return value
+PL/Python function "test_type_conversion_int4_int2"
+SELECT * FROM test_type_conversion_int4_int2(-100::int4);
+INFO:  (-100, )
+ERROR:  value "-100" is out of range for type smallint
+CONTEXT:  while creating return value
+PL/Python function "test_type_conversion_int4_int2"
+SELECT * FROM test_type_conversion_int4_int2(null);
+INFO:  (None, )
+ test_type_conversion_int4_int2 
+
+   
+(1 row)
+
+CREATE FUNCTION test_type_conversion_int8_int2(x int8) RETURNS int2 AS $$
+plpy.info(x, type(x))
+return x
+$$ LANGUAGE plpythonu;
+SELECT * FROM test_type_conversion_int8_int2(100::int8);
+INFO:  (100L, )
+ test_type_conversion_int8_int2 
+
+100
+(1 row)
+
+SELECT * FROM test_type_conversion_int8_int2(-100::int8);
+INFO:  (-100L, )
+ test_type_conversion_int8_int2 
+
+   -100
+(1 row)
+
+SELECT * FROM test_type_conversion_int8_int2(100::int8);
+INFO:  (100L, )
+ERROR:  value "100" is out of range for type smallint
+CONTEXT:  while creating return value
+PL/Python function "test_type_conversion_int8_int2"
+SELECT * FROM test_type_conversion_int8_int2(-100::int8);
+INFO:  (-100L, )
+ERROR:  value "-100" is out of range for type smallint
+CONTEXT:  while creating return value
+PL/Python function "test_type_conversion_int8_int2"
+SELECT * FROM test_type_conversion_int8_int2(null);
+INFO:  (None, )
+ test_type_conversion_int8_int2 
+
+   
+(1 row)
+
+CREATE FUNCTION test_type_c

Re: cost_sort() improvements

2018-07-12 Thread Teodor Sigaev




One more thought about estimating the worst case - I wonder if simply
multiplying the per-tuple cost by 1.5 is the right approach. It does not
seem particularly principled, and it's trivial simple to construct
counter-examples defeating it (imagine columns with 99% of the rows
having the same value, and then many values in exactly one row - that
will defeat the ndistinct-based group size estimates).


Agree. But right now that is best what we have. and constant 1.5 easily could be 
changed to 1 or 10 - it is just arbitrary value, intuitively chosen.  As I 
mentioned in v7 patch description, new estimation function provides ~10% bigger 
estimation and this constant should not be very large, because we could easily 
get overestimation.




The reason why constructing such counter-examples is so simple is that
this relies entirely on the ndistinct stats, ignoring the other stats we
already have. I think this might leverage the per-column MCV lists (and
eventually multi-column MCV lists, if that ever gets committed).

We're estimating the number of tuples in group (after fixing values in
the preceding columns), because that's what determines the number of
comparisons we need to make at a given stage. The patch does this by
estimating the average group size, by estimating ndistinct of the
preceding columns G(i-1), and computing ntuples/G(i-1).

But consider that we also have MCV lists for each column - if there is a
very common value, it should be there. So let's say M(i) is a frequency
of the most common value in i-th column, determined either from the MCV
list or as 1/ndistinct for that column. Then if m(i) is minimum of M(i)
for the fist i columns, then the largest possible group of values when
comparing values in i-th column is

 N * m(i-1)

This seems like a better way to estimate the worst case. I don't know if
this should be used instead of NG(i), or if those two estimates should
be combined in some way.
Agree, definitely we need an estimation of larger group size to use it in 
cost_sort. But I don't feel power to do that, pls, could you attack a computing 
group size issue? Thank you.




I think estimating the largest group we need to sort should be helpful
for the incremental sort patch, so I'm adding Alexander to this thread.Agree
I think so. suggested estimation algorithm should be modified to support leading 
preordered keys and this form could be directly used in GROUP BY and incremental 
sort patches.

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/



Re: cost_sort() improvements

2018-07-12 Thread Teodor Sigaev

V8 contains fixes of Tomas Vondra complaints:
 - correct usage of comparison_cost
 - remove uneeded check of sortop


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index a2a7e0c520..521a7d3c9a 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1612,6 +1612,250 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 	rterm->pathtarget->width);
 }
 
+/*
+ * is_fake_var
+ *		Workaround for generate_append_tlist() which generates fake Vars with
+ *		varno == 0, that will cause a fail of estimate_num_group() call
+ */
+static bool
+is_fake_var(Expr *expr)
+{
+	if (IsA(expr, RelabelType))
+		expr = (Expr *) ((RelabelType *) expr)->arg;
+
+	return (IsA(expr, Var) && ((Var *) expr)->varno == 0);
+}
+
+/*
+ * get_width_cost_multiplier
+ *		Returns relative complexity of comparing two valyes based on it's width.
+ * The idea behind - long values have more expensive comparison. Return value is
+ * in cpu_operator_cost unit.
+ */
+static double
+get_width_cost_multiplier(PlannerInfo *root, Expr *expr)
+{
+	double	width = -1.0; /* fake value */
+
+	if (IsA(expr, RelabelType))
+		expr = (Expr *) ((RelabelType *) expr)->arg;
+
+	/* Try to find actual stat in corresonding relation */
+	if (IsA(expr, Var))
+	{
+		Var		*var = (Var *) expr;
+
+		if (var->varno > 0 && var->varno < root->simple_rel_array_size)
+		{
+			RelOptInfo	*rel = root->simple_rel_array[var->varno];
+
+			if (rel != NULL &&
+var->varattno >= rel->min_attr &&
+var->varattno <= rel->max_attr)
+			{
+int	ndx = var->varattno - rel->min_attr;
+
+if (rel->attr_widths[ndx] > 0)
+	width = rel->attr_widths[ndx];
+			}
+		}
+	}
+
+	/* Didn't find any actual stats, use estimation by type */
+	if (width < 0.0)
+	{
+		Node	*node = (Node*) expr;
+
+		width = get_typavgwidth(exprType(node), exprTypmod(node));
+	}
+
+	/*
+	 * Any value in pgsql is passed by Datum type, so any operation with value
+	 * could not be cheaper than operation with Datum type
+	 */
+	if (width <= sizeof(Datum))
+		return 1.0;
+
+	/*
+	 * Seems, cost of comparision is not directly proportional to args width,
+	 * because comparing args could be differ width (we known only average over
+	 * column) and difference often could be defined only by looking on first
+	 * bytes. So, use log16(width) as estimation.
+	 */
+	return 1.0 + 0.125 * LOG2(width / sizeof(Datum));
+}
+
+/*
+ * compute_cpu_sort_cost
+ *		compute CPU cost of sort (i.e. in-memory)
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  In this case, it will fallback to
+ * simple comparison cost estimate.
+ *
+ * Estimation algorithm is based on ideas from course Algorithms,
+ * Robert Sedgewick, Kevin Wayne, https://algs4.cs.princeton.edu/home/ and paper
+ * "Quicksort Is Optimal For Many Equal Keys", Sebastian Wild,
+ * arXiv:1608.04906v4 [cs.DS] 1 Nov 2017.
+ *
+ * In term of that papers, let N - number of tuples, Xi - number of tuples with
+ * key Ki, then estimation is:
+ * log(N! / (X1! * X2! * ..))  ~  sum(Xi * log(N/Xi))
+ * In our case all Xi are the same because noew we don't have an estimation of
+ * group sizes, we have only estimation of number of groups. In this case,
+ * formula becomes: N * log(NumberOfGroups). Next, to support correct estimation
+ * of multicolumn sort we need separately compute each column, so, let k is a
+ * column number, Gk - number of groups  defined by k columns:
+ * N * sum( Fk * log(Gk) )
+ * Fk is sum of functions costs for k columns.
+ */
+
+static Cost
+compute_cpu_sort_cost(PlannerInfo *root, List *pathkeys,
+		   Cost comparison_cost,
+		   double tuples, double output_tuples, bool heapSort)
+{
+	Cost		per_tuple_cost = 0.0;
+	ListCell	*lc;
+	List		*pathkeyExprs = NIL;
+	double		tuplesPerPrevGroup = tuples;
+	bool		has_fake_var = false;
+	double		totalFuncCost = 0.0;
+
+	/* fallback if pathkeys is unknown */
+	if (list_length(pathkeys) == 0)
+	{
+		/*
+		 * If we'll use a bounded heap-sort keeping just K tuples in memory, for
+		 * a total number of tuple comparisons of N log2 K; but the constant
+		 * factor is a bit higher than for quicksort.  Tweak it so that the
+		 * cost curve is continuous at the crossover point.
+		 */
+		output_tuples = (heapSort) ? 2.0 * output_tuples : tuples;
+		per_tuple_cost += 2.0 * cpu_operator_cost * LOG2(output_tuples);
+
+		return per_tuple_cost * tuples;
+	}
+
+	totalFuncCost = 1.0;
+
+	/*
+	 * Computing total cost of sorting takes into account:
+	 * - per column comparison function cost
+	 * - we try to compute needed number of comparison per column
+	 */
+
+	foreach(lc, pathkeys)
+	{
+		PathKey*pathkey = (PathKey*)lfirst(lc);
+		Cost funcCost = 1.0; /* fallback value */
+		EquivalenceMember	

Re: cost_sort() improvements

2018-07-12 Thread Teodor Sigaev

OK, so Fi is pretty much whatever CREATE FUNCTION ... COST says, right?

exactly


Hmm, makes sense. But doesn't that mean it's mostly a fixed per-tuple
cost, not directly related to the comparison? For example, why should it
be multiplied by C0? That is, if I create a very expensive comparator
(say, with cost 100), why should it increase the cost for transferring
the tuple to CPU cache, unpacking it, etc.?

I'd say those costs are rather independent of the function cost, and
remain rather fixed, no matter what the function cost is.

Perhaps you haven't noticed that, because the default funcCost is 1?
May be, but see my email 
https://www.postgresql.org/message-id/ee14392b-d753-10ce-f5ed-7b2f7e277512%40sigaev.ru 
about additional term proportional to N



The number of new magic constants introduced by this patch is somewhat
annoying. 2.0, 1.5, 0.125, ... :-(
2.0 is removed in last patch, 1.5 leaved and could be removed when I understand 
you letter with group size estimation :)
0.125 should be checked, and I suppose we couldn't remove it at all because it 
"average over whole word" constant.





  - Final cost is cpu_operator_cost * N * sum(per column costs described
above).
    Note, for single column with width <= sizeof(datum) and F1 = 1 this
formula
    gives exactly the same result as current one.
  - for Top-N sort empiric is close to old one: use 2.0 multiplier as
constant
    under log2, and use log2(Min(NGi, output_tuples)) for second and
following
    columns.



I think compute_cpu_sort_cost is somewhat confused whether
per_tuple_cost is directly a cost, or a coefficient that will be
multiplied with cpu_operator_cost to get the actual cost.

At the beginning it does this:

 per_tuple_cost = comparison_cost;

so it inherits the value passed to cost_sort(), which is supposed to be
cost. But then it does the work, which includes things like this:

 per_tuple_cost += 2.0 * funcCost * LOG2(tuples);

where funcCost is pretty much pg_proc.procost. AFAIK that's meant to be
a value in units of cpu_operator_cost. And at the end it does this

 per_tuple_cost *= cpu_operator_cost;

I.e. it gets multiplied with another cost. That doesn't seem right.


Huh, you are right, will fix in v8.



Also, why do we need this?

 if (sortop != InvalidOid)
 {
 Oid funcOid = get_opcode(sortop);

 funcCost = get_func_cost(funcOid);
 }

Safety first :). Will remove.
--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/



Re: _isnan() on Windows

2018-07-12 Thread Tom Lane
Andrew Dunstan  writes:
> On 07/12/2018 10:20 AM, Tom Lane wrote:
>> bowerbird and hamerkop have some gripes like this:
>> 
>> bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : 
>> macro redefinition (src/pl/plperl/SPI.c) 
>> [G:\prog\bf\root\HEAD\pgsql.build\plperl.vcxproj]

> We actually undef a bunch of things in plperl.h to keep the compiler 
> quiet. Maybe we need to add this to the list?

Perhaps.  But how do we tell the platforms where we should do that
from the ones where we shouldn't?

regards, tom lane



Re: _isnan() on Windows

2018-07-12 Thread Andrew Dunstan




On 07/12/2018 10:20 AM, Tom Lane wrote:

Michael Paquier  writes:

On Wed, Jul 11, 2018 at 09:13:40AM -0400, Alvaro Herrera wrote:

I just pushed it before seeing your message.

Fine as well, thanks for picking this up.  The buildfarm shows no
failures about this patch.

I scraped all the compiler warnings from the buildfarm this morning,
and I see no new ones that could be blamed on this change, so I think
we're good.

bowerbird and hamerkop have some gripes like this:

  bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : 
macro redefinition (src/pl/plperl/SPI.c) 
[G:\prog\bf\root\HEAD\pgsql.build\plperl.vcxproj]
  bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : 
macro redefinition (src/pl/plperl/Util.c) 
[G:\prog\bf\root\HEAD\pgsql.build\plperl.vcxproj]
  bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : 
macro redefinition (src/pl/plperl/plperl.c) 
[G:\prog\bf\root\HEAD\pgsql.build\plperl.vcxproj]
  bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : 
macro redefinition [G:\prog\bf\root\HEAD\pgsql.build\hstore_plperl.vcxproj]
  bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : 
macro redefinition [G:\prog\bf\root\HEAD\pgsql.build\jsonb_plperl.vcxproj]
  bowerbird |   c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : 
macro redefinition (src/pl/plperl/SPI.c) 
[G:\prog\bf\root\HEAD\pgsql.build\plperl.vcxproj]
  bowerbird |   c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : 
macro redefinition (src/pl/plperl/Util.c) 
[G:\prog\bf\root\HEAD\pgsql.build\plperl.vcxproj]
  bowerbird |   c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : 
macro redefinition (src/pl/plperl/plperl.c) 
[G:\prog\bf\root\HEAD\pgsql.build\plperl.vcxproj]
  bowerbird |   c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : 
macro redefinition [G:\prog\bf\root\HEAD\pgsql.build\hstore_plperl.vcxproj]
  bowerbird |   c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : 
macro redefinition [G:\prog\bf\root\HEAD\pgsql.build\jsonb_plperl.vcxproj]

but those were there before too.  Not sure if there's anything
we can/should try to do about that.





We actually undef a bunch of things in plperl.h to keep the compiler 
quiet. Maybe we need to add this to the list?


cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: cost_sort() improvements

2018-07-12 Thread Teodor Sigaev
At least [1] and [2] hit into to that issues and have an objections/questions 
about correctness of cost sort estimation. Suggested patch tries to improve 
current estimation and solve that issues.


Sorry for long delay but issue was even more complicated than I thought. When I 
tried to add cost_sort to GROUP BY patch I found it isn't work well as I hoped 
:(. The root of problem is suggested cost_sort improvement doesn't take into 
account uniqueness of first column and it's cost always the same. I tried to 
make experiments with all the same and all unique column and found that 
execution time could be significantly differ (up to 3 times on 1e7 randomly 
generated integer values). And I went to read book and papers.


So, I suggest new algorithm based on ideas in [3], [4]. In term of that papers, 
let I use designations  from previous my email and Xi - number of tuples with 
key Ki, then estimation is:

log(N! / (X1! * X2! * ..))  ~  sum(Xi * log(N/Xi))
In our case all Xi are the same because now we don't have an estimation of
group sizes, we have only estimation of number of groups. In this case,
formula becomes: N * log(NumberOfGroups). Next, to support correct estimation
of multicolumn sort we need separately compute each column, so, let k is a
column number, Gk - number of groups  defined by k columns:
N * sum( FCk * log(Gk) )
and FCk is a sum(Cj) over k columns. Gk+1 is defined as estimate_num_groups(NGk) 
- i.e. it's recursive, that's means that comparison of k-th columns includes all 
comparison j-th columns < k.


Next, I found that this formula gives significant underestimate with N < 1e4 and 
using [5] (See Chapter 8.2 and Theorem 4.1) found that we can use N + N * log(N) 
formula which actually adds a multiplier in simple case but it's unclear for me 
how to add that multimplier to multicolumn formula, so I added just a separate 
term proportional to N.


As demonstration, I add result of some test, *sh and *plt are scripts to 
reproduce results. Note, all charts are normalized because  computed cost as not 
a milliseconds. p.pl is a parser for JSON format of explain analyze.


N test - sort unique values with different number of tuples
NN test - same as previous one but sort of single value (all the same values)
U test - fixed number of total values (1e7) but differ number of unique values

Hope, new estimation much more close to real sort timing. New estimation 
function gives close result to old estimation function on trivial examples, but 
~10% more expensive, and three of regression tests aren't passed, will look 
closer later. Patch doesn't include  regression test changes.



[1] https://commitfest.postgresql.org/18/1124/
[2] https://commitfest.postgresql.org/18/1651/

[3] https://algs4.cs.princeton.edu/home/, course Algorithms,
Robert Sedgewick, Kevin Wayne,
[4] "Quicksort Is Optimal For Many Equal Keys", Sebastian Wild,
arXiv:1608.04906v4 [cs.DS] 1 Nov 2017
[5] "Introduction to algorithms", Thomas H. Cormen, Charles E.
 Leiserson, Ronald L. Rivest, ISBN 0-07-013143-0
--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index a2a7e0c520..3588cff802 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1612,6 +1612,252 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 	rterm->pathtarget->width);
 }
 
+/*
+ * is_fake_var
+ *		Workaround for generate_append_tlist() which generates fake Vars with
+ *		varno == 0, that will cause a fail of estimate_num_group() call
+ */
+static bool
+is_fake_var(Expr *expr)
+{
+	if (IsA(expr, RelabelType))
+		expr = (Expr *) ((RelabelType *) expr)->arg;
+
+	return (IsA(expr, Var) && ((Var *) expr)->varno == 0);
+}
+
+/*
+ * get_width_cost_multiplier
+ *		Returns relative complexity of comparing two valyes based on it's width.
+ * The idea behind - long values have more expensive comparison. Return value is
+ * in cpu_operator_cost unit.
+ */
+static double
+get_width_cost_multiplier(PlannerInfo *root, Expr *expr)
+{
+	double	width = -1.0; /* fake value */
+
+	if (IsA(expr, RelabelType))
+		expr = (Expr *) ((RelabelType *) expr)->arg;
+
+	/* Try to find actual stat in corresonding relation */
+	if (IsA(expr, Var))
+	{
+		Var		*var = (Var *) expr;
+
+		if (var->varno > 0 && var->varno < root->simple_rel_array_size)
+		{
+			RelOptInfo	*rel = root->simple_rel_array[var->varno];
+
+			if (rel != NULL &&
+var->varattno >= rel->min_attr &&
+var->varattno <= rel->max_attr)
+			{
+int	ndx = var->varattno - rel->min_attr;
+
+if (rel->attr_widths[ndx] > 0)
+	width = rel->attr_widths[ndx];
+			}
+		}
+	}
+
+	/* Didn't find any actual stats, use estimation by type */
+	if (width < 0.0)
+	{
+		Node	*node = (Node*) expr;
+
+		width = get_typavgwidth(exprType(node), exprTypmod(node)

Re: _isnan() on Windows

2018-07-12 Thread Tom Lane
Michael Paquier  writes:
> On Wed, Jul 11, 2018 at 09:13:40AM -0400, Alvaro Herrera wrote:
>> I just pushed it before seeing your message.

> Fine as well, thanks for picking this up.  The buildfarm shows no
> failures about this patch.

I scraped all the compiler warnings from the buildfarm this morning,
and I see no new ones that could be blamed on this change, so I think
we're good.

bowerbird and hamerkop have some gripes like this:

 bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : 
macro redefinition (src/pl/plperl/SPI.c) 
[G:\prog\bf\root\HEAD\pgsql.build\plperl.vcxproj]
 bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : 
macro redefinition (src/pl/plperl/Util.c) 
[G:\prog\bf\root\HEAD\pgsql.build\plperl.vcxproj]
 bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : 
macro redefinition (src/pl/plperl/plperl.c) 
[G:\prog\bf\root\HEAD\pgsql.build\plperl.vcxproj]
 bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : 
macro redefinition [G:\prog\bf\root\HEAD\pgsql.build\hstore_plperl.vcxproj]
 bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : 
macro redefinition [G:\prog\bf\root\HEAD\pgsql.build\jsonb_plperl.vcxproj]
 bowerbird |   c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : 
macro redefinition (src/pl/plperl/SPI.c) 
[G:\prog\bf\root\HEAD\pgsql.build\plperl.vcxproj]
 bowerbird |   c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : 
macro redefinition (src/pl/plperl/Util.c) 
[G:\prog\bf\root\HEAD\pgsql.build\plperl.vcxproj]
 bowerbird |   c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : 
macro redefinition (src/pl/plperl/plperl.c) 
[G:\prog\bf\root\HEAD\pgsql.build\plperl.vcxproj]
 bowerbird |   c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : 
macro redefinition [G:\prog\bf\root\HEAD\pgsql.build\hstore_plperl.vcxproj]
 bowerbird |   c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : 
macro redefinition [G:\prog\bf\root\HEAD\pgsql.build\jsonb_plperl.vcxproj]

but those were there before too.  Not sure if there's anything
we can/should try to do about that.

regards, tom lane



Re: [HACKERS] WAL logging problem in 9.4.3?

2018-07-12 Thread Heikki Linnakangas

On 12/07/18 16:51, Andrew Dunstan wrote:



On 07/10/2018 11:32 PM, Michael Paquier wrote:

On Tue, Jul 10, 2018 at 05:35:58PM +0300, Heikki Linnakangas wrote:

Thanks for picking this up!

(I hope this gets through the email filters this time, sending a shell
script seems to be difficult. I also trimmed the CC list, if that helps.)

On 04/07/18 07:59, Michael Paquier wrote:

Hence I propose the patch attached which disables the TRUNCATE and COPY
optimizations for two cases, which are the ones actually causing
problems.  One solution has been presented by Simon here for COPY, which
is to disable the optimization when there are no blocks on a relation
with wal_level = minimal:
https://www.postgresql.org/message-id/CANP8+jKN4V4MJEzFN_iEtdZ+1oM=yetxvmuu1yk4umxqy2g...@mail.gmail.com
For back-patching, I find that really appealing.

This fails in the case that there are any WAL-logged changes to the table
while the COPY is running. That can happen at least if the table has an
INSERT trigger, that performs operations on the same table, and the COPY
fires the trigger. That scenario is covered by the little bash script I
posted earlier in this thread
(https://www.postgresql.org/message-id/55AFC302.1060805%40iki.fi). Attached
is a new version of that script, updated to make it work with v11.

Thanks for the pointer.  My tap test has been covering two out of the
three scenarios you have in your script.  I have been able to convert
the extra as the attached, and I have added as well an extra test with
TRUNCATE triggers.  So it seems to me that we want to disable the
optimization if any type of trigger are defined on the relation copied
to as it could be possible that these triggers work on the blocks copied
as well, for any BEFORE/AFTER and STATEMENT/ROW triggers.  What do you
think?


Yeah, this seems like the only sane approach.


Doesn't have to be a trigger, could be a CHECK constraint, datatype 
input function, etc. Admittedly, having a datatype input function that 
inserts to the table is worth a "huh?", but I'm feeling very confident 
that we can catch all such cases, and some of them might even be sensible.


- Heikki



Re: [HACKERS] WAL logging problem in 9.4.3?

2018-07-12 Thread Andrew Dunstan




On 07/10/2018 11:32 PM, Michael Paquier wrote:

On Tue, Jul 10, 2018 at 05:35:58PM +0300, Heikki Linnakangas wrote:

Thanks for picking this up!

(I hope this gets through the email filters this time, sending a shell
script seems to be difficult. I also trimmed the CC list, if that helps.)

On 04/07/18 07:59, Michael Paquier wrote:

Hence I propose the patch attached which disables the TRUNCATE and COPY
optimizations for two cases, which are the ones actually causing
problems.  One solution has been presented by Simon here for COPY, which
is to disable the optimization when there are no blocks on a relation
with wal_level = minimal:
https://www.postgresql.org/message-id/CANP8+jKN4V4MJEzFN_iEtdZ+1oM=yetxvmuu1yk4umxqy2g...@mail.gmail.com
For back-patching, I find that really appealing.

This fails in the case that there are any WAL-logged changes to the table
while the COPY is running. That can happen at least if the table has an
INSERT trigger, that performs operations on the same table, and the COPY
fires the trigger. That scenario is covered by the little bash script I
posted earlier in this thread
(https://www.postgresql.org/message-id/55AFC302.1060805%40iki.fi). Attached
is a new version of that script, updated to make it work with v11.

Thanks for the pointer.  My tap test has been covering two out of the
three scenarios you have in your script.  I have been able to convert
the extra as the attached, and I have added as well an extra test with
TRUNCATE triggers.  So it seems to me that we want to disable the
optimization if any type of trigger are defined on the relation copied
to as it could be possible that these triggers work on the blocks copied
as well, for any BEFORE/AFTER and STATEMENT/ROW triggers.  What do you
think?




Yeah, this seems like the only sane approach.

cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Vacuum: allow usage of more than 1GB of work mem

2018-07-12 Thread Andrew Dunstan




On 04/06/2018 08:00 PM, Claudio Freire wrote:

On Fri, Apr 6, 2018 at 5:25 PM, Claudio Freire  wrote:

On Fri, Apr 6, 2018 at 10:39 AM, Heikki Linnakangas  wrote:

On 06/04/18 01:59, Claudio Freire wrote:

The iteration interface, however, seems quite specific for the use
case of vacuumlazy, so it's not really a good abstraction.


Can you elaborate? It does return the items one block at a time. Is that
what you mean by being specific for vacuumlazy? I guess that's a bit
special, but if you imagine some other users for this abstraction, it's
probably not that unusual. For example, if we started using it in bitmap
heap scans, a bitmap heap scan would also want to get the TIDs one block
number at a time.

But you're also tying the caller to the format of the buffer holding
those TIDs, for instance. Why would you, when you can have an
interface that just iterates TIDs and let the caller store them
if/however they want?

I do believe a pure iterator interface is a better interface.

Between the b-tree or not discussion and the refactoring to separate
the code, I don't think we'll get this in the next 24hs.

So I guess we'll have ample time to poner on both issues during the
next commit fest.





There doesn't seem to have been much pondering done since then, at least 
publicly. Can we make some progress on this? It's been around for a long 
time now.


cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: buildfarm vs code

2018-07-12 Thread Andrew Dunstan




On 07/12/2018 05:54 AM, Heikki Linnakangas wrote:

On 02/07/18 10:57, Peter Eisentraut wrote:

On 05.06.18 18:09, Andrew Dunstan wrote:

The first should be simple and non-controversial. It allows
src/tools/msvc/build.pl to be called in such a way that it only creates
the project files and then stops. This is a one line addition to the
script and should affect nobody not using the option. A patch for it is
attached.


This seems reasonable.


There's already a separate script for this, 
src/tools/msvc/mkvcbuild.pl. Doesn't that do the same thing?







Yes, it does, more or less. For some reason I'd completely forgotten 
about it. I'll withdraw the patch.


cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Binary difference in pg_internal.init after running pg_initdb multiple times

2018-07-12 Thread Tom Lane
"samuel.coulee" <313914...@qq.com> writes:
> In the PG source code function "write_relcache_init_file()", I found that
> the whole 'Relation' structs were directly written into the file
> 'pg_internal.init'. This brings some binary differences of that file,  if we
> run pg_initdb multiple times, because the struct 'Relation' contains some
> pointer fields.

There's never been any expectation that that file is bitwise-reproducible.
The order of entries is arbitrary, there are probably alignment padding
bytes that contain garbage, etc etc.  So I'm not buying into your apparent
goal here.

> And my question is : Could we clear the pointer values in 'Relation' before
> calling write_item(...)?

No; that's live data with no backup copy.  Conceivably we could copy the
structure and zero out useless fields before writing from the copy, but
that adds code, cycles, maintenance effort, and risk of bugs.

regards, tom lane



Re: pg_create_logical_replication_slot returns text instead of name

2018-07-12 Thread Michael Paquier
On Thu, Jul 12, 2018 at 07:00:48PM +0900, Masahiko Sawada wrote:
> The documentation[1] says that both pg_create_logical_replication_slot
> and pg_create_physical_replication_slot returns slot_name as a name
> type. But only pg_create_logical_replication_slot returns it as text
> type. I think these should be united as the name type. Attached small
> patch fixes it.
> 
> [1] 
> https://www.postgresql.org/docs/devel/static/functions-admin.html#FUNCTIONS-REPLICATION

That's a small thing, but I agree with you.  As far as I can see slot
names are always mapped with the name type.  I'll push that tomorrow if
there are no objections.
--
Michael


signature.asc
Description: PGP signature


Re: Negotiating the SCRAM channel binding type

2018-07-12 Thread Michael Paquier
On Wed, Jul 11, 2018 at 04:00:47PM +0300, Heikki Linnakangas wrote:
> Looking at the GnuTLS docs, I believe it has everything we need.
> gnutls_certificate_get_peers() and gnutls_certificate_get_ours() can be used
> to get the certificate, and gnutls_x509_crt_get_signature_algorithm() gets
> the signatureAlgorithm.

Looking at the docs, there is gnutls_x509_crt_get_fingerprint() which
can provide the certificate hash.  So if the signature algorithm  is MD5
or SHA-1, it would be simple enough to upgrade it to SHA-256 and
calculate the hash.  They have way better docs than OpenSSL, which is
nice.
--
Michael


signature.asc
Description: PGP signature


Re: Negotiating the SCRAM channel binding type

2018-07-12 Thread Michael Paquier
On Thu, Jul 12, 2018 at 12:34:51PM +0300, Heikki Linnakangas wrote:
> Meh. We're not going implement tls-unique, anyway, in some of the upcoming
> non-OpenSSL TLS implementations that don't support it.

True enough.  Only GnuTLS supports it:
https://www.gnutls.org/manual/html_node/Channel-Bindings.html

> Hmm. That is actually in a section called "Default Channel Binding". And the
> first paragraph says "A default channel binding type agreement process for
> all SASL application protocols that do not provide their own channel binding
> type agreement is provided as follows". Given that, it's not entirely clear
> to me if the requirement to support tls-unique is for all implementations of
> SCRAM, or only those applications that don't provide their own channel
> binding type agreement.

I am not sure, but I get that as tls-unique must be the default if
available, so if it is technically possible to have it we should have it
in priority.  If not, then a channel binding type which is supported by
both the server and the client can be chosen.
--
Michael


signature.asc
Description: PGP signature


Re: partition pruning doesn't work with IS NULL clause in multikey range partition case

2018-07-12 Thread Ashutosh Bapat
I think we should add this to open items list so that it gets tracked.

On Thu, Jul 12, 2018 at 6:31 PM, Ashutosh Bapat
 wrote:
> On Thu, Jul 12, 2018 at 11:10 AM, Amit Langote
>  wrote:
>>>
>>> I think your fix is correct.  I slightly modified it along with updating
>>> nearby comments and added regression tests.
>>
>> I updated regression tests to reduce lines.  There is no point in
>> repeating tests like v2 patch did.
>
> + *
> + * For hash partitioning however, it is possible to combine null and non-
> + * null keys in a pruning step, so do this only if *all* partition keys
> + * are involved in IS NULL clauses.
>
> I don't think this is true. When equality conditions and IS NULL clauses cover
> all partition keys of a hash partitioned table and do not have contradictory
> clauses, we should be able to find the partition which will remain unpruned. I
> see that we already have this supported in get_matching_hash_bounds()
> /*
>  * For hash partitioning we can only perform pruning based on equality
>  * clauses to the partition key or IS NULL clauses.  We also can only
>  * prune if we got values for all keys.
>  */
> if (nvalues + bms_num_members(nullkeys) == partnatts)
> {
>
>   */
> -if (!generate_opsteps)
> +if (!bms_is_empty(nullkeys) &&
> +(part_scheme->strategy != PARTITION_STRATEGY_HASH ||
> + bms_num_members(nullkeys) == part_scheme->partnatts))
>
> So, it looks like we don't need bms_num_members(nullkeys) ==
> part_scheme->partnatts there.
>
> Also, I think, we don't know how some new partition strategy will treat NULL
> values so above condition looks wrong to me. Instead it should explicitly 
> check
> the strategies for which we know that the NULL values go to a single 
> partition.
>
>  /*
> - * Note that for IS NOT NULL clauses, simply having step suffices;
> - * there is no need to propagate the exact details of which keys are
> - * required to be NOT NULL.  Hash partitioning expects to see actual
> - * values to perform any pruning.
> + * There are no OpExpr's, but there are IS NOT NULL clauses, which
> + * can be used to eliminate the null-partition-key-only partition.
>
> I don't understand this. When there are IS NOT NULL clauses for all the
> partition keys, it's only then that we could eliminate the partition 
> containing
> NULL values, not otherwise.
>
>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: Binary difference in pg_internal.init after running pg_initdb multiple times

2018-07-12 Thread Tomas Vondra

On 07/12/2018 10:08 AM, samuel.coulee wrote:

Hi,

In the PG source code function "write_relcache_init_file()", I found that
the whole 'Relation' structs were directly written into the file
'pg_internal.init'. This brings some binary differences of that file,  if we
run pg_initdb multiple times, because the struct 'Relation' contains some
pointer fields.

And my question is : Could we clear the pointer values in 'Relation' before
calling write_item(...)?

No benefit regarding function or performance, just for binary
compatibility...


Binary compatibility with what? Can you describe a scenario where this 
actually matters?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: partition pruning doesn't work with IS NULL clause in multikey range partition case

2018-07-12 Thread Ashutosh Bapat
On Thu, Jul 12, 2018 at 11:10 AM, Amit Langote
 wrote:
>>
>> I think your fix is correct.  I slightly modified it along with updating
>> nearby comments and added regression tests.
>
> I updated regression tests to reduce lines.  There is no point in
> repeating tests like v2 patch did.

+ *
+ * For hash partitioning however, it is possible to combine null and non-
+ * null keys in a pruning step, so do this only if *all* partition keys
+ * are involved in IS NULL clauses.

I don't think this is true. When equality conditions and IS NULL clauses cover
all partition keys of a hash partitioned table and do not have contradictory
clauses, we should be able to find the partition which will remain unpruned. I
see that we already have this supported in get_matching_hash_bounds()
/*
 * For hash partitioning we can only perform pruning based on equality
 * clauses to the partition key or IS NULL clauses.  We also can only
 * prune if we got values for all keys.
 */
if (nvalues + bms_num_members(nullkeys) == partnatts)
{

  */
-if (!generate_opsteps)
+if (!bms_is_empty(nullkeys) &&
+(part_scheme->strategy != PARTITION_STRATEGY_HASH ||
+ bms_num_members(nullkeys) == part_scheme->partnatts))

So, it looks like we don't need bms_num_members(nullkeys) ==
part_scheme->partnatts there.

Also, I think, we don't know how some new partition strategy will treat NULL
values so above condition looks wrong to me. Instead it should explicitly check
the strategies for which we know that the NULL values go to a single partition.

 /*
- * Note that for IS NOT NULL clauses, simply having step suffices;
- * there is no need to propagate the exact details of which keys are
- * required to be NOT NULL.  Hash partitioning expects to see actual
- * values to perform any pruning.
+ * There are no OpExpr's, but there are IS NOT NULL clauses, which
+ * can be used to eliminate the null-partition-key-only partition.

I don't understand this. When there are IS NOT NULL clauses for all the
partition keys, it's only then that we could eliminate the partition containing
NULL values, not otherwise.


-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: Temporary WAL segments files not cleaned up after an instance crash

2018-07-12 Thread Heikki Linnakangas

On 12/07/18 15:38, Michael Paquier wrote:

On Thu, Jul 12, 2018 at 01:15:03PM +0300, Heikki Linnakangas wrote:

On 12/07/18 10:44, Michael Paquier wrote:

+   snprintf(path, MAXPGPATH, XLOGDIR "/%s", xlde->d_name);
+   elog(DEBUG2, "removed temporary WAL file \"%s\"", path);
+   unlink(path);


The elog message says "removed", but the removal actually happens after the
elog. "removing" would be more accurate.


Or just move the elog() after the file is actually removed?  Would you
be fine with that?


Sure.

- Heikki



Re: Temporary WAL segments files not cleaned up after an instance crash

2018-07-12 Thread Michael Paquier
On Thu, Jul 12, 2018 at 01:15:03PM +0300, Heikki Linnakangas wrote:
> On 12/07/18 10:44, Michael Paquier wrote:
> > +   snprintf(path, MAXPGPATH, XLOGDIR "/%s", xlde->d_name);
> > +   elog(DEBUG2, "removed temporary WAL file \"%s\"", path);
> > +   unlink(path);
> 
> The elog message says "removed", but the removal actually happens after the
> elog. "removing" would be more accurate.

Or just move the elog() after the file is actually removed?  Would you
be fine with that?
--
Michael


signature.asc
Description: PGP signature


Re: make installcheck-world in a clean environment

2018-07-12 Thread Alexander Lakhin
Hello Tom,

11.07.2018 23:15, Tom Lane wrote:
>
>> /make clean/
>> # Also you can just install binary packages to get the same state.
>> make installcheck-world
>> # This check fails.
> I do not think that should be expected to work.  It would require that
> "make installcheck" first invoke "make all" (to rebuild the stuff you
> threw away with "make clean"), which is rather antithetical to its
> purpose.  Specifically, installcheck is supposed to be checking something
> you already built; so having it do a fresh build seems to introduce
> version-skew hazards that we don't need.
If I understand correctly, the installcheck target should not use the
stuff in the build directory (that is thrown away with 'make clean'),
but instead should use/check the installed assets.
In fact with REL_10_STABLE you can run "make clean && make installcheck"
and it works just fine (without calling 'make all'). It's sufficient to
have a working instance running. And I think, this behavior is correct —
"installcheck" supposed to check not something we built ("check" is
supposed to do that), but something we installed.
And only if I run "make installcheck-world" with REL_10_STABLE I get an
error related to ECPG, I've referred to in the first message in this thread.
In the master branch there was some changes that prevent "make clean &&
make installcheck" scenario to run, but I think it can (and should) be
fixed too.
(When I prepared the patches, there were no differences between these
branches in this aspect.)
So for me the question is what assets should the installcheck target be
checking? Installed or built ones? For example, if psql/pg_dump/ecpg/...
is installed in /usr/local/pgsql/bin/, should it be checked by the
installcheck? And if we target the installed stuff then why do we need
to build something (or have something built) for installcheck?

Best regards,
--

Alexander Lakhin
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Copy function for logical replication slots

2018-07-12 Thread Masahiko Sawada
On Mon, Jul 9, 2018 at 10:34 AM, Michael Paquier  wrote:
> On Mon, Jul 09, 2018 at 10:06:00AM +0900, Masahiko Sawada wrote:
>> I think that this patch might be splitted but I will be able to send
>> an updated patch in the next week. As you suggestion this patch needs
>> more careful thoughts. I'll move this patch to the next commit fest if
>> I will not be able to sent it. Is that okay?
>
> Fine by me.  Thanks for the update.

Attached new version of patch incorporated the all comments I got from
Michael-san.

To prevent the WAL segment file of restart_lsn of the origin slot from
removal during creating the target slot, I've chosen a way to copy new
one while holding the origin one. One problem to implement this way is
that the current replication slot code doesn't allow us to do
straightforwardly; the code assumes that the process creating a new
slot is not holding another slot. So I've changed the copy functions
so that it save temporarily MyReplicationSlot and then restore and
release it after creation the target slot. To ensure that both the
origin and target slot are released properly in failure cases I used
PG_ENSURE_ERROR_CLEANUP(). That way, we can keep the code changes of
the logical decoding at a minimum. I've thought that we can change the
logical decoding code so that it can assumes that the process can have
more than one slots at once but it seems overkill to me.

Please review it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


v4-0001-Copy-function-for-logical-and-physical-replicatio.patch
Description: Binary data


  1   2   >