date:20190313

On 2019/03/14 15:03, Kyotaro HORIGUCHI wrote:
> At Thu, 14 Mar 2019 11:30:12 +0900, Amit Langote 
>  wrote in 
> <59e5a734-9e06-1035-385b-626717581...@lab.ntt.co.jp>
>> On 2019/03/13 21:03, Peter Eisentraut wrote:
>>> If a FOR ALL TABLES publication exists, unlogged tables are ignored
>>> for publishing changes.  But CheckCmdReplicaIdentity() would still
>>> check in that case that such a table has a replica identity set before
>>> accepting updates.  That is useless, so check first whether the given
>>> table is publishable and skip the check if not.
>>>
>>> Example:
>>>
>>> CREATE PUBLICATION pub FOR ALL TABLES;
>>> CREATE UNLOGGED TABLE logical_replication_test AS SELECT 1 AS number;
>>> UPDATE logical_replication_test SET number = 2;
>>> ERROR:  cannot update table "logical_replication_test" because it does
>>> not have a replica identity and publishes updates
>>>
>>> Patch attached.
>>
>> An email on -bugs earlier this morning complains of the same problem but
>> for temporary tables.
>>
>> https://www.postgresql.org/message-id/CAHOFxGr%3DmqPZXbAuoR7Nbq-bU4HxqVWHbTTUy5%3DPKQut_F0%3DXA%40mail.gmail.com
>>
>> It seems your patch fixes their case too.
> 
> Is it the right thing that GetRelationPublicationsActions sets
> wrong rd_publicatons for the relations?

Actually, after applying Peter's patch, maybe we should add an
Assert(is_publishable_relation(relation)) at the top of
GetRelationPublicationActions(), also adding a line in the function header
comment that callers must ensure that.  There's only one caller at the
moment anyway, which Peter's patch is fixing to ensure that.

Thanks,
Amit

Improve the generic plan mechanism

2019-03-13 Thread Kato, Sho

Hi, hackers

I want to improve the generic plan mechanism and speed up the UPDATE/DELETE 
planning of a table partitioned into thousands.
However, I am not sure if this is realistic, I would like advice.

The current generic plan creates access plans for all child tables without 
using the parameters specified in EXECUTE.
Also, inheritance_planner() creates access plans for all child tables after 
copying the RTEs of the parent to subroot.

As mentioned above, I think it is wasteful to create access plans for all child 
tables including child tables that may not be accessed in EXECUTE.
Therefore, instead of creating an access plan for all child tables at one time, 
I think that it would be better to create an access plan for the child table at 
the time of EXECUTE execution and add it to the cached plans (plans of 
ModifyTable).

According to my research, I know that there are two issues.

#1 How to add a partially created plan to an existing cached plan
#2 Which child table plan to create

#1
In order to partially create a plan, it is necessary to consider which 
information is cached to create a plan in addition to the plan cache.
Also, it is necessary to consider how to manage cached plans.

#2
If a parameter of EXEXUTE is specified in boundParams of pg_queries(), a custom 
plan will be created, so it is necessary to use bounParams to narrow down the 
child tables for creating a plan.

At first I will try to work from #1.

regards,

Sho Kato

Re: WIP: Avoid creation of the free space map for small tables

2019-03-13 Thread Amit Kapila

On Thu, Mar 14, 2019 at 7:08 AM John Naylor  wrote:
>
> > [segfault problems]
>
> This now seems spurious. I ran make distclean, git pull, reapplied the
> patch (leaving out the gettimeofday() calls), and now my upgrade perf
> test works with default compiler settings. Not sure what happened, but
> hopefully we can move forward.
>

Yeah, I took another pass through the patch and change minor things as
you can see in the attached patch.

1. Added an Assert in new_cluster_needs_fsm() that old cluster version
should be >= 804 as that is where fsm support has been added.
2. Reverted the old cluster version check to <= 1100.  There was
nothing wrong in the way you have written a check, but I find most of
the other places in the code uses that way.
3. At one place changed the function header to be consistent with the
nearby code, run pgindent on the code and changed the commit message.

Let me know what you think?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

v24-0001-During-pg_upgrade-conditionally-skip-transfer-of-FSM.patch
Description: Binary data

RE: Timeout parameters

2019-03-13 Thread Fabien COELHO




HI think that your patch is responsible for the added option at least.

I agree that the underlying issue that other parameters should probably
also be reused, which would be a bug fix, does not belong to this thread.


This doesn't seem to be a bug.  \connect just establishes a new connection, not 
reusing the previous settings for most connection parameters.  As the examples 
in the following manual page show, the user needs to specify necessary 
connection parameters.

https://www.postgresql.org/docs/devel/app-psql.html

=> \c service=foo
=> \c "host=localhost port=5432 dbname=mydb connect_timeout=10 sslmode=disable"


But I'm afraid the description of \connect may not be clear enough about 
connection parameters, and that may cause users to expect the reuse of 
all connection parameter settings.


I think that the typical use-case of \c is to connect to another database 
on the same host, at least that what I do pretty often. The natural 
expectation is that the same "other" connection parameters are used, 
otherwise it does not make much sense, and there is already a whole logic 
of reusing the previous settings in psql, at least wrt describing the 
target (host, port, user…).


Anyway, I think this would be an improvement for psql's documentation or 
new feature for psql.  What do you think?


I think that we should fix the behavior rather than document the current 
weirdness. I do not think that we should introduce more weirdness.


--
Fabien.

Re: Fix handling of unlogged tables in FOR ALL TABLES publications

At Thu, 14 Mar 2019 11:30:12 +0900, Amit Langote 
 wrote in 
<59e5a734-9e06-1035-385b-626717581...@lab.ntt.co.jp>
> On 2019/03/13 21:03, Peter Eisentraut wrote:
> > If a FOR ALL TABLES publication exists, unlogged tables are ignored
> > for publishing changes.  But CheckCmdReplicaIdentity() would still
> > check in that case that such a table has a replica identity set before
> > accepting updates.  That is useless, so check first whether the given
> > table is publishable and skip the check if not.
> > 
> > Example:
> > 
> > CREATE PUBLICATION pub FOR ALL TABLES;
> > CREATE UNLOGGED TABLE logical_replication_test AS SELECT 1 AS number;
> > UPDATE logical_replication_test SET number = 2;
> > ERROR:  cannot update table "logical_replication_test" because it does
> > not have a replica identity and publishes updates
> > 
> > Patch attached.
> 
> An email on -bugs earlier this morning complains of the same problem but
> for temporary tables.
> 
> https://www.postgresql.org/message-id/CAHOFxGr%3DmqPZXbAuoR7Nbq-bU4HxqVWHbTTUy5%3DPKQut_F0%3DXA%40mail.gmail.com
> 
> It seems your patch fixes their case too.

Is it the right thing that GetRelationPublicationsActions sets
wrong rd_publicatons for the relations?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 2704c5fbb65ebfee144a37c6b34eccdd853033a0 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 14 Mar 2019 15:02:20 +0900
Subject: [PATCH 1/2] step1

---
 src/backend/utils/cache/relcache.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index eba77491fd..a43fb040cb 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -5089,6 +5089,8 @@ GetRelationPublicationActions(Relation relation)
 		return memcpy(pubactions, relation->rd_pubactions,
 	  sizeof(PublicationActions));
 
+	if (is_publishable_relation(relation))
+	{
 	/* Fetch the publication membership info. */
 	puboids = GetRelationPublications(RelationGetRelid(relation));
 	puboids = list_concat_unique_oid(puboids, GetAllTablesPublications());
@@ -5121,6 +5123,7 @@ GetRelationPublicationActions(Relation relation)
 			pubactions->pubdelete && pubactions->pubtruncate)
 			break;
 	}
+	}
 
 	if (relation->rd_pubactions)
 	{
-- 
2.16.3

>From b0946c5b97857cf476975306ce78355f9316e722 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 14 Mar 2019 15:02:54 +0900
Subject: [PATCH 2/2] step2: fix indentation

---
 src/backend/utils/cache/relcache.c | 50 +++---
 1 file changed, 25 insertions(+), 25 deletions(-)

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index a43fb040cb..f5dff5572e 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -5091,38 +5091,38 @@ GetRelationPublicationActions(Relation relation)
 
 	if (is_publishable_relation(relation))
 	{
-	/* Fetch the publication membership info. */
-	puboids = GetRelationPublications(RelationGetRelid(relation));
-	puboids = list_concat_unique_oid(puboids, GetAllTablesPublications());
+		/* Fetch the publication membership info. */
+		puboids = GetRelationPublications(RelationGetRelid(relation));
+		puboids = list_concat_unique_oid(puboids, GetAllTablesPublications());
 
-	foreach(lc, puboids)
-	{
-		Oid			pubid = lfirst_oid(lc);
-		HeapTuple	tup;
-		Form_pg_publication pubform;
+		foreach(lc, puboids)
+		{
+			Oid			pubid = lfirst_oid(lc);
+			HeapTuple	tup;
+			Form_pg_publication pubform;
 
-		tup = SearchSysCache1(PUBLICATIONOID, ObjectIdGetDatum(pubid));
+			tup = SearchSysCache1(PUBLICATIONOID, ObjectIdGetDatum(pubid));
 
-		if (!HeapTupleIsValid(tup))
-			elog(ERROR, "cache lookup failed for publication %u", pubid);
+			if (!HeapTupleIsValid(tup))
+elog(ERROR, "cache lookup failed for publication %u", pubid);
 
-		pubform = (Form_pg_publication) GETSTRUCT(tup);
+			pubform = (Form_pg_publication) GETSTRUCT(tup);
 
-		pubactions->pubinsert |= pubform->pubinsert;
-		pubactions->pubupdate |= pubform->pubupdate;
-		pubactions->pubdelete |= pubform->pubdelete;
-		pubactions->pubtruncate |= pubform->pubtruncate;
+			pubactions->pubinsert |= pubform->pubinsert;
+			pubactions->pubupdate |= pubform->pubupdate;
+			pubactions->pubdelete |= pubform->pubdelete;
+			pubactions->pubtruncate |= pubform->pubtruncate;
 
-		ReleaseSysCache(tup);
+			ReleaseSysCache(tup);
 
-		/*
-		 * If we know everything is replicated, there is no point to check for
-		 * other publications.
-		 */
-		if (pubactions->pubinsert && pubactions->pubupdate &&
-			pubactions->pubdelete && pubactions->pubtruncate)
-			break;
-	}
+			/*
+			 * If we know everything is replicated, there is no point to check
+			 * for other publications.
+			 */
+			if (pubactions->pubinsert && pubactions->pubupdate &&
+pubactions->pubdelete && pubactions->pubtruncate)
+break;
+		}
 	}
 
 	if (relation->rd_pubactions)
-- 
2.16.3

Re: Progress reporting for pg_verify_checksums

On Thu, Mar 14, 2019 at 11:54:17AM +0900, Kyotaro HORIGUCHI wrote:
> Why this patch changes the behavior for temprary directories? It
> seems like a bug fix of pg_checksums.

Oops, that's a thinko from 5c995139, so fixed.  Note this has no
actual consequence though as PG_TEMP_FILE_PREFIX and PG_TEMP_FILES_DIR
have the same value.  However a bug is a bug.
--
Michael


signature.asc
Description: PGP signature

Re: Timeout parameters

At Thu, 14 Mar 2019 03:33:20 +, "Tsunakawa, Takayuki" 
 wrote in 
<0A3221C70F24FB45833433255569204D1FBC4191@G01JPEXMBYT05>
> From: Robert Haas [mailto:robertmh...@gmail.com]
> > But that's not what it will do.  As long as the server continues to
> > dribble out protocol messages from time to time, the timeout will
> > never fire no matter how much time passes.  I saw a system once where
> > every 8kB read took many seconds to complete; such a system could
> > dribble out sequential scan results over an arbitrarily long period of
> > time without ever tripping the timeout.
> 
> I understood hat the example is about an SELECT that returns multiple rows.  
> If so, statement_timeout should handle it, shouldn't it?

If so, in turn the socket_timeout doesn't work as expected? I
understand that what is proposed here is to disconnect after that
time of waiting for *the first tuple* of a query, regardless of
it is a long query or network failure. On of the problems raised
here is the socket_timetout patch doesn't work that way?

In order to at least to make it work as expected, it should arm a
timeout at PQsendQuery and disarm at the first response.

> >  If you really want to return
> > control to the user in any situation, what you can do is use the libpq
> > APIs in asynchronous mode which, barring certain limitations of the
> > current implementation, will actually allow you to maintain control
> > over the connection at all times.
> 
> Maybe.  But the users aren't often in a situation to modify the application 
> to use libpq asynchronous APIs.

I can't imagine that in the cases where other than applications
cannot be rebuild for some reasons. (E.g. the source code has
been lost or the source code no longer be built on modern
environment..)

If an application is designed to have such a requirement, mustn't
it have implemented that by itself by any means? For example, an
application on JDBC can be designed to kill a querying thread
that takes longer than threshold.

> > I think the use case for a timeout that has both false positives (i.e.
> > it will fire even when there's no problem, as when the connection is
> > legitimately idle) and false negatives (i.e. it will fail to trigger
> > when there is a problem, as when there are periodic notices or
> > notifies from the server connection) is extremely limited if not
> > nonexistent, and I think the potential for users to be confused is
> > really high.
> 
> My understanding is that the false positive case doesn't occur,
> because libpq doesn't wait on the socket while the client is
> idle and not communicating SQL request/response.  As for the
> false negative case, resetting the timer upon notices or
> notifies receipt is good, because they show that the database
> server is functioning.  socket_timeout is not a mechanism to
> precisely limit the duration of query request/response.  It is
> kind of a stop-gap, last resort to assure return control within
> reasonable amount of time, rather than minutes or hours.

Doesn't TCP_KEEPALIVE work that way?

statement_timeout works on a live connection, TCP_USER_TIMEOUT
works for an already failed network but if the network fails
after a backend already sent the ack for a query request, it
doesn't work. TCP_KEEPALIVE would work for the case where any
packet doesn't reach each other for a certain time. Is there any
other situations to save?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Sparse bit set data structure

2019-03-13 Thread Andrey Borodin

Hi!

> 14 марта 2019 г., в 0:18, Heikki Linnakangas  написал(а):
> <0001-Add-SparseBitset-to-hold-a-large-set-of-64-bit-ints-.patch><0002-Andrey-Borodin-s-test_blockset-tool-adapted-for-Spar.patch>

That is very interesting idea. Basically, B-tree and radix tree is a tradeoff 
between space and time.

In general, lookup into radix tree will touch less CPU cache lines.
In this terms Bitmapset is on most performant and memory-wasteful side: lookup 
into Bitmapset is always 1 cache line.
Performance of radix tree can be good in case of skewed distribution, while 
B-tree will be OK on uniform. I think that distribution of GiST inner pages is 
uniform, distribution of empty leafs is... I have no idea, let's consider 
uniform too.

I'd review this data structure ASAP. I just need to understand: do we aim to 
v12 or v13? (I did not solve concurrency issues in GiST VACUUM yet, but will 
fix them at weekend)

Also, maybe we should consider using RoaringBitmaps? [0]
As a side not I would add that while balanced trees are widely used for 
operations on external memory, there are more performant versions for main 
memory. Like AVL-tree and RB-tree.


Brest regards, Andrey Borodin.

[0] https://github.com/RoaringBitmap/CRoaring

Re: current_logfiles not following group access and instead follows log_file_mode permissions

On Tue, Mar 12, 2019 at 04:08:53PM -0400, Robert Haas wrote:
> Anybody who has permission to read the log files but not the data
> directory will presumably hit the directory-level permissions on
> $PGDATA before the issue of the permissions on current_logfiles() per
> se become relevant, except in corner cases that I don't care about.

Sane deployments normally split the log directory and the main data
folder into separate partitions, and use an absolute path for
log_directory.  So, FWIW, I can live with the original proposal as
well.
--
Michael


signature.asc
Description: PGP signature

Re: pgsql: Add support for hyperbolic functions, as well as log10().

Dean Rasheed  writes:
> I'm amazed that jacana's asinh() returned -0 for an input of +0.

Even more amusingly, it returns NaN for acosh('infinity'), cf
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2019-03-14%2003%3A00%3A34

Presumably that means they calculated "infinity - infinity" at some
point, but why?

So far, no other failures ...

regards, tom lane

Re: Offline enabling/disabling of data checksums

On Wed, Mar 13, 2019 at 08:56:33PM +0900, Michael Paquier wrote:
> On Wed, Mar 13, 2019 at 12:09:24PM +0100, Michael Banck wrote:
> > The attached patch should do the above, on top of Michael's last
> > patchset.
> 
> What you are doing here looks like a good defense in itself.

More thoughts on that, and here is a short summary of the thread.

+   /* Check if control file has changed */
+   if (controlfile_last_updated != ControlFile->time)
+   {
+   fprintf(stderr, _("%s: control file has changed since startup\n"), 
progname);
+   exit(1);
+   }
Actually, under the conditions discussed on this thread that Postgres
is started in parallel of pg_checksums, imagine the following
scenario:
- pg_checksums starts to enable checksums, it reads a block to
calculate its checksum, then calculates it.
- Postgres has been started in parallel, writes the same block.
- pg_checksums finishes the block calculation, writes back the block
it has just read.
- Postgres stops, some data is lost.
- At the end of pg_checksums, we complain that the control file has
been updated since the start of pg_checksums.
I think that we should be way more noisy about this error message
document properly that Postgres should not be started while checksums
are enabled.  Basically, I guess that it should mention that there is
a risk of corruption because of this parallel operation.

Hence, based on what I could read on this thread, we'd like to have
the following things added to the core patch set:
1) When enabling checksums, fsync the data folder.  Then update the
control file, and finally fsync the control file (+ flush of the
parent folder to make the whole durable).  This way a host crash never
actually means that we finish in an inconsistent state.
2) When checksums are disabled, update the control file, then fsync
it + its parent folder.
3) Add tracking of the control file data, and complain loudly before
trying to update the file if there are any inconsistencies found.
4) Document with a big-fat-red warning that postgres should not be
worked on while the tool is enabling or disabling checksums.

There is a part discussed about standbys and primaries with not the
same checksum settings, but I commented on that in [1].

There is a secondary bug fix to prevent the tool to run if the data
folder has been initialized with a block size different than what
pg_checksums has been compiled with in [2].  The patch looks good,
still the error message could be better per my lookup.

[1]: https://www.postgresql.org/message-id/20190314002342.gc3...@paquier.xyz
[2]: https://www.postgresql.org/message-id/20190313224742.ga3...@paquier.xyz

Am I missing something?
--
Michael

signature.asc
Description: PGP signature

Proposal to suppress errors thrown by to_reg*()

2019-03-13 Thread Takuma Hoshiai

Hi, hackers,

According to the document, "to_reg* functions return null rather than
throwing an error if the name is not found", but this is not the case
if the arguments to those functions are schema qualified and the
caller does not have access permission of the schema even if the table
(or other object) does exist -- we get an error.

For example, to_regclass() throws an error if its argument is
'schema_name.table_name'' (i.e. contains schema name) and caller's
role doesn't have access permission of the schema. Same thing can be
said to Other functions,

I get complain from Pgpool-II users because it uses to_regclass()
internally to confirm table's existence but in the case above it's
not useful because the error aborts user's transaction.

To be more consistent with the doc and to make those functions more
useful, I propose to change current implementation so that they do not
throw an error if the name space cannot be accessible by the caller.

Attached patch implements this. Comments and suggestions are welcome.

-- 
Takuma Hoshiai 


fix_to_reg.patch
Description: Binary data

Re: BUG #15668: Server crash in transformPartitionRangeBounds

On Wed, Mar 13, 2019 at 03:17:47PM +0900, Amit Langote wrote:
> but on HEAD, you get:
> 
> create table foo (a int default (avg(foo.a)));
> ERROR:  aggregate functions are not allowed in DEFAULT expressions

I actually think that what you propose here makes more sense than what
HEAD does because the most inner expression gets evaluated first.
This for example generates the same error as on HEAD:
=# create table foo (a int default (avg(1)));
ERROR:  42803: aggregate functions are not allowed in DEFAULT expressions
--
Michael


signature.asc
Description: PGP signature

Re: Tid scan improvements

2019-03-13 Thread David Rowley

On Mon, 4 Feb 2019 at 18:37, Edmund Horner  wrote:
> 2. v6-0002-Support-backward-scans-over-restricted-ranges-in-hea.patch
> 3. v6-0003-Support-range-quals-in-Tid-Scan.patch
> 4. v6-0004-TID-selectivity-reduce-the-density-of-the-last-page-.patch

These ones need a rebase.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Special role for subscriptions

At Wed, 13 Mar 2019 23:03:26 -0400, Stephen Frost  wrote in 
<20190314030326.gq6...@tamriel.snowman.net>
> Greetings,
> 
> * Robert Haas (robertmh...@gmail.com) wrote:
> > On Mon, Mar 11, 2019 at 10:39 PM Michael Paquier  
> > wrote:
> > > On Mon, Mar 11, 2019 at 06:32:10PM -0700, Jeff Davis wrote:
> > > >  * Is the original idea of a special role still viable?
> > >
> > > In my opinion, that part may be valuable.  The latest patches proposed
> > > change the way tables are filtered and listed on the subscription
> > > side, lowering the permission to spawn a new thread and to connect to a
> > > publication server by just being a database owner instead of being a
> > > superuser, and that's quite a gap.
> > 
> > I agree.  I think the original idea was better than what Stephen
> > suggested, and for basically the reasons you mention.
> > 
> > However, I'm not sure that you are right when you say "just being a
> > database owner."  I think that what's being proposed is that anybody
> > who is a *table* owner could make PostgreSQL run off and try to sync
> > that table from a remote server in perpetuity.  That seems like WAY
> > too much access to give an unprivileged user.  I don't think we want
> > unprivileged users to be able to launch more or less permanent
> > background processes, nor do we want them to be able to initiate
> > outbound network traffic from the server.  Whether we want database
> > owners to be able to do those things is more debatable, but even that
> > would represent a significant expansion of their current rights, IIUC.
> > 
> > Just letting the superuser decide who gets to create subscriptions
> > seems good enough from here.
> 
> It seems I wasn't very clear earlier in the thread- I *definitely* think
> we need to have a special role which a superuser can grant to certain
> roles (possibly with the permission to grant further) to allow them to
> create subscriptions, and that's definitely distinct from "database
> owner" and shouldn't be combined with that.
> 
> I view that as the first step towards building a more granular privilege
> system for subscription creation, and that was the second half of what I
> was trying to say before- I do think there's value in having something
> more granular than just "this role can create ANY subscription".  As an
> administrator, I might be fine with subscriptions to system X, but not
> to system Y, for example.  As long as we don't block off the ability to
> build something finer grained in the future, then having the system role
> to allow a given role to do create subscription seems fine to me.

The subscription privileges is completely reasonable, but I've
heard of users who want to restrict tables on which a role can
make subscription. Subscription causes INSERT/UPDATE/DELETE to a
table, is it too-much to check the privileges addition to the
subscription privileges?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Tid scan improvements

2019-03-13 Thread David Rowley

On Mon, 4 Feb 2019 at 18:37, Edmund Horner  wrote:
> 1. v6-0001-Add-selectivity-estimate-for-CTID-system-variables.patch

I think 0001 is good to go. It's a clear improvement over what we do today.

(t1 = 1 million row table with a single int column.)

Patched:
# explain (analyze, timing off) select * from t1 where ctid < '(1, 90)';
 Seq Scan on t1  (cost=0.00..16925.00 rows=315 width=4) (actual
rows=315 loops=1)

# explain (analyze, timing off) select * from t1 where ctid <= '(1, 90)';
 Seq Scan on t1  (cost=0.00..16925.00 rows=316 width=4) (actual
rows=316 loops=1)

Master:
# explain (analyze, timing off) select * from t1 where ctid < '(1, 90)';
 Seq Scan on t1  (cost=0.00..16925.00 rows=33 width=4) (actual
rows=315 loops=1)

# explain (analyze, timing off) select * from t1 where ctid <= '(1, 90)';
 Seq Scan on t1  (cost=0.00..16925.00 rows=33 width=4) (actual
rows=316 loops=1)

The only possible risk I can foresee is that it may be more likely we
underestimate the selectivity and that causes something like a nested
loop join due to the estimation being, say 1 row.

It could happen in a case like:

SELECT * FROM bloated_table WHERE ctid >= 

but I don't think we should keep using DEFAULT_INEQ_SEL just in case
this happens. We could probably fix 90% of those cases by returning 2
rows instead of 1.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

RE: Timeout parameters

2019-03-13 Thread Tsunakawa, Takayuki

From: Robert Haas [mailto:robertmh...@gmail.com]
> But that's not what it will do.  As long as the server continues to
> dribble out protocol messages from time to time, the timeout will
> never fire no matter how much time passes.  I saw a system once where
> every 8kB read took many seconds to complete; such a system could
> dribble out sequential scan results over an arbitrarily long period of
> time without ever tripping the timeout.

I understood hat the example is about an SELECT that returns multiple rows.  If 
so, statement_timeout should handle it, shouldn't it?

>  If you really want to return
> control to the user in any situation, what you can do is use the libpq
> APIs in asynchronous mode which, barring certain limitations of the
> current implementation, will actually allow you to maintain control
> over the connection at all times.

Maybe.  But the users aren't often in a situation to modify the application to 
use libpq asynchronous APIs.


> I think the use case for a timeout that has both false positives (i.e.
> it will fire even when there's no problem, as when the connection is
> legitimately idle) and false negatives (i.e. it will fail to trigger
> when there is a problem, as when there are periodic notices or
> notifies from the server connection) is extremely limited if not
> nonexistent, and I think the potential for users to be confused is
> really high.

My understanding is that the false positive case doesn't occur, because libpq 
doesn't wait on the socket while the client is idle and not communicating SQL 
request/response.  As for the false negative case, resetting the timer upon 
notices or notifies receipt is good, because they show that the database server 
is functioning.  socket_timeout is not a mechanism to precisely limit the 
duration of query request/response.  It is kind of a stop-gap, last resort to 
assure return control within reasonable amount of time, rather than minutes or 
hours.


Regards
Takayuki Tsunakawa

Re: pgsql: Add support for hyperbolic functions, as well as log10().

At Wed, 13 Mar 2019 23:18:27 -0400, Tom Lane  wrote in 
<2503.1552533...@sss.pgh.pa.us>
tgl> Andrew Dunstan  writes:
tgl> > Or we could possibly call the function and then turn a result of -0 into 
0?
tgl> 
tgl> But -0 is the correct output if the input is -0.  So that approach
tgl> requires distinguishing -0 from 0, which is annoyingly difficult.

I think just turning both of -0 and +0 into +0 works, and, FWIW,
it is what is done in geo_ops.c (e.g. line_construct()) as a kind
of normalization and I think it is legit for geo_ops, but I don't
think so for fundamental functions like (d)asinh().

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: pgsql: Add support for hyperbolic functions, as well as log10().

Andrew Dunstan  writes:
> Or we could possibly call the function and then turn a result of -0 into 0?

But -0 is the correct output if the input is -0.  So that approach
requires distinguishing -0 from 0, which is annoyingly difficult.

regards, tom lane

Re: pgsql: Add support for hyperbolic functions, as well as log10().

2019-03-13 Thread Andrew Dunstan



On 3/13/19 5:56 PM, Tom Lane wrote:
> Michael Paquier  writes:
>> On Tue, Mar 12, 2019 at 11:16:42PM -0400, Tom Lane wrote:
>>> I'm inclined to leave it as-is for a day or so and see if any
>>> other failures turn up, before deciding what to do about it.
>> Fine by me.
> Well, so far jacana is the only critter that's shown any problem.
>
> I don't find any of the possible solutions to be super attractive:
>
> 1. Put in an explicit special case, along the lines of
>
>   if (arg1 == 0.0)
>   result = arg1;/* Handle 0 and -0 explicitly */
>   else
>   result = asinh(arg1);
>
> Aside from being ugly, this'd mean that our regression tests weren't
> really exercising the library asinh function at all.


Or we could possibly call the function and then turn a result of -0 into 0?


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Special role for subscriptions

2019-03-13 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Mon, Mar 11, 2019 at 10:39 PM Michael Paquier  wrote:
> > On Mon, Mar 11, 2019 at 06:32:10PM -0700, Jeff Davis wrote:
> > >  * Is the original idea of a special role still viable?
> >
> > In my opinion, that part may be valuable.  The latest patches proposed
> > change the way tables are filtered and listed on the subscription
> > side, lowering the permission to spawn a new thread and to connect to a
> > publication server by just being a database owner instead of being a
> > superuser, and that's quite a gap.
> 
> I agree.  I think the original idea was better than what Stephen
> suggested, and for basically the reasons you mention.
> 
> However, I'm not sure that you are right when you say "just being a
> database owner."  I think that what's being proposed is that anybody
> who is a *table* owner could make PostgreSQL run off and try to sync
> that table from a remote server in perpetuity.  That seems like WAY
> too much access to give an unprivileged user.  I don't think we want
> unprivileged users to be able to launch more or less permanent
> background processes, nor do we want them to be able to initiate
> outbound network traffic from the server.  Whether we want database
> owners to be able to do those things is more debatable, but even that
> would represent a significant expansion of their current rights, IIUC.
> 
> Just letting the superuser decide who gets to create subscriptions
> seems good enough from here.

It seems I wasn't very clear earlier in the thread- I *definitely* think
we need to have a special role which a superuser can grant to certain
roles (possibly with the permission to grant further) to allow them to
create subscriptions, and that's definitely distinct from "database
owner" and shouldn't be combined with that.

I view that as the first step towards building a more granular privilege
system for subscription creation, and that was the second half of what I
was trying to say before- I do think there's value in having something
more granular than just "this role can create ANY subscription".  As an
administrator, I might be fine with subscriptions to system X, but not
to system Y, for example.  As long as we don't block off the ability to
build something finer grained in the future, then having the system role
to allow a given role to do create subscription seems fine to me.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Progress reporting for pg_verify_checksums

Hello.

At Wed, 13 Mar 2019 16:25:15 +0900, Michael Paquier  wrote 
in <20190313072515.gb2...@paquier.xyz>
> On Wed, Mar 13, 2019 at 07:22:28AM +0100, Fabien COELHO wrote:
> > Does not apply because of the renaming committed by Michaël.
> > 
> > Could you rebase?
> 
> This stuff touches pg_checksums.c, so you may want to wait one day or
> two to avoid extra work...  I think that I'll be able to finish the
> addition of the --enable and --disable switches by then.

> + /* Make sure we report at most once every tenth of a second */
> + if ((INSTR_TIME_GET_MILLISEC(now) -
> +  INSTR_TIME_GET_MILLISEC(last_progress_update) < 100) && !force)

I'm not a skilled gamer and 10Hz update is a bit hard for my
eyes:p The second hand of old clocks ticks at 4Hz. I think it is
enough frequently.

> 841/1516 MB (55%, 700 MB/s)

Differently from other prgress reporting project, estimating ETC
(estimated time to completion), which is in the most important,
is quite easy. (pgbench does that.) And I'd like to see a
progress bar there but it might be overdoing. I'm not sure let
the current size occupy a part of screen width is needed.  I
don't think total size is needed to be fixed in MB and I think at
most four or five digits are enough. (That is the same for
speed.)

If the all of aboves are involved, the line would look as the
follows.

[=== ] ( 63% of 12.53 GB, 179 MB/s, ETC 26s)

# Note that this is just an opinion.

(pg_checksum runs fast at the beginning so ETC behaves somewhat
strange in the meanwhile.)

> +#define MEGABYTES 1048576

 1024 * 1024 is preferable than that many digits.

> + /* we handle SIGUSR1 only, and toggle the value of show_progress */
> + if (signum == SIGUSR1)
> + show_progress = !show_progress;

SIGUSR1 *toggles* progress. A garbage line is left alone after
turning it off. It would be erasable. I'm not sure which is
better, though.

> 
> @@ -167,7 +255,7 @@ scan_directory(const char *basedir, const char *subdir)
>   if (strncmp(de->d_name,
>   PG_TEMP_FILES_DIR,
>   strlen(PG_TEMP_FILES_DIR)) == 0)
> - return;
> + continue;

Why this patch changes the behavior for temprary directories? It
seems like a bug fix of pg_checksums.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: pgsql: Add support for hyperbolic functions, as well as log10().

Robert Haas  writes:
> On Wed, Mar 13, 2019 at 10:39 PM Tom Lane  wrote:
>> No, but that's not the hazard.  I have a very fresh-in-mind example:
>> at one point while tweaking Laetitia's patch, I'd accidentally changed
>> datanh so that it called tanh not atanh.  The previous set of tests did
>> not reveal that :-(

> Well, that was a goof, but it's not likely that such a regression will
> ever be reintroduced.

Sure, but how about this for another example: maybe a given platform
hasn't got these functions (or they're in a different library we
didn't pull in), but you don't see a failure until you actually
call them.  We try to set up our link options so that that sort
of failure is reported at build time, but I wouldn't care to bet
that we've succeeded at that everywhere.

regards, tom lane

Re: pgsql: Add support for hyperbolic functions, as well as log10().

On Wed, Mar 13, 2019 at 10:39 PM Tom Lane  wrote:
> Robert Haas  writes:
> > On Wed, Mar 13, 2019 at 8:49 PM Tom Lane  wrote:
> >> Meh.  As I said before, we're not in the business of improving on what
> >> libm does --- if someone has a beef with the results, they need to take
> >> it to their platform's libm maintainer, not us.  The point of testing
> >> this at all is just to ensure that we've wired up the SQL functions
> >> to the library functions correctly.
>
> > Pretty sure we don't even need a test for that.  asinh() isn't going
> > to call creat() by mistake.
>
> No, but that's not the hazard.  I have a very fresh-in-mind example:
> at one point while tweaking Laetitia's patch, I'd accidentally changed
> datanh so that it called tanh not atanh.  The previous set of tests did
> not reveal that :-(

Well, that was a goof, but it's not likely that such a regression will
ever be reintroduced.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Timeout parameters

On Wed, Mar 13, 2019 at 10:20 PM Tsunakawa, Takayuki
 wrote:
> I think the purpose of socket_timeout is to avoid infinite or unduely long 
> wait and return response to users, where other existing timeout parameters 
> wouldn't help.  For example, OS's process scheduling or paging/swapping 
> problems might cause long waits before postgres gets control (e.g. due to 
> Transparent HugePage (THP)).  Within postgres, the unfair lwlock can 
> unexpected long waits (I experienced dozens of seconds per wait on 
> ProcArrayLock, which was unbelievable.)  Someone may say that those should be 
> fixed or improved instead of introducing this parameter, but I think it's 
> good to have this as a stop-gap measure.  In other words, we can suggest 
> setting this parameter when the user asks "How can I return control to the 
> end user in any situation?"

But that's not what it will do.  As long as the server continues to
dribble out protocol messages from time to time, the timeout will
never fire no matter how much time passes.  I saw a system once where
every 8kB read took many seconds to complete; such a system could
dribble out sequential scan results over an arbitrarily long period of
time without ever tripping the timeout.  If you really want to return
control to the user in any situation, what you can do is use the libpq
APIs in asynchronous mode which, barring certain limitations of the
current implementation, will actually allow you to maintain control
over the connection at all times.

I think the use case for a timeout that has both false positives (i.e.
it will fire even when there's no problem, as when the connection is
legitimately idle) and false negatives (i.e. it will fail to trigger
when there is a problem, as when there are periodic notices or
notifies from the server connection) is extremely limited if not
nonexistent, and I think the potential for users to be confused is
really high.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pgsql: Add support for hyperbolic functions, as well as log10().

Robert Haas  writes:
> On Wed, Mar 13, 2019 at 8:49 PM Tom Lane  wrote:
>> Meh.  As I said before, we're not in the business of improving on what
>> libm does --- if someone has a beef with the results, they need to take
>> it to their platform's libm maintainer, not us.  The point of testing
>> this at all is just to ensure that we've wired up the SQL functions
>> to the library functions correctly.

> Pretty sure we don't even need a test for that.  asinh() isn't going
> to call creat() by mistake.

No, but that's not the hazard.  I have a very fresh-in-mind example:
at one point while tweaking Laetitia's patch, I'd accidentally changed
datanh so that it called tanh not atanh.  The previous set of tests did
not reveal that :-(

regards, tom lane

Re: Fix handling of unlogged tables in FOR ALL TABLES publications

On 2019/03/13 21:03, Peter Eisentraut wrote:
> If a FOR ALL TABLES publication exists, unlogged tables are ignored
> for publishing changes.  But CheckCmdReplicaIdentity() would still
> check in that case that such a table has a replica identity set before
> accepting updates.  That is useless, so check first whether the given
> table is publishable and skip the check if not.
> 
> Example:
> 
> CREATE PUBLICATION pub FOR ALL TABLES;
> CREATE UNLOGGED TABLE logical_replication_test AS SELECT 1 AS number;
> UPDATE logical_replication_test SET number = 2;
> ERROR:  cannot update table "logical_replication_test" because it does
> not have a replica identity and publishes updates
> 
> Patch attached.

An email on -bugs earlier this morning complains of the same problem but
for temporary tables.

https://www.postgresql.org/message-id/CAHOFxGr%3DmqPZXbAuoR7Nbq-bU4HxqVWHbTTUy5%3DPKQut_F0%3DXA%40mail.gmail.com

It seems your patch fixes their case too.

Thanks,
Amit

Re: pgsql: Add support for hyperbolic functions, as well as log10().

On Wed, Mar 13, 2019 at 8:49 PM Tom Lane  wrote:
> Meh.  As I said before, we're not in the business of improving on what
> libm does --- if someone has a beef with the results, they need to take
> it to their platform's libm maintainer, not us.  The point of testing
> this at all is just to ensure that we've wired up the SQL functions
> to the library functions correctly.

Pretty sure we don't even need a test for that.  asinh() isn't going
to call creat() by mistake.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

RE: Timeout parameters

2019-03-13 Thread Tsunakawa, Takayuki

From: Robert Haas [mailto:robertmh...@gmail.com]
> The first thing I notice about the socket_timeout patch is that the
> documentation is definitely wrong:

Agreed.  I suppose the description should be clearer about:

* the purpose and what situation this timeout will help: not for canceling a 
long-running query or recovering from a network failure.
* relationship with other timeout parameters: e.g., socket_timeout > 
connect_timeout, socket_timeout > statement_timeout, socket_timeout > TCP 
keepalive/user timeout


> Actually, I think that socket_timeout_v8.patch is a bad idea and
> should be rejected, not because it doesn't send a cancel to the server
> but just because it's not the right way of solving either of the
> problems that it purports to solve.  If you want a timeout for
> queries, you need that to be administered by the server, which knows
> when it starts and finishes executing a query; you cannot substitute a
> client-side judgement based on the last time we received a byte of
> data.  Maybe a client-side judgement would work if the timer were
> armed when we send a Query or Execute message and disarmed when we
> receive ReadyForQuery.  And, if you want a network problem detector,
> then you should use keepalives and perhaps the TCP_USER_TIMEOUT stuff,
> so that you don't kill perfectly good connections that just happen to
> be idle.

I think the purpose of socket_timeout is to avoid infinite or unduely long wait 
and return response to users, where other existing timeout parameters wouldn't 
help.  For example, OS's process scheduling or paging/swapping problems might 
cause long waits before postgres gets control (e.g. due to Transparent HugePage 
(THP)).  Within postgres, the unfair lwlock can unexpected long waits (I 
experienced dozens of seconds per wait on ProcArrayLock, which was 
unbelievable.)  Someone may say that those should be fixed or improved instead 
of introducing this parameter, but I think it's good to have this as a stop-gap 
measure.  In other words, we can suggest setting this parameter when the user 
asks "How can I return control to the end user in any situation?"


Regards
Takayuki Tsunakawa

Re: ToDo: show size of partitioned table

On 2019/03/14 2:11, Pavel Stehule wrote:
> st 13. 3. 2019 v 8:02 odesílatel Amit Langote 
> napsal:
>> Not a native speaker either, but I like Justin's changes.  Although I
>> noticed that he missed changing one sentence to look like other similar
>> sentences.
>>
>> What Justin did:
>>
>> -indexes) and associated description are also displayed.
>> +indexes) is displayed, as is the relation's description.
>>  
>>
>> What I think he meant to do:
>>
>> -indexes) and associated description are also displayed.
>> +indexes) is also displayed, along with the associated description.
>>  
>>
>>
>>> I am not sure if we use labels in form "some: detail" somewhere.
>>
>> I haven't seen column names like that either.  How about:
>>
>> -  gettext_noop("Direct partitions size"));
>> +  gettext_noop("Leaf partition size"));
>>
>> -  gettext_noop("Total partitions size"));
>> +  gettext_noop("Total size"));
>>
>> -  gettext_noop("Partitions size"));
>> +  gettext_noop("Total size"));
>>
> 
> +1
> 
> Pavel
> 
> 
>> I've attached v11 of the patch, which merges most of Justin's changes and
>> some of my own on top -- documentation and partition size column names.

Maybe, we should set this ready for committer then?

Thanks,
Amit

Re: Using the return value of strlcpy() and strlcat()

Ashwin Agrawal  writes:
> On Wed, Mar 13, 2019 at 9:51 AM Tom Lane  wrote:
>> I don't think that's a safe transformation: what strlcpy returns is
>> strlen(src), which might be different from what it was actually
>> able to fit into the destination.
>> Sure, they're equivalent if no truncation occurred; but if we were
>> 100.00% sure of no truncation, we'd likely not bother with strlcpy.

> So, if return value < length (3rd argument) we should be able to use the
> return value and avoid the strlen, else do the strlen ?

Mmm ... if there's a way to do it that's not messy and typo-prone,
maybe.  But I'm dubious that the potential gain is worth complicating
the code.  The strings involved aren't usually all that long.

regards, tom lane

RE: Timeout parameters

2019-03-13 Thread Tsunakawa, Takayuki

From: Fabien COELHO [mailto:coe...@cri.ensmp.fr]
> >> If the user reconnects, eg "\c db", the setting is lost. The
> >> re-connection handling should probably take care of this parameter, and
> maybe others.
> > I think your opinion is reasonable, but it seems not in this thread.
> 
> HI think that your patch is responsible for the added option at least.
> 
> I agree that the underlying issue that other parameters should probably
> also be reused, which would be a bug fix, does not belong to this thread.

This doesn't seem to be a bug.  \connect just establishes a new connection, not 
reusing the previous settings for most connection parameters.  As the examples 
in the following manual page show, the user needs to specify necessary 
connection parameters.

https://www.postgresql.org/docs/devel/app-psql.html

=> \c service=foo
=> \c "host=localhost port=5432 dbname=mydb connect_timeout=10 sslmode=disable"


But I'm afraid the description of \connect may not be clear enough about 
connection parameters, and that may cause users to expect the reuse of all 
connection parameter settings.  Anyway, I think this would be an improvement 
for psql's documentation or new feature for psql.  What do you think?


Regards
Takayuki Tsunakawa

Re: Using the return value of strlcpy() and strlcat()

2019-03-13 Thread Ashwin Agrawal

On Wed, Mar 13, 2019 at 9:51 AM Tom Lane  wrote:

> ilm...@ilmari.org (Dagfinn Ilmari =?utf-8?Q?Manns=C3=A5ker?=) writes:
> > [ let's convert
> > + strlcpy(buf + buflen, name, NAMEDATALEN);
> > + buflen += strlen(buf + buflen);
> > to
> > + buflen += strlcpy(buf + buflen, name, NAMEDATALEN);
> > ]
>
> I don't think that's a safe transformation: what strlcpy returns is
> strlen(src), which might be different from what it was actually
> able to fit into the destination.
>
> Sure, they're equivalent if no truncation occurred; but if we were
> 100.00% sure of no truncation, we'd likely not bother with strlcpy.
>

So, if return value < length (3rd argument) we should be able to use the
return value and avoid the strlen, else do the strlen ?

Re: hyrax vs. RelationBuildPartitionDesc

On 2019/03/14 5:18, Tom Lane wrote:
> Robert Haas  writes:
>> On Wed, Mar 13, 2019 at 3:14 PM Tom Lane  wrote:
>>> Meanwhile, who's going to take point on cleaning up rd_partcheck?
>>> I don't really understand this code well enough to know whether that
>>> can share one of the existing partitioning-related sub-contexts.
> 
>> To your question, I think it probably can't share a context.  Briefly,
>> rd_partkey can't change ever, except that when a partitioned relation
>> is in the process of being created it is briefly NULL; once it obtains
>> a value, that value cannot be changed.  If you want to range-partition
>> a list-partitioned table or something like that, you have to drop the
>> table and create a new one.  I think that's a perfectly acceptable
>> permanent limitation and I have no urge whatever to change it.
>> rd_partdesc changes when you attach or detach a child partition.
>> rd_partcheck is the reverse: it changes when you attach or detach this
>> partition to or from a parent.
> 
> Got it.  Yeah, it seems like the clearest and least bug-prone solution
> is for those to be in three separate sub-contexts.  The only reason
> I was trying to avoid that was the angle of how to back-patch into 11.
> However, we can just add the additional context pointer field at the
> end of the Relation struct in v11, and that should be good enough to
> avoid ABI problems.

Agree that rd_partcheck needs its own context as you have complained in
the past [1].

I think we'll need to back-patch this fix to PG 10 as well.  I've attached
patches for all 3 branches.

Thanks,
Amit

[1] https://www.postgresql.org/message-id/22236.1523374067%40sss.pgh.pa.us
diff --git a/src/backend/utils/cache/partcache.c 
b/src/backend/utils/cache/partcache.c
index 2b55f25e75..5f0b4dbc0c 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -373,7 +373,12 @@ generate_partition_qual(Relation rel)
elog(ERROR, "unexpected whole-row reference found in partition 
key");
 
/* Save a copy in the relcache */
-   oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
+   rel->rd_partcheckcxt = AllocSetContextCreate(CacheMemoryContext,
+   
 "partition constraint",
+   
 ALLOCSET_DEFAULT_SIZES);
+   MemoryContextCopyAndSetIdentifier(rel->rd_partcheckcxt,
+ 
RelationGetRelationName(rel));
+   oldcxt = MemoryContextSwitchTo(rel->rd_partcheckcxt);
rel->rd_partcheck = copyObject(result);
MemoryContextSwitchTo(oldcxt);
 
diff --git a/src/backend/utils/cache/relcache.c 
b/src/backend/utils/cache/relcache.c
index eba77491fd..383570340a 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1203,6 +1203,12 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
/* It's fully valid */
relation->rd_isvalid = true;
 
+   /*
+* Initialized by generate_partition_qual() if it's ever called in this
+* session.
+*/
+   relation->rd_partcheckcxt = NULL;
+
return relation;
 }
 
@@ -2313,8 +2319,8 @@ RelationDestroyRelation(Relation relation, bool 
remember_tupdesc)
MemoryContextDelete(relation->rd_partkeycxt);
if (relation->rd_pdcxt)
MemoryContextDelete(relation->rd_pdcxt);
-   if (relation->rd_partcheck)
-   pfree(relation->rd_partcheck);
+   if (relation->rd_partcheckcxt)
+   MemoryContextDelete(relation->rd_partcheckcxt);
if (relation->rd_fdwroutine)
pfree(relation->rd_fdwroutine);
pfree(relation);
@@ -5544,6 +5550,7 @@ load_relcache_init_file(bool shared)
rel->rd_partkey = NULL;
rel->rd_pdcxt = NULL;
rel->rd_partdesc = NULL;
+   rel->rd_partcheckcxt = NULL;
rel->rd_partcheck = NIL;
rel->rd_indexprs = NIL;
rel->rd_indpred = NIL;
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 54028515a7..c2eae7ef62 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -198,6 +198,9 @@ typedef struct RelationData
 
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+   /* Memory context to hold the content of rd_partcheck. */
+   MemoryContext   rd_partcheckcxt;
 } RelationData;
 
 
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 8219d05d83..a13e1185b0 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1883,7 +1883,12 @@ generate_partition_qual(Relation rel)
elog(ERROR, "unexpected whole-row reference found i

Re: WIP: Avoid creation of the free space map for small tables

2019-03-13 Thread John Naylor

> [segfault problems]

This now seems spurious. I ran make distclean, git pull, reapplied the
patch (leaving out the gettimeofday() calls), and now my upgrade perf
test works with default compiler settings. Not sure what happened, but
hopefully we can move forward.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pgsql: Add support for hyperbolic functions, as well as log10().

Dean Rasheed  writes:
> It's not unreasonable to expect these functions to be accurate to within
> the last 1 or 2 digits, so testing with extra_float_digits or whatever
> seems reasonable, but I think a wider variety of test inputs is required.

Meh.  As I said before, we're not in the business of improving on what
libm does --- if someone has a beef with the results, they need to take
it to their platform's libm maintainer, not us.  The point of testing
this at all is just to ensure that we've wired up the SQL functions
to the library functions correctly.

> I also wonder if we should be doing what we do for the regular trig
> functions and explicitly handle special cases like Inf and NaN to ensure
> POSIX compatibility on all platforms.

I'm not too excited about this, but perhaps it would be interesting to
throw in tests of the inf/nan cases temporarily, just to see how big
a problem there is of that sort.  If the answer comes out to be
"all modern platforms get this right", I don't think it's our job to
clean up after the stragglers.  But if the answer is not that, maybe
I could be talked into spending code on it.

regards, tom lane

Re: pgsql: Add support for hyperbolic functions, as well as log10().

2019-03-13 Thread Dean Rasheed

On Wed, 13 Mar 2019, 21:56 Tom Lane,  wrote:

>
> Of these, probably the least bad is #3, even though it might require
> a few rounds of experimentation to find the best extra_float_digits
> setting to use.  I'll go try it without any roundoff, just to see
> what the raw results look like ...
>

Yeah, that seems like a reasonable thing to try.

I'm amazed that jacana's asinh() returned -0 for an input of +0. I'm not
aware of any implementation that does that. I'd be quite interested to know
what it returned for an input like 1e-20. If that returned any variety of
zero, I'd say that it's worse than useless. Another interesting test case
would be whether or not it satisfies asinh(-x) = -asinh(x) for a variety of
different values of x, because that's something that commonly breaks down
badly with naive implementations.

It's not unreasonable to expect these functions to be accurate to within
the last 1 or 2 digits, so testing with extra_float_digits or whatever
seems reasonable, but I think a wider variety of test inputs is required.

I also wonder if we should be doing what we do for the regular trig
functions and explicitly handle special cases like Inf and NaN to ensure
POSIX compatibility on all platforms.

Regards,
Dean

>

Re: Offline enabling/disabling of data checksums

On Wed, Mar 13, 2019 at 12:24:21PM +0100, Magnus Hagander wrote:
> Enabling or disabling the checksums offline on the master quite clearly
> requires a rebuild of the standby, there is no other way (this is one of
> the reasons for the online enabling in that patch, so I still hope we can
> get that done -- but not for this version).

I am curious to understand why this would require a rebuild of the
standby.  Technically FPWs don't update the checksum of a page when it
is WAL-logged, so even if a primary and a standby don't agree on the
checksum configuration, it is the timing where pages are flushed in
the local instance which counts for checksum correctness.

> You mean if the backend and pg_checksums is built with different blocksize?
> Yeah, that sounds like something which is a cheap check and should be done.

Yes, we should check after that, checksum calculation uses BLCKSZ with
a hardcoded value, so a mismatch would cause computation failures.  It
could be possible to not have this restriction if we made the block
size an argument of the checksum calculation though.
--
Michael


signature.asc
Description: PGP signature

Re: pg_rewind : feature to rewind promoted standby is broken!

On Thu, Mar 14, 2019 at 12:15:57AM +0530, Mithun Cy wrote:
> I have not looked into the patch but quick test show it has fixed the above
> issue.

Thanks for confirming, Mythun.  I'll think about the logic of this
patch for a couple of days in the background, then I'll try to commit
it likely at the beginning of next week.
--
Michael


signature.asc
Description: PGP signature

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-13 Thread Peter Geoghegan

On Tue, Mar 12, 2019 at 11:40 AM Robert Haas wrote:
> I think it's pretty clear that we have to view that as acceptable. I
> mean, we could reduce contention even further by finding a way to make
> indexes 40% larger, but I think it's clear that nobody wants that.

I found this analysis of bloat in the production database of Gitlab in
January 2019 fascinating:

https://about.gitlab.com/handbook/engineering/infrastructure/blueprint/201901-postgres-bloat/

They determined that their tables consisted of about 2% bloat, whereas
indexes were 51% bloat (determined by running VACUUM FULL, and
measuring how much smaller indexes and tables were afterwards). That
in itself may not be that telling. What is telling is the index bloat
disproportionately affects certain kinds of indexes. As they put it,
"Indexes that do not serve a primary key constraint make up 95% of the
overall index bloat". In other words, the vast majority of all bloat
occurs within non-unique indexes, with most remaining bloat in unique
indexes.

One factor that could be relevant is that unique indexes get a lot
more opportunistic LP_DEAD killing. Unique indexes don't rely on the
similar-but-distinct kill_prior_tuple optimization -- a lot more
tuples can be killed within _bt_check_unique() than with
kill_prior_tuple in realistic cases. That's probably not really that
big a factor, though. It seems almost certain that "getting tired" is
the single biggest problem.

The blog post drills down further, and cites the examples of several
*extremely* bloated indexes on a single-column, which is obviously low
cardinality. This includes an index on a boolean field, and an index
on an enum-like text field. In my experience, having many indexes like
that is very common in real world applications, though not at all
common in popular benchmarks (with the exception of TPC-E).

It also looks like they may benefit from the "split after new item"
optimization, at least among the few unique indexes that were very
bloated, such as merge_requests_pkey:

https://gitlab.com/snippets/1812014

Gitlab is open source, so it should be possible to confirm my theory
about the "split after new item" optimization (I am certain about
"getting tired", though). Their schema is defined here:

https://gitlab.com/gitlab-org/gitlab-ce/blob/master/db/schema.rb

I don't have time to confirm all this right now, but I am pretty
confident that they have both problems. And almost as confident that
they'd notice substantial benefits from this patch series.
--
Peter Geoghegan

Re: Offline enabling/disabling of data checksums

On Wed, Mar 13, 2019 at 02:43:39PM +0300, Sergei Kornilov wrote:
> Seems good. And I think we need backpath this check to pg11. similar
> to cross-version compatibility checks 

+fprintf(stderr, _("%s: data directory block size %d is different to 
compiled-in block size %d.\n"),
+progname, ControlFile->blcksz, BLCKSZ);
The error message looks grammatically a bit weird to me.  What about
the following?  Say:
"database block size of %u is different from supported block size of
%u."
Better ideas are welcome.

Please note that hose integers are unsigned by the way.
--
Michael


signature.asc
Description: PGP signature

Re: Inadequate executor locking of indexes

2019-03-13 Thread David Rowley

Thanks for having a look at this.

On Wed, 13 Mar 2019 at 22:45, Amit Langote
 wrote:
> I have one question about the relation between idxlockmode and
> rellockmode?  From skimming the patch, it appears that they're almost
> always set to the same value.  If so, why not use rellockmode for index
> locking too?

Maybe that's possible, but it would mean giving up on locking indexes
during DELETE with AccessShareLock. That would become
RowExclusiveLock. Also FOR UPDATE would lock indexes with RowShareLock
instead of AccessShareLock.

> > #2 would not address Tom's mention of there possibly being some way to
> > have the index scan node initialise before the modify table node
> > (currently we pass NoLock for indexes belonging to result rels in the
> > index scan nodes).  I can't quite imagine at the moment how that's
> > possible, but maybe my imagination is just not good enough.  We could
> > fix that by passing RowExclusiveLock instead of NoLock. It just means
> > we'd discover the lock already exists in the local lock table...
> > unless of course there is a case where the index scan gets initialised
> > before modify table is.
>
> Tom gave an example upthread that looked like this:

[...]

> But if finalize_lockmodes() in your patch set lockmodes correctly,
> ExecInitBitmapIndexScan() called for x1 ought to lock the index with
> RowExclusiveLock, no?

Yeah, in the executor the bitmap index scan uses a RowExclusiveLock.
There's still the issue in the planner that get_relation_info() will
still take an AccessShareLock when planning the 1st CTE and then
upgrade it to RowExclusiveLock once it gets to the 2nd.  It's not
really very obvious how that can be fixed, but at least we don't start
using indexes without any locks.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: REINDEX CONCURRENTLY 2.0

2019-03-13 Thread Peter Eisentraut

On 2019-03-13 15:13, Sergei Kornilov wrote:
> Patch is marked as target version 12, but is inactive few weeks long. I think 
> many users want this feature and patch is in good shape. We have open 
> questions on this thread?
> 
> Latest patch still can be aplied cleanly; it builds and pass tests.

Here is an updated patch.  It includes the typo fix in the documentation
from you, some small bug fixes, a new reindexdb --concurrently option,
and based on that some whole-database tests, as discussed recently.

I think this addresses all open issues.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 83df99b2862f02e1b200b2db0c036d75f5df936a Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 13 Mar 2019 23:01:29 +0100
Subject: [PATCH v9] REINDEX CONCURRENTLY

This adds the CONCURRENTLY option to the REINDEX command.  A REINDEX
CONCURRENTLY on a specific index behaves like

CREATE INDEX CONCURRENTLY new
DROP INDEX CONCURRENTLY old
ALTER INDEX new RENAME TO old

based on existing functionality.  The REINDEX command also has the
capability to run its other variants (TABLE, DATABASE, SYSTEM) with
the CONCURRENTLY option (although SYSTEM will end up skipping all
system tables).

The reindexdb command gets the --concurrently option.

Author: TODO
Reviewed-by: TODO
Discussion: 
https://www.postgresql.org/message-id/flat/60052986-956b-4478-45ed-8bd119e9b9cf%402ndquadrant.com#74948a1044c56c5e817a5050f554ddee
---
 doc/src/sgml/mvcc.sgml|   1 +
 doc/src/sgml/ref/reindex.sgml | 185 +++-
 doc/src/sgml/ref/reindexdb.sgml   |  10 +
 src/backend/catalog/index.c   | 544 ++-
 src/backend/catalog/pg_depend.c   | 143 +++
 src/backend/commands/indexcmds.c  | 879 +++---
 src/backend/commands/tablecmds.c  |  32 +-
 src/backend/nodes/copyfuncs.c |   1 +
 src/backend/nodes/equalfuncs.c|   1 +
 src/backend/parser/gram.y |  22 +-
 src/backend/tcop/utility.c|  10 +-
 src/bin/psql/common.c |  16 +
 src/bin/psql/tab-complete.c   |  18 +-
 src/bin/scripts/reindexdb.c   |  42 +-
 src/bin/scripts/t/090_reindexdb.pl|  30 +-
 src/include/catalog/dependency.h  |   5 +
 src/include/catalog/index.h   |  16 +
 src/include/commands/defrem.h |   6 +-
 src/include/nodes/parsenodes.h|   1 +
 .../expected/reindex-concurrently.out |  78 ++
 src/test/isolation/isolation_schedule |   1 +
 .../isolation/specs/reindex-concurrently.spec |  40 +
 src/test/regress/expected/create_index.out|  97 ++
 src/test/regress/sql/create_index.sql |  63 ++
 24 files changed, 2059 insertions(+), 182 deletions(-)
 create mode 100644 src/test/isolation/expected/reindex-concurrently.out
 create mode 100644 src/test/isolation/specs/reindex-concurrently.spec

diff --git a/doc/src/sgml/mvcc.sgml b/doc/src/sgml/mvcc.sgml
index bedd9a008d..9b7ef8bf09 100644
--- a/doc/src/sgml/mvcc.sgml
+++ b/doc/src/sgml/mvcc.sgml
@@ -926,6 +926,7 @@ Table-level Lock Modes
 
  Acquired by VACUUM (without FULL),
  ANALYZE, CREATE INDEX 
CONCURRENTLY,
+ REINDEX CONCURRENTLY,
  CREATE STATISTICS, and certain ALTER
  INDEX and ALTER TABLE variants (for full
  details see  and 
 
-REINDEX [ ( VERBOSE ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } 
name
+REINDEX [ ( VERBOSE ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [ 
CONCURRENTLY ] name
 
  
 
@@ -67,10 +67,7 @@ Description
  
   An index build with the CONCURRENTLY option failed, 
leaving
   an invalid index. Such indexes are useless but it can be
-  convenient to use REINDEX to rebuild them. Note that
-  REINDEX will not perform a concurrent build. To build 
the
-  index without interfering with production you should drop the index and
-  reissue the CREATE INDEX CONCURRENTLY command.
+  convenient to use REINDEX to rebuild them.
  
 
 
@@ -151,6 +148,21 @@ Parameters
 

 
+   
+CONCURRENTLY
+
+ 
+  When this option is used, PostgreSQL will 
rebuild the
+  index without taking any locks that prevent concurrent inserts,
+  updates, or deletes on the table; whereas a standard reindex build
+  locks out writes (but not reads) on the table until it's done.
+  There are several caveats to be aware of when using this option
+  — see .
+ 
+
+   
+

 VERBOSE
 
@@ -241,6 +253,161 @@ Notes
Each individual partition can be reindexed separately instead.
   
 
+  
+   Rebuilding Indexes 
Concurrently
+
+   
+index
+rebuilding concurrently
+   
+
+   
+Rebuilding an index can interfere with regular operation of a database.
+N

Re: Using condition variables to wait for checkpoints

2019-03-13 Thread Thomas Munro

On Thu, Mar 14, 2019 at 1:15 AM Robert Haas  wrote:
> On Tue, Mar 12, 2019 at 7:12 PM Andres Freund  wrote:
> > Having useful infrastructure is sure cool.
>
> Yay!

+1

I renamed the CVs because the names I had used before broke the
convention that variables named ckpt_* are protected by ckpt_lck, and
pushed.

There are some other things like this in the tree (grepping for
poll/pg_usleep loops finds examples in xlog.c, standby.c, ...).  That
might be worth looking into.

-- 
Thomas Munro
https://enterprisedb.com

Re: pgsql: Add support for hyperbolic functions, as well as log10().

Michael Paquier  writes:
> On Tue, Mar 12, 2019 at 11:16:42PM -0400, Tom Lane wrote:
>> I'm inclined to leave it as-is for a day or so and see if any
>> other failures turn up, before deciding what to do about it.

> Fine by me.

Well, so far jacana is the only critter that's shown any problem.

I don't find any of the possible solutions to be super attractive:

1. Put in an explicit special case, along the lines of

if (arg1 == 0.0)
result = arg1;/* Handle 0 and -0 explicitly */
else
result = asinh(arg1);

Aside from being ugly, this'd mean that our regression tests weren't
really exercising the library asinh function at all.

2. Drop that test case entirely, again leaving us with no coverage
of the asinh function.

3. Switch to some other test value besides 0.  This is also kinda ugly
because we almost certainly won't get identical results everywhere.
However, we could probably make the results pretty portable by using
extra_float_digits to suppress the low-order digit or two.  (If we did
that, I'd be inclined to do similarly for the other hyperbolic functions,
just so we're testing cases that actually show different results, and
thereby proving we didn't fat-finger which function we're calling.)

4. Carry an additional expected-results file.

5. Write our own asinh implementation.  Dean already did the work, of
course, but I think this'd be way overkill just because one platform
did their roundoff handling sloppily.  We're not in the business
of implementing transcendental functions better than libm does it.

Of these, probably the least bad is #3, even though it might require
a few rounds of experimentation to find the best extra_float_digits
setting to use.  I'll go try it without any roundoff, just to see
what the raw results look like ...

regards, tom lane

Re: [HACKERS] Block level parallel vacuum

On Wed, Mar 13, 2019 at 1:56 AM Masahiko Sawada  wrote:
> I don't have a strong opinion but the using a Node would be more
> suitable in the future when we add more options to vacuum. And it
> seems to me that it's unlikely to change a Node to a plain struct. So
> there is an idea of doing it now anyway if we might need to do it
> someday.

I just tried to apply 0001 again and noticed a conflict in the
autovac_table structure in postmaster.c.

That conflict got me thinking: aren't parameters and options an awful
lot alike?  Why do we need to pass around a VacuumOptions structure
*and* a VacuumParams structure to all of these functions?  Couldn't we
just have one?  That led to the attached patch, which just gets rid of
the separate options flag and folds it into VacuumParams.  If we took
this approach, the degree of parallelism would just be another thing
that would get added to VacuumParams, and VacuumOptions wouldn't end
up existing at all.

This patch does not address the question of what the *parse tree*
representation of the PARALLEL option should look like; the idea would
be that ExecVacuum() would need to extra the value for that option and
put it into VacuumParams just as it already does for various other
things in VacuumParams.  Maybe the most natural approach would be to
convert the grammar productions for the VACUUM options list so that
they just build a list of DefElems, and then have ExecVacuum() iterate
over that list and make sense of it, as for example ExplainQuery()
already does.

I kinda like the idea of doing it that way, but then I came up with
it, so maybe you or others will think it's terrible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

vacuum-options-into-params.patch
Description: Binary data

Re: hyrax vs. RelationBuildPartitionDesc

2019-03-13 Thread Andres Freund

Hi,

On 2019-03-13 17:10:55 -0400, Tom Lane wrote:
> There's already a mechanism in there to suppress child contexts after
> 100 or so, which would almost inevitably kick in on the relcache if we
> did this.  So I don't believe we'd have a problem with the context dumps
> getting too long --- more likely, the complaints would be the reverse.

Well, that's two sides of the same coin.


> Having said that, I do agree that CacheMemoryContext is too much of an
> undifferentiated blob right now, and splitting it up seems like it'd be
> good for accountability.  I'd definitely be +1 for a catcache vs. relcache
> vs. other caches split.

That'd make a lot of sense.


> You could imagine per-catcache contexts, too.
> The main limiting factor here is that the per-context overhead could get
> excessive.

Yea, per relcache entry contexts seem like they'd get really expensive
fast.  Even per-catcache seems like it might be noticable additional
overhead for a new backend.

Greetings,

Andres Freund

Re: hyrax vs. RelationBuildPartitionDesc

Andres Freund  writes:
> On 2019-03-13 16:50:53 -0400, Robert Haas wrote:
>> On Wed, Mar 13, 2019 at 4:38 PM Robert Haas  wrote:
>>> On Wed, Mar 13, 2019 at 4:18 PM Tom Lane  wrote:
 Off topic for the moment, since this clearly wouldn't be back-patch
 material, but I'm starting to wonder if we should just have a context
 for each relcache entry and get rid of most or all of the retail
 cleanup logic in RelationDestroyRelation ...

>> It just occurred to me that one advantage of this would be that you
>> could see how much memory was being used by each relcache entry using
>> MemoryContextStats(), which seems super-appealing.

> But it might also make it frustrating to look at memory context dumps -
> we'd suddenly have many many more memory context lines we'd displayed,
> right? Wouldn't that often make the dump extremely long?

There's already a mechanism in there to suppress child contexts after
100 or so, which would almost inevitably kick in on the relcache if we
did this.  So I don't believe we'd have a problem with the context dumps
getting too long --- more likely, the complaints would be the reverse.

My gut feeling is that right now relcache entries tend to be mas-o-menos
the same size, except for stuff that is *already* in sub-contexts, like
index and partition descriptors.  So I'm not that excited about this
adding useful info to context dumps.  I was just looking at it as a way
to make relcache entry cleanup simpler and less leak-prone.

Having said that, I do agree that CacheMemoryContext is too much of an
undifferentiated blob right now, and splitting it up seems like it'd be
good for accountability.  I'd definitely be +1 for a catcache vs. relcache
vs. other caches split.  You could imagine per-catcache contexts, too.
The main limiting factor here is that the per-context overhead could get
excessive.

regards, tom lane

Re: hyrax vs. RelationBuildPartitionDesc

2019-03-13 Thread Andres Freund

Hi,

On 2019-03-13 16:50:53 -0400, Robert Haas wrote:
> On Wed, Mar 13, 2019 at 4:38 PM Robert Haas  wrote:
> > On Wed, Mar 13, 2019 at 4:18 PM Tom Lane  wrote:
> > > Off topic for the moment, since this clearly wouldn't be back-patch
> > > material, but I'm starting to wonder if we should just have a context
> > > for each relcache entry and get rid of most or all of the retail
> > > cleanup logic in RelationDestroyRelation ...
> >
> > I think that idea might have a lot of merit, but I haven't studied it 
> > closely.
> 
> It just occurred to me that one advantage of this would be that you
> could see how much memory was being used by each relcache entry using
> MemoryContextStats(), which seems super-appealing.  In fact, what
> about getting rid of all allocations in CacheMemoryContext itself in
> favor of some more specific context in each case?  That would make it
> a lot clearer where to look for leaks -- or efficiencies.

But it might also make it frustrating to look at memory context dumps -
we'd suddenly have many many more memory context lines we'd displayed,
right? Wouldn't that often make the dump extremely long?

To be clear, I think the idea has merit. Just want to raise the above
point.

Greetings,

Andres Freund

Re: hyrax vs. RelationBuildPartitionDesc

On Wed, Mar 13, 2019 at 4:38 PM Robert Haas  wrote:
> On Wed, Mar 13, 2019 at 4:18 PM Tom Lane  wrote:
> > Off topic for the moment, since this clearly wouldn't be back-patch
> > material, but I'm starting to wonder if we should just have a context
> > for each relcache entry and get rid of most or all of the retail
> > cleanup logic in RelationDestroyRelation ...
>
> I think that idea might have a lot of merit, but I haven't studied it closely.

It just occurred to me that one advantage of this would be that you
could see how much memory was being used by each relcache entry using
MemoryContextStats(), which seems super-appealing.  In fact, what
about getting rid of all allocations in CacheMemoryContext itself in
favor of some more specific context in each case?  That would make it
a lot clearer where to look for leaks -- or efficiencies.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: hyrax vs. RelationBuildPartitionDesc

On Wed, Mar 13, 2019 at 4:18 PM Tom Lane  wrote:
> Off topic for the moment, since this clearly wouldn't be back-patch
> material, but I'm starting to wonder if we should just have a context
> for each relcache entry and get rid of most or all of the retail
> cleanup logic in RelationDestroyRelation ...

I think that idea might have a lot of merit, but I haven't studied it closely.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: using index or check in ALTER TABLE SET NOT NULL

2019-03-13 Thread Justin Pryzby

Hi,

On Wed, Mar 13, 2019 at 01:25:11PM -0400, Robert Haas wrote:
> On Wed, Mar 13, 2019 at 1:19 PM Tom Lane  wrote:
> > Oh, and yes, I think QueuePartitionConstraintValidation's usage
> > is an unacceptable abuse of INFO level.  I'm surprised we haven't
> > gotten complaints about it yet.
> 
> Perhaps that's because users aren't as direly opposed to informational
> messages from DDL commands as you seem to believe...

On Wed, Mar 13, 2019 at 01:09:32PM -0400, Tom Lane wrote:
> Sergei Kornilov  writes:
> >> Ugh, I guess so. Or how about changing the message itself to use
> >> INFO, like we already do in QueuePartitionConstraintValidation?
> 
> > Fine for me. But year ago this was implemented in my patch and Tom voted 
> > against using INFO level for such purpose: 
> > https://www.postgresql.org/message-id/1142.1520362313%40sss.pgh.pa.us
> 
> What I thought then was that you didn't need the message at all,
> at any debug level.  I still think that.  It might have been useful
> for development purposes but it does not belong in committed code.
> INFO (making it impossible for anybody to not have the message
> in-their-face) is right out.

I'm writing to provide a datapoint for the existing message regarding "implied
by existing constraints" in QueuePartitionConstraintValidation.

On the one hand, I agree that message at INFO is more verbose than other
commands, and I'm 20% surprised by it.

On the other hand, I *like* the output (including at INFO level).  It alerted
me to 1) deficiency with some of our tables (due to column not marked NOT
NULL); and, 2) helped me to understand how to improve some of our dynamic DDL.

If the message weren't visible at LOG, I'd miss it, at least sometimes, and
look for how to re-enable it (but statement logging of DEBUG is beyond
excessive).

For context, I'm someone who configures servers with log_min_messages=info and
who sometimes SETs client_min_messages='DEBUG1' ; I have $toomany warnings in
our monitoring system, because I'd rather see twice as much stuff that may or
may not be interesting than miss 10% of things I needed to see.

Justin

Re: hyrax vs. RelationBuildPartitionDesc

Robert Haas  writes:
> On Wed, Mar 13, 2019 at 3:14 PM Tom Lane  wrote:
>> Meanwhile, who's going to take point on cleaning up rd_partcheck?
>> I don't really understand this code well enough to know whether that
>> can share one of the existing partitioning-related sub-contexts.

> To your question, I think it probably can't share a context.  Briefly,
> rd_partkey can't change ever, except that when a partitioned relation
> is in the process of being created it is briefly NULL; once it obtains
> a value, that value cannot be changed.  If you want to range-partition
> a list-partitioned table or something like that, you have to drop the
> table and create a new one.  I think that's a perfectly acceptable
> permanent limitation and I have no urge whatever to change it.
> rd_partdesc changes when you attach or detach a child partition.
> rd_partcheck is the reverse: it changes when you attach or detach this
> partition to or from a parent.

Got it.  Yeah, it seems like the clearest and least bug-prone solution
is for those to be in three separate sub-contexts.  The only reason
I was trying to avoid that was the angle of how to back-patch into 11.
However, we can just add the additional context pointer field at the
end of the Relation struct in v11, and that should be good enough to
avoid ABI problems.

Off topic for the moment, since this clearly wouldn't be back-patch
material, but I'm starting to wonder if we should just have a context
for each relcache entry and get rid of most or all of the retail
cleanup logic in RelationDestroyRelation ...

regards, tom lane

Re: Sparse bit set data structure

On Wed, Mar 13, 2019 at 3:18 PM Heikki Linnakangas  wrote:
> I started to consider rewriting the data structure into something more
> like B-tree. Then I remembered that I wrote a data structure pretty much
> like that last year already! We discussed that on the "Vacuum: allow
> usage of more than 1GB of work mem" thread [2], to replace the current
> huge array that holds the dead TIDs during vacuum.
>
> So I dusted off that patch, and made it more general, so that it can be
> used to store arbitrary 64-bit integers, rather than ItemPointers or
> BlockNumbers. I then added a rudimentary form of compression to the leaf
> pages, so that clusters of nearby values can be stored as an array of
> 32-bit integers, or as a bitmap. That would perhaps be overkill, if it
> was just to conserve some memory in GiST vacuum, but I think this will
> turn out to be a useful general-purpose facility.

Yeah, that sounds pretty cool.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: hyrax vs. RelationBuildPartitionDesc

On Wed, Mar 13, 2019 at 3:14 PM Tom Lane  wrote:
> Yeah, but usually there's some comment pointing it out.  I also wonder
> if there aren't corner-case bugs; it seems a bit bogus for example that
> rd_pdcxt is created without any thought as to whether it might be set
> already.  It's not clear whether this has been written with the
> level of paranoia that's appropriate for messing with a relcache entry,
> and some comments would make it a lot clearer (a) if that is true and
> (b) what assumptions are implicitly being shared with relcache.c.

Yeah, this is all pretty new code, and it probably has some bugs we
haven't found yet.

> Meanwhile, who's going to take point on cleaning up rd_partcheck?
> I don't really understand this code well enough to know whether that
> can share one of the existing partitioning-related sub-contexts.

Well, it would be nice if Amit Langote worked on it since he wrote the
original version of most of this code, but I'm sure he'll defer to you
if you feel the urge to work on it, or I can take a look at it (not
today).

To your question, I think it probably can't share a context.  Briefly,
rd_partkey can't change ever, except that when a partitioned relation
is in the process of being created it is briefly NULL; once it obtains
a value, that value cannot be changed.  If you want to range-partition
a list-partitioned table or something like that, you have to drop the
table and create a new one.  I think that's a perfectly acceptable
permanent limitation and I have no urge whatever to change it.
rd_partdesc changes when you attach or detach a child partition.
rd_partcheck is the reverse: it changes when you attach or detach this
partition to or from a parent.  It's probably helpful to think of the
case of a table with partitions each of which is itself partitioned --
the table at that middle level has to worry both about gaining or
losing children and about being ripped away from its parent.

As a parenthetical note, I observe that relcache.c seems to know
almost nothing about rd_partcheck.  rd_partkey and rd_partdesc both
have handling in RelationClearRelation(), but rd_partcheck does not,
and I suspect that's wrong.  So the problems are probably not confined
to the relcache-drop-time problem that you observed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Sparse bit set data structure

2019-03-13 Thread Heikki Linnakangas


Hi,

I was reviewing Andrey Borodin's patch for GiST VACUUM [1], which 
includes a new "block set" data structure, to track internal and empty 
pages while vacuuming a GiST. The blockset data structure was a pretty 
simple radix tree, that can hold a set of BlockNumbers.


The memory usage of the radix tree would probably be good enough in real 
life, as we also discussed on the thread. Nevertheless, I was somewhat 
bothered by it, so I did some measurements. I added some 
MemoryContextStats() calls to Andrey's test_blockset module, to print 
out memory usage.


For storing 500 random 32-bit integers, or a density of about 1% of 
bits set, the blockset consumed about 380 MB of memory. I think that's a 
pretty realistic ratio of internal pages : leaf pages on a GiST index, 
so I would hope the blockset to be efficient in that ballpark. However, 
380 MB / 500 is about 76 bytes, so it's consuming about 76 bytes per 
stored block number. That's a lot of overhead! For comparison, a plain 
BlockNumber is just 4 bytes. With more sparse sets, it is even more 
wasteful, on a per-item basis, although the total memory usage will of 
course be smaller. (To be clear, no one is pushing around GiST indexes 
with anywhere near 2^32 blocks, or 32 TB, but the per-BlockNumber stats 
are nevertheless interesting.)


I started to consider rewriting the data structure into something more 
like B-tree. Then I remembered that I wrote a data structure pretty much 
like that last year already! We discussed that on the "Vacuum: allow 
usage of more than 1GB of work mem" thread [2], to replace the current 
huge array that holds the dead TIDs during vacuum.


So I dusted off that patch, and made it more general, so that it can be 
used to store arbitrary 64-bit integers, rather than ItemPointers or 
BlockNumbers. I then added a rudimentary form of compression to the leaf 
pages, so that clusters of nearby values can be stored as an array of 
32-bit integers, or as a bitmap. That would perhaps be overkill, if it 
was just to conserve some memory in GiST vacuum, but I think this will 
turn out to be a useful general-purpose facility.


I plugged this new "sparse bitset" implementation into the same 
test_blockset test. The memory usage for 500 values is now just over 
20 MB, or about 4.3 bytes per value. That's much more reasonable than 
the 76 bytes.


I'll do some more performance testing on this, to make sure it performs 
well enough on random lookups, to also replace VACUUM's dead item 
pointer array. Assuming that works out, I plan to polish up and commit 
this, and use it in the GiST vacuum. I'm also tempted to change VACUUM 
to use this, since that should be pretty straightforward once we have 
the data structure.


[1] 
https://www.postgresql.org/message-id/A51F64E3-850D-4249-814E-54967103A859%40yandex-team.ru


[2] 
https://www.postgresql.org/message-id/8e5cbf08-5dd8-466d-9271-562fc65f133f%40iki.fi


- Heikki
>From 678f56caf1c6eceb5195c531190b573c180e5ef2 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 13 Mar 2019 20:05:04 +0200
Subject: [PATCH 1/2] Add SparseBitset, to hold a large set of 64-bit ints
 efficiently.

This doesn't include any use of the code yet, but we have two patches in
the works that would benefit from this:

* the GiST vacuum patch, to track empty GiST pages and internal GiST pages.

* Reducing memory usage, and also allowing more than 1 GB of memory to be
  used, to hold the dead TIDs in VACUUM.
---
 src/backend/lib/Makefile   |   2 +-
 src/backend/lib/README |   2 +
 src/backend/lib/sparsebitset.c | 986 +
 src/include/lib/sparsebitset.h |  26 +
 4 files changed, 1015 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/lib/sparsebitset.c
 create mode 100644 src/include/lib/sparsebitset.h

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 191ea9bca26..32ba388c12d 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = binaryheap.o bipartite_match.o bloomfilter.o dshash.o hyperloglog.o \
-   ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o
+   ilist.o knapsack.o pairingheap.o rbtree.o sparsebitset.o stringinfo.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/README b/src/backend/lib/README
index ae5debe1bc6..f938795b54a 100644
--- a/src/backend/lib/README
+++ b/src/backend/lib/README
@@ -19,6 +19,8 @@ pairingheap.c - a pairing heap
 
 rbtree.c - a red-black tree
 
+sparsebitset.c - a sparse bitset of integers
+
 stringinfo.c - an extensible string type
 
 
diff --git a/src/backend/lib/sparsebitset.c b/src/backend/lib/sparsebitset.c
new file mode 100644
index 000..6df900b9ebc
--- /dev/null
+++ b/src/backend/lib/sparsebitset.c
@@ -0,0 +1,986 @@
+/*-
+ *
+ * sparsebitset.c
+ *	  Data

Re: hyrax vs. RelationBuildPartitionDesc

Robert Haas  writes:
> On Wed, Mar 13, 2019 at 2:26 PM Tom Lane  wrote:
>> I think that RelationBuildPartitionDesc could use some additional cleanup
>> or at least better commenting.  In particular, it's neither documented nor
>> obvious to the naked eye why rel->rd_partdesc mustn't get set till the
>> very end.  As the complainant, I'm willing to go fix that, but do you want
>> to push your patch first so it doesn't get broken?  Or I could include
>> your patch in the cleanup.

> It seems kinda obvious to me why rel->rd_partdesc can't get set until
> the end.  Isn't it just that you'd better not set a permanent pointer
> to a data structure until you're past any code that might ERROR, which
> is pretty much everything?  That principle applies to lots of
> PostgreSQL code, not just this.

Yeah, but usually there's some comment pointing it out.  I also wonder
if there aren't corner-case bugs; it seems a bit bogus for example that
rd_pdcxt is created without any thought as to whether it might be set
already.  It's not clear whether this has been written with the
level of paranoia that's appropriate for messing with a relcache entry,
and some comments would make it a lot clearer (a) if that is true and
(b) what assumptions are implicitly being shared with relcache.c.

Meanwhile, who's going to take point on cleaning up rd_partcheck?
I don't really understand this code well enough to know whether that
can share one of the existing partitioning-related sub-contexts.

regards, tom lane

Re: hyrax vs. RelationBuildPartitionDesc

On Wed, Mar 13, 2019 at 2:26 PM Tom Lane  wrote:
> OK, in that case it's definitely all the temporary data that gets created
> that is the problem.  I've not examined your patch in great detail but
> it looks plausible for fixing that.

Cool.

> I think that RelationBuildPartitionDesc could use some additional cleanup
> or at least better commenting.  In particular, it's neither documented nor
> obvious to the naked eye why rel->rd_partdesc mustn't get set till the
> very end.  As the complainant, I'm willing to go fix that, but do you want
> to push your patch first so it doesn't get broken?  Or I could include
> your patch in the cleanup.

Yeah, that probably makes sense, though it might be polite to wait
another hour or two to see if anyone wants to argue with that approach
further.

It seems kinda obvious to me why rel->rd_partdesc can't get set until
the end.  Isn't it just that you'd better not set a permanent pointer
to a data structure until you're past any code that might ERROR, which
is pretty much everything?  That principle applies to lots of
PostgreSQL code, not just this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: hyrax vs. RelationBuildPartitionDesc

On Wed, Mar 13, 2019 at 2:21 PM Alvaro Herrera  wrote:
> On 2019-Mar-13, Robert Haas wrote:
> > On Wed, Mar 13, 2019 at 1:38 PM Alvaro Herrera  
> > wrote:
> > > A bit, yes, but not overly so, and it's less fragile that not having
> > > such a protection.  Anything that allocates in CacheMemoryContext needs
> > > to be very careful anyway.
> >
> > True, but I think it's more fragile than either of the options I proposed.
>
> You do?  Unless I misunderstood, your options are:
>
> 1. (the patch you attached) create a temporary memory context that is
> used for everything, then at the end copy the good stuff to CacheMemCxt
> (or a sub-context thereof).  This still needs to copy.
>
> 2. create a temp memory context, do everything there, do retail freeing
> of everything we don't want, reparenting the context to CacheMemCxt.
> Hope we didn't forget to pfree anything.
>
> How is any of those superior to what I propose?

Well, *I* thought of it, so obviously it must be superior.  :-)

More seriously, your idea does seem better in some ways.  My #1
doesn't avoid the copy, but does kill the leaks.  My #2 avoids the
copy, but risks a different flavor of leakage.  Your idea also avoids
the copy and leaks in fewer cases than my #2.  In that sense, it is
the technically superior option.  However, it also requires more
memory context switches than either of my options, and I guess that
seems fragile to me in the sense that I think future code changes are
more likely to go wrong due to the complexity involved.  I might be
mistaken about that, though.

One other note is that, although the extra copy looks irksome, it's
probably not very significant from a performance point of view.  In a
non-CLOBBER_CACHE_ALWAYS build it doesn't happen frequently enough to
matter, and in a CLOBBER_CACHE_ALWAYS build everything is already so
slow that it still doesn't matter.  That's not the only consideration,
though.  Code which copies data structures might be buggy, or might
develop bugs in the future, and avoiding that copy would avoid
exposure to such bugs.  On the other hand, it's not really possible to
remove the copying without increasing the risk of leaking into the
long-lived context.

In some ways, I think this is all a natural outgrowth of the fact that
we rely on palloc() in so many places instead of forcing code to be
explicit about which memory context it intends to target.  Global
variables are considered harmful, and it's not that difficult to see
the connection between that general principle and present
difficulties.  However, I will not have time to write and debug a
patch reversing that choice between now and feature freeze.  :-)

I'm kinda inclined to go for the brute-force approach of slamming the
temporary context in there as in the patch I proposed upthread, which
should solve the immediate problem, and we can implement one of these
other ideas later if we want.  What do you think about that?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_rewind : feature to rewind promoted standby is broken!

2019-03-13 Thread Mithun Cy

On Wed, Mar 13, 2019 at 1:38 PM Michael Paquier  wrote:

> On Tue, Mar 12, 2019 at 06:23:01PM +0900, Michael Paquier wrote:
> > And you are pointing out to the correct commit.  The issue is that
> > process_target_file() has added a call to check_file_excluded(), and
> > this skips all the folders which it thinks can be skipped.  One
> > problem though is that we also filter out pg_internal.init, which is
> > present in each database folder, and remains in the target directory
> > marked for deletion.  Then, when the deletion happens, the failure
> > happens as the directory is not fully empty.
>
> Okay, here is a refined patch with better comments, the addition of a
> test case (creating tables in the new databases in 002_databases.pl is
> enough to trigger the problem).
>

I have not looked into the patch but quick test show it has fixed the above
issue.

[mithuncy@localhost pgrewindbin]$ ./bin/pg_rewind -D standby
--source-server="host=127.0.0.1 port=5432 user=mithuncy dbname=postgres" -n
servers diverged at WAL location 0/300 on timeline 1
rewinding from last common checkpoint at 0/260 on timeline 1
Done!
[mithuncy@localhost pgrewindbin]$ ./bin/pg_rewind -D standby
--source-server="host=127.0.0.1 port=5432 user=mithuncy dbname=postgres"
servers diverged at WAL location 0/300 on timeline 1
rewinding from last common checkpoint at 0/260 on timeline 1
Done!

-- 
Thanks and Regards
Mithun Chicklore Yogendra
EnterpriseDB: http://www.enterprisedb.com

Re: Compressed TOAST Slicing

2019-03-13 Thread Paul Ramsey


> On Mar 13, 2019, at 9:32 AM, Andrey Borodin  wrote:
> 
> 
> 
>> 13 марта 2019 г., в 21:05, Paul Ramsey  
>> написал(а):
>> 
>> Here is a new (final?) patch ...
>> 
>> 
> 
> This check
> 
> @@ -744,6 +748,8 @@ pglz_decompress(const char *source, int32 slen, char 
> *dest,
>   {
>   *dp = dp[-off];
>   dp++;
> + if (dp >= destend)  /* check for 
> buffer overrun */
> + break;  /* do not 
> clobber memory */
>   }
> 
> is still done for every byte. You can precompute maximum allowed length 
> before that cycle. Here's diff

Thanks! Attached change,

P

compressed-datum-slicing-20190313b.patch
Description: Binary data

Re: hyrax vs. RelationBuildPartitionDesc

Alvaro Herrera  writes:
> You do?  Unless I misunderstood, your options are:

> 1. (the patch you attached) create a temporary memory context that is
> used for everything, then at the end copy the good stuff to CacheMemCxt
> (or a sub-context thereof).  This still needs to copy.

> 2. create a temp memory context, do everything there, do retail freeing
> of everything we don't want, reparenting the context to CacheMemCxt.
> Hope we didn't forget to pfree anything.

> How is any of those superior to what I propose?

I doubt that what you're suggesting is terribly workable.  It's not
just RelationBuildPartitionDesc that's at issue.  Part of what will
be the long-lived data structure is made by partition_bounds_create,
and that invokes quite a boatload of code that both makes the final
data structure and leaks a lot of intermediate junk.  Having to be
very careful about permanent vs temporary data throughout all of that
seems like a recipe for bugs.

The existing code in RelationBuildPartitionDesc is already pretty good
about avoiding copying of data other than the output of
partition_bounds_create.  In fact, I think it's already close to what
you're suggesting other than that point.  So I think --- particularly
given that we need something we can back-patch into v11 --- that we
shouldn't try to do anything much more complicated than what Robert is
suggesting.

regards, tom lane

Re: hyrax vs. RelationBuildPartitionDesc

Robert Haas  writes:
>> Dunno if that's related to hyrax's issue, though.

> It's related in the sense that it's a leak, and any leak will tend to
> run the system out of memory more easily, but what I observed was a
> large leak into MessageContext, and that would be a leak into
> CacheMemoryContext, so I think it's probably a sideshow rather than
> the main event.

OK, in that case it's definitely all the temporary data that gets created
that is the problem.  I've not examined your patch in great detail but
it looks plausible for fixing that.

I think that RelationBuildPartitionDesc could use some additional cleanup
or at least better commenting.  In particular, it's neither documented nor
obvious to the naked eye why rel->rd_partdesc mustn't get set till the
very end.  As the complainant, I'm willing to go fix that, but do you want
to push your patch first so it doesn't get broken?  Or I could include
your patch in the cleanup.

regards, tom lane

Re: hyrax vs. RelationBuildPartitionDesc

2019-03-13 Thread Alvaro Herrera

On 2019-Mar-13, Robert Haas wrote:

> On Wed, Mar 13, 2019 at 1:38 PM Alvaro Herrera  
> wrote:
> > A bit, yes, but not overly so, and it's less fragile that not having
> > such a protection.  Anything that allocates in CacheMemoryContext needs
> > to be very careful anyway.
> 
> True, but I think it's more fragile than either of the options I proposed.

You do?  Unless I misunderstood, your options are:

1. (the patch you attached) create a temporary memory context that is
used for everything, then at the end copy the good stuff to CacheMemCxt
(or a sub-context thereof).  This still needs to copy.

2. create a temp memory context, do everything there, do retail freeing
of everything we don't want, reparenting the context to CacheMemCxt.
Hope we didn't forget to pfree anything.

How is any of those superior to what I propose?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: hyrax vs. RelationBuildPartitionDesc

On Wed, Mar 13, 2019 at 1:30 PM Tom Lane  wrote:
> Ah, here we are: it was rel->rd_partcheck.  I'm not sure exactly how
> complicated that structure can be, but I'm pretty sure that this is
> a laughably inadequate method of cleaning it up:
>
> if (relation->rd_partcheck)
> pfree(relation->rd_partcheck);

Oh, for crying out loud.

> Dunno if that's related to hyrax's issue, though.

It's related in the sense that it's a leak, and any leak will tend to
run the system out of memory more easily, but what I observed was a
large leak into MessageContext, and that would be a leak into
CacheMemoryContext, so I think it's probably a sideshow rather than
the main event.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: hyrax vs. RelationBuildPartitionDesc

On Wed, Mar 13, 2019 at 1:38 PM Alvaro Herrera  wrote:
> A bit, yes, but not overly so, and it's less fragile that not having
> such a protection.  Anything that allocates in CacheMemoryContext needs
> to be very careful anyway.

True, but I think it's more fragile than either of the options I proposed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Timeout parameters

On Wed, Mar 13, 2019 at 1:25 PM Fabien COELHO  wrote:
> Hmmm... ISTM that we are not talking about the same patch...

You are correct!  I was talking about the patches that allow user
control of TCP_USER_TIMEOUT, which is apparently totally different
from the socket_timeout patch that you're talking about.

The first thing I notice about the socket_timeout patch is that the
documentation is definitely wrong:

+Number of seconds to wait for socket read/write operation before
+closing the connection, as a decimal integer. This can be
used both as a
+force global query timeout and network problems detector. A value of

This can't be used to force a global query timeout, because any kind
of protocol message - e.g. a NOTICE or WARNING - would reset the
timeout, allowing the continue past the supposedly-global query
timeout.  There may be some other ways that can happen, too, either
currently or in the future.

> My point is about the "socket_timeout" patch which timeout on not
> receiving an answer, but is not related to the network connection.

The above-quoted documentation disagrees with you about that, claiming
that this can be used as a network problem detector. However, I think
that the documentation is wrong about that, too.  The mere fact that
the connection is idle does not show that there is a network problem;
there could, for example, be TCP keepalives being sent and received
behind the scenes, and if so, this patch would cause a connection that
is known to be valid to be disconnected for no reason.

Actually, I think that socket_timeout_v8.patch is a bad idea and
should be rejected, not because it doesn't send a cancel to the server
but just because it's not the right way of solving either of the
problems that it purports to solve.  If you want a timeout for
queries, you need that to be administered by the server, which knows
when it starts and finishes executing a query; you cannot substitute a
client-side judgement based on the last time we received a byte of
data.  Maybe a client-side judgement would work if the timer were
armed when we send a Query or Execute message and disarmed when we
receive ReadyForQuery.  And, if you want a network problem detector,
then you should use keepalives and perhaps the TCP_USER_TIMEOUT stuff,
so that you don't kill perfectly good connections that just happen to
be idle.

The only argument I can really see for socket_timeout_v8.patch is that
there might be platforms where setting keepalives and/or
TCP_USER_TIMEOUT doesn't work.  However, if that is the issue, then I
think we should look for equivalent facilities that do work on those
platforms and consider supporting those facilities, or else just
decide that if certain features are not supported on obscure platforms
then users of those platforms are going to have to live without those
features.  If we add something like this, people are likely to try to
use it and then get frustrated when it doesn't work right.

It's also worth noting that it's generally a bad idea to try to make
one facilities do two different things with conflicting requirements.
It seems not unlikely that if we add this, somebody will come back and
ask us to add keepalives to the protocol itself (rather than relying
on TCP) to keep this from killing the connection in vain -- and then
somebody else will object to that because they're using it to kill off
queries that run too long (and have raised client_min_messages to
ERROR in an attempt to avoid getting any other messages meanwhile).
By munging together two different requirements into a single facility,
we make it impossible to decide which side of that argument is
correct.  It will just come down to who yells the loudest.

Note that none of this is an argument against the TCP_USER_TIMEOUT
portion of this patch set.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: hyrax vs. RelationBuildPartitionDesc

2019-03-13 Thread Alvaro Herrera

On 2019-Mar-13, Robert Haas wrote:

> On Wed, Mar 13, 2019 at 12:42 PM Alvaro Herrera
>  wrote:
> > I remember going over this code's memory allocation strategy a bit to
> > avoid the copy while not incurring potential leaks CacheMemoryContext;
> > as I recall, my idea was to use two contexts, one of which is temporary
> > and used for any potentially leaky callees, and destroyed at the end of
> > the function, and the other contains the good stuff and is reparented to
> > CacheMemoryContext at the end.  So if you have any accidental leaks,
> > they don't affect a long-lived context.  You have to be mindful of not
> > calling leaky code when you're using the permanent one.
> 
> Well, that assumes that the functions which allocate the good stuff do
> not also leak, which seems a bit fragile.

A bit, yes, but not overly so, and it's less fragile that not having
such a protection.  Anything that allocates in CacheMemoryContext needs
to be very careful anyway.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

RE: Timeout parameters

2019-03-13 Thread Fabien COELHO





Hello Fabien-san.


The 2nd patch is 700 KB, I think that there is a unvoluntary file copy.


If the user reconnects, eg "\c db", the setting is lost. The
re-connection handling should probably take care of this parameter, and maybe 
others.

I think your opinion is reasonable, but it seems not in this thread.


HI think that your patch is responsible for the added option at least.

I agree that the underlying issue that other parameters should probably 
also be reused, which would be a bug fix, does not belong to this thread.



* settings Inheritance by "\c" command.
suggested by Fabien-san.
Not implemented yet.
This should be in the other thread.


See above.

--
Fabien.

Re: hyrax vs. RelationBuildPartitionDesc

I wrote:
> I recall having noticed someplace where I thought the relcache partition
> support was simply failing to make provisions for cleaning up a cached
> structure at relcache entry drop, but I didn't have time to pursue it
> right then.  Let me see if I can reconstruct what I was worried about.

Ah, here we are: it was rel->rd_partcheck.  I'm not sure exactly how
complicated that structure can be, but I'm pretty sure that this is
a laughably inadequate method of cleaning it up:

if (relation->rd_partcheck)
pfree(relation->rd_partcheck);

Having it be loose data in CacheMemoryContext, which is the current state
of affairs, is just not going to be practical to clean up.  I suggest that
maybe it could be copied into rd_partkeycxt or rd_pdcxt, so that it'd go
away as a byproduct of freeing those.  If there's a reason it has to be
independent of both, it'll have to have its own small context.

Dunno if that's related to hyrax's issue, though.

regards, tom lane

Re: hyrax vs. RelationBuildPartitionDesc

On Wed, Mar 13, 2019 at 1:15 PM Tom Lane  wrote:
> I'm a bit confused as to why there's an issue here at all.  The usual
> plan for computed-on-demand relcache sub-structures is that we compute
> a working copy that we're going to return to the caller using the
> caller's context (which is presumably statement-duration at most)
> and then do the equivalent of copyObject to stash a long-lived copy
> into the relcache context.  Is this case being done differently, and if
> so why?  If it's being done the same, where are we leaking?

It's being done in just that way.  The caller's context is
MessageContext, which is indeed statement duration.  But if you plunk
10k into MessageContext a few thousand times per statement, then you
chew through a couple hundred meg of memory, and apparently hyrax
can't tolerate that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: using index or check in ALTER TABLE SET NOT NULL

On Wed, Mar 13, 2019 at 1:19 PM Tom Lane  wrote:
> Oh, and yes, I think QueuePartitionConstraintValidation's usage
> is an unacceptable abuse of INFO level.  I'm surprised we haven't
> gotten complaints about it yet.

Perhaps that's because users aren't as direly opposed to informational
messages from DDL commands as you seem to believe...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Timeout parameters

2019-03-13 Thread Fabien COELHO




Hello Robert,


Also, I do not see the downside of sending a cancel query before severing
the connection. If it is not processed, too bad, but if it is then it is
for the better.


If the network connection is dead, which is the situation the patch
intends to detect,


Hmmm... ISTM that we are not talking about the same patch...

My point is about the "socket_timeout" patch which timeout on not 
receiving an answer, but is not related to the network connection.


The other two patches, however, deal with tcp timeout both client & server 
side, and are indeed more related to the network connection. Sending 
request on a tcp timeout would not make much sense, but this is not the 
proposal here.


then PQcancel() isn't going to work, but it might still hang for a 
period of time or forever.  That seems like a pretty major downside.


The fact that no answer data is received may mean that it takes time to 
compute the result, so cancelling seems appropriate to me, rather than 
severing the connection and starting a new one immediately, leaving the 
server loaded with its query.


--
Fabien.

Re: using index or check in ALTER TABLE SET NOT NULL

On Wed, Mar 13, 2019 at 1:09 PM Tom Lane  wrote:
> Sergei Kornilov  writes:
> >> Ugh, I guess so. Or how about changing the message itself to use
> >> INFO, like we already do in QueuePartitionConstraintValidation?
>
> > Fine for me. But year ago this was implemented in my patch and Tom voted 
> > against using INFO level for such purpose: 
> > https://www.postgresql.org/message-id/1142.1520362313%40sss.pgh.pa.us
>
> What I thought then was that you didn't need the message at all,
> at any debug level.  I still think that.  It might have been useful
> for development purposes but it does not belong in committed code.
> INFO (making it impossible for anybody to not have the message
> in-their-face) is right out.

I find that position entirely wrong-headed.  If you think users have
no interest in a message telling them whether their gigantic table is
getting scanned or not, you're wrong.  Maybe you're right that
everyone doesn't want it, but I'm positive some do.  We've had
requests for very similar things on this very mailing list more than
one.

And if you think that it's not useful to have regression tests that
verify that the behavior is correct, well, that's not a question with
one objectively right answer, but I emphatically disagree with that
position.  This behavior is often quite subtle, especially in
combination with partitioned tables where different partitions may get
different treatment.  It would not be at all difficult to break it
without realizing.

I do understand that we seem to have no good way of doing this that
lets users have the option of seeing this message and also makes it
something that can be regression-tested.  INFO is out because there's
no option, and DEBUG1 is out because it doesn't work in the regression
tests.  However, I think giving up and saying "ok, well that's just
how it is, too bad" is, one, letting the tail wag the dog, and two, a
terribly disappointing attitude.

I've just reverted the 0002 portion of the patch, which should
hopefully fix this problem for now.  But I strongly encourage you
think of something to which you could say "yes" instead of just
shooting everything people want to do in this area down.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: using index or check in ALTER TABLE SET NOT NULL

I wrote:
> Sergei Kornilov  writes:
>>> Ugh, I guess so. Or how about changing the message itself to use
>>> INFO, like we already do in QueuePartitionConstraintValidation?

>> Fine for me. But year ago this was implemented in my patch and Tom voted 
>> against using INFO level for such purpose: 
>> https://www.postgresql.org/message-id/1142.1520362313%40sss.pgh.pa.us

> What I thought then was that you didn't need the message at all,
> at any debug level.  I still think that.

Oh, and yes, I think QueuePartitionConstraintValidation's usage
is an unacceptable abuse of INFO level.  I'm surprised we haven't
gotten complaints about it yet.

regards, tom lane

Re: hyrax vs. RelationBuildPartitionDesc

Robert Haas  writes:
> On Wed, Mar 13, 2019 at 12:42 PM Alvaro Herrera
>  wrote:
>> I remember going over this code's memory allocation strategy a bit to
>> avoid the copy while not incurring potential leaks CacheMemoryContext;
>> as I recall, my idea was to use two contexts, one of which is temporary
>> and used for any potentially leaky callees, and destroyed at the end of
>> the function, and the other contains the good stuff and is reparented to
>> CacheMemoryContext at the end.  So if you have any accidental leaks,
>> they don't affect a long-lived context.  You have to be mindful of not
>> calling leaky code when you're using the permanent one.

> Well, that assumes that the functions which allocate the good stuff do
> not also leak, which seems a bit fragile.

I'm a bit confused as to why there's an issue here at all.  The usual
plan for computed-on-demand relcache sub-structures is that we compute
a working copy that we're going to return to the caller using the
caller's context (which is presumably statement-duration at most)
and then do the equivalent of copyObject to stash a long-lived copy
into the relcache context.  Is this case being done differently, and if
so why?  If it's being done the same, where are we leaking?

I recall having noticed someplace where I thought the relcache partition
support was simply failing to make provisions for cleaning up a cached
structure at relcache entry drop, but I didn't have time to pursue it
right then.  Let me see if I can reconstruct what I was worried about.

regards, tom lane

Re: Timeout parameters

On Wed, Mar 13, 2019 at 2:05 AM Fabien COELHO  wrote:
> Also, I do not see the downside of sending a cancel query before severing
> the connection. If it is not processed, too bad, but if it is then it is
> for the better.

If the network connection is dead, which is the situation the patch
intends to detect, then PQcancel() isn't going to work, but it might
still hang for a period of time or forever.  That seems like a pretty
major downside.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: ToDo: show size of partitioned table

2019-03-13 Thread Pavel Stehule

st 13. 3. 2019 v 8:02 odesílatel Amit Langote 
napsal:

> On 2019/02/22 1:41, Pavel Stehule wrote:
> > čt 21. 2. 2019 v 0:56 odesílatel Justin Pryzby 
> > napsal:
> >
> >> On Sat, Feb 16, 2019 at 10:52:35PM +0100, Pavel Stehule wrote:
> >>> I like your changes. I merged all - updated patch is attached
> >>
> >> I applied and tested your v10 patch.
> >>
> >> Find attached some light modifications.
> >>
> >
> > I have not any objection - I cannot to evaluate. It is question for some
> > native speaker.
>
> Not a native speaker either, but I like Justin's changes.  Although I
> noticed that he missed changing one sentence to look like other similar
> sentences.
>
> What Justin did:
>
> -indexes) and associated description are also displayed.
> +indexes) is displayed, as is the relation's description.
>  
>
> What I think he meant to do:
>
> -indexes) and associated description are also displayed.
> +indexes) is also displayed, along with the associated description.
>  
>
>
> > I am not sure if we use labels in form "some: detail" somewhere.
>
> I haven't seen column names like that either.  How about:
>
> -  gettext_noop("Direct partitions size"));
> +  gettext_noop("Leaf partition size"));
>
> -  gettext_noop("Total partitions size"));
> +  gettext_noop("Total size"));
>
> -  gettext_noop("Partitions size"));
> +  gettext_noop("Total size"));
>

+1

Pavel


> I've attached v11 of the patch, which merges most of Justin's changes and
> some of my own on top -- documentation and partition size column names.
>
> Thanks,
> Amit
>

Re: using index or check in ALTER TABLE SET NOT NULL

Sergei Kornilov  writes:
>> Ugh, I guess so. Or how about changing the message itself to use
>> INFO, like we already do in QueuePartitionConstraintValidation?

> Fine for me. But year ago this was implemented in my patch and Tom voted 
> against using INFO level for such purpose: 
> https://www.postgresql.org/message-id/1142.1520362313%40sss.pgh.pa.us

What I thought then was that you didn't need the message at all,
at any debug level.  I still think that.  It might have been useful
for development purposes but it does not belong in committed code.
INFO (making it impossible for anybody to not have the message
in-their-face) is right out.

regards, tom lane

Re: Offline enabling/disabling of data checksums

2019-03-13 Thread Magnus Hagander

On Wed, Mar 13, 2019 at 4:51 PM Michael Banck 
wrote:

> Hi,
>
> Am Mittwoch, den 13.03.2019, 12:43 +0100 schrieb Magnus Hagander:
> > I think this is dangerous enough that it needs to be enforced and not
> > documented.
>
> Changing the cluster ID might have some other side-effects, I think
> there are several cloud-native 3rd party solutions that use the cluster
> ID as some kind of unique identifier for an instance. It might not be an
> issue in practise, but then again, it might break other stuff down the
> road.
>

Well, whatever we do they have to update, right? If we're not changing it,
then we're basically saying that it's (systemid, checksums) that is the
identifier of the cluster, not just systemid. They'd have to go around and
check each node individually for the configuration and not just use
systemid anyway, so what's the actual win?

Another possibility would be to extend the replication protocol's
> IDENTIFY_SYSTEM command to also report the checksum version so that the
> standby can check against the local control file on startup. But I am
> not sure we can easily extend IDENTIFY_SYSTEM this way nor whether we
> should for this rather corner-casey thing?
>

We could, but is it really a win in those scenarios? Vs just making the
systemid different? With systemid being different it's obvious that
something needs to be done. If it's not then at the best, if we check it in
the standby startup, the standby won't start. But people can still end up
with things like unusuable/corrupt backups for example.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: Offline enabling/disabling of data checksums

2019-03-13 Thread Magnus Hagander

On Wed, Mar 13, 2019 at 4:46 PM Michael Banck 
wrote:

> Hi,
>
> Am Mittwoch, den 13.03.2019, 12:24 +0100 schrieb Magnus Hagander:
> > On Wed, Mar 13, 2019 at 11:54 AM Sergei Kornilov  wrote:
> > > One new question from me: how about replication?
> > > Case: primary+replica, we shut down primary and enable checksum, and
> > > "started streaming WAL from primary" without any issue. I have
> > > master with checksums, but replica without.
> > > Or cluster with checksums, then disable checksums on primary, but
> > > standby think we have checksums.
> >
> > Enabling or disabling the checksums offline on the master quite
> > clearly requires a rebuild of the standby, there is no other way
>
> What about shutting down both and running pg_checksums --enable on the
> standby as well?
>

That sounds pretty fragile to me. But if we can prove that the user has
done things in the right order, sure. But how can we do that in an offline
process? what if the user just quickly restarted the primary note after the
standby had been shut down? We'll need to somehow validate it across the
nodes..
-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: Using the return value of strlcpy() and strlcat()

ilm...@ilmari.org (Dagfinn Ilmari =?utf-8?Q?Manns=C3=A5ker?=) writes:
> [ let's convert
> + strlcpy(buf + buflen, name, NAMEDATALEN);
> + buflen += strlen(buf + buflen);
> to
> + buflen += strlcpy(buf + buflen, name, NAMEDATALEN);
> ]

I don't think that's a safe transformation: what strlcpy returns is
strlen(src), which might be different from what it was actually
able to fit into the destination.

Sure, they're equivalent if no truncation occurred; but if we were
100.00% sure of no truncation, we'd likely not bother with strlcpy.

regards, tom lane

Re: hyrax vs. RelationBuildPartitionDesc

On Wed, Mar 13, 2019 at 12:42 PM Alvaro Herrera
 wrote:
> I remember going over this code's memory allocation strategy a bit to
> avoid the copy while not incurring potential leaks CacheMemoryContext;
> as I recall, my idea was to use two contexts, one of which is temporary
> and used for any potentially leaky callees, and destroyed at the end of
> the function, and the other contains the good stuff and is reparented to
> CacheMemoryContext at the end.  So if you have any accidental leaks,
> they don't affect a long-lived context.  You have to be mindful of not
> calling leaky code when you're using the permanent one.

Well, that assumes that the functions which allocate the good stuff do
not also leak, which seems a bit fragile.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: using index or check in ALTER TABLE SET NOT NULL

2019-03-13 Thread Sergei Kornilov

Hi

> Ugh, I guess so. Or how about changing the message itself to use
> INFO, like we already do in QueuePartitionConstraintValidation?

Fine for me. But year ago this was implemented in my patch and Tom voted 
against using INFO level for such purpose: 
https://www.postgresql.org/message-id/1142.1520362313%40sss.pgh.pa.us

regards, Sergei

Re: hyrax vs. RelationBuildPartitionDesc

2019-03-13 Thread Alvaro Herrera

On 2019-Mar-13, Robert Haas wrote:

> RelationBuildPartitionDesc() creates basically all of the data
> structures it needs and then copies them into rel->rd_pdcxt, which has
> always seemed a bit inefficient to me.  Another way to redesign this
> would be to have the function create a temporary context, do all of
> its work there, and then reparent the context under CacheMemoryContext
> at the end.  That means any leaks would go into a relatively
> long-lifespan context, but on the other hand you wouldn't leak into
> the same context a zillion times over, and you'd save the expense of
> copying everything.  I think that the biggest thing that is being
> copied around here is the partition bounds, so maybe the leak wouldn't
> amount to much, and we could also do things like list_free(inhoids) to
> make it a little tighter.

I remember going over this code's memory allocation strategy a bit to
avoid the copy while not incurring potential leaks CacheMemoryContext;
as I recall, my idea was to use two contexts, one of which is temporary
and used for any potentially leaky callees, and destroyed at the end of
the function, and the other contains the good stuff and is reparented to
CacheMemoryContext at the end.  So if you have any accidental leaks,
they don't affect a long-lived context.  You have to be mindful of not
calling leaky code when you're using the permanent one.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

On Wed, Mar 13, 2019 at 12:20 AM Kyotaro HORIGUCHI
 wrote:
> +Multivariate MCV (most-common values) lists are a straightforward extension 
> of
>
>   "lists are *a*" is wrong?

No, that's correct.  Not sure exactly what your concern is, but it's
probably related to the fact that the first parent of the sentences
(before "are") is plural and the second part is singular.  It does
seem a little odd that you can say "lists are an extension," mixing
singular and plural, but English lets you do stuff like that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: GIN indexes on an = ANY(array) clause

Corey Huinker  writes:
> A client had an issue with a where that had a where clause something like
> this:
> WHERE 123456 = ANY(integer_array_column)
> I was surprised that this didn't use the pre-existing GIN index on
> integer_array_column, whereas recoding as
> WHERE ARRAY[123456] <@ integer_array_column
> did cause the GIN index to be used. Is this a known/expected behavior? If
> so, is there any logical reason why we couldn't have the planner pick up on
> that?
> Flo Rance (toura...@gmail.com) was nice enough to show that yes, this is
> expected behavior.

The planner doesn't know enough about the semantics of array <@ to make
such a transformation.  (As pointed out in the stackoverflow article Flo
pointed you to, the equivalence might not even hold, depending on which
version of <@ we're talking about.)

Since the GIN index type is heavily oriented towards array-related
operators, I spent some time wondering whether we could get any mileage
by making ScalarArrayOpExpr indexquals be natively supported by GIN
(right now they aren't).  But really I don't see where the GIN AM would
get the knowledge from, either.  What it knows about the array_ops
opclass is basically the list of associated operators:

regression=# select amopopr::regoperator from pg_amop where amopfamily = 2745;
amopopr
---
 &&(anyarray,anyarray)
 @>(anyarray,anyarray)
 <@(anyarray,anyarray)
 =(anyarray,anyarray)
(4 rows)

and none of those are obviously related to the =(int4,int4) operator that
is in the ScalarArrayOp.  The only way to get from point A to point B is
to know very specifically that =(anyarray,anyarray) is related to any
scalar-type btree equality operator, which is not the kind of thing the
GIN AM ought to know either.

Really the array_ops opclass itself is the widest scope where it'd be
reasonable to embed knowledge about this sort of thing --- but we lack
any API at all whereby opclass-specific code could affect planner behavior
at this level.  Even if we had one, there's no obvious reason why we
should be consulting a GIN opclass about a ScalarArrayOp that does not
contain an operator visibly related to the opclass.  That path soon
leads to consulting everybody about everything and planner performance
going into the tank.

Extensibility is a harsh mistress.  

regards, tom lane

Re: Compressed TOAST Slicing

2019-03-13 Thread Andrey Borodin




> 13 марта 2019 г., в 21:05, Paul Ramsey  написал(а):
> 
> Here is a new (final?) patch ...
> 
> 

This check

@@ -744,6 +748,8 @@ pglz_decompress(const char *source, int32 slen, char *dest,
{
*dp = dp[-off];
dp++;
+   if (dp >= destend)  /* check for 
buffer overrun */
+   break;  /* do not 
clobber memory */
}

is still done for every byte. You can precompute maximum allowed length before 
that cycle. Here's diff

diff --git a/src/common/pg_lzcompress.c b/src/common/pg_lzcompress.c
index 6b48892a8f..05b2b3d5d1 100644
--- a/src/common/pg_lzcompress.c
+++ b/src/common/pg_lzcompress.c
@@ -744,12 +744,11 @@ pglz_decompress_checked(const char *source, int32 slen, 
char *dest,
 * memcpy() here, because the copied areas 
could overlap
 * extremely!
 */
+   len = Min(len, destend - dp);
while (len--)
{
*dp = dp[-off];
dp++;
-   if (dp >= destend)  /* check for 
buffer overrun */
-   break;  /* do not 
clobber memory */
}
}
else


Best regards, Andrey Borodin.

Re: using index or check in ALTER TABLE SET NOT NULL

On Wed, Mar 13, 2019 at 11:17 AM Sergei Kornilov  wrote:
> > The buildfarm thinks additional nitpicking is needed.
>
> hm. Patch was committed with debug1 level tests and many animals uses 
> log_statement = 'all'. Therefore they have additional line in result: LOG:  
> statement: alter table pg_class alter column relname drop not null; and 
> similar for other queries.
> I think we better would be to revert "set client_min_messages to 'debug1';"  
> part.

Ugh, I guess so.  Or how about changing the message itself to  use
INFO, like we already do in QueuePartitionConstraintValidation?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

hyrax vs. RelationBuildPartitionDesc