Re: [HACKERS] Prepared statements assume text type in PG10

2017-10-07 Thread Tom Lane
Jack Christensen  writes:
> Pre-version 10:

> jack=# prepare ps as select $1;
> ERROR:  could not determine data type of parameter $1

> But on PG10 the type defaults to text:

> jack=# prepare ps as select $1;
> PREPARE

> I looked through the git log and couldn't find any commits referencing 
> this. Is this an intended behavior change?

Yes, it is; we no longer allow any SELECT output column to remain
unresolved as unknown type.  If it would have come out like that
before, we force it to text instead.  Peter misidentified the
responsible commit though: it was 1e7c4bb00.  (I guess the commit
log message is a bit misleading, because it only talks about unknown
literals ... but undetermined parameters and undecorated NULL constants
would receive the same force-to-text-type treatment.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Prepared statements assume text type in PG10

2017-10-07 Thread Peter Geoghegan
On Sat, Oct 7, 2017 at 4:27 PM, Peter Geoghegan  wrote:
> I suspect commit d8d32d9 is involved here, though I haven't verified that.

Weirdly, there is a git hash collision here, so you'll have to put in
d8d32d9a (8 characters -- the default of 7 for a short git hash isn't
cutting it).

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Prepared statements assume text type in PG10

2017-10-07 Thread Peter Geoghegan
On Sat, Oct 7, 2017 at 2:56 PM, Jack Christensen
 wrote:
> The test suite for the Go PostgreSQL driver pgx
> (https://github.com/jackc/pgx) found an unexpected behavior change in PG10.
> Previously, it was impossible to prepare a statement with a unknown or
> ambiguous parameter type.
>
> Pre-version 10:
>
> jack=# prepare ps as select $1;
> ERROR:  could not determine data type of parameter $1
>
> But on PG10 the type defaults to text:
>
> jack=# prepare ps as select $1;
> PREPARE
> Time: 0.183 ms
> jack=# execute ps('Hello, there');
>?column?
> --
>  Hello, there
> (1 row)
>
> Time: 0.437 ms
>
> I looked through the git log and couldn't find any commits referencing this.
> Is this an intended behavior change?

I suspect commit d8d32d9 is involved here, though I haven't verified that.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-07 Thread Alvaro Herrera
Peter Geoghegan wrote:
> On Sat, Oct 7, 2017 at 1:31 AM, Alvaro Herrera  
> wrote:
> >> As you must have seen, Alvaro said he has a variant of Dan's original
> >> script that demonstrates that a problem remains, at least on 9.6+,
> >> even with today's fix. I think it's the stress-test that plays with
> >> fillfactor, many clients, etc [1].
> >
> > I just execute setup.sql once and then run this shell command,
> >
> > while :; do
> > psql -e -P pager=off -f ./repro.sql
> > for i in `seq 1 5`; do
> > psql -P pager=off -e --no-psqlrc -f ./lock.sql &
> > done
> > wait && psql -P pager=off -e --no-psqlrc -f ./reindex.sql
> > psql -P pager=off -e --no-psqlrc -f ./report.sql
> > echo "done"
> > done
> 
> I cannot reproduce the problem on my personal machine using this
> script/stress-test. I tried to do so on the master branch git tip.
> This reinforces the theory that there is some timing sensitivity,
> because the remaining race condition is very narrow.

Hmm, I think I added a random sleep (max. 100ms) right after the
HeapTupleSatisfiesVacuum call in vacuumlazy.c (lazy_scan_heap), and that
makes the race easier to hit.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-07 Thread Peter Geoghegan
On Sat, Oct 7, 2017 at 1:31 AM, Alvaro Herrera  wrote:
>> As you must have seen, Alvaro said he has a variant of Dan's original
>> script that demonstrates that a problem remains, at least on 9.6+,
>> even with today's fix. I think it's the stress-test that plays with
>> fillfactor, many clients, etc [1].
>
> I just execute setup.sql once and then run this shell command,
>
> while :; do
> psql -e -P pager=off -f ./repro.sql
> for i in `seq 1 5`; do
> psql -P pager=off -e --no-psqlrc -f ./lock.sql &
> done
> wait && psql -P pager=off -e --no-psqlrc -f ./reindex.sql
> psql -P pager=off -e --no-psqlrc -f ./report.sql
> echo "done"
> done

I cannot reproduce the problem on my personal machine using this
script/stress-test. I tried to do so on the master branch git tip.
This reinforces the theory that there is some timing sensitivity,
because the remaining race condition is very narrow.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tablespaces inside $PGDATA considered harmful

2017-10-07 Thread Mark Kirkwood

On 26/09/17 20:44, Mark Kirkwood wrote:




$ pg_basebackup -D .
WARNING:  could not read symbolic link "pg_tblspc/space1": Invalid 
argument
pg_basebackup: directory "/data0/pgdata/11/pg_tblspc/space1" exists 
but is not empty

pg_basebackup: removing contents of data directory "."



Err - actually this example is wrong - sorry. In fact pg_basebackup is 
complaining because it does not want to overwrite the contents of the 
tablespace (need to use the -T option as I'm on the same host)!


A correct example of pg_basebackup failing due to tablespaces inside 
$PGDATA/pg_tblspc can be easily demonstrated by trying to set up 
streaming replication on another host:


$ pg_basebackup -h 10.0.119.100 -P -D .
WARNING:  could not read symbolic link "pg_tblspc/space1": Invalid argument
pg_basebackup: could not create directory "./pg_tblspc": File exists

Fortunately this can be worked around by changing to tar format:

$ pg_basebackup -h 10.0.119.100 -Ft -P -D .
WARNING:  could not read symbolic link "pg_tblspc/space1": Invalid argument
1560632/1560632 kB (100%), 2/2 tablespaces

...however, not that great that the plain mode is busted.

regards

Mark




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Prepared statements assume text type in PG10

2017-10-07 Thread Jack Christensen
The test suite for the Go PostgreSQL driver pgx 
(https://github.com/jackc/pgx) found an unexpected behavior change in 
PG10. Previously, it was impossible to prepare a statement with a 
unknown or ambiguous parameter type.


Pre-version 10:

jack=# prepare ps as select $1;
ERROR:  could not determine data type of parameter $1

But on PG10 the type defaults to text:

jack=# prepare ps as select $1;
PREPARE
Time: 0.183 ms
jack=# execute ps('Hello, there');
   ?column?
--
 Hello, there
(1 row)

Time: 0.437 ms

I looked through the git log and couldn't find any commits referencing 
this. Is this an intended behavior change?


Jack





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-07 Thread Petr Jelinek
On 07/10/17 19:59, Tom Lane wrote:
> Petr Jelinek  writes:
>> On 07/10/17 18:15, Tom Lane wrote:
>>> No, I'm afraid you didn't read that comment closely enough.  This will
>>> flat out fail for cases like "select ... where x=x order by x", because
>>> there will already be a single-element EC for x and so the clause will
>>> just disappear entirely.  If that doesn't happen, then instead you're
>>> creating an EC with duplicate entries, which is an idea I seriously
>>> dislike; the semantics of that don't seem clear at all to me.
> 
>> Hmm it did not disappear (and worked fine in SQL level tests).
> 
> I may not be correctly remembering what the query would need to look
> like for there to be single-element ECs in existence at this point; but
> I surely would not like this code to assume that none will exist till
> later.
> 
>> I don't
>> think I fully understand the "EC with duplicate entries" part and what's
>> the problem with it so I'll defer to your wisdom there.
> 
> Well, as one example, assume that we use your patch and consider what
> happens with
>   where x = y and x = x
> vs
>   where x = x and x = y
> 
> In the first case we end up with an EC that's just {x,y}, because the
> second process_equivalence() will find both sides in the same EC and
> conclude that the second clause is redundant.  (Which it is, if the
> equality operators have the same notion of what to do with nulls.)
> In the second case we end up with an EC containing {x,x,y}, which
> at minimum will result in emitting redundant generated quals.  I'm
> not sure if it could have any worse consequences than that, but I'm
> not sure it couldn't either.  But this is bogus in any case, because
> those WHERE clauses surely should produce identical results.
> 
> Looking further ahead, if ECs have to support being multisets rather
> than pure sets, that would put a crimp in ever improving this logic to
> use a smarter UNION-FIND algorithm.  (I've not yet seen queries where the
> number of EC members is large enough to make that a serious issue, but
> I think it could happen, particularly with inheritance/partitioning.)
> 

Okay, that makes sense, thanks for explanation. Your patch is the way to
go then.

>>> This passes the smell test for me in the sense of not adding any
>>> significant number of planner cycles except when the weird case occurs.
>>> It leaves something on the table in that there are some situations
>>> where X=X could be converted, but we won't do it because we don't see
>>> the clause as being a potential EC (because it's not at top level),
>>> as in the second new regression test case below.  I think that's
>>> probably all right; I don't see any way to be more thorough without
>>> adding a lot of new cycles all the time, and I don't believe this is
>>> worth that.
> 
>> My code had same issue. I think going deeper would require quite a bit
>> of new code (and cycles) for something that's even less likely to happen
>> than simple X=X while the current patch is quite cheap win.
> 
> Yeah.  I'm not really convinced it's a big win, but it's so cheap we
> might as well do it.  The only case where it would expend cycles and
> fail to give an improvement is if we have X=X with a non-strict operator,
> which I think is a case that never arises in practice at present,
> because btree effectively assumes that all btree operators are strict.
> (Someday we might want to relax that, though, which is why I don't
> want to let the planner just assume it.)
> 

In real-life probably not, but seems like these small bits can be useful
marketing wise, given articles like the one which started this. And it
should not hurt anything either.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horrible CREATE DATABASE Performance in High Sierra

2017-10-07 Thread Tom Lane
I wrote:
> Current status is that I've filed a bug report with Apple and am waiting
> to see their response before deciding what to do next.  If they fix the
> issue promptly then there's little need for us to do anything.

Not having heard a peep from Apple yet, I decided to do a bit more
experimenting.  I found that indeed, issuing fewer and larger mmap/msync
requests helps enormously.  If you're willing to go as far as issuing
only one per file, the speed seems on par with non-fsync.  But that
requires being able to mmap 1GB-sized files, so it's probably not
something we want to do.

What I did instead was to adjust the logic in copy_file() so that the
unit of flush requests can be a multiple of the unit of read/write
requests.  (My original thought of just raising the buffer size seems
like not as good an idea; it's less cache-friendly for one thing.)

I find that on both Linux and macOS-with-HFS, requesting a flush only
every 1MB seems to be a win compared to flushing every 64KB as we
currently do.  Actually it seems that on macOS, every increment of
increase in the flush distance helps, but with HFS you hit diminishing
returns after 1MB or so.  With APFS you need a flush distance of 32MB
or more to have credible performance.

Accordingly I propose the attached patch.  If anyone's interested in
experimenting on other platforms, we might be able to refine/complicate
the FLUSH_DISTANCE selection further, but I think this is probably good
enough for a first cut.

regards, tom lane

diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index a5e074e..eae9f5a 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -139,10 +139,24 @@ copy_file(char *fromfile, char *tofile)
 	int			dstfd;
 	int			nbytes;
 	off_t		offset;
+	off_t		flush_offset;
 
-	/* Use palloc to ensure we get a maxaligned buffer */
+	/* Size of copy buffer (read and write requests) */
 #define COPY_BUF_SIZE (8 * BLCKSZ)
 
+	/*
+	 * Size of data flush requests.  It seems beneficial on most platforms to
+	 * do this every 1MB or so.  But macOS, at least with early releases of
+	 * APFS, is really unfriendly to small mmap/msync requests, so there do it
+	 * only every 32MB.
+	 */
+#if defined(__darwin__)
+#define FLUSH_DISTANCE (32 * 1024 * 1024)
+#else
+#define FLUSH_DISTANCE (1024 * 1024)
+#endif
+
+	/* Use palloc to ensure we get a maxaligned buffer */
 	buffer = palloc(COPY_BUF_SIZE);
 
 	/*
@@ -163,11 +177,23 @@ copy_file(char *fromfile, char *tofile)
 	/*
 	 * Do the data copying.
 	 */
+	flush_offset = 0;
 	for (offset = 0;; offset += nbytes)
 	{
 		/* If we got a cancel signal during the copy of the file, quit */
 		CHECK_FOR_INTERRUPTS();
 
+		/*
+		 * We fsync the files later, but during the copy, flush them every so
+		 * often to avoid spamming the cache and hopefully get the kernel to
+		 * start writing them out before the fsync comes.
+		 */
+		if (offset - flush_offset >= FLUSH_DISTANCE)
+		{
+			pg_flush_data(dstfd, flush_offset, offset - flush_offset);
+			flush_offset = offset;
+		}
+
 		pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_READ);
 		nbytes = read(srcfd, buffer, COPY_BUF_SIZE);
 		pgstat_report_wait_end();
@@ -190,15 +216,11 @@ copy_file(char *fromfile, char *tofile)
 	 errmsg("could not write to file \"%s\": %m", tofile)));
 		}
 		pgstat_report_wait_end();
-
-		/*
-		 * We fsync the files later but first flush them to avoid spamming the
-		 * cache and hopefully get the kernel to start writing them out before
-		 * the fsync comes.
-		 */
-		pg_flush_data(dstfd, offset, nbytes);
 	}
 
+	if (offset > flush_offset)
+		pg_flush_data(dstfd, flush_offset, offset - flush_offset);
+
 	if (CloseTransientFile(dstfd))
 		ereport(ERROR,
 (errcode_for_file_access(),

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-07 Thread Nico Williams
On Fri, Oct 06, 2017 at 10:19:54PM -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2017-10-06 21:33:16 -0400, Adam Brusselback wrote:
> >> The article in question is here:
> >> https://blog.jooq.org/2017/09/28/10-cool-sql-optimisations-that-do-not-depend-on-the-cost-model/
> 
> > That's interesting.
> 
> The impression I have in a quick scan is that probably hardly any of these
> are cases that any of the DB designers think are important in themselves.

That's true for some of those.  But some of them might become important
when you start pushing WHERE constraints from outside into inner table
sources and subqueries, as dumb-looking constraints can simply appear
from pushing non-dumb-looking constraints.

More than the op optimizations would make a big difference for me:

 - turning subqueries into joins

 - turning ORs into UNIONs

   It is easy enough to work around the lack of this optimization in
   many cases, but it does make queries more verbose.

 - pushing WHERE constraints from outer queries into the table source
   queries (_including_ VIEWs)

- determining that some table in a query that had WHERE constraints
  pushed into it... now has a very well-filled out lookup key,
  therefore it's the one that should be the table source to start
  the plan with, i.e., that it should be first in the outermost loop
  of a nested loop join 

  For me these two would be huge wins.  I have to resort to
  functions with roughly the same body as views just so that I can
  have the optimizer pick the correct plan.  This causes a lot of
  code duplication in my schemas.

 - pushing WHERE constraints from outer queries into HAVING thence WHERE
   constraints on GROUP BY queries where the outer constraints are on
   columns used to GROUP BY

   I find myself making two versions of views that do aggregation: one
   that does not, and one that does.  This allows me to use the
   non-aggregating view in contexts where I need this optimization, but
   then I have to re-code the aggregation at that layer.  Again, lots of
   duplication.

These sorts of optimizations are huge.

> Rather, they fall out of more general optimization attempts, or not,
> depending on the optimization mechanisms in use in a particular DB.
> For example, reducing "WHERE 1=1" to "WHERE TRUE" and then to nothing
> comes out of a constant-subexpression-precalculation mechanism for us,
> whereas "WHERE column=column" doesn't fall to that approach.  ISTM it
> would be really dumb to expend planner cycles looking specifically for
> that case, so I guess that DB2 et al are finding it as a side-effect of
> some more general optimization ... I wonder what that is?

If you can reduce the number of compilations / optimization passes for
statements, then spending more time in the optimizer is not a big deal.
So, when invoked via PREPARE I would say spending more cycles looking
for this sort of thing is OK, but in many other cases it's not.

Also, sometimes these cases crop up do to pushing constraints into VIEWs
and sub-queries.  In those cases then constant sub-expression
elimination can be a win.

> (edit: a few minutes later, I seem to remember that equivclass.c has
> to do something special with the X=X case, so maybe it could do
> something else special instead, with little new overhead.)

I'd expect that column = column is not trivial to turn into TRUE, not
unless those columns are NOT NULLable.

> > 9. Unneeded Self JOIN
> 
> > Can't remember discussions of this.
> 
> I can't get very excited about that one either.
> 
> In the end, what the article fails to consider is that all of these are
> tradeoffs, not unalloyed goods.  If you spend planner cycles on every
> query to look for cases that only the most unabashedly brain-dead ORMs
> ever generate, you're not really doing your users a favor on balance.

I can't get very excited about this one either, though I do believe it
can arise as the author says, "when you build complex views and JOIN
them to each other".  Maybe I'm not excited about it because I've not
needed it :)

Nico
-- 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Slow synchronous logical replication

2017-10-07 Thread Konstantin Knizhnik

On 10/07/2017 10:42 PM, Andres Freund wrote:

Hi,

On 2017-10-07 22:39:09 +0300, konstantin knizhnik wrote:

In our sharded cluster project we are trying to use logical relication for 
providing HA (maintaining redundant shard copies).
Using asynchronous logical replication has not so much sense in context of HA. 
This is why we try to use synchronous logical replication.
Unfortunately it shows very bad performance. With 50 shards and level of 
redundancy=1 (just one copy) cluster is 20 times slower then without logical 
replication.
With asynchronous replication it is "only" two times slower.

As far as I understand, the reason of such bad performance is that synchronous replication 
mechanism was originally developed for streaming replication, when all replicas have the same 
content and LSNs. When it is used for logical replication, it behaves very inefficiently. 
Commit has to wait confirmations from all receivers mentioned in 
"synchronous_standby_names" list. So we are waiting not only for our own single 
logical replication standby, but all other standbys as well. Number of synchronous standbyes is 
equal to number of shards divided by number of nodes. To provide uniform distribution number of 
shards should >> than number of nodes, for example for 10 nodes we usually create 100 
shards. As a result we get awful performance and blocking of any replication channel blocks all 
backends.

So my question is whether my understanding is correct and synchronous logical 
replication can not be efficiently used in such manner.
If so, the next question is how difficult it will be to make synchronous 
replication mechanism for logical replication more efficient and are there some 
plans to  work in this direction?

This seems to be a question that is a) about a commercial project we
don't know much about b) hasn't received a lot of investigation.


Sorry, If I was not clear.
The question was about logical replication mechanism in mainstream version of 
Postgres.
I think that most of people are using asynchronous logical replication and 
synchronous LR is something exotic and not well tested and investigated.
It will be great if I am wrong:)

Concerning our sharded cluster (pg_shardman) - it is not a commercial product 
yet, it is in development phase.
We are going to open its sources when it will be more or less stable.
But unlike multimaster, this sharded cluster is mostly built from existed 
components: pg_pathman  + postgres_fdw + logical replication.
So we are just trying to combine them all into some integrated system.
But currently the most obscure point is logical replication.

And the main goal of my e-mail was to know the opinion of authors and users of 
LR whether it is good idea to use LR to provide fault tolerance in sharded 
cluster.
Or some other approaches, for example sharding with redundancy or using 
streaming replication are preferable?


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Slow synchronous logical replication

2017-10-07 Thread Andres Freund
Hi,

On 2017-10-07 22:39:09 +0300, konstantin knizhnik wrote:
> In our sharded cluster project we are trying to use logical relication for 
> providing HA (maintaining redundant shard copies).
> Using asynchronous logical replication has not so much sense in context of 
> HA. This is why we try to use synchronous logical replication.
> Unfortunately it shows very bad performance. With 50 shards and level of 
> redundancy=1 (just one copy) cluster is 20 times slower then without logical 
> replication.
> With asynchronous replication it is "only" two times slower.
> 
> As far as I understand, the reason of such bad performance is that 
> synchronous replication mechanism was originally developed for streaming 
> replication, when all replicas have the same content and LSNs. When it is 
> used for logical replication, it behaves very inefficiently. Commit has to 
> wait confirmations from all receivers mentioned in 
> "synchronous_standby_names" list. So we are waiting not only for our own 
> single logical replication standby, but all other standbys as well. Number of 
> synchronous standbyes is equal to number of shards divided by number of 
> nodes. To provide uniform distribution number of shards should >> than number 
> of nodes, for example for 10 nodes we usually create 100 shards. As a result 
> we get awful performance and blocking of any replication channel blocks all 
> backends.
> 
> So my question is whether my understanding is correct and synchronous logical 
> replication can not be efficiently used in such manner.
> If so, the next question is how difficult it will be to make synchronous 
> replication mechanism for logical replication more efficient and are there 
> some plans to  work in this direction?

This seems to be a question that is a) about a commercial project we
don't know much about b) hasn't received a lot of investigation.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Slow synchronous logical replication

2017-10-07 Thread konstantin knizhnik
In our sharded cluster project we are trying to use logical relication for 
providing HA (maintaining redundant shard copies).
Using asynchronous logical replication has not so much sense in context of HA. 
This is why we try to use synchronous logical replication.
Unfortunately it shows very bad performance. With 50 shards and level of 
redundancy=1 (just one copy) cluster is 20 times slower then without logical 
replication.
With asynchronous replication it is "only" two times slower.

As far as I understand, the reason of such bad performance is that synchronous 
replication mechanism was originally developed for streaming replication, when 
all replicas have the same content and LSNs. When it is used for logical 
replication, it behaves very inefficiently. Commit has to wait confirmations 
from all receivers mentioned in "synchronous_standby_names" list. So we are 
waiting not only for our own single logical replication standby, but all other 
standbys as well. Number of synchronous standbyes is equal to number of shards 
divided by number of nodes. To provide uniform distribution number of shards 
should >> than number of nodes, for example for 10 nodes we usually create 100 
shards. As a result we get awful performance and blocking of any replication 
channel blocks all backends.

So my question is whether my understanding is correct and synchronous logical 
replication can not be efficiently used in such manner.
If so, the next question is how difficult it will be to make synchronous 
replication mechanism for logical replication more efficient and are there some 
plans to  work in this direction?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Help required to debug pg_repack breaking logical replication

2017-10-07 Thread Daniele Varrazzo
Hello,

we have been reported, and I have experienced a couple of times,
pg_repack breaking logical replication.

- https://github.com/reorg/pg_repack/issues/135
- https://github.com/2ndQuadrant/pglogical/issues/113

In my experience, after the botched run, the replication slot was
"stuck", and any attempt of reading (including
pg_logical_slot_peek_changes()) blocked until ctrl-c. I've tried
replicating the issue but first attempts have failed to fail.

In the above issue #113, Petr Jelinek commented:

> From quick look at pg_repack, the way it does table rewrite is almost 
> guaranteed
> to break logical decoding unless there is zero unconsumed changes for a given 
> table
> as it does not build the necessary mappings info for logical decoding that 
> standard
> heap rewrite in postgres does.

unfortunately he didn't follow up to further details requests.

I've started drilling down the problem, observing that the swap
function, swap_heap_or_index_files() [1] was cargoculted by the
original author from the CLUSTER command code as of PG 8.2 [2] (with a
custom addition to update the relfrozenxid which seems backwards to me
as it sets the older frozen xid on the new table [3]).

[1] https://github.com/reorg/pg_repack/blob/ver_1.4.1/lib/repack.c#L1082
[2] 
https://github.com/postgres/postgres/blob/REL8_2_STABLE/src/backend/commands/cluster.c#L783
[3] https://github.com/reorg/pg_repack/issues/152

so that code is effectively missing a good 10 years of development.
Before jumping into fast-forwarding it, I would like to ask for some
help, i.e.

- Is Petr diagnosis right and freezing of logical replication is to be
blamed to missing mapping?
- Can you suggest a test to reproduce the issue reliably?
- What are mapped relations anyway?

Thank you in advance for any help (either info about how to fix the
issue properly, or a patch by someone who happens to really understand
what we are talking about).

-- Daniele


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-07 Thread Tom Lane
Petr Jelinek  writes:
> On 07/10/17 18:15, Tom Lane wrote:
>> No, I'm afraid you didn't read that comment closely enough.  This will
>> flat out fail for cases like "select ... where x=x order by x", because
>> there will already be a single-element EC for x and so the clause will
>> just disappear entirely.  If that doesn't happen, then instead you're
>> creating an EC with duplicate entries, which is an idea I seriously
>> dislike; the semantics of that don't seem clear at all to me.

> Hmm it did not disappear (and worked fine in SQL level tests).

I may not be correctly remembering what the query would need to look
like for there to be single-element ECs in existence at this point; but
I surely would not like this code to assume that none will exist till
later.

> I don't
> think I fully understand the "EC with duplicate entries" part and what's
> the problem with it so I'll defer to your wisdom there.

Well, as one example, assume that we use your patch and consider what
happens with
where x = y and x = x
vs
where x = x and x = y

In the first case we end up with an EC that's just {x,y}, because the
second process_equivalence() will find both sides in the same EC and
conclude that the second clause is redundant.  (Which it is, if the
equality operators have the same notion of what to do with nulls.)
In the second case we end up with an EC containing {x,x,y}, which
at minimum will result in emitting redundant generated quals.  I'm
not sure if it could have any worse consequences than that, but I'm
not sure it couldn't either.  But this is bogus in any case, because
those WHERE clauses surely should produce identical results.

Looking further ahead, if ECs have to support being multisets rather
than pure sets, that would put a crimp in ever improving this logic to
use a smarter UNION-FIND algorithm.  (I've not yet seen queries where the
number of EC members is large enough to make that a serious issue, but
I think it could happen, particularly with inheritance/partitioning.)

>> This passes the smell test for me in the sense of not adding any
>> significant number of planner cycles except when the weird case occurs.
>> It leaves something on the table in that there are some situations
>> where X=X could be converted, but we won't do it because we don't see
>> the clause as being a potential EC (because it's not at top level),
>> as in the second new regression test case below.  I think that's
>> probably all right; I don't see any way to be more thorough without
>> adding a lot of new cycles all the time, and I don't believe this is
>> worth that.

> My code had same issue. I think going deeper would require quite a bit
> of new code (and cycles) for something that's even less likely to happen
> than simple X=X while the current patch is quite cheap win.

Yeah.  I'm not really convinced it's a big win, but it's so cheap we
might as well do it.  The only case where it would expend cycles and
fail to give an improvement is if we have X=X with a non-strict operator,
which I think is a case that never arises in practice at present,
because btree effectively assumes that all btree operators are strict.
(Someday we might want to relax that, though, which is why I don't
want to let the planner just assume it.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-07 Thread Petr Jelinek
On 07/10/17 18:15, Tom Lane wrote:
> Petr Jelinek  writes:
>> On 07/10/17 04:19, Tom Lane wrote:
>>> (edit: a few minutes later, I seem to remember that equivclass.c has
>>> to do something special with the X=X case, so maybe it could do
>>> something else special instead, with little new overhead.)
> 
>> So I wrote prototype of achieving this optimization and it seems to be
>> really quite simple code-wise (see attached). I did only a limited
>> manual testing of this but I don't see any negative impact on planning time.
> 
> No, I'm afraid you didn't read that comment closely enough.  This will
> flat out fail for cases like "select ... where x=x order by x", because
> there will already be a single-element EC for x and so the clause will
> just disappear entirely.  If that doesn't happen, then instead you're
> creating an EC with duplicate entries, which is an idea I seriously
> dislike; the semantics of that don't seem clear at all to me.

Hmm it did not disappear (and worked fine in SQL level tests). I don't
think I fully understand the "EC with duplicate entries" part and what's
the problem with it so I'll defer to your wisdom there.

> What I was imagining was that having detected X=X, instead of "throwing
> back" the clause as-is we could throw back an X IS NOT NULL clause,
> along the lines of the attached.

Right, I wanted to avoid messing with the incoming result info, but if
we don't want to call the code bellow or write tons of code for this, I
guess it's the best we can do.

> This passes the smell test for me in the sense of not adding any
> significant number of planner cycles except when the weird case occurs.
> It leaves something on the table in that there are some situations
> where X=X could be converted, but we won't do it because we don't see
> the clause as being a potential EC (because it's not at top level),
> as in the second new regression test case below.  I think that's
> probably all right; I don't see any way to be more thorough without
> adding a lot of new cycles all the time, and I don't believe this is
> worth that.
> 

My code had same issue. I think going deeper would require quite a bit
of new code (and cycles) for something that's even less likely to happen
than simple X=X while the current patch is quite cheap win.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issue with logical replication: MyPgXact->xmin already is valid

2017-10-07 Thread Petr Jelinek
On 07/10/17 18:23, Konstantin Knizhnik wrote:
> On 10/07/2017 04:26 PM, Petr Jelinek wrote:
>>
>> Hmm so you start transaction (you have to when running
>> CREATE_REPLICATION_SLOT with USE_SNAPSHOT parameter). And while the slot
>> is being created the config is reloaded. And since now you are in
>> transaction the tsearch hook for GUC processing tries to access catalogs
>> which sets the xmin for the transaction.
> 
> Actually this slot is implicitly created by LogicalRepSyncTableStart to
> perform initial data sync.
> 

That does not change what I said above.

>>
>> That's not good, but I can't really say I have idea about what to do
>> with it other than to set some kind of flag saying that logical decoding
>> snapshot is being built and using that to skip catalog access which does
>> not seem very pretty.
>>
> It is not quite clear from the comment why it is not possible to
> overwrite MyPgXact->xmin or just use existed value.
> 

We can't use existing value because the snapshot used to read data would
no longer correspond to the one we just built as consistent start point.

About overwriting, well we may be able to do that if there was no active
snapshot in the transaction, ie we could try doing something like
SnapshotResetXmin() or possibly rather InvalidateCatalogSnapshot() since
the xmin was set for reading catalog snapshot before checking the
MyPgXact->xmin.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] separate serial_schedule useful?

2017-10-07 Thread Tom Lane
I wrote:
> Robert Haas  writes:
>> There's no reason why pg_regress couldn't have a
>> --bail-if-group-size-exceeds=N argument, or why we couldn't have a
>> separate Perl script to validate the schedule file as part of the
>> build process.

> I'd go for the former approach; seems like less new code and fewer cycles
> used to enforce the rule.

Concretely, how about the attached?  (Obviously we'd have to fix
parallel_schedule before committing this.)

regards, tom lane

diff --git a/src/test/regress/GNUmakefile b/src/test/regress/GNUmakefile
index b923ea1..56cd202 100644
*** a/src/test/regress/GNUmakefile
--- b/src/test/regress/GNUmakefile
*** tablespace-setup:
*** 124,130 
  ## Run tests
  ##
  
! REGRESS_OPTS = --dlpath=. $(EXTRA_REGRESS_OPTS)
  
  check: all tablespace-setup
  	$(pg_regress_check) $(REGRESS_OPTS) --schedule=$(srcdir)/parallel_schedule $(MAXCONNOPT) $(EXTRA_TESTS)
--- 124,130 
  ## Run tests
  ##
  
! REGRESS_OPTS = --dlpath=. --max-concurrent-tests=20 $(EXTRA_REGRESS_OPTS)
  
  check: all tablespace-setup
  	$(pg_regress_check) $(REGRESS_OPTS) --schedule=$(srcdir)/parallel_schedule $(MAXCONNOPT) $(EXTRA_TESTS)
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index abb742b..ff979b8 100644
*** a/src/test/regress/pg_regress.c
--- b/src/test/regress/pg_regress.c
*** char	   *launcher = NULL;
*** 78,83 
--- 78,84 
  static _stringlist *loadlanguage = NULL;
  static _stringlist *loadextension = NULL;
  static int	max_connections = 0;
+ static int	max_concurrent_tests = 0;
  static char *encoding = NULL;
  static _stringlist *schedulelist = NULL;
  static _stringlist *extra_tests = NULL;
*** run_schedule(const char *schedule, test_
*** 1691,1696 
--- 1692,1704 
  			wait_for_tests(pids, statuses, NULL, 1);
  			/* status line is finished below */
  		}
+ 		else if (max_concurrent_tests > 0 && max_concurrent_tests < num_tests)
+ 		{
+ 			/* can't print scbuf here, it's already been trashed */
+ 			fprintf(stderr, _("too many tests (more than %d) in schedule file \"%s\" line %d\n"),
+ 	max_concurrent_tests, schedule, line_num);
+ 			exit(2);
+ 		}
  		else if (max_connections > 0 && max_connections < num_tests)
  		{
  			int			oldest = 0;
*** help(void)
*** 1999,2004 
--- 2007,2014 
  	printf(_("tests; can appear multiple times\n"));
  	printf(_("  --max-connections=N   maximum number of concurrent connections\n"));
  	printf(_("(default is 0, meaning unlimited)\n"));
+ 	printf(_("  --max-concurrent-tests=N  maximum number of concurrent tests in schedule\n"));
+ 	printf(_("(default is 0, meaning unlimited)\n"));
  	printf(_("  --outputdir=DIR   place output files in DIR (default \".\")\n"));
  	printf(_("  --schedule=FILE   use test ordering schedule from FILE\n"));
  	printf(_("(can be used multiple times to concatenate)\n"));
*** regression_main(int argc, char *argv[], 
*** 2048,2053 
--- 2058,2064 
  		{"launcher", required_argument, NULL, 21},
  		{"load-extension", required_argument, NULL, 22},
  		{"config-auth", required_argument, NULL, 24},
+ 		{"max-concurrent-tests", required_argument, NULL, 25},
  		{NULL, 0, NULL, 0}
  	};
  
*** regression_main(int argc, char *argv[], 
*** 2161,2166 
--- 2172,2180 
  			case 24:
  config_auth_datadir = pg_strdup(optarg);
  break;
+ 			case 25:
+ max_concurrent_tests = atoi(optarg);
+ break;
  			default:
  /* getopt_long already emitted a complaint */
  fprintf(stderr, _("\nTry \"%s -h\" for more information.\n"),

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issue with logical replication: MyPgXact->xmin already is valid

2017-10-07 Thread Konstantin Knizhnik

On 10/07/2017 04:26 PM, Petr Jelinek wrote:


Hmm so you start transaction (you have to when running
CREATE_REPLICATION_SLOT with USE_SNAPSHOT parameter). And while the slot
is being created the config is reloaded. And since now you are in
transaction the tsearch hook for GUC processing tries to access catalogs
which sets the xmin for the transaction.


Actually this slot is implicitly created by LogicalRepSyncTableStart to perform 
initial data sync.



That's not good, but I can't really say I have idea about what to do
with it other than to set some kind of flag saying that logical decoding
snapshot is being built and using that to skip catalog access which does
not seem very pretty.


It is not quite clear from the comment why it is not possible to overwrite 
MyPgXact->xmin or just use existed value.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-07 Thread Tom Lane
David Rowley  writes:
> It would be much nicer if you'd at least wait for benchmarks before
> shooting.

Benchmarks of what?  We'd have to expend quite a bit of effort just
to get to a place where we'd have something to benchmark.  I do not
think it's unreasonable of me to express an opinion that that would
likely be wasted effort.  If you disagree, and are willing to expend
such effort speculatively, I'm not stopping you.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-07 Thread Tom Lane
Petr Jelinek  writes:
> On 07/10/17 04:19, Tom Lane wrote:
>> (edit: a few minutes later, I seem to remember that equivclass.c has
>> to do something special with the X=X case, so maybe it could do
>> something else special instead, with little new overhead.)

> So I wrote prototype of achieving this optimization and it seems to be
> really quite simple code-wise (see attached). I did only a limited
> manual testing of this but I don't see any negative impact on planning time.

No, I'm afraid you didn't read that comment closely enough.  This will
flat out fail for cases like "select ... where x=x order by x", because
there will already be a single-element EC for x and so the clause will
just disappear entirely.  If that doesn't happen, then instead you're
creating an EC with duplicate entries, which is an idea I seriously
dislike; the semantics of that don't seem clear at all to me.

What I was imagining was that having detected X=X, instead of "throwing
back" the clause as-is we could throw back an X IS NOT NULL clause,
along the lines of the attached.

This passes the smell test for me in the sense of not adding any
significant number of planner cycles except when the weird case occurs.
It leaves something on the table in that there are some situations
where X=X could be converted, but we won't do it because we don't see
the clause as being a potential EC (because it's not at top level),
as in the second new regression test case below.  I think that's
probably all right; I don't see any way to be more thorough without
adding a lot of new cycles all the time, and I don't believe this is
worth that.

regards, tom lane

diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 7997f50..a225414 100644
*** a/src/backend/optimizer/path/equivclass.c
--- b/src/backend/optimizer/path/equivclass.c
***
*** 27,32 
--- 27,33 
  #include "optimizer/paths.h"
  #include "optimizer/planmain.h"
  #include "optimizer/prep.h"
+ #include "optimizer/restrictinfo.h"
  #include "optimizer/var.h"
  #include "utils/lsyscache.h"
  
*** static bool reconsider_full_join_clause(
*** 71,78 
   *	  any delay by an outer join, so its two sides can be considered equal
   *	  anywhere they are both computable; moreover that equality can be
   *	  extended transitively.  Record this knowledge in the EquivalenceClass
!  *	  data structure.  Returns TRUE if successful, FALSE if not (in which
!  *	  case caller should treat the clause as ordinary, not an equivalence).
   *
   * If below_outer_join is true, then the clause was found below the nullable
   * side of an outer join, so its sides might validly be both NULL rather than
--- 72,85 
   *	  any delay by an outer join, so its two sides can be considered equal
   *	  anywhere they are both computable; moreover that equality can be
   *	  extended transitively.  Record this knowledge in the EquivalenceClass
!  *	  data structure, if applicable.  Returns TRUE if successful, FALSE if not
!  *	  (in which case caller should treat the clause as ordinary, not an
!  *	  equivalence).
!  *
!  * In some cases, although we cannot convert a clause into EquivalenceClass
!  * knowledge, we can still modify it to a more useful form than the original.
!  * Then, *p_restrictinfo will be replaced by a new RestrictInfo, which is what
!  * the caller should use for further processing.
   *
   * If below_outer_join is true, then the clause was found below the nullable
   * side of an outer join, so its sides might validly be both NULL rather than
*** static bool reconsider_full_join_clause(
*** 104,112 
   * memory context.
   */
  bool
! process_equivalence(PlannerInfo *root, RestrictInfo *restrictinfo,
  	bool below_outer_join)
  {
  	Expr	   *clause = restrictinfo->clause;
  	Oid			opno,
  collation,
--- 111,121 
   * memory context.
   */
  bool
! process_equivalence(PlannerInfo *root,
! 	RestrictInfo **p_restrictinfo,
  	bool below_outer_join)
  {
+ 	RestrictInfo *restrictinfo = *p_restrictinfo;
  	Expr	   *clause = restrictinfo->clause;
  	Oid			opno,
  collation,
*** process_equivalence(PlannerInfo *root, R
*** 154,169 
  	   collation);
  
  	/*
! 	 * Reject clauses of the form X=X.  These are not as redundant as they
! 	 * might seem at first glance: assuming the operator is strict, this is
! 	 * really an expensive way to write X IS NOT NULL.  So we must not risk
! 	 * just losing the clause, which would be possible if there is already a
! 	 * single-element EquivalenceClass containing X.  The case is not common
! 	 * enough to be worth contorting the EC machinery for, so just reject the
! 	 * clause and let it be processed as a normal restriction clause.
  	 */
  	if (equal(item1, item2))
! 		return false;			/* X=X is not a useful equivalence */
  
  	/*
  	 * If below outer join, check for 

Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-10-07 Thread Maksim Milyutin

07.10.17 16:34, Robert Haas wrote:


On Fri, Oct 6, 2017 at 12:37 PM, Alvaro Herrera  wrote:
One thing I'm a bit worried about is how to name these subordinate
indexes.  They have to have names because that's how pg_class works,
and those names can't all be the same, again because that's how
pg_class works.  There's no problem right away when you first create
the partitioned index, because you can just pick names out of a hat
using whatever name-generating algorithm seems best.  However, when
you dump-and-restore (including but not limited to the pg_upgrade
case) you've got to preserve those names.  If you just generate a new
name that may or may not be the same as the old one, then it may
collide with a user-specified name that only occurs later in the dump.
Also, you'll have trouble if the user has applied a COMMENT or a
SECURITY LABEL to the index because that command works by name, or if
the user has a reference to the index name inside a function or
whatever.

These are pretty annoying corner-case bugs because they're not likely
to come up very often.  Most people won't notice or care if the index
name changes.  But I don't think it's acceptable to just ignore the
problem.  An idea I had was to treat the abstract index - to use your
term - sort of the way we treat an extension.  Normally, when you
create an index on a partitioned table, it cascades, but for dump and
restore purpose, we tag on some syntax that says "well, don't actually
create the subordinate indexes, i'll tell you about those later".
Then for each subordinate index we issue a separate CREATE INDEX
command followed by ALTER INDEX abstract_index ATTACH PARTITION
concrete_index or something of that sort.  That means you can't
absolutely count on the parent index to have all of the children it's
supposed to have but maybe that's OK.


AFAICS, the main problem with naming is generating new unique names for 
subordinate indexes on the stage of migrating data scheme (pg_dump, 
pg_upgrade, etc). And we cannot specify these names in the 'CREATE INDEX 
partitioned_index' statement therefore we have to regenerate their.


In this case I propose to restore index names' hierarchy *bottom-up*, 
i.e. first of all create indexes for the leaf partitions and then create 
ones for parents up to root explicitly specifying names. When creating 
index on parent table we have to check is there exist any index on child 
table that could be child index (identical criteria). If so, not 
generate new index but implicitly attach that index into parent one.
If we have incomplete index hierarchy, e.g. we dropped some indexes of 
partitions previously, then recreating of parent's index would 
regenerate (not attach) indexes for those partitions. We could drop 
those odd generated indexes after building of parent's index. This 
decision is not straightforward but provides to consider 'CREATE INDEX 
paritioned_table' statement as a cascade operation.
As a result, we can specify name for each concrete index while 
recreating a whole hierarchy.


--
Regards,
Maksim Milyutin



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-07 Thread David Rowley
On 7 October 2017 at 15:19, Tom Lane  wrote:
>> 9. Unneeded Self JOIN
>
>> Can't remember discussions of this.
>
> I can't get very excited about that one either.
>
> In the end, what the article fails to consider is that all of these are
> tradeoffs, not unalloyed goods.  If you spend planner cycles on every
> query to look for cases that only the most unabashedly brain-dead ORMs
> ever generate, you're not really doing your users a favor on balance.

I think that final sentence lacks imagination.

I've seen plenty of views being called where some column is
unavailable, but the caller joins the very same table again on the
primary key to add the column. There was no brain-dead ORM involved,
just perhaps questionable design. This was very common in my last job
where we had some rats nest of views several layers deep, the core of
which often had some complex logic that nobody dared to try and
replicate.

It would be fairly cheap to check if any of the rtekind==RTE_RELATION
joinlist  items have above 1 RangeTblEntry with the same relid.  The
joinlist is never that big anyway, and if it was the join search would
be slow. The more expensive part would be to build the join clauses,
check if the expressions on either side of all OpExpr matches and that
nothing else will cause a non-match, then perform the uniqueness check
on those OpExpr operands. We do have some infrastructure to do the
unique checks.  Likely the slowdown in planning would be just for
cases with a legitimately useful self-join, I doubt checking for a
duplicate RangeTblEntry->relid would cause much of a concern.

Anyway, this is giving me some feeling of Deja vu.. Perhaps we need
some pg_stats view that shows us planning time and execution time so
that we can get a better idea on how much these things matter in the
average case. We tend to never fare so well in these sorts of
comparisons with commercial databases. It's not hard to imagine
someone with migration plans loading some rats nest of views into
Postgres and taking it for a spin and finding performance is not quite
what they need. It's a bit sad that often the people with the loudest
voices are always so fast to stomp on the ideas for improvements. It
would be much nicer if you'd at least wait for benchmarks before
shooting.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] separate serial_schedule useful?

2017-10-07 Thread Tom Lane
Robert Haas  writes:
> On Fri, Oct 6, 2017 at 4:16 PM, Tom Lane  wrote:
>> The other routine mistake, which I see Robert just made again,
>> is to break the at-most-twenty-parallel-tests-at-once convention.
>> I wonder if we can get in some sort of automated check for that.

> There's no reason why pg_regress couldn't have a
> --bail-if-group-size-exceeds=N argument, or why we couldn't have a
> separate Perl script to validate the schedule file as part of the
> build process.

I'd go for the former approach; seems like less new code and fewer cycles
used to enforce the rule.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallelize queries containing initplans

2017-10-07 Thread Robert Haas
On Fri, Oct 6, 2017 at 7:08 AM, Amit Kapila  wrote:
> I have fixed the other review comment related to using safe_param_list
> in the attached patch.  I think I have fixed all comments given by
> you, but let me know if I have missed anything or you have any other
> comment.

-Param   *param = (Param *) node;
+if (list_member_int(context->safe_param_ids, ((Param *)
node)->paramid))
+return false;

-if (param->paramkind != PARAM_EXEC ||
-!list_member_int(context->safe_param_ids, param->paramid))
-{
-if (max_parallel_hazard_test(PROPARALLEL_RESTRICTED, context))
-return true;
-}
-return false;/* nothing to recurse to */
+if (max_parallel_hazard_test(PROPARALLEL_RESTRICTED, context))
+return true;

I think your revised logic is wrong here because it drops the
paramkind test, and I don't see any justification for that.  IIUC,
param IDs are tracked separately for extern and exec parameters, so
why would we ignore that distinction here?

But I'm also wondering if we're missing a trick here, because isn't a
PARAM_EXTERN parameter always safe, given SerializeParamList?  If so,
this || ought to be &&, but if that adjustment is needed, it seems
like a separate patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-10-07 Thread Robert Haas
On Fri, Oct 6, 2017 at 12:37 PM, Alvaro Herrera  wrote:
> 2. create one index for each existing partition.  These would be
>identical to what would happen if you created the index directly on
>each partition, except that there is an additional dependency to the
>parent's abstract index.

One thing I'm a bit worried about is how to name these subordinate
indexes.  They have to have names because that's how pg_class works,
and those names can't all be the same, again because that's how
pg_class works.  There's no problem right away when you first create
the partitioned index, because you can just pick names out of a hat
using whatever name-generating algorithm seems best.  However, when
you dump-and-restore (including but not limited to the pg_upgrade
case) you've got to preserve those names.  If you just generate a new
name that may or may not be the same as the old one, then it may
collide with a user-specified name that only occurs later in the dump.
Also, you'll have trouble if the user has applied a COMMENT or a
SECURITY LABEL to the index because that command works by name, or if
the user has a reference to the index name inside a function or
whatever.

These are pretty annoying corner-case bugs because they're not likely
to come up very often.  Most people won't notice or care if the index
name changes.  But I don't think it's acceptable to just ignore the
problem.  An idea I had was to treat the abstract index - to use your
term - sort of the way we treat an extension.  Normally, when you
create an index on a partitioned table, it cascades, but for dump and
restore purpose, we tag on some syntax that says "well, don't actually
create the subordinate indexes, i'll tell you about those later".
Then for each subordinate index we issue a separate CREATE INDEX
command followed by ALTER INDEX abstract_index ATTACH PARTITION
concrete_index or something of that sort.  That means you can't
absolutely count on the parent index to have all of the children it's
supposed to have but maybe that's OK.

Another thing that would let you do is CREATE INDEX CONCURRENTLY
replacement_concrete_index; ALTER INDEX abstract_index DETACH
PARTITION old_concrete_index, ATTACH PARTITION
replacement_concrete_index; DROP INDEX CONCURRENTLY
old_concrete_index, which seems like a thing someone might want to do.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On markers of changed data

2017-10-07 Thread Stephen Frost
Alvaro, Michael,

* Alvaro Herrera (alvhe...@alvh.no-ip.org) wrote:
> Michael Paquier wrote:
> > That’s actually what pg_rman is doing for what it calls incremental
> > backups (perhaps that would be differential backup in PG
> > terminology?), and the performance is bad as you can imagine. We could
> > have a dedicated LSN map to do such things with 4 bytes per page. I am
> > still not convinced that this much facility and the potential bug
> > risks are worth it though, Postgres already knows about differential
> > backups if you shape it as a delta of WAL segments. I think that, in
> > order to find a LSN map more convincing, we should find first other
> > use cases where it could become useful. Some use cases may pop up with
> > VACUUM, but I have not studied the question hard enough...
> 
> The case I've discussed with barman developers is a large database
> (couple dozen of TBs should be enough) where a large fraction (say 95%)
> is read-only but there are many changes to the active part of the data,
> so that WAL is more massive than size of active data.

Yes, we've seen environments like that also.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Issue with logical replication: MyPgXact->xmin already is valid

2017-10-07 Thread Petr Jelinek
On 06/10/17 16:46, Konstantin Knizhnik wrote:
> 
> 
> On 06.10.2017 15:29, Petr Jelinek wrote:
>> On 06/10/17 12:16, Konstantin Knizhnik wrote:
>>> When creating logical replication slots we quite often get the following
>>> error:
>>>
>>> ERROR:  cannot build an initial slot snapshot when MyPgXact->xmin
>>> already is valid
>>>
>>> which cause restart of WAL sender.
>>> The comment to this line doesn't clarify much:
>>>
>>>  /* so we don't overwrite the existing value */
>>>  if (TransactionIdIsValid(MyPgXact->xmin))
>>>  elog(ERROR, "cannot build an initial slot snapshot when
>>> MyPgXact->xmin already is valid");
>>> I wonder if it is normal situation or something goes wrong?
>>>
>> Hi,
>>
>> no it's not normal situation, it seems you are doing something that
>> assigns xid before you run the CREATE_REPLICATION_SLOT command on that
>> connection.
> 
> I have not doing something in this connection: it is wal sender
> executing CREATE_REPLICATION_SLOT replication command.
> Please look at the stack in my original e-mail. It shows who and when is
> setting MyPgXact->xmin.
> It is GetSnapshotData called because of reloading configuration settings.
> And configuration setting are reloaded because our application is
> updating "synchronous_standby_names".
> 
> 

Hmm so you start transaction (you have to when running
CREATE_REPLICATION_SLOT with USE_SNAPSHOT parameter). And while the slot
is being created the config is reloaded. And since now you are in
transaction the tsearch hook for GUC processing tries to access catalogs
which sets the xmin for the transaction.

That's not good, but I can't really say I have idea about what to do
with it other than to set some kind of flag saying that logical decoding
snapshot is being built and using that to skip catalog access which does
not seem very pretty.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-07 Thread Petr Jelinek
On 07/10/17 04:19, Tom Lane wrote:
> Andres Freund  writes:
>> On 2017-10-06 21:33:16 -0400, Adam Brusselback wrote:
>>> The article in question is here:
>>> https://blog.jooq.org/2017/09/28/10-cool-sql-optimisations-that-do-not-depend-on-the-cost-model/
> 
>> That's interesting.
> 
> The impression I have in a quick scan is that probably hardly any of these
> are cases that any of the DB designers think are important in themselves.
> Rather, they fall out of more general optimization attempts, or not,
> depending on the optimization mechanisms in use in a particular DB.
> For example, reducing "WHERE 1=1" to "WHERE TRUE" and then to nothing
> comes out of a constant-subexpression-precalculation mechanism for us,
> whereas "WHERE column=column" doesn't fall to that approach.  ISTM it
> would be really dumb to expend planner cycles looking specifically for
> that case, so I guess that DB2 et al are finding it as a side-effect of
> some more general optimization ... I wonder what that is?
> 
> (edit: a few minutes later, I seem to remember that equivclass.c has
> to do something special with the X=X case, so maybe it could do
> something else special instead, with little new overhead.)
> 

What it actually does is to specifically skip the processing for X=X
(the const expression will be simplified by
estimate_expression_value/eval_const_expressions separately). There is
comment there that specifically states that it's not worth it to process
this as it's rare clause which is equal to X IS NOT NULL.

I don't actually agree with the argument of the comment there, since in
practice the if the "silly" equality is not there, we'll just waste
equal() call and if it is there the optimization seems worth it as it
will lead to orders of magnitude better estimation in many cases.

So I wrote prototype of achieving this optimization and it seems to be
really quite simple code-wise (see attached). I did only a limited
manual testing of this but I don't see any negative impact on planning time.

Thoughts?

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services
From a7d6f3fe0a5d42a96c711cc33e6962b9a2f6511f Mon Sep 17 00:00:00 2001
From: Petr Jelinek 
Date: Sat, 7 Oct 2017 15:10:54 +0200
Subject: [PATCH] Transform X=X expressions into X IS NOT NULL

---
 src/backend/optimizer/path/equivclass.c | 12 
 src/backend/optimizer/plan/initsplan.c  | 11 +++
 2 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 7997f50c18..d4ffdf66f2 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -154,18 +154,6 @@ process_equivalence(PlannerInfo *root, RestrictInfo *restrictinfo,
 	   collation);
 
 	/*
-	 * Reject clauses of the form X=X.  These are not as redundant as they
-	 * might seem at first glance: assuming the operator is strict, this is
-	 * really an expensive way to write X IS NOT NULL.  So we must not risk
-	 * just losing the clause, which would be possible if there is already a
-	 * single-element EquivalenceClass containing X.  The case is not common
-	 * enough to be worth contorting the EC machinery for, so just reject the
-	 * clause and let it be processed as a normal restriction clause.
-	 */
-	if (equal(item1, item2))
-		return false;			/* X=X is not a useful equivalence */
-
-	/*
 	 * If below outer join, check for strictness, else reject.
 	 */
 	if (below_outer_join)
diff --git a/src/backend/optimizer/plan/initsplan.c b/src/backend/optimizer/plan/initsplan.c
index 9931dddba4..1c3672a978 100644
--- a/src/backend/optimizer/plan/initsplan.c
+++ b/src/backend/optimizer/plan/initsplan.c
@@ -2347,6 +2347,17 @@ process_implied_equality(PlannerInfo *root,
 return;
 		}
 	}
+	else if (equal(item1, item2))
+	{
+		NullTest	   *ntest;
+		ntest = makeNode(NullTest);
+		ntest->arg = item1;
+		ntest->nulltesttype = IS_NOT_NULL;
+		ntest->argisrow = false;
+		ntest->location = exprLocation((Node *) clause);
+
+		clause = (Expr *) ntest;
+	}
 
 	/*
 	 * Push the new clause into all the appropriate restrictinfo lists.
-- 
2.11.0


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] separate serial_schedule useful?

2017-10-07 Thread Robert Haas
On Fri, Oct 6, 2017 at 4:16 PM, Tom Lane  wrote:
> The other routine mistake, which I see Robert just made again,
> is to break the at-most-twenty-parallel-tests-at-once convention.
> I wonder if we can get in some sort of automated check for that.

Argh.  We can argue about whether that's my mistake or Ashutosh's
mistake, but I do try to catch these things.  It's just that there are
so many rules that require a committer to (a) know the rule and (b)
remember to enforce the rule that it's really easy to miss one.  And I
do know that rule, but it slipped my mind in the course of trying to
make sure that we'd covered all the bases in terms of the feature
itself.

There's no reason why pg_regress couldn't have a
--bail-if-group-size-exceeds=N argument, or why we couldn't have a
separate Perl script to validate the schedule file as part of the
build process.

I feel like the need to manually enforce so many tedious coding rules
is a real limiting factor on our ability to (a) involve new people in
the project and (b) get their work committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] hash partitioning

2017-10-07 Thread amul sul
On Fri, Oct 6, 2017 at 5:35 PM, Jesper Pedersen
 wrote:
> Hi Amul,
>
> Could you rebase on latest master ?
>

Sure will post that soon, but before that, I need to test hash partitioning
with recent partition-wise join commit (f49842d1ee), thanks.

Regards,
Amul


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Predicate Locks for writes?

2017-10-07 Thread Simon Riggs
SERIALIZABLE looks for chains of rw cases.

When we perform UPDATEs and DELETEs we search for rows and then modify
them. The current implementation views that as a read followed by a
write because we issue PredicateLockTuple() during the index fetch.

Is it correct that a statement that only changes data will add
predicate locks for the rows that it modifies?

PredicateLockTuple() specifically avoids adding an SIRead lock if the
tuple already has a write lock on it, so surely it must also be
correct to skip the SIRead lock if we are just about to update the
row?

I am suggesting that we ignore PredicateLockTuple() for cases where we
are about to update or delete the found tuple.

ISTM that a Before Row Trigger would need to add an SIRead lock since
that is clearly a read.

Thoughts?

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel worker (PID ) exited with exit code 1

2017-10-07 Thread Andreas Seltenreich
Michael Paquier writes:

> On Fri, Oct 6, 2017 at 9:19 PM, tushar  wrote:
>> ERROR:  recovery is not in progress
>
> Perhaps there is a way to blacklist some functions depending on the
> server context. This question may be better asked directly where the
> project is maintained then: https://github.com/anse1/sqlsmith. I am
> adding as well Andreas in CC, he works on sqlsmith.

Blacklisting when testing with sqlsmith typically happens on the error
logging side: Logging into a database via --log-to with the schema
shipped with sqlsmith filters out boring error messages with triggers.
The ERROR reported is actually in the filter tables that ship with
sqlsmith.

I don't think adding such product-specific filtering on the generating
side is necessarily the right thing, as it would reduces code coverage
while testing.  We might miss a core dump when something becomes buggy
in that path.

I did make an exception for syntax errors: sqlsmith notices when grammar
productions consistently lead to errors. This feature was added to allow
testing other products or older postgres, so sqlsmith doesn't waste time
generating upsert statements on 9.1.  It's more apparent when testing
sqlite3: The error rate is 90% on startup, but shrinks to 2% or so after
a couple thousand statements have been generated.

regards,
Andreas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On markers of changed data

2017-10-07 Thread Alvaro Herrera
Michael Paquier wrote:

> That’s actually what pg_rman is doing for what it calls incremental
> backups (perhaps that would be differential backup in PG
> terminology?), and the performance is bad as you can imagine. We could
> have a dedicated LSN map to do such things with 4 bytes per page. I am
> still not convinced that this much facility and the potential bug
> risks are worth it though, Postgres already knows about differential
> backups if you shape it as a delta of WAL segments. I think that, in
> order to find a LSN map more convincing, we should find first other
> use cases where it could become useful. Some use cases may pop up with
> VACUUM, but I have not studied the question hard enough...

The case I've discussed with barman developers is a large database
(couple dozen of TBs should be enough) where a large fraction (say 95%)
is read-only but there are many changes to the active part of the data,
so that WAL is more massive than size of active data.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-07 Thread Alvaro Herrera
Peter Geoghegan wrote:
> On Fri, Oct 6, 2017 at 2:09 PM, Wong, Yi Wen  wrote:
> > Yesterday, I've been spending time with pg_visibility on the pages when I 
> > reproduce the issue in 9.6.
> > None of the all-frozen or all-visible bits are necessarily set in 
> > problematic pages.
> 
> Since this happened yesterday, I assume it was with an unfixed version?
> 
> As you must have seen, Alvaro said he has a variant of Dan's original
> script that demonstrates that a problem remains, at least on 9.6+,
> even with today's fix. I think it's the stress-test that plays with
> fillfactor, many clients, etc [1].

I just execute setup.sql once and then run this shell command,

while :; do
psql -e -P pager=off -f ./repro.sql
for i in `seq 1 5`; do
psql -P pager=off -e --no-psqlrc -f ./lock.sql &
done
wait && psql -P pager=off -e --no-psqlrc -f ./reindex.sql 
psql -P pager=off -e --no-psqlrc -f ./report.sql
echo "done"
done

Note that you need to use pg10's psql because of the \if lines in
lock.sql.  For some runs I change the values to compare random() to, and
originally the commented out section in lock.sql was not commented out,
but I'm fairly sure the failures I saw where with this version.  Also, I
sometime change the 5 in the `seq` command to higher values (180, 250).

I didn't find the filler column to have any effect, so I took that out.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
begin;

values ('for key share') /*, ('') */ order by random() limit 1 \gset

select pg_sleep(random() * 0.03);
select id from t where id=3 :column1 ;

select random() > .0 as update \gset
\if :update
select pg_sleep(random() * 0.03);
update t set x=x+1 where id=3;
\endif

/*
select random() > .0 as update2 \gset
\if :update2
savepoint foo;
update t set x=x+1 where id=3;
\endif
select random() > .5 as rollback_to \gset
\if :rollback_to
rollback to foo;
\endif
*/

select random() > .0 as commit \gset

\if :commit
   commit;
\else
   rollback;
\endif

select pg_sleep(random() * 0.03);
vacuum freeze t;

reindex index t_pkey;


with
pages (blkno) as (select generate_series::int from generate_series(0, 
pg_relation_size('t')/current_setting('block_size')::int - 1)),
rawpages (blkno, pg) as (select blkno, get_raw_page from pages, 
get_raw_page('t', blkno)),
heapitems as (select blkno, heap_page_items.* from rawpages, 
heap_page_items(pg))
select 
blkno, lp, lp_flags, lp_off, t_xmin, t_xmax, t_ctid,
infomask(t_infomask, 1) as infomask, infomask(t_infomask2, 2) as infomask2
from heapitems where lp_off <> 0
drop table t;
create table t (id int primary key, name char(3), x integer)
-- with (fillfactor = 10)
;

insert into t values (1, '111', 0);
insert into t values (3, '333', 0);
create type infomask_bit_desc as (mask varbit, symbol text);
create or replace function infomask(msk int, which int) returns text
language plpgsql as $$
declare
r infomask_bit_desc;
str text = '';
append_bar bool = false;
begin
for r in select * from infomask_bits(which) loop
if (msk::bit(16) & r.mask)::int <> 0 then
if append_bar then
str = str || '|';
end if;
append_bar = true;
str = str || r.symbol;
end if;
end loop;
return str;
end;
$$ ;
create or replace function infomask_bits(which int)
returns setof infomask_bit_desc
language plpgsql as $$
begin
if which = 1 then
return query values
(x'8000'::varbit, 'MOVED_IN'),
(x'4000', 'MOVED_OFF'),
(x'2000', 'UPDATED'),
(x'1000', 'XMAX_IS_MULTI'),
(x'0800', 'XMAX_INVALID'),
(x'0400', 'XMAX_COMMITTED'),
(x'0200', 'XMIN_INVALID'),
(x'0100', 'XMIN_COMMITTED'),
(x'0080', 'XMAX_LOCK_ONLY'),
(x'0040', 'EXCL_LOCK'),
(x'0020', 'COMBOCID'),
(x'0010', 'XMAX_KEYSHR_LOCK'),
(x'0008', 'HASOID'),
(x'0004', 'HASEXTERNAL'),
(x'0002', 'HASVARWIDTH'),
(x'0001', 'HASNULL');
elsif which = 2 then
return query values
(x'2000'::varbit, 'UPDATE_KEY_REVOKED'),
(x'4000', 'HOT_UPDATED'),
(x'8000', 'HEAP_ONLY_TUPLE');
end if;
end;
$$;

create extension if not exists pageinspect;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On markers of changed data

2017-10-07 Thread Michael Paquier


> Le 6 oct. 2017 à 23:44, Alvaro Herrera  a écrit :
> 
> Michael Paquier wrote:
> 
>> The only sane method for Postgres is really to scan the
>> page header LSNs, and of course you already know that.
> 
> I hope the idea is not to have to scan every single page in the
> database, because that would be too slow.  It should be possible to
> build this so that a single summary LSN is kept for a largish group of
> pages, allowing large portions of the database to be skipped from even
> being read if they are known not to contain any page newer than the
> previous backup.

That’s actually what pg_rman is doing for what it calls incremental backups 
(perhaps that would be differential backup in PG terminology?), and the 
performance is bad as you can imagine. We could have a dedicated LSN map to do 
such things with 4 bytes per page. I am still not convinced that this much 
facility and the potential bug risks are worth it though, Postgres already 
knows about differential backups if you shape it as a delta of WAL segments. I 
think that, in order to find a LSN map more convincing, we should find first 
other use cases where it could become useful. Some use cases may pop up with 
VACUUM, but I have not studied the question hard enough...
--
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers