Re: Memory leaks in BufFileOpenShared()

2018-06-15 Thread Tatsuo Ishii
>> Thanks for finding these accidental duplications, and to Ishii-san for
>> committing the fix.  Oops.
>> 
>> +1 for this refactoring.
> 
> Thanks for confirming. I will go ahead commit/push the patch.

Done.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: WAL prefetch

2018-06-15 Thread Amit Kapila
On Sat, Jun 16, 2018 at 10:47 AM, Konstantin Knizhnik
 wrote:
>
>
> On 16.06.2018 06:33, Amit Kapila wrote:
>>
>> On Fri, Jun 15, 2018 at 11:31 PM, Andres Freund 
>> wrote:
>>>
>>> On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:


 On 14.06.2018 09:52, Thomas Munro wrote:
>
> On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
>  wrote:
>>
>> pg_wal_prefetch function will infinitely traverse WAL and prefetch
>> block
>> references in WAL records
>> using posix_fadvise(WILLNEED) system call.
>
> Hi Konstantin,
>
> Why stop at the page cache...  what about shared buffers?
>
 It is good question. I thought a lot about prefetching directly to
 shared
 buffers.
>>>
>>> I think that's definitely how this should work.  I'm pretty strongly
>>> opposed to a prefetching implementation that doesn't read into s_b.
>>>
>> We can think of supporting two modes (a) allows to read into shared
>> buffers or (b)  allows to read into OS page cache.
>>
> Unfortunately I afraid that a) requires different approach: unlike
> posix_fadvise,  reading data to shared buffer is blocking operation. If we
> do it by one worker, then it will read it with the same speed as redo
> process. So to make prefetch really efficient,  in this case we have to
> spawn multiple workers to perform prefetch in parallel (as pg_prefaulter
> does).
>
> Another my concern against prefetching to shared buffers is that it may
> flush away from cache pages which are most frequently used by read only
> queries at hot standby replica.
>

Okay, but I am suggesting to make it optional so that it can be
enabled when helpful (say when the user has enough shared buffers to
hold the data).


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: WAL prefetch

2018-06-15 Thread Konstantin Knizhnik




On 16.06.2018 06:33, Amit Kapila wrote:

On Fri, Jun 15, 2018 at 11:31 PM, Andres Freund  wrote:

On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:


On 14.06.2018 09:52, Thomas Munro wrote:

On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
 wrote:

pg_wal_prefetch function will infinitely traverse WAL and prefetch block
references in WAL records
using posix_fadvise(WILLNEED) system call.

Hi Konstantin,

Why stop at the page cache...  what about shared buffers?


It is good question. I thought a lot about prefetching directly to shared
buffers.

I think that's definitely how this should work.  I'm pretty strongly
opposed to a prefetching implementation that doesn't read into s_b.


We can think of supporting two modes (a) allows to read into shared
buffers or (b)  allows to read into OS page cache.

Unfortunately I afraid that a) requires different approach: unlike 
posix_fadvise,  reading data to shared buffer is blocking operation. If 
we do it by one worker, then it will read it with the same speed as redo 
process. So to make prefetch really efficient,  in this case we have to 
spawn multiple workers to perform prefetch in parallel (as pg_prefaulter 
does).


Another my concern against prefetching to shared buffers is that it may 
flush away from cache pages which are most frequently used by read only 
queries at hot standby replica.





Re: server crashed with TRAP: FailedAssertion("!(!parallel_aware || pathnode->path.parallel_safe)"

2018-06-15 Thread Amit Kapila
On Thu, Jun 14, 2018 at 10:05 PM, Tom Lane  wrote:
> I wrote:
>> This appears to be the fault of commit ab7271677, whose authors I've cc'd:
>> the stanza starting at about allpaths.c:1672 is bullheadedly creating a
>> parallel path whether that's allowed or not.  Fixing it might be as simple
>> as adding "rel->consider_parallel" to the conditions there.
>
> Oh, and while I'm bitching: the same commit has created this exceedingly
> dangerous coding pattern in create_append_path:
>
> pathnode->subpaths = list_concat(subpaths, partial_subpaths);
>
> foreach(l, subpaths)
> {
> Path   *subpath = (Path *) lfirst(l);
>
> Because list_concat() modifies its first argument, this will usually
> have the effect that the foreach traverses the partial_subpaths as
> well as the main subpaths.  But it's very unclear to the reader whether
> that's intended or not.  Worse, if we had *only* partial subpaths so
> that subpaths was initially NIL, then the loop would fail to traverse
> the partial subpaths, resulting in inconsistent behavior.
>
> It looks to me like traversal of the partial subpaths is the right
> thing here, in which case we should do
>
> -   foreach(l, subpaths)
> +   foreach(l, pathnode->subpaths)
>
> or perhaps better
>
> -   pathnode->subpaths = list_concat(subpaths, partial_subpaths);
> +   pathnode->subpaths = subpaths = list_concat(subpaths, 
> partial_subpaths);
>
> to make the behavior clear and consistent.
>

I agree with your analysis and proposed change.  However, I think in
practice, it might not lead to any bug as in the loop, we are
computing parallel_safety and partial_subpaths should be
parallel_safe.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: server crashed with TRAP: FailedAssertion("!(!parallel_aware || pathnode->path.parallel_safe)"

2018-06-15 Thread Amit Kapila
On Thu, Jun 14, 2018 at 9:54 PM, Tom Lane  wrote:
> Rajkumar Raghuwanshi  writes:
>> I am getting a server crash for below test case.
>
>> postgres=# CREATE TABLE test( c1 int, c2 int, c3 text) partition by
>> range(c1);
>> CREATE TABLE
>> postgres=# create table test_p1 partition of test for values from
>> (minvalue) to (0);
>> CREATE TABLE
>> postgres=# create table test_p2 partition of test for values from (0) to
>> (maxvalue);
>> CREATE TABLE
>> postgres=#
>> postgres=# select (select max((select t1.c2 from test t1 where t1.c1 =
>> t2.c1))) from test t2;
>> server closed the connection unexpectedly
>
> Reproduced here.  The assert seems to be happening because
> add_paths_to_append_rel is trying to create a parallel path for
> an appendrel that is marked consider_parallel = false.
>
> This appears to be the fault of commit ab7271677, whose authors I've cc'd:
> the stanza starting at about allpaths.c:1672 is bullheadedly creating a
> parallel path whether that's allowed or not.  Fixing it might be as simple
> as adding "rel->consider_parallel" to the conditions there.
>

Yeah, or perhaps disallow creation of any partial paths at the first
place like in attached.   This will save us some work as well.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


fix_pa_path_generation_v1.patch
Description: Binary data


Re: WAL prefetch

2018-06-15 Thread Konstantin Knizhnik




On 16.06.2018 06:30, Amit Kapila wrote:

On Fri, Jun 15, 2018 at 8:45 PM, Konstantin Knizhnik
 wrote:

On 15.06.2018 18:03, Amit Kapila wrote:

wal_prefetch is prefetching blocks referenced by WAL records. But in case of
"full page writes" such prefetch is not needed and even is harmful.


Okay, IIUC, the basic idea is to prefetch recently modified data
pages, so that they can be referenced.  If so, isn't there some
overlap with autoprewarm functionality which dumps recently modified
blocks and then on recovery, it can prefetch those?

Sorry,  I do not see any intersection with autoprewarw functionality: 
wal prefetch is performed at replica where data was not yet modified: 
actually the goal of WAL prefetch is to make this update more efficient. 
WAL prefetch can be also done at standalone server to speed up recovery 
after crash. But it seems to be much more exotic use case.





Re: [HACKERS] Statement-level rollback

2018-06-15 Thread Robert Haas
On Fri, Jun 15, 2018 at 4:23 PM, Alvaro Herrera
 wrote:
> I've been looking at re-implementing this feature recently, using
> Tsunakawa's proposed UI of a GUC transaction_rollback_scope that can
> take values "transaction" (default, current behavior) and "statement".
> I didn't take other parts of his patch though; see below.
>
> I think the main objectionable point is that of making servers behave in
> a way that could lose data, if applications assume that transactions
> behave in the way they do today.  I propose that we solve this by
> allowing this feature to be enabled only via one of:
>
> * a PGOPTIONS connection-time option
> * ALTER USER SET (transaction_rollback_scope)
>
> but it can be *disabled* normally via SET.  In other words, changing the
> scope from transaction to statement in a running session is forbidden,
> but changing it the other way around is allowed (if app is unsure
> whether env is unsafe, it can set the scope to "transaction" to ensure
> it's safe from that point onwards).  Changing the scope in
> postgresql.conf is forbidden, so a server is never unsafe as a whole.

I'm not sure that really solves the problem, because changing the GUC
in either direction causes the system to behave differently.  I don't
see any particular reason to believe that changing the behavior from A
to B is any more or less likely to break applications than a change
from B to A.

I put this feature, which - in this interest of full disclosure - is
already implemented in Advanced Server and has been for many years,
more or less in the same category as a hypothetical GUC that changes
case-folding from lower case to upper case.  That is, it's really nice
for compatibility, but you can't get around the fact that every bit of
software that is supposed to run on all PostgreSQL instances has to be
tested in all possible modes, because otherwise you might find that it
doesn't work in all of those modes, or doesn't work as expected.  It
is a behavior-changing GUC par excellence.

Some people are going to love that, and some people are going to hate
it, but I don't think adding some funny GUC mode that only allows it
to be changed in one direction actually makes any difference.  Other
people may, of course, disagree.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Remove mention in docs that foreign keys on partitioned tables are not supported

2018-06-15 Thread Robert Haas
On Thu, Jun 14, 2018 at 9:45 AM, Ashutosh Bapat
 wrote:
>> It sounds like you want to try to hide from users the fact that they
>> can create triggers on the individual partitions.
>
> No. I never said that in my mails (see [1], [2]) I object to the
> explicit suggestion that they can/should create BEFORE ROW triggers on
> partitions since those are not supported on the partitioned table.
>
>>  I think that's the
>> wrong approach.  First of all, it's not going to work, because people
>> are going to find out about it and do it anyway.  Secondly, the
>> documentation's job is to tell you about the system's capabilities,
>> not conceal them from you.
>
> Neither is documentation's job to "suggest" something that can lead to
> problems in future.

Well, perhaps what this boils down to is that I don't agree that it
can lead to problems in the future.  If the user creates a trigger on
each partition, whether they are all the same or some are different
from others, they can stick with that configuration in future
releases, too.  I don't think we're going to remove the ability to
have separate triggers on each partition; we're just going to add, as
an additional possibility, the ability to have a trigger on the
partitioned table that cascades to the individual partitions.  It's
true that if people want to get the benefit of that feature, they'll
have to change the configuration, but so what?  That's true of many
new features, and I don't think it's right to characterize that as a
problem.

By that logic, we should not have suggested that anyone use table
inheritance, because they would have to change their configuration
when we implemented table partitioning.  Indeed, switching from table
inheritance to table partitioning is much more intrusive than dropping
triggers on individual partitions and adding one on the partitioned
table.  The latter only requires DDL on existing objects, but the
former requires creating entirely new objects and moving all of your
data.

I think that the existing wording is fine and there's really no need
to change anything.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WAL prefetch

2018-06-15 Thread Amit Kapila
On Fri, Jun 15, 2018 at 11:31 PM, Andres Freund  wrote:
> On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:
>>
>>
>> On 14.06.2018 09:52, Thomas Munro wrote:
>> > On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
>> >  wrote:
>> > > pg_wal_prefetch function will infinitely traverse WAL and prefetch block
>> > > references in WAL records
>> > > using posix_fadvise(WILLNEED) system call.
>> > Hi Konstantin,
>> >
>> > Why stop at the page cache...  what about shared buffers?
>> >
>>
>> It is good question. I thought a lot about prefetching directly to shared
>> buffers.
>
> I think that's definitely how this should work.  I'm pretty strongly
> opposed to a prefetching implementation that doesn't read into s_b.
>

We can think of supporting two modes (a) allows to read into shared
buffers or (b)  allows to read into OS page cache.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: WAL prefetch

2018-06-15 Thread Amit Kapila
On Fri, Jun 15, 2018 at 8:45 PM, Konstantin Knizhnik
 wrote:
>
> On 15.06.2018 18:03, Amit Kapila wrote:
>
> wal_prefetch is prefetching blocks referenced by WAL records. But in case of
> "full page writes" such prefetch is not needed and even is harmful.
>

Okay, IIUC, the basic idea is to prefetch recently modified data
pages, so that they can be referenced.  If so, isn't there some
overlap with autoprewarm functionality which dumps recently modified
blocks and then on recovery, it can prefetch those?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Statement-level rollback

2018-06-15 Thread MauMau
From: Alvaro Herrera
> I've been looking at re-implementing this feature recently, using
> Tsunakawa's proposed UI of a GUC transaction_rollback_scope that can
> take values "transaction" (default, current behavior) and
"statement".
> I didn't take other parts of his patch though; see below.

Thank you so much for reviving this thread!


> I propose that we solve this by
> allowing this feature to be enabled only via one of:
>
> * a PGOPTIONS connection-time option
> * ALTER USER SET (transaction_rollback_scope)

Why don't we also allow ALTER DATABASE SET for a database exclusively
for data migrated from another DBMS?


> but it can be *disabled* normally via SET.  In other words, changing
the
> scope from transaction to statement in a running session is
forbidden,
> but changing it the other way around is allowed (if app is unsure
> whether env is unsafe, it can set the scope to "transaction" to
ensure
> it's safe from that point onwards).  Changing the scope in
> postgresql.conf is forbidden, so a server is never unsafe as a
whole.

Would it be dangerous to allow both enabling and disabling the
statement-level rollback only outside a transaction block?  I thought
it was considered dangerous to change the setting inside a transaction
block.


> Drivers such as JDBC can easily use this mode, for example a
connection
> option such as "AUTOSAVE=SERVER" can automatically add the
> transaction_rollback_scope option.  (Naturally, if the server does
not
> support transaction_rollback_scope and the user gave that option,
this
> causes an exception to be raised -- NOT to fallback to the standard
> transaction behavior!)

How do the drivers know, from the server error response to connection
request, that transaction_rollback_scope is unsupported?


> Tsunakawa's implementation puts the feature in postgres.c's client
loop.
> I think a better way to implement this is to change xact.c to have a
new
> TBLOCK state which indicates when to start a new internal
> subtransaction; StartTransactionCommand pushes a new element into
the
> transaction stack and puts it in the new state; a subsequent
operation
> actually starts the new subtransaction.  (This design decision
allows
> things like SAVEPOINT to work correctly by having the
> subtrasaction-for-savepoint appear *before* the internal
subtransaction,
> so a subsequent "SELECT 0/0" doesn't remove the user declared
> savepoint.)

That sounds interesting.

* How can PLs like PL/pgSQL utilize this to continue upon an SQL
failure?  They don't call StartTransactionCommand.

* How can psql make use of this feature for its ON_ERROR_ROLLBACK?


Regards
MauMau




Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-06-15 Thread Peter Geoghegan
On Fri, Jun 15, 2018 at 2:36 PM, Robert Haas  wrote:
> Yes, retail index deletion is essential for the delete-marking
> approach that is proposed for zheap.

Makes sense.

I don't know that much about zheap. I'm sure that retail index tuple
deletion is really important in general, though. The Gray & Reuter
book treats unique keys as a basic assumption, as do other
authoritative reference works and papers. Other database systems
probably make unique indexes simply use the user-visible attributes as
unique values, but appending heap TID as a unique-ifier is probably a
reasonably common design for secondary indexes (it would also be nice
if we could simply not store duplicates for unique indexes, rather
than using heap TID). I generally have a very high opinion on the
nbtree code, but this seems like a problem that ought to be fixed.

I've convinced myself that I basically have the right idea with this
patch, because the classic L invariants have all been tested with an
enhanced amcheck run against all indexes in a regression test
database. There was other stress-testing, too. The remaining problems
are fixable, but I need some guidance.

> It could also be extremely useful in some workloads with the regular
> heap.  If the indexes are large -- say, 100GB -- and the number of
> tuples that vacuum needs to kill is small -- say, 5 -- scanning them
> all to remove the references to those tuples is really inefficient.
> If we had retail index deletion, then we could make a cost-based
> decision about which approach to use in a particular case.

I remember talking to Andres about this in a bar 3 years ago. I can
imagine variations of pruning that do some amount of this when there
are lots of duplicates. Perhaps something like InnoDB's purge threads,
which do things like in-place deletes of secondary indexes after an
updating (or deleting) xact commits. I believe that that mechanism
targets secondary indexes specifically, and that is operates quite
eagerly.

> Can you enumerate some of the technical obstacles that you see?

The #1 technical obstacle is that I simply don't know where I should
try to take this patch, given that it probably needs to be tied to
some much bigger project, such as zheap. I have an open mind, though,
and intend to help if I can. I'm not really sure what the #2 and #3
problems are, because I'd need to be able to see a few steps ahead to
be sure. Maybe #2 is that I'm doing something wonky to avoid breaking
duplicate checking for unique indexes. (The way that unique duplicate
checking has always worked [1] is perhaps questionable, though.)

> I think it would be helpful if you could talk more about these
> regressions (and the wins).

I think that the performance regressions are due to the fact that when
you have a huge number of duplicates today, it's useful to be able to
claim space to fit further duplicates from almost any of the multiple
leaf pages that contain or have contained duplicates. I'd hoped that
the increased temporal locality that the patch gets would more than
make up for that. As far as I can tell, the problem is that temporal
locality doesn't help enough. I saw that performance was somewhat
improved with extreme Zipf distribution contention, but it went the
other way with less extreme contention. The details are not that fresh
in my mind, since I shelved this patch for a while following limited
performance testing.

The code could certainly use more performance testing, and more
general polishing. I'm not strongly motivated to do that right now,
because I don't quite see a clear path to making this patch useful.
But, as I said, I have an open mind about what the next step should
be.

[1] 
https://wiki.postgresql.org/wiki/Key_normalization#Avoiding_unnecessary_unique_index_enforcement
-- 
Peter Geoghegan



pg_upgrade from 9.4 to 10.4

2018-06-15 Thread Vimalraj A
Hi,

After pg_upgrade, the data in Postgres is corrupted.

1. Postgres 9.4, stop db with "immediate" mode
2. Run pg_upgrade, to upgrade to Postgres 10.4
3. Start Postgres 10 and run vacuum full, got a bunch of "concurrent insert
in progress". Looks like the data is corrupted. (Loading the old data back
on Postgres 9.4 works just fine)

But if I stop the 9.4 DB with "smart" mode, the data after pg_upgrade looks
fine.

pg_upgrade doesn't stop or throw warnings if the upgraded db gets into
corrupted state.


I would like to understand why it happens so.
1. What transient state corrupts the db?
2. Is it a known issue with pg_upgrade?
3. Is there a way to get the data from pg_upgrade after "immediate" mode
stop of previous version?

Thank you.

Regards,
Vimal.


Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-06-15 Thread Robert Haas
On Thu, Jun 14, 2018 at 2:44 PM, Peter Geoghegan  wrote:
> I've been thinking about using heap TID as a tie-breaker when
> comparing B-Tree index tuples for a while now [1]. I'd like to make
> all tuples at the leaf level unique, as assumed by L This can
> enable "retail index tuple deletion", which I think we'll probably end
> up implementing in some form or another, possibly as part of the zheap
> project. It's also possible that this work will facilitate GIN-style
> deduplication based on run length encoding of TIDs, or storing
> versioned heap TIDs in an out-of-line nbtree-versioning structure
> (unique indexes only). I can see many possibilities, but we have to
> start somewhere.

Yes, retail index deletion is essential for the delete-marking
approach that is proposed for zheap.

It could also be extremely useful in some workloads with the regular
heap.  If the indexes are large -- say, 100GB -- and the number of
tuples that vacuum needs to kill is small -- say, 5 -- scanning them
all to remove the references to those tuples is really inefficient.
If we had retail index deletion, then we could make a cost-based
decision about which approach to use in a particular case.

> mind now, while it's still swapped into my head. I won't do any
> serious work on this project unless and until I see a way to implement
> retail index tuple deletion, which seems like a multi-year project
> that requires the buy-in of multiple senior community members.

Can you enumerate some of the technical obstacles that you see?

> On its
> own, my patch regresses performance unacceptably in some workloads,
> probably due to interactions with kill_prior_tuple()/LP_DEAD hint
> setting, and interactions with page space management when there are
> many "duplicates" (it can still help performance in some pgbench
> workloads with non-unique indexes, though).

I think it would be helpful if you could talk more about these
regressions (and the wins).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: SCRAM with channel binding downgrade attack

2018-06-15 Thread Robert Haas
On Thu, Jun 14, 2018 at 7:43 AM, Magnus Hagander  wrote:
> I still think that the fact that we are still discussing what is basically
> the *basic concepts* of how this would be set up after we have released
> beta1 is a clear sign that this should not go into 11.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Statement-level rollback

2018-06-15 Thread Alvaro Herrera
On 2017-Nov-06, Tsunakawa, Takayuki wrote:

> From: Simon Riggs
> > A backend-based solution is required for PL procedures and functions.
> > 
> > We could put this as an option into PL/pgSQL, but it seems like it is
> > a function of the transaction manager rather than the driver.
> 
> Exactly.  Thanks.

I've been looking at re-implementing this feature recently, using
Tsunakawa's proposed UI of a GUC transaction_rollback_scope that can
take values "transaction" (default, current behavior) and "statement".
I didn't take other parts of his patch though; see below.

I think the main objectionable point is that of making servers behave in
a way that could lose data, if applications assume that transactions
behave in the way they do today.  I propose that we solve this by
allowing this feature to be enabled only via one of:

* a PGOPTIONS connection-time option
* ALTER USER SET (transaction_rollback_scope)

but it can be *disabled* normally via SET.  In other words, changing the
scope from transaction to statement in a running session is forbidden,
but changing it the other way around is allowed (if app is unsure
whether env is unsafe, it can set the scope to "transaction" to ensure
it's safe from that point onwards).  Changing the scope in
postgresql.conf is forbidden, so a server is never unsafe as a whole.

Drivers such as JDBC can easily use this mode, for example a connection
option such as "AUTOSAVE=SERVER" can automatically add the
transaction_rollback_scope option.  (Naturally, if the server does not
support transaction_rollback_scope and the user gave that option, this
causes an exception to be raised -- NOT to fallback to the standard
transaction behavior!)


Tsunakawa's implementation puts the feature in postgres.c's client loop.
I think a better way to implement this is to change xact.c to have a new
TBLOCK state which indicates when to start a new internal
subtransaction; StartTransactionCommand pushes a new element into the
transaction stack and puts it in the new state; a subsequent operation
actually starts the new subtransaction.  (This design decision allows
things like SAVEPOINT to work correctly by having the
subtrasaction-for-savepoint appear *before* the internal subtransaction,
so a subsequent "SELECT 0/0" doesn't remove the user declared
savepoint.)

I have a PoC implementation that's slightly different: it adds more code
to a few xact.c low-level routines (StartTransactionCommand creates the
internal savepoint itself). It's not as nice because SAVEPOINT has to
close the internal subtransaction, then create the savepoint, then
create the internal subtransaction again.  And it doesn't handle
RELEASE.  But as a PoC it's quite nice.  I measured the performance
overhead to be about 2% - 3% loss of the current mode, which seems
acceptable.

I would like to hear opinions on whether the protections I propose are
sufficient to appease the objections.  In my opinion they are, and we
should press forward with this, which seems to be one of the frequently
requested features from people porting from other DBMSs.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Spilling hashed SetOps and aggregates to disk

2018-06-15 Thread David Gershuni


> On Jun 13, 2018, at 12:53 PM, Jeff Davis  wrote:
> 
>> 
>> An adaptive hash agg node would start as a hash agg, and turn into a
>> "partial hash agg + sort + final group agg" when OOM is detected.
>> 
>> The benefit over ordinary sort+group agg is that the sort is
>> happening
>> on a potentially much smaller data set. When the buffer is large
>> enough to capture enough key repetitions, the output of the partial
>> hash agg can be orders of magnitude smaller than the scan.
>> 
>> My proposal was to use this for all group aggs, not only when the
>> planner chooses a hash agg, because it can speed up the sort and
>> reduce temp storage considerably. I guess the trick lies in
>> estimating
>> that cardinality reduction to avoid applying this when there's no
>> hope
>> of getting a payoff. The overhead of such a lazy hash table isn't
>> much, really. But yes, its cache locality is terrible, so it needs to
>> be considered.
> 
> I think this is a promising approach because it means the planner has
> one less decion to make. And the planner doesn't have great information
> to make its decision on, anyway (ndistinct estimates are hard enough on
> plain tables, and all bets are off a few levels up in the plan).
> 
> Unfortunately, it means that either all data types must support hashing
> and sorting[1], or we need to have a fallback path that can get by with
> hashing alone (which would not be tested nearly as well as more typical
> paths).

Jeff, I agree with you, but we should also take grouping sets into consideration
because they can cause the executor to create multiple hash tables in memory
simultaneously, each growing at a different rate. Not only does this make 
eviction
more complicated, but it actually prevents your original approach from working
because it violates the assumption that all entries left in the hash table at 
the end
of a phase are complete and can be flushed to output. 

For example, imagine a tuple that belongs to a group G in grouping set A had to 
be
spilled because it also belonged to an evicted group from grouping set B. Then 
group
G would remain in set A’s hash table at the end of the phase, but it wouldn’t 
have
aggregated the values from the spilled tuple. Of course, work-arounds are 
possible,
but the current approach would not work as is.

Claudio’s proposal mostly solves this problem. In fact, his proposal is very 
similar to
the HashSort algorithm from [1] that I mentioned. We would still need to think 
about
how to choose which hash table to evict from. For example, we could evict a 
group
from the largest hash table (if we tracked memory usage independently for each 
one). 

But as you mentioned, this solution relies on all grouping keys being sortable, 
so we
would need a fallback plan. For this, we could use your hash-based approach, 
but we
would have to make modifications. One idea would be to split each grouping set 
into its
own aggregate phase, so that only one hash table is in memory at a time. This 
would
eliminate the performance benefits of grouping sets when keys aren’t sortable, 
but at
least they would still work. 

Best,
David
Salesforce

[1] Revisiting Aggregation for Data Intensive Applications: A Performance Study
https://arxiv.org/pdf/1311.0059.pdf





Re: missing toast table for pg_policy

2018-06-15 Thread Joe Conway
On 06/15/2018 02:40 PM, John Naylor wrote:
> On 2/19/18, Joe Conway  wrote:
>> The attached does exactly this. Gives all system tables toast tables
>> except pg_class, pg_attribute, and pg_index, and includes cat version
>> bump and regression test in misc_sanity.
>>
>> Any further discussion, comments, complaints?
> 
> Hi Joe,
> There's been a little bit-rot with duplicate OIDs and the regression
> test. The first attached patch fixes that (applies on top of yours).


Not surprising -- thanks for the update.


> It occurred to be that we could go further and create most toast
> tables automatically by taking advantage of the fact that the toast
> creation function is a no-op if there are no varlena attributes. The
> second patch (applies on top of the first) demonstrates a setup where
> only shared and bootstrap catalogs need to have toast declarations
> specified manually with fixed OIDs. It's possible this way is more
> fragile, though.

Hmmm, I'll have a look.


Thanks!

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



Re: pgsql: Store 2PC GID in commit/abort WAL recs for logical decoding

2018-06-15 Thread Alvaro Herrera
By the way, why do we need to strlen() the target buffer when strlcpy
already reports the length?  (You could argue that there is a difference
if the string is truncated ... but surely we don't care about that case)
I propose the attached.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 714260ddb13b440db68f4e6db1f8aef0a0baba0a Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 15 Jun 2018 11:48:53 -0400
Subject: [PATCH] no need for separate strlen

---
 src/backend/access/rmgrdesc/xactdesc.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c 
b/src/backend/access/rmgrdesc/xactdesc.c
index 6d5ebd475b..da3a296e99 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -104,8 +104,8 @@ ParseCommitRecord(uint8 info, xl_xact_commit *xlrec, 
xl_xact_parsed_commit *pars
 
if (parsed->xinfo & XACT_XINFO_HAS_GID)
{
-   strlcpy(parsed->twophase_gid, data, 
sizeof(parsed->twophase_gid));
-   data += strlen(data) + 1;
+   data += strlcpy(parsed->twophase_gid, data,
+   
sizeof(parsed->twophase_gid)) + 1;
}
}
 
@@ -188,8 +188,8 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, 
xl_xact_parsed_abort *parsed)
 
if (parsed->xinfo & XACT_XINFO_HAS_GID)
{
-   strlcpy(parsed->twophase_gid, data, 
sizeof(parsed->twophase_gid));
-   data += strlen(data) + 1;
+   data += strlcpy(parsed->twophase_gid, data,
+   
sizeof(parsed->twophase_gid)) + 1;
}
}
 
-- 
2.11.0



Re: pgsql: Store 2PC GID in commit/abort WAL recs for logical decoding

2018-06-15 Thread Alvaro Herrera
On 2018-Jun-14, Nikhil Sontakke wrote:

> There was a slight oversight in the twophase_gid length calculation in
> the XactLogCommitRecord() code path in the cf5a1890592 commit. The
> corresponding XactLogAbortRecord() code path was ok. PFA, a small
> patch to fix the oversight.

Forgot to add: maybe it would be useful to have tests in core where
these omissions become evident.  Do you have some?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pgsql: Store 2PC GID in commit/abort WAL recs for logical decoding

2018-06-15 Thread Alvaro Herrera
On 2018-Jun-14, Nikhil Sontakke wrote:

> There was a slight oversight in the twophase_gid length calculation in
> the XactLogCommitRecord() code path in the cf5a1890592 commit. The
> corresponding XactLogAbortRecord() code path was ok. PFA, a small
> patch to fix the oversight.

Thanks, pushed.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Procedure additions to ParseFuncOrColumn are inadequate

2018-06-15 Thread Tom Lane
I noticed that ParseFuncOrColumn isn't terribly thorough about rejecting
non-procedure results when proc_call is true.  Since the caller is just
assuming that it gets back what it expects, that leads to core dumps
or worse:

regression=# call int4(42);
server closed the connection unexpectedly
This probably means the server terminated abnormally

Also, with the existing ad-hoc approach to this, not all the ereports
are alike; the one for an aggregate lacks the HINT the others provide.

Also, in the cases where it does successfully reject should-be-procedure
or should-not-be-procedure, it reports ERRCODE_UNDEFINED_FUNCTION, which
seems quite inappropriate from here: I'd expect ERRCODE_WRONG_OBJECT_TYPE.

There's also a pre-existing oversight that when we get a result of
FUNCDETAIL_COERCION, we should reject that if there's any aggregate
decoration, but we fail to.

The attached cleans all this up; any objections?

regards, tom lane

diff --git a/src/backend/parser/parse_func.c b/src/backend/parser/parse_func.c
index ea5d521..21ddd5b 100644
*** a/src/backend/parser/parse_func.c
--- b/src/backend/parser/parse_func.c
*** static Node *ParseComplexProjection(Pars
*** 68,73 
--- 68,76 
   *	last_srf should be a copy of pstate->p_last_srf from just before we
   *	started transforming fargs.  If the caller knows that fargs couldn't
   *	contain any SRF calls, last_srf can just be pstate->p_last_srf.
+  *
+  *	proc_call is true if we are considering a CALL statement, so that the
+  *	name must resolve to a procedure name, not anything else.
   */
  Node *
  ParseFuncOrColumn(ParseState *pstate, List *funcname, List *fargs,
*** ParseFuncOrColumn(ParseState *pstate, Li
*** 204,210 
  	 * the "function call" could be a projection.  We also check that there
  	 * wasn't any aggregate or variadic decoration, nor an argument name.
  	 */
! 	if (nargs == 1 && agg_order == NIL && agg_filter == NULL && !agg_star &&
  		!agg_distinct && over == NULL && !func_variadic && argnames == NIL &&
  		list_length(funcname) == 1)
  	{
--- 207,214 
  	 * the "function call" could be a projection.  We also check that there
  	 * wasn't any aggregate or variadic decoration, nor an argument name.
  	 */
! 	if (nargs == 1 && !proc_call &&
! 		agg_order == NIL && agg_filter == NULL && !agg_star &&
  		!agg_distinct && over == NULL && !func_variadic && argnames == NIL &&
  		list_length(funcname) == 1)
  	{
*** ParseFuncOrColumn(ParseState *pstate, Li
*** 253,273 
  
  	cancel_parser_errposition_callback();
  
! 	if (fdresult == FUNCDETAIL_COERCION)
! 	{
! 		/*
! 		 * We interpreted it as a type coercion. coerce_type can handle these
! 		 * cases, so why duplicate code...
! 		 */
! 		return coerce_type(pstate, linitial(fargs),
! 		   actual_arg_types[0], rettype, -1,
! 		   COERCION_EXPLICIT, COERCE_EXPLICIT_CALL, location);
! 	}
! 	else if (fdresult == FUNCDETAIL_NORMAL || fdresult == FUNCDETAIL_PROCEDURE)
  	{
  		/*
! 		 * Normal function found; was there anything indicating it must be an
! 		 * aggregate?
  		 */
  		if (agg_star)
  			ereport(ERROR,
--- 257,298 
  
  	cancel_parser_errposition_callback();
  
! 	/*
! 	 * Check for various wrong-kind-of-routine cases.
! 	 */
! 
! 	/* If this is a CALL, reject things that aren't procedures */
! 	if (proc_call &&
! 		(fdresult == FUNCDETAIL_NORMAL ||
! 		 fdresult == FUNCDETAIL_AGGREGATE ||
! 		 fdresult == FUNCDETAIL_WINDOWFUNC ||
! 		 fdresult == FUNCDETAIL_COERCION))
! 		ereport(ERROR,
! (errcode(ERRCODE_WRONG_OBJECT_TYPE),
!  errmsg("%s is not a procedure",
! 		func_signature_string(funcname, nargs,
! 			  argnames,
! 			  actual_arg_types)),
!  errhint("To call a function, use SELECT."),
!  parser_errposition(pstate, location)));
! 	/* Conversely, if not a CALL, reject procedures */
! 	if (fdresult == FUNCDETAIL_PROCEDURE && !proc_call)
! 		ereport(ERROR,
! (errcode(ERRCODE_WRONG_OBJECT_TYPE),
!  errmsg("%s is a procedure",
! 		func_signature_string(funcname, nargs,
! 			  argnames,
! 			  actual_arg_types)),
!  errhint("To call a procedure, use CALL."),
!  parser_errposition(pstate, location)));
! 
! 	if (fdresult == FUNCDETAIL_NORMAL ||
! 		fdresult == FUNCDETAIL_PROCEDURE ||
! 		fdresult == FUNCDETAIL_COERCION)
  	{
  		/*
! 		 * In these cases, complain if there was anything indicating it must
! 		 * be an aggregate or window function.
  		 */
  		if (agg_star)
  			ereport(ERROR,
*** ParseFuncOrColumn(ParseState *pstate, Li
*** 306,331 
  	 errmsg("OVER specified, but %s is not a window function nor an aggregate function",
  			NameListToString(funcname)),
  	 parser_errposition(pstate, location)));
  
! 		if (fdresult == FUNCDETAIL_NORMAL && proc_call)
! 			ereport(ERROR,
! 	(errcode(ERRCODE_UNDEFINED_FUNCTION),
! 	 errmsg("%s is not a procedure",
! 			

Re: BUG #15237: I got "ERROR: source for a multiple-column UPDATE item must be a sub-SELECT or ROW() expression"

2018-06-15 Thread Andrew Gierth
> "Tom" == Tom Lane  writes:

 Tom> For example, suppose f(...) returns a single-column tuple result.
 Tom> This should be legal, if x matches the type of the single column:
 Tom> update ... set (x) = f(...)
 Tom> but if we try to do what you seem to have in mind, this would not
 Tom> be:
 Tom> update ... set (x) = (f(...))

 >> The spec indeed says that this is not valid, since it would wrap an
 >> additional ROW() around the result, and would try and assign the row
 >> value to x.

 Tom> I think that arguing that the spec requires that is fairly shaky,

I'll be more precise: the spec does not define any syntax that would
match UPDATE ... SET (x) = (f(...))  where f() is a function returning a
row type of degree 1 (or any other degree for that matter).

Specifically, the expansions of 
don't allow a row-valued function to appear alone that way, while the
alternative, , requires a .

The spec accordingly doesn't require that the construct be rejected, it
could be accepted as a (nonstandard) extension.

 Tom> I'd also point out that with the rule that the extra parens
 Tom> implicitly mean "ROW()", it's fairly unclear what should happen
 Tom> with

 Tom>   update ... set (x) = ((select ...))

The spec doesn't say that parens (even extra parens) implicitly mean
ROW(), what it does say is that some branches of the syntax implicitly
add a ROW() and some do not.

Specifically, if the thing to the right of the = is a , , or NULL/DEFAULT without parens,
then ROW() is added. ((select scalarcol from ...)) is a , ((select ROW(scalarcol) from ...)) is, however, not.

 Tom> But I repeat that this is a bad idea and we'd regret it in the
 Tom> long run; treating extra parens as meaning ROW() in some cases is
 Tom> just a fundamentally unsound idea from both syntactic and semantic
 Tom> standpoints.

But that's not actually the requirement. Parens are significant to the
spec, but that doesn't mean they have to be significant to us in the
same way as long as we end up accepting all the spec's cases (or at
least the useful ones).

Here is a real problem case (where "rowcol" is a column with a row type
and "rowval" has that same type):

UPDATE ... SET (rowcol) = (rowval)

There literally isn't any way to write the above in the spec except as

UPDATE ... SET (rowcol) = ROW(rowval)
or
UPDATE ... SET (rowcol) = (select ROW(rowval) from ...)

If we don't try and handle SET (rowcol) = (rowval), then I think we can
make all the spec's cases work by this rule: if the thing on the RHS of
the assignment is not a row type, then treat it as a row type of degree
1. That works without needing special rules for parens, and while it
accepts a lot of possibilities that the spec rejects, I don't see any
problems there.

 Tom> Given the extremely small number of complaints since v10 came out,
 Tom> I think we should just stay with what we have.

I tried looking at what combinations were supported by other major
databases:

UPDATE ... SET (cols...) = (exprs...)

Supported by DB2 and SQLite, both of which allow 1 or more expressions.
Not supported by MySQL, MSSQL, Oracle.

UPDATE ... SET (cols...) = (select exprs...)

Supported by DB2, Oracle, SQLite, for 1 or more columns. Not supported
by MySQL, MSSQL. (Note that this syntax is apparently not valid in the
spec except for the case of a single non-rowtype column!)

UPDATE ... SET (cols...) = (DEFAULT,...)

Supported by DB2, for 1 or more columns. Not supported by MySQL, MSSQL,
Oracle, SQLite.

UPDATE ... SET (cols...) = ROW(...)
UPDATE ... SET (cols...) = nonparenthesized_row_expression
UPDATE ... SET (cols...) = (select ROW(...) from ...)

Not supported by anything that I could find. (HSQLDB claims to support
the full standard syntax, but I don't believe it yet.)

-- 
Andrew (irc:RhodiumToad)



Re: missing toast table for pg_policy

2018-06-15 Thread John Naylor
On 2/19/18, Joe Conway  wrote:
> The attached does exactly this. Gives all system tables toast tables
> except pg_class, pg_attribute, and pg_index, and includes cat version
> bump and regression test in misc_sanity.
>
> Any further discussion, comments, complaints?

Hi Joe,
There's been a little bit-rot with duplicate OIDs and the regression
test. The first attached patch fixes that (applies on top of yours).

It occurred to be that we could go further and create most toast
tables automatically by taking advantage of the fact that the toast
creation function is a no-op if there are no varlena attributes. The
second patch (applies on top of the first) demonstrates a setup where
only shared and bootstrap catalogs need to have toast declarations
specified manually with fixed OIDs. It's possible this way is more
fragile, though.

I also did some non-scientific performance testing. On my machine,
initdb takes at least:
HEAD ~1040 ms
patch ~1070 ms
with my addenda ~1075 ms

A little slower, but within the noise of variation.

-John Naylor

> Joe
>
> --
> Crunchy Data - http://crunchydata.com
> PostgreSQL Support for Secure Enterprises
> Consulting, Training, & Open Source Development
>
diff --git a/src/include/catalog/toasting.h b/src/include/catalog/toasting.h
index ee54487..1387569 100644
--- a/src/include/catalog/toasting.h
+++ b/src/include/catalog/toasting.h
@@ -46,9 +46,9 @@ extern void BootstrapToastTable(char *relName,
  */
 
 /* normal catalogs */
-DECLARE_TOAST(pg_aggregate, 4139, 4140);
+DECLARE_TOAST(pg_aggregate, 4187, 4188);
 DECLARE_TOAST(pg_attrdef, 2830, 2831);
-DECLARE_TOAST(pg_collation, 4141, 4142);
+DECLARE_TOAST(pg_collation, 4189, 4190);
 DECLARE_TOAST(pg_constraint, 2832, 2833);
 DECLARE_TOAST(pg_default_acl, 4143, 4144);
 DECLARE_TOAST(pg_description, 2834, 2835);
diff --git a/src/test/regress/expected/misc_sanity.out b/src/test/regress/expected/misc_sanity.out
index 0be74d2..ad10a7e 100644
--- a/src/test/regress/expected/misc_sanity.out
+++ b/src/test/regress/expected/misc_sanity.out
@@ -92,11 +92,12 @@ ORDER BY 1,2;
 --+---+--
  pg_attribute | attacl| aclitem[]
  pg_attribute | attfdwoptions | text[]
+ pg_attribute | attmissingval | anyarray
  pg_attribute | attoptions| text[]
  pg_class | relacl| aclitem[]
  pg_class | reloptions| text[]
  pg_class | relpartbound  | pg_node_tree
  pg_index | indexprs  | pg_node_tree
  pg_index | indpred   | pg_node_tree
-(8 rows)
+(9 rows)
 
diff --git a/src/backend/bootstrap/bootparse.y b/src/backend/bootstrap/bootparse.y
index 4c72989..9f5a3b9 100644
--- a/src/backend/bootstrap/bootparse.y
+++ b/src/backend/bootstrap/bootparse.y
@@ -27,10 +27,13 @@
 #include "catalog/heap.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
+#include "catalog/pg_amproc.h"
 #include "catalog/pg_attribute.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_class.h"
+#include "catalog/pg_index.h"
 #include "catalog/pg_namespace.h"
+#include "catalog/pg_opclass.h"
 #include "catalog/pg_tablespace.h"
 #include "catalog/toasting.h"
 #include "commands/defrem.h"
@@ -255,6 +258,23 @@ Boot_CreateStmt:
 	  InvalidOid,
 	  NULL);
 		elog(DEBUG4, "relation created with OID %u", id);
+
+		/*
+		 * Create a toast table for all non-shared catalogs
+		 * that have any varlena attributes, except for pg_index.
+		 * We also exclude catalogs that need to be populated
+		 * in order for toast tables to be created. These don't
+		 * currently have varlenas, so this is just future-proofing.
+		 */
+		if (!shared_relation
+			&& id != IndexRelationId
+			&& id != AccessMethodRelationId
+			&& id != OperatorClassRelationId
+			&& id != AccessMethodProcedureRelationId)
+		{
+			NewRelationCreateToastTable(id, (Datum) 0);
+		}
+		elog(DEBUG4, "creating toast table for table \"%s\"", $2);
 	}
 	do_end();
 }
diff --git a/src/backend/catalog/Makefile b/src/backend/catalog/Makefile
index 0865240..48a871a 100644
--- a/src/backend/catalog/Makefile
+++ b/src/backend/catalog/Makefile
@@ -27,12 +27,13 @@ include $(top_srcdir)/src/backend/common.mk
 
 # Note: the order of this list determines the order in which the catalog
 # header files are assembled into postgres.bki.  BKI_BOOTSTRAP catalogs
-# must appear first, and there are reputedly other, undocumented ordering
-# dependencies.
+# must appear first, followed by catalogs needed for creating toast tables.
+# There are reputedly other, undocumented ordering dependencies.
 CATALOG_HEADERS := \
 	pg_proc.h pg_type.h pg_attribute.h pg_class.h \
-	pg_attrdef.h pg_constraint.h pg_inherits.h pg_index.h pg_operator.h \
-	pg_opfamily.h pg_opclass.h pg_am.h pg_amop.h pg_amproc.h \
+	pg_index.h pg_am.h pg_opclass.h pg_amproc.h \
+	pg_operator.h pg_opfamily.h pg_amop.h \
+	pg_attrdef.h pg_constraint.h pg_inherits.h \
 	pg_language.h 

Re: AtEOXact_ApplyLauncher() and subtransactions

2018-06-15 Thread Amit Khandekar
On 15 June 2018 at 09:46, Peter Eisentraut
 wrote:
> On 6/5/18 07:02, Amit Khandekar wrote:
>> I haven't written a patch for it, but I think we should have a
>> separate on_commit_stop_workers for eachyou get subtransaction. At
>> subtransaction commit, we replace the on_commit_stop_workers list of
>> the parent subtransaction with the one from the committed
>> subtransaction; and on abort, discard the list of the current
>> subtransaction. So have a stack of the lists.
>
> Is there maybe a more general mechanism we could attach this to?  Maybe
> resource owners?

You mean using something like ResourceOwnerRelease() ? We need to
merge the on_commit_stop_workers list into the parent transaction's
list. So we can't release the whole list on commit.

The way I am implementing this can be seen in attached
apply_launcher_subtrans_WIP.patch. (check launcher.c changes). I
haven't started testing it yet though.
on_commit_stop_workers denotes a list of subscriptions, and each
element also contains a list of relations for that subscription.
This list is built by ALTER-SUBSCRIPTION commands that have run in the
current (sub)transaction.

At sub-transaction start, the on_commit_stop_workers is pushed into a
stack. Then on_commit_stop_workers starts with empty list. This list
is then populated by ALTER-SUBCSRIPTION commands of the current
sub-transaction.

At sub-transaction commit, this list is merged into the list of parent
subtransaction, the parent transaction stack element is popped out,
and the merged list becomes the new on_commit_stop_workers.
So say, parent has sub1(tab1, tab2), sub2(tab2, tab3), where sub means
subscription.
and current on_commit_workers has sub2(tab4) and sub3(tab1)
At commit, for subscription sub2, we should replace the outer
sub2(tab2, tab3) with the newer sub2(tab4).
So, the merged list will have :
sub1(tab1, tab2), sub2(tab4), sub3(tab1)

At sub-transaction abort, the on_commit_stop_workers is discarded. The
parent transaction worker list is assigned back to
on_commit_stop_workers.


>
>> Subscription mysub is set to synchronise tables tx1 and tx2 :
>> CREATE SUBSCRIPTION mysub ... PUBLICATION pubx;
>>
>> Now suppose the subscription is altered to synchronise ty1 and ty2,
>> and then again altered back to synchronise tx1 and tx2 in the same
>> transaction.
>> begin;
>> ALTER SUBSCRIPTION mysub set publication puby;
>> ALTER SUBSCRIPTION mysub set publication pubx;
>> commit;
>>
>> Here, workers for tx1 and tx2 are added to on_commit_stop_workers
>> after the publication is set to puby. And then workers for ty1 and ty2
>> are further added to that list after the 2nd ALTER command where
>> publication is set to pubx, because the earlier ALTER command has
>> already changed the catalogs to denote that ty1 and ty2 are being
>> synchronised. Effectively, at commit, all the workers are targetted to
>> be stopped, when actually at commit time there is no final change in
>> the tables to be synchronised.
>
> I'm not so bothered about this scenario.  When you drop and then
> recreate a table in the same transaction, that doesn't mean you keep the
> data that was previously in the table.  If you want to *undo* a change,
> you need to do rollback, not commit further changes that try to recreate
> the previous state.

May be the example I gave is not practical. But a user can as well do
something like :

CREATE SUBSCRIPTION mysub ... PUBLICATION pub1;
...
begin;
ALTER SUBSCRIPTION mysub set publication pub2;

ALTER SUBSCRIPTION mysub set publication pub3;
commit;
So pub3 is yet another publication. So here the old one is not set back again.

Or may there can be ALTER SUBSCRIPTION REFRESH.

I believe on_commit_stop_workers was designed keeping in mind that all
actions are to be done only at commit; we should not immediately stop
workers. So it implies that the workers it determines in the end of
commit, also should be accurate. I have anyways included the changes
for this in the same attached patch (changes are in
subscriptioncmds.c)

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


apply_launcher_subtrans_WIP.patch
Description: Binary data


Re: Speedup of relation deletes during recovery

2018-06-15 Thread Andres Freund
Hi,

On 2018-06-15 10:45:04 -0700, Andres Freund wrote:
> > +
> > +   srels = palloc(sizeof(SMgrRelation) * ndelrels);
> > for (i = 0; i < ndelrels; i++)
> > -   {
> > -   SMgrRelation srel = smgropen(delrels[i], InvalidBackendId);
> > +   srels[i] = smgropen(delrels[i], InvalidBackendId);
> >  
> > -   smgrdounlink(srel, false);
> > -   smgrclose(srel);
> > -   }
> > +   smgrdounlinkall(srels, ndelrels, false);
> > +
> > +   for (i = 0; i < ndelrels; i++)
> > +   smgrclose(srels[i]);
> > +   pfree(srels);

Hm. This will probably cause another complexity issue. If we just
smgropen() the relation will be unowned. Which means smgrclose() will
call remove_from_unowned_list(), which is O(open-relations). Which means
this change afaict creates a new O(ndrels^2) behaviour?

It seems like the easiest fix for that would be to have a local
SMgrRelationData in that loop, that we assign the relations to?  That's
a bit hacky, but afaict should work?

Greetings,

Andres Freund



Removing "Included attributes in B-tree indexes" section from docs

2018-06-15 Thread Peter Geoghegan
I propose removing the "Included attributes in B-tree indexes"
top-level section of chapter 63 from the user facing documentation.
Chapter 63 concerns B-Tree operator classes themselves, in the
abstract, so the fact that an operator class isn't needed for extra
covering index columns isn't appropriate. It seems sufficient to only
mention this once, in the CREATE INDEX docs.

Attached patch shows what I have in mind -- the total removal of this
top-level section.

-- 
Peter Geoghegan
From 293f1fc93771fa6be65867f53aa507fb30137230 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Fri, 15 Jun 2018 10:45:29 -0700
Subject: [PATCH] Remove INCLUDE attributes section from docs.

It doesn't seem useful to mention covering indexes in a chapter about
the behavior of B-Tree operator classes.  The fact that INCLUDE columns
don't require a B-Tree operator class is already covered in the CREATE
INDEX documentation.
---
 doc/src/sgml/btree.sgml | 17 -
 1 file changed, 17 deletions(-)

diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
index 336d026ea1..8bd0badb28 100644
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -433,23 +433,6 @@ returns bool
 
 
 
-
- Included attributes in B-tree indexes
-
- 
-  As of PostgreSQL 11.0 there is an optional
-  INCLUDE clause, which allows to add non-key (included) attributes to index.
-  Those included attributes allow more queries to benefit from index-only scans.
-  We never use included attributes in ScanKeys for search.  That allows us to
-  include into B-tree any datatypes, even those which don't have suitable
-  operator classes.  Included columns only stored in regular tuples on leaf
-  pages.  All pivot tuples on non-leaf pages and highkey tuples are truncated
-  to contain only key attributes.  That helps to slightly reduce the size of
-  index.
- 
-
-
-
 
  Implementation
 
-- 
2.17.1



Re: WAL prefetch

2018-06-15 Thread Andres Freund
On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:
> 
> 
> On 14.06.2018 09:52, Thomas Munro wrote:
> > On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
> >  wrote:
> > > pg_wal_prefetch function will infinitely traverse WAL and prefetch block
> > > references in WAL records
> > > using posix_fadvise(WILLNEED) system call.
> > Hi Konstantin,
> > 
> > Why stop at the page cache...  what about shared buffers?
> > 
> 
> It is good question. I thought a lot about prefetching directly to shared
> buffers.

I think that's definitely how this should work.  I'm pretty strongly
opposed to a prefetching implementation that doesn't read into s_b.


> But the current c'est la vie with Postgres is that allocating too large
> memory for shared buffers is not recommended.
> Due to many different reasons: degradation of clock replacement algorithm,
> "write storm",...

I think a lot of that fear is overplayed. And we've fixed a number of
issues.  We don't really generate write storms in the default config
anymore in most scenarios, and if it's an issue you can turn on
backend_flush_after.


> If your system has 1Tb of memory,  almost none of Postgresql administrators
> will recommend to use all this 1Tb for shared buffers.

I've used 1TB successfully.


> Also PostgreSQL is not currently supporting dynamic changing of shared
> buffers size. Without it, the only way of using Postgres in clouds and
> another multiuser systems where system load is not fully controlled by  user
> is to choose relatively small shared buffer size and rely on OS caching.

That seems largely unrelated to the replay case, because there the data
will be read into shared buffers anyway. And it'll be dirtied therein.

Greetings,

Andres Freund



Re: Speedup of relation deletes during recovery

2018-06-15 Thread Teodor Sigaev

We just had a customer hit this issue. I kind of wonder whether this
shouldn't be backpatched: Currently the execution on the primary is
O(NBuffers * log(ndrels)) whereas it's O(NBuffers * ndrels) on the
standby - with a lot higher constants to boot.  That means it's very
easy to get into situations where the standy starts to lag behind very 
significantly.

+1, we faced with that too
--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/



Re: Speedup of relation deletes during recovery

2018-06-15 Thread Andres Freund
Hi,

We just had a customer hit this issue. I kind of wonder whether this
shouldn't be backpatched: Currently the execution on the primary is
O(NBuffers * log(ndrels)) whereas it's O(NBuffers * ndrels) on the
standby - with a lot higher constants to boot.  That means it's very
easy to get into situations where the standy starts to lag behind very 
significantly.

> --- a/src/backend/access/transam/twophase.c
> +++ b/src/backend/access/transam/twophase.c
> @@ -1445,6 +1445,7 @@ FinishPreparedTransaction(const char *gid, bool 
> isCommit)
>   int ndelrels;
>   SharedInvalidationMessage *invalmsgs;
>   int i;
> + SMgrRelation *srels = NULL;
>  
>   /*
>* Validate the GID, and lock the GXACT to ensure that two backends do 
> not
> @@ -1534,13 +1535,16 @@ FinishPreparedTransaction(const char *gid, bool 
> isCommit)
>   delrels = abortrels;
>   ndelrels = hdr->nabortrels;
>   }
> +
> + srels = palloc(sizeof(SMgrRelation) * ndelrels);
>   for (i = 0; i < ndelrels; i++)
> - {
> - SMgrRelation srel = smgropen(delrels[i], InvalidBackendId);
> + srels[i] = smgropen(delrels[i], InvalidBackendId);
>  
> - smgrdounlink(srel, false);
> - smgrclose(srel);
> - }
> + smgrdounlinkall(srels, ndelrels, false);
> +
> + for (i = 0; i < ndelrels; i++)
> + smgrclose(srels[i]);
> + pfree(srels);

This code is now duplicated three times - shouldn't we just add a
function that encapsulates dropping relations in a commit/abort record?

Greetings,

Andres Freund



Re: BUG #15237: I got "ERROR: source for a multiple-column UPDATE item must be a sub-SELECT or ROW() expression"

2018-06-15 Thread Tom Lane
Andrew Gierth  writes:
> "Tom" == Tom Lane  writes:
>  Tom> As far as (a) goes, it was an intentional compatibility breakage,

> So here I have a problem: the fact that it was a compatibility breakage
> seems not to have been discussed *or even so much as mentioned* on
> -hackers at any time.

I will freely admit that at the time I did the original patch, I did not
think that the change for the single-column case was of interest to
anyone.  Who'd write an extra four parentheses that they didn't need?
But it *was* discussed later, including on -hackers:

https://www.postgresql.org/message-id/flat/20170719174507.GA19616%40telsasoft.com#20170719174507.ga19...@telsasoft.com
https://www.postgresql.org/message-id/flat/CAMjNa7cDLzPcs0xnRpkvqmJ6Vb6G3EH8CYGp9ZBjXdpFfTz6dg%40mail.gmail.com

>  Tom> For example, suppose f(...) returns a single-column tuple result.
>  Tom> This should be legal, if x matches the type of the single column:
>  Tom> update ... set (x) = f(...)
>  Tom> but if we try to do what you seem to have in mind, this would not
>  Tom> be:
>  Tom> update ... set (x) = (f(...))

> The spec indeed says that this is not valid, since it would wrap an
> additional ROW() around the result, and would try and assign the row
> value to x.

I think that arguing that the spec requires that is fairly shaky, and
implementing it would be even shakier; for example, we'd have to start
worrying about whether ruleutils.c's habit of throwing in extra parens
when in doubt could ever result in unexpected change to the meaning.

I'd also point out that with the rule that the extra parens implicitly
mean "ROW()", it's fairly unclear what should happen with

update ... set (x) = ((select ...))

If somebody were to hold a gun to my head and say "you must make this
work", I'd probably do something like the attached, which abuses the
operator_precedence_warning infrastructure to detect whether there were
extra parens.  One would have to argue about exactly what it should do
with parens around ROW() or (SELECT ...), but I chose to ignore them.
(I think that in the SELECT case, the grammar would actually absorb
the extra parens elsewhere, so that we couldn't do anything differently
anyway without further kluges.)

But I repeat that this is a bad idea and we'd regret it in the long run;
treating extra parens as meaning ROW() in some cases is just a
fundamentally unsound idea from both syntactic and semantic standpoints.
Given the extremely small number of complaints since v10 came out,
I think we should just stay with what we have.

regards, tom lane

PS: I noticed while messing with this that there's a separate problem:
right now, turning on operator_precedence_warning causes unexpected
failures with extra parens around the RHS.  The code added below to
make transformMultiAssignRef look through AEXPR_PAREN nodes before
deciding anything is needed anyway to fix that.

diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 90dfac2..64bd254 100644
*** a/src/backend/parser/gram.y
--- b/src/backend/parser/gram.y
*** static Node *makeRecursiveViewSelect(cha
*** 470,476 
  %type 	def_arg columnElem where_clause where_or_current_clause
  a_expr b_expr c_expr AexprConst indirection_el opt_slice_bound
  columnref in_expr having_clause func_table xmltable array_expr
! ExclusionWhereClause operator_def_arg
  %type 	rowsfrom_item rowsfrom_list opt_col_def_list
  %type  opt_ordinality
  %type 	ExclusionConstraintList ExclusionConstraintElem
--- 470,476 
  %type 	def_arg columnElem where_clause where_or_current_clause
  a_expr b_expr c_expr AexprConst indirection_el opt_slice_bound
  columnref in_expr having_clause func_table xmltable array_expr
! ct_row_expr ExclusionWhereClause operator_def_arg
  %type 	rowsfrom_item rowsfrom_list opt_col_def_list
  %type  opt_ordinality
  %type 	ExclusionConstraintList ExclusionConstraintElem
*** set_clause:
*** 11070,11076 
  	$1->val = (Node *) $3;
  	$$ = list_make1($1);
  }
! 			| '(' set_target_list ')' '=' a_expr
  {
  	int ncolumns = list_length($2);
  	int i = 1;
--- 11070,11076 
  	$1->val = (Node *) $3;
  	$$ = list_make1($1);
  }
! 			| '(' set_target_list ')' '=' ct_row_expr
  {
  	int ncolumns = list_length($2);
  	int i = 1;
*** c_expr:		columnref{ $$ = $1; }
*** 13451,13457 
  		n->indirection = check_indirection($4, yyscanner);
  		$$ = (Node *)n;
  	}
! 	else if (operator_precedence_warning)
  	{
  		/*
  		 * If precedence warnings are enabled, insert
--- 13451,13458 
  		n->indirection = check_indirection($4, yyscanner);
  		$$ = (Node *)n;
  	}
! 	else if (operator_precedence_warning ||
! 			 pg_yyget_extra(yyscanner)->in_ctre)
  	{
  		/*
  		 * If precedence warnings are enabled, insert
*** c_expr:		columnref

Re: question on streaming replication

2018-06-15 Thread Andreas Kretschmer
On 14 June 2018 07:28:53 CEST, Atul Kumar  wrote:
>Hi,
>
>I have postgres edb 9.6 version, i have below query to solve it out.
>
>i have configured streaming replication having master and slave node
>on same  server just to test it.
>
>All worked fine but when i made slave service stop, and create some
>test databases in master, after then i made slave service start, slave
>didn't pick the changes.
>
>The replication was on async state.
>
>Then after doing some search on google i tried to make it sync state
>but even making changes in postgresql.conf file I am neither getting
>sync state nor getting any changes on slave server.
>
>Please suggest the needful.
>
>
>Regards,
>Atul

Sync replication isn't usefull with only one standby. 

I think, during the stop of the standby the master has overwitten needed wal's. 
You can prevent that by increasing wal_keep_segments or by using replication 
slots. Please use google, there are tons of docs about that all.


Regards, Andreas


-- 
2ndQuadrant - The PostgreSQL Support Company



Re: Possible bug in logical replication.

2018-06-15 Thread Arseny Sher

Alvaro Herrera  writes:

> Can somebody (Arseny, Konstantin, horiguti, Sawada) please confirm that
> Michaël's commit fixes the reported bug?

I confirm that starting reading WAL since restart_lsn as implemented in
f731cfa fixes this issue, as well as the second issue tushar mentioned
at [1]. I think that the code still can be improved a bit though --
consider the attached patch:
* pg_logical_replication_slot_advance comment was not very informative
  and even a bit misleading: it said that we use confirmed_flush_lsn and
  restart_lsn, but didn't explain why.
* Excessive check in its main loop.
* Copy-paste comment fix.

[1] 
https://www.postgresql.org/message-id/5f85bf41-098e-c4e1-7332-9171fef57a0a%40enterprisedb.com

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

>From d8ed8ae3eec54b716d7dbb35379d0047a96c6c75 Mon Sep 17 00:00:00 2001
From: Arseny Sher 
Date: Fri, 15 Jun 2018 18:11:17 +0300
Subject: [PATCH] Cosmetic review for f731cfa.

* pg_logical_replication_slot_advance comment was not very informative and even
  a bit misleading: it said that we use confirmed_flush_lsn and restart_lsn, but
  didn't explain why.
* Excessive check in its main loop.
* Copy-paste comment fix.
---
 src/backend/replication/slotfuncs.c | 25 ++---
 1 file changed, 10 insertions(+), 15 deletions(-)

diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2806e1076c..597e81f4d9 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -341,24 +341,26 @@ pg_physical_replication_slot_advance(XLogRecPtr moveto)
 
 /*
  * Helper function for advancing logical replication slot forward.
- * The slot's restart_lsn is used as start point for reading records,
- * while confirmed_lsn is used as base point for the decoding context.
- * The LSN position to move to is checked by doing a per-record scan and
- * logical decoding which makes sure that confirmed_lsn is updated to a
- * LSN which allows the future slot consumer to get consistent logical
- * changes.
+ * While we could just do LogicalConfirmReceivedLocation updating
+ * confirmed_flush_lsn, we'd better digest WAL to advance restart_lsn allowing
+ * to recycle WAL and old catalog tuples. We do it in special fast_forward
+ * mode without actual replay.
  */
 static XLogRecPtr
 pg_logical_replication_slot_advance(XLogRecPtr moveto)
 {
 	LogicalDecodingContext *ctx;
 	ResourceOwner old_resowner = CurrentResourceOwner;
+	/*
+	 * Start reading WAL at restart_lsn, which certainly points to the valid
+	 * record.
+	 */
 	XLogRecPtr	startlsn = MyReplicationSlot->data.restart_lsn;
 	XLogRecPtr	retlsn = MyReplicationSlot->data.confirmed_flush;
 
 	PG_TRY();
 	{
-		/* restart at slot's confirmed_flush */
+		/* start_lsn doesn't matter here, we don't replay xacts at all */
 		ctx = CreateDecodingContext(InvalidXLogRecPtr,
 	NIL,
 	true,
@@ -388,17 +390,10 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto)
 			 */
 			startlsn = InvalidXLogRecPtr;
 
-			/*
-			 * The {begin_txn,change,commit_txn}_wrapper callbacks above will
-			 * store the description into our tuplestore.
-			 */
+			/* Changes are not actually produced in fast_forward mode. */
 			if (record != NULL)
 LogicalDecodingProcessRecord(ctx, ctx->reader);
 
-			/* Stop once the moving point wanted by caller has been reached */
-			if (moveto <= ctx->reader->EndRecPtr)
-break;
-
 			CHECK_FOR_INTERRUPTS();
 		}
 
-- 
2.11.0



Re: WAL prefetch

2018-06-15 Thread Konstantin Knizhnik




On 15.06.2018 18:03, Amit Kapila wrote:

On Fri, Jun 15, 2018 at 1:08 PM, Konstantin Knizhnik
 wrote:


On 15.06.2018 07:36, Amit Kapila wrote:

On Fri, Jun 15, 2018 at 12:16 AM, Stephen Frost 
wrote:

I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb
NVME
RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
The speed of synchronous replication between two nodes is increased from
56k
TPS to 60k TPS (on pgbench with scale 1000).

I'm also surprised that it wasn't a larger improvement.

Seems like it would make sense to implement in core using
posix_fadvise(), perhaps in the wal receiver and in RestoreArchivedFile
or nearby..  At least, that's the thinking I had when I was chatting w/
Sean.


Doing in-core certainly has some advantage such as it can easily reuse
the existing xlog code rather trying to make a copy as is currently
done in the patch, but I think it also depends on whether this is
really a win in a number of common cases or is it just a win in some
limited cases.


I am completely agree. It was my mail concern: on which use cases this
prefetch will be efficient.
If "full_page_writes" is on (and it is safe and default value), then first
update of a page since last checkpoint will be written in WAL as full page
and applying it will not require reading any data from disk.


What exactly you mean by above?  AFAIU, it needs to read WAL to apply
full page image.  See below code:

XLogReadBufferForRedoExtended()
{
..
/* If it has a full-page image and it should be restored, do it. */
if (XLogRecBlockImageApply(record, block_id))
{
Assert(XLogRecHasBlockImage(record, block_id));
*buf = XLogReadBufferExtended(rnode, forknum, blkno,
   get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK);
page = BufferGetPage(*buf);
if (!RestoreBlockImage(record, block_id, page))
..
}




Sorry, for my confusing statement.
Definitely we need to read page from WAL.
I mean that in case of "full page write" we do not need to read updated 
page from the database.

It can be just overwritten.

pg_prefaulter and my wal_prefetch are not prefetching WAL pages themselves.
There is no sense to do it, because them are just written by 
wal_receiver and so should be present in file system cache.
wal_prefetch is prefetching blocks referenced by WAL records. But in 
case of "full page writes" such prefetch is not needed and even is harmful.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Make description of heap records more talkative for flags

2018-06-15 Thread Nathan Bossart
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

-   appendStringInfo(buf, "off %u", xlrec->offnum);
+   appendStringInfo(buf, "off %u flags %02X", xlrec->offnum,
+xlrec->flags);

Can we add "0x" before the flags so that it is abundantly clear it's a
hex value?

Nathan

Re: question on streaming replication

2018-06-15 Thread Amit Kapila
On Thu, Jun 14, 2018 at 10:58 AM, Atul Kumar  wrote:
> Hi,
>
> I have postgres edb 9.6 version, i have below query to solve it out.
>

This is not the right place to ask queries on edb versions.  You need
to check with your vendor about the right place to ask questions.
Here, you can ask the questions about PostgreSQL.

> i have configured streaming replication having master and slave node
> on same  server just to test it.
>
> All worked fine but when i made slave service stop, and create some
> test databases in master, after then i made slave service start, slave
> didn't pick the changes.
>

I think you need to ensure that replica is connected to master server
and then probably check logs to confirm what exactly happened.

> The replication was on async state.
>
> Then after doing some search on google i tried to make it sync state
> but even making changes in postgresql.conf file I am neither getting
> sync state nor getting any changes on slave server.
>

After making changes in postgresql.conf, you might need to use
pg_reload_conf  or restart the server depending on the setting you
have changed to make that setting effective.

> Please suggest the needful.
>

As mentioned above, I suggest asking only PostgreSQL related questions
on these mailing lists.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: why partition pruning doesn't work?

2018-06-15 Thread Andrew Dunstan




On 06/15/2018 10:02 AM, Tom Lane wrote:

Andrew Dunstan  writes:

OK, lousyjack is back online with this, new and improved. It currently
takes 7.5 hours for a run.  Should I also add -DCATCACHE_FORCE_RELEASE?

I did some experimentation yesterday with valgrind plus both
RELCACHE_FORCE_RELEASE and CATCACHE_FORCE_RELEASE.  I didn't find
any new bugs, but adding CATCACHE_FORCE_RELEASE makes things quite
a lot slower :-(.  Not sure it's worth it.





OK.

cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: why partition pruning doesn't work?

2018-06-15 Thread Tom Lane
Andrew Dunstan  writes:
> OK, lousyjack is back online with this, new and improved. It currently 
> takes 7.5 hours for a run.  Should I also add -DCATCACHE_FORCE_RELEASE?

I did some experimentation yesterday with valgrind plus both
RELCACHE_FORCE_RELEASE and CATCACHE_FORCE_RELEASE.  I didn't find
any new bugs, but adding CATCACHE_FORCE_RELEASE makes things quite
a lot slower :-(.  Not sure it's worth it.

regards, tom lane



Re: why partition pruning doesn't work?

2018-06-15 Thread Andrew Dunstan




On 06/12/2018 07:47 AM, Andrew Dunstan wrote:



On 06/11/2018 06:41 PM, Andrew Dunstan wrote:



On 06/11/2018 06:24 PM, Tom Lane wrote:


If we had any buildfarm critters running valgrind on
RELCACHE_FORCE_RELEASE or CLOBBER_CACHE_ALWAYS builds, they'd have
detected use of uninitialized memory here ... but I don't think we have
any.  (The combination of valgrind and CCA would probably be too 
slow to

be practical :-(, though maybe somebody with a fast machine could do
the other thing.)





I don't have a fast machine, but I do have a slow machine already 
running valgrind and not doing much else :-) Let's see how lousyjack 
does with -DRELCACHE_FORCE_RELEASE






It added about 20% to the run time. That's tolerable, so I'll leave it 
on.






OK, lousyjack is back online with this, new and improved. It currently 
takes 7.5 hours for a run.  Should I also add -DCATCACHE_FORCE_RELEASE?


cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Refactoring query_planner() callback

2018-06-15 Thread Heikki Linnakangas
While hacking around the planner, the "callback" mechanism in 
query_planner started to feel awkward. It would seem more natural to 
split query_planner() into two parts: one function to do all the 
pre-processing of the jointree, building equivalence classes etc. And a 
second function to generate the Paths.


So where the callers currently do this:

/* set up callback arguments */
qp_extra.foo = ...;
qp_extra.bar = ...;

query_planner(root, qp_callback, _extra);

static void
qp_callback(PlannerInfo *root, void *qp_extra)
{
  root->sort_pathkeys = ...;
  root->query_pathkeys = ...;
}


This would feel more natural to me:

/* process the jointree, build equivalence classes */
process_jointree(root);

/* set up query_pathkeys */
root->sort_pathkeys = ...;
root->query_pathkeys = ...;

query_planner(root);


Attached is a more full-fledged patch to do that.

The callback was introduced by commit db9f0e1d9a. The commit message 
explains the choice to use a callback like this:



There are several ways in which we could implement that without making
query_planner itself deal with grouping/sorting features (which are
supposed to be the province of grouping_planner).  I chose to add a
callback function to query_planner's API; other alternatives would have
required adding more fields to PlannerInfo, which while not bad in itself
would create an ABI break for planner-related plugins in the 9.2 release
series.  This still breaks ABI for anything that calls query_planner
directly, but it seems somewhat unlikely that there are any such plugins.


In v12, we can change the ABI.

- Heikki

>From 482e86a864f7ae5f4a5ffac09e0d1d21dfd8f762 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Fri, 15 Jun 2018 15:13:41 +0300
Subject: [PATCH 1/1] Refactor query_planner() into two parts.

Commit db9f0e1d9a introduced a callback to query_planner(), which gave the
caller a chance to build query_pathkeys, in the middle of query_planner()
steps. That feels a bit awkward. Let's split query_planner() into two
functions instead, one to process the jointree, building the equivalence
classes, and another to produce the Paths. The caller can build the
query_pathkeys between the two function calls.
---
 src/backend/optimizer/plan/planagg.c |  51 ++---
 src/backend/optimizer/plan/planmain.c| 125 ++-
 src/backend/optimizer/plan/planner.c |  45 +--
 src/backend/optimizer/util/placeholder.c |   2 +-
 src/include/nodes/relation.h |   9 ++-
 src/include/optimizer/planmain.h |   7 +-
 6 files changed, 131 insertions(+), 108 deletions(-)

diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index 95cbffbd69..07f0734559 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -50,7 +50,6 @@
 static bool find_minmax_aggs_walker(Node *node, List **context);
 static bool build_minmax_path(PlannerInfo *root, MinMaxAggInfo *mminfo,
   Oid eqop, Oid sortop, bool nulls_first);
-static void minmax_qp_callback(PlannerInfo *root, void *extra);
 static Oid	fetch_agg_sort_op(Oid aggfnoid);
 
 
@@ -63,9 +62,9 @@ static Oid	fetch_agg_sort_op(Oid aggfnoid);
  * the (UPPERREL_GROUP_AGG, NULL) upperrel.
  *
  * This should be called by grouping_planner() just before it's ready to call
- * query_planner(), because we generate indexscan paths by cloning the
- * planner's state and invoking query_planner() on a modified version of
- * the query parsetree.  Thus, all preprocessing needed before query_planner()
+ * process_jointree(), because we generate indexscan paths by cloning the
+ * planner's state and invoking the scan/join planner on a modified version of
+ * the query parsetree.  Thus, all preprocessing needed before process_jointree()
  * must already be done.
  *
  * Note: we are passed the preprocessed targetlist separately, because it's
@@ -435,13 +434,33 @@ build_minmax_path(PlannerInfo *root, MinMaxAggInfo *mminfo,
 		   FLOAT8PASSBYVAL);
 
 	/*
-	 * Generate the best paths for this query, telling query_planner that we
-	 * have LIMIT 1.
+	 * Build base relation and equivalence classes, knowing that we have
+	 * LIMIT 1.
 	 */
 	subroot->tuple_fraction = 1.0;
 	subroot->limit_tuples = 1.0;
 
-	final_rel = query_planner(subroot, tlist, minmax_qp_callback, NULL);
+	process_jointree(subroot, tlist);
+
+	/*
+	 * Compute query_pathkeys to represent the desired order.  There is no
+	 * GROUP BY, window functions, or DISTINCT in the generated query.
+	 */
+	subroot->group_pathkeys = NIL;
+	subroot->window_pathkeys = NIL;
+	subroot->distinct_pathkeys = NIL;
+
+	subroot->sort_pathkeys =
+		make_pathkeys_for_sortclauses(subroot,
+	  subroot->parse->sortClause,
+	  subroot->parse->targetList);
+
+	subroot->query_pathkeys = subroot->sort_pathkeys;
+
+	/*
+	 * Generate the best paths for the generated query.
+	 */
+	final_rel = 

Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.

2018-06-15 Thread Etsuro Fujita

(2018/06/14 22:37), Ashutosh Bapat wrote:

On Mon, Jun 11, 2018 at 4:05 PM, Etsuro Fujita
  wrote:

(2018/06/07 19:42), Ashutosh Bapat wrote:

On Wed, Jun 6, 2018 at 5:00 PM, Etsuro Fujita
   wrote:

Yeah, but I mean we modify set_append_rel_size so that we only map a parent
wholerow Var in the parent tlist to a child wholerow Var in the child's
tlist; parent wholerow Vars in the parent's other expressions such as
conditions are transformed into CREs as-is.


What happens to a PlaceHolderVar containing whole-row reference when
that appears in a condition and/or targetlist.


Whole-row Vars contained in such PHV's expressions will be mapped into 
ConvertRowtypeExprs with adjust_appendrel_attrs.



   I don't think the
tlists for the children need to have their wholerows transformed into the
corresponding ConvertRowtypeExprs *at this point*, so what I'd like to
propose is to 1) map a parent wholerow Var simply to a child wholerow
Var,
instead (otherwise, the same as adjust_appendrel_attrs), when building
the
reltarget for a child in set_append_rel_size, 2) build the tlists for
child
joins using those children's wholerow Vars at path creation time, and 3)
adjust those wholerow Vars in the tlists for subpaths in the chosen
AppendPath so that those are transformed into the corresponding
ConvertRowtypeExprs, at plan creation time (ie, in
create_append_plan/create_merge_append_plan).  IIUC, this would not
require
any changes to pull_var_clause as proposed by the patch.  This would not
require any changes to postgres_fdw as proposed by the patch, either.  In
addition, this would make unnecessary the code added to setrefs.c by the
partitionwise-join patch.  Maybe I'm missing something though.



Not translating whole-row expressions to ConvertRowtypeExpr before
creating paths can lead to a number of anomalies. For example,

1. if there's an index on the whole-row expression of child,
translating parent's whole-row expression to child's whole-row
expression as is will lead to using that index, when in reality the it
can not be used since the condition/ORDER BY clause (which originally
referred the parent's whole-row expression) require the child's
whole-row reassembled as parent's whole-row.
2. Constraints on child'd whole-row expression, will be used to prune
a child when they can not be used since the original condition was on
parent' whole-row expression and not that of the child.
3. Equivalence classes will think that a child whole-row expression
(without ConvertRowtypeExpr) is equivalent to an expression which is
part of the same equivalence class as the parent' whole-row
expression.



Is that still possible when we only map a parent wholerow Var in the
parent's tlist to a child wholerow Var in the child's tlist?


Yes. 1 will affect the choice of index-only scans. We use pathkey
comparisons at a number of places so tlist going out of sync with
equivalence classes can affect a number of places.


Sorry, I'm still not sure if #1 is really a problem.  Could you show me 
an example for that?



build_tlist_to_deparse() is used to deparse targetlist at the time of
creating paths,as well as during the planning. According to your
proposal we will build different tlists during path creation and plan
creation. That doesn't look good.


Maybe my explanation was not enough, but my proposal will guarantee that 
the FDW can build the same tlist at path creation time and plan creation 
time because the RelOptInfo's reltarget from which the tlist is created 
doesn't change at all.



Apart from that your proposal of
changing a child's targetlist at the time of plan creation to use CRE
doesn't help since the original problem as described in [1] happens at
the time of creating plans as described in [1].


I think my proposal will address the original issue.  Actually, I've 
created a patch implementing that proposal.  That's still WIP, but it 
works well even for these cases (and makes code much more simple, as 
mentioned above!)  But I think that patch needs more work, so I'm 
planning to post it next week.


Best regards,
Etsuro Fujita



Re: Jsonb transform for pl/python

2018-06-15 Thread Alexander Korotkov
Hi!

On Fri, Jun 15, 2018 at 2:11 PM Nikita Glukhov  wrote:
>
> On 28.03.2018 15:43, Peter Eisentraut wrote:
>
> > On 3/21/18 18:50, Peter Eisentraut wrote:
> >> On 3/12/18 11:26, Nikita Glukhov wrote:
> >>> I have reviewed this patch.  Attached new 6th version of the patch with
> >>> v5-v6 delta-patch.
> >> Thanks for the update.  I'm working on committing this.
> > Committed.
> >
> > Everything seemed to work well.  I just did a bit of cosmetic cleanups.
> > I did a fair amount of tweaking on the naming of functions, the
> > extensions and library names, etc., to make it look like the existing
> > plpython transforms.  I also changed it so that the transform would
> > support mappings and sequences other than dict and list.  I removed the
> > non-ASCII test and the test with the huge numbers.
>
>
> I found a memory leak in PLySequence_ToJsonbValue():
> PyObject returned from PySequence_GetItem() is not released.
>
> A bug can be easily reproduced using function roudtrip() from regression test:
> SELECT roundtrip('[1,2,3]'::jsonb) FROM generate_series(1, 100);
>
> Similar code in PLyMapping_ToJsonbValue() seems to be correct because
> PyList_GetItem() and PyTuple_GetItem() return a borrowed reference.
>
> Patch with fix is attached.

I'm going to check and commit this if everything is OK.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Internal error XX000 with enable_partition_pruning=on, pg 11 beta1 on Debian

2018-06-15 Thread David Rowley
On 15 June 2018 at 20:37, Amit Langote  wrote:
> select * from partitioned_table_a
> union all
> select * from partitioned_table_b
>
> The only thing that changes with the patch is that
> ExecLockNonLeafAppendTables is called *twice* for the two nested Appends
> corresponding to partitioned_table_a and partitioned_table_b, resp.,
> instead of just once for the top level Append corresponding to the UNION
> ALL parent.  In fact, when called for the top level Append,
> ExecLockNonLeafAppendTables is now a no-op.

If the top level Append is the UNION ALL one, then there won't be any
partitioned_rels. If that's what you mean by no-op then, yeah. There
are no duplicate locks already obtained in the parent with the child
Append node.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Jsonb transform for pl/python

2018-06-15 Thread Nikita Glukhov

On 28.03.2018 15:43, Peter Eisentraut wrote:


On 3/21/18 18:50, Peter Eisentraut wrote:

On 3/12/18 11:26, Nikita Glukhov wrote:

I have reviewed this patch.  Attached new 6th version of the patch with
v5-v6 delta-patch.

Thanks for the update.  I'm working on committing this.

Committed.

Everything seemed to work well.  I just did a bit of cosmetic cleanups.
I did a fair amount of tweaking on the naming of functions, the
extensions and library names, etc., to make it look like the existing
plpython transforms.  I also changed it so that the transform would
support mappings and sequences other than dict and list.  I removed the
non-ASCII test and the test with the huge numbers.



I found a memory leak in PLySequence_ToJsonbValue():
PyObject returned from PySequence_GetItem() is not released.

A bug can be easily reproduced using function roudtrip() from regression test:
SELECT roundtrip('[1,2,3]'::jsonb) FROM generate_series(1, 100);

Similar code in PLyMapping_ToJsonbValue() seems to be correct because
PyList_GetItem() and PyTuple_GetItem() return a borrowed reference.

Patch with fix is attached.

--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

>From 53ab2aa2f345d7c4f5924b2e327b2d94c6f2c765 Mon Sep 17 00:00:00 2001
From: Nikita Glukhov 
Date: Fri, 15 Jun 2018 02:58:40 +0300
Subject: [PATCH] Fix memory leak in contrib/jsonb_plpython

---
 contrib/jsonb_plpython/jsonb_plpython.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/contrib/jsonb_plpython/jsonb_plpython.c b/contrib/jsonb_plpython/jsonb_plpython.c
index f752d6c..d6d6eeb 100644
--- a/contrib/jsonb_plpython/jsonb_plpython.c
+++ b/contrib/jsonb_plpython/jsonb_plpython.c
@@ -308,6 +308,8 @@ PLySequence_ToJsonbValue(PyObject *obj, JsonbParseState **jsonb_state)
 		PyObject   *value = PySequence_GetItem(obj, i);
 
 		(void) PLyObject_ToJsonbValue(value, jsonb_state, true);
+
+		Py_XDECREF(value);
 	}
 
 	return pushJsonbValue(jsonb_state, WJB_END_ARRAY, NULL);
-- 
2.7.4



Re: automating perl compile time checking

2018-06-15 Thread Andrew Dunstan




On 06/15/2018 12:21 AM, Peter Eisentraut wrote:

On 6/11/18 16:06, Andrew Dunstan wrote:

This affects pretty much nothing. In fact some of the other changes I've
recently committed were arguably more dangerous. Do you want me to
revert the whole lot?

No, but this whole let's clean up the Perl code initiative seemed to
have come out of nowhere after feature freeze.  The Perl code was fine
before, so if we're going to be polishing it around it should go into
the next release.  I would actually have liked to have had more time to
discuss some of the changes, since I didn't seem to agree to some of
them at first reading.



There were changes made for perlcritic, but almost none for this check.

cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: why partition pruning doesn't work?

2018-06-15 Thread Amit Langote
On 2018/06/14 20:23, David Rowley wrote:
> On 14 June 2018 at 19:17, Amit Langote  wrote:
>> I had sent a patch to try to get rid of the open(NoLock) there a couple of
>> months ago [1].  The idea was to both lock and open the relation in
>> ExecNonLeafAppendTables, which is the first time all partitioned tables in
>> a given Append node are locked for execution.  Also, the patch makes it a
>> responsibility of ExecEndAppend to release the relcache pins, so the
>> recently added ExecDestroyPartitionPruneState would not be needed.
> 
> Robert and I briefly discussed something more complete a while back.
> Not much detail was talked about, but the basic idea was to store the
> Relation somewhere in the planner an executor that we could lookup by
> rel index rather than having to relation_open all the time.
> 
> I just had a very quick look around and thought maybe RangeTblEntry
> might be a good spot to store the Relation and current lock level that
> we hold on that relation. This means that we can use
> PlannerInfo->simple_rte_array in the planner and
> EState->es_range_table in the executor. The latter of those would be
> much nicer if we built an array instead of keeping the list (can
> probably build that during InitPlan()).  We could then get hold of a
> Relation object when needed without having to do relation_open(...,
> NoLock).
> 
> Code which currently looks like:
> 
> reloid = getrelid(scanrelid, estate->es_range_table);
> rel = heap_open(reloid, lockmode);
> 
> In places like ExecOpenScanRelation, could be replaced with some
> wrapper function call like: rte_open_rel(estate, scanrelid, lockmode);
> 
> This could also be used to ERROR out if rte_open_rel() was done
> initially with NoLock. Right now there are so many places that we do
> relation_open(..., NoLock) and write a comment /* Lock already taken
> by ... */, which we hope is always true.
> 
> If the Relation is already populated in the RangeTblEntry then the
> function would just return that, otherwise, we'd just look inside the
> RangeTblEntry for the relation Oid and do a heap_open on it, and store
> the lockmode that's held. We could then also consider getting of a
> bunch of fields like boundinfo and nparts from RelOptInfo.
> 
> We could also perhaps do a WARNING about lock upgrade hazards in
> there, at least maybe in debug builds.
> 
> However, I only spent about 10 mins looking into this, there may be
> some giant holes in the idea.  It would need much more research.

Will also need to consider caching of plans, that is of PlannedStmts,
which in turn contain RangeTblEntry nodes.  The idea that we can open a
relation at the beginning or even before planning and keep using the
Relation pointer all the way to the end of execution of a query seems hard
to realize.  As others pointed out, we need to think of planning and
execution as separate phases and will have to open/close relations separately.

Thanks,
Amit




Re: Possible bug in logical replication.

2018-06-15 Thread Masahiko Sawada
On Fri, Jun 15, 2018 at 5:06 AM, Alvaro Herrera
 wrote:
> Hello
>
> Can somebody (Arseny, Konstantin, horiguti, Sawada) please confirm that
> Michaël's commit fixes the reported bug?
>

I don't confirm that commit deeply yet but I have confirmed that the
reported bug is fixed using the following test case which led an
assertion failure. And this test case is the same as the previous
report[1].

postgres(1:128004)=# select pg_create_logical_replication_slot('s1',
'pgoutput');
2018-06-15 18:04:31.301 JST [128004] LOG:  logical decoding found
consistent point at 0/1645388
2018-06-15 18:04:31.301 JST [128004] DETAIL:  There are no running transactions.
2018-06-15 18:04:31.301 JST [128004] STATEMENT:  select
pg_create_logical_replication_slot('s1', 'pgoutput');
 pg_create_logical_replication_slot

 (s1,0/16453C0)
(1 row)

postgres(1:128004)=# select pg_switch_wal();
 pg_switch_wal
---
 0/16453D8
(1 row)

postgres(1:128004)=# select pg_switch_wal();
 pg_switch_wal
---
 0/200
(1 row)

postgres(1:128004)=# select pg_replication_slot_advance('s1', '0/200');
2018-06-15 18:04:31.338 JST [128004] LOG:  starting logical decoding
for slot "s1"
2018-06-15 18:04:31.338 JST [128004] DETAIL:  Streaming transactions
committing after 0/16453C0, reading WAL from 0/1645388.
2018-06-15 18:04:31.338 JST [128004] STATEMENT:  select
pg_replication_slot_advance('s1', '0/200');
2018-06-15 18:04:31.339 JST [128004] LOG:  logical decoding found
consistent point at 0/1645388
2018-06-15 18:04:31.339 JST [128004] DETAIL:  There are no running transactions.
2018-06-15 18:04:31.339 JST [128004] STATEMENT:  select
pg_replication_slot_advance('s1', '0/200');
 pg_replication_slot_advance
-
 (s1,0/200)
(1 row)

postgres(1:128004)=# create table a (c int);
select pg_replication_slot_advance('s1', pg_current_wal_lsn());
CREATE TABLE
postgres(1:128004)=# select pg_replication_slot_advance('s1',
pg_current_wal_lsn());
2018-06-15 18:04:31.401 JST [128004] LOG:  starting logical decoding
for slot "s1"
2018-06-15 18:04:31.401 JST [128004] DETAIL:  Streaming transactions
committing after 0/200, reading WAL from 0/1645388.
2018-06-15 18:04:31.401 JST [128004] STATEMENT:  select
pg_replication_slot_advance('s1', pg_current_wal_lsn());
2018-06-15 18:04:31.402 JST [128004] LOG:  logical decoding found
consistent point at 0/1645388
2018-06-15 18:04:31.402 JST [128004] DETAIL:  There are no running transactions.
2018-06-15 18:04:31.402 JST [128004] STATEMENT:  select
pg_replication_slot_advance('s1', pg_current_wal_lsn());
 pg_replication_slot_advance
-
 (s1,0/2017828)
(1 row)

[1] 
https://www.postgresql.org/message-id/34d66f63-40a9-4c3e-c9a1-248d1e393d29%40enterprisedb.com

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: Memory leaks in BufFileOpenShared()

2018-06-15 Thread Tatsuo Ishii
Hi Thomas,

> On Fri, Jun 15, 2018 at 8:42 PM, Antonin Houska  wrote:
>> Tatsuo Ishii  wrote:
>>
>>> The changes were made by this commit to add infrastructure for sharing
>>> temporary files between backends, according to the author (Thomas
>>> Munro).
>>>
>>> https://git.postgresql.org/pg/commitdiff/dc6c4c9dc2a111519b76b22dc86c5608223b
>>>
>>> Your proposal looks reasonable but I would like to hear from Thomas's
>>> opinion as well.
>>
>> o.k. He can actually have different feeling about details, e.g. if a new
>> function makeBufFileCommon() should be introduced or if makeBufFile() should
>> do the common settings. In the latter case, BufFileCreateTemp() would have to
>> do more work than it does now.
> 
> Thanks for finding these accidental duplications, and to Ishii-san for
> committing the fix.  Oops.
> 
> +1 for this refactoring.

Thanks for confirming. I will go ahead commit/push the patch.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: Memory leaks in BufFileOpenShared()

2018-06-15 Thread Thomas Munro
On Fri, Jun 15, 2018 at 8:42 PM, Antonin Houska  wrote:
> Tatsuo Ishii  wrote:
>
>> The changes were made by this commit to add infrastructure for sharing
>> temporary files between backends, according to the author (Thomas
>> Munro).
>>
>> https://git.postgresql.org/pg/commitdiff/dc6c4c9dc2a111519b76b22dc86c5608223b
>>
>> Your proposal looks reasonable but I would like to hear from Thomas's
>> opinion as well.
>
> o.k. He can actually have different feeling about details, e.g. if a new
> function makeBufFileCommon() should be introduced or if makeBufFile() should
> do the common settings. In the latter case, BufFileCreateTemp() would have to
> do more work than it does now.

Thanks for finding these accidental duplications, and to Ishii-san for
committing the fix.  Oops.

+1 for this refactoring.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Memory leaks in BufFileOpenShared()

2018-06-15 Thread Antonin Houska
Tatsuo Ishii  wrote:

> The changes were made by this commit to add infrastructure for sharing
> temporary files between backends, according to the author (Thomas
> Munro).
> 
> https://git.postgresql.org/pg/commitdiff/dc6c4c9dc2a111519b76b22dc86c5608223b
> 
> Your proposal looks reasonable but I would like to hear from Thomas's
> opinion as well.

o.k. He can actually have different feeling about details, e.g. if a new
function makeBufFileCommon() should be introduced or if makeBufFile() should
do the common settings. In the latter case, BufFileCreateTemp() would have to
do more work than it does now.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com



Re: why partition pruning doesn't work?

2018-06-15 Thread Amit Langote
On 2018/06/14 23:40, Tom Lane wrote:
> Robert Haas  writes:
>> On Thu, Jun 14, 2018 at 7:23 AM, David Rowley
>>  wrote:
>>> However, I only spent about 10 mins looking into this, there may be
>>> some giant holes in the idea.  It would need much more research.
> 
>> It kind of flies in the face of the idea that a RangeTblEntry is just
>> a node that can be freely copied around, serialized and deserialized,
>> etc.
> 
> And also the idea that the Plan tree is read-only to the executor,
> which is not a good property to give up.
> 
>> I think it would be better to keep the pointer in the RelOptInfo in
>> the planner and in the EState or PlanState in the executor.  Those are
>> things we don't think can be copied, serialized, etc.
> 
> Yeah, RelOptInfo seems like the natural place in the planner; we might
> need index relcache links in IndexOptInfo, too.
> 
> I'm less sure what to do in the executor.  We already do keep open
> relation pointers in PlanStates; the problem is just that it's
> node-type-specific (ss_currentRelation, iss_RelationDesc, etc).  Perhaps
> that's unavoidable and we should just add more such fields as needed.

The patch I mentioned upthread maintains an array of Relation pointers in
AppendState with as many members as there are in the partitioned_rels list
that appears in the corresponding Append plan.

I revised that patch a bit to rename ExecLockNonLeafAppendTables to
ExecOpenAppendPartitionedTables to sound consistent with
ExecOpenScanRelation et al.

Thanks,
Amit
From d55c30f8330081bc7c2eadea61d236a3b5de0f87 Mon Sep 17 00:00:00 2001
From: amit 
Date: Thu, 14 Jun 2018 14:50:22 +0900
Subject: [PATCH] Open partitioned tables during Append initialization

---
 src/backend/executor/execPartition.c   |  40 ++--
 src/backend/executor/execUtils.c   | 173 ++---
 src/backend/executor/nodeAppend.c  |  33 ---
 src/backend/executor/nodeMergeAppend.c |   9 +-
 src/include/executor/execPartition.h   |   4 +-
 src/include/executor/executor.h|   4 +-
 src/include/nodes/execnodes.h  |   2 +
 7 files changed, 157 insertions(+), 108 deletions(-)

diff --git a/src/backend/executor/execPartition.c 
b/src/backend/executor/execPartition.c
index 7a4665cc4e..ea6b05934b 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -1401,7 +1401,8 @@ adjust_partition_tlist(List *tlist, TupleConversionMap 
*map)
  * in each PartitionPruneInfo.
  */
 PartitionPruneState *
-ExecCreatePartitionPruneState(PlanState *planstate, List *partitionpruneinfo)
+ExecCreatePartitionPruneState(PlanState *planstate, List *partitionpruneinfo,
+ Relation 
*partitioned_rels)
 {
PartitionPruneState *prunestate;
PartitionPruningData *prunedata;
@@ -1440,6 +1441,7 @@ ExecCreatePartitionPruneState(PlanState *planstate, List 
*partitionpruneinfo)
PartitionPruneInfo *pinfo = castNode(PartitionPruneInfo, 
lfirst(lc));
PartitionPruningData *pprune = [i];
PartitionPruneContext *context = >context;
+   Relationrel;
PartitionDesc partdesc;
PartitionKey partkey;
int partnatts;
@@ -1460,17 +1462,11 @@ ExecCreatePartitionPruneState(PlanState *planstate, 
List *partitionpruneinfo)
/* present_parts is also subject to later modification */
pprune->present_parts = bms_copy(pinfo->present_parts);
 
-   /*
-* We need to hold a pin on the partitioned table's relcache 
entry so
-* that we can rely on its copies of the table's partition key 
and
-* partition descriptor.  We need not get a lock though; one 
should
-* have been acquired already by InitPlan or
-* ExecLockNonLeafAppendTables.
-*/
-   context->partrel = relation_open(pinfo->reloid, NoLock);
-
-   partkey = RelationGetPartitionKey(context->partrel);
-   partdesc = RelationGetPartitionDesc(context->partrel);
+   Assert(partitioned_rels[i] != NULL);
+   rel = partitioned_rels[i];
+   Assert(RelationGetRelid(rel) == pinfo->reloid);
+   partkey = RelationGetPartitionKey(rel);
+   partdesc = RelationGetPartitionDesc(rel);
n_steps = list_length(pinfo->pruning_steps);
 
context->strategy = partkey->strategy;
@@ -1546,26 +1542,6 @@ ExecCreatePartitionPruneState(PlanState *planstate, List 
*partitionpruneinfo)
 }
 
 /*
- * ExecDestroyPartitionPruneState
- * Release resources at plan shutdown.
- *
- * We don't bother to free any memory here, since the whole executor context
- * will be going away shortly.  We do need to release our relcache pins.
- */
-void
-ExecDestroyPartitionPruneState(PartitionPruneState *prunestate)
-{
-

Re: Index maintenance function for BRIN doesn't check RecoveryInProgress()

2018-06-15 Thread Masahiko Sawada
On Fri, Jun 15, 2018 at 2:20 AM, Alvaro Herrera
 wrote:
> On 2018-Jun-14, Michael Paquier wrote:
>
>> On Thu, Jun 14, 2018 at 02:06:57AM +0900, Masahiko Sawada wrote:
>> > On Thu, Jun 14, 2018 at 12:04 AM, Simon Riggs  
>> > wrote:
>> >> On 13 June 2018 at 15:51, Alvaro Herrera  wrote:
>> >>> I guess you could go either way ... we're just changing one unhelpful
>> >>> error with a better one: there is no change in behavior.  I would
>> >>> backpatch this, myself, and avoid the code divergence.
>> >>
>> >> WAL control functions all say the same thing, so we can do that here also.
>> >
>> > +1
>>
>> +1 for a back-patch to v10.  The new error message brings clarity in my
>> opinion.
>
> Pushed, backpatched to 9.5.  Thanks!
>

Thank you!

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: Internal error XX000 with enable_partition_pruning=on, pg 11 beta1 on Debian

2018-06-15 Thread Amit Langote
On 2018/06/11 16:49, David Rowley wrote:
> On 11 June 2018 at 12:19, Tom Lane  wrote:
>> David Rowley  writes:
>>> On 10 June 2018 at 04:48, Tom Lane  wrote:
 So, IIUC, the issue is that for partitioning cases Append expects *all*
 its children to be partitions of the *same* partitioned table?  That
 is, you could also break it with

 select * from partitioned_table_a
 union all
 select * from partitioned_table_b

 ?
>>
>>> Not quite.

That would be correct I think.  An Append may contain multiple partitioned
tables that all appear under an UNION ALL parent, as in the OP's case and
the example above.  In this case, the partitioned_rels list of Append
consist of non-leaf tables from *all* of the partitioned tables.  Before
run-time pruning arrived, the only purpose of partitioned_rels list was to
make sure that the executor goes through it and locks all of those
non-leaf tables (ExecLockNonLeafAppendTables).  Run-time pruning expanded
its usage by depending it to generate run-time pruning info.

>> I just had a thought that might lead to a nice solution to that, or
>> might be totally crazy.  What if we inverted the sense of the bitmaps
>> that track partition pruning state, so that instead of a bitmap of
>> valid partitions that need to be scanned, we had a bitmap of pruned
>> partitions that we know we don't need to scan?  (The indexes of this
>> bitmap would be subplan indexes not partition indexes.)  With this
>> representation, it doesn't matter if some of the Append's children
>> are not supposed to participate in pruning; they just don't ever get
>> added to the bitmap of what to skip.  It's also fairly clear, I think,
>> how to handle independent pruning rules for different top-level tables
>> that are being unioned together: just OR the what-to-skip bitmaps.
>> But there may be some reason why this isn't workable.
> 
> I think it would be less efficient. A common case and one that I very
> much would like to make as fast as possible is when all but a single
> partition is pruned. Doing the opposite sounds like more effort would
> need to be expended to get the subplans that we do need to scan.
> 
> I don't really see the way it works now as a huge problem to overcome
> in pruning. We'd just a list of subplans that don't belong to the
> hierarchy and tag them on to the matches found in
> ExecFindInitialMatchingSubPlans and ExecFindMatchingSubPlans. The
> bigger issue to overcome is the mixed flattened list of partition RT
> indexes in partitioned_rels.  Perhaps having a list of Lists for
> partitioned_rels could be used to resolve that. The question is more,
> how to solve for PG11. Do we need that?
> 
> I think we'll very soon be wanting to have ordered partition scans
> where something like:
> 
> create table listp(a int) partition by list(a);
> create index on listp(a);
> create table listp1 partition of listp for values in (1);
> create table listp2 partition of listp for values in (2);
> 
> and
> 
> select * from listp order by a;
> 
> would be possible with an Append and Index Scan, rather than having a
> MergeAppend or Sort. In which case we'll not want mixed partition
> hierarchies in the Append subplans. Although, perhaps that would mean
> we just wouldn't pullup AppendPaths which have PathKeys.
> 
> I have written and attached the patch to stop flattening of
> partitioned tables into UNION ALL parent's paths, meaning we can now
> get nested Append and MergeAppend paths.
> 
> I've added Robert too as I know he was the committer of partitioning
> and parallel Append. Maybe he has a view on what should be done about
> this? Is not flattening the paths a problem?

Not speaking for Robert here, just saying from what I know.

I don't think your patch breaks anything, even if does change the shape of
the plan.  So, for:

select * from partitioned_table_a
union all
select * from partitioned_table_b

The only thing that changes with the patch is that
ExecLockNonLeafAppendTables is called *twice* for the two nested Appends
corresponding to partitioned_table_a and partitioned_table_b, resp.,
instead of just once for the top level Append corresponding to the UNION
ALL parent.  In fact, when called for the top level Append,
ExecLockNonLeafAppendTables is now a no-op.

Thanks,
Amit




Re: [GSoC] current working status

2018-06-15 Thread Aleksander Alekseeev
Hello Charles,

> The repo currently can be build on my mac. You can check it out.
> Also I am working on the CI config to monitor each commit.

Tests pass on Arch Linux with PostgreSQL 11 and Ubuntu 16.04 with
PostgreSQL 9.6. However there is a name conflict in case of PostgreSQL
11: https://afiskon.ru/s/7a/fe681b17f0_paste.txt I suggest to rename
the procedure to `convert_int8_to_char`.

Also GCC still generates a lot of warnings (see my previous message).
I would advise to fix all these warnings, otherwise you can miss a
really important warning. Ideally, the code should be compiled with
-Wall flag.

-- 
Best regards,
Aleksander Alekseev



Re: Memory leaks in BufFileOpenShared()

2018-06-15 Thread Tatsuo Ishii
> Now I see that BufFileCreateShared() has similar problem with file->name.

Right.

> More generic problem I see is that the common initialization of BufFile is
> repeated a few times. The attached patch tries to improve that (it also fixes
> the duplicate allocation of file->name).

The changes were made by this commit to add infrastructure for sharing
temporary files between backends, according to the author (Thomas
Munro).

https://git.postgresql.org/pg/commitdiff/dc6c4c9dc2a111519b76b22dc86c5608223b

Your proposal looks reasonable but I would like to hear from Thomas's
opinion as well.

Thomas?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: Memory leaks in BufFileOpenShared()

2018-06-15 Thread Antonin Houska
Now I see that BufFileCreateShared() has similar problem with file->name.

More generic problem I see is that the common initialization of BufFile is
repeated a few times. The attached patch tries to improve that (it also fixes
the duplicate allocation of file->name).

Tatsuo Ishii  wrote:

> > Memory is allocated twice for "file" and "files" variables. Possible fix:
> > 
> > diff --git a/src/backend/storage/file/buffile.c 
> > b/src/backend/storage/file/buffile.c
> > index d8a18dd3dc..00f61748b3 100644
> > --- a/src/backend/storage/file/buffile.c
> > +++ b/src/backend/storage/file/buffile.c
> > @@ -277,10 +277,10 @@ BufFileCreateShared(SharedFileSet *fileset, const 
> > char *name)
> >  BufFile *
> >  BufFileOpenShared(SharedFileSet *fileset, const char *name)
> >  {
> > -   BufFile*file = (BufFile *) palloc(sizeof(BufFile));
> > +   BufFile*file;
> > charsegment_name[MAXPGPATH];
> > Sizecapacity = 16;
> > -   File   *files = palloc(sizeof(File) * capacity);
> > +   File   *files;
> > int nfiles = 0;
> >  
> > file = (BufFile *) palloc(sizeof(BufFile));
> 
> Good catch. Thanks.
> 
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp
> 

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com

diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index d8a18dd3dc..e345d008b5 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -99,6 +99,7 @@ struct BufFile
 	char		buffer[BLCKSZ];
 };
 
+static BufFile *makeBufFileCommon(int nfiles);
 static BufFile *makeBufFile(File firstfile);
 static void extendBufFile(BufFile *file);
 static void BufFileLoadBuffer(BufFile *file);
@@ -106,21 +107,16 @@ static void BufFileDumpBuffer(BufFile *file);
 static int	BufFileFlush(BufFile *file);
 static File MakeNewSharedSegment(BufFile *file, int segment);
 
-
 /*
- * Create a BufFile given the first underlying physical file.
- * NOTE: caller must set isInterXact if appropriate.
+ * Create BufFile and perform the common initialization.
  */
 static BufFile *
-makeBufFile(File firstfile)
+makeBufFileCommon(int nfiles)
 {
 	BufFile*file = (BufFile *) palloc(sizeof(BufFile));
 
-	file->numFiles = 1;
-	file->files = (File *) palloc(sizeof(File));
-	file->files[0] = firstfile;
-	file->offsets = (off_t *) palloc(sizeof(off_t));
-	file->offsets[0] = 0L;
+	file->numFiles = nfiles;
+	file->offsets = (off_t *) palloc0(sizeof(off_t) * nfiles);
 	file->isInterXact = false;
 	file->dirty = false;
 	file->resowner = CurrentResourceOwner;
@@ -128,6 +124,21 @@ makeBufFile(File firstfile)
 	file->curOffset = 0L;
 	file->pos = 0;
 	file->nbytes = 0;
+
+	return file;
+}
+
+/*
+ * Create a BufFile given the first underlying physical file.
+ * NOTE: caller must set isInterXact if appropriate.
+ */
+static BufFile *
+makeBufFile(File firstfile)
+{
+	BufFile*file = makeBufFileCommon(1);
+
+	file->files = (File *) palloc(sizeof(File));
+	file->files[0] = firstfile;
 	file->readOnly = false;
 	file->fileset = NULL;
 	file->name = NULL;
@@ -246,23 +257,12 @@ BufFileCreateShared(SharedFileSet *fileset, const char *name)
 {
 	BufFile*file;
 
-	file = (BufFile *) palloc(sizeof(BufFile));
+	file = makeBufFileCommon(1);
 	file->fileset = fileset;
 	file->name = pstrdup(name);
-	file->numFiles = 1;
 	file->files = (File *) palloc(sizeof(File));
 	file->files[0] = MakeNewSharedSegment(file, 0);
-	file->offsets = (off_t *) palloc(sizeof(off_t));
-	file->offsets[0] = 0L;
-	file->isInterXact = false;
-	file->dirty = false;
-	file->resowner = CurrentResourceOwner;
-	file->curFile = 0;
-	file->curOffset = 0L;
-	file->pos = 0;
-	file->nbytes = 0;
 	file->readOnly = false;
-	file->name = pstrdup(name);
 
 	return file;
 }
@@ -286,4 +286,3 @@
-	file = (BufFile *) palloc(sizeof(BufFile));
 	files = palloc(sizeof(File) * capacity);
 
 	/*
@@ -317,16 +316,8 @@ BufFileOpenShared(SharedFileSet *fileset, const char *name)
 (errcode_for_file_access(),
  errmsg("could not open BufFile \"%s\"", name)));
 
-	file->numFiles = nfiles;
+	file = makeBufFileCommon(nfiles);
 	file->files = files;
-	file->offsets = (off_t *) palloc0(sizeof(off_t) * nfiles);
-	file->isInterXact = false;
-	file->dirty = false;
-	file->resowner = CurrentResourceOwner;	/* Unused, can't extend */
-	file->curFile = 0;
-	file->curOffset = 0L;
-	file->pos = 0;
-	file->nbytes = 0;
 	file->readOnly = true;		/* Can't write to files opened this way */
 	file->fileset = fileset;
 	file->name = pstrdup(name);


Re: Attempt to fix inheritance limitations: unique and foreign key constraints

2018-06-15 Thread Raphael Medaer

Hi Robert,


Sounds very similar to commit 3de241dba86f3dd000434f70aebba725fb928032.


I'm reusing part of the code from this commit (mainly 
"CloneForeignKeyConstraints")..

But this commit concerns only partitioned tables.. not inherited tables.

Kind regards,

--
Raphael Medaer
Product Development Engineer
Escaux

Escaux, Communication as easy as the web
Chaussée de Bruxelles 408, 1300 Wavre, Belgium
Direct: +3227887564
Main: +3226860900
www.escaux.com 

/Dit bericht is onderworpen aan de voorwaarden beschikbaar op onze 
website .
Ce message est soumis aux conditions disponibles sur notre site web 
.
This message is subject to the terms and conditions available on our 
website . /


Re: Getting "ERROR: did not find all requested child rels in append_rel_list" when enable_partition_pruning is on

2018-06-15 Thread Amit Langote
On 2018/06/15 16:32, Rajkumar Raghuwanshi wrote:
> Hi,
> 
> I am getting "ERROR:  did not find all requested child rels in
> append_rel_list" when enable_partition_pruning is on for below test case.
> 
> CREATE TABLE test(c1 int, c2 int) PARTITION BY RANGE(c1);
> CREATE TABLE test_p1 PARTITION OF test FOR VALUES FROM (minvalue) TO (0);
> CREATE TABLE test_p2 PARTITION OF test FOR VALUES FROM  (0) TO (maxvalue);
> select * from (select * from test a union all select *  from test b) ss
> where (c1 >= c2);
> 
> postgres=# set enable_partition_pruning to off;
> SET
> postgres=# select * from (select * from test a union all select *  from
> test b) ss where (c1 >= c2);
>  c1 | c2
> +
> (0 rows)
> 
> postgres=#
> postgres=# set enable_partition_pruning to on;
> SET
> postgres=# select * from (select * from test a union all select *  from
> test b) ss where (c1 >= c2);
> ERROR:  did not find all requested child rels in append_rel_list

Thanks for the report.  This appears to be the same problem as one being
discussed on the following thread where David Rowley even seems to have
posted a patch as a fix for this issue:

https://www.postgresql.org/message-id/flat/HE1PR03MB17068BB27404C90B5B788BCABA7B0%40HE1PR03MB1706.eurprd03.prod.outlook.com

Thanks,
Amit




Getting "ERROR: did not find all requested child rels in append_rel_list" when enable_partition_pruning is on

2018-06-15 Thread Rajkumar Raghuwanshi
Hi,

I am getting "ERROR:  did not find all requested child rels in
append_rel_list" when enable_partition_pruning is on for below test case.

CREATE TABLE test(c1 int, c2 int) PARTITION BY RANGE(c1);
CREATE TABLE test_p1 PARTITION OF test FOR VALUES FROM (minvalue) TO (0);
CREATE TABLE test_p2 PARTITION OF test FOR VALUES FROM  (0) TO (maxvalue);
select * from (select * from test a union all select *  from test b) ss
where (c1 >= c2);

postgres=# set enable_partition_pruning to off;
SET
postgres=# select * from (select * from test a union all select *  from
test b) ss where (c1 >= c2);
 c1 | c2
+
(0 rows)

postgres=#
postgres=# set enable_partition_pruning to on;
SET
postgres=# select * from (select * from test a union all select *  from
test b) ss where (c1 >= c2);
ERROR:  did not find all requested child rels in append_rel_list


Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation


Re: Memory leaks in BufFileOpenShared()

2018-06-15 Thread Tatsuo Ishii
> Memory is allocated twice for "file" and "files" variables. Possible fix:
> 
> diff --git a/src/backend/storage/file/buffile.c 
> b/src/backend/storage/file/buffile.c
> index d8a18dd3dc..00f61748b3 100644
> --- a/src/backend/storage/file/buffile.c
> +++ b/src/backend/storage/file/buffile.c
> @@ -277,10 +277,10 @@ BufFileCreateShared(SharedFileSet *fileset, const char 
> *name)
>  BufFile *
>  BufFileOpenShared(SharedFileSet *fileset, const char *name)
>  {
> -   BufFile*file = (BufFile *) palloc(sizeof(BufFile));
> +   BufFile*file;
> charsegment_name[MAXPGPATH];
> Sizecapacity = 16;
> -   File   *files = palloc(sizeof(File) * capacity);
> +   File   *files;
> int nfiles = 0;
>  
> file = (BufFile *) palloc(sizeof(BufFile));

Good catch. Thanks.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Memory leaks in BufFileOpenShared()

2018-06-15 Thread Antonin Houska
Memory is allocated twice for "file" and "files" variables. Possible fix:

diff --git a/src/backend/storage/file/buffile.c 
b/src/backend/storage/file/buffile.c
index d8a18dd3dc..00f61748b3 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -277,10 +277,10 @@ BufFileCreateShared(SharedFileSet *fileset, const char 
*name)
 BufFile *
 BufFileOpenShared(SharedFileSet *fileset, const char *name)
 {
-   BufFile*file = (BufFile *) palloc(sizeof(BufFile));
+   BufFile*file;
charsegment_name[MAXPGPATH];
Sizecapacity = 16;
-   File   *files = palloc(sizeof(File) * capacity);
+   File   *files;
int nfiles = 0;
 
file = (BufFile *) palloc(sizeof(BufFile));


-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com



Re: Remove mention in docs that foreign keys on partitioned tables are not supported

2018-06-15 Thread Amit Langote
On 2018/06/14 22:45, Ashutosh Bapat wrote:
> On Thu, Jun 14, 2018 at 6:49 PM, Robert Haas  wrote:
>> It sounds like you want to try to hide from users the fact that they
>> can create triggers on the individual partitions.
> 
> No. I never said that in my mails (see [1], [2]) I object to the
> explicit suggestion that they can/should create BEFORE ROW triggers on
> partitions since those are not supported on the partitioned table.

I understand Ashutosh's worry as follows:

A workaround for the limitation that BR triggers cannot be defined on
partitioned tables yet is to define that *same* trigger on all partitions,
which works from the beginning (PG 10), so that gets the job done.  But...
some users may however fail to ensure that the trigger's definition is
exactly alike for each partition, especially they are likely to make each
trigger call a different function, although each of those functions may
have the same body of code.  When in some future release, one is able to
define the trigger on the partitioned table and does so, the trigger will
end up being created again on each partition, because the existing trigger
on each partition (after upgrading) would be different due to a different
function being called in each.

I proposed that we reword the sentence that describes this particular
limitation to warn users about that.  See if the attached patch is any
good for that.

Thanks,
Amit
diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 0258391154..781536c44b 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -3349,8 +3349,10 @@ ALTER TABLE measurement ATTACH PARTITION 
measurement_y2008m02
 
  
   
-   BEFORE ROW triggers, if necessary, must be defined
-   on individual partitions, not the partitioned table.
+   Creating BEFORE ROW triggers on partitioned tables
+   is not supported.  A workaround is to create such a trigger on 
individual
+   partitions, but make sure that its definition is identical across all
+   partitions, including the function that's called from the trigger.
   
  
 


Re: Two round for Client Authentication

2018-06-15 Thread Yinjie Lin
Many thanks to Marko and David for your reply. It really helped.

Now I am playing with extension auth_delay, which uses
ClientAuthentication_hook. But I find it not easy to distinguish the first
connection of psql from the second one with empty password, since the
variable 'status' are both STATUS_EOF. Maybe I should dive into the code
deeper.

Regards,