date:20170420

Re: [HACKERS] identity columns

2017-04-20 Thread Noah Misch

On Tue, Apr 18, 2017 at 11:53:36AM -0400, Peter Eisentraut wrote:
> On 4/18/17 02:07, Noah Misch wrote:
> > This PostgreSQL 10 open item is past due for your status update.  Kindly 
> > send
> > a status update within 24 hours, and include a date for your subsequent 
> > status
> > update.  Refer to the policy on open item ownership:
> > https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
> 
> I am in disagreement with the submitter that this is a problem or what
> the solution should be.  The discussion is ongoing.  Note that the
> current state isn't actually broken, so it doesn't have to hold anything up.

I see.  If anyone in addition to the submitter thinks making a change in this
area qualifies as a PostgreSQL 10 open item, please speak up.  Otherwise, on
Monday, I'll reclassify this as a non-bug.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-04-20 Thread Ashutosh Bapat

On Fri, Apr 21, 2017 at 1:34 AM, Robert Haas  wrote:
>> You seem to be suggesting that we keep as many sets of
>> partition bounds as there are base relations participating in the join
>> and then use appropriate partition bounds based on the columns in the
>> join conditions, so that we can use the same operator as used in the
>> join condition. That doesn't seem to be a good option since the
>> partition bounds will all have same values, only differing in their
>> binary representation because of differences in data types.
>
> Well, actually, I think it is a good option, as I wrote in
> http://postgr.es/m/CA+TgmoY-LiJ+_S7OijNU_r2y=dhsj539wtqa7cayj-hcecc...@mail.gmail.com

I guess, you are now confusing between partition bounds for a join
relation and partition bounds of base relation. Above paragraph is
about partition bounds of a join relation. I have already agreed that
we need to store partition bounds in RelOptInfo. For base relation
this is trivial; its RelOptInfo has to store partition bounds as
stored in the partition descriptor of corresponding partitioned table.
I am talking about partition bounds of a join relation. See below for
more explanation.

>
> In that email, my principal concern was allowing partition-wise join
> to succeed even with slightly different sets of partition boundaries
> on the two sides of the join; in particular, if we've got A with A1 ..
> A10 and B with B1 .. B10 and the DBA adds A11, I don't want
> performance to tank until the DBA gets around to adding B11.  Removing
> the partition bounds from the PartitionScheme and storing them
> per-RelOptInfo fixes that problem;

We have an agreement on this.

> the fact that it also solves this
> problem of what happens when we have different data types on the two
> sides looks to me like a second reason to go that way.

I don't see how is that fixed. For a join relation we need to come up
with one set of partition bounds by merging partition bounds of the
joining relation and in order to understand how to interpret the
datums in the partition bounds, we need to associate data types. The
question is which data type we should use if the relations being
joined have different data types associated with their respective
partition bounds.

Or are you saying that we don't need to associate data type with
merged partition bounds? In that case, I don't know how do we compare
the partition bounds of two relations?

In your example, A has partition key of type int8, has bound datums
X1.. X10. B has partition key of type int4 and has bounds datums X1 ..
X11. C has partition key type int2 and bound datums X1 .. X12. The
binary representation of X's is going to differ between A, B and C
although each Xk for A, B and C is equal, wherever exists. Join
between A and B will have merged bound datums X1 .. X10 (and X11
depending upon the join type). In order to match bounds of AB with C,
we need to know the data type of bounds of AB, so that we can choose
appropriate equality operator. The question is what should we choose
as data type of partition bounds of AB, int8 or int4. This is
different from applying join conditions between AB and C, which can
choose the right opfamily operator based on the join conditions.

>
> And there's a third reason, too, which is that the opfamily mechanism
> doesn't currently provide any mechanism for reasoning about which data
> types are "wider" or "narrower" in the way that you want.  In general,
> there's not even a reason why such a relationship has to exist;
> consider two data types t1 and t2 with opclasses t1_ops and t2_ops
> that are part of the same opfamily t_ops, and suppose that t1 can
> represent any positive integer and t2 can represent any even integer,
> or in general that each data type can represent some but not all of
> the values that can be represented by the other data type.  In such a
> case, neither would be "wider" than the other in the sense that you
> need; you essentially want to find a data type within the opfamily to
> which all values of any of the types involved in the query can be cast
> without error, but there is nothing today which requires such a data
> type to exist, and no way to identify which one it is.  In practice,
> for all of the built-in opfamilies that have more than one opclass,
> such a data type always exists but is not always unique -- in
> particular, datetime_ops contains date_ops, timestamptz_ops, and
> timestamp_ops, and either of the latter two is a plausible choice for
> the "widest" data type of the three.  But there's no way to figure
> that out from the opfamily or opclass information we have today.
>
> In theory, it would be possible to modify the opfamily machinery so
> that every opfamily designates an optional ordering of types from
> "narrowest" to "widest", such that saying t1 is-narrower-than t2 is a
> guarantee that every value of type t1 can be cast without error to a
> value of type t2.  But I think that's a bad plan.  It means that every
> opfam

Re: [HACKERS] Quorum commit for multiple synchronous replication.

2017-04-20 Thread Noah Misch

On Fri, Apr 21, 2017 at 01:20:05PM +0900, Masahiko Sawada wrote:
> On Fri, Apr 21, 2017 at 12:02 PM, Noah Misch  wrote:
> > On Wed, Apr 19, 2017 at 01:52:53PM +0900, Masahiko Sawada wrote:
> >> On Wed, Apr 19, 2017 at 12:34 PM, Noah Misch  wrote:
> >> > On Sun, Apr 16, 2017 at 07:25:28PM +0900, Fujii Masao wrote:
> >> >> As I told firstly this is not a bug. There are some proposals for 
> >> >> better design
> >> >> of priority column in pg_stat_replication, but we've not reached the 
> >> >> consensus
> >> >> yet. So I think that it's better to move this open item to "Design 
> >> >> Decisions to
> >> >> Recheck Mid-Beta" section so that we can hear more opinions.
> >> >
> >> > I'm reading that some people want to report NULL priority, some people 
> >> > want to
> >> > report a constant 1 priority, and nobody wants the current behavior.  Is 
> >> > that
> >> > an accurate summary?
> >>
> >> Yes, I think that's correct.
> >
> > Okay, but ...
> >
> >> FWIW the reason of current behavior is that it would be useful for the
> >> user who is willing to switch from ANY to FIRST. They can know which
> >> standbys will become sync or potential.
> >
> > ... does this mean you personally want to keep the current behavior?  If 
> > not,
> > has some other person stated a wish to keep the current behavior?
> 
> No, I want to change the current behavior. IMO it's better to set
> priority 1 to all standbys in quorum set. I guess there is no longer
> person who supports the current behavior.

In that case, this open item is not eligible for section "Design Decisions to
Recheck Mid-Beta".  That section is for items where we'll probably change
nothing, but we plan to recheck later just in case.  Here, we expect to change
the behavior; the open question is which replacement behavior to prefer.

Fujii, as the owner of this open item, you are responsible for moderating the
debate until there's adequate consensus to make a particular change or to keep
the current behavior after all.  Please proceed to do that.  Beta testers
deserve a UI they may like, not a UI you already plan to change later.

Thanks,
nm


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] AGG_HASHED cost estimate

2017-04-20 Thread Ashutosh Bapat

On Thu, Apr 20, 2017 at 11:35 PM, Jeff Janes  wrote:

> Regardless, it seems like something is
> getting overlooked.

I agree with this.

> The way final_cost_hashjoin charges for the actual data
> comparison is via pg_proc.procost, rather than just assuming 1.0.  I don't
> know if we would want to go to that effort in cost_agg or not;

But aren't we already doing that as (cpu_operator_cost * numGroupCols)
* input_tuples. May be we should use procost of cur_eq_funcs instead
of blank cpu_operator_cost.

> I assume that
> there was a reason the code was put in final_cost_hashjoin rather than
> initial_cost_hashjoin.

I think this is part of final_cost_hashjoin because it might need a
pg_proc cache lookup. The lookup can be avoided if initial cost is
higher than the existing path's cost.


>
> The initial_cost_hashjoin also throws in an addition of cpu_tuple_cost, "to
> model the costs of inserting the row into the hashtable". Based on the gprof
> and perf output of some very simple aggregates, I would say that
> cpu_tuple_cost is if anything an underestimate, and that it applies to all
> the hash table look ups, whether they end up inserting (about numGroups) or
> finding an existing one (approximately input_tuples - numGroups).  Currently
> in AGG_HASHED that is charged only for numGroups, although I don't know if
> that charge is for inserting into the hash table, or for walking the hash
> table at the end, projecting out tuples.   That it is charged to total_cost
> rather than startup_cost suggests it is meant to apply to walking the hash
> table at the end, rather than inserting into it.

Yes. It's for final projection.

> Maybe it should be charged
> both on the way in and on the way out?

Hash lookup and insertion is costed as

startup_cost += (cpu_operator_cost * numGroupCols) * input_tuples;

May be that needs to change.


>
> Both gprof and perf agree that tuplehash_insert and ExecStoreMinimalTuple
> are quite a bit more expensive than either texteq or hash_any.  This is with
> large hash tables (25 million tuples hashed to 3 million aggregates) and I
> think a lot of the time goes to CPU cache misses, so they might not be so
> bad if the hash tables were smaller.  I don't know how to model this,
> though, if we need it to be accurate over both regimes.

I have not seen our costs modelling CPU cache behaviour; it assumes
the optimal performance in that case. But may be we want to start
modelling it.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Quorum commit for multiple synchronous replication.

2017-04-20 Thread Kyotaro HORIGUCHI

At Fri, 21 Apr 2017 13:20:05 +0900, Masahiko Sawada  
wrote in 
> On Fri, Apr 21, 2017 at 12:02 PM, Noah Misch  wrote:
> > On Wed, Apr 19, 2017 at 01:52:53PM +0900, Masahiko Sawada wrote:
> >> On Wed, Apr 19, 2017 at 12:34 PM, Noah Misch  wrote:
> >> > On Sun, Apr 16, 2017 at 07:25:28PM +0900, Fujii Masao wrote:
> >> >> On Sun, Apr 16, 2017 at 1:19 PM, Noah Misch  wrote:
> >> >> > On Fri, Apr 14, 2017 at 11:58:23PM -0400, Noah Misch wrote:
> >> >> >> On Wed, Apr 05, 2017 at 09:51:02PM -0400, Noah Misch wrote:
> >> >> >> > On Thu, Apr 06, 2017 at 12:48:56AM +0900, Fujii Masao wrote:
> >> >> >>
> >> >> >> > > > On Mon, Dec 19, 2016 at 09:49:58PM +0900, Fujii Masao wrote:
> >> >> >> > > >> (3)
> >> >> >> > > >> The priority value is assigned to each standby listed in 
> >> >> >> > > >> s_s_names
> >> >> >> > > >> even in quorum commit though those priority values are not 
> >> >> >> > > >> used at all.
> >> >> >> > > >> Users can see those priority values in pg_stat_replication.
> >> >> >> > > >> Isn't this confusing? If yes, it might be better to always 
> >> >> >> > > >> assign 1 as
> >> >> >> > > >> the priority, for example.
> >> >
> >> >> >> This PostgreSQL 10 open item is past due for your status update.  
> >> >> >> Kindly send
> >> >> >> a status update within 24 hours, and include a date for your 
> >> >> >> subsequent status
> >> >> >> update.  Refer to the policy on open item ownership:
> >> >> >> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
> >> >
> >> >> >> > Since you do want (3) to change, please own it like any other open 
> >> >> >> > item,
> >> >> >> > including the mandatory status updates.
> >> >> >>
> >> >> >> Likewise.
> >> >>
> >> >> As I told firstly this is not a bug. There are some proposals for 
> >> >> better design
> >> >> of priority column in pg_stat_replication, but we've not reached the 
> >> >> consensus
> >> >> yet. So I think that it's better to move this open item to "Design 
> >> >> Decisions to
> >> >> Recheck Mid-Beta" section so that we can hear more opinions.
> >> >
> >> > I'm reading that some people want to report NULL priority, some people 
> >> > want to
> >> > report a constant 1 priority, and nobody wants the current behavior.  Is 
> >> > that
> >> > an accurate summary?
> >>
> >> Yes, I think that's correct.
> >
> > Okay, but ...
> >
> >> FWIW the reason of current behavior is that it would be useful for the
> >> user who is willing to switch from ANY to FIRST. They can know which
> >> standbys will become sync or potential.
> >
> > ... does this mean you personally want to keep the current behavior?  If 
> > not,
> > has some other person stated a wish to keep the current behavior?
> 
> No, I want to change the current behavior. IMO it's better to set
> priority 1 to all standbys in quorum set. I guess there is no longer
> person who supports the current behavior.

+1 for the latter. For the former, I'd like to distinguish
standbys in sync and not in the field or something if we can
allow the additional complexity.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] some review comments on logical rep code

2017-04-20 Thread Kyotaro HORIGUCHI

At Wed, 19 Apr 2017 10:59:00 +0200, Petr Jelinek  
wrote in <3ef9c831-0508-51a9-5ded-c2e31e958...@2ndquadrant.com>
> On 19/04/17 10:45, Kyotaro HORIGUCHI wrote:
> > At Wed, 19 Apr 2017 17:43:17 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
> >  wrote in 
> > <20170419.174317.114509231.horiguchi.kyot...@lab.ntt.co.jp>
> >> At Wed, 19 Apr 2017 10:33:29 +0200, Petr Jelinek 
> >>  wrote in 
> >> 
> >> Some other process may modify it then go to there. With a kind of
> >> priority inversion, the process may modify the data on the memory
> >> *before* we modify it. Of course this is rather unrealistic,
> >> though.
> > 
> > A bit short.
> > 
> > Some other process may modify it *after* we read it then go to
> > there. With a kind of priority inversion, the process may modify
> > the data on the memory *before* we modify it. Of course this is
> > rather unrealistic, though.
> > 
> 
> Yeah but I think that's risk anyway in MVCC read committed database, the
> only way to protect against that would be to hold lock over the catalogs
> access which was what we wanted to get rid of :)

Agreed, or, I'm not sure about the real risks.

> BTW the whole sync coordination is explicitly written in a way that
> unless you mess with catalogs manually only the tablesync process should
> do UPDATEs to that catalog (with the exception of s->r state switch
> which can happen in apply and has no effect on sync because both states
> mean that synchronization is done, only makes apply stop tracking the
> table individually).

Hmm. Inhibiting retrospective updates of the on-memory data would
save from some kind of priority inversions but such kind of
manual operation is similar to overwriting of pg_control and I
think is not worth bothering about:p

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Quorum commit for multiple synchronous replication.

2017-04-20 Thread Masahiko Sawada

On Fri, Apr 21, 2017 at 12:02 PM, Noah Misch  wrote:
> On Wed, Apr 19, 2017 at 01:52:53PM +0900, Masahiko Sawada wrote:
>> On Wed, Apr 19, 2017 at 12:34 PM, Noah Misch  wrote:
>> > On Sun, Apr 16, 2017 at 07:25:28PM +0900, Fujii Masao wrote:
>> >> On Sun, Apr 16, 2017 at 1:19 PM, Noah Misch  wrote:
>> >> > On Fri, Apr 14, 2017 at 11:58:23PM -0400, Noah Misch wrote:
>> >> >> On Wed, Apr 05, 2017 at 09:51:02PM -0400, Noah Misch wrote:
>> >> >> > On Thu, Apr 06, 2017 at 12:48:56AM +0900, Fujii Masao wrote:
>> >> >>
>> >> >> > > > On Mon, Dec 19, 2016 at 09:49:58PM +0900, Fujii Masao wrote:
>> >> >> > > >> (3)
>> >> >> > > >> The priority value is assigned to each standby listed in 
>> >> >> > > >> s_s_names
>> >> >> > > >> even in quorum commit though those priority values are not used 
>> >> >> > > >> at all.
>> >> >> > > >> Users can see those priority values in pg_stat_replication.
>> >> >> > > >> Isn't this confusing? If yes, it might be better to always 
>> >> >> > > >> assign 1 as
>> >> >> > > >> the priority, for example.
>> >
>> >> >> This PostgreSQL 10 open item is past due for your status update.  
>> >> >> Kindly send
>> >> >> a status update within 24 hours, and include a date for your 
>> >> >> subsequent status
>> >> >> update.  Refer to the policy on open item ownership:
>> >> >> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
>> >
>> >> >> > Since you do want (3) to change, please own it like any other open 
>> >> >> > item,
>> >> >> > including the mandatory status updates.
>> >> >>
>> >> >> Likewise.
>> >>
>> >> As I told firstly this is not a bug. There are some proposals for better 
>> >> design
>> >> of priority column in pg_stat_replication, but we've not reached the 
>> >> consensus
>> >> yet. So I think that it's better to move this open item to "Design 
>> >> Decisions to
>> >> Recheck Mid-Beta" section so that we can hear more opinions.
>> >
>> > I'm reading that some people want to report NULL priority, some people 
>> > want to
>> > report a constant 1 priority, and nobody wants the current behavior.  Is 
>> > that
>> > an accurate summary?
>>
>> Yes, I think that's correct.
>
> Okay, but ...
>
>> FWIW the reason of current behavior is that it would be useful for the
>> user who is willing to switch from ANY to FIRST. They can know which
>> standbys will become sync or potential.
>
> ... does this mean you personally want to keep the current behavior?  If not,
> has some other person stated a wish to keep the current behavior?

No, I want to change the current behavior. IMO it's better to set
priority 1 to all standbys in quorum set. I guess there is no longer
person who supports the current behavior.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-04-20 Thread Michael Paquier

On Fri, Apr 21, 2017 at 12:29 AM, Peter Eisentraut
 wrote:
> On 4/20/17 07:52, Petr Jelinek wrote:
>> On 20/04/17 05:57, Michael Paquier wrote:
>>> 2nd thoughts here... Ah now I see your point. True that there is no
>>> way to ensure that an unwanted command is not running when SIGUSR2 is
>>> received as the shutdown checkpoint may have already begun. Here is an
>>> idea: add a new state in WalSndState, say WALSNDSTATE_STOPPING, and
>>> the shutdown checkpoint does not run as long as all WAL senders still
>>> running do not reach such a state.
>>
>> +1 to this solution
>
> Michael, can you attempt to supply a patch?

Hmm. I have been actually looking at this solution and I am having
doubts regarding its robustness. In short this would need to be
roughly a two-step process:
- In PostmasterStateMachine(), SIGUSR2 is sent to the checkpoint to
make it call ShutdownXLOG(). Prior doing that, a first signal should
be sent to all the WAL senders with
SignalSomeChildren(BACKEND_TYPE_WALSND). SIGUSR2 or SIGINT could be
used.
- At reception of this signal, all WAL senders switch to a stopping
state, refusing commands that can generate WAL.
- Checkpointer looks at the state of all WAL senders, looping with a
sleep call of a couple of ms, refusing to launch the shutdown
checkpoint as long as all WAL senders have not switched to the
stopping state.
- In reaper(), once checkpointer is confirmed as stopped, signal again
the WAL senders, and tell them to perform the last loop.

After that, I got a second, more simple idea.
CheckpointerShmem->ckpt_flags holds the information about checkpoints
currently running, so we could have the WAL senders look at this data
and prevent any commands generating WAL. The checkpointer may be
already stopped at the moment the WAL senders finish their loop, so we
need also to check if the checkpointer is running or not on those code
paths. Such safeguards may actually be enough for the problem of this
thread. Thoughts?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Quorum commit for multiple synchronous replication.

2017-04-20 Thread Noah Misch

On Wed, Apr 19, 2017 at 01:52:53PM +0900, Masahiko Sawada wrote:
> On Wed, Apr 19, 2017 at 12:34 PM, Noah Misch  wrote:
> > On Sun, Apr 16, 2017 at 07:25:28PM +0900, Fujii Masao wrote:
> >> On Sun, Apr 16, 2017 at 1:19 PM, Noah Misch  wrote:
> >> > On Fri, Apr 14, 2017 at 11:58:23PM -0400, Noah Misch wrote:
> >> >> On Wed, Apr 05, 2017 at 09:51:02PM -0400, Noah Misch wrote:
> >> >> > On Thu, Apr 06, 2017 at 12:48:56AM +0900, Fujii Masao wrote:
> >> >>
> >> >> > > > On Mon, Dec 19, 2016 at 09:49:58PM +0900, Fujii Masao wrote:
> >> >> > > >> (3)
> >> >> > > >> The priority value is assigned to each standby listed in 
> >> >> > > >> s_s_names
> >> >> > > >> even in quorum commit though those priority values are not used 
> >> >> > > >> at all.
> >> >> > > >> Users can see those priority values in pg_stat_replication.
> >> >> > > >> Isn't this confusing? If yes, it might be better to always 
> >> >> > > >> assign 1 as
> >> >> > > >> the priority, for example.
> >
> >> >> This PostgreSQL 10 open item is past due for your status update.  
> >> >> Kindly send
> >> >> a status update within 24 hours, and include a date for your subsequent 
> >> >> status
> >> >> update.  Refer to the policy on open item ownership:
> >> >> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
> >
> >> >> > Since you do want (3) to change, please own it like any other open 
> >> >> > item,
> >> >> > including the mandatory status updates.
> >> >>
> >> >> Likewise.
> >>
> >> As I told firstly this is not a bug. There are some proposals for better 
> >> design
> >> of priority column in pg_stat_replication, but we've not reached the 
> >> consensus
> >> yet. So I think that it's better to move this open item to "Design 
> >> Decisions to
> >> Recheck Mid-Beta" section so that we can hear more opinions.
> >
> > I'm reading that some people want to report NULL priority, some people want 
> > to
> > report a constant 1 priority, and nobody wants the current behavior.  Is 
> > that
> > an accurate summary?
> 
> Yes, I think that's correct.

Okay, but ...

> FWIW the reason of current behavior is that it would be useful for the
> user who is willing to switch from ANY to FIRST. They can know which
> standbys will become sync or potential.

... does this mean you personally want to keep the current behavior?  If not,
has some other person stated a wish to keep the current behavior?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unportable implementation of background worker start

2017-04-20 Thread Andres Freund

On 2017-04-20 20:10:41 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2017-04-20 20:05:02 -0400, Tom Lane wrote:
> >> Also, if it's not there we'd fall back to using plain poll(), which is
> >> not so awful that we need to work hard to avoid it.  I'd just as soon
> >> keep the number of combinations down.
> 
> > Just using fcntl(SET, CLOEXEC) wound't increase the number of
> > combinations?
> 
> True, if you just did it that way unconditionally.  But doesn't that
> require an extra kernel call per CreateWaitEventSet()?

It does - the question is whether that matters much.  FE/BE uses a
persistent wait set, but unfortunately much of other latch users
don't. And some of them can be somewhat frequent - so I guess that'd
possibly be measurable.  Ok, so I'm on board with epoll1.

If somebody were to change more frequent latch users to use persistent
wait sets, that'd be good too.

- Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DROP SUBSCRIPTION, query cancellations and slot handling

2017-04-20 Thread Michael Paquier

On Fri, Apr 21, 2017 at 11:34 AM, Petr Jelinek
 wrote:
> On 20/04/17 23:30, Peter Eisentraut wrote:
>> On 4/20/17 10:19, Petr Jelinek wrote:
>>> Hmm well since this only affects the synchronization of table
>>> states/names, I guess we could just simply do that before we create the
>>> slot as there is no expectancy of consistency between slot and the table
>>> list snapshot.
>>
>> I suppose that wouldn't hurt.
>>
>> Prior to the table sync patch, a missing target relation would just show
>> up as an error later on in the logs.  So having the error sooner
>> actually seems like a good change.
>>
>
> Very simple patch to make.

+1 for that.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] some review comments on logical rep code

2017-04-20 Thread Masahiko Sawada

On Fri, Apr 21, 2017 at 4:41 AM, Peter Eisentraut
 wrote:
> On 4/20/17 12:30, Fujii Masao wrote:
>> I've pushed several patches, and there is now only one remaining patch.
>> I posted the review comment on that patch, and I'm expecting that
>> Masahiko-san will update the patch. So what about waiting for the updated
>> version of the patch by next Monday (April 24th)?
>
> Thank you for pitching in.
>
> Where is your latest patch?
>

The remaining patch is 002_Reset_on_commit_launcher_wakeup_v3.patch
attached on this mail[1]. I'll update this patch and send it.

[1] 
https://www.postgresql.org/message-id/CAD21AoB1HiZV%2B%2Bah%3DVrQjVZoH%2Bb6Rn8Atv6Miz%2BHCp0j%2BL6GZQ%40mail.gmail.com

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_dump emits ALTER TABLE ONLY partitioned_table

2017-04-20 Thread Amit Langote

Hi Stephen,

On 2017/04/21 8:43, Stephen Frost wrote:
> * Amit Langote (langote_amit...@lab.ntt.co.jp) wrote:
>> On 2017/04/18 1:43, Stephen Frost wrote:
>>> Please take a look at the attached and let me know your thoughts on it.
>>> I changed the code to complain again regarding TRUNCATE ONLY, since that
>>> never actually makes sense on a partitioned table, unlike ALTER TABLE
>>> ONLY, and adjusted the error messages and added hints.
>>
>> Thanks for polishing the patch.  It looks mostly good to me.
>>
>> + Once
>> +   partitions exist, we do not support using ONLY to
>> +   add or drop constraints on only the partitioned table.
>>
>> I wonder if the following sounds a bit more informative: Once partitions
>> exist, using ONLY will result in an error, because the
>> constraint being added or dropped must be added or dropped from the
>> partitions too.
> 
> My thinking here is that we may add support later on down the line for
> using the ONLY keyword, when the constraint has already been created on
> the partitions themselves, so I would rather just clarify that we don't
> (yet) support that.

OK, I wanted to word it like that, because I have been thinking that we
will never support that.

So right now, we do not support ONLY (or how I tend to read it: cause
error on specifying ONLY) if partitions exist at all.  In the future, we
will not output error until we have also seen that some partition does not
possess the constraint being added.  IOW, specifying ONLY won't cause an
error even with non-zero partitions if all of them have the constraint
already defined.  Maybe that's something we will support, but I am not
sure how many users will want to do things that way instead of the other
way around; I mean if the constraint is to be enforced on the whole
partitioned table anyway, why not just define it on the partitioned table
in the first place.  But then again, why restrict users in many number of
ways.

So alright, I'm fine with the wording.  If someone complains later that
they want more clarification on that point, it could be done later.

> Your wording, at least to me, seems to imply that
> if the constraint was added to or dropped from the partitions then the
> ONLY keyword could be used, which isn't accurate today.

Yes, that's likely to be the impression.

 Updated patches attached (0002 and 0003 unchanged).
>>>
>>> Regarding these, 0002 changes pg_dump to handle partitions much more
>>> like inheritance, which at least seems like a decent idea but makes me a
>>> bit concerned about making sure we're doing everything correctly.
>>
>> Are you concerned about the correctness of the code in pg_dump or backend?
> 
> My concern is with regard to pg_dump in this case, primairly.
> 
>> Rules of inheritance enforced by flagInhAttrs() are equally applicable to
>> the partitioning case, so actions performed by it for the partitioning
>> cases will not emit incorrect text.
> 
> I understand that this *should* be the case, but that's not how the code
> in pg_dump was written originally when it came to native partitions,
> having its own flagPartitions() function in fact, which makes me
> hesitate to start assuming what we do for inheritance is going to work
> in these cases the same.

I think why getPartitions() is separate from getInherits() and then
flagPartitions() separate from flagInhTables() is because I thought
originally that mixing the two would be undesirable.  In the partitioning
case, getPartitions() joins pg_inherits with pg_class so that it can also
get hold of relpartbound, which I then thought would be a bad idea to do
for all of the inheritance tables.  If we're OK with always doing the
join, I don't see why we couldn't get rid of the separation.

flagInhAttrs(), OTOH, seems unaffected by that concern and I think using
it for both inheritance and partitioning is fine.  It affects NOT NULL
constraints and defaults, which as far as it goes, applies to both
inheritance and partitioning the same.

>>> In
>>> particular, we should really have regression tests for non-inherited
>>> CHECK (and NOT NULL) constraints on partitions.  Perhaps there are tests
>>> covering those cases in the standard regression suite, but it's not
>>> obvious from the SQL.
>>
>> There is one in create_table.sql that looks like this:
>>
>> CREATE TABLE part_b PARTITION OF parted (
>> b NOT NULL DEFAULT 1 CHECK (b >= 0),
>> CONSTRAINT check_a CHECK (length(a) > 0)
>> ) FOR VALUES IN ('b');
>> -- conislocal should be false for any merged constraints
>> SELECT conislocal, coninhcount FROM pg_constraint WHERE conrelid =
>> 'part_b'::regclass AND conname = 'check_a';
>>
>> But it doesn't really look like it's testing non-inherited constraints a
>> whole lot (CHECK (b >= 0) above is a local non-inherited constraint).
>>
>> I added a few more tests right below the above test so that there is a bit
>> more coverage.  Keep the rule I mentioned above when looking at these.  I
>> also added a test in the pg_dump

Re: [HACKERS] Interval for launching the table sync worker

2017-04-20 Thread Masahiko Sawada

On Thu, Apr 20, 2017 at 8:43 PM, Petr Jelinek
 wrote:
> On 20/04/17 06:21, Masahiko Sawada wrote:
>> On Thu, Apr 20, 2017 at 12:30 AM, Petr Jelinek
>>  wrote:
>>> On 19/04/17 15:57, Masahiko Sawada wrote:
 On Wed, Apr 19, 2017 at 10:07 PM, Petr Jelinek
  wrote:
> On 19/04/17 14:42, Masahiko Sawada wrote:
>> On Wed, Apr 19, 2017 at 5:12 PM, Kyotaro HORIGUCHI
>>  wrote:
>>> At Tue, 18 Apr 2017 18:40:56 +0200, Petr Jelinek 
>>>  wrote in 
>>> 
 On 18/04/17 18:14, Peter Eisentraut wrote:
> On 4/18/17 11:59, Petr Jelinek wrote:
>> Hmm if we create hashtable for this, I'd say create hashtable for the
>> whole table_states then. The reason why it's list now was that it 
>> seemed
>> unnecessary to have hashtable when it will be empty almost always but
>> there is no need to have both hashtable + list IMHO.
>>>
>>> I understant that but I also don't like the frequent palloc/pfree
>>> in long-lasting context and double loop like Peter.
>>>
> The difference is that we blow away the list of states when the 
> catalog
> changes, but we keep the hash table with the start times around.  We
> need two things with different life times.
>>>
>>> On the other hand, hash seems overdone. Addition to that, the
>>> hash-version leaks stale entries while subscriptions are
>>> modified. But vacuuming them costs high.
>>>
 Why can't we just update the hashtable based on the catalog? I mean 
 once
 the record is not needed in the list, the table has been synced so 
 there
 is no need for the timestamp either since we'll not try to start the
 worker again.
>>
>> I guess the table sync worker for the same table could need to be
>> started again. For example, please image a case where the table
>> belonging to the publication is removed from it and the corresponding
>> subscription is refreshed, and then the table is added to it again. We
>> have the record of the table with timestamp in the hash table when the
>> table sync in the first time, but the table sync after refreshed could
>> have to wait for the interval.
>>
>
> But why do we want to wait in such case where user has explicitly
> requested refresh?
>

 Yeah, sorry, I meant that we don't want to wait but cannot launch the
 tablesync worker in such case.

 But after more thought, the latest patch Peter proposed has the same
 problem. Perhaps we need to scan always whole pg_subscription_rel and
 remove the entry if the corresponding table get synced.

>>>
>>> Yes that's what I mean by "Why can't we just update the hashtable based
>>> on the catalog". And if we do that then I don't understand why do we
>>> need both hastable and linked list if we need to update both based on
>>> catalog reads anyway.
>>
>> Thanks, I've now understood correctly. Yes, I think you're right. If
>> we update the hash table based on the catalog whenever table state is
>> invalidated, we don't need to have both hash table and list.
>>
>> BTW, in current HEAD the SUBREL_STATE_SYNCWAIT is not stored in the
>> pg_subscription_catalog. So the following condition seems not correct.
>> We should use "syncworker->relstate == SUBSCRIPTION_STATE_SYNCWAIT"
>> instead?
>>
>> /*
>>  * There is a worker synchronizing the relation and waiting for
>>  * apply to do something.
>>  */
>> if (syncworker && rstate->state == SUBREL_STATE_SYNCWAIT)
>> {
>> /*
>>  * There are three possible synchronization situations here.
>>  *
>>  * a) Apply is in front of the table sync: We tell the table
>>  *sync to CATCHUP.
>>  *
>>  * b) Apply is behind the table sync: We tell the table sync
>>  *to mark the table as SYNCDONE and finish.
>>
>>  * c) Apply and table sync are at the same position: We tell
>>  *table sync to mark the table as READY and finish.
>>  *
>>  * In any case we'll need to wait for table sync to change
>>  * the state in catalog and only then continue ourselves.
>>  */
>>
>
> Good catch. Although it's not comment that's wrong, it's the if. We
> should not compare rstate->state but syncworker->relstate.

I've attached a patch to fix this bug.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


bug_fix.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] walsender & parallelism

2017-04-20 Thread Petr Jelinek

On 21/04/17 04:32, Craig Ringer wrote:
> On 21 April 2017 at 10:20, Petr Jelinek  wrote:
>> On 21/04/17 03:40, Andres Freund wrote:
>>>
>>> Since [1] walsender (not receiver as commit message says) can execute
>>> SQL queries.  While doing some testing of [2] I noticed that SQL queries
>>> in walsender get stuck if parallelism is used - I have not investigated
>>> why that is yet, but it surely is an issue.  On first blush I'd suspect
>>> that some signalling is not wired up correctly (cf. am_walsender branches
>>> in PostgresMain() and such).
>>
>> Looks like SIGUSR1 being different is problem here - it's normally used
>> to . I also noticed that we don't handle SIGINT (query cancel).
>>
>> I'll write proper patch but can you try to just use
>> procsignal_sigusr1_handler for SIGUSR1 in walsender.c to see if it fixes
>> the issue?
> 
> That's what my recovery conflicts for logical decoding on standby
> patch does, FWIW.
> 
> I haven't found any issues yet..
> 

Ah I knew I've seen that change somewhere. I thought it was either in my
patch or master which is why I thought it's working fine already.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DROP SUBSCRIPTION, query cancellations and slot handling

2017-04-20 Thread Petr Jelinek

On 20/04/17 23:30, Peter Eisentraut wrote:
> On 4/20/17 10:19, Petr Jelinek wrote:
>> Hmm well since this only affects the synchronization of table
>> states/names, I guess we could just simply do that before we create the
>> slot as there is no expectancy of consistency between slot and the table
>> list snapshot.
> 
> I suppose that wouldn't hurt.
> 
> Prior to the table sync patch, a missing target relation would just show
> up as an error later on in the logs.  So having the error sooner
> actually seems like a good change.
> 

Very simple patch to make.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services
From c86411521694fd2729a81ebaa09c98e346886b59 Mon Sep 17 00:00:00 2001
From: Petr Jelinek 
Date: Fri, 21 Apr 2017 04:31:58 +0200
Subject: [PATCH] Synchronize table list before creating slot in CREATE
 SUBSCRIPTION

This way the failure to synchronize the table list will not leave an
unused slot on publisher.
---
 src/backend/commands/subscriptioncmds.c | 28 ++--
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/src/backend/commands/subscriptioncmds.c 
b/src/backend/commands/subscriptioncmds.c
index 35dccbc..c29a57e 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -395,20 +395,6 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool 
isTopLevel)
PG_TRY();
{
/*
-* If requested, create permanent slot for the 
subscription.
-* We won't use the initial snapshot for anything, so 
no need
-* to export it.
-*/
-   if (create_slot)
-   {
-   walrcv_create_slot(wrconn, slotname, false,
-  
CRS_NOEXPORT_SNAPSHOT, &lsn);
-   ereport(NOTICE,
-   (errmsg("created replication 
slot \"%s\" on publisher",
-   slotname)));
-   }
-
-   /*
 * Set sync state based on if we were asked to do data 
copy or
 * not.
 */
@@ -432,6 +418,20 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool 
isTopLevel)
 
ereport(NOTICE,
(errmsg("synchronized table states")));
+
+   /*
+* If requested, create permanent slot for the 
subscription.
+* We won't use the initial snapshot for anything, so 
no need
+* to export it.
+*/
+   if (create_slot)
+   {
+   walrcv_create_slot(wrconn, slotname, false,
+  
CRS_NOEXPORT_SNAPSHOT, &lsn);
+   ereport(NOTICE,
+   (errmsg("created replication 
slot \"%s\" on publisher",
+   slotname)));
+   }
}
PG_CATCH();
{
-- 
2.7.4


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] walsender & parallelism

2017-04-20 Thread Craig Ringer

On 21 April 2017 at 10:20, Petr Jelinek  wrote:
> On 21/04/17 03:40, Andres Freund wrote:
>>
>> Since [1] walsender (not receiver as commit message says) can execute
>> SQL queries.  While doing some testing of [2] I noticed that SQL queries
>> in walsender get stuck if parallelism is used - I have not investigated
>> why that is yet, but it surely is an issue.  On first blush I'd suspect
>> that some signalling is not wired up correctly (cf. am_walsender branches
>> in PostgresMain() and such).
>
> Looks like SIGUSR1 being different is problem here - it's normally used
> to . I also noticed that we don't handle SIGINT (query cancel).
>
> I'll write proper patch but can you try to just use
> procsignal_sigusr1_handler for SIGUSR1 in walsender.c to see if it fixes
> the issue?

That's what my recovery conflicts for logical decoding on standby
patch does, FWIW.

I haven't found any issues yet..

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tablesync patch broke the assumption that logical rep depends on?

2017-04-20 Thread Noah Misch

On Mon, Apr 17, 2017 at 02:09:54PM -0400, Peter Eisentraut wrote:
> I think we're not really sure yet what to do about this.  Discussion is
> ongoing.  I'll report back on Wednesday.

This PostgreSQL 10 open item is past due for your status update, and this is
overall the seventh time you have you allowed one of your open items to go out
of compliance.  Kindly start complying with the policy on open item ownership:
https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

At a bare minimum, send a status update within 24 hours, and include a date
for your subsequent status update.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] walsender & parallelism

2017-04-20 Thread Petr Jelinek

On 21/04/17 03:40, Andres Freund wrote:
> 
> Since [1] walsender (not receiver as commit message says) can execute
> SQL queries.  While doing some testing of [2] I noticed that SQL queries
> in walsender get stuck if parallelism is used - I have not investigated
> why that is yet, but it surely is an issue.  On first blush I'd suspect
> that some signalling is not wired up correctly (cf. am_walsender branches
> in PostgresMain() and such).

Looks like SIGUSR1 being different is problem here - it's normally used
to . I also noticed that we don't handle SIGINT (query cancel).

I'll write proper patch but can you try to just use
procsignal_sigusr1_handler for SIGUSR1 in walsender.c to see if it fixes
the issue?

BTW while looking at the code, I don't understand why we call
latch_sigusr1_handler after calling SetLatch(MyLatch), shouldn't just
the SetLatch be enough (they both end up calling sendSelfPipeByte()
eventually)?

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_dump emits ALTER TABLE ONLY partitioned_table

2017-04-20 Thread Stephen Frost

Greetings,

* Noah Misch (n...@leadboat.com) wrote:
> On Mon, Apr 17, 2017 at 03:41:25PM -0400, Stephen Frost wrote:
> > I've put up a new patch for review on the thread and plan to commit
> > that tomorrow, assuming there isn't anything further.  That should
> > resolve the immediate issue, but I do think there's some merit to Amit's
> > follow-on patches and will continue to discuss those and may commit one
> > or both of those later this week.
> 
> This PostgreSQL 10 open item is past due for your status update.  Kindly send
> a status update within 24 hours, and include a date for your subsequent status
> update.  Refer to the policy on open item ownership:
> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

I thought I'd be able to get to this today, but other activities have
taken up the time I had planned to work on this.  I'll be on it again
tomorrow morning and will provide an update (most likely a couple of
commits) by COB tomorrow.

Thanks!

Stephen


signature.asc
Description: Digital signature

[HACKERS] walsender & parallelism

2017-04-20 Thread Andres Freund

Hi,

Since [1] walsender (not receiver as commit message says) can execute
SQL queries.  While doing some testing of [2] I noticed that SQL queries
in walsender get stuck if parallelism is used - I have not investigated
why that is yet, but it surely is an issue.  On first blush I'd suspect
that some signalling is not wired up correctly (cf. am_walsender branches
in PostgresMain() and such).

This'd be easier to look at if the fairly separate facility of allowing
walsender to execute SQL commands had been committed separately, rather
than as part of a fairly large commit of largely unrelated things.

Greetings,

Andres Freund

[1] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=7c4f52409a8c7d85ed169bbbc1f6092274d03920
[2] 
http://archives.postgresql.org/message-id/30242bc6-eca4-b7bb-670e-8d0458753a8c%402ndquadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical replication and inheritance

2017-04-20 Thread Amit Langote

On 2017/04/21 3:22, Peter Eisentraut wrote:
> On 4/18/17 01:58, Amit Langote wrote:
>> We could be more explicit here and say the following instead:
>>
>> create publication mypub for table p;
>> ERROR:  "p" is a partitioned table
>> DETAIL:  Adding partitioned tables to publications is not supported.
>>
>> Thoughts?  (a patch is attached for consideration)
> 
> Committed.  I added an errhint, moved the tests around a bit to match
> the existing structure, and added some documentation.

Thanks.

Regards,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WITH clause in CREATE STATISTICS

2017-04-20 Thread Tomas Vondra


On 04/21/2017 12:13 AM, Tom Lane wrote:

Alvaro Herrera  writes:

Simon just pointed out that having the WITH clause appear in the middle
of the CREATE STATISTICS command looks odd; apparently somebody else
already complained on list about the same.  Other commands put the WITH
clause at the end, so perhaps we should do likewise in the new command.



Here's a patch to implement that.  I verified that if I change
qualified_name to qualified_name_list, bison does not complain about
conflicts, so this new syntax should support extension to multiple
relations without a problem.


Yeah, WITH is fully reserved, so as long as the clause looks like
WITH ( stuff... ) you're pretty much gonna be able to drop it
wherever you want.


Discuss.


+1 for WITH at the end; the existing syntax looks weird to me too.



-1 from me

I like the current syntax more, and  WHERE ... WITH seems a bit weird to 
me. But more importantly, one thing Dean probably considered when 
proposing the current syntax was that we may add support for partial 
statistics, pretty much like partial indexes. And we don't allow WITH at 
the end (after WHERE) for indexes:


test=# create index on t (a) where a < 100 with (fillfactor=10);
ERROR:  syntax error at or near "with"
LINE 1: create index on t (a) where a < 100 with (fillfactor=10);
^
test=# create index on t (a) with (fillfactor=10) where a < 100;


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unportable implementation of background worker start

2017-04-20 Thread Tom Lane

Andres Freund  writes:
> On 2017-04-20 20:05:02 -0400, Tom Lane wrote:
>> Also, if it's not there we'd fall back to using plain poll(), which is
>> not so awful that we need to work hard to avoid it.  I'd just as soon
>> keep the number of combinations down.

> Just using fcntl(SET, CLOEXEC) wound't increase the number of
> combinations?

True, if you just did it that way unconditionally.  But doesn't that
require an extra kernel call per CreateWaitEventSet()?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unportable implementation of background worker start

2017-04-20 Thread Andres Freund

On 2017-04-20 20:05:02 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2017-04-20 19:53:02 -0400, Tom Lane wrote:
> >> So ... what would you say to replacing epoll_create() with
> >> epoll_create1(EPOLL_CLOEXEC) ?  Then a WaitEventSet would not
> >> represent inheritable-across-exec resources on any platform,
> >> making it a lot easier to deal with the EXEC_BACKEND case.
> 
> > I'm generally quite in favor of using CLOEXEC as much as possible in our
> > tree.  I'm a bit concerned with epoll_create1's availability tho - the
> > glibc support for it was introduced in 2.9, whereas epoll_create is in
> > 2.3.2.  On the other hand 2.9 was released 2008-11-13.
> 
> Also, if it's not there we'd fall back to using plain poll(), which is
> not so awful that we need to work hard to avoid it.  I'd just as soon
> keep the number of combinations down.

Just using fcntl(SET, CLOEXEC) wound't increase the number of
combinations?

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unportable implementation of background worker start

2017-04-20 Thread Tom Lane

Andres Freund  writes:
> On 2017-04-20 19:53:02 -0400, Tom Lane wrote:
>> So ... what would you say to replacing epoll_create() with
>> epoll_create1(EPOLL_CLOEXEC) ?  Then a WaitEventSet would not
>> represent inheritable-across-exec resources on any platform,
>> making it a lot easier to deal with the EXEC_BACKEND case.

> I'm generally quite in favor of using CLOEXEC as much as possible in our
> tree.  I'm a bit concerned with epoll_create1's availability tho - the
> glibc support for it was introduced in 2.9, whereas epoll_create is in
> 2.3.2.  On the other hand 2.9 was released 2008-11-13.

Also, if it's not there we'd fall back to using plain poll(), which is
not so awful that we need to work hard to avoid it.  I'd just as soon
keep the number of combinations down.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unportable implementation of background worker start

2017-04-20 Thread Andres Freund

On 2017-04-20 19:53:02 -0400, Tom Lane wrote:
> I wrote:
> > Andres Freund  writes:
> >> On 2017-04-20 19:23:28 -0400, Tom Lane wrote:
> >>> or are the HANDLEs in a Windows WaitEventSet not inheritable
> >>> resources?
> 
> >> So that kind of sounds like it should be doable.
> 
> > Ah, good.  I'll add a comment about that and press on.
> 
> So ... what would you say to replacing epoll_create() with
> epoll_create1(EPOLL_CLOEXEC) ?  Then a WaitEventSet would not
> represent inheritable-across-exec resources on any platform,
> making it a lot easier to deal with the EXEC_BACKEND case.
> 
> AFAIK, both APIs are Linux-only, and epoll_create1() is not much
> newer than epoll_create(), so it seems like we'd not be giving up
> much portability if we insist on epoll_create1.

I'm generally quite in favor of using CLOEXEC as much as possible in our
tree.  I'm a bit concerned with epoll_create1's availability tho - the
glibc support for it was introduced in 2.9, whereas epoll_create is in
2.3.2.  On the other hand 2.9 was released 2008-11-13.   If we remain
concerned we could just fcntl(fd, F_SETFD, FD_CLOEXEC) instead - that
should only be like three lines more code or such, and should be
available for a lot longer.

- Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unportable implementation of background worker start

2017-04-20 Thread Tom Lane

I wrote:
> Andres Freund  writes:
>> On 2017-04-20 19:23:28 -0400, Tom Lane wrote:
>>> or are the HANDLEs in a Windows WaitEventSet not inheritable
>>> resources?

>> So that kind of sounds like it should be doable.

> Ah, good.  I'll add a comment about that and press on.

So ... what would you say to replacing epoll_create() with
epoll_create1(EPOLL_CLOEXEC) ?  Then a WaitEventSet would not
represent inheritable-across-exec resources on any platform,
making it a lot easier to deal with the EXEC_BACKEND case.

AFAIK, both APIs are Linux-only, and epoll_create1() is not much
newer than epoll_create(), so it seems like we'd not be giving up
much portability if we insist on epoll_create1.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unportable implementation of background worker start

2017-04-20 Thread Andres Freund

On 2017-04-20 00:50:13 -0400, Tom Lane wrote:
> I wrote:
> > Alvaro Herrera  writes:
> >> Tom Lane wrote:
> >>> Hm.  Do you have a more-portable alternative?
> 
> >> I was thinking in a WaitEventSet from latch.c.
> 
> My first reaction was that that sounded like a lot more work than removing
> two lines from maybe_start_bgworker and adjusting some comments.  But on
> closer inspection, the slow-bgworker-start issue isn't the only problem
> here.  On a machine that restarts select()'s timeout after an interrupt,
> as (at least) HPUX does, the postmaster will actually never iterate
> ServerLoop's loop except immediately after receiving a new connection
> request.  The stream of interrupts from the autovac launcher is alone
> sufficient to prevent the initial 60-second timeout from ever elapsing.
> So if there are no new connection requests for awhile, none of the
> housekeeping actions in ServerLoop get done.

FWIW, I vaguely remember somewhat related issues on x86/linux too.  On
busy machines in autovacuum_freeze_max_age territory (pretty frequent
thing these days), the signalling frequency caused problems in
postmaster, but I unfortunately don't remember the symptoms very well.


> So it's looking to me like we do need to do something about this, and
> ideally back-patch it all the way.  But WaitEventSet doesn't exist
> before 9.6.  Do we have the stomach for back-patching that into
> stable branches?

Hm, that's not exactly no code - on the other hand it's (excepting
the select(2) path) reasonably well exercised.  What we could do is to
convert master, see how the beta process likes it, and then backpatch?
These don't like super pressing issues.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_dump emits ALTER TABLE ONLY partitioned_table

2017-04-20 Thread Stephen Frost

Amit,

* Amit Langote (langote_amit...@lab.ntt.co.jp) wrote:
> On 2017/04/18 1:43, Stephen Frost wrote:
> > * Amit Langote (langote_amit...@lab.ntt.co.jp) wrote:
> >> OK, I agree.  I tweaked the existing bullet point about differences from
> >> traditional inheritance when using ONLY with partitioned tables.
> > 
> > Please take a look at the attached and let me know your thoughts on it.
> > I changed the code to complain again regarding TRUNCATE ONLY, since that
> > never actually makes sense on a partitioned table, unlike ALTER TABLE
> > ONLY, and adjusted the error messages and added hints.
> 
> Thanks for polishing the patch.  It looks mostly good to me.
> 
> + Once
> +   partitions exist, we do not support using ONLY to
> +   add or drop constraints on only the partitioned table.
> 
> I wonder if the following sounds a bit more informative: Once partitions
> exist, using ONLY will result in an error, because the
> constraint being added or dropped must be added or dropped from the
> partitions too.

My thinking here is that we may add support later on down the line for
using the ONLY keyword, when the constraint has already been created on
the partitions themselves, so I would rather just clarify that we don't
(yet) support that.  Your wording, at least to me, seems to imply that
if the constraint was added to or dropped from the partitions then the
ONLY keyword could be used, which isn't accurate today.

> >> Updated patches attached (0002 and 0003 unchanged).
> > 
> > Regarding these, 0002 changes pg_dump to handle partitions much more
> > like inheritance, which at least seems like a decent idea but makes me a
> > bit concerned about making sure we're doing everything correctly.
> 
> Are you concerned about the correctness of the code in pg_dump or backend?

My concern is with regard to pg_dump in this case, primairly.

> Rules of inheritance enforced by flagInhAttrs() are equally applicable to
> the partitioning case, so actions performed by it for the partitioning
> cases will not emit incorrect text.

I understand that this *should* be the case, but that's not how the code
in pg_dump was written originally when it came to native partitions,
having its own flagPartitions() function in fact, which makes me
hesitate to start assuming what we do for inheritance is going to work
in these cases the same.

> The rule that backend follows is this: if a constraint is added to the
> partitioned table (either a named check constraint or a NOT NULL
> constraint), that constraint becomes an inherited *non-local* constraint
> in all the partitions no matter if it was present in the partitions before
> or not.

Right, I get that.

> > In
> > particular, we should really have regression tests for non-inherited
> > CHECK (and NOT NULL) constraints on partitions.  Perhaps there are tests
> > covering those cases in the standard regression suite, but it's not
> > obvious from the SQL.
> 
> There is one in create_table.sql that looks like this:
> 
> CREATE TABLE part_b PARTITION OF parted (
> b NOT NULL DEFAULT 1 CHECK (b >= 0),
> CONSTRAINT check_a CHECK (length(a) > 0)
> ) FOR VALUES IN ('b');
> -- conislocal should be false for any merged constraints
> SELECT conislocal, coninhcount FROM pg_constraint WHERE conrelid =
> 'part_b'::regclass AND conname = 'check_a';
> 
> But it doesn't really look like it's testing non-inherited constraints a
> whole lot (CHECK (b >= 0) above is a local non-inherited constraint).
> 
> I added a few more tests right below the above test so that there is a bit
> more coverage.  Keep the rule I mentioned above when looking at these.  I
> also added a test in the pg_dump suite.  That's in patch 0001 (since you
> posted your version of what was 0001 before, I'm no longer including it here).

Right, I saw the additional tests in your original patch, though it
DROP'd at least parted_no_parts, meaning that it wasn't being tested as
part of pg_upgrade/pg_dump testing.

> By the way, 0003 is a bug-fix.  WITH OPTIONS that is emitted currently
> like below is not the acceptable syntax:
> 
> CREATE TABLE a_partition OF a_partitioned_table (
> a WITH OPTIONS NOT NULL DEFAULT 1
> ) FOR VALUES blah
> 
> But looking at the pg_dump code closely and also considering our
> discussion upthread, I don't think this form is ever emitted.  Because we
> never emit a partition's column-level constraints and column defaults in
> the CREATE TABLE, but rather separately using ALTER TABLE.  In any case,
> applying 0003 seems to be the right thing to do anyway, in case we might
> rejigger things so that whatever can be emitted within the CREATE TABLE
> text will be.

Hm, ok.

I'm going over your latest set of patches.

Thanks!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Unportable implementation of background worker start

2017-04-20 Thread Tom Lane

Andres Freund  writes:
> On 2017-04-20 19:23:28 -0400, Tom Lane wrote:
>> or are the HANDLEs in a Windows WaitEventSet not inheritable
>> resources?

> I think we have control over that. According to
> https://msdn.microsoft.com/en-us/library/windows/desktop/ms724466(v=vs.85).aspx
> CreateProcess() has to be called with bInheritHandles = true (which we
> do for backends), and SECURITY_ATTRIBUTES.bInheritHandle has to be true
> too.  The latter we already only do for InitSharedLatch(), but not for
> InitLatch(), nor for the WSACreateEvent's created for sockets - those
> apparently can never be inherited.

> So that kind of sounds like it should be doable.

Ah, good.  I'll add a comment about that and press on.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unportable implementation of background worker start

2017-04-20 Thread Andres Freund

On 2017-04-20 19:23:28 -0400, Tom Lane wrote:
> or are the HANDLEs in a Windows WaitEventSet not inheritable
> resources?

I think we have control over that. According to
https://msdn.microsoft.com/en-us/library/windows/desktop/ms724466(v=vs.85).aspx
CreateProcess() has to be called with bInheritHandles = true (which we
do for backends), and SECURITY_ATTRIBUTES.bInheritHandle has to be true
too.  The latter we already only do for InitSharedLatch(), but not for
InitLatch(), nor for the WSACreateEvent's created for sockets - those
apparently can never be inherited.

So that kind of sounds like it should be doable.

- Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unportable implementation of background worker start

2017-04-20 Thread Tom Lane

I wrote:
> Alvaro Herrera  writes:
>> Tom Lane wrote:
>>> Hm.  Do you have a more-portable alternative?

>> I was thinking in a WaitEventSet from latch.c.

> My first reaction was that that sounded like a lot more work than removing
> two lines from maybe_start_bgworker and adjusting some comments.  But on
> closer inspection, the slow-bgworker-start issue isn't the only problem
> here.  On a machine that restarts select()'s timeout after an interrupt,
> as (at least) HPUX does, the postmaster will actually never iterate
> ServerLoop's loop except immediately after receiving a new connection
> request.

I had a go at making the postmaster use a WaitEventSet, and immediately
ran into problems: latch.c is completely unprepared to be used anywhere
except a postmaster child process.  I think we can get around the issue
for the self-pipe, as per the attached untested patch.  But there remains
a problem: we should do a FreeWaitEventSet() after forking a child
process to ensure that postmaster children aren't running around with
open FDs for the postmaster's stuff.  This is no big problem in a regular
Unix build; we can give ClosePostmasterPorts() the responsibility.
It *is* a problem in EXEC_BACKEND children, which won't have inherited
the WaitEventSet data structure.  Maybe we could ignore the problem for
Unix EXEC_BACKEND builds, since we consider those to be for debug
purposes only, not for production.  But I don't think we can get away
with it for Windows --- or are the HANDLEs in a Windows WaitEventSet
not inheritable resources?

regards, tom lane

diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 4798370..d4ac54c 100644
*** a/src/backend/storage/ipc/latch.c
--- b/src/backend/storage/ipc/latch.c
*** static volatile sig_atomic_t waiting = f
*** 129,134 
--- 129,136 
  /* Read and write ends of the self-pipe */
  static int	selfpipe_readfd = -1;
  static int	selfpipe_writefd = -1;
+ /* Process owning the self-pipe --- needed for checking purposes */
+ static int	selfpipe_owner_pid = 0;
  
  /* Private function prototypes */
  static void sendSelfPipeByte(void);
*** InitializeLatchSupport(void)
*** 158,164 
  #ifndef WIN32
  	int			pipefd[2];
  
! 	Assert(selfpipe_readfd == -1);
  
  	/*
  	 * Set up the self-pipe that allows a signal handler to wake up the
--- 160,202 
  #ifndef WIN32
  	int			pipefd[2];
  
! 	if (IsUnderPostmaster)
! 	{
! 		/*
! 		 * We might have inherited connections to a self-pipe created by the
! 		 * postmaster.  It's critical that child processes create their own
! 		 * self-pipes, of course, and we really want them to close the
! 		 * inherited FDs for safety's sake.
! 		 */
! 		if (selfpipe_owner_pid != 0)
! 		{
! 			/* Assert we go through here but once in a child process */
! 			Assert(selfpipe_owner_pid != MyProcPid);
! 			/* Release postmaster's pipe */
! 			close(selfpipe_readfd);
! 			close(selfpipe_writefd);
! 			/* Clean up, just for safety's sake; we'll set these in a bit */
! 			selfpipe_readfd = selfpipe_writefd = -1;
! 			selfpipe_owner_pid = 0;
! 		}
! 		else
! 		{
! 			/*
! 			 * Postmaster didn't create a self-pipe ... or else we're in an
! 			 * EXEC_BACKEND build, and don't know because we didn't inherit
! 			 * values for the static variables.  (In that case we'll fail to
! 			 * close the inherited FDs, but that seems acceptable since
! 			 * EXEC_BACKEND builds aren't meant for production on Unix.)
! 			 */
! 			Assert(selfpipe_readfd == -1);
! 		}
! 	}
! 	else
! 	{
! 		/* In postmaster or standalone backend, assert we do this but once */
! 		Assert(selfpipe_readfd == -1);
! 		Assert(selfpipe_owner_pid == 0);
! 	}
  
  	/*
  	 * Set up the self-pipe that allows a signal handler to wake up the
*** InitializeLatchSupport(void)
*** 176,188 
  
  	selfpipe_readfd = pipefd[0];
  	selfpipe_writefd = pipefd[1];
  #else
  	/* currently, nothing to do here for Windows */
  #endif
  }
  
  /*
!  * Initialize a backend-local latch.
   */
  void
  InitLatch(volatile Latch *latch)
--- 214,227 
  
  	selfpipe_readfd = pipefd[0];
  	selfpipe_writefd = pipefd[1];
+ 	selfpipe_owner_pid = MyProcPid;
  #else
  	/* currently, nothing to do here for Windows */
  #endif
  }
  
  /*
!  * Initialize a process-local latch.
   */
  void
  InitLatch(volatile Latch *latch)
*** InitLatch(volatile Latch *latch)
*** 193,199 
  
  #ifndef WIN32
  	/* Assert InitializeLatchSupport has been called in this process */
! 	Assert(selfpipe_readfd >= 0);
  #else
  	latch->event = CreateEvent(NULL, TRUE, FALSE, NULL);
  	if (latch->event == NULL)
--- 232,238 
  
  #ifndef WIN32
  	/* Assert InitializeLatchSupport has been called in this process */
! 	Assert(selfpipe_readfd >= 0 && selfpipe_owner_pid == MyProcPid);
  #else
  	latch->event = CreateEvent(NULL, TRUE, FALSE, NULL);
  	if (latch->event == NULL)
*** OwnLatch(volatile Latch *latch)
*** 256,262 ***

Re: [HACKERS] WITH clause in CREATE STATISTICS

2017-04-20 Thread Tom Lane

Alvaro Herrera  writes:
> Simon just pointed out that having the WITH clause appear in the middle
> of the CREATE STATISTICS command looks odd; apparently somebody else
> already complained on list about the same.  Other commands put the WITH
> clause at the end, so perhaps we should do likewise in the new command.

> Here's a patch to implement that.  I verified that if I change
> qualified_name to qualified_name_list, bison does not complain about
> conflicts, so this new syntax should support extension to multiple
> relations without a problem.

Yeah, WITH is fully reserved, so as long as the clause looks like
WITH ( stuff... ) you're pretty much gonna be able to drop it
wherever you want.

> Discuss.

+1 for WITH at the end; the existing syntax looks weird to me too.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Removing select(2) based latch (was Unportable implementation of background worker start)

2017-04-20 Thread Tom Lane

I wrote:
> I did my own checking, and concur that only the MinGW buildfarm members
> are reporting lacking poll(2) or poll.h.  Since they also report lacking
> sys/select.h, they must be falling through to the WAIT_USE_WIN32
> implementation.

BTW, another amusing thing I just noted is that given a machine too old
to have poll(2), the existing coding would most likely fail outright,
because  post-dates poll(2).  SUSv2 specifies including
 to get select(2); apparently POSIX invented 
around 2001.  gaur/pademelon indeed lacks , so it could
never have reached the WAIT_USE_SELECT implementation without some
readjustment of that #ifdef nest.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Removing select(2) based latch (was Unportable implementation of background worker start)

2017-04-20 Thread Tom Lane

Andres Freund  writes:
> On 2017-04-20 17:27:42 -0400, Tom Lane wrote:
>> In short: yeah, let's nuke the WAIT_USE_SELECT implementation.
>> It's dead code and it's unlikely to get resurrected.

> Ok, cool.  v10 or wait till v11?  I see very little reason to wait
> personally.

I feel no need to wait on that.  Code removal is not a "new feature".

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Use sync commit for logical replication apply in TAP tests

2017-04-20 Thread Tom Lane

Peter Eisentraut  writes:
> Double hmm, I ran a few more tests and different machines and get
> consistent but underwhelming improvements.
> I don't mind applying the patch nonetheless, but maybe we can get a few
> more test results from others.
> (Instructions: apply the patch and time make -C src/test/subscription check)

OK, here's some results.  All are typical debug builds (--enable-cassert
in particular), and each number cited is best result of three runs:

UNPATCHED   PATCHED
sss11m13.828s   1m12.763s  (H/W RAID controller + spinning rust)
laptop  1m9.236s1m7.969s   (Macbook Pro, SSD)
gaur11m25.58s   11m14.04s  (1990s-era spinning rust)
prairiedog  8m37.848s   9m26.256s  (ATA/66, 5400RPM spinning rust)

(For context, about 2m10s of gaur's cycle time is the temp install
preparation, and 32s of prairiedog's; it's just a few seconds on the
newer machines.)

I have little faith in the numbers from gaur because I saw run-to-run
variations of a couple of minutes; I suspect that the bgworker launching
bug I described yesterday is making itself felt in this test too.
prairiedog also had rather more variance than one would expect, making
me wonder if something similar is happening there.

But based on these results, I would say that as a test-speed enhancement,
this patch is a failure.  I'd also question the wisdom of testing only
a non-default code path.  There could be an argument for running some
of the subscription tests with async commit and some without, just to
improve code coverage.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Removing select(2) based latch (was Unportable implementation of background worker start)

2017-04-20 Thread Andres Freund

Hi,

On 2017-04-20 17:27:42 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2017-04-19 20:06:05 -0400, Tom Lane wrote:
> >> We should check the buildfarm to see if the select() implementation is
> >> being tested at all.
> 
> > I verified it's currently not (unless I made a mistake):
> 
> I did my own checking, and concur that only the MinGW buildfarm members
> are reporting lacking poll(2) or poll.h.  Since they also report lacking
> sys/select.h, they must be falling through to the WAIT_USE_WIN32
> implementation.  So the WAIT_USE_SELECT implementation is going untested
> in the buildfarm, and has been since before the WaitEventSet API existed
> (I checked buildfarm records back to the start of 2016).

Similarly with the SELECT based unix_latch.c code before...


> In short: yeah, let's nuke the WAIT_USE_SELECT implementation.
> It's dead code and it's unlikely to get resurrected.

Ok, cool.  v10 or wait till v11?  I see very little reason to wait
personally.


> BTW, noting that SUSv2 specifies  not , I wonder
> whether we couldn't drop configure's test for the latter along with
> the
> 
> #ifdef HAVE_SYS_POLL_H
> #include 
> #endif
> 
> stanzas we have in a couple of places.  But that's a separate issue.

Seems like a good idea, a quick select confirms there's no !win
buildfarm animals without poll.h.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DROP SUBSCRIPTION, query cancellations and slot handling

2017-04-20 Thread Peter Eisentraut

On 4/20/17 10:19, Petr Jelinek wrote:
> Hmm well since this only affects the synchronization of table
> states/names, I guess we could just simply do that before we create the
> slot as there is no expectancy of consistency between slot and the table
> list snapshot.

I suppose that wouldn't hurt.

Prior to the table sync patch, a missing target relation would just show
up as an error later on in the logs.  So having the error sooner
actually seems like a good change.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Removing select(2) based latch (was Unportable implementation of background worker start)

2017-04-20 Thread Tom Lane

Andres Freund  writes:
> On 2017-04-19 20:06:05 -0400, Tom Lane wrote:
>> We should check the buildfarm to see if the select() implementation is
>> being tested at all.

> I verified it's currently not (unless I made a mistake):

I did my own checking, and concur that only the MinGW buildfarm members
are reporting lacking poll(2) or poll.h.  Since they also report lacking
sys/select.h, they must be falling through to the WAIT_USE_WIN32
implementation.  So the WAIT_USE_SELECT implementation is going untested
in the buildfarm, and has been since before the WaitEventSet API existed
(I checked buildfarm records back to the start of 2016).

Aside from that empirical observation, we can make a standards-compliance
argument, which is that both poll(2) and poll.h are required by the Single
Unix Spec v2 (a/k/a POSIX 1997), which is what we've been taking as our
baseline requirement for Unix platforms for ten or fifteen years now.

In short: yeah, let's nuke the WAIT_USE_SELECT implementation.
It's dead code and it's unlikely to get resurrected.

BTW, noting that SUSv2 specifies  not , I wonder
whether we couldn't drop configure's test for the latter along with
the

#ifdef HAVE_SYS_POLL_H
#include 
#endif

stanzas we have in a couple of places.  But that's a separate issue.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DROP SUBSCRIPTION, query cancellations and slot handling

2017-04-20 Thread Peter Eisentraut

On 4/20/17 08:41, Michael Paquier wrote:
> As subscription is a self-contained concept, it seems to me that any
> errors happening should at least try to do some cleanup action before
> just giving up processing, that would be a less frustrating
> experience.

This is the way it's designed.

The alternative is to do what we currently do for physical replication,
namely requiring the user to set up all the replication slots manually
beforehand.  I don't think that's a better experience.  There was a
thread about having pg_basebackup automatically create replication
slots.  That will have to deal with the same issues.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-04-20 Thread Claudio Freire

On Wed, Apr 12, 2017 at 4:35 PM, Robert Haas  wrote:
> On Tue, Apr 11, 2017 at 4:38 PM, Claudio Freire  
> wrote:
>> In essence, the patch as it is proposed, doesn't *need* a binary
>> search, because the segment list can only grow up to 15 segments at
>> its biggest, and that's a size small enough that linear search will
>> outperform (or at least perform as well as) binary search. Reducing
>> the initial segment size wouldn't change that. If the 12GB limit is
>> lifted, or the maximum segment size reduced (from 1GB to 128MB for
>> example), however, that would change.
>>
>> I'd be more in favor of lifting the 12GB limit than of reducing the
>> maximum segment size, for the reasons above. Raising the 12GB limit
>> has concrete and readily apparent benefits, whereas using bigger (or
>> smaller) segments is far more debatable. Yes, that will need a binary
>> search. But, I was hoping that could be a second (or third) patch, to
>> keep things simple, and benefits measurable.
>
> To me, it seems a bit short-sighted to say, OK, let's use a linear
> search because there's this 12GB limit so we can limit ourselves to 15
> segments.  Because somebody will want to remove that 12GB limit, and
> then we'll have to revisit the whole thing anyway.  I think, anyway.

Ok, attached an updated patch that implements the binary search

> What's not clear to me is how sensitive the performance of vacuum is
> to the number of cycles used here.  For a large index, the number of
> searches will presumably be quite large, so it does seem worth
> worrying about performance.  But if we just always used a binary
> search, would that lose enough performance with small numbers of
> segments that anyone would care?  If so, maybe we need to use linear
> search for small numbers of segments and switch to binary search with
> larger numbers of segments.

I just went and tested.

I implemented the hybrid binary search attached, and ran a few tests
with and without the sequential code enabled, at small scales.

The difference is statistically significant, but small (less than 3%).
With proper optimization of the binary search, however, the difference
flips:

claudiofreire@klaumpp:~/src/postgresql.vacuum> fgrep shufp80
fullbinary.s100.times
vacuum_bench_s100.1.shufp80.log:CPU: user: 6.20 s, system: 1.42 s,
elapsed: 18.34 s.
vacuum_bench_s100.2.shufp80.log:CPU: user: 6.44 s, system: 1.40 s,
elapsed: 19.75 s.
vacuum_bench_s100.3.shufp80.log:CPU: user: 6.28 s, system: 1.41 s,
elapsed: 18.48 s.
vacuum_bench_s100.4.shufp80.log:CPU: user: 6.39 s, system: 1.51 s,
elapsed: 20.60 s.
vacuum_bench_s100.5.shufp80.log:CPU: user: 6.26 s, system: 1.42 s,
elapsed: 19.16 s.

claudiofreire@klaumpp:~/src/postgresql.vacuum> fgrep shufp80
hybridbinary.s100.times
vacuum_bench_s100.1.shufp80.log:CPU: user: 6.49 s, system: 1.39 s,
elapsed: 19.15 s.
vacuum_bench_s100.2.shufp80.log:CPU: user: 6.36 s, system: 1.33 s,
elapsed: 18.40 s.
vacuum_bench_s100.3.shufp80.log:CPU: user: 6.36 s, system: 1.31 s,
elapsed: 18.87 s.
vacuum_bench_s100.4.shufp80.log:CPU: user: 6.59 s, system: 1.35 s,
elapsed: 26.43 s.
vacuum_bench_s100.5.shufp80.log:CPU: user: 6.54 s, system: 1.28 s,
elapsed: 20.02 s.

That's after inlining the compare on both the linear and sequential
code, and it seems it lets the compiler optimize the binary search to
the point where it outperforms the sequential search.

That's not the case when the compare isn't inlined.

That seems in line with [1], that show the impact of various
optimizations on both algorithms. It's clearly a close enough race
that optimizations play a huge role.

Since we're not likely to go and implement SSE2-optimized versions, I
believe I'll leave the binary search only. That's the attached patch
set.

I'm running the full test suite, but that takes a very long while.
I'll post the results when they're done.

[1] https://schani.wordpress.com/2010/04/30/linear-vs-binary-search/
From 709428f67960dd20eaf34846a0b579766e0701f6 Mon Sep 17 00:00:00 2001
From: Claudio Freire 
Date: Mon, 12 Sep 2016 23:36:42 -0300
Subject: [PATCH 1/2] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum and thus reduce the number of full
index scans necessary to remove all dead tids when the memory is
available.
---
 src/backend/commands/vacuumlazy.c| 408 ---
 src/test/regress/expected/vacuum.out |   8 +
 src/test/regress/sql/vacuum.sql  |  10 +
 3 files changed, 350 insertions(+), 76 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 5b43a66..ddf19d7 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -12,11 +12,25 @@
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead tuples.  We
- * initially allocate an array of T

[HACKERS] WITH clause in CREATE STATISTICS

2017-04-20 Thread Alvaro Herrera

Simon just pointed out that having the WITH clause appear in the middle
of the CREATE STATISTICS command looks odd; apparently somebody else
already complained on list about the same.  Other commands put the WITH
clause at the end, so perhaps we should do likewise in the new command.

Here's a patch to implement that.  I verified that if I change
qualified_name to qualified_name_list, bison does not complain about
conflicts, so this new syntax should support extension to multiple
relations without a problem.

Discuss.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
commit 237909c440054e97c09dc6e7711a4ca92892465b[m
Author: Alvaro Herrera 
AuthorDate: Thu Apr 20 18:19:06 2017 -0300
CommitDate: Thu Apr 20 18:19:06 2017 -0300

Put WITH clause at the end

diff --git a/doc/src/sgml/ref/create_statistics.sgml 
b/doc/src/sgml/ref/create_statistics.sgml
index edbcf5840b..910d6be8ab 100644
--- a/doc/src/sgml/ref/create_statistics.sgml
+++ b/doc/src/sgml/ref/create_statistics.sgml
@@ -22,9 +22,9 @@ PostgreSQL documentation
  
 
 CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
-WITH ( option [= value] [, ... ] )
 ON ( column_name, 
column_name [, ...])
 FROM table_name
+WITH ( option [= value] [, ... ] )
 
 
  
@@ -158,7 +158,7 @@ CREATE TABLE t1 (
 INSERT INTO t1 SELECT i/100, i/500
  FROM generate_series(1,100) s(i);
 
-CREATE STATISTICS s1 WITH (dependencies) ON (a, b) FROM t1;
+CREATE STATISTICS s1 ON (a, b) FROM t1 WITH (dependencies);
 
 ANALYZE t1;
 
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 89d2836c49..92e9aa8b28 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -3847,28 +3847,28 @@ ExistingIndex:   USING INDEX index_name 
{ $$ = $3; }
 /*
  *
  * QUERY :
- * CREATE STATISTICS stats_name WITH (options) ON 
(columns) FROM relname
+ * CREATE STATISTICS stats_name ON (columns) FROM 
relname WITH (options)
  *
  */
 
 
-CreateStatsStmt:   CREATE STATISTICS any_name opt_reloptions ON '(' 
columnList ')' FROM qualified_name
+CreateStatsStmt:   CREATE STATISTICS any_name ON '(' columnList ')' FROM 
qualified_name opt_reloptions
{
CreateStatsStmt *n = 
makeNode(CreateStatsStmt);
n->defnames = $3;
-   n->relation = $10;
-   n->keys = $7;
-   n->options = $4;
+   n->relation = $9;
+   n->keys = $6;
+   n->options = $10;
n->if_not_exists = 
false;
$$ = (Node *)n;
}
-   | CREATE STATISTICS IF_P NOT EXISTS 
any_name opt_reloptions ON '(' columnList ')' FROM qualified_name
+   | CREATE STATISTICS IF_P NOT EXISTS 
any_name ON '(' columnList ')' FROM qualified_name opt_reloptions
{
CreateStatsStmt *n = 
makeNode(CreateStatsStmt);
n->defnames = $6;
-   n->relation = $13;
-   n->keys = $10;
-   n->options = $7;
+   n->relation = $12;
+   n->keys = $9;
+   n->options = $13;
n->if_not_exists = true;
$$ = (Node *)n;
}
diff --git a/src/test/regress/expected/stats_ext.out 
b/src/test/regress/expected/stats_ext.out
index 0d6f65e604..7dc6011e6b 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -389,7 +389,7 @@ EXPLAIN (COSTS OFF)
 (2 rows)
 
 -- create statistics
-CREATE STATISTICS func_deps_stat WITH (dependencies) ON (a, b, c) FROM 
functional_dependencies;
+CREATE STATISTICS func_deps_stat ON (a, b, c) FROM functional_dependencies 
WITH

Re: [HACKERS] OK, so culicidae is still broken

2017-04-20 Thread Tom Lane

Andres Freund  writes:
> On 2017-04-20 16:57:03 +0530, Amit Kapila wrote:
>> Agreed.  I have done some further study by using VMMap tool in Windows
>> and it seems to me that all 64-bit processes use address range
>> (0001 ~ 07FE).  I have attached two screen
>> shots to show the address range (lower range and upper range).  You
>> need to refer the lower half of the window in attached screenshots.
>> At this stage, I am not completely sure whether we can assume some
>> address out of this range to use in MapViewOfFileEx.  Let me know your
>> thoughts.

> That seems like a fairly restricted range (good!), and squares with
> memories of reading about this (can't find the reference anymore).  Just
> using something like 0x0F00 as the address might work
> sufficiently well.  What happens if you just hardcode that in the first
> MapViewOfFileEx call, and change RandomizedBaseAddress="FALSE" in
> src/tools/msvc/VCBuildProject.pm to true?

The material I found about Linux x86_64 addressing explains that the
underlying hardware address space is only 48 bits.  Linux chooses to
normalize this by sign-extending into the upper 16 bits, so that
valid 64-bit addresses are
 .. 7FFF
and
8000 .. 

If I'm counting the bits correctly, Windows is choosing to use only
1/16th of the lower half of the available address space, which seems
a bit odd.  However, the main point here is that there should be
quite a bit of available daylight, if they will let us use addresses
above 07FE at all.  I'd be inclined to do our own low-tech
ASLR by using addresses with a random component, say 0Fxx.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] hash partitioning

2017-04-20 Thread Robert Haas

On Mon, Apr 17, 2017 at 10:50 AM, Yugo Nagata  wrote:
> I thought that the partition constraint could be decided every
> time a new partition is created or attached, and that it woule be
> needed to relocate records automatically when the partition configuration
> changes. However, I have come to think that the automatic relocation
> might not be needed at this point.

Great!  I am glad that we are in agreement about this point.  However,
actually I think the problem is worse than you are supposing.  If
you're restoring from a database dump created by pg_dump, then we will
try to load data into each individual partition using COPY.  Direct
insertions into individual partitions are not subject to tuple routing
-- that only affects inserts into the parent table.  So if the
partition constraint is not correct immediately after creating the
table, the COPY which tries to repopulate that partition will probably
fail with an ERROR, because there will likely be at least one row
(probably many) which match the "final" partition constraint but not
the "interim" partition constraint that we'd have after recreating
some but not all of the hash partitions.  For example, if we had
created 2 partitions so far out of a total of 3, we'd think the
constraint ought to be (hashvalue % 2) == 1 rather than (hashvalue %
3) == 1, which obviously will likely lead to the dump failing to
restore properly.

So, I think we really need something like the syntax in Amul's patch
in order for this to work at all.  Of course, the details can be
changed according to what seems best but I think the overall picture
is about right.

There is another point that I think also needs thought; not sure if
either your patch or Amit's patch handles it: constraint exclusion
will not work for hash partitioning.  For example, if the partitioning
constraint for each partition is of the form (hash(partcol) % 6) ==
SOME_VALUE_BETWEEN_0_AND_5, and the query contains the predicate
partcol == 37, constraint exclusion will not be able to prove anything
about which partitions need to be scanned.  Amit Langote has noted a
few times that partitioning relies on constraint exclusion *for now*,
which implies, I think, that he's thought about changing it to work
differently.  I think that would be a good idea.  For range
partitioning or list partitioning, a special-purpose mechanism for
partitioning could be much faster than constraint exclusion, since it
knows that partcol == 37 can only be true for one partition and can
reuse the tuple-routing infrastructure to figure out which one it is.
And that approach can also work for hash partitioning, where
constraint exclusion is useless.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-04-20 Thread Robert Haas

On Thu, Apr 20, 2017 at 8:45 AM, Ashutosh Bapat
 wrote:
> I think you are confusing join condition application and partition
> bounds of a join relation.

You're right, I misunderstood what you were talking about.

> But the problem we are trying to solve here about partition bounds of
> the join relation: what should be the partition bounds of AB, BC or
> AC? When we compare partition bounds of and intermediate join with
> other intermediate join (e.g. AB with those of C) what operator should
> be used? You seem to be suggesting that we keep as many sets of
> partition bounds as there are base relations participating in the join
> and then use appropriate partition bounds based on the columns in the
> join conditions, so that we can use the same operator as used in the
> join condition. That doesn't seem to be a good option since the
> partition bounds will all have same values, only differing in their
> binary representation because of differences in data types.

Well, actually, I think it is a good option, as I wrote in
http://postgr.es/m/CA+TgmoY-LiJ+_S7OijNU_r2y=dhsj539wtqa7cayj-hcecc...@mail.gmail.com

In that email, my principal concern was allowing partition-wise join
to succeed even with slightly different sets of partition boundaries
on the two sides of the join; in particular, if we've got A with A1 ..
A10 and B with B1 .. B10 and the DBA adds A11, I don't want
performance to tank until the DBA gets around to adding B11.  Removing
the partition bounds from the PartitionScheme and storing them
per-RelOptInfo fixes that problem; the fact that it also solves this
problem of what happens when we have different data types on the two
sides looks to me like a second reason to go that way.

And there's a third reason, too, which is that the opfamily mechanism
doesn't currently provide any mechanism for reasoning about which data
types are "wider" or "narrower" in the way that you want.  In general,
there's not even a reason why such a relationship has to exist;
consider two data types t1 and t2 with opclasses t1_ops and t2_ops
that are part of the same opfamily t_ops, and suppose that t1 can
represent any positive integer and t2 can represent any even integer,
or in general that each data type can represent some but not all of
the values that can be represented by the other data type.  In such a
case, neither would be "wider" than the other in the sense that you
need; you essentially want to find a data type within the opfamily to
which all values of any of the types involved in the query can be cast
without error, but there is nothing today which requires such a data
type to exist, and no way to identify which one it is.  In practice,
for all of the built-in opfamilies that have more than one opclass,
such a data type always exists but is not always unique -- in
particular, datetime_ops contains date_ops, timestamptz_ops, and
timestamp_ops, and either of the latter two is a plausible choice for
the "widest" data type of the three.  But there's no way to figure
that out from the opfamily or opclass information we have today.

In theory, it would be possible to modify the opfamily machinery so
that every opfamily designates an optional ordering of types from
"narrowest" to "widest", such that saying t1 is-narrower-than t2 is a
guarantee that every value of type t1 can be cast without error to a
value of type t2.  But I think that's a bad plan.  It means that every
opfamily created by either the core code or some extension now needs
to worry about annotating the opclass with this new information, and
we have to add to core the SQL syntax and supporting code to make that
work.  If it were implementing a valuable feature which could not
practically be implemented without extending the opfamily machinery,
then I guess that's what we'd have to suck it up and incur that
complexity, but in this case it does not appear necessary.  Storing
the partition bounds per-RelOptInfo makes this problem -- and a few
others -- go away.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] some review comments on logical rep code

2017-04-20 Thread Peter Eisentraut

On 4/20/17 12:30, Fujii Masao wrote:
> I've pushed several patches, and there is now only one remaining patch.
> I posted the review comment on that patch, and I'm expecting that
> Masahiko-san will update the patch. So what about waiting for the updated
> version of the patch by next Monday (April 24th)?

Thank you for pitching in.

Where is your latest patch?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] changing mvstats output to be valid JSON

2017-04-20 Thread Alvaro Herrera

Hi,

While writing the recent ext.stats docs, I became annoyed by the output
functions of the new types used by multivariate statistics: they are
almost JSON, but not quite.  Since they can become largish, I propose
that we make a point of ensuring the output of those types is valid
JSON, so that they can be processed by JSON tools, particularly by
jsonb_pretty().  Note that the internal bytea format does not change;
this is only for human consumption (or for external tools that want to
examine the values), not for the planner.  Also, the input function for
those types rejects input, so nothing needs to care about reading these
values; ANALYZE is the only valid source for them.

With the proposed attached patch, those columns look more convenient to
work with:

ialvherre=# select jsonb_pretty(stxndistinct::jsonb), jsonb_pretty( 
stxdependencies::jsonb) from pg_statistic_ext;
 jsonb_pretty │jsonb_pretty
──┼
 {   ↵│ { ↵
 "1, 2": 33178,  ↵│ "1 => 2": 1.00,   ↵
 "1, 5": 33178,  ↵│ "1 => 5": 1.00,   ↵
 "2, 5": 27435,  ↵│ "5 => 1": 0.422367,   ↵
 "1, 2, 5": 33178↵│ "5 => 2": 0.482567,   ↵
 }│ "1, 2 => 5": 1.00,↵
  │ "1, 5 => 2": 1.00,↵
  │ "2, 5 => 1": 0.807367 ↵
  │ }

Changes:

* Removed the external [ ]; the whole string is now a single JSON object.
* Removed the bitmapset output artifact.  (I introduced that while
  simplifying the code, but I now think that was a mistake).
* The attribute list is the object key.
* In ndistinct, the value is now an integer rather than a floating point
  number.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
commit 49695b5ef165b1736fab65d6443640c5529b6338[m
Author: Alvaro Herrera 
AuthorDate: Thu Apr 20 16:26:27 2017 -0300
CommitDate: Thu Apr 20 16:26:27 2017 -0300

Fix extstats output funcs to emit valid JSON

diff --git a/src/backend/statistics/dependencies.c 
b/src/backend/statistics/dependencies.c
index 0890514bf7..cbc2a3c6c3 100644
--- a/src/backend/statistics/dependencies.c
+++ b/src/backend/statistics/dependencies.c
@@ -696,7 +696,7 @@ pg_dependencies_out(PG_FUNCTION_ARGS)
MVDependencies *dependencies = statext_dependencies_deserialize(data);
 
initStringInfo(&str);
-   appendStringInfoChar(&str, '[');
+   appendStringInfoChar(&str, '{');
 
for (i = 0; i < dependencies->ndeps; i++)
{
@@ -705,7 +705,7 @@ pg_dependencies_out(PG_FUNCTION_ARGS)
if (i > 0)
appendStringInfoString(&str, ", ");
 
-   appendStringInfoChar(&str, '{');
+   appendStringInfoChar(&str, '"');
for (j = 0; j < dependency->nattributes; j++)
{
if (j == dependency->nattributes - 1)
@@ -715,11 +715,10 @@ pg_dependencies_out(PG_FUNCTION_ARGS)
 
appendStringInfo(&str, "%d", dependency->attributes[j]);
}
-   appendStringInfo(&str, " : %f", dependency->degree);
-   appendStringInfoChar(&str, '}');
+   appendStringInfo(&str, "\": %f", dependency->degree);
}
 
-   appendStringInfoChar(&str, ']');
+   appendStringInfoChar(&str, '}');
 
PG_RETURN_CSTRING(str.data);
 }
diff --git a/src/backend/statistics/mvdistinct.c 
b/src/backend/statistics/mvdistinct.c
index b77113fb39..362c912b2c 100644
--- a/src/backend/statistics/mvdistinct.c
+++ b/src/backend/statistics/mvdistinct.c
@@ -354,21 +354,29 @@ pg_ndistinct_out(PG_FUNCTION_ARGS)
StringInfoData str;
 
initStringInfo(&str);
-   appendStringInfoChar(&str, '[');
+   appendStringInfoChar(&str, '{');
 
for (i = 0; i < ndist->nitems; i++)
{
MVNDistinctItem item = ndist->items[i];
+   int x = -1;
+   boolfirst = true;
 
if (i > 0)
appendStringInfoString(&str, ", ");
 
-   appendStringInfoChar(&str, '{');
-   outBitmapset(&str, item.attrs);
-   appendStringInfo(&str, ", %f}", item.ndistinct);
+   appendStringInfoChar(&str, '"');
+   while ((x = bms_next_member(item.attrs, x)) >= 0)
+   {
+   if (!first)
+   appendStringInfoString(&str, ", ");
+   first = false;
+   appendStringInfo(&str, "%d", x);
+   }
+   appendStringInfo(&str, "\": %d", (int) item.ndistinct);
}
 
-   appendStringInfoChar(&str, ']');
+   appendStringInfoChar(&str, '}');
 
PG_RETURN_CSTRING(str.data);
 }

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.or

Re: [HACKERS] Interval for launching the table sync worker

2017-04-20 Thread Peter Eisentraut

On 4/19/17 23:02, Noah Misch wrote:
> This PostgreSQL 10 open item is past due for your status update.  Kindly send
> a status update within 24 hours, and include a date for your subsequent status
> update.  Refer to the policy on open item ownership:
> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

This is currently on the back burner relative to other issues.  Next
check-in from me by Monday.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why does logical replication launcher set application_name?

2017-04-20 Thread Peter Eisentraut

On 4/18/17 13:18, Tom Lane wrote:
> I think you're thinking about it wrong.  To my mind the issue is that
> there should be some generic way to determine that a bgworker process
> is or is not laboring on behalf of an identifiable user.  It's great
> that we can tell which user it is when there is one, but clearly some
> bgworkers will be providing general services that aren't associated with
> a single user.  So it should be possible to set the userID to zero or
> some such when the bgworker is one that isn't associated with a
> particular user.  Maybe the owning user needs to become an additional
> parameter passed in struct BackgroundWorker.

I think this is probably a problem particular to the logical replication
launcher.  Other background workers either do work as a particular user,
as you say, or don't touch the database at all.  So a localized hack or
a simple hide-the-user flag might suffice for now.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why does logical replication launcher set application_name?

2017-04-20 Thread Peter Eisentraut

On 4/18/17 12:37, Tom Lane wrote:
> Peter Eisentraut  writes:
>> I think showing bgw_name as backend_type always sounds reasonable.  No
>> need to treat external implementations differently.
> 
> That's definitely an approach we could use.  It would encourage people
> to use short bgw_names, which is a constraint that wasn't especially
> apparent before, but I don't think that's a bad thing.

Actually, bgw_name is probably not food for
pg_stat_activity.backend_type, since it's often not the same for all
background workers of the same kind.  For example, it might be "parallel
worker for PID %d".  Ideally, a background worker would have a bgw_type
field and perhaps a bgw_name_extra field.  However, a background worker
might also want to update some part of that dynamically, to change the
process title.  Many details depend on the particular background workers.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] OK, so culicidae is still broken

2017-04-20 Thread Andres Freund

On 2017-04-20 16:57:03 +0530, Amit Kapila wrote:
> On Wed, Apr 19, 2017 at 9:01 PM, Andres Freund  wrote:
> > On 2017-04-19 10:15:31 -0400, Tom Lane wrote:
> >> Amit Kapila  writes:
> >> > On Sun, Apr 16, 2017 at 3:04 AM, Tom Lane  wrote:
> >> >> Obviously, any such fix would be a lot more likely to be reliable in
> >> >> 64-bit machines.  There's probably not enough daylight to be sure of
> >> >> making it work in 32-bit Windows, so I suspect we'd need some retry
> >> >> logic anyway for that case.
> >>
> >> > Yeah, that kind of thing can work assuming we don't get conflicts too
> >> > often, but it could be possible that conflicts are not reported from
> >> > ASLR enabled environments because of commit 7f3e17b4.
> >>
> >> Right, but Andres' point is that we should make an effort to undo that
> >> hack and instead allow ASLR to happen.  Not just because it's allegedly
> >> more secure, but because we may have no choice in future Windows versions.
> >
> > FWIW, I think it *also* might make us more secure, because addresses in
> > the postgres binary won't be predictable anymore.
> >
> 
> Agreed.  I have done some further study by using VMMap tool in Windows
> and it seems to me that all 64-bit processes use address range
> (0001 ~ 07FE).  I have attached two screen
> shots to show the address range (lower range and upper range).  You
> need to refer the lower half of the window in attached screenshots.
> At this stage, I am not completely sure whether we can assume some
> address out of this range to use in MapViewOfFileEx.  Let me know your
> thoughts.

That seems like a fairly restricted range (good!), and squares with
memories of reading about this (can't find the reference anymore).  Just
using something like 0x0F00 as the address might work
sufficiently well.  What happens if you just hardcode that in the first
MapViewOfFileEx call, and change RandomizedBaseAddress="FALSE" in
src/tools/msvc/VCBuildProject.pm to true?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical replication - TRAP: FailedAssertion in pgstat.c

2017-04-20 Thread Peter Eisentraut

On 4/16/17 16:11, Petr Jelinek wrote:
> Yeah it is, it needs to be fenced to happen only after commit, which is
> not guaranteed at the point of code, we probably need to put the
> pgstat_report_stat() inside the if above after the
> CommitTransactionCommand() (that will make it report stats for changes
> apply did to pg_subscription_rel after next transaction though)

I think to avoid the latter, we should add more pgstat_report_stat()
calls, such as in process_syncing_tables_for_apply().  Basically every
code path that calls CommitTransactionCommand() should have one, no?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tablesync patch broke the assumption that logical rep depends on?

2017-04-20 Thread Petr Jelinek

On 20/04/17 18:58, Peter Eisentraut wrote:
> On 4/18/17 22:13, Petr Jelinek wrote:
>> So my idea was to add some kind of inuse flag. This turned out to be bit
>> more complicated in terms of how to clean it than I would have hoped.
>> This is due to the fact that there is no way to reliably tell if worker
>> has failed to start if the parent worker crashed while waiting.
>>
>> My solution to that is to use similar logic to autovacuum where we use
>> timeout for worker to attach to shmem. We do this only if there is no
>> free slot found when launch of replication worker was requested.
> 
> It looks like launch_time is never set the current time in your patch.
> 

Oops, fixed.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services
From 331cf79d58deff37d2697e1e1fe594c55a92cd42 Mon Sep 17 00:00:00 2001
From: Petr Jelinek 
Date: Wed, 19 Apr 2017 03:48:53 +0200
Subject: [PATCH] Fix various concurrency issues in logical replication worker
 launching

The code was originally written with assumption that launcher is the
only proccess starting the worker. However that hasn't been true since
commit 7c4f52409 which failed to modify the worker management code
adequately.

This patch adds in_use field to LogicalRepWorker struct to indicate if
the worker slot is being used and uses proper locking everywhere this
flag is set or read.

However if the parent process which dies while the new worker is
starting and the new worker failes to attach to shared memory, this flag
would never get cleared. We solve this rare corner case by adding sort
of garbage collector for in_use slots. This uses another filed in the
LogicalRepWorker struct named launch_time which contains timestamp of
when the worker has been started. If any request to start a new worker
does not find free slot, we'll check for workers that were supposed to
start but took long to actually do so and reuse their slot.

In passing also fix possible race conditions when stopping worker that
hasn't finished starting yet.
---
 src/backend/replication/logical/launcher.c | 166 +++--
 src/include/replication/worker_internal.h  |   9 ++
 2 files changed, 140 insertions(+), 35 deletions(-)

diff --git a/src/backend/replication/logical/launcher.c 
b/src/backend/replication/logical/launcher.c
index f6ae610..761fbfa 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -38,6 +38,7 @@
 #include "replication/logicallauncher.h"
 #include "replication/logicalworker.h"
 #include "replication/slot.h"
+#include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 
 #include "storage/ipc.h"
@@ -76,6 +77,7 @@ static void ApplyLauncherWakeup(void);
 static void logicalrep_launcher_onexit(int code, Datum arg);
 static void logicalrep_worker_onexit(int code, Datum arg);
 static void logicalrep_worker_detach(void);
+static void logicalrep_worker_cleanup(LogicalRepWorker *worker);
 
 /* Flags set by signal handlers */
 volatile sig_atomic_t got_SIGHUP = false;
@@ -154,15 +156,19 @@ get_subscription_list(void)
 /*
  * Wait for a background worker to start up and attach to the shmem context.
  *
- * This is like WaitForBackgroundWorkerStartup(), except that we wait for
- * attaching, not just start and we also just exit if postmaster died.
+ * This is only needed for cleaning up the shared memory in case the worker
+ * fails to attach.
  */
-static bool
+static void
 WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
   
BackgroundWorkerHandle *handle)
 {
BgwHandleStatus status;
int rc;
+   uint16  generation;
+
+   /* Remember generation for future identification. */
+   generation = worker->generation;
 
for (;;)
{
@@ -170,18 +176,29 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
 
CHECK_FOR_INTERRUPTS();
 
+   LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
+
+   /* Worker either died or started, no need to do anything. */
+   if (!worker->in_use || worker->proc)
+   {
+   LWLockRelease(LogicalRepWorkerLock);
+   return;
+   }
+
+   LWLockRelease(LogicalRepWorkerLock);
+
+   /* Check if worker has died before attaching and clean up after 
it. */
status = GetBackgroundWorkerPid(handle, &pid);
 
-   /*
-* Worker started and attached to our shmem. This check is safe
-* because only launcher ever starts the workers, so nobody can 
steal
-* the worker slot.
-*/
-   if (status == BGWH_STARTED && worker->proc)
-   return true;
-   /* Worker didn't start or died before attaching to our shmem. */
if (sta

Re: [HACKERS] Logical replication and inheritance

2017-04-20 Thread Peter Eisentraut

On 4/18/17 01:58, Amit Langote wrote:
> We could be more explicit here and say the following instead:
> 
> create publication mypub for table p;
> ERROR:  "p" is a partitioned table
> DETAIL:  Adding partitioned tables to publications is not supported.
> 
> Thoughts?  (a patch is attached for consideration)

Committed.  I added an errhint, moved the tests around a bit to match
the existing structure, and added some documentation.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] AGG_HASHED cost estimate

2017-04-20 Thread Jeff Janes

On Thu, Apr 20, 2017 at 3:46 AM, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

> On Thu, Apr 20, 2017 at 7:47 AM, Jeff Janes  wrote:
> >
> > In cost_size.c, there is this comment block:
> >
> > +* Note: in this cost model, AGG_SORTED and AGG_HASHED have
> exactly
> > the
> > +* same total CPU cost, but AGG_SORTED has lower startup cost.
> If
> > the
> > +* input path is already sorted appropriately, AGG_SORTED should
> be
> > +* preferred (since it has no risk of memory overflow).  This
> will
> > happen
> > +* as long as the computed total costs are indeed exactly equal
> ---
> > but
> > +* if there's roundoff error we might do the wrong thing.  So be
> > sure
> > +* that the computations below form the same intermediate values
> in
> > the
> > +* same order.
> >
> > But, why should they have the same cost in the first place?  A sorted
> > aggregation just has to do an equality comparison on each adjacent pair,
> > which is costed as (cpu_operator_cost * numGroupCols) * input_tuples.   A
> > hash aggregation has to do a hashcode computation for each input, which
> > apparently is also costed at (cpu_operator_cost * numGroupCols) *
> > input_tuples.
>
> I suspect we don't cost this.
>
> > But it also has to do the equality comparison between the
> > input tuple and any aggregation it is about to be aggregated into, to
> make
> > sure the hashcode is not a false collision.  That should be another
> > (cpu_operator_cost * numGroupCols) * (input_tuples - numGroups),
> shouldn't
> > it?
>
> but cost this without numGroups.
>
> /*
>  * The transCost.per_tuple component of aggcosts should be charged once
>  * per input tuple, corresponding to the costs of evaluating the
> aggregate
>  * transfns and their input expressions (with any startup cost of
> course
>  * charged but once).  The finalCost component is charged once per
> output
>  * tuple, corresponding to the costs of evaluating the finalfns.
>  *
>  * If we are grouping, we charge an additional cpu_operator_cost per
>  * grouping column per input tuple for grouping comparisons.
>  *
>

The hash join code uses cpu_operator_cost * num_hashclauses (in
initial_cost_hashjoin), so I was going by analogy with that in interpreting
what kind of grouping comparison it was referring to here--the initial hash
comparison or the final full-data comparison.  But yeah, I can see how it
was probably meant the other way.  Regardless, it seems like something is
getting overlooked.  The way final_cost_hashjoin charges for the actual
data comparison is via pg_proc.procost, rather than just assuming 1.0.  I
don't know if we would want to go to that effort in cost_agg or not; I
assume that there was a reason the code was put in final_cost_hashjoin
rather than initial_cost_hashjoin.  Anyway, assuming 1.0 is going to be a
lot closer to reality than assuming 0.0 is, if we don't want to go through
the work of looking up the actual procost.

The initial_cost_hashjoin also throws in an addition of cpu_tuple_cost, "to
model the costs of inserting the row into the hashtable". Based on the
gprof and perf output of some very simple aggregates, I would say that
cpu_tuple_cost is if anything an underestimate, and that it applies to all
the hash table look ups, whether they end up inserting (about numGroups) or
finding an existing one (approximately input_tuples - numGroups).
Currently in AGG_HASHED that is charged only for numGroups, although I
don't know if that charge is for inserting into the hash table, or for
walking the hash table at the end, projecting out tuples.   That it is
charged to total_cost rather than startup_cost suggests it is meant to
apply to walking the hash table at the end, rather than inserting into it.
Maybe it should be charged both on the way in and on the way out?

>
> The reason may be that hashing isn't as costly as a comparison. I
> don't how true is that.
>

In a simple test case aggregating on md5(random()::text) strings, according
to gprof, it spends about 3 times more time in texteq than it does in
hash_any.  (And ~10% of those hash_any calls are not part of the
AGG_HASHED, but other code paths ).  But according to perf, it is only
about 40% more time in texteq.  I think I'd probably go with perf over
gprof here.

Both gprof and perf agree that tuplehash_insert and ExecStoreMinimalTuple
are quite a bit more expensive than either texteq or hash_any.  This is
with large hash tables (25 million tuples hashed to 3 million aggregates)
and I think a lot of the time goes to CPU cache misses, so they might not
be so bad if the hash tables were smaller.  I don't know how to model this,
though, if we need it to be accurate over both regimes.

Cheers,

Jeff

Re: [HACKERS] Fwd: WIP Patch: Precalculate stable functions

2017-04-20 Thread Tom Lane

Marina Polyakova  writes:
> Now in Postgresql only immutable functions are precalculated; stable 
> functions are calculated for every row so in fact they don't differ from 
> volatile functions.

> There's a proposal to precalculate stable and immutable functions (= 
> calculate once for all output rows, but as many times as function is 
> mentioned in query), if they don't return a set and their arguments are 
> constants or recursively precalculated functions.

Have you looked at the previous efforts in this direction?  The last
discussion I can find is

https://www.postgresql.org/message-id/flat/CABRT9RA-RomVS-yzQ2wUtZ%3Dm-eV61LcbrL1P1J3jydPStTfc6Q%40mail.gmail.com

In particular, that relied on the planner to decide which subtrees were
worth caching and insert marker nodes for the purpose.  I'm not certain
that that's better than putting the intelligence into execExpr.c, but
I'm not sure it isn't either.  In principle we could afford to spend
more effort on making such determinations at plan time than we should
do at executor startup.  Also, the fundamental implementation seemed
less invasive, in that only the marker node type had to know about the
caching behavior, whereas I gather from your description that what you
are doing is going to end up touching almost all node types.

v10's new expression eval technology is sufficiently different that
it may well be that that old approach isn't very relevant anymore.
But it would be a good idea to look.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench more operators & functions

2017-04-20 Thread Fabien COELHO



Here is a v9 which includes some more cleanup, hopefully in the expected 
direction which is to make pgbench expressions behave as SQL 
expressions, and I hope taking into account all other feedback as well.



CONTEXT

Pgbench has been given an expression parser (878fdcb8) which allows to use 
full expressions instead of doing one-at-a-time operations. This parser has 
been extended with functions (7e137f84) & double type (86c43f4e). The first 
batch of functions was essentially a poc about how to add new functions with 
various requirements. Pgbench default "tpcb-like" test takes advantage of 
these additions to reduce the number of lines it needs.



MOTIVATION

This patch aims at providing actually useful functions for benchmarking. The 
functions and operators provided here are usual basic operations. They are 
not chosen randomly, but are simply taken from existing benchmarks:


In TPC-B 2.0.0 section 5.3.5 and TPC-C 5.11 section 2.5.1.2, the selection of 
accounts uses a test (if ...), logical conditions (AND, OR) and comparisons 
(<, =, >=, >).


In TPC-C 5.11 section 2.1.6, a bitwise or (|) is used to skew a distribution 
based on two uniform distributions.


In TPC-C 5.11 section 5.2.5.4, a log function is used to determine "think 
time", which can be truncated (i.e. "least" function, already in pgbench).



CONTENTS

The attached patch provides a consistent set of functions and operators based 
on the above examples, with operator precedence taken from postgres SQL 
parser:


- "boolean" type support is added, because it has been requested that pgbench 
should be as close as SQL expressions as possible. This induced some renaming 
as some functions & struct fields where named "num" because they where 
expecting an int or a double, but a boolean is not really a numeral.


- SQL comparisons (= <> < > <= >=) plus pg SQL "!=", which result in a 
boolean.


- SQL logical operators (and or not) on booleans.

- SQL bitwise operators taken from pg: | & # << >> ~.

- mod SQL function as a synonymous for %.

- ln and exp SQL functions.

- SQL CASE/END conditional structure.

The patch also includes documentation and additional tap tests.
A test script is also provided.

This version is strict about typing, mimicking postgres behavior. For 
instance, using an int as a boolean results in a error. It is easy to make it 
more tolerant to types, which was the previous behavior before it was 
suggested to follow SQL behavior.


Together with another submitted patch about retrieving query results, the 
added capabilities allow to implement strictly conforming TPC-B transactions.


Given the time scale to get things through, or eventually not,
here is an update I was initially planning to submit later on.

On top of new functions, operators and the boolean type provided in v9:

 - improve boolean conversion for "yes", "on" and other variants,
   in line with pg general behavior.

 - add support for NULL, including IS test variants.

 - conditions are quite permissive i.e. non zero numericals are true
   on a test.

 - TAP tests are removed.
   I think there must be a TAP test, but the pgbench testing
   infrastructure is currently not very adequate.
   I've submitted an independent patch to enhance it:

   https://commitfest.postgresql.org/14/1118/

   that I suggest should be considered first. Once there is such
   convenient infra, I would update this patch to take advantage of it.

 - it cleans up a few things in the implementation

--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 1eee8dc..a7a1f4a 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -825,14 +825,31 @@ pgbench  options  dbname
  
   Sets variable varname to a value calculated
   from expression.
-  The expression may contain integer constants such as 5432,
+  The expression may contain the NULL constant,
+  boolean constants TRUE and FALSE,
+  integer constants such as 5432,
   double constants such as 3.14159,
   references to variables :variablename,
-  unary operators (+, -) and binary operators
-  (+, -, *, /,
-  %) with their usual precedence and associativity,
-  function calls, and
-  parentheses.
+  operators
+  with their usual SQL precedence and associativity,
+  function calls,
+  SQL CASE generic conditional expressions
+  and parentheses.
+ 
+
+ 
+  Functions and most operators return NULL on
+  NULL input.
+ 
+
+ 
+  For conditional purposes, non zero numerical values are TRUE,
+  zero numerical values and NULL are FALSE.
+ 
+
+ 
+  When no final ELSE clause is provided to a CASE,
+  the default value is NULL.
  
 
  
@@ -917,6 +934,177 @@ pgbench  options  dbname
   
  
 
+ 
+  Built-In Operators
+
+  
+   The arithmetic, bitwise, comparison and logical operators listed in
+are built into pgbench
+   and may be used in e

Re: [HACKERS] pgbench tap tests & minor fixes

2017-04-20 Thread Fabien COELHO



Hello Nikolay,


Hi! Since I used to be good at perl, I like tests, and I've dealt with
postgres TAP tests before, I think I can review you patch. If it is OK for
everyone.


I think that all good wills are welcome, especially concerning code 
reviews.



For now I've just gave this patch a quick look through... But nevertheless I
have something to comment:


The attached patch:

(1) extends the existing perl tap test infrastructure with methods to test
pgbench, i.e. "pgbench" which runs a pgbench test and "pgbench_likes"
which allows to check for expectations.



I do not think it is good idea adding this functions to the PostgresNode.pm.


I thought it was:-)


They are pgbench specific.


Sure.


I do not think anybody will need them outside of pgbench tests.


Hmmm. The pre-existing TAP test in pgbench is about concurrent commits, it 
is not to test pgbench itself. Pgbench allows to run some programmable 
clients in parallel very easily, which cannot be done simply otherwise. I 
think that having it there would encourage such uses.


Another point is that the client needs informations held as attributes in 
the node in order to connect to the server. Having it outside would mean 
picking the attributes inside, which is pretty unclean as it breaks the 
abstraction. For me, all pg standard executables could have their methods 
in PostgresNode.pm.


Finally, it does not cost anything to have it there.


Generally, these functions as they are now, should be placed is separate .pm
file. like it was done in src/test/ssl/


I put them there because it already manages both terminal client (psql) & 
server side things (initdb, postgres...). Initially pgbench was a 
"contrib" but it has been moved as a standard client a few versions ago, 
so for me it seemed logical to have the support there.



May be you should move some code into some kind of helpers and keep them in
PostgresNode.pm.


Hmmm. I can put a "run any client" function with the same effect and pass 
an additional argument to tell that pgbench should be run, but this looks 
quite contrived for no special reason.



Or write universal functions that can be used for testing any postgres console
tool. Then they can be placed in PostgresNode.pm


There are already "psql"-specific functions... Why not pgbench specific 
ones?



(3) add a lot of new very small tests so that coverage jumps from very low
to over 90% for source files. [...]


What tool did you use to calculate the coverage?


I followed the documentated recipee:

https://www.postgresql.org/docs/devel/static/regress-coverage.html


Lots of small test looks good at first glance, except the point that adding
lots of small files with pgbench scripts is not great idea. This makes code
tree too overloaded with no practical reason. I am speaking of
src/bin/pgbench/t/scripts/001_0* files.



I think all the data from this file should be somehow imported into
001_pgbench.pl and these files should be created only when tests is running,
and then removed when testing is done.


Hmmm. I thought of that but was very unconvinced by the approach: pgbench 
basically always requires a file, so what the stuff was doing was writting 
the script into a file, checking for possible errors when doing so, then 
unlinking the file and checking for errors again after the run. Then you 
have to do some escaping the the pgbench script themselves, and the perl 
script becomes pretty horrible and unreadable with plenty of perl, SQL, 
backslash commands in strings... Finally, if the script is inside the perl 
script it makes it hard to run the test outside of it when a problem is 
found, so it is a pain.



I think that job for creating and removing these files should be moved to
pgbench and pgbench_likes. But there is more then one way to do it. ;-)


Hmmm. See above.


That's what I've noticed so far... I will give this patch more attentive look
soon.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Fwd: WIP Patch: Precalculate stable functions

2017-04-20 Thread Marina Polyakova


Sorry, attached patch.

 Исходное сообщение 
Тема: WIP Patch: Precalculate stable functions
Дата: 20-04-2017 19:56
От: Marina Polyakova 
Кому: pgsql-hackers@postgresql.org

Hello everyone!

Now in Postgresql only immutable functions are precalculated; stable 
functions are calculated for every row so in fact they don't differ from 
volatile functions.


There's a proposal to precalculate stable and immutable functions (= 
calculate once for all output rows, but as many times as function is 
mentioned in query), if they don't return a set and their arguments are 
constants or recursively precalculated functions. The same for 
operators' functions, strict functions, tracking functions. It can be 
very effective, for example, there's a comparison for full text search 
in messages (Intel® Core™ i5-6500 CPU @ 3.20GHz × 4, RAM 8Gb):


Without precalculation:

EXPLAIN (ANALYZE TRUE, BUFFERS TRUE) SELECT COUNT(*) FROM messages WHERE 
body_tsvector @@ to_tsquery('postgres');
QUERY 
PLAN


--

 Aggregate  (cost=18714.82..18714.83 rows=1 width=8) (actual 
time=2275.334..2275.334 rows=1 loops=1)

   Buffers: shared hit=309234 read=184261
   ->  Bitmap Heap Scan on messages  (cost=66.93..18702.34 rows=4991 
width=0) (actual time=70.661..224

7.462 rows=151967 loops=1)
 Recheck Cond: (body_tsvector @@ to_tsquery('postgres'::text))
 Rows Removed by Index Recheck: 118531
 Heap Blocks: exact=56726 lossy=33286
 Buffers: shared hit=309234 read=184261
 ->  Bitmap Index Scan on message_body_idx  (cost=0.00..65.68 
rows=4991 width=0) (actual time=

54.599..54.599 rows=151967 loops=1)
   Index Cond: (body_tsvector @@ 
to_tsquery('postgres'::text))

   Buffers: shared hit=1 read=37
 Planning time: 0.493 ms
 Execution time: 2276.412 ms
(12 rows)

With precalculation:

EXPLAIN (ANALYZE TRUE, BUFFERS TRUE) SELECT COUNT(*) FROM messages WHERE 
body_tsvector @@ to_tsquery('postgres');
  QUERY 
PLAN


--

 Aggregate  (cost=192269.70..192269.71 rows=1 width=8) (actual 
time=1458.679..1458.680 rows=1 loops=1)

   Buffers: shared hit=309234 read=184261
   ->  Bitmap Heap Scan on messages  (cost=1445.68..191883.51 
rows=154474 width=0) (actual time=70.069

..1433.999 rows=151967 loops=1)
 Recheck Cond: (body_tsvector @@ to_tsquery('postgres'::text))
 Rows Removed by Index Recheck: 118531
 Heap Blocks: exact=56726 lossy=33286
 Buffers: shared hit=309234 read=184261
 ->  Bitmap Index Scan on message_body_idx  (cost=0.00..1406.81 
rows=154474 width=0) (actual t

ime=56.149..56.149 rows=151967 loops=1)
   Index Cond: (body_tsvector @@ 
to_tsquery('postgres'::text))

   Buffers: shared hit=1 read=37
 Planning time: 1.644 ms
 Execution time: 1459.836 ms
(12 rows)

Patch is attached. It isn't done yet:
- changing documentation (partly because of next lines);
- precalculation of expressions IS DISTINCT FROM and NULLIF which use 
nonvolatile equality operators;
- precalculation of expressions "scalar op ANY/ALL (array)" which use 
nonvolatile operators;
- precalculation of row compare expressions which use nonvolatile 
operators.


--
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
+7 926 92 00 265From 46d590281129083d524751805797ef0a3c386df0 Mon Sep 17 00:00:00 2001
From: Marina Polyakova 
Date: Thu, 20 Apr 2017 19:23:05 +0300
Subject: [PATCH 2/2] Precalculate stable functions

Now in Postgresql only immutable functions are precalculated; stable functions
are calculated for every row so in fact they don't differ from volatile
functions.

In this patch function / operator is precalculated if:
1) it doesn't return set,
2) it's not volatile itself,
3) its arguments are constants or nonvolatile too (functions or operators).
Costs are changed to reflect the changed behaviour.
---
 src/backend/executor/execExpr.c|  41 ++
 src/backend/executor/execExprInterp.c  | 196 +-
 src/backend/optimizer/path/costsize.c  |  84 ++-
 src/include/fmgr.h |   4 +
 src/include/nodes/primnodes.h  |   2 +
 .../expected/precalculate_stable_functions.out | 784 +
 src/test/regress/serial_schedule   |   1 +
 .../regress/sql/precalculate_stable_functions.sql  | 240 +++
 8 files changed, 1321 insertions(+), 31 deletions(-)
 create mode 100644 src/test/regress/expected/precalculate_stable_functions.out
 create mode 100644 src/test/regress/sql/precalculate_stable_functions.sql

diff

[HACKERS] WIP Patch: Precalculate stable functions

2017-04-20 Thread Marina Polyakova


Hello everyone!

Now in Postgresql only immutable functions are precalculated; stable 
functions are calculated for every row so in fact they don't differ from 
volatile functions.


There's a proposal to precalculate stable and immutable functions (= 
calculate once for all output rows, but as many times as function is 
mentioned in query), if they don't return a set and their arguments are 
constants or recursively precalculated functions. The same for 
operators' functions, strict functions, tracking functions. It can be 
very effective, for example, there's a comparison for full text search 
in messages (Intel® Core™ i5-6500 CPU @ 3.20GHz × 4, RAM 8Gb):


Without precalculation:

EXPLAIN (ANALYZE TRUE, BUFFERS TRUE) SELECT COUNT(*) FROM messages WHERE 
body_tsvector @@ to_tsquery('postgres');
QUERY 
PLAN


--

 Aggregate  (cost=18714.82..18714.83 rows=1 width=8) (actual 
time=2275.334..2275.334 rows=1 loops=1)

   Buffers: shared hit=309234 read=184261
   ->  Bitmap Heap Scan on messages  (cost=66.93..18702.34 rows=4991 
width=0) (actual time=70.661..224

7.462 rows=151967 loops=1)
 Recheck Cond: (body_tsvector @@ to_tsquery('postgres'::text))
 Rows Removed by Index Recheck: 118531
 Heap Blocks: exact=56726 lossy=33286
 Buffers: shared hit=309234 read=184261
 ->  Bitmap Index Scan on message_body_idx  (cost=0.00..65.68 
rows=4991 width=0) (actual time=

54.599..54.599 rows=151967 loops=1)
   Index Cond: (body_tsvector @@ 
to_tsquery('postgres'::text))

   Buffers: shared hit=1 read=37
 Planning time: 0.493 ms
 Execution time: 2276.412 ms
(12 rows)

With precalculation:

EXPLAIN (ANALYZE TRUE, BUFFERS TRUE) SELECT COUNT(*) FROM messages WHERE 
body_tsvector @@ to_tsquery('postgres');
  QUERY 
PLAN


--

 Aggregate  (cost=192269.70..192269.71 rows=1 width=8) (actual 
time=1458.679..1458.680 rows=1 loops=1)

   Buffers: shared hit=309234 read=184261
   ->  Bitmap Heap Scan on messages  (cost=1445.68..191883.51 
rows=154474 width=0) (actual time=70.069

..1433.999 rows=151967 loops=1)
 Recheck Cond: (body_tsvector @@ to_tsquery('postgres'::text))
 Rows Removed by Index Recheck: 118531
 Heap Blocks: exact=56726 lossy=33286
 Buffers: shared hit=309234 read=184261
 ->  Bitmap Index Scan on message_body_idx  (cost=0.00..1406.81 
rows=154474 width=0) (actual t

ime=56.149..56.149 rows=151967 loops=1)
   Index Cond: (body_tsvector @@ 
to_tsquery('postgres'::text))

   Buffers: shared hit=1 read=37
 Planning time: 1.644 ms
 Execution time: 1459.836 ms
(12 rows)

Patch is attached. It isn't done yet:
- changing documentation (partly because of next lines);
- precalculation of expressions IS DISTINCT FROM and NULLIF which use 
nonvolatile equality operators;
- precalculation of expressions "scalar op ANY/ALL (array)" which use 
nonvolatile operators;
- precalculation of row compare expressions which use nonvolatile 
operators.


--
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
+7 926 92 00 265


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tablesync patch broke the assumption that logical rep depends on?

2017-04-20 Thread Peter Eisentraut

On 4/18/17 22:13, Petr Jelinek wrote:
> So my idea was to add some kind of inuse flag. This turned out to be bit
> more complicated in terms of how to clean it than I would have hoped.
> This is due to the fact that there is no way to reliably tell if worker
> has failed to start if the parent worker crashed while waiting.
> 
> My solution to that is to use similar logic to autovacuum where we use
> timeout for worker to attach to shmem. We do this only if there is no
> free slot found when launch of replication worker was requested.

It looks like launch_time is never set the current time in your patch.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] some review comments on logical rep code

2017-04-20 Thread Fujii Masao

On Thu, Apr 20, 2017 at 12:05 PM, Noah Misch  wrote:
> On Sun, Apr 16, 2017 at 06:14:49AM +, Noah Misch wrote:
>> On Fri, Apr 14, 2017 at 04:47:12AM +0900, Fujii Masao wrote:
>> > Though I've read only a part of the logical rep code yet, I'd like to
>> > share some (relatively minor) review comments that I got so far.
>> >
>> > In ApplyWorkerMain(), DatumGetInt32() should be used to get integer
>> > value from the argument, instead of DatumGetObjectId().
>> >
>> > No one resets on_commit_launcher_wakeup flag to false.
>> >
>> > ApplyLauncherWakeup() should be static function.
>> >
>> > Seems not only CREATE SUBSCRIPTION but also ALTER SUBSCRIPTION's
>> > subcommands (e.g., ENABLED one) should wake the launcher up on commit.
>> >
>> > Why is IsBackendPid() check necessary in ApplyLauncherWakeup()?
>> >
>> > Normal statements that the walsender for logical rep runs are logged
>> > by log_replication_commands. So if log_statement is also set,
>> > those commands are logged twice.
>> >
>> > LogicalRepCtx->launcher_pid isn't reset to 0 when the launcher emits
>> > an error. The callback function to reset it needs to be registered
>> > via on_shmem_exit().
>> >
>> > Typo: "an subscriber" should be "a subscriber" in some places.
>> >
>> > /* Get conninfo */
>> > datum = SysCacheGetAttr(SUBSCRIPTIONOID,
>> > tup,
>> > Anum_pg_subscription_subconninfo,
>> > &isnull);
>> > Assert(!isnull);
>> >
>> > This assertion makes me think that pg_subscription.subconnifo should
>> > have NOT NULL constraint. Also subslotname and subpublications
>> > should have NOT NULL constraint because of the same reason.
>> >
>> > SpinLockAcquire(&MyLogicalRepWorker->relmutex);
>> > MyLogicalRepWorker->relstate =
>> >   GetSubscriptionRelState(MyLogicalRepWorker->subid,
>> > MyLogicalRepWorker->relid,
>> > &MyLogicalRepWorker->relstate_lsn,
>> > false);
>> > SpinLockRelease(&MyLogicalRepWorker->relmutex);
>> >
>> > Non-"short-term" function like GetSubscriptionRelState() should not
>> > be called while spinlock is being held.
>>
>> [Action required within three days.  This is a generic notification.]
>>
>> The above-described topic is currently a PostgreSQL 10 open item.  Peter,
>> since you committed the patch believed to have created it, you own this open
>> item.  If some other commit is more relevant or if this does not belong as a
>> v10 open item, please let us know.  Otherwise, please observe the policy on
>> open item ownership[1] and send a status update within three calendar days of
>> this message.  Include a date for your subsequent status update.  Testers may
>> discover new open items at any time, and I want to plan to get them all fixed
>> well in advance of shipping v10.  Consequently, I will appreciate your 
>> efforts
>> toward speedy resolution.  Thanks.
>>
>> [1] 
>> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
>
> This PostgreSQL 10 open item is past due for your status update.  Kindly send
> a status update within 24 hours, and include a date for your subsequent status
> update.  Refer to the policy on open item ownership:
> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

I'm not the owner of this item, but the current status is;

I've pushed several patches, and there is now only one remaining patch.
I posted the review comment on that patch, and I'm expecting that
Masahiko-san will update the patch. So what about waiting for the updated
version of the patch by next Monday (April 24th)?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] some review comments on logical rep code

2017-04-20 Thread Fujii Masao

On Tue, Apr 18, 2017 at 5:16 PM, Masahiko Sawada  wrote:
> On Tue, Apr 18, 2017 at 12:24 PM, Kyotaro HORIGUCHI
>  wrote:
>> Hi,
>>
>> Thank you for the revised version.
>>
>> At Mon, 17 Apr 2017 23:29:28 +0900, Masahiko Sawada  
>> wrote in 
>>> On Mon, Apr 17, 2017 at 9:13 PM, Masahiko Sawada  
>>> wrote:
>>> > On Mon, Apr 17, 2017 at 7:39 PM, Kyotaro HORIGUCHI
>>> >  wrote:
>>> >> At Mon, 17 Apr 2017 18:02:57 +0900, Masahiko Sawada 
>>> >>  wrote in 
>>> >> 
>>> >>> On Fri, Apr 14, 2017 at 4:47 AM, Fujii Masao  
>>> >>> wrote:
>>> >>> 1.
>>> >>> >
>>> >>> > In ApplyWorkerMain(), DatumGetInt32() should be used to get integer
>>> >>> > value from the argument, instead of DatumGetObjectId().
>>> >>>
>>> >>> Attached 001 patch fixes it.
>>> >>
>>> >> Hmm. I looked at the launcher side and found that it assigns bare
>>> >> integer to bgw_main_arg. It also should use Int32GetDatum.
>>> >
>>> > Yeah, I agreed. Will fix it.
>>
>> Thanks.
>>
>>> >>> 2.
>>> >>> >
>>> >>> > No one resets on_commit_launcher_wakeup flag to false.
>>> >>>
>>> >>> Attached 002 patch makes on_commit_launcher_wakeup turn off after woke
>>> >>> up the launcher process.
>>> >>
>>> >> It is omitting the abort case. Maybe it should be
>>> >> AtEOXact_ApplyLauncher(bool commit)?
>>> >
>>> > Should we wake up the launcher process even when abort?
>>
>> No, I meant that on_commit_launcher_wakeup should be just reset
>> without waking up launcher on abort.
>
> I understood. Sounds reasonable.

ROLLBACK PREPARED should not wake up the launcher, but not even with the patch.
To fix this issue, we should call AtEOXact_ApplyLauncher() in
FinishPreparedTransaction() or somewhere?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] New command to monitor progression of long running queries

2017-04-20 Thread Remi Colinet

2017-04-19 18:41 GMT+02:00 Maksim Milyutin :

> On 19.04.2017 17:13, Remi Colinet wrote:
>
>> Maksim,
>>
>>
>> 2017-04-18 20:31 GMT+02:00 Maksim Milyutin > >:
>>
>> On 18.04.2017 17:39, Remi Colinet wrote:
>>
>>
>> Regarding the queryDesc state of SQL query upon receiving a
>> request to
>> report its execution progress, it does not bring any issue. The
>> request
>> is noted when the signal is received by the monitored backend.
>> Then, the
>> backend continues its execution code path. When interrupts are
>> checked
>> in the executor code, the request will be dealt.
>>
>>
>> Yes, interrupts are checked in the CHECK_FOR_INTERRUPTS entries.
>>
>> When the request is being dealt, the monitored backend will stop
>> its
>> execution and report the progress of the SQL query. Whatever is
>> the
>> status of the SQL query, progress.c code checks the status and
>> report
>> either that the SQL query does not have a valid status, or
>> otherwise the
>> current execution state of the SQL query.
>>
>> SQL query status checking is about:
>> - idle transaction
>> - out of transaction status
>> - null planned statement
>> - utility command
>> - self monitoring
>>
>> Other tests can be added if needed to exclude some SQL query
>> state. Such
>> checking is done in void HandleProgressRequest(void).
>> I do not see why a SQL query progression would not be possible
>> in this
>> context. Even when the queryDescc is NULL, we can just report a
>> > transaction> output. This is currently the case with the patch
>> suggested.
>>
>>
>> It's interesting question - how much the active query is in a usable
>> state on the stage of execution. Tom Lane noticed that
>> CHECK_FOR_INTERRUPTS doesn't give us 100% guarantee about full
>> consistency [1].
>>
>>
>> I wonder what you mean about usable state.
>>
>>
> A usable query state is suitable for analysis, IOW we have consistent
> QueryDesc object. This term was introduced by Tom Lane in [1]. I suppose he
> meant the case when a query fails with error and before transaction aborts
> we bump into *CHECK_FOR_INTERRUPTS* in the place where QueryDesc may be
> inconsistent and further reading from it will give us invalid result.
>

I could indeed trigger a segmentation fault because the nodes of the tree
may be under freeing. Some node may be partially filled for instance. But
each node can be checked against null pointer once the monitored backend is
no more executing its query and is dumping its progress state. So this is
not a big deal in fact.


>
>
> Currently, the code suggested tests the queryDesc pointer and all the
>> sub nodes pointers in order to detect NULL pointers. When the progress
>> report is collected by the backend, this backend does the collect and
>> consequently does not run the query. So the query tree is not being
>> modified. At this moment, whatever is the query state, we can manage to
>> deal with its static state. It is only a tree which could also be just a
>> NULL pointer in the most extreme case. Such case is dealt in the current
>> code.
>>
>>
> Perhaps the deep checking of QueryDesc would allow us to consider it as
> consistent.
>
>
> 1. https://www.postgresql.org/message-id/24182.1472745492%40sss.pgh.pa.us
>
>
> --
> Maksim Milyutin
> Postgres Professional: http://www.postgrespro.com
> Russian Postgres Company
>

Re: [HACKERS] Range Merge Join v1

2017-04-20 Thread Jeff Davis

On Thu, Apr 20, 2017 at 2:05 AM, Rafia Sabih
 wrote:
> Looks like an interesting idea, however, in an attempt to test this patch I
> found following error when compiling,
> selfuncs.c: In function ‘mergejoinscansel’:
> selfuncs.c:2901:12: error: ‘op_strategy’ undeclared (first use in this
> function)
>&op_strategy,
>

Looks like filterdiff destroyed my patch from git. Attaching unified
version against master 3820c63d.

Thanks!
 Jeff Davis
diff --git a/doc/src/sgml/rangetypes.sgml b/doc/src/sgml/rangetypes.sgml
index 9557c16..84578a7 100644
--- a/doc/src/sgml/rangetypes.sgml
+++ b/doc/src/sgml/rangetypes.sgml
@@ -522,4 +522,74 @@ INSERT 0 1
 
   
  
+ 
+  Range Join
+
+  
+range type
+join
+  
+
+  
+A Range Join is a join where one of the join conditions is the range
+overlaps operator (see ). In other
+words, rather than returning rows where corresponding fields are equal, it
+returns rows where the corresponding fields overlap.
+  
+  
+For instance, to find users who were active on a website at the same time
+as they were using a mobile app, you might issue a query like:
+
+CREATE TABLE web_session(
+web_session_id text primary key,
+username text,
+during tstzrange);
+CREATE TABLE app_session(
+app_session_id text primary key,
+username text,
+during tstzrange);
+INSERT INTO web_session VALUES
+('0cc175b9c0f1b6a8', 'user1', '[2017-02-01 09:30, 2017-02-01 10:45)'),
+('92eb5ffee6ae2fec', 'user2', '[2017-02-01 09:30, 2017-02-01 10:45)'),
+('4a8a08f09d37b737', 'user3', '[2017-02-01 09:30, 2017-02-01 10:45)');
+INSERT INTO app_session VALUES
+('8277e0910d750195', 'user1', '[2017-02-01 10:30, 2017-02-01 11:45)'),
+('b448797616e091ad', 'user3', '[2017-02-01 09:00, 2017-02-01 11:00)'),
+('95649038408b5f33', 'user4', '[2017-02-01 09:30, 2017-02-01 10:45)');
+
+SELECT w.username,
+   w.during * a.during AS both_during
+FROM  web_session w, app_session a
+WHERE w.username = a.username
+AND w.during && a.during;
+ username | both_during 
+--+-
+ user1| ["2017-02-01 10:30:00-08","2017-02-01 10:45:00-08")
+ user3| ["2017-02-01 09:30:00-08","2017-02-01 10:45:00-08")
+(2 rows)
+
+  
+  
+In addition to the general execution strategies available, Postgres also
+has special support for a variant of Merge Join called Range Merge Join:
+
+ EXPLAIN (costs off) SELECT w.username,
+   w.during * a.during AS both_during
+FROM  web_session w, app_session a
+WHERE w.username = a.username
+AND w.during && a.during;
+  QUERY PLAN  
+--
+ Range Merge Join
+   Merge Cond: ((w.during && a.during) AND (w.username = a.username))
+   ->  Sort
+ Sort Key: w.during, w.username
+ ->  Seq Scan on web_session w
+   ->  Sort
+ Sort Key: a.during, a.username
+ ->  Seq Scan on app_session a
+(8 rows)
+
+  
+ 
 
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 9359d0a..1010668 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -909,7 +909,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			pname = sname = "Nested Loop";
 			break;
 		case T_MergeJoin:
-			pname = "Merge";	/* "Join" gets added by jointype switch */
+			if (((MergeJoin*)plan)->mergeRangeJoin)
+pname = "Range Merge";	/* "Join" gets added by jointype switch */
+			else
+pname = "Merge";	/* "Join" gets added by jointype switch */
+
 			sname = "Merge Join";
 			break;
 		case T_HashJoin:
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index 8c483bf..bd312e6 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -89,14 +89,67 @@
  *		proceed to another state.  This state is stored in the node's
  *		execution state information and is preserved across calls to
  *		ExecMergeJoin. -cim 10/31/89
+ *
+ *		RANGE MERGE JOIN
+ *
+ *		Range Merge Join is a generalization of merge join. When a join
+ *		predicate involves the range overlaps operator (&&), a merge join
+ *		becomes a range join.
+ *
+ *		The input ranges must be ordered by (lower-bound, upper-bound), which
+ *		is the ordinary range_ops operator class. MJCompare must not simply
+ *		use the operator class's comparison semantics though; instead it
+ *		follows these rules:
+ *
+ *		  - return 0 if the outer range overlaps the inner range
+ *		  - return <0 if the outer range is strictly left-of the inner range
+ *		  - return >0 if the outer range is strictly right-of the inner range
+ *
+ *		NB: the above is a simplification considering only a single merge
+ *		clause. When there are multiple merge clauses, it's possible that one
+ *		tuple is neither right-of, nor left-o

Re: [HACKERS] Re: logical replication and PANIC during shutdown checkpoint in publisher

2017-04-20 Thread Peter Eisentraut

On 4/19/17 23:04, Noah Misch wrote:
> This PostgreSQL 10 open item is past due for your status update.  Kindly send
> a status update within 24 hours, and include a date for your subsequent status
> update.  Refer to the policy on open item ownership:
> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

We have a possible solution but need to work out a patch.  Let's say
next check-in on Monday.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-04-20 Thread Peter Eisentraut

On 4/20/17 07:52, Petr Jelinek wrote:
> On 20/04/17 05:57, Michael Paquier wrote:
>> 2nd thoughts here... Ah now I see your point. True that there is no
>> way to ensure that an unwanted command is not running when SIGUSR2 is
>> received as the shutdown checkpoint may have already begun. Here is an
>> idea: add a new state in WalSndState, say WALSNDSTATE_STOPPING, and
>> the shutdown checkpoint does not run as long as all WAL senders still
>> running do not reach such a state.
> 
> +1 to this solution

Michael, can you attempt to supply a patch?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Use sync commit for logical replication apply in TAP tests

2017-04-20 Thread Peter Eisentraut

On 4/20/17 09:46, Petr Jelinek wrote:
> On 20/04/17 14:57, Peter Eisentraut wrote:
>> On 4/18/17 22:25, Petr Jelinek wrote:
>>> The commit 887227a1c changed the defaults for subscriptions to do async
>>> commit. But since the tests often wait for disk flush and there is no
>>> concurrent activity this has increased the amount of time needed for
>>> each test. So the attached patch changes the subscriptions create in tab
>>> tests to use sync commit which improves performance there (because we
>>> also replicate only very few transactions in the tests).
>>
>> I see only a 1.5% (1s/70s) improvement in the run time of the whole test
>> suite from this.  Is that what you were expecting?
> 
> Hmm, on my machine the difference is around 50% (~1m vs ~2m).

Double hmm, I ran a few more tests and different machines and get
consistent but underwhelming improvements.

I don't mind applying the patch nonetheless, but maybe we can get a few
more test results from others.

(Instructions: apply the patch and time make -C src/test/subscription check)

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] dtrace probes

2017-04-20 Thread Jesper Pedersen


Hi,

On 04/20/2017 09:24 AM, Amit Kapila wrote:

The lwlock dtrace probes define LWLockMode as int, and the
TRACE_POSTGRESQL_LWLOCK methods are called using both a variable and
constant definition.

This leads to a mix of argument definitions depending on the call site, as
seen in probes.txt file.

A fix is to explicit cast 'mode' to int such that all call sites will use
the

 argument #2 4 signed   bytes

definition. Attached patch does this.



I think this fix is harmless and has some value in terms of
consistency.  One minor suggestion is that you should leave a space
after typecasting.

- TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), mode);
+ TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), (int)mode);

There should be a space like "(int) mode".




v2 attached.


I have verified all dtraces probes for their type, and only the lock__
methods doesn't aligned with its actual types.



Do you see any problem with that?



Not really, but it would be more clear what the value space of each of 
the parameters were.




Depending on the feedback I can add this patch to the open item list in
order to fix it for PostgreSQL 10.



Is there any commit in PG-10 which has caused this behavior?  If not,
then I don't think it should be added to open items of PG-10.




It is really a bug fix, so it could even be back patched.

Thanks for the feedback !

Best regards,
 Jesper


>From 7175dc8e05ff703229bd6cab6b254587ffc076c8 Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Tue, 18 Apr 2017 11:44:18 -0400
Subject: [PATCH] Explicit cast LWLockMode, which an enum, to int in order to
 match the dtrace definition of the lwlock methods. Thereby all call sites
 will have the same definition instead of a mix between signed and unsigned.

Author: Jesper Pedersen 
---
 src/backend/storage/lmgr/lwlock.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 3e13394..abf5fbb 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -1226,7 +1226,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode)
 #endif
 
 		LWLockReportWaitStart(lock);
-		TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode);
+		TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), (int)mode);
 
 		for (;;)
 		{
@@ -1248,7 +1248,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode)
 		}
 #endif
 
-		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), mode);
+		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), (int)mode);
 		LWLockReportWaitEnd();
 
 		LOG_LWDEBUG("LWLockAcquire", lock, "awakened");
@@ -1257,7 +1257,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode)
 		result = false;
 	}
 
-	TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(lock), mode);
+	TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(lock), (int)mode);
 
 	/* Add lock to list of locks held by this backend */
 	held_lwlocks[num_held_lwlocks].lock = lock;
@@ -1308,14 +1308,14 @@ LWLockConditionalAcquire(LWLock *lock, LWLockMode mode)
 		RESUME_INTERRUPTS();
 
 		LOG_LWDEBUG("LWLockConditionalAcquire", lock, "failed");
-		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE_FAIL(T_NAME(lock), mode);
+		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE_FAIL(T_NAME(lock), (int)mode);
 	}
 	else
 	{
 		/* Add lock to list of locks held by this backend */
 		held_lwlocks[num_held_lwlocks].lock = lock;
 		held_lwlocks[num_held_lwlocks++].mode = mode;
-		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE(T_NAME(lock), mode);
+		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE(T_NAME(lock), (int)mode);
 	}
 	return !mustwait;
 }
@@ -1387,7 +1387,7 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
 #endif
 
 			LWLockReportWaitStart(lock);
-			TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode);
+			TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), (int)mode);
 
 			for (;;)
 			{
@@ -1405,7 +1405,7 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
 Assert(nwaiters < MAX_BACKENDS);
 			}
 #endif
-			TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), mode);
+			TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), (int)mode);
 			LWLockReportWaitEnd();
 
 			LOG_LWDEBUG("LWLockAcquireOrWait", lock, "awakened");
@@ -1435,7 +1435,7 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
 		/* Failed to get lock, so release interrupt holdoff */
 		RESUME_INTERRUPTS();
 		LOG_LWDEBUG("LWLockAcquireOrWait", lock, "failed");
-		TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT_FAIL(T_NAME(lock), mode);
+		TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT_FAIL(T_NAME(lock), (int)mode);
 	}
 	else
 	{
@@ -1443,7 +1443,7 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
 		/* Add lock to list of locks held by this backend */
 		held_lwlocks[num_held_lwlocks].lock = lock;
 		held_lwlocks[num_held_lwlocks++].mode = mode;
-		TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT(T_NAME(lock), mode);
+		TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT(T_NAME(lock), (int)mode);
 	}
 
 	return !mustwait;
-- 
2.7.4


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to

Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW

2017-04-20 Thread Ashutosh Bapat

The subject of this thread is not directly related to this discussion
and we have a new thread [1] for relevant discussion. So, let's
discuss this further on that thread.


[1] 
https://www.postgresql.org/message-id/CAF1DzPU8Kx%2BfMXEbFoP289xtm3bz3t%2BZfxhmKavr98Bh-C0TqQ%40mail.gmail.com

On Tue, Apr 18, 2017 at 9:30 AM, Kyotaro HORIGUCHI
 wrote:
> At Tue, 18 Apr 2017 09:12:07 +0530, Ashutosh Bapat 
>  wrote in 
> 
>> On Tue, Apr 18, 2017 at 8:12 AM, Kyotaro HORIGUCHI
>>  wrote:
>> > At Mon, 17 Apr 2017 17:50:58 +0530, Ashutosh Bapat 
>> >  wrote in 
>> > 
>> >> On Mon, Apr 17, 2017 at 1:53 PM, Kyotaro HORIGUCHI
>> >>  wrote:
>> >> > At Thu, 13 Apr 2017 13:04:12 -0400, Robert Haas  
>> >> > wrote in 
>> >> > 
>> >> >> On Thu, Apr 21, 2016 at 10:53 AM, Robert Haas  
>> >> >> wrote:
>> >> >> > On Thu, Apr 21, 2016 at 8:44 AM, Michael Paquier
>> >> >> >  wrote:
>> >> >> >> On Thu, Apr 21, 2016 at 5:22 PM, Etsuro Fujita
>> >> >> >>  wrote:
>> >> >> >>> Attached is an updated version of the patch, which modified 
>> >> >> >>> Michael's
>> >> >> >>> version of the patch, as I proposed in [1] (see "Other changes:"). 
>> >> >> >>>  I
>> >> >> >>> modified comments for pgfdw_get_result/pgfdw_exec_query also, 
>> >> >> >>> mainly because
>> >> >> >>> words like "non-blocking mode" there seems confusing (note that we 
>> >> >> >>> have
>> >> >> >>> PQsetnonbloking).
>> >> >> >>
>> >> >> >> OK, so that is what I sent except that the comments mentioning 
>> >> >> >> PG_TRY
>> >> >> >> are moved to their correct places. That's fine for me. Thanks for
>> >> >> >> gathering everything in a single patch and correcting it.
>> >> >> >
>> >> >> > I have committed this patch.  Thanks for working on this.  Sorry for 
>> >> >> > the delay.
>> >> >>
>> >> >> This 9.6-era patch, as it turns out, has a problem, which is that we
>> >> >> now respond to an interrupt by sending a cancel request and a
>> >> >> NON-interruptible ABORT TRANSACTION command to the remote side.  If
>> >> >> the reason that the user is trying to interrupt is that the network
>> >> >> connection has been cut, they interrupt the original query only to get
>> >> >> stuck in a non-interruptible wait for ABORT TRANSACTION.  That is
>> >> >> non-optimal.
>> >> >
>> >> > Agreed.
>> >> >
>> >> >> It is not exactly clear to me how to fix this.  Could we get by with
>> >> >> just slamming the remote connection shut, instead of sending an
>> >> >> explicit ABORT TRANSACTION?  The remote side ought to treat a
>> >> >> disconnect as equivalent to an ABORT anyway, I think, but maybe our
>> >> >> local state would get confused.  (I have not checked.)
>> >> >>
>> >> >> Thoughts?
>> >> >
>> >> > Perhaps we will get stuck at query cancellation before ABORT
>> >> > TRANSACTION in the case. A connection will be shut down when
>> >> > anything wrong (PQstatus(conn) != CONNECTION_OK and so) on
>> >> > aborting local transactoin . So I don't think fdw gets confused
>> >> > or sholdn't be confused by shutting down there.
>> >> >
>> >> > The most significant issue I can see is that the same thing
>> >> > happens in the case of graceful ABORT TRANSACTION. It could be a
>> >> > performance issue.
>> >> >
>> >> > We could set timeout here but maybe we can just slamming the
>> >> > connection down instead of sending a query cancellation. It is
>> >> > caused only by timeout or interrupts so I suppose it is not a
>> >> > problem *with a few connections*.
>> >> >
>> >> >
>> >> > Things are a bit diffent with hundreds of connections. The
>> >> > penalty of reconnection would be very high in the case.
>> >> >
>> >> > If we are not willing to pay such high penalty, maybe we are to
>> >> > manage busy-idle time of each connection and trying graceful
>> >> > abort if it is short enough, maybe having a shoft timeout.
>> >> >
>> >> > Furthermore, if most or all of the hundreds of connections get
>> >> > stuck, such timeout will accumulate up like a mountain...
>> >>
>> >> Even when the transaction is aborted because a user cancels a query,
>> >> we do want to preserve the connection, if possible, to avoid
>> >
>> > Yes.
>> >
>> >> reconnection. If the request to cancel the query itself fails, we
>> >> should certainly drop the connection. Here's the patch to do that.
>> >
>> > A problem I think on this would be that we still try to make
>> > another connection for canceling and it would stall for several
>> > minutes per connection on a packet stall, which should be done in
>> > a second on ordinary circumstances. Perhaps we might want here is
>> > async canceling with timeout.
>>
>> I am not sure what do you mean "make another connection for
>> cancelling.". Do you mean to say that PQcancel would make another
>> connection?
>
> Yes. It will take about 3 minutes on standard settings when no
> ACK returned. I thought that this discussion is based on such
> situation.
>
>> The patch proposed simply improves the condition when PQcancel has
>> returned a failure. Right now we ignore that failure and

Re: [HACKERS] statement_timeout is not working as expected with postgres_fdw

2017-04-20 Thread Ashutosh Bapat

On Thu, Apr 20, 2017 at 4:08 PM, Suraj Kharage
 wrote:
> Hello All,
>
> The query on foreign table hangs due to network down of remote server for
> near about 16 minutes before exiting.
> statement_timeout is expected to work in this case as well but when i tried
> statement_timeout, it is not working as expected.
>
> Below is test case to reproduce the issue: [I am testing this on two
> different systems]
>
> 1: Set the postgres_fdw
>
> postgres=# CREATE SERVER myserver FOREIGN DATA WRAPPER postgres_fdw OPTIONS
> (host '172.16.173.237', dbname 'postgres', port '5432',keepalives '1',
> keepalives_interval '3',keepalives_idle '3', keepalives_count '1');
> CREATE SERVER
> postgres=# CREATE USER MAPPING FOR postgres SERVER myserver OPTIONS (user
> 'postgres', password 'edb');
> CREATE USER MAPPING
> postgres=# CREATE FOREIGN TABLE test(id int) SERVER myserver;
> CREATE FOREIGN TABLE
> postgres=# select * from test;
>  id
> 
>   1
>  10
>   2
>   1
> (4 rows)
>
> postgres=# \d
> List of relations
>  Schema | Name | Type  |  Owner
> +--+---+--
>  public | test | foreign table | postgres
> (1 row)
> postgres=#
> postgres=# select * from pg_foreign_server ;
>  srvname  | srvowner | srvfdw | srvtype | srvversion | srvacl |
> srvoptions
>
> --+--++-+++--
> ---
>  myserver |   10 |  16387 | |||
> {host=172.16.173.237,dbname=postgres,port=5432,keepalives=1,keepalives_interval=3,keepalives_idle=3,keepalives_co
> unt=1}
> (1 row)
>
> 3. Execute the insert command.
> E.g: insert into test values (generate_series(1,100));
> * You need to do network down of remote server during insert command.
>
> postgres=# set statement_timeout to 6000;
> SET
> postgres=# insert into test values (generate_series(1,100));
> WARNING:  server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
>
> CONTEXT:  Remote SQL command: ABORT TRANSACTION
> ERROR:  could not receive data from server: No route to host
>
> CONTEXT:  Remote SQL command: INSERT INTO public.test(id) VALUES ($1)
> Time: 931427.002 ms
>
> It was in hang state for approx 15-16 minute before exit.
>
> In pg_log, i see below error for above query:
> =
> 2017-04-20 15:22:02 IST ERROR:  could not receive data from server: No route
> to host
> 2017-04-20 15:22:02 IST CONTEXT:  Remote SQL command: INSERT INTO
> public.test(id) VALUES ($1)
> 2017-04-20 15:22:02 IST STATEMENT:  insert into test values
> (generate_series(1,100));
> 2017-04-20 15:22:02 IST WARNING:  server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> 2017-04-20 15:22:02 IST CONTEXT:  Remote SQL command: ABORT TRANSACTION
> ==
>
> I tested the patch submitted by Ashutosh Bapat on community
> [https://www.postgresql.org/message-id/CAFjFpRdcWw4h0a-zrL-EiaekkPj8O0GR2M1FwZ1useSRfRm3-g%40mail.gmail.com]
> to make the statement_timeout working and i can see statement_timeout is
> working but it is taking some extra time as statement_timeout.
>

Thanks for testing the patch.

> postgres=# set statement_timeout to 6000;
> SET
> postgres=#
> postgres=# \timing on
> Timing is on.
> postgres=# insert into test values (generate_series(1,100));
> -- [after executing query immediately disconnect the ethernet on remote
> server ]
> 2017-04-20 07:10:51.588 IST [10467] ERROR:  canceling statement due to
> statement timeout
> 2017-04-20 07:10:51.588 IST [10467] STATEMENT:  insert into test values
> (generate_series(1,100));
> 2017-04-20 07:11:25.590 IST [10467] WARNING:  discarding connection 0xfe4260
> because of failed cancel request: PQcancel() -- connect() failed: No route
> to host
> WARNING:  discarding connection 0xfe4260 because of failed cancel request:
> PQcancel() -- connect() failed: No route to host
>
> ERROR:  canceling statement due to statement timeout
> Time: 40001.765 ms (00:40.002)
> postgres=#
> postgres=#
>
> In above case, I got the error related to statement timeout after 6 seconds,
> but it it taking more time (approx 34 sec and it is varing each time if you
> see below) to terminate the connection and to exit from query. As per the
> tcp keepavlies settings for foreign server, it should take max 6 sec to
> terminate the connection.

The logs above show that 34 seconds elapsed between starting to abort
the transaction and knowing that the foreign server is unreachable. It
looks like it took that much time for the local server to realise that
the foreign server is not reachable. Looking at PQcancel code, it
seems to be trying to connect to the foreign server to cancel the
query. But somehow it doesn't seem to honor connect_timeout setting.
Is that expected?

Irr

Re: [HACKERS] DROP SUBSCRIPTION, query cancellations and slot handling

2017-04-20 Thread Petr Jelinek

On 20/04/17 14:41, Michael Paquier wrote:
> On Thu, Apr 20, 2017 at 8:47 PM, Petr Jelinek
>  wrote:
>> Or you can drop the slot manually on upstream.
> 
> Sure, but the point here is that if for example users have
> client_min_messages set at least at warning, they may have no idea
> that an underlying slot has been created. This is a confusing
> experience for users.
> 
> As subscription is a self-contained concept, it seems to me that any
> errors happening should at least try to do some cleanup action before
> just giving up processing, that would be a less frustrating
> experience.
> 

Hmm well since this only affects the synchronization of table
states/names, I guess we could just simply do that before we create the
slot as there is no expectancy of consistency between slot and the table
list snapshot.

Any other potential errors will be out of control of CreateSubscription
anyway.

Thoughts?

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical replication ApplyContext bloat

2017-04-20 Thread Dilip Kumar

On Thu, Apr 20, 2017 at 7:04 PM, Stas Kelvich  wrote:
> Thanks for noting.
>
> Added short description of ApplyContext and ApplyMessageContext to README.


+ApplyContext --- permanent during whole lifetime of apply worker. It is
+possible to use TopMemoryContext here as well, but for simplicity of
memory usage
+analysys we spin up different context.


Typo

/analysys/analysis


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Use sync commit for logical replication apply in TAP tests

2017-04-20 Thread Petr Jelinek

On 20/04/17 14:57, Peter Eisentraut wrote:
> On 4/18/17 22:25, Petr Jelinek wrote:
>> The commit 887227a1c changed the defaults for subscriptions to do async
>> commit. But since the tests often wait for disk flush and there is no
>> concurrent activity this has increased the amount of time needed for
>> each test. So the attached patch changes the subscriptions create in tab
>> tests to use sync commit which improves performance there (because we
>> also replicate only very few transactions in the tests).
> 
> I see only a 1.5% (1s/70s) improvement in the run time of the whole test
> suite from this.  Is that what you were expecting?
> 

Hmm, on my machine the difference is around 50% (~1m vs ~2m).

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical replication ApplyContext bloat

2017-04-20 Thread Stas Kelvich


> On 19 Apr 2017, at 16:07, Alvaro Herrera  wrote:
> 
> Stas Kelvich wrote:
> 
>> With patch MemoryContextStats() shows following hierarchy during slot 
>> operations in 
>> apply worker:
>> 
>> TopMemoryContext: 83824 total in 5 blocks; 9224 free (8 chunks); 74600 used
>>  ApplyContext: 8192 total in 1 blocks; 6520 free (4 chunks); 1672 used
>>ApplyMessageContext: 8192 total in 1 blocks; 6632 free (11 chunks); 1560 
>> used
>>  ExecutorState: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used
>>ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
> 
> Please update src/backend/utils/mmgr/README to list the new contexts.


Thanks for noting.

Added short description of ApplyContext and ApplyMessageContext to README.



apply_contexts_2.patch
Description: Binary data




Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] dtrace probes

2017-04-20 Thread Amit Kapila

On Tue, Apr 18, 2017 at 9:38 PM, Jesper Pedersen
 wrote:
> Hi,
>
> The lwlock dtrace probes define LWLockMode as int, and the
> TRACE_POSTGRESQL_LWLOCK methods are called using both a variable and
> constant definition.
>
> This leads to a mix of argument definitions depending on the call site, as
> seen in probes.txt file.
>
> A fix is to explicit cast 'mode' to int such that all call sites will use
> the
>
>  argument #2 4 signed   bytes
>
> definition. Attached patch does this.
>

I think this fix is harmless and has some value in terms of
consistency.  One minor suggestion is that you should leave a space
after typecasting.

- TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), mode);
+ TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), (int)mode);

There should be a space like "(int) mode".


> I have verified all dtraces probes for their type, and only the lock__
> methods doesn't aligned with its actual types.
>

Do you see any problem with that?

>
> Depending on the feedback I can add this patch to the open item list in
> order to fix it for PostgreSQL 10.
>

Is there any commit in PG-10 which has caused this behavior?  If not,
then I don't think it should be added to open items of PG-10.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Use sync commit for logical replication apply in TAP tests

2017-04-20 Thread Peter Eisentraut

On 4/18/17 22:25, Petr Jelinek wrote:
> The commit 887227a1c changed the defaults for subscriptions to do async
> commit. But since the tests often wait for disk flush and there is no
> concurrent activity this has increased the amount of time needed for
> each test. So the attached patch changes the subscriptions create in tab
> tests to use sync commit which improves performance there (because we
> also replicate only very few transactions in the tests).

I see only a 1.5% (1s/70s) improvement in the run time of the whole test
suite from this.  Is that what you were expecting?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Highly Variable Planning Times

2017-04-20 Thread Greg Stark

On 19 April 2017 at 22:39, Michael Malis  wrote:

>> *At best*, you're doing substantial work in the
>> planner to avoid the first tree descent step or two in a single
>> non-partial index.

Fwiw, in addition to replacing the first few levels of the descent
with planning-time work, there's also an advantage due to the smaller
keys. Effectively these partial indexes are emulating
prefix-compression in the btree.

> While the specific example I gave in the post could be replaced with a
> non-partial index, most of the partial indexes contain predicates that
> are not straightforward to index with non-partial indexes. About 85%
> of the partial indexes contain a regular expression in them for CSS
> selector matching. I've tried using a trigram GIN index, but it wound
> up not working too well due to the data being highly redundant (almost
> every CSS hierarchy contains 'div' in it). Additionally, most of the
> predicates for partial indexes are extremely specific making the
> partial indexes very small. The sum total size of all of the partial
> indexes is still much smaller than if we were to index every necessary
> field with regular indexes.

I wonder if you could implement a FTS parser that tokenized html in
just tokens representing the matching criteria. A GIN index using such
a parser would actually be very similar to what you have as GIN
indexes are basically a collection of btrees...

The operational problem with that is I think it would be even harder
to update a parser than adding a new partial index. I don't think you
can effectively upgrade a parser to include new tokens without
rebuilding any indexes using it. If you wanted to add new selector
critieria live you would probably end up deploying the new parser and
building a new index with CREATE INDEX CONCURRENTLY using the new
parser and then dropping the old index.

I'm not sure if it's possible to do a FTS parser for handling
arbitrary CSS selectors but if you managed that that would be a very
valuable addition to Postgres, IMHO

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench tap tests & minor fixes

2017-04-20 Thread Nikolay Shaplov

В письме от 17 апреля 2017 14:51:45 пользователь Fabien COELHO написал:

> When developping new features for pgbench, I usually write some tests
> which are lost when the feature is committed. Given that I have submitted
> some more features and that part of pgbench code may be considered for
> sharing with pgsql, I think that improving the abysmal state of tests
> would be a good move.
Hi! Since I used to be good at perl, I like tests, and I've dealt with 
postgres TAP tests before, I think I can review you patch. If it is OK for 
everyone.

For now I've just gave this patch a quick look through... But nevertheless I 
have something to comment:

> The attached patch:
> 
> (1) extends the existing perl tap test infrastructure with methods to test
> pgbench, i.e. "pgbench" which runs a pgbench test and "pgbench_likes"
> which allows to check for expectations.
I do not think it is good idea adding this functions to the PostgresNode.pm. 
They are pgbench specific. I do not think anybody will need them outside of 
pgbench tests. 

Generally, these functions as they are now, should be placed is separate .pm 
file. like it was done in src/test/ssl/

May be you should move some code into some kind of helpers and keep them in 
PostgresNode.pm. 

Or write universal functions that can be used for testing any postgres console 
tool. Then they can be placed in PostgresNode.pm

> (3) add a lot of new very small tests so that coverage jumps from very low
> to over 90% for source files. I think that derived files (exprparse.c,
> exprscan.c) should be removed from coverage analysis.
> 
> Previous coverage status:
> 
>   exprparse.y 0.0 % 0 / 770.0 %0 / 8
>   exprscan.l  0.0 % 0 / 102   0.0 %0 / 8
>   pgbench.c   28.3 %  485 / 1716  43.1 %  28 / 65
> 
> New status:
> 
>   exprparse.y 96.1 %73 / 76   100.0 %  8 / 8
>   exprscan.l  92.8 %90 / 97   100.0 %  8 / 8
>   pgbench.c   90.4 %  1542 / 1705  96.9 % 63 / 65
> 
> The test runtime is about doubled on my laptop, which is not too bad given
> the coverage achieved.
What tool did you use to calculate the coverage?

Lots of small test looks good at first glance, except the point that adding 
lots of small files with pgbench scripts is not great idea. This makes code 
tree too overloaded with no practical reason. I am speaking of 
src/bin/pgbench/t/scripts/001_0* files.

I think all the data from this file should be somehow imported into 
001_pgbench.pl and these files should be created only when tests is running, 
and then removed when testing is done.

I think that job for creating and removing these files should be moved to 
pgbench and pgbench_likes. But there is more then one way to do it. ;-)



That's what I've noticed so far... I will give this patch more attentive look 
soon.

-- 
Nikolay Shaplov, independent Perl & C/C++ developer. Available for hire.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DROP SUBSCRIPTION, query cancellations and slot handling

2017-04-20 Thread Michael Paquier

On Thu, Apr 20, 2017 at 8:47 PM, Petr Jelinek
 wrote:
> Or you can drop the slot manually on upstream.

Sure, but the point here is that if for example users have
client_min_messages set at least at warning, they may have no idea
that an underlying slot has been created. This is a confusing
experience for users.

As subscription is a self-contained concept, it seems to me that any
errors happening should at least try to do some cleanup action before
just giving up processing, that would be a less frustrating
experience.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-04-20 Thread Petr Jelinek

On 20/04/17 05:57, Michael Paquier wrote:
> On Thu, Apr 20, 2017 at 12:40 PM, Michael Paquier
>  wrote:
>> On Thu, Apr 20, 2017 at 4:57 AM, Peter Eisentraut
>>  wrote:
>>> I think the problem with a signal-based solution is that there is no
>>> feedback.  Ideally, you would wait for all walsenders to acknowledge the
>>> receipt of SIGUSR2 (or similar) and only then proceed with the shutdown
>>> checkpoint.
>>
>> Are you sure that it is necessary to go to such extent? Why wouldn't
>> it be enough to prevent any replication commands generating WAL to run
>> when the WAL sender knows that the postmaster is in shutdown mode?
> 
> 2nd thoughts here... Ah now I see your point. True that there is no
> way to ensure that an unwanted command is not running when SIGUSR2 is
> received as the shutdown checkpoint may have already begun. Here is an
> idea: add a new state in WalSndState, say WALSNDSTATE_STOPPING, and
> the shutdown checkpoint does not run as long as all WAL senders still
> running do not reach such a state.
> 

+1 to this solution

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DROP SUBSCRIPTION, query cancellations and slot handling

2017-04-20 Thread Petr Jelinek

On 20/04/17 09:35, Michael Paquier wrote:
> On Thu, Apr 20, 2017 at 4:22 PM, Michael Paquier
>  wrote:
>> I am adding an open item.
> 
> Just adding something... When a subscription is created, if the step
> synchronizing tables fails then CREATE SUBSCRIPTION fails but the slot
> remains present on the publisher side, so trying to re-create the same
> subscription results in an error:
> 
> =# CREATE SUBSCRIPTION mysub CONNECTION 'port=5432' PUBLICATION mypub,
> insert_only;
> NOTICE:  0: Sleeping now...
> NOTICE:  0: created replication slot "mysub" on publisher
> LOCATION:  CreateSubscription, subscriptioncmds.c:411
> ERROR:  42P01: relation "public.aa" does not exist
> LOCATION:  RangeVarGetRelidExtended, namespace.c:400
> Time: 1033.739 ms (00:01.034)
> =# CREATE SUBSCRIPTION mysub CONNECTION 'port=5432 user=mpaquier
> dbname=mpaquier' PUBLICATION mypub, insert_only;
> NOTICE:  0: Sleeping now...
> LOCATION:  CreateSubscription, subscriptioncmds.c:376
> ERROR:  XX000: could not create replication slot "mysub": ERROR:
> replication slot "mysub" already exists
> LOCATION:  libpqrcv_create_slot, libpqwalreceiver.c:776
> 


CREATE SUBSCRIPTION mysub CONNECTION 'port=5432' PUBLICATION mypub,
insert_only WITH(NOCREATE SLOT);

Or you can drop the slot manually on upstream.



-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DROP SUBSCRIPTION, query cancellations and slot handling

2017-04-20 Thread Petr Jelinek

On 20/04/17 09:22, Michael Paquier wrote:
> Hi,
> 
> I have noticed the following behavior with DROP SUBSCRIPTION followed
> by a cancel request. If the remote replication slot is dropped, the
> subscription may still be present locally:
> =# CREATE SUBSCRIPTION mysub CONNECTION 'port=5432 user=mpaquier
> dbname=mpaquier' PUBLICATION mypub, insert_only;
> NOTICE:  0: created replication slot "mysub" on publisher
> LOCATION:  CreateSubscription, subscriptioncmds.c:408
> NOTICE:  0: synchronized table states
> LOCATION:  CreateSubscription, subscriptioncmds.c:434
> CREATE SUBSCRIPTION
> =# DROP SUBSCRIPTION mysub;
> ^CCancel request sent
> NOTICE:  0: dropped replication slot "mysub" on publisher
> LOCATION:  DropSubscription, subscriptioncmds.c:873
> ERROR:  57014: canceling statement due to user request
> LOCATION:  ProcessInterrupts, postgres.c:2984
> 
> In this case the subscription is not dropped:
> =# select subname from pg_subscription;
>  subname
> -
>  mysub
> (1 row)
> But trying to issue once again a drop results in an error:
> =# DROP SUBSCRIPTION mysub;
> ERROR:  XX000: could not drop the replication slot "mysub" on publisher
> DETAIL:  The error was: ERROR:  replication slot "mysub" does not exist
> LOCATION:  DropSubscription, subscriptioncmds.c:869
> 
> A subscription with the same name cannot be created either, so there
> is nothing that the user can do except drop manually the slot on the
> publisher. It seems to me that the moment where the slot is created
> should be a point of no-return: the subcription has to be dropped on
> the replication slot is dropped on the remote.
> 

DROP SUBSCRIPTION mysub NODROP SLOT;


-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Interval for launching the table sync worker

2017-04-20 Thread Petr Jelinek

On 20/04/17 06:21, Masahiko Sawada wrote:
> On Thu, Apr 20, 2017 at 12:30 AM, Petr Jelinek
>  wrote:
>> On 19/04/17 15:57, Masahiko Sawada wrote:
>>> On Wed, Apr 19, 2017 at 10:07 PM, Petr Jelinek
>>>  wrote:
 On 19/04/17 14:42, Masahiko Sawada wrote:
> On Wed, Apr 19, 2017 at 5:12 PM, Kyotaro HORIGUCHI
>  wrote:
>> At Tue, 18 Apr 2017 18:40:56 +0200, Petr Jelinek 
>>  wrote in 
>> 
>>> On 18/04/17 18:14, Peter Eisentraut wrote:
 On 4/18/17 11:59, Petr Jelinek wrote:
> Hmm if we create hashtable for this, I'd say create hashtable for the
> whole table_states then. The reason why it's list now was that it 
> seemed
> unnecessary to have hashtable when it will be empty almost always but
> there is no need to have both hashtable + list IMHO.
>>
>> I understant that but I also don't like the frequent palloc/pfree
>> in long-lasting context and double loop like Peter.
>>
 The difference is that we blow away the list of states when the catalog
 changes, but we keep the hash table with the start times around.  We
 need two things with different life times.
>>
>> On the other hand, hash seems overdone. Addition to that, the
>> hash-version leaks stale entries while subscriptions are
>> modified. But vacuuming them costs high.
>>
>>> Why can't we just update the hashtable based on the catalog? I mean once
>>> the record is not needed in the list, the table has been synced so there
>>> is no need for the timestamp either since we'll not try to start the
>>> worker again.
>
> I guess the table sync worker for the same table could need to be
> started again. For example, please image a case where the table
> belonging to the publication is removed from it and the corresponding
> subscription is refreshed, and then the table is added to it again. We
> have the record of the table with timestamp in the hash table when the
> table sync in the first time, but the table sync after refreshed could
> have to wait for the interval.
>

 But why do we want to wait in such case where user has explicitly
 requested refresh?

>>>
>>> Yeah, sorry, I meant that we don't want to wait but cannot launch the
>>> tablesync worker in such case.
>>>
>>> But after more thought, the latest patch Peter proposed has the same
>>> problem. Perhaps we need to scan always whole pg_subscription_rel and
>>> remove the entry if the corresponding table get synced.
>>>
>>
>> Yes that's what I mean by "Why can't we just update the hashtable based
>> on the catalog". And if we do that then I don't understand why do we
>> need both hastable and linked list if we need to update both based on
>> catalog reads anyway.
> 
> Thanks, I've now understood correctly. Yes, I think you're right. If
> we update the hash table based on the catalog whenever table state is
> invalidated, we don't need to have both hash table and list.
> 
> BTW, in current HEAD the SUBREL_STATE_SYNCWAIT is not stored in the
> pg_subscription_catalog. So the following condition seems not correct.
> We should use "syncworker->relstate == SUBSCRIPTION_STATE_SYNCWAIT"
> instead?
> 
> /*
>  * There is a worker synchronizing the relation and waiting for
>  * apply to do something.
>  */
> if (syncworker && rstate->state == SUBREL_STATE_SYNCWAIT)
> {
> /*
>  * There are three possible synchronization situations here.
>  *
>  * a) Apply is in front of the table sync: We tell the table
>  *sync to CATCHUP.
>  *
>  * b) Apply is behind the table sync: We tell the table sync
>  *to mark the table as SYNCDONE and finish.
> 
>  * c) Apply and table sync are at the same position: We tell
>  *table sync to mark the table as READY and finish.
>  *
>  * In any case we'll need to wait for table sync to change
>  * the state in catalog and only then continue ourselves.
>  */
> 

Good catch. Although it's not comment that's wrong, it's the if. We
should not compare rstate->state but syncworker->relstate.

The reason why SUBREL_STATE_SYNCWAIT is not stored in catalog is that
we'd have to commit from sync worker which could leave table in
partially synchronized state on crash without the ability to resume
afterwards (since the slot on other side will be gone).

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] AGG_HASHED cost estimate

2017-04-20 Thread Dilip Kumar

On Thu, Apr 20, 2017 at 4:16 PM, Ashutosh Bapat
 wrote:
> but cost this without numGroups.
>
> /*
>  * The transCost.per_tuple component of aggcosts should be charged once
>  * per input tuple, corresponding to the costs of evaluating the aggregate
>  * transfns and their input expressions (with any startup cost of course
>  * charged but once).  The finalCost component is charged once per output
>  * tuple, corresponding to the costs of evaluating the finalfns.
>  *
>  * If we are grouping, we charge an additional cpu_operator_cost per
>  * grouping column per input tuple for grouping comparisons.
>  *
>
> The reason may be that hashing isn't as costly as a comparison. I
> don't how true is that.

Earlier in GatherMerge thread[1], Rushabh mentioned that hashAggregate
is getting picked where actually grouping aggregate with GatherMerge
was faster during actual execution time and he suspected problems with
costing of hashAggregate. Maybe this is one of those?

[1]https://www.postgresql.org/message-id/CAGPqQf2164iV6k-_M75qEZWiCfRarA_SKSmHjc0Uh1rEf5RJrA%40mail.gmail.com


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] snapbuild woes

2017-04-20 Thread Petr Jelinek

On 20/04/17 02:09, Andres Freund wrote:
> On 2017-04-17 21:16:57 -0700, Andres Freund wrote:
>> I think we might need some more tests for this to be committable, so
>> it might not become committable tomorrow.
> 
> I'm working on some infrastructure around this.  Not sure if it needs to
> be committed, but it's certainly useful for evaluation.  Basically it's
> a small UDF that:
> 1) creates a slot via walsender protocol (to some dsn)
> 2) imports that snapshot into yet another connection to that dsn
> 3) runs some query over that new connection
> 
> That makes it reasonably easy to run e.g. pgbench and continually create
> slots, and use the snapshot to run queries "verifying" that things look
> good.  It's a bit shoestring-ed together, but everything else seems to
> require more code. And it's just test.
> 
> Unless somebody has a better idea?
> 

I don't. I mean it would be nice to have isolation tester support
walsender protocol, but I don't know anything about isolation tester
internals so no idea how much work that is. On top of that some of the
issues are not even possible to provoke via isolation tester (or
anything similar that would give us control over timing) unless we
expose a lot of guts of xlog/xact as a UDFs as well. So I think simple
function that does what you said and pgbench are reasonable solutions. I
guess you plan to make that as one of the test/modules or something
similar (the UDF)?

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] AGG_HASHED cost estimate

2017-04-20 Thread Ashutosh Bapat

On Thu, Apr 20, 2017 at 7:47 AM, Jeff Janes  wrote:
>
> In cost_size.c, there is this comment block:
>
> +* Note: in this cost model, AGG_SORTED and AGG_HASHED have exactly
> the
> +* same total CPU cost, but AGG_SORTED has lower startup cost.  If
> the
> +* input path is already sorted appropriately, AGG_SORTED should be
> +* preferred (since it has no risk of memory overflow).  This will
> happen
> +* as long as the computed total costs are indeed exactly equal ---
> but
> +* if there's roundoff error we might do the wrong thing.  So be
> sure
> +* that the computations below form the same intermediate values in
> the
> +* same order.
>
> But, why should they have the same cost in the first place?  A sorted
> aggregation just has to do an equality comparison on each adjacent pair,
> which is costed as (cpu_operator_cost * numGroupCols) * input_tuples.   A
> hash aggregation has to do a hashcode computation for each input, which
> apparently is also costed at (cpu_operator_cost * numGroupCols) *
> input_tuples.

I suspect we don't cost this.

> But it also has to do the equality comparison between the
> input tuple and any aggregation it is about to be aggregated into, to make
> sure the hashcode is not a false collision.  That should be another
> (cpu_operator_cost * numGroupCols) * (input_tuples - numGroups), shouldn't
> it?

but cost this without numGroups.

/*
 * The transCost.per_tuple component of aggcosts should be charged once
 * per input tuple, corresponding to the costs of evaluating the aggregate
 * transfns and their input expressions (with any startup cost of course
 * charged but once).  The finalCost component is charged once per output
 * tuple, corresponding to the costs of evaluating the finalfns.
 *
 * If we are grouping, we charge an additional cpu_operator_cost per
 * grouping column per input tuple for grouping comparisons.
 *

The reason may be that hashing isn't as costly as a comparison. I
don't how true is that.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] statement_timeout is not working as expected with postgres_fdw

2017-04-20 Thread Suraj Kharage

Hello All,

The query on foreign table hangs due to network down of remote server for
near about 16 minutes before exiting.
statement_timeout is expected to work in this case as well but when i tried
statement_timeout, it is not working as expected.

Below is test case to reproduce the issue: [I am testing this on two
different systems]

1: Set the postgres_fdw

postgres=# CREATE SERVER myserver FOREIGN DATA WRAPPER postgres_fdw OPTIONS
(host '172.16.173.237', dbname 'postgres', port '5432',keepalives '1',
keepalives_interval '3',keepalives_idle '3', keepalives_count '1');
CREATE SERVER
postgres=# CREATE USER MAPPING FOR postgres SERVER myserver OPTIONS (user
'postgres', password 'edb');
CREATE USER MAPPING
postgres=# CREATE FOREIGN TABLE test(id int) SERVER myserver;
CREATE FOREIGN TABLE
postgres=# select * from test;
 id

  1
 10
  2
  1
(4 rows)

postgres=# \d
List of relations
 Schema | Name | Type  |  Owner
+--+---+--
 public | test | foreign table | postgres
(1 row)
postgres=#
postgres=# select * from pg_foreign_server ;
 srvname  | srvowner | srvfdw | srvtype | srvversion | srvacl |
  srvoptions


--+--++-+++--
---
 myserver |   10 |  16387 | |||
{host=172.16.173.237,dbname=postgres,port=5432,keepalives=1,keepalives_interval=3,keepalives_idle=3,keepalives_co
unt=1}
(1 row)

3. Execute the insert command.
E.g: insert into test values (generate_series(1,100));
* You need to do network down of remote server during insert command.

postgres=# set statement_timeout to 6000;
SET
postgres=# insert into test values (generate_series(1,100));
WARNING:  server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

CONTEXT:  Remote SQL command: ABORT TRANSACTION
ERROR:  could not receive data from server: No route to host

CONTEXT:  Remote SQL command: INSERT INTO public.test(id) VALUES ($1)
Time: 931427.002 ms

It was in hang state for approx 15-16 minute before exit.

In pg_log, i see below error for above query:
=
2017-04-20 15:22:02 IST ERROR:  could not receive data from server: No
route to host
2017-04-20 15:22:02 IST CONTEXT:  Remote SQL command: INSERT INTO
public.test(id) VALUES ($1)
2017-04-20 15:22:02 IST STATEMENT:  insert into test values
(generate_series(1,100));
2017-04-20 15:22:02 IST WARNING:  server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2017-04-20 15:22:02 IST CONTEXT:  Remote SQL command: ABORT TRANSACTION
==

I tested the patch submitted by Ashutosh Bapat on community [
https://www.postgresql.org/message-id/CAFjFpRdcWw4h0a-zrL-EiaekkPj8O0GR2M1FwZ1useSRfRm3-g%40mail.gmail.com]
to make the statement_timeout working and i can see statement_timeout is
working but it is taking some extra time as statement_timeout.

postgres=# set statement_timeout to 6000;
SET
postgres=#
postgres=# \timing on
Timing is on.
postgres=# insert into test values (generate_series(1,100));
-- [after executing query immediately disconnect the ethernet on remote
server ]
2017-04-20 07:10:51.588 IST [10467] ERROR:  canceling statement due to
statement timeout
2017-04-20 07:10:51.588 IST [10467] STATEMENT:  insert into test values
(generate_series(1,100));
2017-04-20 07:11:25.590 IST [10467] WARNING:  discarding connection
0xfe4260 because of failed cancel request: PQcancel() -- connect() failed:
No route to host
WARNING:  discarding connection 0xfe4260 because of failed cancel request:
PQcancel() -- connect() failed: No route to host

ERROR:  canceling statement due to statement timeout
Time: 40001.765 ms (00:40.002)
postgres=#
postgres=#

In above case, I got the error related to statement timeout after 6 seconds,
but it it taking more time (approx 34 sec and it is varing each time if you
see below) to terminate the connection and to exit from query. As per the
tcp keepavlies settings for foreign server, it should take max 6 sec to
terminate the connection.


postgres=#
postgres=# set statement_timeout to 2;
SET
Time: 0.254 ms
postgres=#
postgres=# insert into test values (generate_series(1,100));
2017-04-20 07:13:25.666 IST [10467] ERROR:  canceling statement due to
statement timeout
2017-04-20 07:13:25.666 IST [10467] STATEMENT:  insert into test values
(generate_series(1,100));
2017-04-20 07:13:43.668 IST [10467] WARNING:  discarding connection
0xfe4260 because of failed cancel request: PQcancel() -- connect() failed:
No route to host
WARNING:  discarding connection 0xfe4260 because of failed cancel request:
PQcancel() -- connect() failed: No route to host

ERROR:  canceling statement due to statement timeout
Time: 38004

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-04-20 Thread Ashutosh Bapat

On Thu, Apr 20, 2017 at 3:35 PM, Amit Langote
 wrote:
> On 2017/04/20 15:45, Ashutosh Bapat wrote:
>> On Thu, Apr 20, 2017 at 10:42 AM, Robert Haas  wrote:
>>> I don't understand why you think that partition-wise join needs any
>>> new logic here; if this were a non-partitionwise join, we'd similarly
>>> need to use the correct operator, but the existing code handles that
>>> just fine.  If the join is performed partition-wise, it should use the
>>> same operators that would have been used by a non-partitionwise join
>>> between the same tables.
>>>
>>> I think the choice of operator depends only on the column types, and
>>> that the "width" of those types has nothing to do with it.  For
>>> example, if the user writes .WHERE A.x = B.x AND B.x = C.x, the
>>> operator for an A/B join or a B/C join will be the one that appears in
>>> the query; parse analysis will have identified which specific operator
>>> is meant based on the types of the columns.  If the optimizer
>>> subsequently decides to reorder the joins and perform the A/C join
>>> first, it will go hunt down the operator with the same strategy number
>>> in the same operator family that takes the type of A.x on one side and
>>> the type of C.x on the other side.  No problem.  A partition-wise join
>>> between A and C will use that same operator; again, no problem.
>>>
>>> Your example involves joining the output of a join between i4 and i8
>>> against i2, so it seems there is some ambiguity about what the input
>>> type should be.  But, again, the planner already copes with this
>>> problem.  In fact, the join is performed either using i4.x or i8.x --
>>> I don't know what happens, or whether it depends on other details of
>>> the query or the plan -- and the operator which can accept that value
>>> on one side and i2.x on the other side is the one that gets used.
>>
>> I think you are confusing join condition application and partition
>> bounds of a join relation. What you have described above is how
>> operators are chosen to apply join conditions - it picks up the
>> correct operator from the operator family based on the column types
>> being used in join condition. That it can do because the columns being
>> joined are both present the relations being joined, irrespective of
>> which pair of relations is being joined. In your example, A.x, B.x and
>> C.x are all present on one of the sides of join irrespective of
>> whether the join is executed as (AB)C, A(BC) or (AC)B.
>>
>> But the problem we are trying to solve here about partition bounds of
>> the join relation: what should be the partition bounds of AB, BC or
>> AC? When we compare partition bounds of and intermediate join with
>> other intermediate join (e.g. AB with those of C) what operator should
>> be used? You seem to be suggesting that we keep as many sets of
>> partition bounds as there are base relations participating in the join
>> and then use appropriate partition bounds based on the columns in the
>> join conditions, so that we can use the same operator as used in the
>> join condition. That doesn't seem to be a good option since the
>> partition bounds will all have same values, only differing in their
>> binary representation because of differences in data types. I am of
>> the opinion that we save a single set of partition bounds. We have to
>> then associate a data type with bounds to know binary representation
>> of partition bound datums. That datatype would be one of the partition
>> key types of joining relations. I may be wrong in using term "wider"
>> since its associated with the length of binary reprentation. But we
>> need some logic to coalesce the two data types based on the type of
>> join and key type on the outer side.
>
> FWIW, I think that using any one of the partition bounds of the baserels
> being partitionwise-joined should suffice as the partition bound of any
> combination of joins involving two or more of those baserels, as long as
> the partitioning operator of each of the baserels is in the same operator
> family (I guess that *is* checked somewhere in the partitionwise join
> consideration flow).  IOW, partopfamily[] of all of the baserels should
> match and then the join clause operators involved should belong to the
> same respective operator families.

The partition bounds of different base rels may be different and we
have to compare them. Even we say that we join two tables with same
partition bounds using partitio-wise join, we need to make sure that
those partition bounds are indeed same, thus requiring to compare. And
to compare any datum we need to know its type.

>
> ISTM, the question here is about how to derive the partitioning properties
> of joinrels from those of the baserels involved.  Even if the join
> conditions refer to columns of different types on two sides, as long as
> the partitioning and joining is known to occur using operators of
> compatible semantics, I don't understand what more needs to be considered
> or done.  Al

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-04-20 Thread Amit Langote

On 2017/04/20 15:45, Ashutosh Bapat wrote:
> On Thu, Apr 20, 2017 at 10:42 AM, Robert Haas  wrote:
>> I don't understand why you think that partition-wise join needs any
>> new logic here; if this were a non-partitionwise join, we'd similarly
>> need to use the correct operator, but the existing code handles that
>> just fine.  If the join is performed partition-wise, it should use the
>> same operators that would have been used by a non-partitionwise join
>> between the same tables.
>>
>> I think the choice of operator depends only on the column types, and
>> that the "width" of those types has nothing to do with it.  For
>> example, if the user writes .WHERE A.x = B.x AND B.x = C.x, the
>> operator for an A/B join or a B/C join will be the one that appears in
>> the query; parse analysis will have identified which specific operator
>> is meant based on the types of the columns.  If the optimizer
>> subsequently decides to reorder the joins and perform the A/C join
>> first, it will go hunt down the operator with the same strategy number
>> in the same operator family that takes the type of A.x on one side and
>> the type of C.x on the other side.  No problem.  A partition-wise join
>> between A and C will use that same operator; again, no problem.
>>
>> Your example involves joining the output of a join between i4 and i8
>> against i2, so it seems there is some ambiguity about what the input
>> type should be.  But, again, the planner already copes with this
>> problem.  In fact, the join is performed either using i4.x or i8.x --
>> I don't know what happens, or whether it depends on other details of
>> the query or the plan -- and the operator which can accept that value
>> on one side and i2.x on the other side is the one that gets used.
> 
> I think you are confusing join condition application and partition
> bounds of a join relation. What you have described above is how
> operators are chosen to apply join conditions - it picks up the
> correct operator from the operator family based on the column types
> being used in join condition. That it can do because the columns being
> joined are both present the relations being joined, irrespective of
> which pair of relations is being joined. In your example, A.x, B.x and
> C.x are all present on one of the sides of join irrespective of
> whether the join is executed as (AB)C, A(BC) or (AC)B.
> 
> But the problem we are trying to solve here about partition bounds of
> the join relation: what should be the partition bounds of AB, BC or
> AC? When we compare partition bounds of and intermediate join with
> other intermediate join (e.g. AB with those of C) what operator should
> be used? You seem to be suggesting that we keep as many sets of
> partition bounds as there are base relations participating in the join
> and then use appropriate partition bounds based on the columns in the
> join conditions, so that we can use the same operator as used in the
> join condition. That doesn't seem to be a good option since the
> partition bounds will all have same values, only differing in their
> binary representation because of differences in data types. I am of
> the opinion that we save a single set of partition bounds. We have to
> then associate a data type with bounds to know binary representation
> of partition bound datums. That datatype would be one of the partition
> key types of joining relations. I may be wrong in using term "wider"
> since its associated with the length of binary reprentation. But we
> need some logic to coalesce the two data types based on the type of
> join and key type on the outer side.

FWIW, I think that using any one of the partition bounds of the baserels
being partitionwise-joined should suffice as the partition bound of any
combination of joins involving two or more of those baserels, as long as
the partitioning operator of each of the baserels is in the same operator
family (I guess that *is* checked somewhere in the partitionwise join
consideration flow).  IOW, partopfamily[] of all of the baserels should
match and then the join clause operators involved should belong to the
same respective operator families.

ISTM, the question here is about how to derive the partitioning properties
of joinrels from those of the baserels involved.  Even if the join
conditions refer to columns of different types on two sides, as long as
the partitioning and joining is known to occur using operators of
compatible semantics, I don't understand what more needs to be considered
or done.  Although, I haven't studied things in enough detail to say
anything confidently about whether join being INNER or OUTER has any
bearing on the semantics of the partitioning of the joinrels in question.
IIUC, using partitioning properties to apply partitionwise join technique
at successive join levels will be affected by the OUTER considerations
similar to how they affect what levels a give EquivalenceClass clause
could be applied without causing any semantics vio

Re: [HACKERS] Range Merge Join v1

2017-04-20 Thread Rafia Sabih

On Thu, Apr 6, 2017 at 2:13 PM, Jeff Davis  wrote:

> 
> Example:
> 
>
> Find different people using the same website at the same time:
>
> create table session(sessionid text, username text, during tstzrange);
> SELECT s1.username, s2.username, s1.during * s2.during
>   FROM session s1, session s2
>   WHERE s1.during && s2.during AND s1.username < s2.username
>
>
> =
> Brief summary of previous discussion:
> =
>
> http://www.postgresql.org/message-id/1334554850.10878.52.camel@jdavis
>
> - can indexes solve it already (parameterized paths, etc.)?
> - planner complexity (track path keys, mergejoinable equality, etc.)
> - spatial join algorithm choice
> - compare range merge join against existing indexes to see if any
>   indexing approach is viable
>
> ==
> Externals:
> ==
>
> No new syntax or other externals. Just improved execution strategy for
> joining ranges using the overlaps operator (&&).
>
> New syntax is possible, but I don't see a strong reason to support
> special range join syntax at this time.
>
> =
> Optimizer Design
> =
>
> I took the path of least resistance, so if I am doing something wrong
> please let me know. I hardwired it to look for the range overlaps
> operator when checking if it could do a merge join, and then add
> range_ops to restrictinfo->mergeopfamilies.
>
> Then, I have to suppress adding it as an equivalence class, because
> overlaps is not equality. It still adds single-member ECs to
> restrictinfo->right_ec and left_ec, but (I think) that's OK.
>
> Also, I have to prevent generating paths for right/full join, because
> the range join algorithm can't (currently) handle those.
>
> =
> Costing
> =
>
> Needs more consideration. Seems to do reasonable things in the few
> examples I tried.
>
> =
> Executor Design
> =
>
> See detailed comments in nodeMergejoin.c
>
> =
> Performance
> =
>
> Seems much better than index nest loop join when the join is not
> selective. I will post detailed numbers soon.
>
> If no index is available, range merge join is the only reasonable way
> to execute a range join. For instance, if the inner is not a leaf in
> the plan.
>
> =
> Alternatives:
> =
>
> It was suggested that I approach it as a general spatial-join
> problem. That has merit, but I rejected it for now because the
> algorithm that I think has the most promise is range-oriented
> anyway. By that I mean we would need to extract all of the dimensions
> into ranges first, and then perform the join. With that in mind, it
> makes sense to implement range joins first; and then later provide the
> APIs to get at the spatial dimensions so that we can implement other
> spatial joins as range joins.
>
> Another suggestion was to base it on hash join, but I never understood
> that proposal and it didn't seem to map very well to a spatial join.
>
> Yet another suggestion was to use some kind of temporary index. Some
> brief experiments I did indicated that it would be fairly slow (though
> more investigation might be useful here). Also, it doesn't provide any
> alternative to the nestloop-with-inner-index we already offer at the
> leaf level today.
>
> Regards,
>  Jeff Davis
>

Looks like an interesting idea, however, in an attempt to test this patch I
found following error when compiling,
selfuncs.c: In function ‘mergejoinscansel’:
selfuncs.c:2901:12: error: ‘op_strategy’ undeclared (first use in this
function)
   &op_strategy,

-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

[HACKERS] Patch - Tcl 8.6 version support for PostgreSQL

2017-04-20 Thread Paresh More

Hello Team

This is regarding Tcl version (8.6) support in PostgreSQL.

Currently in PostgreSQL server file  (src/tools/msvc/Mkvcbuild.pm) it
supports only till Tcl version 8.5

Attach patch contains adding Tcl 8.6 support.


Please Note - In Tcl8.6, the library name now contains an extra 't" in it.
Tcl 8.6 - tcl86t.lib
Tcl 8.5 - tcl85.lib

Can you please review the patch and commit for PostgreSQL.


-- 

Thanks & Regards

*Paresh More*

[image: NEW-EDB-logo-4c]

Pune, India.


Server_build_tcl86.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench - allow to store select results into variables

2017-04-20 Thread Tatsuo Ishii

> Hello,
> 
>> I think you'd better to change the following comments because there's
>> no psqlscan.l or psqlscanslash.l in pgbench source tree.
>>
>> + * underscore.  Keep this in sync with the definition of
>> variable_char in
>> + * psqlscan.l and psqlscanslash.l.
> 
> Here is a v3 with a more precise comment.

Looks good to me. I have marked the patch status as "Ready for
committer".

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DROP SUBSCRIPTION, query cancellations and slot handling

2017-04-20 Thread Michael Paquier

On Thu, Apr 20, 2017 at 4:22 PM, Michael Paquier
 wrote:
> I am adding an open item.

Just adding something... When a subscription is created, if the step
synchronizing tables fails then CREATE SUBSCRIPTION fails but the slot
remains present on the publisher side, so trying to re-create the same
subscription results in an error:

=# CREATE SUBSCRIPTION mysub CONNECTION 'port=5432' PUBLICATION mypub,
insert_only;
NOTICE:  0: Sleeping now...
NOTICE:  0: created replication slot "mysub" on publisher
LOCATION:  CreateSubscription, subscriptioncmds.c:411
ERROR:  42P01: relation "public.aa" does not exist
LOCATION:  RangeVarGetRelidExtended, namespace.c:400
Time: 1033.739 ms (00:01.034)
=# CREATE SUBSCRIPTION mysub CONNECTION 'port=5432 user=mpaquier
dbname=mpaquier' PUBLICATION mypub, insert_only;
NOTICE:  0: Sleeping now...
LOCATION:  CreateSubscription, subscriptioncmds.c:376
ERROR:  XX000: could not create replication slot "mysub": ERROR:
replication slot "mysub" already exists
LOCATION:  libpqrcv_create_slot, libpqwalreceiver.c:776

I have created a simple table aa (a int) on the publisher first, where
a publication with ALL TABLES has been created:
CREATE PUBLICATION mypub FOR ALL TABLES;
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] DROP SUBSCRIPTION, query cancellations and slot handling

2017-04-20 Thread Michael Paquier

Hi,

I have noticed the following behavior with DROP SUBSCRIPTION followed
by a cancel request. If the remote replication slot is dropped, the
subscription may still be present locally:
=# CREATE SUBSCRIPTION mysub CONNECTION 'port=5432 user=mpaquier
dbname=mpaquier' PUBLICATION mypub, insert_only;
NOTICE:  0: created replication slot "mysub" on publisher
LOCATION:  CreateSubscription, subscriptioncmds.c:408
NOTICE:  0: synchronized table states
LOCATION:  CreateSubscription, subscriptioncmds.c:434
CREATE SUBSCRIPTION
=# DROP SUBSCRIPTION mysub;
^CCancel request sent
NOTICE:  0: dropped replication slot "mysub" on publisher
LOCATION:  DropSubscription, subscriptioncmds.c:873
ERROR:  57014: canceling statement due to user request
LOCATION:  ProcessInterrupts, postgres.c:2984

In this case the subscription is not dropped:
=# select subname from pg_subscription;
 subname
-
 mysub
(1 row)
But trying to issue once again a drop results in an error:
=# DROP SUBSCRIPTION mysub;
ERROR:  XX000: could not drop the replication slot "mysub" on publisher
DETAIL:  The error was: ERROR:  replication slot "mysub" does not exist
LOCATION:  DropSubscription, subscriptioncmds.c:869

A subscription with the same name cannot be created either, so there
is nothing that the user can do except drop manually the slot on the
publisher. It seems to me that the moment where the slot is created
should be a point of no-return: the subcription has to be dropped on
the replication slot is dropped on the remote.

I am adding an open item.
Thanks,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

96 matches

Mail list logo