date:20150914

Re: [HACKERS] Testing WAL replay by comparing before and after images again

2015-09-14 Thread Heikki Linnakangas


On 09/04/2015 09:30 PM, Simon Riggs wrote:

On 4 September 2015 at 13:45, Heikki Linnakangas  wrote:


Another issue was with the new speculative insertions. Replaying a
speculative insertion record sets the tuple's CTID to point to itself, like
in a regular insertion. But in the original system, the CTID is set to a
special speculative insertion token. The tool flagged up that difference.

I propose the attached patch (mark-speculative-insertions-in-replay.patch)
to fix that in the replay routine. This is not required for correctness,
but helps this tool, and seems like a good idea for debugging purposes
anyway.


ISTM that the WAL record should include the speculative insertion token, so
that replay can set it correctly.


I view this the same as command IDs. We don't restore the original 
command ID of a tuple at WAL replay either, because it's irrelevant for 
recovery and hot standby.



That way we can always re-check that the later update matches the
speculative insertion token we expect, in all cases.


Hmm, I guess that would give a tiny bit of extra sanity checking at 
replay. Doesn't really seem worth the trouble and extra WAL volume to me.



In any case, the assumption that we are replaying all changes in single
threaded mode is not appropriate for use with logical replication.


No such assumption here AFAICS.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Double linking MemoryContext children

2015-09-14 Thread Jan Wieck


On 09/12/2015 11:35 AM, Kevin Grittner wrote:


On the other hand, a grep indicates that there are two places that
MemoryContextData.nextchild is set (and we therefore probably need
to also set the new field), and Jan's proposed patch only changes
one of them.  If we do this, I think we need to change both places
that are affected, so ResourceOwnerCreate() in resowner.c would
need a line or two added.


ResourceOwnerCreate() sets ResourceOwnerData.nextchild, not 
MemoryContextData.nextchild.



Regards, Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Rework access method interface

2015-09-14 Thread Petr Jelinek


On 2015-09-14 14:34, Oleg Bartunov wrote:


Whhat I don't understand from this thread if  we should wait 2ndQuadrant
for their sequence and column AMs or just start to work on committing it
? Alvaro, where are you ?



I don't see problems with this patch from the sequence am perspective. 
The next AM type will need to add code for different AM types but that's 
mostly it.


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-09-14 Thread Amit Kapila

On Mon, Sep 14, 2015 at 3:02 PM, Alexander Korotkov 
wrote:

> On Sat, Sep 12, 2015 at 3:24 PM, Amit Kapila 
> wrote:
>
>> On Fri, Aug 14, 2015 at 7:23 PM, Alexander Korotkov 
>> wrote:
>> >
>> > On Thu, Aug 6, 2015 at 1:01 PM, Ildus Kurbangaliev <
>> i.kurbangal...@postgrespro.ru> wrote:
>> >>
>> >>
>> >> I've looked deeper and I found PgBackendStatus to be not a suitable
>> >> place for keeping information about low level waits. Really,
>> PgBackendStatus
>> >> is used to track high level information about backend. This is why
>> auxiliary
>> >> processes don't have PgBackendStatus, because they don't have such
>> information
>> >> to expose. But when we come to the low level wait events then auxiliary
>> >> processes are as useful for monitoring as backends are. WAL writer,
>> >> checkpointer, bgwriter etc are using LWLocks as well. This is
>> certainly unclear
>> >> why they can't be monitored.
>> >>
>> >> This is why I think we shoudn't place wait event into PgBackendStatus.
>> It
>> >> could be placed into PGPROC or even separate data structure with
>> different
>> >> concurrency model which would be most suitable for monitoring.
>> >
>> >
>> > +1 for tracking wait events not only for backends
>> >
>> > Ildus, could you do following?
>> > 1) Extract LWLocks refactoring into separate patch.
>> > 2) Make a patch with storing current wait event information in PGPROC.
>> >
>>
>> Now as Robert has committed standardization of lwlock names in
>> commit - aa65de04, let us try to summarize and work on remaining parts
>> of the patch. So I think now the next set of work is as follows:
>>
>> 1. Modify the tranche mechanism so that information about LWLocks
>> can be tracked easily.  For this already there is some discussion, ideas
>> and initial patch is floated in this thread and there doesn't seem to be
>> much
>> conflicts, so we can write the patch for it.  I am planning to write or
>> modify
>> the existing patch unless you, IIdus or anyone has objections or want to
>> write it, please let me know to avoid duplication of work.
>>
>> 2. Track wait_event in pg_stat_activity.  Now here the main point where
>> we doesn't seem to be in complete agreement is that shall we keep it
>> as one byte in PgBackendStatus or use two or more bytes to store
>> wait_event information and still there is another point made by you to
>> store it in PGProc rather than in PgBackendStatus so that we can display
>> information of background processes as well.
>>
>> Now as a matter of consensus, I think Tom has raised a fair point [1]
>> against
>> storing this information in PGProc and I feel that it is relatively less
>> important to have such information about background processes and the
>> reason for same is explained upthread [2].  About having more than
>> one-byte
>> to store information about various wait_events, I think now we will not
>> have
>> more than 100 or so events to track, do you really think that anytime in
>> forseeable
>> future we will have more than 256 important events which we would like to
>> track?
>>
>> So I think about this lets first try to build the consensus and then
>> attempt to
>> write or modify the patch.
>>
>>
>> [1] - http://www.postgresql.org/message-id/4067.1439561...@sss.pgh.pa.us
>> [2] -
>> http://www.postgresql.org/message-id/CAA4eK1JRQGTgRMRao6H65rA=fhifucnqhx7ongy58um6rn1...@mail.gmail.com
>>
>
> In order to build the consensus we need the roadmap for waits monitoring.
> Would single byte in PgBackendStatus be the only way for tracking wait
> events? Could we have pluggable infrastructure in waits monitoring: for
> instance, hooks for wait event begin and end?
> Limit of 256 wait events is probably OK. But we have no room for any
> additional parameters of wait event. For instance, if you notice high
> contention for buffer content lock then it would be nice to figure out: how
> many blocks are involved?, which blocks? We need to track additional
> parameters in order to answer this question.
>

We can track additional parameters by default or based on some
parameter, but do you think that tracking backends wait_event
information as proposed hinders in any which way the future extensions
in this area?  The point is that detailed discussion of other parameters
could be better done separately unless you think that this can block
future enhancements for waits monitoring.  I see wait monitoring of overall
database as a really valuable feature, but not able to see compelling
need to sort out everything before this patch.  Having said that, I am open
for discussion if you want specific things to be sorted out before moving
further.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-09-14 Thread Pavel Stehule

2015-09-14 14:11 GMT+02:00 Tomas Vondra :

>
>
> On 09/14/2015 01:15 PM, Shulgin, Oleksandr wrote:
>
>> On Mon, Sep 14, 2015 at 11:53 AM, Tomas Vondra
>> >
>> wrote:
>>
>>
>> On 09/14/2015 10:23 AM, Shulgin, Oleksandr wrote:
>>
>> On Sat, Sep 12, 2015 at 11:50 AM, Tomas Vondra
>> > 
>> >
>> >> wrote:
>>
>> ...
>>
>>  - Attempts to get plan for simple insert queries like this
>>
>> INSERT INTO x SELECT * FROM x;
>>
>> end with a segfault, because ActivePortal->queryDesc is
>> 0x0 for this
>> query. Needs investigation.
>>
>>
>> Yes, I've hit the same problem after submitting the latest
>> version of
>> the patch.  For now I've just added a check for queryDesc being
>> not
>> NULL, but I guess the top of the current_query_stack might
>> contains
>> something useful.  Something I need to try.
>>
>>
>> Well, the thing is we're able to do EXPLAIN on those queries, and
>> IIRC auto_explain can log them too. So perhaps look into the hooks
>> where they take the queryDesc in those cases - it has to be
>> available somewhere.
>>
>>
>> Yes, this seems to work.  I was able to successfully query "create table
>> as" and even "explain analyze" for the plans.  In both cases
>> ActivePortal doesn't have a valid queryDesc, but the executor hook
>> captures one.
>>
>>  - The lockless approach seems fine to me, although I think
>> the fear
>> of performance issues is a bit moot (I don't think we
>> expect large
>> number of processes calling pg_cmdstatus at the same
>> time). But
>> it's not significantly more complex, so why not.
>>
>>
>> I believe the main benefit of the less-locking approach is that if
>> something goes wrong when two backends tried to communicate it
>> doesn't
>> prevent the rest of them from operating, because there is no
>> shared (and
>> thus locked) communication channel.
>>
>>
>> Sure, but I think it really deserves a bit more detailed explanation
>> of the protocol, and discussion of the expected behavior for some
>> basic failure types.
>>
>> For example - what happens when a backend requests info, but dies
>> before receiving it, and the backed is reused for another
>> connection? Doesn't this introduce a race condition? Perhaps not,
>> I'm just asking.
>>
>>
>> Now explained better in code comments.
>>
>> The worst thing that could happen in this case I believe is that the
>> requesting backend will not receive any response from the second use of
>> its slot due to the slot being marked as processed by the backend that
>> was asked first:
>>
>> A:
>> slot->is_processed = false;
>> slot->is_valid = true;
>> /* signals backend B */;
>> shm_mq_read(); /* blocks */
>>
>> B: /* finds the slot and prepares the response */
>>
>> A: /* decides to bail out */
>> InvalidateCmdStatusSlot();
>> shm_mq_detach();
>> /* exits pg_cmdstatus() */
>>
>> /* pg_cmdstatus() is called again */
>> /* initializes the slot and signals some backend */
>> shm_mq_read(); /* blocks */
>>
>> B: /* completes preparing the response */
>> slot->is_processed = true;
>> shm_mq_send() => SHM_MQ_DETACHED
>> slot->is_valid = false;
>> /* gives up with this slot */
>>
>
hmm .. yes - there have to be check if slot is related still to queue
before to change is_valid attribute.

Regards

Pavel

Re: [HACKERS] Scaling PostgreSQL at multicore Power8

2015-09-14 Thread YUriy Zhuravlev

On Monday 31 August 2015 17:43:08 Tomas Vondra wrote:
> Well, I could test the patch on a x86 machine with 4 sockets (64 cores),
> but I wonder whether it makes sense at this point, as the patch really
> is not correct (judging by what Andres says).

Can you test patch from this thread: 
http://www.postgresql.org/message-id/2400449.GjM57CE0Yg@dinodell ?

In our view it is correct, although this is not obvious.

-- 
YUriy Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: index-only scans with partial indexes

2015-09-14 Thread Tomas Vondra




On 09/14/2015 12:51 PM, Kyotaro HORIGUCHI wrote:

I rethinked on this from the first.


Sorry.


Hi, this looks to be a bug of cost_index(). The attached patch
would fix that.


No, that's wrong. please forget the patch. The qual in qpquals
should be indexquals which is excluded because it is not
necessary to be applied. The right way would be remove the cost
for qpqual in cost_index().


Your patch allows index only scan even if a qual contains
non-index column when the qual can be removed by implication from
index predicates.

So the 'implied' clauses is not needed ever after. It should be
excluded from cost estimation and it is not needed on execution
even if index only scan is found not to be doable finally.

So the implicit quals may be removed on building index paths but
I think check_index_only is not the place.

Removing implied quals from index quals is not only for index
*only* scan so the place for removing such quals is in
build_index_paths, in the loop of step 1. After removing the
quals there, check_index_only will naturally give disired result.

# I remember that I have tried the same or similar thing. I don't
# recall what made me give up then.


I don't think this is particularly related to the patch, because some of 
the anomalies can be observed even on master. For example, let's force 
the index scans to be non-IOS by requesting another column from the heap:


  create table t (a int, b int, c int);
  insert into t select i,i,i from generate_series(1,100) s(i);

  create index idx1 on t (a)   where b between 30 and 60;
  create index idx2 on t (a,b) where b between 30 and 60;

  analyze t;
  vacuum t;

The indexes have exactly the same size (thanks to alignment of 
IndexTuples), and should have exactly the same statistics:


test=# \di+
   List of relations
 Schema | Name | Type  | Owner | Table |  Size   | Description
+--+---+---+---+-+-
 public | idx1 | index | user  | t | 6600 kB |
 public | idx2 | index | user  | t | 6600 kB |
(2 rows)

Now, let's see the query reading column 'c' (forcing heap fetches)

explain select c from t where b between 30 and 60;
  QUERY PLAN
---
 Index Scan using idx1 on t  (cost=0.42..10945.99 rows=300971 width=4)
(1 row)

drop index idx1;
set enable_bitmapscan = off;

explain select c from t where b between 30 and 60;
  QUERY PLAN
---
 Index Scan using idx2 on t  (cost=0.42..19688.08 rows=300971 width=4)
   Index Cond: ((b >= 30) AND (b <= 60))
(2 rows)

I claim that either both or none of the indexes should use "Index Cond".

This is exactly the same reason that lead to the strange behavior after 
applying the patch, but in this case the heap access actually introduces 
some additional cost so the issue is not that obvious.


But in reality the costs should be pretty much exactly the same - the 
indexes have exactly the same size, statistics, selectivity etc.


Also, the plan difference suggests this is not merely a costing issue, 
because while with idx1 (first plan) it was correctly detected we don't 
need to evaluate the condition on the partial index, on idx2 that's not 
true and we'll waste time doing that. So we probably can't just tweak 
the costing a bit - this probably needs to be addressed when actually 
building the index path.



regards

--
Tomas Vondra   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-09-14 Thread Robert Haas

On Mon, Sep 14, 2015 at 5:32 AM, Alexander Korotkov
 wrote:
> In order to build the consensus we need the roadmap for waits monitoring.
> Would single byte in PgBackendStatus be the only way for tracking wait
> events? Could we have pluggable infrastructure in waits monitoring: for
> instance, hooks for wait event begin and end?

No, it's not the only way of doing it.  I proposed doing that way
because it's simple and cheap, but I'm not hell-bent on it.  My basic
concern here is about the cost of this.  I think that the most data we
can report without some kind of synchronization protocol is one 4-byte
integer.  If we want to report anything more than that, we're going to
need something like the st_changecount protocol, or a lock, and that's
going to add very significantly - and in my view unacceptably - to the
cost.  I care very much about having this facility be something that
we can use in lots of places, even extremely frequent operations like
buffer reads and contended lwlock acquisition.

I think that there may be some *kinds of waits* for which it's
practical to report additional detail.  For example, suppose that when
a heavyweight lock wait first happens, we just report the lock type
(relation, tuple, etc.) but then when the deadlock detector expires,
if we're still waiting, we report the entire lock tag.  Well, that's
going to happen infrequently enough, and is expensive enough anyway,
that the cost doesn't matter.  But if, every time we read a disk
block, we take a lock (or bump a changecount and do a write barrier),
dump the whole block tag in there, release the lock (or do another
write barrier and bump the changecount again) that sounds kind of
expensive to me.  Maybe we can prove that it doesn't matter on any
workload, but I doubt it.  We're fighting for every cycle in some of
these code paths, and there's good evidence that we're burning too
many of them compared to competing products already.

I am not a big fan of hooks as a way of resolving disagreements about
the design.  We may find that there are places where it's useful to
have hooks so that different extensions can do different things, and
that is fine.  But we shouldn't use that as a way of punting the
difficult questions.  There isn't enough common understanding here of
what we're all trying to get done and why we're trying to do it in
particular ways rather than in other ways to jump to the conclusion
that a hook is the right answer.  I'd prefer to have a nice, built-in
system that everyone agrees represents a good set of trade-offs than
an extensible system.

I think it's reasonable to consider reporting this data in the PGPROC
using a 4-byte integer rather than reporting it through a singe byte
in the backend status structure.  I believe that addresses the
concerns about reporting from auxiliary processes, and it also allows
a little more data to be reported.  For anything in excess of that, I
think we should think rather harder.  Most likely, such addition
detail should be reported only for certain types of wait events, or on
a delay, or something like that, so that the core mechanism remains
really, really fast.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Rework access method interface

2015-09-14 Thread Oleg Bartunov

On Fri, Sep 11, 2015 at 4:22 PM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> On Mon, Sep 7, 2015 at 9:17 PM, Petr Jelinek  wrote:
>
>> On 2015-09-04 16:26, Alexander Korotkov wrote:
>>
>>>
>>> Attached patch is implementing this. It doesn't pretend to be fully
>>> correct implementation, but it should be enough for proof the concept.
>>> In this patch access method exposes another function: amvalidate. It
>>> takes data structure representing opclass and throws error if it finds
>>> it invalid.
>>> This method is used on new opclass definition (alter operator family
>>> etc. are not yet implemented but planned). Also, there is SQL function
>>> validate_opclass(oid) which is used in regression tests.
>>> Any thoughts?
>>>
>>>
>> This is starting to look good.
>>
>> However I don't like the naming differences between validate_opclass and
>> amvalidate. If you expect that the current amvalidate will only be used for
>> opclass validation then it should be renamed accordingly.
>>
>
> validate_opclass was renamed to amvalidate.
>
>
>> Also GetIndexAmRoutine should check the return type of the amhandler.
>
>
> Fixed.
>
> See the attached patch.
>
>

Whhat I don't understand from this thread if  we should wait 2ndQuadrant
for their sequence and column AMs or just start to work on committing it ?
Alvaro, where are you ?



> --
> Alexander Korotkov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-09-14 Thread Amit Kapila

On Sun, Sep 13, 2015 at 8:06 AM, Robert Haas  wrote:
>
> On Sat, Sep 12, 2015 at 8:24 AM, Amit Kapila 
wrote:
> > 1. Modify the tranche mechanism so that information about LWLocks
> > can be tracked easily.  For this already there is some discussion, ideas
> > and initial patch is floated in this thread and there doesn't seem to be
> > much
> > conflicts, so we can write the patch for it.  I am planning to write or
> > modify
> > the existing patch unless you, IIdus or anyone has objections or want to
> > write it, please let me know to avoid duplication of work.
>
> What I'd like to see happen here is two new API calls.  During the
> early initialization (before shared memory sizing, and during
> process_shared_preload_libraries), backends in either the core code or
> a loadable module can call RequestLWLockTranche(char *, int) to
> request a tranche with the given name and number of locks.
>

Won't this new API introduce another failure mode which is collision
of tranche name?  Will it give the error for collision during Request
call or during shared_memory create?

I understand that such an API could simplify the maintainance of tranche,
but won't it be slightly less user friendly, in particular due to additional
requirement of providing name of tranche which extensions user might or
might not understand the meaning of.


> Then, when
> shared memory is created, the core code creates a tranche which is
> part of MainLWLockArray.  The base of the tranche points to the first
> lock in that tranche, and the tranche is automatically registered for
> all subsequent backends.
>
I think same has to be done for tranches outside MainLWLockArray like WAL
or ReplicationOrigin?

>  In EXEC_BACKEND builds, this requires
> stashing the LWLockTranche and the name to which it points in shared
> memory somewhere, so that exec'd backends can look at shared memory
> and redo the registration.  In non-EXEC_BACKEND builds the values can
> just be inherited via fork.  Then, we add a second API call
> LookupLWTrancheByName(char *) which does just that.  This gets used to
> initialize backend-private pointers to the various tranches.
>

So will this also require changes in the way extensions assigns the
locks?
Now they call LWLockAssign() during shared memory initialization, so
will that needs change?



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] PATCH: index-only scans with partial indexes

2015-09-14 Thread Kyotaro HORIGUCHI

I rethinked on this from the first.

> Sorry.
> 
> > Hi, this looks to be a bug of cost_index(). The attached patch
> > would fix that.
> 
> No, that's wrong. please forget the patch. The qual in qpquals
> should be indexquals which is excluded because it is not
> necessary to be applied. The right way would be remove the cost
> for qpqual in cost_index().

Your patch allows index only scan even if a qual contains
non-index column when the qual can be removed by implication from
index predicates.

So the 'implied' clauses is not needed ever after. It should be
excluded from cost estimation and it is not needed on execution
even if index only scan is found not to be doable finally.

So the implicit quals may be removed on building index paths but
I think check_index_only is not the place.

Removing implied quals from index quals is not only for index
*only* scan so the place for removing such quals is in
build_index_paths, in the loop of step 1. After removing the
quals there, check_index_only will naturally give disired result.

# I remember that I have tried the same or similar thing. I don't
# recall what made me give up then.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] row_security GUC, BYPASSRLS

2015-09-14 Thread Robert Haas

On Mon, Sep 14, 2015 at 3:29 AM, Noah Misch  wrote:
> SECURITY DEFINER execution blocks SET ROLE, SET SESSION AUTHORIZATION, and
> sometimes "GRANT role1 TO role2".  Otherwise, it works like regular execution.
> Adding exceptions, particularly silent behavior changes as opposed to hard
> errors, is a big deal.

Yeah, that does seem possibly surprising...

> Pondering it afresh this week, I see now that row_security=force itself is the
> problem.  It's a new instance of the maligned "behavior-changing GUC".
> Function authors will not consistently test the row_security=force case, and
> others can run the functions under row_security=force and get novel results.
> This poses a reliability and security threat.  While row_security=force is
> handy for testing, visiting a second role for testing purposes is a fine
> replacement.  Let's remove "force", making row_security a config_bool.  If
> someone later desires to revive the capability, a DDL-specified property of
> each policy would be free from these problems.

...but I'm not sure I like this, either.  Without row_security=force,
it's hard for a table owner to test their policy, unless they have the
ability to assume some other user ID, which some won't.  If someone
puts in place a policy which they believe secures their data properly
but which actually does not, and they are unable to test it properly
for lack of this setting, that too will be a security hole.  We will
be able to attribute it to user error rather than product defect, but
that will be cold comfort to the person whose sensitive data has been
leaked.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-09-14 Thread Alexander Korotkov

On Mon, Sep 14, 2015 at 2:12 PM, Amit Kapila 
wrote:

> On Mon, Sep 14, 2015 at 2:25 PM, Alexander Korotkov 
> wrote:
>
>> On Sat, Sep 12, 2015 at 2:05 PM, Amit Kapila 
>> wrote:
>>
>>> On Thu, Aug 6, 2015 at 3:31 PM, Ildus Kurbangaliev <
>>> i.kurbangal...@postgrespro.ru> wrote:
>>> >
>>> > On 08/05/2015 09:33 PM, Robert Haas wrote:
>>> >>
>>> >>
>>> >> You're missing the point.  Those multi-byte fields have additional
>>> >> synchronization requirements, as I explained in some detail in my
>>> >> previous email. You can't just wave that away.
>>> >
>>> > I see that now. Thank you for the point.
>>> >
>>> > I've looked deeper and I found PgBackendStatus to be not a suitable
>>> > place for keeping information about low level waits. Really,
>>> PgBackendStatus
>>> > is used to track high level information about backend. This is why
>>> auxiliary
>>> > processes don't have PgBackendStatus, because they don't have such
>>> information
>>> > to expose. But when we come to the low level wait events then auxiliary
>>> > processes are as useful for monitoring as backends are. WAL writer,
>>> > checkpointer, bgwriter etc are using LWLocks as well. This is
>>> certainly unclear
>>> > why they can't be monitored.
>>> >
>>>
>>> I think the chances of background processes stuck in LWLock is quite less
>>> as compare to backends as they do the activities periodically.  As an
>>> example
>>> WALWriter will take WALWriteLock to write the WAL, but actually there
>>> will never
>>> be any much contention for WALWriter. In synchronous_commit = on, the
>>> backends themselves write the WAL so WALWriter won't do much in that
>>> case and for synchronous_commit = off, backends won't write the WAL so
>>> WALWriter won't face any contention unless some buffers have to be
>>> written
>>> by bgwriter or checkpoint for which WAL is not flushed which I don't
>>> think
>>> would lead to any contention.
>>>
>>
>> Hmm, synchronous_commit is per session variable: some transactions could
>> run with synchronous_commit = on, but some with synchronous_commit = off.
>> This is very popular feature of PostgreSQL: achieve better performance by
>> making non-critical transaction asynchronous while leaving critical
>> transactions synchronous. Thus, contention for WALWriteLock between
>> backends and WALWriter could be real.
>>
>>
> I think it is difficult to say that can lead to contention due to periodic
> nature of WALWriter, but I don't deny that there is chance for
> background processes to have contention.
>

We don't know if there could be contention in advance. This is why we need
monitoring.


> I am not denying from the fact that there could be some contention in rare
>>> scenarios for background processes, but I think tracking them is not as
>>> important as tracking the LWLocks for backends.
>>>
>>
>> I would be more careful in calling some of scenarios rare. As DBMS
>> developers we should do our best to evade contention for LWLocks: any
>> contention, not only between backends and background processes.  One may
>> assume that high LWLock contention is rare scenario in general. Once we're
>> here we doesn't think so, though.
>> You claims that there couldn't be contention for WALWriteLock between
>> backends and WALWriter. This is unclear for me: I think it could be.
>>
>
> I think there would be more things where background processes could wait
> than LWLocks and I think they are important to track, but could be done
> separately
> from tracking them for pg_stat_activity.  Example, we have a
> pg_stat_bgwriter
> view, can't we think of tracking bgwriter/checkpointer wait information in
> that
> view and similarly for other background processes we can track in other
> views
> if any related view exists or create a new one to track for all background
> processes.
>
>
>> Nobody opposes tracking wait events for backends and tracking them for
>> background processes. I think we need to track both in order to provide
>> full picture to DBA.
>>
>>
> Sure, that is good to do, but can't we do it separately in another patch.
> I think in this patch lets just work for wait_events for backends.
>

Yes, but I think we should have a design of tracking wait event for every
process before implementing this only for backends.


> Also as we are planning to track the wait_event information in
>>> pg_stat_activity
>>> along with other backends information, it will not make sense to include
>>> information about backend processes in this variable as pg_stat_activity
>>> just displays information of backend processes.
>>>
>>
>> I'm not objecting that we should track only backends information in
>> pg_stat_activity. I think we should have also some other way of tracking
>> wait events for background processes. We should think it out before
>> extending pg_stat_activity to evade design issues later.
>>
>>
> I think we  can discuss if you see any specific problems or you want
>

Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-09-14 Thread Shulgin, Oleksandr

On Mon, Sep 14, 2015 at 11:53 AM, Tomas Vondra  wrote:

>
> On 09/14/2015 10:23 AM, Shulgin, Oleksandr wrote:
>
>> On Sat, Sep 12, 2015 at 11:50 AM, Tomas Vondra
>> >
>> wrote:
>>
> ...
>
>> - Attempts to get plan for simple insert queries like this
>>
>>INSERT INTO x SELECT * FROM x;
>>
>>end with a segfault, because ActivePortal->queryDesc is 0x0 for
>> this
>>query. Needs investigation.
>>
>>
>> Yes, I've hit the same problem after submitting the latest version of
>> the patch.  For now I've just added a check for queryDesc being not
>> NULL, but I guess the top of the current_query_stack might contains
>> something useful.  Something I need to try.
>>
>
> Well, the thing is we're able to do EXPLAIN on those queries, and IIRC
> auto_explain can log them too. So perhaps look into the hooks where they
> take the queryDesc in those cases - it has to be available somewhere.


Yes, this seems to work.  I was able to successfully query "create table
as" and even "explain analyze" for the plans.  In both cases ActivePortal
doesn't have a valid queryDesc, but the executor hook captures one.

- The lockless approach seems fine to me, although I think the fear
>>of performance issues is a bit moot (I don't think we expect large
>>number of processes calling pg_cmdstatus at the same time). But
>>it's not significantly more complex, so why not.
>>
>>
>> I believe the main benefit of the less-locking approach is that if
>> something goes wrong when two backends tried to communicate it doesn't
>> prevent the rest of them from operating, because there is no shared (and
>> thus locked) communication channel.
>>
>
> Sure, but I think it really deserves a bit more detailed explanation of
> the protocol, and discussion of the expected behavior for some basic
> failure types.
>
> For example - what happens when a backend requests info, but dies before
> receiving it, and the backed is reused for another connection? Doesn't this
> introduce a race condition? Perhaps not, I'm just asking.


Now explained better in code comments.

The worst thing that could happen in this case I believe is that the
requesting backend will not receive any response from the second use of its
slot due to the slot being marked as processed by the backend that was
asked first:

A:
slot->is_processed = false;
slot->is_valid = true;
/* signals backend B */;
shm_mq_read(); /* blocks */

B: /* finds the slot and prepares the response */

A: /* decides to bail out */
InvalidateCmdStatusSlot();
shm_mq_detach();
/* exits pg_cmdstatus() */

/* pg_cmdstatus() is called again */
/* initializes the slot and signals some backend */
shm_mq_read(); /* blocks */

B: /* completes preparing the response */
slot->is_processed = true;
shm_mq_send() => SHM_MQ_DETACHED
slot->is_valid = false;
/* gives up with this slot */

Now the backend that has been signaled on the second call to pg_cmdstatus
(it can be either some other backend, or the backend B again) will not find
an unprocessed slot, thus it will not try to attach/detach the queue and
the backend A will block forever.

This requires a really bad timing and the user should be able to interrupt
the querying backend A still.

In any case, the backends that are being asked to send the info will be
able to notice the problem (receiver detached early) and handle it
gracefully.

- The patch contains pretty much no documentation, both comments
>>at the code level and user docs. The lack of user docs is not that
>>a big deal at this point (although the patch seems to be mature
>>enough, although the user-level API will likely change).
>>
>>The lack of code comments is more serious, as it makes the review
>>somewhat more difficult. For example it'd be very nice to document
>>the contract for the lock-less interface.
>>
>>
>> I will add the code comments.  The user docs could wait before we decide
>> on the interface, I think.
>>
>
> Agreed, although I think having rudimentary user documentation would be
> useful for the reviewers - a summary of the goals that are a bit scattered
> over the e-mail thread.


OK

- I agree that pg_cmdstatus() is not the best API. Having something
>>like EXPLAIN PID would be nice, but it does not really work for
>>all the request types (just CMD_STATUS_REQUEST_EXPLAIN). Maybe
>>there's not a single API for all cases, i.e. we should use EXPLAIN
>>PID for one case and invent something different for the other?
>>
>>
>> I can think of something like:
>>
>> EXPLAIN [ ( option [, ...] ) ] PROCESS ;
>>
>> where option is extended with:
>>
>>QUERY
>>PROGRESS
>>BACKTRACE
>>
>> in addition to the usual ANALYZE, VERBOSE, FORMAT, etc.
>>
>
> Seems OK. I'm not quite sure what PROGRESS is - I assume it's the same
> thing as ANALYZE? Why not to use the

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-09-14 Thread Alexander Korotkov

On Mon, Sep 14, 2015 at 2:25 PM, Amit Kapila 
wrote:

> On Mon, Sep 14, 2015 at 3:02 PM, Alexander Korotkov 
> wrote:
>
>> On Sat, Sep 12, 2015 at 3:24 PM, Amit Kapila 
>> wrote:
>>
>>> On Fri, Aug 14, 2015 at 7:23 PM, Alexander Korotkov <
>>> aekorot...@gmail.com> wrote:
>>> >
>>> > On Thu, Aug 6, 2015 at 1:01 PM, Ildus Kurbangaliev <
>>> i.kurbangal...@postgrespro.ru> wrote:
>>> >>
>>> >>
>>> >> I've looked deeper and I found PgBackendStatus to be not a suitable
>>> >> place for keeping information about low level waits. Really,
>>> PgBackendStatus
>>> >> is used to track high level information about backend. This is why
>>> auxiliary
>>> >> processes don't have PgBackendStatus, because they don't have such
>>> information
>>> >> to expose. But when we come to the low level wait events then
>>> auxiliary
>>> >> processes are as useful for monitoring as backends are. WAL writer,
>>> >> checkpointer, bgwriter etc are using LWLocks as well. This is
>>> certainly unclear
>>> >> why they can't be monitored.
>>> >>
>>> >> This is why I think we shoudn't place wait event into
>>> PgBackendStatus. It
>>> >> could be placed into PGPROC or even separate data structure with
>>> different
>>> >> concurrency model which would be most suitable for monitoring.
>>> >
>>> >
>>> > +1 for tracking wait events not only for backends
>>> >
>>> > Ildus, could you do following?
>>> > 1) Extract LWLocks refactoring into separate patch.
>>> > 2) Make a patch with storing current wait event information in PGPROC.
>>> >
>>>
>>> Now as Robert has committed standardization of lwlock names in
>>> commit - aa65de04, let us try to summarize and work on remaining parts
>>> of the patch. So I think now the next set of work is as follows:
>>>
>>> 1. Modify the tranche mechanism so that information about LWLocks
>>> can be tracked easily.  For this already there is some discussion, ideas
>>> and initial patch is floated in this thread and there doesn't seem to be
>>> much
>>> conflicts, so we can write the patch for it.  I am planning to write or
>>> modify
>>> the existing patch unless you, IIdus or anyone has objections or want to
>>> write it, please let me know to avoid duplication of work.
>>>
>>> 2. Track wait_event in pg_stat_activity.  Now here the main point where
>>> we doesn't seem to be in complete agreement is that shall we keep it
>>> as one byte in PgBackendStatus or use two or more bytes to store
>>> wait_event information and still there is another point made by you to
>>> store it in PGProc rather than in PgBackendStatus so that we can display
>>> information of background processes as well.
>>>
>>> Now as a matter of consensus, I think Tom has raised a fair point [1]
>>> against
>>> storing this information in PGProc and I feel that it is relatively less
>>> important to have such information about background processes and the
>>> reason for same is explained upthread [2].  About having more than
>>> one-byte
>>> to store information about various wait_events, I think now we will not
>>> have
>>> more than 100 or so events to track, do you really think that anytime in
>>> forseeable
>>> future we will have more than 256 important events which we would like
>>> to track?
>>>
>>> So I think about this lets first try to build the consensus and then
>>> attempt to
>>> write or modify the patch.
>>>
>>>
>>> [1] - http://www.postgresql.org/message-id/4067.1439561...@sss.pgh.pa.us
>>> [2] -
>>> http://www.postgresql.org/message-id/CAA4eK1JRQGTgRMRao6H65rA=fhifucnqhx7ongy58um6rn1...@mail.gmail.com
>>>
>>
>> In order to build the consensus we need the roadmap for waits monitoring.
>> Would single byte in PgBackendStatus be the only way for tracking wait
>> events? Could we have pluggable infrastructure in waits monitoring: for
>> instance, hooks for wait event begin and end?
>> Limit of 256 wait events is probably OK. But we have no room for any
>> additional parameters of wait event. For instance, if you notice high
>> contention for buffer content lock then it would be nice to figure out: how
>> many blocks are involved?, which blocks? We need to track additional
>> parameters in order to answer this question.
>>
>
> We can track additional parameters by default or based on some
> parameter, but do you think that tracking backends wait_event
> information as proposed hinders in any which way the future extensions
> in this area?  The point is that detailed discussion of other parameters
> could be better done separately unless you think that this can block
> future enhancements for waits monitoring.
>

Yes, I think this can block future enhancements for waits monitoring.
You're currently proposing to store current wait event in just single byte.
And this is single byte not because of space economy, it is so by
concurrency design.
Isn't it natural to ask you how are we going to store something more about
current wait event? Should we store additional

Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-09-14 Thread Shulgin, Oleksandr

On Mon, Sep 14, 2015 at 2:11 PM, Tomas Vondra 
wrote:

>
>> Now the backend that has been signaled on the second call to
>> pg_cmdstatus (it can be either some other backend, or the backend B
>> again) will not find an unprocessed slot, thus it will not try to
>> attach/detach the queue and the backend A will block forever.
>>
>> This requires a really bad timing and the user should be able to
>> interrupt the querying backend A still.
>>
>
> I think we can't rely on the low probability that this won't happen, and
> we should not rely on people interrupting the backend. Being able to detect
> the situation and fail gracefully should be possible.
>
> It may be possible to introduce some lock-less protocol preventing such
> situations, but it's not there at the moment. If you believe it's possible,
> you need to explain and "prove" that it's actually safe.
>
> Otherwise we may need to introduce some basic locking - for example we may
> introduce a LWLock for each slot, and lock it with dontWait=true (and skip
> it if we couldn't lock it). This should prevent most scenarios where one
> corrupted slot blocks many processes.


OK, I will revisit this part then.

In any case, the backends that are being asked to send the info will be
>> able to notice the problem (receiver detached early) and handle it
>> gracefully.
>>
>
> Ummm, how? Maybe I missed something?


Well, I didn't attach the updated patch (doing that now).  The basic idea
is that when the backend that has requested information bails out
prematurely it still detaches from the shared memory queue.  This makes it
possible for the backend being asked to detect the situation either before
attaching to the queue or when trying to send the data, so it won't be
blocked forever if the other backend failed to wait.

I don't think we should mix this with monitoring of auxiliary
>> processes. This interface is designed at monitoring SQL queries
>> running in other backends, effectively "remote" EXPLAIN. But those
>> auxiliary processes are not processing SQL queries at all, they're
>> not even using regular executor ...
>>
>> OTOH the ability to request this info (e.g. auxiliary process
>> looking at plans running in backends) seems useful, so I'm ok with
>> tuple slots for auxiliary processes.
>>
>>
>> Now that I think about it, reserving the slots for aux process doesn't
>> let us query their status, it's the other way round.  If we don't
>> reserve them, then aux process would not be able to query any other
>> process for the status.  Likely this is not a problem at all, so we can
>> remove these extra slots.
>>
>
> I don't know. I can imagine using this from background workers, but I
> think those are counted as regular backends (not sure though).


MaxBackends includes the background workers, yes.

--
Alex

Re: [HACKERS] [PATCH] Refactoring of LWLock tranches

2015-09-14 Thread Robert Haas

On Sun, Sep 13, 2015 at 5:05 PM, Ildus Kurbangaliev
 wrote:
> Yes, that is because I tried to go with current convention working with
> shmem in Postgres (there are one function that returns the size and
> others that initialize that memory). But I like your suggestion about
> API functions, in that case number of tranches and locks will be known
> during the initialization.

I also want to leave the door open to tranches that are registered
after initialization.  At that point, it's too late to put a tranche
in shared memory, but you can still use DSM.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-09-14 Thread Tomas Vondra

On 09/14/2015 01:15 PM, Shulgin, Oleksandr wrote:

On Mon, Sep 14, 2015 at 11:53 AM, Tomas Vondra
> wrote:

On 09/14/2015 10:23 AM, Shulgin, Oleksandr wrote:

On Sat, Sep 12, 2015 at 11:50 AM, Tomas Vondra

>> wrote:

...

 - Attempts to get plan for simple insert queries like this

INSERT INTO x SELECT * FROM x;

end with a segfault, because ActivePortal->queryDesc is
0x0 for this
query. Needs investigation.

Yes, I've hit the same problem after submitting the latest
version of
the patch.  For now I've just added a check for queryDesc being not
NULL, but I guess the top of the current_query_stack might contains
something useful.  Something I need to try.

Well, the thing is we're able to do EXPLAIN on those queries, and
IIRC auto_explain can log them too. So perhaps look into the hooks
where they take the queryDesc in those cases - it has to be
available somewhere.

Yes, this seems to work.  I was able to successfully query "create table
as" and even "explain analyze" for the plans.  In both cases
ActivePortal doesn't have a valid queryDesc, but the executor hook
captures one.

 - The lockless approach seems fine to me, although I think
the fear
of performance issues is a bit moot (I don't think we
expect large
number of processes calling pg_cmdstatus at the same
time). But
it's not significantly more complex, so why not.

I believe the main benefit of the less-locking approach is that if
something goes wrong when two backends tried to communicate it
doesn't
prevent the rest of them from operating, because there is no
shared (and
thus locked) communication channel.

Sure, but I think it really deserves a bit more detailed explanation
of the protocol, and discussion of the expected behavior for some
basic failure types.

For example - what happens when a backend requests info, but dies
before receiving it, and the backed is reused for another
connection? Doesn't this introduce a race condition? Perhaps not,
I'm just asking.

Now explained better in code comments.

The worst thing that could happen in this case I believe is that the
requesting backend will not receive any response from the second use of
its slot due to the slot being marked as processed by the backend that
was asked first:

A:
slot->is_processed = false;
slot->is_valid = true;
/* signals backend B */;
shm_mq_read(); /* blocks */

B: /* finds the slot and prepares the response */

A: /* decides to bail out */
InvalidateCmdStatusSlot();
shm_mq_detach();
/* exits pg_cmdstatus() */

/* pg_cmdstatus() is called again */
/* initializes the slot and signals some backend */
shm_mq_read(); /* blocks */

B: /* completes preparing the response */
slot->is_processed = true;
shm_mq_send() => SHM_MQ_DETACHED
slot->is_valid = false;
/* gives up with this slot */

Now the backend that has been signaled on the second call to
pg_cmdstatus (it can be either some other backend, or the backend B
again) will not find an unprocessed slot, thus it will not try to
attach/detach the queue and the backend A will block forever.

This requires a really bad timing and the user should be able to
interrupt the querying backend A still.

I think we can't rely on the low probability that this won't happen, and 
we should not rely on people interrupting the backend. Being able to 
detect the situation and fail gracefully should be possible.

It may be possible to introduce some lock-less protocol preventing such 
situations, but it's not there at the moment. If you believe it's 
possible, you need to explain and "prove" that it's actually safe.

Otherwise we may need to introduce some basic locking - for example we 
may introduce a LWLock for each slot, and lock it with dontWait=true 
(and skip it if we couldn't lock it). This should prevent most scenarios 
where one corrupted slot blocks many processes.

In any case, the backends that are being asked to send the info will be
able to notice the problem (receiver detached early) and handle it
gracefully.

Ummm, how? Maybe I missed something?

I don't think we should mix this with monitoring of auxiliary
processes. This interface is designed at monitoring SQL queries
running in other backends, effectively "remote" EXPLAIN. But those
auxiliary processes are not processing SQL queries at all, they're
not even using regular executor ...

OTOH the ability to request this info (e.g. auxiliary process
looking at plans running in backends) seems useful,

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-09-14 Thread Amit Kapila

On Mon, Sep 14, 2015 at 2:25 PM, Alexander Korotkov 
wrote:

> On Sat, Sep 12, 2015 at 2:05 PM, Amit Kapila 
> wrote:
>
>> On Thu, Aug 6, 2015 at 3:31 PM, Ildus Kurbangaliev <
>> i.kurbangal...@postgrespro.ru> wrote:
>> >
>> > On 08/05/2015 09:33 PM, Robert Haas wrote:
>> >>
>> >>
>> >> You're missing the point.  Those multi-byte fields have additional
>> >> synchronization requirements, as I explained in some detail in my
>> >> previous email. You can't just wave that away.
>> >
>> > I see that now. Thank you for the point.
>> >
>> > I've looked deeper and I found PgBackendStatus to be not a suitable
>> > place for keeping information about low level waits. Really,
>> PgBackendStatus
>> > is used to track high level information about backend. This is why
>> auxiliary
>> > processes don't have PgBackendStatus, because they don't have such
>> information
>> > to expose. But when we come to the low level wait events then auxiliary
>> > processes are as useful for monitoring as backends are. WAL writer,
>> > checkpointer, bgwriter etc are using LWLocks as well. This is certainly
>> unclear
>> > why they can't be monitored.
>> >
>>
>> I think the chances of background processes stuck in LWLock is quite less
>> as compare to backends as they do the activities periodically.  As an
>> example
>> WALWriter will take WALWriteLock to write the WAL, but actually there
>> will never
>> be any much contention for WALWriter. In synchronous_commit = on, the
>> backends themselves write the WAL so WALWriter won't do much in that
>> case and for synchronous_commit = off, backends won't write the WAL so
>> WALWriter won't face any contention unless some buffers have to be written
>> by bgwriter or checkpoint for which WAL is not flushed which I don't think
>> would lead to any contention.
>>
>
> Hmm, synchronous_commit is per session variable: some transactions could
> run with synchronous_commit = on, but some with synchronous_commit = off.
> This is very popular feature of PostgreSQL: achieve better performance by
> making non-critical transaction asynchronous while leaving critical
> transactions synchronous. Thus, contention for WALWriteLock between
> backends and WALWriter could be real.
>
>
I think it is difficult to say that can lead to contention due to periodic
nature of WALWriter, but I don't deny that there is chance for
background processes to have contention.


> I am not denying from the fact that there could be some contention in rare
>> scenarios for background processes, but I think tracking them is not as
>> important as tracking the LWLocks for backends.
>>
>
> I would be more careful in calling some of scenarios rare. As DBMS
> developers we should do our best to evade contention for LWLocks: any
> contention, not only between backends and background processes.  One may
> assume that high LWLock contention is rare scenario in general. Once we're
> here we doesn't think so, though.
> You claims that there couldn't be contention for WALWriteLock between
> backends and WALWriter. This is unclear for me: I think it could be.
>

I think there would be more things where background processes could wait
than LWLocks and I think they are important to track, but could be done
separately
from tracking them for pg_stat_activity.  Example, we have a
pg_stat_bgwriter
view, can't we think of tracking bgwriter/checkpointer wait information in
that
view and similarly for other background processes we can track in other
views
if any related view exists or create a new one to track for all background
processes.


> Nobody opposes tracking wait events for backends and tracking them for
> background processes. I think we need to track both in order to provide
> full picture to DBA.
>
>
Sure, that is good to do, but can't we do it separately in another patch.
I think in this patch lets just work for wait_events for backends.


> Also as we are planning to track the wait_event information in
>> pg_stat_activity
>> along with other backends information, it will not make sense to include
>> information about backend processes in this variable as pg_stat_activity
>> just displays information of backend processes.
>>
>
> I'm not objecting that we should track only backends information in
> pg_stat_activity. I think we should have also some other way of tracking
> wait events for background processes. We should think it out before
> extending pg_stat_activity to evade design issues later.
>
>
I think we  can discuss if you see any specific problems or you want
specific
things to be clarified, but sorting out the complete design of waits
monitoring
before this patch can extend the scope of this patch beyond need.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] [COMMITTERS] pgsql: Fix an O(N^2) problem in foreign key references.

2015-09-14 Thread Tom Lane

Kevin Grittner  writes:
> Fix an O(N^2) problem in foreign key references.

Judging from the buildfarm, this patch is broken under
CLOBBER_CACHE_ALWAYS.  See friarbird's results in particular.
I might be too quick to finger this patch, but nothing else
lately has touched foreign-key behavior, and the foreign_key
test is where things are going south.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [COMMITTERS] pgsql: Check existency of table/schema for -t/-n option (pg_dump/pg_res

2015-09-14 Thread Robert Haas

On Mon, Sep 14, 2015 at 9:20 AM, Teodor Sigaev  wrote:
> Check existency of table/schema for -t/-n option (pg_dump/pg_restore)
>
> Patch provides command line option --strict-names which requires that at
> least one table/schema should present for each -t/-n option.
>
> Pavel Stehule 

/*
-* We use UNION ALL rather than UNION; this might sometimes result in
-* duplicate entries in the OID list, but we don't care.
+* this might sometimes result in duplicate entries in the OID list,
+* but we don't care.
 */

This looks totally incoherent.  You've removed the thing to which the
word "this" referred and replaced it with nothing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RLS open items are vague and unactionable

2015-09-14 Thread Stephen Frost

Dean,

* Dean Rasheed (dean.a.rash...@gmail.com) wrote:
> It shouldn't be necessary to change get_policies_for_relation() at all
> to support the RETURNING check. You just need to call it with
> CMD_SELECT.

Yup, that works well.

> BTW, your change to change get_policies_for_relation() has
> a bug -- if the policy is for ALL commands it gets added
> unconditionally to the list of returning_policies, regardless of
> whether it applies to the current user role. That's the risk of
> complicating the function by trying to make it do more than it was
> designed to do.

Yeah, I had added the lappend's directly under the individual commands
at first, thinking back to when we were doing the per-role check there
instead of later and when I realized how you changed it to happen later
I went back and updated them, but apparently missed the 'all' case.

> So for example, build^H^H^Hadd_security_quals() takes 2 lists of
> policies (permissive and restrictive), combines them in the right way
> using OR and AND, does the necessary varno manipulation, and adds the
> resulting quals to the passed-in list of security quals (thereby
> implicitly AND'ing the new quals with any pre-existing ones), which is
> precisely what's needed to support RETURNING.

Right, it just needs to be called twice to have that happen correctly.

> I think this should actually be 2 separate commits, since the
> refactoring and the support for RETURNING are entirely different
> things. It just happens that after the refactoring, the RETURNING
> patch becomes trivial (4 new executable lines of code wrapped in a
> couple of if statements, to fetch and then apply the new policies in
> the necessary cases). At least that's the theory :-)

I had been planning to do them as independent commits in the end, but
thought it easier to review one patch, particularly when the differences
were larger.  I've now reworked adding the RETURNING handling without
changing the other functions, per discussion.

Attached is a git format-patch built series which includes both commits,
now broken out, for review.

Thanks!

Stephen
From f6866952d2b5049acf8eecb45b990c3ab4916f04 Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Sun, 13 Sep 2015 09:03:09 -0400
Subject: [PATCH 1/2] RLS refactoring

This refactors rewrite/rowsecurity.c to simplify the handling of the
default deny case (reducing the number of places where we check for and
add the default deny policy from three to one) by splitting up the
retrival of the policies from the application of them.

This also allowed us to do away with the policy_id field.  A policy_name
field was added for WithCheckOption policies and is used in error
reporting, when available.

Patch by Dean Rasheed, with various mostly cosmetic changes by me.

Back-patch to 9.5 where RLS was introduced to avoid unnecessary
differences, since we're still in alpha, per discussion with Robert.
---
 src/backend/commands/policy.c  |  41 --
 src/backend/executor/execMain.c|  20 +-
 src/backend/nodes/copyfuncs.c  |   1 +
 src/backend/nodes/equalfuncs.c |   1 +
 src/backend/nodes/outfuncs.c   |   1 +
 src/backend/nodes/readfuncs.c  |   1 +
 src/backend/rewrite/rewriteHandler.c   |   5 +-
 src/backend/rewrite/rowsecurity.c  | 816 +++--
 src/backend/utils/cache/relcache.c |   2 -
 src/include/nodes/parsenodes.h |   1 +
 src/include/rewrite/rowsecurity.h  |   3 +-
 .../test_rls_hooks/expected/test_rls_hooks.out |  10 +-
 src/test/modules/test_rls_hooks/test_rls_hooks.c   |   2 -
 13 files changed, 447 insertions(+), 457 deletions(-)

diff --git a/src/backend/commands/policy.c b/src/backend/commands/policy.c
index 45326a3..8851fe7 100644
--- a/src/backend/commands/policy.c
+++ b/src/backend/commands/policy.c
@@ -186,9 +186,6 @@ policy_role_list_to_array(List *roles, int *num_roles)
 /*
  * Load row security policy from the catalog, and store it in
  * the relation's relcache entry.
- *
- * We will always set up some kind of policy here.  If no explicit policies
- * are found then an implicit default-deny policy is created.
  */
 void
 RelationBuildRowSecurity(Relation relation)
@@ -246,7 +243,6 @@ RelationBuildRowSecurity(Relation relation)
 			char	   *with_check_value;
 			Expr	   *with_check_qual;
 			char	   *policy_name_value;
-			Oid			policy_id;
 			bool		isnull;
 			RowSecurityPolicy *policy;
 
@@ -298,14 +294,11 @@ RelationBuildRowSecurity(Relation relation)
 			else
 with_check_qual = NULL;
 
-			policy_id = HeapTupleGetOid(tuple);
-
 			/* Now copy everything into the cache context */
 			MemoryContextSwitchTo(rscxt);
 
 			policy = palloc0(sizeof(RowSecurityPolicy));
 			policy->policy_name = pstrdup(policy_name_value);
-			policy->policy_id = policy_id;
 			policy->polcmd = cmd_value;
 			policy->roles =

Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-09-14 Thread Shulgin, Oleksandr

>
> Well, I didn't attach the updated patch (doing that now).
>

This time for real.  Sorry, it's Monday :-p
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 32ac58f..2e3beaf 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -43,6 +43,7 @@
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "storage/spin.h"
+#include "utils/cmdstatus.h"
 
 
 shmem_startup_hook_type shmem_startup_hook = NULL;
@@ -139,6 +140,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, BTreeShmemSize());
 		size = add_size(size, SyncScanShmemSize());
 		size = add_size(size, AsyncShmemSize());
+		size = add_size(size, CmdStatusShmemSize());
 #ifdef EXEC_BACKEND
 		size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -243,6 +245,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	ReplicationOriginShmemInit();
 	WalSndShmemInit();
 	WalRcvShmemInit();
+	CmdStatusShmemInit();
 
 	/*
 	 * Set up other modules that need some shared memory space
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 0abde43..e637be1 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -26,6 +26,7 @@
 #include "storage/shmem.h"
 #include "storage/sinval.h"
 #include "tcop/tcopprot.h"
+#include "utils/cmdstatus.h"
 
 
 /*
@@ -296,6 +297,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
 
+	if (CheckProcSignal(PROCSIG_CMD_STATUS_INFO))
+		HandleCmdStatusInfoInterrupt();
+
 	if (set_latch_on_sigusr1)
 		SetLatch(MyLatch);
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index d917af3..1a5b03c 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -67,6 +67,7 @@
 #include "tcop/pquery.h"
 #include "tcop/tcopprot.h"
 #include "tcop/utility.h"
+#include "utils/cmdstatus.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
@@ -2991,6 +2992,9 @@ ProcessInterrupts(void)
 
 	if (ParallelMessagePending)
 		HandleParallelMessages();
+
+	if (CmdStatusInfoPending)
+		ProcessCmdStatusInfoRequest();
 }
 
 
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 3ed0b44..2c8687c 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -18,7 +18,7 @@ endif
 # keep this list arranged alphabetically or it gets to be a mess
 OBJS = acl.o arrayfuncs.o array_expanded.o array_selfuncs.o \
 	array_typanalyze.o array_userfuncs.o arrayutils.o ascii.o \
-	bool.o cash.o char.o date.o datetime.o datum.o dbsize.o domains.o \
+	bool.o cash.o char.o cmdstatus.o date.o datetime.o datum.o dbsize.o domains.o \
 	encode.o enum.o expandeddatum.o \
 	float.o format_type.o formatting.o genfile.o \
 	geo_ops.o geo_selfuncs.o inet_cidr_ntop.o inet_net_pton.o int.o \
diff --git a/src/backend/utils/adt/cmdstatus.c b/src/backend/utils/adt/cmdstatus.c
new file mode 100644
index 000..5cc6d51
--- /dev/null
+++ b/src/backend/utils/adt/cmdstatus.c
@@ -0,0 +1,691 @@
+/*-
+ *
+ * cmdstatus.c
+ *	  Definitions for pg_cmdstatus function.
+ *
+ * Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/cmdstatus.c
+ *
+ *-
+ */
+
+#include "postgres.h"
+
+#include "funcapi.h"
+#include "miscadmin.h"
+
+#include "access/htup_details.h"
+#include "commands/explain.h"
+#include "lib/stringinfo.h"
+#include "storage/dsm.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/procsignal.h"
+#include "storage/shm_mq.h"
+#include "storage/shm_toc.h"
+#include "tcop/dest.h"
+#include "tcop/pquery.h"
+#include "utils/builtins.h"
+#include "utils/cmdstatus.h"
+
+
+typedef enum {
+	CMD_STATUS_REQUEST_EXPLAIN = 1,
+	CMD_STATUS_REQUEST_QUERY_TEXT = 2,
+	CMD_STATUS_REQUEST_PROGRESS_TAG = 3,
+	CMD_STATUS_REQUEST_EXPLAIN_BACKTRACE = 4
+} CmdStatusInfoRequestType;
+
+#define CMD_STATUS_MAX_REQUEST CMD_STATUS_REQUEST_EXPLAIN_BACKTRACE
+
+typedef enum {
+	CMD_STATUS_RESULT_FAILURE = -1,
+	CMD_STATUS_RESULT_SUCCESS = 0,
+	CMD_STATUS_RESULT_BACKEND_IDLE,
+	CMD_STATUS_RESULT_NO_DATA
+} CmdStatusInfoResultCode;
+
+/*
+ * Each process that wants to be able to query other processes for their
+ * status registers itself in a slot in the shared memory by setting the
+ * slot's process ID.  The slots array is indexed by backend ID, so any slot
+ * can be assigned at max to one backend at any given time.
+ *
+ * In order to actually query some backend for its status, the interested
+ * process will initialize its *own* slot, create a DSM segment and initialize
+ * a shared

Re: [HACKERS] Re: [COMMITTERS] pgsql: Check existency of table/schema for -t/-n option (pg_dump/pg_res

2015-09-14 Thread Teodor Sigaev


 /*
-* We use UNION ALL rather than UNION; this might sometimes result in
-* duplicate entries in the OID list, but we don't care.
+* this might sometimes result in duplicate entries in the OID list,
+* but we don't care.
  */

This looks totally incoherent.  You've removed the thing to which the
word "this" referred and replaced it with nothing.


Oops.

/*
 * The loop below runs multiple SELECTs might sometimes result in
 * duplicate entries in the OID list, but we don't care.
 */

looks reasonable?

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-09-14 Thread Thom Brown

On 11 September 2015 at 15:43, Syed, Rahila  wrote:

>
> Hello,
>
> Please find attached updated VACUUM progress checker patch.
> Following have been accomplished in the patch
>
> 1. Accounts for index pages count while calculating  total progress of
> VACUUM.
> 2. Common location for storing progress parameters for any command. Idea
> is every command which needs to report progress can populate and interpret
> the shared variables in its own way.
>  Each can display progress by implementing separate views.
> 3. Separate VACUUM progress view to display various progress parameters
> has been implemented . Progress of various phases like heap scan, index
> scan, total pages scanned along with
> completion percentage is reported.
> 4.This view can display progress for all active backends running VACUUM.
>
> Basic testing has been performed. Thorough testing is yet to be done.
> Marking it as Needs Review in  Sept-Commitfest.
>
> ToDo:
> Display count of heap pages actually vacuumed(marking line pointers unused)
> Display percentage of work_mem being used to store dead tuples.
>

Okay, I've just tested this with a newly-loaded table (1,252,973 of jsonb
data), and it works fine during a vacuum.  I can see the scanned_pages,
scanned_heap_pages and percent_complete increasing, but after it's
finished, I end up with this:

json=# select * from pg_stat_vacuum_progress;
-[ RECORD 1 ]---+---
pid | 5569
total_pages | 217941
scanned_pages   | 175243
total_heap_pages| 175243
scanned_heap_pages  | 175243
total_index_pages   | 42698
scanned_index_pages |
percent_complete| 80

This was running with a VACUUM ANALYZE.  This output seems to suggest that
it didn't complete.

After, I ran VACUUM FULL.  pg_stat_vacuum_progress didn't change from
before, so that doesn't appear to show up in the view.

I then deleted 40,000 rows from my table, and ran VACUUM ANALYZE again.
This time it progressed and percent_complete reached 100.

-- 
Thom

[HACKERS] Attach comments to functions' parameters and return value

2015-09-14 Thread Charles Clavadetscher

Hello

In PostgreSQL it is possible to attach comments to almost everything. This
made it possible for us to integrate the wiki that we use for our technical
documentation directly with the database using the MediaWiki [1] extensions
ExternalData [2] and MagicNoCache [3]. The result is a documentation on
tables and related objects (indexes, triggers, etc.) and views that always
shows the current state, i.e. any DDL change or any comment attached to an
object is shown in the wiki immediately (or on refresh if the change was
done after the reader landed on the page).

In order to optimize the query, we wrote a small set of sql functions that
generate wiki markup for the objects queried. The idea behind is that this
is much quicker in PostgreSQL than on a web server hosting MediaWiki,
besides a better control of the privileges for the user retrieving data.

So far we can document in such a way tables and views. I started to create
something similar for functions until I noticed that there is no way to
attach comments on functions' parameters and return value. My first idea was
to add this information in the function description, but this is quite an
ugly solution.

My next workaround is to simulate the behaviour of a COMMENT ON FUNCTION
PARAMETER/RETURNVALUE command inserting comments on these directly in
pg_description. For that I used a convention similar to the one used for
table attributes and defined following pg_description.objsubid:

  -1 = return value
   0 = comment on the function itself (this already exists)
1..n = comment on parameter at position n

An insert would then look like:

INSERT INTO pg_catalog.pg_description
VALUES ('function_name(param_type_list)'::regprocedure,
'pg_proc'::regclass,
parameter_position,
'Comment');

With a simple function similar to the one used to query column descriptions
(pg_catalog.col_description), it is possible to get the comments.

CREATE OR REPLACE FUNCTION pg_catalog.param_description (objoid OID, posnum
INTEGER)
RETURNS TEXT
STABLE
AS $$
SELECT description FROM pg_catalog.pg_description
WHERE objoid = $1
AND classoid = 'pg_catalog.pg_proc'::pg_catalog.regclass
AND objsubid = $2;
$$ LANGUAGE SQL;

Example:
INSERT INTO pg_catalog.pg_description
VALUES ('public.create_wiki_doc(integer,text[],text[])'::regprocedure,
'pg_proc'::regclass,
-1,
'Returns a set of TEXT with wiki formatted description of tables and
views');
INSERT INTO pg_catalog.pg_description
VALUES ('public.create_wiki_doc(integer,text[],text[])'::regprocedure,
'pg_proc'::regclass,
1,
'Wiki title level for each table documentation. The number of "=" to
put before and after the name of the object');
VALUES ('public.create_wiki_doc(integer,text[],text[])'::regprocedure,
'pg_proc'::regclass,
2,
'An array with the list of schemas to be documented');
Etc.

SELECT
param_description('public.create_wiki_doc(integer,text[],text[])'::regproced
ure,-1);
 param_description
---
 Returns a set of TEXT with wiki formatted description of tables and views

SELECT
param_description('public.create_wiki_doc(integer,text[],text[])'::regproced
ure,1);
   param_description

-
 Wiki title level for each table documentation. The number of "=" to put
before and after the name of the object

SELECT
param_description('public.create_wiki_doc(integer,text[],text[])'::regproced
ure,2);
 param_description

 An array with the list of schemas to be documented

 Etc.

As I said this is just a workaround and it is not comfortable to manipulate
catalog tables directly. The much better solution would be to have an
implementation of the sql comment command for parameters and return value of
functions built in the system. So my questions on that topic to the
community:

- Is there a reason why this approach should not be followed (currently as
workaround later as implementation in C)?
- Is somebody already doing implementation work in this area or would be
interested in engaging?
- Is there a general interest for that addition?
- Any good advice, tips, suggestions?

I was not completely inactive. I started looking into the code and I am,
honestly, a bit puzzled (see below). If I were to take up the job, which I
would love, I guess that this could not be before the beginning of November
this year. With some help, however, I may be able to start earlier.

What I could find so far looking at the documentation and the code is that
there are quite a number of files to be touched. The given snippets are just
for illustration:

- src/include/nodes/parsenodes.h : Define object types e.g.
FUNCTION_PARAMETER,

Re: [HACKERS] Review: GiST support for UUIDs

2015-09-14 Thread Andres Freund

Please post reviews to the original thread unless that's already humongously 
large. Otherwise somebody needs to manually link up the threads in the CF 
application.

It also makes it much harder to follow the development because there'll likely 
be a new version of the patch after a review. Which then cab either pushed in a 
thread titled review, or in an old one, without context.

Andres
--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RLS open items are vague and unactionable

2015-09-14 Thread Dean Rasheed

On 14 September 2015 at 14:47, Stephen Frost  wrote:
> Attached is a git format-patch built series which includes both commits,
> now broken out, for review.
>

That looks OK to me.

A minor point -- this comment isn't quite right:

/*
 * For the target relation, when there is a returning list, we need to
 * collect up CMD_SELECT policies to add via add_security_quals and
 * add_with_check_options.  This is because, for the RETURNING case, we
 * have to filter any records which are not visible through an ALL or SELECT
 * USING policy.
 *
 * We don't need to worry about the non-target relation case because we are
 * checking the ALL and SELECT policies for those relations anyway (see
 * above).
 */

because the policies that are fetched there are only used for
add_security_quals(), not for add_with_check_options(). It might be
cleaner if the 'if' statement that follows were merged with the
identical one a few lines down, and then those returning policies
could be local to that block, with the 2 pieces of RETURNING handling
done together. Similarly for the upsert block.

Actually, it isn't necessary to test that rt_index ==
root->resultRelation, because for all other relations commandType is
set to CMD_SELECT higher up, so the 'returning' bool variable could
just be replaced with 'root->returningList != NIL' throughout.

Regards,
Dean

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Review: GiST support for UUIDs

2015-09-14 Thread Teodor Sigaev


http://www.postgresql.org/message-id/flat/ca+renyvephxto1c7dfbvjp1gymuc0-3qdnwpn30-noo5mpy...@mail.gmail.com#ca+renyvephxto1c7dfbvjp1gymuc0-3qdnwpn30-noo5mpy...@mail.gmail.com

Patch looks perfect but it's still needed some work.

0) rebase to current HEAD (done, in attach)
1) UUIDSIZE ->  UUID_LEN (it's defined in utils/uuid.h, done)
2)
static double
uuid2num(const pg_uuid_t *i)
{
return *((uint64 *)i);
}
   It isn't looked as correct transformation for me. May be, it's better
   to transform to numeric type (UUID looks like a 16-digit hexademical number)
   and follow  gbt_numeric_penalty() logic (or even call directly).




--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/
diff --git a/contrib/btree_gist/Makefile b/contrib/btree_gist/Makefile
index a4b2cc7..43a6c5d 100644
--- a/contrib/btree_gist/Makefile
+++ b/contrib/btree_gist/Makefile
@@ -6,16 +6,16 @@ OBJS =  btree_gist.o btree_utils_num.o btree_utils_var.o 
btree_int2.o \
 btree_int4.o btree_int8.o btree_float4.o btree_float8.o btree_cash.o \
 btree_oid.o btree_ts.o btree_time.o btree_date.o btree_interval.o \
 btree_macaddr.o btree_inet.o btree_text.o btree_bytea.o btree_bit.o \
-btree_numeric.o $(WIN32RES)
+btree_numeric.o btree_uuid.o $(WIN32RES)
 
 EXTENSION = btree_gist
-DATA = btree_gist--1.1.sql btree_gist--unpackaged--1.0.sql \
-   btree_gist--1.0--1.1.sql
+DATA = btree_gist--1.2.sql btree_gist--unpackaged--1.0.sql \
+   btree_gist--1.0--1.1.sql btree_gist--1.1--1.2.sql
 PGFILEDESC = "btree_gist - B-tree equivalent GiST operator classes"
 
 REGRESS = init int2 int4 int8 float4 float8 cash oid timestamp timestamptz \
 time timetz date interval macaddr inet cidr text varchar char bytea \
-bit varbit numeric not_equal
+bit varbit numeric uuid not_equal
 
 SHLIB_LINK += $(filter -lm, $(LIBS))
 
diff --git a/contrib/btree_gist/btree_gist--1.1--1.2.sql 
b/contrib/btree_gist/btree_gist--1.1--1.2.sql
new file mode 100644
index 000..376e884
--- /dev/null
+++ b/contrib/btree_gist/btree_gist--1.1--1.2.sql
@@ -0,0 +1,64 @@
+/* contrib/btree_gist/btree_gist--1.1--1.2.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION btree_gist UPDATE TO '1.2'" to load this file. \quit
+
+-- Add support for indexing UUID columns
+
+-- define the GiST support methods
+CREATE FUNCTION gbt_uuid_consistent(internal,uuid,int2,oid,internal)
+RETURNS bool
+AS 'MODULE_PATHNAME'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE FUNCTION gbt_uuid_fetch(internal)
+RETURNS internal
+AS 'MODULE_PATHNAME'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE FUNCTION gbt_uuid_compress(internal)
+RETURNS internal
+AS 'MODULE_PATHNAME'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE FUNCTION gbt_uuid_penalty(internal,internal,internal)
+RETURNS internal
+AS 'MODULE_PATHNAME'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE FUNCTION gbt_uuid_picksplit(internal, internal)
+RETURNS internal
+AS 'MODULE_PATHNAME'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE FUNCTION gbt_uuid_union(bytea, internal)
+RETURNS gbtreekey16
+AS 'MODULE_PATHNAME'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE FUNCTION gbt_uuid_same(internal, internal, internal)
+RETURNS internal
+AS 'MODULE_PATHNAME'
+LANGUAGE C IMMUTABLE STRICT;
+
+-- Create the operator class
+CREATE OPERATOR CLASS gist_uuid_ops
+DEFAULT FOR TYPE uuid USING gist
+AS
+   OPERATOR1   <   ,
+   OPERATOR2   <=  ,
+   OPERATOR3   =   ,
+   OPERATOR4   >=  ,
+   OPERATOR5   >   ,
+   FUNCTION1   gbt_uuid_consistent (internal, uuid, int2, oid, 
internal),
+   FUNCTION2   gbt_uuid_union (bytea, internal),
+   FUNCTION3   gbt_uuid_compress (internal),
+   FUNCTION4   gbt_decompress (internal),
+   FUNCTION5   gbt_uuid_penalty (internal, internal, internal),
+   FUNCTION6   gbt_uuid_picksplit (internal, internal),
+   FUNCTION7   gbt_uuid_same (internal, internal, internal),
+   STORAGE gbtreekey32;
+
+ALTER OPERATOR FAMILY gist_uuid_ops USING gist ADD
+   OPERATOR6   <>  (uuid, uuid) ,
+   FUNCTION9 (uuid, uuid) gbt_uuid_fetch (internal) ;
diff --git a/contrib/btree_gist/btree_gist--1.1.sql 
b/contrib/btree_gist/btree_gist--1.1.sql
deleted file mode 100644
index cdec964..000
--- a/contrib/btree_gist/btree_gist--1.1.sql
+++ /dev/null
@@ -1,1570 +0,0 @@
-/* contrib/btree_gist/btree_gist--1.0.sql */
-
--- complain if script is sourced in psql, rather than via CREATE EXTENSION
-\echo Use "CREATE EXTENSION btree_gist" to load this file. \quit
-
-CREATE FUNCTION gbtreekey4_in(cstring)
-RETURNS gbtreekey4
-AS 'MODULE_PATHNAME', 'gbtreekey_in'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION

Re: [HACKERS] RLS open items are vague and unactionable

2015-09-14 Thread Stephen Frost

Dean,

* Dean Rasheed (dean.a.rash...@gmail.com) wrote:
> On 14 September 2015 at 14:47, Stephen Frost  wrote:
> > Attached is a git format-patch built series which includes both commits,
> > now broken out, for review.
> 
> That looks OK to me.

Excellent.

> A minor point -- this comment isn't quite right:
> 
> /*
>  * For the target relation, when there is a returning list, we need to
>  * collect up CMD_SELECT policies to add via add_security_quals and
>  * add_with_check_options.  This is because, for the RETURNING case, we
>  * have to filter any records which are not visible through an ALL or 
> SELECT
>  * USING policy.
>  *
>  * We don't need to worry about the non-target relation case because we 
> are
>  * checking the ALL and SELECT policies for those relations anyway (see
>  * above).
>  */
> 
> because the policies that are fetched there are only used for
> add_security_quals(), not for add_with_check_options(). It might be
> cleaner if the 'if' statement that follows were merged with the
> identical one a few lines down, and then those returning policies
> could be local to that block, with the 2 pieces of RETURNING handling
> done together. Similarly for the upsert block.

Hmm, ok, will take a look at doing that.

> Actually, it isn't necessary to test that rt_index ==
> root->resultRelation, because for all other relations commandType is
> set to CMD_SELECT higher up, so the 'returning' bool variable could
> just be replaced with 'root->returningList != NIL' throughout.

I had thought something similar originally and ran into a case where
that didn't quite work.  That was a few revisions ago though, so perhaps
there was something else going on.  I'll take a look at making this
change also (which was actually how I had implemented it initially).

I'll be offline for a few hours as I'm about to fly to Dallas, but I'll
get to this tomorrow morning, at the latest.

Thanks!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2015-09-14 Thread Andres Freund

On 2015-09-14 13:16:46 +0300, YUriy Zhuravlev wrote:
> On Friday 11 September 2015 18:50:35 you wrote:

> > a) As I said upthread there's a patch to remove these locks entirely

> It is very interesting. Could you provide a link?

http://archives.postgresql.org/message-id/CA%2BTgmoYE4t-Pt%2Bv08kMO5u_XN-HNKBWtfMgcUXEGBrQiVgdV9Q%40mail.gmail.com

> And it's not very good,
> since there is a bottleneck PinBuffer / UnpinBuffer instead of
> LWLocks.

Where the bottleneck is entirely depends on your workload. If you have a
high cache replacement ratio the mapping partition locks are frequently
going to be held exclusively.

> > b) It doesn't matter anyway. Not every pin goes through the buffer
> >mapping table. StrategyGetBuffer(), SyncOneBuffer(), ...

> StrategyGetBuffer call only from BufferAlloc .

It gets called without buffer mapping locks held. And it can (and
frequently will!) access all the buffers in the buffer pool.

> SyncOneBuffer not problem too because:

> PinBuffer_Locked(bufHdr);

Which you made ineffective because PinBuffer() doesn't take a lock
anymore. Mutual exclusion through locks only works if all participants
take the locks.

> We checked all functions with refcount and usage_count.

Adding lockless behaviour by just taking out locks without analyzing the
whole isn't going to fly. You either need to provide backward
compatibility (a LockBuffer that provides actual exclusion) or you
actually need to go carefully through all relevant code and make it
lock-free.

I pointed out how you can actually make this safely lock-free giving you
the interesting code.

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: GiST support for UUIDs

2015-09-14 Thread Teodor Sigaev




Andres Freund wrote:

Please post reviews to the original thread unless that's already humongously 
large.

I've lost original mail. Sorry.


Otherwise somebody needs to manually link up the threads in the CF application.


Of course, I did it



It also makes it much harder to follow the development because there'll likely 
be a new version of the patch
after a review. Which then cab either pushed in a thread titled review, or in 
an old one, without context.


Sorry, I tried to fix that as possible.

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spurious standby query cancellations

2015-09-14 Thread Jeff Janes

On Fri, Sep 4, 2015 at 3:25 PM, Simon Riggs  wrote:

> On 27 August 2015 at 22:55, Jeff Janes  wrote:
>
>> In ResolveRecoveryConflictWithLock, there is the comment:
>>
>> /*
>>  * If blowing away everybody with conflicting locks doesn't work,
>> after
>>  * the first two attempts then we just start blowing everybody away
>> until
>>  * it does work.
>>
>>
>> But what it does is something different than that.
>>
>> At least under some conditions, it will wait patiently for all initially
>> conflicting locks to go away.  If that doesn't work twice (because new
>> lockers joined while we were waiting for the old ones to go away), then it
>> will wait patiently for all transactions throughout the system to go away
>> even if they don't conflict, perhaps not even realizing that the lock has
>> become free in the mean time.
>>
>> Then when its patience runs out, it kills everything on the system.  But
>> it never tried to blow away just the conflicting locks, instead it just
>> tried to wait them out.
>>
>> The fact that trying to wait them out didn't work (twice) doesn't mean
>> that killing them wouldn't have worked.
>>
>> I think that it was intended that num_attempts would get incremented only
>> once WaitExceedsMaxStandbyDelay becomes true, but that is not what happens
>> with the current code.
>>
>> Currently WaitExceedsMaxStandbyDelay only has one caller.  I think it we
>> should take the sleep code out of that function and move it into the
>> existing call site, and then have ResolveRecoveryConflictWithLock check
>> with WaitExceedsMaxStandbyDelay before incrementing num_attempts.
>>
>
> I agree with the problem in the current code, thank you for spotting it.
>
> I also agree that the proposed solution would work-ish, if we are
> suggesting backpatching this.
>

I wasn't specifically thinking of backpatching it, but if it doesn't
operate the way you intended it to, it might make sense to do that.

I stumbled on this while experimentally looking into a different issue (the
ProcArrayLock contention issue in HeapTupleSatisfiesMVCC that was recently
ameliorated in 8a7d0701814, but the variant of that issue reported
in "[PERFORM] Strange query stalls on replica in 9.3.9").  I haven't heard
of any complaints from the field about too-frequent query cancellations,
but I also don't have my "ear to the ground" in that area.

>
> It's now possible to fix this by putting a lock wait on the actual lock
> request, which wasn't available when I first wrote that, hence the crappy
> wait loop. Using the timeout handler would now be the preferred way to
> solve this. We can backpatch that to 9.3 if needed, where they were
> introduced.
>
> There's an example of how to use lock waits further down
> on ResolveRecoveryConflictWithBufferPin(). Could you have a look at doing
> it that way?
>

It looks like this will take some major surgery.  The heavy weight lock
manager doesn't play well with others when it comes to timeouts other than
its own.  LockBufferForCleanup is a simple retry loop, but the lock manager
is much more complicated than that.

Here are some approaches I can think of:

Approach one, replace LockAcquireExtended's dontWait boolean with a
"waitUntil" timeout value.  Queue the lock like ordinary
(non-recovery-backend)  lock requests are queued so that new requestors are
not granted new locks which conflict with what we want.  If the timeout
expires without the lock being acquired, dequeue ourselves and clean up,
and then return LOCKACQUIRE_NOT_AVAIL.  I think something would have to be
done about deadlocks, as well, as I don't think the regular deadlock
detectors works in the standby startup process.

Approach two, same as one but when the timeout expires we invoke a callback
to terminate the blocking processes without dequeueing ourselves.  That way
no one can sneak in and grab the lock during the time we were doing the
terminations.  This way we are guaranteed to succeed after one round of
terminations, and there is no need to terminate innocent bystanders.  I
think this is my preferred solution to have.  (But I haven't had much luck
implementing it beyond the hand-wavy stages.)

Approach three, ask for the lock with dontWait as we do now, but if we
don't get the lock then somehow arrange for us to be signalled when the
lock becomes free, without being queued for it.  Then we would have to
retry, with no guarantee the retry would be successful.  Functionally this
would be no different than the simple patch I gave, except the polling loop
is much more efficient.  You still have to terminate processes if a stream
of new requestors take the lock in a constantly overlapping basis.

> We probably also need to resurrect my earlier patch to avoid logging
> AccessExclusiveLocks on temp tables.
>

I couldn't find that, can you provide a link?

Another source of frequent AccessExclusiveLocks is vacuum truncation
attempts.  If a table has some occupied

Re: [HACKERS] WIP: Make timestamptz_out less slow.

2015-09-14 Thread Andres Freund

On 2015-09-14 12:03:31 -0500, Jim Nasby wrote:
> On 9/13/15 2:43 AM, David Rowley wrote:
> >Are you worried about this because I've not focused on optimising float
> >timestamps as much as int64 timestamps?  Are there many people compiling
> >with float timestamps in the real world?
> 
> My $0.02: the default was changed some 5 years ago so FP time is probably
> pretty rare now. I don't think it's worth a bunch of extra work to speed
> them up.

Agreed. That's not why I'm arguing for it. It's about having kind of duplicate
code that's not exercised at al. by any common setup.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Attach comments to functions' parameters and return value

2015-09-14 Thread Jim Nasby


On 9/14/15 8:59 AM, Charles Clavadetscher wrote:

Hello

In PostgreSQL it is possible to attach comments to almost everything. This
made it possible for us to integrate the wiki that we use for our technical
documentation directly with the database using the MediaWiki [1] extensions
ExternalData [2] and MagicNoCache [3]. The result is a documentation on
tables and related objects (indexes, triggers, etc.) and views that always
shows the current state, i.e. any DDL change or any comment attached to an
object is shown in the wiki immediately (or on refresh if the change was
done after the reader landed on the page).


Neat! I hope you'll open source that. :)


In order to optimize the query, we wrote a small set of sql functions that
generate wiki markup for the objects queried. The idea behind is that this
is much quicker in PostgreSQL than on a web server hosting MediaWiki,
besides a better control of the privileges for the user retrieving data.


And that! :)


So far we can document in such a way tables and views. I started to create
something similar for functions until I noticed that there is no way to
attach comments on functions' parameters and return value. My first idea was
to add this information in the function description, but this is quite an
ugly solution.

My next workaround is to simulate the behaviour of a COMMENT ON FUNCTION
PARAMETER/RETURNVALUE command inserting comments on these directly in
pg_description. For that I used a convention similar to the one used for
table attributes and defined following pg_description.objsubid:

   -1 = return value
0 = comment on the function itself (this already exists)
1..n = comment on parameter at position n


At first glance, seems reasonable.


   - Add function to get the position of the parameter, e.g.
LookupFuncNamePos(function_name, param_name) or a function to create the
whole ObjectAddress e.g. .


Something similar might exist already. TBH, to start with, I would only 
support position number. You'll have to support that case anyway, and it 
should be simpler.



To long time PostgreSQL developers this may look straightforward. For the
moment I am not even sure if that is correct and if there are other places
that would need additions, apart from the obvious display in psql.


I suspect that changes to support this should be pretty localized. I 
suggest you look at other recent patches that have added COMMENT 
functionality to see what they did.


BTW, I'm also interested in this but I'm not sure when I'd have time to 
work on it.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-09-14 Thread Pavel Stehule

2015-09-14 18:46 GMT+02:00 Shulgin, Oleksandr 
:

> On Mon, Sep 14, 2015 at 3:09 PM, Shulgin, Oleksandr <
> oleksandr.shul...@zalando.de> wrote:
>
>> On Mon, Sep 14, 2015 at 2:11 PM, Tomas Vondra <
>> tomas.von...@2ndquadrant.com> wrote:
>>
>>>
 Now the backend that has been signaled on the second call to
 pg_cmdstatus (it can be either some other backend, or the backend B
 again) will not find an unprocessed slot, thus it will not try to
 attach/detach the queue and the backend A will block forever.

 This requires a really bad timing and the user should be able to
 interrupt the querying backend A still.

>>>
>>> I think we can't rely on the low probability that this won't happen, and
>>> we should not rely on people interrupting the backend. Being able to detect
>>> the situation and fail gracefully should be possible.
>>>
>>> It may be possible to introduce some lock-less protocol preventing such
>>> situations, but it's not there at the moment. If you believe it's possible,
>>> you need to explain and "prove" that it's actually safe.
>>>
>>> Otherwise we may need to introduce some basic locking - for example we
>>> may introduce a LWLock for each slot, and lock it with dontWait=true (and
>>> skip it if we couldn't lock it). This should prevent most scenarios where
>>> one corrupted slot blocks many processes.
>>
>>
>> OK, I will revisit this part then.
>>
>
> I have a radical proposal to remove the need for locking: make the
> CmdStatusSlot struct consist of a mere dsm_handle and move all the required
> metadata like sender_pid, request_type, etc. into the shared memory segment
> itself.
>
> If we allow the only the requesting process to update the slot (that is
> the handle value itself) this removes the need for locking between sender
> and receiver.
>
> The sender will walk through the slots looking for a non-zero dsm handle
> (according to dsm_create() implementation 0 is considered an invalid
> handle), and if it finds a valid one, it will attach and look inside, to
> check if it's destined for this process ID.  At first that might sound
> strange, but I would expect 99% of the time that the only valid slot would
> be for the process that has been just signaled.
>
> The sender process will then calculate the response message, update the
> result_code in the shared memory segment and finally send the message
> through the queue.  If the receiver has since detached we get a detached
> result code and bail out.
>
> Clearing the slot after receiving the message should be the requesting
> process' responsibility.  This way the receiver only writes to the slot and
> the sender only reads from it.
>
> By the way, is it safe to assume atomic read/writes of dsm_handle
> (uint32)?  I would be surprised if not.
>

I don't see any reason why it should not to work - only few processes will
wait for data - so lost attach/detach shm operations will not be too much.

Pavel


>
> --
> Alex
>
>

Re: [HACKERS][PROPOSAL] Covering + unique indexes.

2015-09-14 Thread Jim Nasby


On 9/11/15 7:45 AM, Anastasia Lubennikova wrote:

This idea has obvious restriction. We can set unique only for first
index columns.
There is no clear way to maintain following index.
CREATE INDEX index ON table (c1, c2, c3) UNIQUE ON (c1, c3);

So I suggest following syntax:
CREATE [UNIQUE {ON FIRST {COLUMN | n_unique_column COLUMNS}} INDEX ON
table_name (column_name1, column_name2 ...);


I would use the first (simple) syntax and just throw an error if the 
user tries to skip a column on the UNIQUE clause.


Have you by chance looked to see what other databases have done for 
syntax? I'm guessing this isn't covered by ANSI but maybe there's 
already an industry consensus.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Do Layered Views/Relations Preserve Sort Order ?

2015-09-14 Thread Jim Nasby


On 9/9/15 7:55 PM, Charles Sheridan wrote:

The better question is how expensive is it to sort already sorted
data.  If its cheap, and it likely is, then placing explicit sorting
where you care is the best solution regardless of your level of
confidence that lower level sorting is being maintained.

...


David,  yes, I agree that sorting at the end is the highest-confidence
approach.  I don't (yet) have a large stack of views with an assumption
of a guaranteed underlying sort order, I'm just trying to get a better
sense of what Postgres behavior I can reasonably expect here.


BTW, I believe there is some code in the planner to remove useless 
ORDER-BYs.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Make timestamptz_out less slow.

2015-09-14 Thread Jim Nasby


On 9/13/15 2:43 AM, David Rowley wrote:

Are you worried about this because I've not focused on optimising float
timestamps as much as int64 timestamps?  Are there many people compiling
with float timestamps in the real world?


My $0.02: the default was changed some 5 years ago so FP time is 
probably pretty rare now. I don't think it's worth a bunch of extra work 
to speed them up.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] jsonb_set array append hack?

2015-09-14 Thread Thom Brown

Hi,

I've noticed that if you use a string for an element key in jsonb_set with
create_missing set to true, you can use it to append to an array:

postgres=# SELECT jsonb_set(
'{"name": "Joe", "vehicle_types": ["car","van"]}'::jsonb,
   '{vehicle_types,nonsense}',
   '"motorcycle"', true);
   jsonb_set

 {"name": "Joe", "vehicle_types": ["car", "van", "motorcycle"]}
(1 row)

What this really should match is a nested element inside "vehicle_types"
called "nonsense".  But this seems to be a hack to get an element added to
an array.  To do it properly currently requires specifying an arbitrary
number in the hope that it will exceed the number of elements you have in
the array.

e.g.

postgres=# SELECT jsonb_set(
   '{"name": "Joe", "vehicle_types": ["car","van"]}'::jsonb,
   '{vehicle_types,10}',
   '"motorcycle"', true);
   jsonb_set

 {"name": "Joe", "vehicle_types": ["car", "van", "motorcycle"]}
(1 row)

But I'm guessing people shouldn't be relying on the hack in the first
example.  Isn't this a bug?  If so, wouldn't this also be a bug?:

postgres=# SELECT jsonb_set(
   '{"name": "Joe", "vehicle_types": ["car","van"]}'::jsonb,
   array['vehicle_types',NULL],
   '"motorcycle"', true);

Thom

[HACKERS] [PATCH] add --log-output to pg_ctl on Windows

2015-09-14 Thread Egon Kocjan


Hello

The patch implements a command line option for pg_ctl on Windows to 
redirect logging of errors (write_stderr). The possible outputs are: 
default (eventlog if running as a service, otherwise stderr), stderr or 
eventlog.


The previous discussion in BUG #13594:
http://www.postgresql.org/message-id/cab7npqq+cybqd8-memc0qqruwc3ecotgqpmtskfsvbm6gdb...@mail.gmail.com

Best regards
Egon Kocjan




20150914_pgctl_winlogoutput.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-09-14 Thread Shulgin, Oleksandr

On Mon, Sep 14, 2015 at 3:09 PM, Shulgin, Oleksandr <
oleksandr.shul...@zalando.de> wrote:

> On Mon, Sep 14, 2015 at 2:11 PM, Tomas Vondra <
> tomas.von...@2ndquadrant.com> wrote:
>
>>
>>> Now the backend that has been signaled on the second call to
>>> pg_cmdstatus (it can be either some other backend, or the backend B
>>> again) will not find an unprocessed slot, thus it will not try to
>>> attach/detach the queue and the backend A will block forever.
>>>
>>> This requires a really bad timing and the user should be able to
>>> interrupt the querying backend A still.
>>>
>>
>> I think we can't rely on the low probability that this won't happen, and
>> we should not rely on people interrupting the backend. Being able to detect
>> the situation and fail gracefully should be possible.
>>
>> It may be possible to introduce some lock-less protocol preventing such
>> situations, but it's not there at the moment. If you believe it's possible,
>> you need to explain and "prove" that it's actually safe.
>>
>> Otherwise we may need to introduce some basic locking - for example we
>> may introduce a LWLock for each slot, and lock it with dontWait=true (and
>> skip it if we couldn't lock it). This should prevent most scenarios where
>> one corrupted slot blocks many processes.
>
>
> OK, I will revisit this part then.
>

I have a radical proposal to remove the need for locking: make the
CmdStatusSlot struct consist of a mere dsm_handle and move all the required
metadata like sender_pid, request_type, etc. into the shared memory segment
itself.

If we allow the only the requesting process to update the slot (that is the
handle value itself) this removes the need for locking between sender and
receiver.

The sender will walk through the slots looking for a non-zero dsm handle
(according to dsm_create() implementation 0 is considered an invalid
handle), and if it finds a valid one, it will attach and look inside, to
check if it's destined for this process ID.  At first that might sound
strange, but I would expect 99% of the time that the only valid slot would
be for the process that has been just signaled.

The sender process will then calculate the response message, update the
result_code in the shared memory segment and finally send the message
through the queue.  If the receiver has since detached we get a detached
result code and bail out.

Clearing the slot after receiving the message should be the requesting
process' responsibility.  This way the receiver only writes to the slot and
the sender only reads from it.

By the way, is it safe to assume atomic read/writes of dsm_handle
(uint32)?  I would be surprised if not.

--
Alex

Re: [HACKERS] Review: GiST support for UUIDs

2015-09-14 Thread Teodor Sigaev




Paul Jungwirth wrote:

2)
 static double
 uuid2num(const pg_uuid_t *i)
 {
 return *((uint64 *)i);
 }
It isn't looked as correct transformation for me. May be, it's better
to transform to numeric type (UUID looks like a 16-digit hexademical
number)
and follow  gbt_numeric_penalty() logic (or even call directly).


Thanks for the review! A UUID is actually not stored as a string of
hexadecimal digits. (It is normally displayed that way, but with 32
digits, not 16.) Rather it is stored as an unstructured 128-bit value
(which in C is 16 unsigned chars). Here is the easy-to-misread
declaration from src/backend/utils/adt/uuid.c:
Missed number of digit, but nevertheless it doesn't matter for idea. 
Original coding uses only 8 bytes from 16 to compute penalty which could 
cause a problem with index performance. Simple way is just printing each 
4bits  with %02d modifier into string and then make a numeric value with 
a help of numeric_in.


Or something like this in pseudocode:

numeric = int8_numeric(*(uint64 *)(>data[0])) * 
int8_numeric(MAX_INT64) + int8_numeric(*(uint64 *)(>data[8]))



The only other 128-bit type I found in btree_gist was Interval. For that
type we convert to a double using INTERVAL_TO_SEC, then call
penalty_num. By my read that accepts a similar loss of precision.
Right, but precision of double  is enough to represent 1 century 
interval with 0.1 seconds accuracy which is enough for  practical 
usage. In UUID case you will take into account only half of value. Of 
course, GiST will work even with penalty function returning constant but 
each scan could become full-index-scan.




If I'm mistaken about 128-bit integer support, let me know, and maybe we
can do the penalty computation on the whole UUID. Or maybe I should just
convert the uint64 to a double before calling penalty_num? I don't
completely understand what the penalty calculation is all about, so I
welcome suggestions here.


Penalty method calculates how union key will be enlarged if insert will 
be produced in current subtree. It directly affects selectivity of subtree.


--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS][PROPOSAL] Covering + unique indexes.

2015-09-14 Thread Jim Nasby


On 9/14/15 1:50 PM, Thomas Munro wrote:

CREATE [UNIQUE {ON FIRST {COLUMN | n_unique_column COLUMNS}}
INDEX ON
table_name (column_name1, column_name2 ...);


I would use the first (simple) syntax and just throw an error if the
user tries to skip a column on the UNIQUE clause.

Seems, second option looks as more natural extension of CREATE
UNIQUE INDEX


True, but it's awefully verbose. :( And...


It surprised me that you can INCLUDE extra columns on non-UNIQUE
indexes, since you could just add them as regular indexed columns for
the same effect.  It looks like when you do that in SQL Server, the
extra columns are only stored on btree leaf pages and so can't be used
for searching or ordering.  I don't know how useful that is or if we
would ever want it... but I just wanted to note that difference, and
that the proposed UNIQUE ON FIRST n COLUMNS syntax and catalog change
can't express that.


... we might want to support INCLUDE at some point. It enhances covering 
scans without bloating the heck out of the btree. (I'm not sure if it 
would help other index types...) So it seems like a bad idea to preclude 
that.


I don't see that UNIQUE ON FIRST precludes also supporting INCLUDE. 
Presumably we could do either


CREATE INDEX ... ON table (f1, f2, f3) UNIQUE(f1, f2) INCLUDE(f4);
or
CREATE UNIQUE ON FIRST 2 COLUMNS INDEX ... ON table (f1, f2, f3) 
INCLUDE(f4);


Personally, I find the first form easier to read.

Are we certain that no index type could ever support an index on (f1, 
f2, f3) UNIQUE(f1, f3)? Even if it doesn't make sense for btree, perhaps 
some other index could handle it.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS][PROPOSAL] Covering + unique indexes.

2015-09-14 Thread Thom Brown

On 14 September 2015 at 22:44, Jim Nasby  wrote:

> On 9/14/15 1:50 PM, Thomas Munro wrote:
>
>> CREATE [UNIQUE {ON FIRST {COLUMN | n_unique_column COLUMNS}}
>> INDEX ON
>> table_name (column_name1, column_name2 ...);
>>
>>
>> I would use the first (simple) syntax and just throw an error if
>> the
>> user tries to skip a column on the UNIQUE clause.
>>
>> Seems, second option looks as more natural extension of CREATE
>> UNIQUE INDEX
>>
>
> True, but it's awefully verbose. :( And...
>
> It surprised me that you can INCLUDE extra columns on non-UNIQUE
>> indexes, since you could just add them as regular indexed columns for
>> the same effect.  It looks like when you do that in SQL Server, the
>> extra columns are only stored on btree leaf pages and so can't be used
>> for searching or ordering.  I don't know how useful that is or if we
>> would ever want it... but I just wanted to note that difference, and
>> that the proposed UNIQUE ON FIRST n COLUMNS syntax and catalog change
>> can't express that.
>>
>
> ... we might want to support INCLUDE at some point. It enhances covering
> scans without bloating the heck out of the btree. (I'm not sure if it would
> help other index types...) So it seems like a bad idea to preclude that.
>
> I don't see that UNIQUE ON FIRST precludes also supporting INCLUDE.
> Presumably we could do either
>
> CREATE INDEX ... ON table (f1, f2, f3) UNIQUE(f1, f2) INCLUDE(f4);
> or
> CREATE UNIQUE ON FIRST 2 COLUMNS INDEX ... ON table (f1, f2, f3)
> INCLUDE(f4);
>
> Personally, I find the first form easier to read.
>

+1

I guess the standard CREATE UNIQUE INDEX can be seen as shorthand for
CREATE INDEX with all columns listed in the UNIQUE clause.

Are we certain that no index type could ever support an index on (f1, f2,
> f3) UNIQUE(f1, f3)? Even if it doesn't make sense for btree, perhaps some
> other index could handle it.
>

That's certainly an interesting question.  At the moment, only btree is
capable of enforcing uniqueness, but that's not to say it will always be
that way.  But I guess you'd need a way for the access method list of
defining whether it's capable of multi-column indexes with out-of-order
unique columns. (or some more sensible way of describing it)

Thom

Re: [HACKERS] Re: [COMMITTERS] pgsql: Check existency of table/schema for -t/-n option (pg_dump/pg_res

2015-09-14 Thread Robert Haas

On Mon, Sep 14, 2015 at 10:54 AM, Teodor Sigaev  wrote:
>>  /*
>> -* We use UNION ALL rather than UNION; this might sometimes result in
>> -* duplicate entries in the OID list, but we don't care.
>> +* this might sometimes result in duplicate entries in the OID list,
>> +* but we don't care.
>>   */
>>
>> This looks totally incoherent.  You've removed the thing to which the
>> word "this" referred and replaced it with nothing.
>
>
> Oops.
>
> /*
>  * The loop below runs multiple SELECTs might sometimes result in
>  * duplicate entries in the OID list, but we don't care.
>  */
>
> looks reasonable?

Sure, that seems fine.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] exposing pg_controldata and pg_config as functions

2015-09-14 Thread Peter Eisentraut

On 9/8/15 4:56 PM, Andrew Dunstan wrote:
> The problem is that at least this user's system had something odd about
> it. so that I wouldn't entirely trust the output of
> 
>select is_supported
>from information_schema.sql_features
>where feature_name = 'XML type';
> 
> to reflect the config.

This should be a built-in function, not dependent on the state of the
catalogs, like pg_build_option('xml') returns boolean.

> I also have cases where clients don't want to give me superuser access,
> and sometimes not even shell access, and it could well be useful to me
> to be able to say to them "OK, you need to make sure that this file in
> this location has this entry".

Not sure what this has to do with this.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Can extension build own SGML document?

2015-09-14 Thread Alvaro Herrera

Peter Eisentraut wrote:
> On 9/14/15 5:35 AM, Kouhei Kaigai wrote:
> > Hello,
> > 
> > The pgxs makefile (pgxs.ml) says:
> > 
> > #   DOCS -- random files to install under $PREFIX/doc/$MODULEDIR
> > 
> > It is a bunch of files to be copied to document directory, however,
> > not looks like a variable to specify SGML source as PostgreSQL core
> > doing.
> > 
> > Do we have way to build SGML source of extension using makefile?
> 
> No.
> 
> We could, but a project that's large enough to want this kind of thing
> will likely have its own ideas and requirements, so it unclear how
> difficult it would be to support that.

FWIW bdr has such a thing.  Mostly, they replaced postgres.sgml with
bdr.sgml and stripped down and edited the Makefile to match.  The
resulting doc is a standalone "book" (in docbook parlance).  You can see
the source code here
http://git.postgresql.org/gitweb/?p=2ndquadrant_bdr.git;a=tree;f=doc;hb=HEAD
Note there's no support for anything other than HTML --- no PDF, no
EPUB, etc.

The resulting HTML is published as http://bdr-project.org/docs/stable/

I think the only way upstream Postgres could offer this is as a way to
build a separate "book", i.e. not a chapter/section within the main
book.  I think it would require huge complications in doc/src/sgml's
Makefile.  Not sure it's worthwhile.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Can extension build own SGML document?

2015-09-14 Thread Michael Paquier

On Tue, Sep 15, 2015 at 6:01 AM, Alvaro Herrera wrote:
> I think the only way upstream Postgres could offer this is as a way to
> build a separate "book", i.e. not a chapter/section within the main
> book.  I think it would require huge complications in doc/src/sgml's
> Makefile.  Not sure it's worthwhile.

I am not sure either, and as each project/developer have always
different requirements I am convinced that this is going to be
finished with enforced rules in Makefile rules for sure, so it is
really unclear what would be the potential benefit to have a
centralized facility. Take for example man pages,  those should not be
installed in share/doc/extension/ which is the default path, but in
$(DESTDIR)$(mandir)/man1...
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS][PROPOSAL] Covering + unique indexes.

2015-09-14 Thread Oleg Bartunov

On Tue, Sep 15, 2015 at 12:44 AM, Jim Nasby 
wrote:

> On 9/14/15 1:50 PM, Thomas Munro wrote:
>
>> CREATE [UNIQUE {ON FIRST {COLUMN | n_unique_column COLUMNS}}
>> INDEX ON
>> table_name (column_name1, column_name2 ...);
>>
>>
>> I would use the first (simple) syntax and just throw an error if
>> the
>> user tries to skip a column on the UNIQUE clause.
>>
>> Seems, second option looks as more natural extension of CREATE
>> UNIQUE INDEX
>>
>
> True, but it's awefully verbose. :( And...
>
> It surprised me that you can INCLUDE extra columns on non-UNIQUE
>> indexes, since you could just add them as regular indexed columns for
>> the same effect.  It looks like when you do that in SQL Server, the
>> extra columns are only stored on btree leaf pages and so can't be used
>> for searching or ordering.  I don't know how useful that is or if we
>> would ever want it... but I just wanted to note that difference, and
>> that the proposed UNIQUE ON FIRST n COLUMNS syntax and catalog change
>> can't express that.
>>
>
> ... we might want to support INCLUDE at some point. It enhances covering
> scans without bloating the heck out of the btree. (I'm not sure if it would
> help other index types...) So it seems like a bad idea to preclude that.
>
> I don't see that UNIQUE ON FIRST precludes also supporting INCLUDE.
> Presumably we could do either
>
> CREATE INDEX ... ON table (f1, f2, f3) UNIQUE(f1, f2) INCLUDE(f4);
> or
> CREATE UNIQUE ON FIRST 2 COLUMNS INDEX ... ON table (f1, f2, f3)
> INCLUDE(f4);
>
> Personally, I find the first form easier to read.
>

Why not normal syntax with optional INCLUDE ?

CREATE UNIQUE INDEX ON table (f1,f2,f3)  INCLUDE (f4)


>
> Are we certain that no index type could ever support an index on (f1, f2,
> f3) UNIQUE(f1, f3)? Even if it doesn't make sense for btree, perhaps some
> other index could handle it.
> --
> Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
> Experts in Analytics, Data Architecture and PostgreSQL
> Data in Trouble? Get it in Treble! http://BlueTreble.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Re: [HACKERS][PROPOSAL] Covering + unique indexes.

2015-09-14 Thread Thom Brown

On 14 September 2015 at 23:12, Oleg Bartunov  wrote:

>
>
>
> On Tue, Sep 15, 2015 at 12:44 AM, Jim Nasby 
> wrote:
>
>> On 9/14/15 1:50 PM, Thomas Munro wrote:
>>
>>> CREATE [UNIQUE {ON FIRST {COLUMN | n_unique_column COLUMNS}}
>>> INDEX ON
>>> table_name (column_name1, column_name2 ...);
>>>
>>>
>>> I would use the first (simple) syntax and just throw an error if
>>> the
>>> user tries to skip a column on the UNIQUE clause.
>>>
>>> Seems, second option looks as more natural extension of CREATE
>>> UNIQUE INDEX
>>>
>>
>> True, but it's awefully verbose. :( And...
>>
>> It surprised me that you can INCLUDE extra columns on non-UNIQUE
>>> indexes, since you could just add them as regular indexed columns for
>>> the same effect.  It looks like when you do that in SQL Server, the
>>> extra columns are only stored on btree leaf pages and so can't be used
>>> for searching or ordering.  I don't know how useful that is or if we
>>> would ever want it... but I just wanted to note that difference, and
>>> that the proposed UNIQUE ON FIRST n COLUMNS syntax and catalog change
>>> can't express that.
>>>
>>
>> ... we might want to support INCLUDE at some point. It enhances covering
>> scans without bloating the heck out of the btree. (I'm not sure if it would
>> help other index types...) So it seems like a bad idea to preclude that.
>>
>> I don't see that UNIQUE ON FIRST precludes also supporting INCLUDE.
>> Presumably we could do either
>>
>> CREATE INDEX ... ON table (f1, f2, f3) UNIQUE(f1, f2) INCLUDE(f4);
>> or
>> CREATE UNIQUE ON FIRST 2 COLUMNS INDEX ... ON table (f1, f2, f3)
>> INCLUDE(f4);
>>
>> Personally, I find the first form easier to read.
>>
>
> Why not normal syntax with optional INCLUDE ?
>
> CREATE UNIQUE INDEX ON table (f1,f2,f3)  INCLUDE (f4)
>

That's not the same thing.  Then you're including f3 in the unique
constraint, which you may not want for uniqueness purposes, but may want in
the index for sorting.  But then saying that, if f1 and f2 are unique,
you'd only get 1 value of f3 for each combination of f1 and f2, so sorting
probably isn't useful.  You might as well only INCLUDE f3 rather than have
it in the multi-column index for sorting.  So to adjust your example:

CREATE UNIQUE INDEX ON table (f1,f2)  INCLUDE (f3, f4);

Is there a scenario anyone can think of where f3 here:

CREATE INDEX ... ON table (f1, f2, f3) UNIQUE(f1, f2) INCLUDE(f4);

would be advantageous outside of INCLUDE?

Out of curiosity, why is this only being discussed for unique indexes?
What if you want additional columns included on non-unique indexes?

Thom

Re: [HACKERS] [PATCH] add --log-output to pg_ctl on Windows

2015-09-14 Thread Michael Paquier

On Mon, Sep 14, 2015 at 7:42 PM, Egon Kocjan  wrote:
> Hello
>
> The patch implements a command line option for pg_ctl on Windows to redirect
> logging of errors (write_stderr). The possible outputs are: default
> (eventlog if running as a service, otherwise stderr), stderr or eventlog.
>
> The previous discussion in BUG #13594:
> http://www.postgresql.org/message-id/cab7npqq+cybqd8-memc0qqruwc3ecotgqpmtskfsvbm6gdb...@mail.gmail.com

Thanks for picking this up!

Documentation is missing. Adding this patch to the next commit fest of
November would be good as well:
https://commitfest.postgresql.org/7/
Feel free to assign me as a reviewer at the same time.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: GiST support for UUIDs

2015-09-14 Thread Paul Jungwirth


Or something like this in pseudocode:

numeric = int8_numeric(*(uint64 *)(>data[0])) *
int8_numeric(MAX_INT64) + int8_numeric(*(uint64 *)(>data[8]))


This is more like what I was hoping for, rather than converting to a 
string and back. Would you mind confirming for me: int8_numeric turns an 
int64 into an arbitrary-precision numeric Datum? So there is no overflow 
risk here?


But it looks like int8_numeric takes a *signed* integer. Isn't that a 
problem? I suppose I could get it working though by jumping through some 
hoops.



In UUID case you will take into account only half of value. Of
course, GiST will work even with penalty function returning constant but
each scan could become full-index-scan.


Yes, but that seems like an unrealistic concern. Even "only" 2^64 is 
18446744073709551616 records. Or another way of thinking about it, if 64 
bits (or 32) is a good enough penalty input for all the other data 
types, why not for UUIDs? Keep in mind type 3-5 UUIDs should be evenly 
distributed. Perhaps we could use the bottom half (instead of the top) 
to ensure even distribution for type 1 and 2 too.


It seems to me that using only the top half should be okay, but if you 
believe it's not I'll go with the int8_numeric approach (in three chunks 
instead of two to work around signed-vs-unsigned).


Thanks,
Paul



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] cache type info in json_agg and friends

2015-09-14 Thread Andrew Dunstan




On 09/14/2015 03:42 PM, Teodor Sigaev wrote:

Currently, json_agg, jsonb_agg, json_object_agg and jsonb_object_agg do
type classification on their arguments on each call to the transition
function. This is quite unnecessary, as the argument types won't change.
This patch remedies the defect by caching the necessary values in the
aggregate state object.

+1



While this doesn't change the performance much, since these functions
are essentially dominated by other bits of the processing, I think it is
nevertheless worth doing.

Agree

After quick observation of your patch, why don't you use FmgrInfo 
instead of JsonAggState.val_output_func/JsonAggState.key_category? 
FmgrInfo could be filled by fmgr_info_cxt() in aggcontext memory 
context. Suppose, direct usage of FmgrInfo with FunctionCall a bit 
faster than OidFunctionCall.



Well, we need the category to help data_to_json(b) do its work. 
Nevertheless, it might be doable to pass an FmgrInfo* to datum_to_json. 
I'll see what I can do.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Can extension build own SGML document?

2015-09-14 Thread Peter Eisentraut

On 9/14/15 5:35 AM, Kouhei Kaigai wrote:
> Hello,
> 
> The pgxs makefile (pgxs.ml) says:
> 
> #   DOCS -- random files to install under $PREFIX/doc/$MODULEDIR
> 
> It is a bunch of files to be copied to document directory, however,
> not looks like a variable to specify SGML source as PostgreSQL core
> doing.
> 
> Do we have way to build SGML source of extension using makefile?

No.

We could, but a project that's large enough to want this kind of thing
will likely have its own ideas and requirements, so it unclear how
difficult it would be to support that.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Make timestamptz_out less slow.

2015-09-14 Thread David Rowley

On 15 September 2015 at 05:52, Alvaro Herrera 
wrote:

> Jim Nasby wrote:
> > On 9/13/15 2:43 AM, David Rowley wrote:
> > >Are you worried about this because I've not focused on optimising float
> > >timestamps as much as int64 timestamps?  Are there many people compiling
> > >with float timestamps in the real world?
> >
> > My $0.02: the default was changed some 5 years ago so FP time is probably
> > pretty rare now.
>
> The default was FP for 8.3 and was changed before 8.4.  Anybody who was
> running with the default back then and who has pg_upgraded all the way
> to current releases is still using floating-point date/times.
>
> > I don't think it's worth a bunch of extra work to speed them up.
>
> Not sure about that.
>
>
It's not like nothing is improved in float timestamps with this patch, it's
only appending the non-zero fractional seconds that I've left alone. Every
other portion of the timestamp has been improved.

I made a quick test as a demo. This is compiled with float timestamps.

create table ts (ts timestamp not null);
insert into ts select generate_series('2010-01-01 00:00:00', '2011-01-01
00:00:00', '1 sec'::interval);
vacuum freeze ts;
create table ts2 (ts timestamp not null);
insert into ts2 select generate_series('2010-01-01 00:00:00', '2010-01-01
00:00:31.536001', '1 microsecond'::interval);
vacuum freeze ts2;

Master:
copy ts to '/dev/null'; -- Test 1
Time: 18252.128 ms
Time: 18230.650 ms
Time: 18247.622 ms

copy ts2 to '/dev/null'; -- Test 2
Time: 30928.999 ms
Time: 30973.576 ms
Time: 30935.489 ms

Patched

copy ts to '/dev/null'; -- Test 1 (247%)
Time: 7378.243 ms
Time: 7383.486 ms
Time: 7385.675 ms

copy ts2 to '/dev/null'; -- Test 2 (142%)
Time: 21761.047 ms
Time: 21757.026 ms
Time: 21759.621 ms

The patched difference between ts and ts2 can be accounted for by the fact
that I didn't find a better way to do:

  if (fillzeros)
  sprintf(cp, "%0*.*f", precision + 3, precision, fabs(sec + fsec));
  else
  sprintf(cp, "%.*f", precision, fabs(sec + fsec));

A fast path exists when the fractional seconds is zero, which is why
there's such a variation between the 2 tests.

I did, however spend some time a few weekends ago writing a function to
improve this, which is more aimed at improving poor performance of
float4out and float8out. It's quite likely if I can get finished one day,
it can help improve this case too.

Regards

David Rowley

--
 David Rowley   http://www.2ndQuadrant.com/

 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS][PROPOSAL] Covering + unique indexes.

2015-09-14 Thread Gavin Flower


On 15/09/15 09:44, Jim Nasby wrote:

On 9/14/15 1:50 PM, Thomas Munro wrote:

CREATE [UNIQUE {ON FIRST {COLUMN | n_unique_column COLUMNS}}
INDEX ON
table_name (column_name1, column_name2 ...);


I would use the first (simple) syntax and just throw an error 
if the

user tries to skip a column on the UNIQUE clause.

Seems, second option looks as more natural extension of CREATE
UNIQUE INDEX


True, but it's awefully verbose. :( And...


It surprised me that you can INCLUDE extra columns on non-UNIQUE
indexes, since you could just add them as regular indexed columns for
the same effect.  It looks like when you do that in SQL Server, the
extra columns are only stored on btree leaf pages and so can't be used
for searching or ordering.  I don't know how useful that is or if we
would ever want it... but I just wanted to note that difference, and
that the proposed UNIQUE ON FIRST n COLUMNS syntax and catalog change
can't express that.


... we might want to support INCLUDE at some point. It enhances 
covering scans without bloating the heck out of the btree. (I'm not 
sure if it would help other index types...) So it seems like a bad 
idea to preclude that.


I don't see that UNIQUE ON FIRST precludes also supporting INCLUDE. 
Presumably we could do either


CREATE INDEX ... ON table (f1, f2, f3) UNIQUE(f1, f2) INCLUDE(f4);

Of the formats I've seen so far, I prefer this one.

I think using "[ALSO] INCLUDE(f4)" - might be potentially more readable 
than using just "INCLUDE(f4)".  even if not used, the noise word also 
would help people understand that the other fields mentioned are already 
covered.


If not too difficult then allowing the unique fields to be separated by 
other fields could be useful - in the example allowing "UNIQUE(f1, 
f3)".  Especially if the index is likely to be used to CLUSTER a table, 
where the order f1, f2, ... is important.




or
CREATE UNIQUE ON FIRST 2 COLUMNS INDEX ... ON table (f1, f2, f3) 
INCLUDE(f4);


Personally, I find the first form easier to read.

Are we certain that no index type could ever support an index on (f1, 
f2, f3) UNIQUE(f1, f3)? Even if it doesn't make sense for btree, 
perhaps some other index could handle it.




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] row_security GUC, BYPASSRLS

2015-09-14 Thread Noah Misch

On Mon, Sep 14, 2015 at 07:30:33AM -0400, Robert Haas wrote:
> On Mon, Sep 14, 2015 at 3:29 AM, Noah Misch  wrote:
> > Pondering it afresh this week, I see now that row_security=force itself is 
> > the
> > problem.  It's a new instance of the maligned "behavior-changing GUC".
> > Function authors will not consistently test the row_security=force case, and
> > others can run the functions under row_security=force and get novel results.
> > This poses a reliability and security threat.  While row_security=force is
> > handy for testing, visiting a second role for testing purposes is a fine
> > replacement.  Let's remove "force", making row_security a config_bool.  If
> > someone later desires to revive the capability, a DDL-specified property of
> > each policy would be free from these problems.

[A variation on that idea is "ALTER TABLE foo FORCE ROW LEVEL SECURITY",
affecting all policies of one table rather than individual policies.]

> ...but I'm not sure I like this, either.  Without row_security=force,
> it's hard for a table owner to test their policy, unless they have the
> ability to assume some other user ID, which some won't.  If someone
> puts in place a policy which they believe secures their data properly
> but which actually does not, and they are unable to test it properly
> for lack of this setting, that too will be a security hole.  We will
> be able to attribute it to user error rather than product defect, but
> that will be cold comfort to the person whose sensitive data has been
> leaked.

The testing capability is nice, all else being equal.  Your scenario entails a
data custodian wishing to test with row_security=force but willing to entrust
sensitive data to an untested policy.  It also requires a DBA unwilling to
furnish test accounts to custodians of sensitive data.  With or without
row_security=force, such a team is on the outer perimeter of the audience able
to benefit from RLS.  Nonetheless, I'd welcome a replacement test aid.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] row_security GUC, BYPASSRLS

2015-09-14 Thread Tom Lane

Noah Misch  writes:
> On Mon, Sep 14, 2015 at 07:30:33AM -0400, Robert Haas wrote:
>> ...but I'm not sure I like this, either.  Without row_security=force,
>> it's hard for a table owner to test their policy, unless they have the
>> ability to assume some other user ID, which some won't.  If someone
>> puts in place a policy which they believe secures their data properly
>> but which actually does not, and they are unable to test it properly
>> for lack of this setting, that too will be a security hole.  We will
>> be able to attribute it to user error rather than product defect, but
>> that will be cold comfort to the person whose sensitive data has been
>> leaked.

> The testing capability is nice, all else being equal.  Your scenario entails a
> data custodian wishing to test with row_security=force but willing to entrust
> sensitive data to an untested policy.  It also requires a DBA unwilling to
> furnish test accounts to custodians of sensitive data.  With or without
> row_security=force, such a team is on the outer perimeter of the audience able
> to benefit from RLS.  Nonetheless, I'd welcome a replacement test aid.

I have to say I'm with Noah on this.  I do not think we should create
potential security holes to help users with uncooperative DBAs.  Those
users have got problems far worse than whether they can test their RLS
policies; or in other words, what in God's name are you doing storing
sensitive data in a system with a hostile DBA?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Attach comments to functions' parameters and return value

2015-09-14 Thread Charles Clavadetscher


Hello Àlvaro

On 14/09/2015 20:02, Alvaro Herrera wrote:

Jim Nasby wrote:

On 9/14/15 8:59 AM, Charles Clavadetscher wrote:



To long time PostgreSQL developers this may look straightforward. For the
moment I am not even sure if that is correct and if there are other places
that would need additions, apart from the obvious display in psql.


I suspect that changes to support this should be pretty localized. I suggest
you look at other recent patches that have added COMMENT functionality to
see what they did.


This sequence of commits may be helpful.
http://git.postgresql.org/gitweb/?p=postgresql.git;a=history;f=src/test/regress/sql/object_address.sql;hb=HEAD


Thank you. From the first impression I think that I was not too far away 
from a correct procedure ;-) I will take a closer look as soon as I can, 
but I am confident that this will help.


Bye
Charles


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Attach comments to functions' parameters and return value

2015-09-14 Thread Charles Clavadetscher


Hello Jim

On 14/09/2015 19:23, Jim Nasby wrote:

On 9/14/15 8:59 AM, Charles Clavadetscher wrote:

Hello

In PostgreSQL it is possible to attach comments to almost everything.
This
made it possible for us to integrate the wiki that we use for our
technical
documentation directly with the database using the MediaWiki [1]
extensions
ExternalData [2] and MagicNoCache [3]. The result is a documentation on
tables and related objects (indexes, triggers, etc.) and views that
always
shows the current state, i.e. any DDL change or any comment attached
to an
object is shown in the wiki immediately (or on refresh if the change was
done after the reader landed on the page).


Neat! I hope you'll open source that. :)


Thank you for your answer. Sure I will put some information on it and 
the source on my website soon and let you know. There is not really much 
magic to it, but it can be quite helpful.



My next workaround is to simulate the behaviour of a COMMENT ON FUNCTION
PARAMETER/RETURNVALUE command inserting comments on these directly in
pg_description. For that I used a convention similar to the one used for
table attributes and defined following pg_description.objsubid:

   -1 = return value
0 = comment on the function itself (this already exists)
1..n = comment on parameter at position n


At first glance, seems reasonable.


   - Add function to get the position of the parameter, e.g.
LookupFuncNamePos(function_name, param_name) or a function to create the
whole ObjectAddress e.g. .


Something similar might exist already. TBH, to start with, I would only
support position number. You'll have to support that case anyway, and it
should be simpler.


I agree. Since parameter names are optional, supporting the position for 
comments definition is mandatory and is probably simpler.



To long time PostgreSQL developers this may look straightforward. For the
moment I am not even sure if that is correct and if there are other
places
that would need additions, apart from the obvious display in psql.


I suspect that changes to support this should be pretty localized. I
suggest you look at other recent patches that have added COMMENT
functionality to see what they did.


Good idea. As a matter of fact, Àlvaro sent a sequence of commits that 
relate in part to it. I will have a look to it as soon as I can.




BTW, I'm also interested in this but I'm not sure when I'd have time to
work on it.


Good to know, thank you. As I mention in my original post I should be 
able to work on that in November this year. If I find some time I may 
start things earlier. Let's keep in touch over this list. Before any 
coding is done there should be an agreement on the syntax to be used in 
the statement and the way the information is stored.


About the latter I think with the -1,0,1..n solution we are on a good 
track. For what's concerning the syntax I suggest the following.


COMMENT ON FUNCTION function_name ( [ [ argmode ] [ argname ] argtype [, 
...] ] ) PARAMETER param_position IS 'text';


COMMENT ON FUNCTION function_name ( [ [ argmode ] [ argname ] argtype [, 
...] ] ) RETURN VALUE IS 'text';


An alternative to "RETURN VALUE" could be "RETURNS", which would make 
the statement shorter, but I think this may be confusing.


The parameter list of the function is only required to identify the 
function also in cases it exists with the same name in different 
flavours. This sticks to the general syntax of the command and should be 
easy to understand.


Your ideas?

Thanks
Charles


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Seq Scan

2015-09-14 Thread Haribabu Kommi

On Thu, Sep 10, 2015 at 2:12 PM, Amit Kapila  wrote:
> On Thu, Sep 10, 2015 at 4:16 AM, Robert Haas  wrote:
>>
>> On Wed, Sep 9, 2015 at 11:07 AM, Amit Kapila 
>> wrote:
>> > On Wed, Sep 9, 2015 at 11:47 AM, Haribabu Kommi
>> > 
>> > wrote:
>> >> With subquery, parallel scan is having some problem, please refer
>> >> below.
>> >>
>> >> postgres=# explain analyze select * from test01 where kinkocord not in
>> >> (select kinkocord from test02 where tenpocord = '001');
>> >> ERROR:  badly formatted node string "SUBPLAN :subLinkType 2
>> >> :testexpr"...
>> >> CONTEXT:  parallel worker, pid 32879
>> >> postgres=#
>> >
>> > The problem here is that readfuncs.c doesn't have support for reading
>> > SubPlan nodes. I have added support for some of the nodes, but it seems
>> > SubPlan node also needs to be added.  Now I think this is okay if the
>> > SubPlan
>> > is any node other than Funnel, but if Subplan contains Funnel, then each
>> > worker needs to spawn other workers to execute the Subplan which I am
>> > not sure is the best way.  Another possibility could be store the
>> > results of
>> > Subplan in some tuplestore or some other way and then pass those to
>> > workers
>> > which again doesn't sound to be promising way considering we might have
>> > hashed SubPlan for which we need to build a hashtable.  Yet another way
>> > could be for such cases execute the Filter in master node only.
>>
>> IIUC, there are two separate issues here:
>>
>
> Yes.
>
>> 1. We need to have readfuncs support for all the right plan nodes.
>> Maybe we should just bite the bullet and add readfuncs support for all
>> plan nodes.  But if not, we can add support for whatever we need.
>>
>> 2. I think it's probably a good idea - at least for now, and maybe
>> forever - to avoid nesting parallel plans inside of other parallel
>> plans.  It's hard to imagine that being a win in a case like this, and
>> it certainly adds a lot more cases to think about.
>>
>
> I also think that avoiding nested parallel plans is a good step forward.
>

I reviewed the parallel_seqscan_funnel_v17.patch and following are my comments.
I will continue my review with the parallel_seqscan_partialseqscan_v17.patch.

+ if (inst_options)
+ {
+ instoptions = shm_toc_lookup(toc, PARALLEL_KEY_INST_OPTIONS);
+ *inst_options = *instoptions;
+ if (inst_options)

Same pointer variable check, it should be if (*inst_options) as per the
estimate and store functions.


+ if (funnelstate->ss.ps.ps_ProjInfo)
+ slot = funnelstate->ss.ps.ps_ProjInfo->pi_slot;
+ else
+ slot = funnelstate->ss.ss_ScanTupleSlot;

Currently, there will not be a projinfo for funnel node. So always it uses
the scan tuple slot. In case if it is different, we need to add the ExecProject
call in ExecFunnel function. Currently it is not present, either we can document
it or add the function call.


+ if (!((*dest->receiveSlot) (slot, dest)))
+ break;

and

+void
+TupleQueueFunnelShutdown(TupleQueueFunnel *funnel)
+{
+ if (funnel)
+ {
+ int i;
+ shm_mq_handle *mqh;
+ shm_mq   *mq;
+ for (i = 0; i < funnel->nqueues; i++)
+ {
+ mqh = funnel->queue[i];
+ mq = shm_mq_get_queue(mqh);
+ shm_mq_detach(mq);
+ }
+ }
+}


Using this function, the backend detaches from the message queue, so
that the workers
which are trying to put results into the queues gets an error message
as SHM_MQ_DETACHED.
Then worker finshes the execution of the plan. For this reason all the
printtup return
types are changed from void to bool.

But this way the worker doesn't get exited until it tries to put a
tuple in the queue.
If there are no valid tuples that satisfy the condition, then it may
take time for the workers
to exit. Am I correct? I am not sure how frequent such scenarios can occur.


+ if (parallel_seqscan_degree >= MaxConnections)
+ {
+ write_stderr("%s: parallel_scan_degree must be less than
max_connections\n", progname);
+ ExitPostmaster(1);
+ }

The error condition works only during server start. User still can set
parallel seqscan degree
more than max connection at super user session level and etc.


+ if (!parallelstmt->inst_options)
+ (*receiver->rDestroy) (receiver);

Why only when there is no instruementation only, the receiver needs to
be destroyed?


Regards,
Hari Babu
Fujitsu Australia


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-09-14 Thread Thomas Munro

On Fri, Sep 11, 2015 at 6:41 AM, Beena Emerson 
wrote:

> Hello,
>
> Please find attached the WIP patch for the proposed feature. It is built
> based on the already discussed design.
>
> Changes made:
> - add new parameter "sync_file" to provide the location of the pg_syncinfo
> file. The default is 'ConfigDir/pg_syncinfo.conf', same as for pg_hba and
> pg_ident file.
> - pg_syncinfo file will hold the sync rep information in the approved JSON
> format.
> - synchronous_standby_names can be set to 'pg_syncinfo.conf'  to read the
> JSON value stored in the file.
> - All the standbys mentioned in the s_s_names or the pg_syncinfo file
> currently get the priority as 1 and all others as  0 (async)
> - Various functions in syncrep.c to read the json file and store the
> values in a struct to be used in checking the quorum status of syncrep
> standbys (SyncRepGetQuorumRecPtr function).
>
> It does not support the current behavior for synchronous_standby_names =
> '*'. I am yet to thoroughly test the patch.
>
> Thoughts?
>

This is a great feature, thanks for working on it!

Here is some initial feedback after a quick eyeballing of the patch and a
couple of test runs.  I will have more soon after I figure out how to
really test it and try out the configuration system...

It crashes when async standbys connect, as already reported by Sameer
Thakur.  It doesn't crash with this change:

@@ -700,6 +700,9 @@ SyncRepGetStandbyPriority(void)
if (am_cascading_walsender)
return 0;

+   if (SyncRepStandbyInfo == NULL)
+   return 0;
+
if (CheckNameList(SyncRepStandbyInfo, application_name, false))
return 1;

I got the following error from clang-602.0.53 on my Mac:

walsender.c:1955:11: error: passing 'char volatile[8192]' to parameter of
type 'void *' discards qualifiers
[-Werror,-Wincompatible-pointer-types-discards-qualifiers]
memcpy(walsnd->name, application_name,
strlen(application_name));
   ^~~~

I think your memcpy and explicit null termination could be replaced with
strcpy, or maybe something to limit buffer overrun damage in case of sizing
bugs elsewhere.  But to get rid of that warning you'd still need to cast
away volatile...  I note that you do that in SyncRepGetQuorumRecPtr when
you read the string with strcmp.  But is that actually safe, with respect
to load/store reordering around spinlock operations?  Do we actually need
volatile-preserving cstring copy and compare functions for this type of
thing?

In walsender_private.h:

+#define MAX_APPLICATION_NAME_LEN 8192

What is the basis for this size?  application_name is a GUC with
GUC_IS_NAME set.  As far as I can see, it's limited to NAMEDATALEN
(including null terminator), so why not use the exact same buffer size?

In load_syncinfo:

+   len = strlen(standby_name);
+   temp->name = malloc(len);
+   memcpy(temp->name, standby_name, len);
+   temp->name[len] = '\0';

This buffer is one byte too short, and doesn't handle malloc failure.  And
generally, this code is equivalent to strdup, and could instead be pstrdup
(which raises an error on allocation failure for free).  But I'm not sure
which memory context is appropriate and when this should be freed.

Same problem in sync_info_scalar:

+   state->cur_node->name = (char *)
malloc(len);
+   memcpy(state->cur_node->name, token,
strlen(token));
+   state->cur_node->name[len] = '\0';

In SyncRepGetQuorumRecPtr, some extra curly braces:

+   if (node->next)
+   {
+   SyncRepGetQuorumRecPtr(node->next, lsnlist,
node->priority_group);
+   }

... and:

+   if (*lsnlist == NIL)
+   {
+   *lsnlist = lappend(*lsnlist, lsn);
+   }

In sync_info_object_field_start:

+   ereport(ERROR,
+
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("Unrecognised key
\"%s\" in file \"%s\"",
+   fname,
SYNC_FILENAME)));

I think this should use US spelling (-ized) as you have it elsewhere.  Also
the primary error message should not be capitalised according to the "Error
Message Style Guide".

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2015-09-14 Thread Andres Freund

On 2015-09-14 17:41:42 +0200, Andres Freund wrote:
> I pointed out how you can actually make this safely lock-free giving you
> the interesting code.

And here's an actual implementation of that approach. It's definitely
work-in-progress and could easily be optimized further. Don't have any
big machines to play around with right now tho.

Andres
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 3ae2848..3e70792 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -95,12 +95,9 @@ InitBufferPool(void)
 			BufferDesc *buf = GetBufferDescriptor(i);
 
 			CLEAR_BUFFERTAG(buf->tag);
-			buf->flags = 0;
-			buf->usage_count = 0;
-			buf->refcount = 0;
-			buf->wait_backend_pid = 0;
 
-			SpinLockInit(>buf_hdr_lock);
+			pg_atomic_init_u32(>state, 0);
+			buf->wait_backend_pid = 0;
 
 			buf->buf_id = i;
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8c0358e..345322a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -51,6 +51,8 @@
 #include "utils/resowner_private.h"
 #include "utils/timestamp.h"
 
+#define likely(x)   __builtin_expect((x),1)
+#define unlikely(x) __builtin_expect((x),0)
 
 /* Note: these two macros only work on shared buffers, not local ones! */
 #define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
@@ -774,9 +776,13 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 */
 		if (isLocalBuf)
 		{
+			uint32		state;
+
+			state = pg_atomic_read_u32(>state);
 			/* Only need to adjust flags */
-			Assert(bufHdr->flags & BM_VALID);
-			bufHdr->flags &= ~BM_VALID;
+			Assert(state & BM_VALID);
+			state &= ~BM_VALID;
+			pg_atomic_write_u32(>state, state);
 		}
 		else
 		{
@@ -788,8 +794,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			do
 			{
 LockBufHdr(bufHdr);
-Assert(bufHdr->flags & BM_VALID);
-bufHdr->flags &= ~BM_VALID;
+Assert(pg_atomic_read_u32(>state) & BM_VALID);
+pg_atomic_fetch_and_u32(>state, ~BM_VALID);
 UnlockBufHdr(bufHdr);
 			} while (!StartBufferIO(bufHdr, true));
 		}
@@ -807,7 +813,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 * it's not been recycled) but come right back here to try smgrextend
 	 * again.
 	 */
-	Assert(!(bufHdr->flags & BM_VALID));		/* spinlock not needed */
+	Assert(!(pg_atomic_read_u32(>state) & BM_VALID)); /* spinlock not needed */
 
 	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
 
@@ -885,7 +891,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	if (isLocalBuf)
 	{
 		/* Only need to adjust flags */
-		bufHdr->flags |= BM_VALID;
+		pg_atomic_fetch_or_u32(>state, BM_VALID);
 	}
 	else
 	{
@@ -939,7 +945,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferTag	oldTag;			/* previous identity of selected buffer */
 	uint32		oldHash;		/* hash value for oldTag */
 	LWLock	   *oldPartitionLock;		/* buffer partition lock for it */
-	BufFlags	oldFlags;
+	uint32		oldFlags;
 	int			buf_id;
 	volatile BufferDesc *buf;
 	bool		valid;
@@ -1013,10 +1019,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 */
 		buf = StrategyGetBuffer(strategy);
 
-		Assert(buf->refcount == 0);
+		Assert((pg_atomic_read_u32(>state) & BUF_REFCOUNT_MASK) == 0);
 
 		/* Must copy buffer flags while we still hold the spinlock */
-		oldFlags = buf->flags;
+		oldFlags = pg_atomic_read_u32(>state) & BUF_FLAG_MASK;
 
 		/* Pin the buffer and then release the buffer spinlock */
 		PinBuffer_Locked(buf);
@@ -1210,8 +1216,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * recycle this buffer; we must undo everything we've done and start
 		 * over with a new victim buffer.
 		 */
-		oldFlags = buf->flags;
-		if (buf->refcount == 1 && !(oldFlags & BM_DIRTY))
+		oldFlags = pg_atomic_read_u32(>state) & BUF_FLAG_MASK;
+		if ((pg_atomic_read_u32(>state) & BUF_REFCOUNT_MASK) == 1 &&
+			!(oldFlags & BM_DIRTY))
 			break;
 
 		UnlockBufHdr(buf);
@@ -1232,12 +1239,19 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 * 1 so that the buffer can survive one clock-sweep pass.)
 	 */
 	buf->tag = newTag;
-	buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT);
+	pg_atomic_fetch_and_u32(>state,
+			~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
+			  BM_CHECKPOINT_NEEDED | BM_IO_ERROR |
+			  BM_PERMANENT |
+			  BUF_USAGECOUNT_MASK));
 	if (relpersistence == RELPERSISTENCE_PERMANENT)
-		buf->flags |= BM_TAG_VALID | BM_PERMANENT;
+		pg_atomic_fetch_or_u32(>state,
+			   BM_TAG_VALID | BM_PERMANENT |
+			   BUF_USAGECOUNT_ONE);
 	else
-		buf->flags |= BM_TAG_VALID;
-	buf->usage_count = 1;
+

Re: [HACKERS] WIP: Make timestamptz_out less slow.

2015-09-14 Thread Alvaro Herrera

David Rowley wrote:

> It's not like nothing is improved in float timestamps with this patch, it's
> only appending the non-zero fractional seconds that I've left alone. Every
> other portion of the timestamp has been improved.

OK, sounds good enough to me.

> I did, however spend some time a few weekends ago writing a function to
> improve this, which is more aimed at improving poor performance of
> float4out and float8out. It's quite likely if I can get finished one day,
> it can help improve this case too.

Even better, prolly not a blocker for this patch.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Mention column name in error messages

2015-09-14 Thread Franck Verrot

On Wed, Aug 19, 2015 at 11:31 PM, Jeff Janes  wrote:
>
> I took this for a test drive, and had some comments.on the user visible
> parts.
> [...]
> But I think these belong as CONTEXT or as DETAIL, not as HINT.  The
> messages are giving me details about where (which column)  the error
> occurred, they aren't giving suggestions to me about what to do about it.
>

Hi Jeff,

Somehow my email never went through. So thank you for the test drive, I've
tried to update my patch (that you will find attached) to reflect what you
said (DETAIL instead of HINT).

Thanks Tom for pointing me at the docs, I clearly wasn't respecting the
guidelines.

Cheers,
Franck

0001-Report-column-for-which-type-coercion-fails.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Mention column name in error messages

2015-09-14 Thread Alvaro Herrera

Franck Verrot wrote:
> On Wed, Aug 19, 2015 at 11:31 PM, Jeff Janes  wrote:
> >
> > I took this for a test drive, and had some comments.on the user visible
> > parts.
> > [...]
> > But I think these belong as CONTEXT or as DETAIL, not as HINT.  The
> > messages are giving me details about where (which column)  the error
> > occurred, they aren't giving suggestions to me about what to do about it.
> >
> 
> Hi Jeff,
> 
> Somehow my email never went through. So thank you for the test drive, I've
> tried to update my patch (that you will find attached) to reflect what you
> said (DETAIL instead of HINT).

I think you need errcontext(), not errdetail().  The most important
difference is that you can have multiple CONTEXT lines but only one
DETAIL; I think if you attach a second errdetail(), the first one is
overwritten.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Multi-tenancy with RLS

2015-09-14 Thread Haribabu Kommi

On Fri, Sep 11, 2015 at 7:50 AM, Joe Conway  wrote:
> On 09/01/2015 11:25 PM, Haribabu Kommi wrote:
>> If any user is granted any permissions on that object then that user
>> can view it's meta data of that object from the catalog tables.
>> To check the permissions of the user on the object, instead of
>> checking each and every available option, I just added a new
>> privilege check option called "any". If user have any permissions on
>> the object, the corresponding permission check function returns
>> true. Patch attached for the same.
>>
>> Any thoughts/comments?
>
> Thanks for working on this! Overall I like the concept and know of use
> cases where it is critical and should be supported. Some comments:

Thanks for the review, I will take care of the comments in the next patch.

I didn't find any better approach other than creating policies automatically
or providing permission to superuser on system catalog tables. If everyone
feels as this is the best approach, then i will create policies for all catalog
tables in the next version.

Regards,
Hari Babu
Fujitsu Australia


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] row_security GUC, BYPASSRLS

2015-09-14 Thread Noah Misch

On Thu, Sep 10, 2015 at 03:23:13PM -0400, Stephen Frost wrote:
> First off, thanks again for your review and comments on RLS.  Hopefully
> this addresses the last remaining open item from that review.

Item (3a), too, remains open.

> * Noah Misch (n...@leadboat.com) wrote:
> > On Wed, Feb 25, 2015 at 11:37:24PM -0500, Stephen Frost wrote:
> > > + /*
> > > +  * Row-level security should be disabled in the case where a foreign-key
> > > +  * relation is queried to check existence of tuples that references the
> > > +  * primary-key being modified.
> > > +  */
> > > + temp_sec_context = save_sec_context | SECURITY_LOCAL_USERID_CHANGE;
> > > + if (qkey->constr_queryno == RI_PLAN_CHECK_LOOKUPPK
> > > + || qkey->constr_queryno == RI_PLAN_CHECK_LOOKUPPK_FROM_PK
> > > + || qkey->constr_queryno == RI_PLAN_RESTRICT_DEL_CHECKREF
> > > + || qkey->constr_queryno == RI_PLAN_RESTRICT_UPD_CHECKREF)
> > > + temp_sec_context |= SECURITY_ROW_LEVEL_DISABLED;

> > Is SECURITY_ROW_LEVEL_DISABLED mode even needed?  We switch users to the 
> > owner
> > of the FROM-clause table before running an RI query.  That means use of this
> > mode can only matter under row_security=force, right?  I would rest easier 
> > if
> > this mode went away, because it is a security landmine.  If user code 
> > managed
> > to run in this mode, it would bypass every policy in the database.  (I find 
> > no
> > such vulnerability today, because we use the mode only for parse analysis of
> > ri_triggers.c queries.)
> 
> You're right, the reason for it to exist was to address the
> 'row_security = force' case.  I spent a bit of time considering that and
> have come up with the attached for consideration (which needs additional
> work before being committed, but should get the idea across clearly).
> 
> This removes the notion of SECURITY_ROW_LEVEL_DISABLED entirely and
> instead ignores 'row_security = force' if InLocalUserIdChange() is true.

> The biggest drawback that I can see to this is that users won't be able
> to set the row_security GUC inside of a security definer function.  I
> can imagine use-cases where someone might want to change it in a
> security definer function but it doesn't seem like they're valuable
> enough to counter your argument (which I agree with) that
> SECURITY_ROW_LEVEL_DISABLED is a security landmine.

SECURITY DEFINER execution blocks SET ROLE, SET SESSION AUTHORIZATION, and
sometimes "GRANT role1 TO role2".  Otherwise, it works like regular execution.
Adding exceptions, particularly silent behavior changes as opposed to hard
errors, is a big deal.

When I wrote the 2015-07-03 review, I had in mind to just let policies run
normally during referential integrity checks.  Table owners, which can already
break referential integrity with triggers and rules, would be responsible for
crafting policies accordingly.

Pondering it afresh this week, I see now that row_security=force itself is the
problem.  It's a new instance of the maligned "behavior-changing GUC".
Function authors will not consistently test the row_security=force case, and
others can run the functions under row_security=force and get novel results.
This poses a reliability and security threat.  While row_security=force is
handy for testing, visiting a second role for testing purposes is a fine
replacement.  Let's remove "force", making row_security a config_bool.  If
someone later desires to revive the capability, a DDL-specified property of
each policy would be free from these problems.

On a related topic, check_enable_rls() has two kinds of RLS bypass semantics.
Under row_security=off|on, superusers bypass every policy, and table owners
bypass policies of their own tables.  Under row_security=off only, roles with
BYPASSRLS privilege additionally bypass every policy; BYPASSRLS doesn't affect
semantics under row_security=on.  Is that distinction a good thing?  Making
BYPASSRLS GUC-independent, like superuser bypass, would simplify things for
the user and would make row_security no longer a behavior-changing GUC.
row_security would come to mean simply "if a policy would otherwise attach to
one of my queries, raise an error instead."  Again, if you have cause to
toggle BYPASSRLS, use multiple roles.  Implementing that with GUCs creates
harder problems.

In summary, I propose to remove row_security=force and make BYPASSRLS
effective under any setting of the row_security GUC.

Some past, brief discussion leading to the current semantics:
http://www.postgresql.org/message-id/CA+TgmoYA=uixxmn390sfgfqgvmll-as5bjal0om7yrppvwn...@mail.gmail.com
http://www.postgresql.org/message-id/cakrt6cqnghzwugwb5pkwg5gfxwd+-joy8mmmenqh+o6vply...@mail.gmail.com
http://www.postgresql.org/message-id/20150729130927.gs3...@tamriel.snowman.net

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-09-14 Thread Vladimir Borodin


> 12 сент. 2015 г., в 14:05, Amit Kapila  написал(а):
> 
> On Thu, Aug 6, 2015 at 3:31 PM, Ildus Kurbangaliev 
> > wrote:
> >
> > On 08/05/2015 09:33 PM, Robert Haas wrote:
> >>
> >>
> >> You're missing the point.  Those multi-byte fields have additional
> >> synchronization requirements, as I explained in some detail in my
> >> previous email. You can't just wave that away.
> >
> > I see that now. Thank you for the point.
> >
> > I've looked deeper and I found PgBackendStatus to be not a suitable
> > place for keeping information about low level waits. Really, PgBackendStatus
> > is used to track high level information about backend. This is why auxiliary
> > processes don't have PgBackendStatus, because they don't have such 
> > information
> > to expose. But when we come to the low level wait events then auxiliary
> > processes are as useful for monitoring as backends are. WAL writer,
> > checkpointer, bgwriter etc are using LWLocks as well. This is certainly 
> > unclear
> > why they can't be monitored.
> >
> 
> I think the chances of background processes stuck in LWLock is quite less
> as compare to backends as they do the activities periodically.  As an example
> WALWriter will take WALWriteLock to write the WAL, but actually there will 
> never
> be any much contention for WALWriter. In synchronous_commit = on, the
> backends themselves write the WAL so WALWriter won't do much in that
> case and for synchronous_commit = off, backends won't write the WAL so
> WALWriter won't face any contention unless some buffers have to be written
> by bgwriter or checkpoint for which WAL is not flushed which I don't think
> would lead to any contention. 

WALWriter is not a good example, IMHO. And monitoring LWLocks is not the only 
thing that waits monitoring brings to us. Here [0] is an example when 
understanding of what is happening inside the startup process took some long 
time and led to GDB usage. With waits monitoring I could do a couple of SELECTs 
and use oid2name to understand the reason of a problem.

Also we should consider that PostgreSQL has a good infrastructure to 
parallelize many auxilary processes. Can we be sure that we will always have 
exactly one wal writer process? Perhaps, some time in the future we would need 
several of them and there would be contention for WALWriteLock between them. 
Perhaps, wal writer is not a good example here too, but having multiple 
checkpointer or bgwriter processes on the near future seems very likely, no?

[0] 
http://www.postgresql.org/message-id/fe82a9a7-0d52-41b5-a9ed-967f6927c...@simply.name
 

> 
> I am not denying from the fact that there could be some contention in rare
> scenarios for background processes, but I think tracking them is not as
> important as tracking the LWLocks for backends.
> 
> Also as we are planning to track the wait_event information in 
> pg_stat_activity
> along with other backends information, it will not make sense to include
> information about backend processes in this variable as pg_stat_activity
> just displays information of backend processes.
> 
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com 

--
May the force be with you…
https://simply.name

Re: [HACKERS] PATCH: index-only scans with partial indexes

2015-09-14 Thread Kyotaro HORIGUCHI

Hi,

At Sun, 13 Sep 2015 23:21:30 +0200, Tomas Vondra  
wrote in <55f5e8da.8080...@2ndquadrant.com>
> That's indeed strange, but after poking into that for a while, it
> seems rather like a costing issue. Let me demonstrate:
...
> Now, both plans are index only scans, but the first one has Index Cond
> and the other one does not!

The seemingly removed IndexCond qual is counted as non-index
quals at the last in cost_index. The quals that the partial index
implies should be ignored on cost_estimation.

> I've put a bunch of logging into cost_index(), and turns out that
> while the final cost is *exactly* the same, it's most likely by
> chance. After we call amcostestimate, we get these two results:

So it is *not by chance* but a stable behavior defined by
algorithm.

> It seems that the problem is related to this:
> 
> qpquals
> = extract_nonindex_conditions(baserel->baserestrictinfo,
>   path->indexquals);
> 
> while the "larger" index on (a,b) gets
> 
> path->indexquals=(b BETWEEN 30 AND 60)
> qpquals=NIL
> 
> the "smaller" index on (a) gets
> 
> path->indexquals=NIL
> qpquals=(b BETWEEN 30 AND 60)
> 
> And so the larger index gets qpqual_cost=0, the smaller one gets
> qpqual_cost=0.005, and so cpu_per_tuple is either 0.01 or 0.015.
> 
> Which is exactly the difference between costs from amcostestimate
> 
> idx1: 4769.115000 + 0.015 * 297823 = 9236.46
> idx2: 6258.23 + 0.010 * 297823 = 9236.46

These calculations are exactly right, but you overlooked the
breakedown of indexTotalCost for idx2.

> Sppoky! Although it seems like a mere coincidence, thanks to the nice
> round numbers of tuples in the table, and lucky choice of two
> conditions.

As said above, it is not a conincidence. The exactly same
calculation about baserestrictinfo is simply calculated in
different places, cost_index for the former and
btcostestiamte(genericcostestimate) for the latter.

We should properly ignore or remove the implicitly-applied quals
for partial indexes on cost estimation.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parser emits mysterious error message for very long tokens

2015-09-14 Thread Kyotaro HORIGUCHI

Hello, thank you for the opinion.

At Fri, 11 Sep 2015 09:31:30 -0400, Tom Lane  wrote in 
<884.1441978...@sss.pgh.pa.us>
> Kyotaro HORIGUCHI  writes:
> > Hello, this is a problem on an extreme situation.
> > When parser encounters very long tokens, it returns an error
> > message which should be mysterious for users.
> 
> >> $ perl -e "print \"select '\" . \"x\"x(512*1024*1024) . \"'\"" | psql 
> >> postgres
> >> ERROR:  invalid memory alloc request size 1073741824
> 
> I can't get terribly excited about that, because there is not that
> much daylight between there and where the query fails because the
> entire input string exceeds 1GB.  Moreover, anyone who tries working
> with literals in this size range will soon learn not to ;-).  So
> it seems quite an artificial example to me ...
> 
>   regards, tom lane

Yes, so 'an extreme situation'. I wanted to know what to think
about this kind of situation.

Thanks for the comment, I'm confident that fixing it and the
similars in clean way should be too-much for the necessity.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: index-only scans with partial indexes

2015-09-14 Thread Tomas Vondra


On 09/14/2015 09:35 AM, Kyotaro HORIGUCHI wrote:

Hi,

,,,

Which is exactly the difference between costs from amcostestimate

idx1: 4769.115000 + 0.015 * 297823 = 9236.46
idx2: 6258.23 + 0.010 * 297823 = 9236.46


These calculations are exactly right, but you overlooked the
breakedown of indexTotalCost for idx2.


Sppoky! Although it seems like a mere coincidence, thanks to the nice
round numbers of tuples in the table, and lucky choice of two
conditions.


As said above, it is not a conincidence. The exactly same
calculation about baserestrictinfo is simply calculated in
different places, cost_index for the former and
btcostestiamte(genericcostestimate) for the latter.


By "coincidence" I meant that we happened to choose such a number of 
conditions in the index predicate & query that this perfect match is 
possible. Apparently there are two places that manipulate the costs and 
in this particular case happen to perfectly compensate the effects.


As demonstrated by the example with a single condition, the costs may 
actually differ for different numbers of clauses (e.g. using a single 
clause makes the wider index - unexpectedly - cheaper).




We should properly ignore or remove the implicitly-applied quals
for partial indexes on cost estimation.


Probably. So far I've traced the difference to build_index_paths() where 
we build index_clauses by iterating over index columns - the smaller 
index does not have the column from the predicate, so we don't add the 
clause. I'm not particularly familiar with this part of the code, so I 
wonder where's the best place to fix this, though.


regards

--
Tomas Vondra   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Make timestamptz_out less slow.

2015-09-14 Thread Alvaro Herrera

Jim Nasby wrote:
> On 9/13/15 2:43 AM, David Rowley wrote:
> >Are you worried about this because I've not focused on optimising float
> >timestamps as much as int64 timestamps?  Are there many people compiling
> >with float timestamps in the real world?
> 
> My $0.02: the default was changed some 5 years ago so FP time is probably
> pretty rare now.

The default was FP for 8.3 and was changed before 8.4.  Anybody who was
running with the default back then and who has pg_upgraded all the way
to current releases is still using floating-point date/times.

> I don't think it's worth a bunch of extra work to speed them up.

Not sure about that.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Attach comments to functions' parameters and return value

2015-09-14 Thread Alvaro Herrera

Jim Nasby wrote:
> On 9/14/15 8:59 AM, Charles Clavadetscher wrote:

> >To long time PostgreSQL developers this may look straightforward. For the
> >moment I am not even sure if that is correct and if there are other places
> >that would need additions, apart from the obvious display in psql.
> 
> I suspect that changes to support this should be pretty localized. I suggest
> you look at other recent patches that have added COMMENT functionality to
> see what they did.

This sequence of commits may be helpful.
http://git.postgresql.org/gitweb/?p=postgresql.git;a=history;f=src/test/regress/sql/object_address.sql;hb=HEAD

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS][PROPOSAL] Covering + unique indexes.

2015-09-14 Thread Teodor Sigaev


CREATE INDEX index ON table (c1, c2, c3) UNIQUE ON (c1, c3);

CREATE [UNIQUE {ON FIRST {COLUMN | n_unique_column COLUMNS}} INDEX ON
table_name (column_name1, column_name2 ...);


I would use the first (simple) syntax and just throw an error if the
user tries to skip a column on the UNIQUE clause.

Seems, second option looks as more natural extension of CREATE UNIQUE INDEX




Have you by chance looked to see what other databases have done for
syntax? I'm guessing this isn't covered by ANSI but maybe there's
already an industry consensus.


MS SQL and DB/2 suggests (with changes for postgresql):
CREATE UNIQUE INDEX i ON t (a,b) INCLUDE (c)

MS SQL supports both unique and non-unique indexes, DB/2 only unique 
indexes. Oracle/MySQL doesn't support covering indexes. Readed at 
http://use-the-index-luke.com/sql/clustering/index-only-scan-covering-index



MSSQL/DB/2 variants looks good too.

--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] cache type info in json_agg and friends

2015-09-14 Thread Teodor Sigaev


Currently, json_agg, jsonb_agg, json_object_agg and jsonb_object_agg do
type classification on their arguments on each call to the transition
function. This is quite unnecessary, as the argument types won't change.
This patch remedies the defect by caching the necessary values in the
aggregate state object.

+1



While this doesn't change the performance much, since these functions
are essentially dominated by other bits of the processing, I think it is
nevertheless worth doing.

Agree

After quick observation of your patch, why don't you use FmgrInfo 
instead of  JsonAggState.val_output_func/JsonAggState.key_category? 
FmgrInfo could be filled by fmgr_info_cxt() in aggcontext memory 
context. Suppose, direct usage of FmgrInfo with FunctionCall a bit 
faster than OidFunctionCall.

--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] cache type info in json_agg and friends

2015-09-14 Thread Alvaro Herrera

Andrew Dunstan wrote:

> Currently, json_agg, jsonb_agg, json_object_agg and jsonb_object_agg do type
> classification on their arguments on each call to the transition function.
> This is quite unnecessary, as the argument types won't change. This patch
> remedies the defect by caching the necessary values in the aggregate state
> object.

Seems a reasonable idea to me.  This is 9.6 only, right?

What's the reason for this pattern?

!   json_categorize_type(val_type,, );
!   state->val_category = tcategory;
!   state->val_output_func = outfuncoid;

I think you could just as well call json_categorize_type() with the
final pointer values, and save the two separate variables, as there is
no gain in clarity or ease of reading; I mean

!   json_categorize_type(tmptyp, >val_category, 
>val_output_func);

Also, and this is not new in this patch, this code reuses a variable
named "val_type" for both values and keys, which reads a bit odd.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: GiST support for UUIDs

2015-09-14 Thread Paul Jungwirth


2)
 static double
 uuid2num(const pg_uuid_t *i)
 {
 return *((uint64 *)i);
 }
It isn't looked as correct transformation for me. May be, it's better
to transform to numeric type (UUID looks like a 16-digit hexademical
number)
and follow  gbt_numeric_penalty() logic (or even call directly).


Thanks for the review! A UUID is actually not stored as a string of 
hexadecimal digits. (It is normally displayed that way, but with 32 
digits, not 16.) Rather it is stored as an unstructured 128-bit value 
(which in C is 16 unsigned chars). Here is the easy-to-misread 
declaration from src/backend/utils/adt/uuid.c:


#define UUID_LEN 16
struct pg_uuid_t
{
unsigned char data[UUID_LEN];
};

I would love to just cast this to a 128-bit unsigned int. But it looks 
like Postgres doesn't always have 128-bit ints available, so my code 
takes the top half and uses that for penalty calculations. It seemed to 
me that was "good enough" for this purpose.


The only other 128-bit type I found in btree_gist was Interval. For that 
type we convert to a double using INTERVAL_TO_SEC, then call 
penalty_num. By my read that accepts a similar loss of precision.


If I'm mistaken about 128-bit integer support, let me know, and maybe we 
can do the penalty computation on the whole UUID. Or maybe I should just 
convert the uint64 to a double before calling penalty_num? I don't 
completely understand what the penalty calculation is all about, so I 
welcome suggestions here.


Thanks again,
Paul











--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS][PROPOSAL] Covering + unique indexes.

2015-09-14 Thread Thomas Munro

On Tue, Sep 15, 2015 at 6:08 AM, Teodor Sigaev  wrote:

> CREATE INDEX index ON table (c1, c2, c3) UNIQUE ON (c1, c3);
>>>
>>> CREATE [UNIQUE {ON FIRST {COLUMN | n_unique_column COLUMNS}} INDEX ON
>>> table_name (column_name1, column_name2 ...);
>>>
>>
>> I would use the first (simple) syntax and just throw an error if the
>> user tries to skip a column on the UNIQUE clause.
>>
> Seems, second option looks as more natural extension of CREATE UNIQUE INDEX
>
>
>
>> Have you by chance looked to see what other databases have done for
>> syntax? I'm guessing this isn't covered by ANSI but maybe there's
>> already an industry consensus.
>>
>
> MS SQL and DB/2 suggests (with changes for postgresql):
> CREATE UNIQUE INDEX i ON t (a,b) INCLUDE (c)
>
> MS SQL supports both unique and non-unique indexes, DB/2 only unique
> indexes. Oracle/MySQL doesn't support covering indexes. Readed at
> http://use-the-index-luke.com/sql/clustering/index-only-scan-covering-index

It surprised me that you can INCLUDE extra columns on non-UNIQUE indexes,
since you could just add them as regular indexed columns for the same
effect.  It looks like when you do that in SQL Server, the extra columns
are only stored on btree leaf pages and so can't be used for searching or
ordering.  I don't know how useful that is or if we would ever want it...
but I just wanted to note that difference, and that the proposed UNIQUE ON
FIRST n COLUMNS syntax and catalog change can't express that.

http://sqlperformance.com/2014/07/sql-indexes/new-index-columns-key-vs-include

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] jsonb_set array append hack?

2015-09-14 Thread Andrew Dunstan




On 09/14/2015 01:29 PM, Thom Brown wrote:

Hi,

I've noticed that if you use a string for an element key in jsonb_set 
with create_missing set to true, you can use it to append to an array:


postgres=# SELECT jsonb_set(
'{"name": "Joe", "vehicle_types": ["car","van"]}'::jsonb,
   '{vehicle_types,nonsense}',
   '"motorcycle"', true);
jsonb_set

 {"name": "Joe", "vehicle_types": ["car", "van", "motorcycle"]}
(1 row)

What this really should match is a nested element inside 
"vehicle_types" called "nonsense".  But this seems to be a hack to get 
an element added to an array.  To do it properly currently requires 
specifying an arbitrary number in the hope that it will exceed the 
number of elements you have in the array.



That's a bug and we should fix it.




e.g.

postgres=# SELECT jsonb_set(
   '{"name": "Joe", "vehicle_types": ["car","van"]}'::jsonb,
   '{vehicle_types,10}',
   '"motorcycle"', true);
jsonb_set

 {"name": "Joe", "vehicle_types": ["car", "van", "motorcycle"]}
(1 row)

But I'm guessing people shouldn't be relying on the hack in the first 
example.  Isn't this a bug?  If so, wouldn't this also be a bug?:


postgres=# SELECT jsonb_set(
   '{"name": "Joe", "vehicle_types": ["car","van"]}'::jsonb,
   array['vehicle_types',NULL],
   '"motorcycle"', true);




I think that's a bug too.

cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] cache type info in json_agg and friends

2015-09-14 Thread Andrew Dunstan



Currently, json_agg, jsonb_agg, json_object_agg and jsonb_object_agg do 
type classification on their arguments on each call to the transition 
function. This is quite unnecessary, as the argument types won't change. 
This patch remedies the defect by caching the necessary values in the 
aggregate state object.


While this doesn't change the performance much, since these functions 
are essentially dominated by other bits of the processing, I think it is 
nevertheless worth doing.


There are other areas where we might attack this, also. In particular, 
if one of the arguments is a record, then composite_to_json(b) will do 
this for every attribute of every record. However, it's much less clear 
to me how we can cache this information sensibly.


cheers

andrew
diff --git a/src/backend/utils/adt/json.c b/src/backend/utils/adt/json.c
index 8d04347..d199421 100644
--- a/src/backend/utils/adt/json.c
+++ b/src/backend/utils/adt/json.c
@@ -68,6 +68,15 @@ typedef enum	/* type categories for datum_to_json */
 	JSONTYPE_OTHER/* all else */
 } JsonTypeCategory;
 
+typedef struct JsonAggState
+{
+	StringInfo str;
+	JsonTypeCategory   key_category;
+	Oidkey_output_func;
+	JsonTypeCategory   val_category;
+	Oidval_output_func;
+} JsonAggState;
+
 static inline void json_lex(JsonLexContext *lex);
 static inline void json_lex_string(JsonLexContext *lex);
 static inline void json_lex_number(JsonLexContext *lex, char *s, bool *num_err);
@@ -1858,18 +1867,10 @@ to_json(PG_FUNCTION_ARGS)
 Datum
 json_agg_transfn(PG_FUNCTION_ARGS)
 {
-	Oid			val_type = get_fn_expr_argtype(fcinfo->flinfo, 1);
 	MemoryContext aggcontext,
 oldcontext;
-	StringInfo	state;
+	JsonAggState	*state;
 	Datum		val;
-	JsonTypeCategory tcategory;
-	Oid			outfuncoid;
-
-	if (val_type == InvalidOid)
-		ereport(ERROR,
-(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("could not determine input data type")));
 
 	if (!AggCheckCallContext(fcinfo, ))
 	{
@@ -1879,50 +1880,62 @@ json_agg_transfn(PG_FUNCTION_ARGS)
 
 	if (PG_ARGISNULL(0))
 	{
+		Oid val_type = get_fn_expr_argtype(fcinfo->flinfo, 1);
+		JsonTypeCategory tcategory;
+		Oid outfuncoid;
+
+		if (val_type == InvalidOid)
+			ereport(ERROR,
+	(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+	 errmsg("could not determine input data type")));
+
 		/*
-		 * Make this StringInfo in a context where it will persist for the
+		 * Make this state object in a context where it will persist for the
 		 * duration of the aggregate call.  MemoryContextSwitchTo is only
 		 * needed the first time, as the StringInfo routines make sure they
 		 * use the right context to enlarge the object if necessary.
 		 */
 		oldcontext = MemoryContextSwitchTo(aggcontext);
-		state = makeStringInfo();
+		state = (JsonAggState *) palloc(sizeof(JsonAggState));
+		state->str = makeStringInfo();
 		MemoryContextSwitchTo(oldcontext);
 
-		appendStringInfoChar(state, '[');
+		appendStringInfoChar(state->str, '[');
+		json_categorize_type(val_type,, );
+		state->val_category = tcategory;
+		state->val_output_func = outfuncoid;
 	}
 	else
 	{
-		state = (StringInfo) PG_GETARG_POINTER(0);
-		appendStringInfoString(state, ", ");
+		state = (JsonAggState *) PG_GETARG_POINTER(0);
+		appendStringInfoString(state->str, ", ");
 	}
 
 	/* fast path for NULLs */
 	if (PG_ARGISNULL(1))
 	{
-		datum_to_json((Datum) 0, true, state, JSONTYPE_NULL, InvalidOid, false);
+		datum_to_json((Datum) 0, true, state->str, JSONTYPE_NULL,
+	  InvalidOid, false);
 		PG_RETURN_POINTER(state);
 	}
 
 	val = PG_GETARG_DATUM(1);
 
-	/* XXX we do this every time?? */
-	json_categorize_type(val_type,
-		 , );
-
 	/* add some whitespace if structured type and not first item */
 	if (!PG_ARGISNULL(0) &&
-		(tcategory == JSONTYPE_ARRAY || tcategory == JSONTYPE_COMPOSITE))
+		(state->val_category == JSONTYPE_ARRAY ||
+		 state->val_category == JSONTYPE_COMPOSITE))
 	{
-		appendStringInfoString(state, "\n ");
+		appendStringInfoString(state->str, "\n ");
 	}
 
-	datum_to_json(val, false, state, tcategory, outfuncoid, false);
+	datum_to_json(val, false, state->str, state->val_category,
+  state->val_output_func, false);
 
 	/*
 	 * The transition type for array_agg() is declared to be "internal", which
 	 * is a pass-by-value type the same size as a pointer.  So we can safely
-	 * pass the ArrayBuildState pointer through nodeAgg.c's machinations.
+	 * pass the JsonAggState pointer through nodeAgg.c's machinations.
 	 */
 	PG_RETURN_POINTER(state);
 }
@@ -1933,19 +1946,21 @@ json_agg_transfn(PG_FUNCTION_ARGS)
 Datum
 json_agg_finalfn(PG_FUNCTION_ARGS)
 {
-	StringInfo	state;
+	JsonAggState	*state;
 
 	/* cannot be called directly because of internal-type argument */
 	Assert(AggCheckCallContext(fcinfo, NULL));
 
-	state = PG_ARGISNULL(0) ? NULL : (StringInfo) PG_GETARG_POINTER(0);
+	state = PG_ARGISNULL(0) ?
+		NULL :
+		(JsonAggState *) PG_GETARG_POINTER(0);
 
 	/* NULL result for

Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-09-14 Thread Shulgin, Oleksandr

On Sat, Sep 12, 2015 at 11:50 AM, Tomas Vondra  wrote:

> Hi,
>
> I did a quick initial review of this patch today, so here are my comments
> so far:
>

Hi Tomas,

First of all, thanks for the review!

- ipcs.c should include utils/cmdstatus.h (the compiler complains
>   about implicit declaration of two functions)
>

Correct, though the file name is ipci.c.

- Attempts to get plan for simple insert queries like this
>
>   INSERT INTO x SELECT * FROM x;
>
>   end with a segfault, because ActivePortal->queryDesc is 0x0 for this
>   query. Needs investigation.
>

Yes, I've hit the same problem after submitting the latest version of the
patch.  For now I've just added a check for queryDesc being not NULL, but I
guess the top of the current_query_stack might contains something useful.
Something I need to try.

- The lockless approach seems fine to me, although I think the fear
>   of performance issues is a bit moot (I don't think we expect large
>   number of processes calling pg_cmdstatus at the same time). But
>   it's not significantly more complex, so why not.
>

I believe the main benefit of the less-locking approach is that if
something goes wrong when two backends tried to communicate it doesn't
prevent the rest of them from operating, because there is no shared (and
thus locked) communication channel.

- The patch contains pretty much no documentation, both comments
>   at the code level and user docs. The lack of user docs is not that
>   a big deal at this point (although the patch seems to be mature
>   enough, although the user-level API will likely change).
>
>   The lack of code comments is more serious, as it makes the review
>   somewhat more difficult. For example it'd be very nice to document
>   the contract for the lock-less interface.
>

I will add the code comments.  The user docs could wait before we decide on
the interface, I think.

- I agree that pg_cmdstatus() is not the best API. Having something
>   like EXPLAIN PID would be nice, but it does not really work for
>   all the request types (just CMD_STATUS_REQUEST_EXPLAIN). Maybe
>   there's not a single API for all cases, i.e. we should use EXPLAIN
>   PID for one case and invent something different for the other?
>

I can think of something like:

EXPLAIN [ ( option [, ...] ) ] PROCESS ;

where option is extended with:

  QUERY
  PROGRESS
  BACKTRACE

in addition to the usual ANALYZE, VERBOSE, FORMAT, etc.

- Is there a particular reason why we allocate slots for auxiliary
>   processes and not just for backends (NumBackends)? Do we expect those
>   auxiliary processes to ever use this API?
>

If we extend the interface to a more general one, there still might be some
space for querying status of checkpointer of bgwriter.

- CleanupCmdStatusSlot seems needlessly complicated. I don't quite see
>   the need for the second argument, or the internal slot variable. Why
>   not to simply use the MyCmdStatusSlot directly?
>

Good point.

- I also don't quite understand why we need to track css_pid for the
>   slot? In what scenario will this actually matter?
>

I think it's being only used for error reporting and could help in
debugging, but for now that's it.

- While being able to get EXPLAIN from the running process is nice,
>   I'm especially interested in getting EXPLAIN ANALYZE to get insight
>   into the progress of the execution. The are other ways to get the
>   EXPLAIN, e.g. by opening a different connection and actually running
>   it (sure, the plan might have changed since then), but currently
>   there's no way to get insight into the progress.
>
>   From the thread I get the impression that Oleksandr also finds this
>   useful - correct? What are the plans in this direction?
>
>   ISTM we need at least two things for that to work:
>
>   (a) Ability to enable instrumentation on all queries (effectively
>   what auto_explain allows), otherwise we can't get EXPLAIN ANALYZE
>   on the queries later. But auto_explain is an extension, so that
>   does not seem as a good match if this is supposed to be in core.
>   In that case a separate GUC seems appropriate.
>
>   (b) Being able to run the InstrEnd* methods repeatedly - the initial
>   message in this thread mentions issues with InstrEndLoop for
>   example. So perhaps this is non-trivial.
>

I was able to make this work with a simple change to InstrEndLoop and the
callers.  Basically, adding a bool parameter in_explain and passing an
appropriate value.  I guess that's not the best approach, but it appears to
work.

Adding a GUC to enable instrumentation sounds reasonable.

Do you believe it makes sense to add instrumentation support in this same
patch or better focus on making the simplest thing work first?

- And finally, I think we should really support all existing EXPLAIN
>   formats, not just text. We need to support the other formats (yaml,
>   json, xml) if we want to use the EXPLAIN PID approach, and it also

Re: [HACKERS] extend pgbench expressions with functions

2015-09-14 Thread Kyotaro HORIGUCHI

Hi, I had a look on this, this gives amazing performance. (I
myself haven't tried that yet:-p) But anyway the new syntax looks
far smarter than preparing arbitraly metacommands.

michael> I have moved this patch to the next CF.

> > Hello Heikki,
> >
> >>> As soon as we add more functions, the way they are documented needs to be
> >>> reworked too; we'll need to add a table in the manual to list them.
> >>
> >>
> >> Here is a v8 with "abs", "min", "max", "random", "gaussian" et
> >> "exponential".
> >>
> >> [...]
> >>
> >> There is no real doc, WIP...

As a quick glance on the patch, I have two simple comment.
Would you consider these comments?

1.evalInt and evalDouble are mutually calling on single expr
 node. It looks simple and seems to work but could easily broken
 to fall into infinite call and finally (in a moment) exceeds
 stack depth.

 I think it is better not to do so. For example, reunioning
 evalDouble and evalInt into one evalulateExpr() which can return
 both double and int result would do so. The interface would be
 something like the follwings or others. Some individual
 functions might be better to be split out.

  static ExprRetType evaluateExpr(TState,CState,PgBenchExpr,
  int *iret, double *dret);

  or

  typedef struct EvalRetVal {
bool is_double,
union val {
  int*ival;
  double *dval;
}
  } EvalRetVal;
  static bool evaluateExpr(..., EvalRetVal *ret);


2. You combined abs/sqrt/ddebug and then make a branch for
  them. It'd be better to split them even if eval*() looks to be
  duplicate.


regards,


> > Here is a v9 with a doc. The implicit typing of expressions is improved.
> >
> > I also added two debug functions which allow to show intermediate integer
> > (idebug) or double (ddebug).
> >
> >   \set i idebug(random(1, 10))
> >
> > will print the random value and assign it to i.
> >
> >
> > I updated the defaut scripts, which seems to speed up meta command
> > evaluations. The initial version does less than 2 million evals per second:
> >
> >   sh> cat before.sql
> >   \set naccounts 10 * :scale
> >   \setrandom aid 1 :naccounts
> >
> >   sh> ./pgbench -T 3 -P 1 -f before.sql
> >   [...]
> >   tps = 1899004.793098 (excluding connections establishing)
> >
> >
> > The updated version does more than 3 million evals per second:
> >
> >   sh> cat after.sql
> >   \set aid random(1, 10 * :scale)
> >
> >   sh> ./pgbench -T 3 -P 1 -f after.sql
> >   [...]
> >   tps = 3143168.813690 (excluding connections establishing)
> >
> >
> > Suggestion:
> >
> > A possibility would be to remove altogether the \setrandom stuff as the
> > functionality is available through \set, maybe with an error message to
> > advise about using \set with one of the random functions. That would remove
> > about 200 ugly locs...
> >
> > Another option would be to translate the setrandom stuff into a \set
> > expression, that would maybe save 100 locs for the eval, but keep and expand
> > a little the "manual" parser part.
> 
> I have moved this patch to the next CF.
> -- 
> Michael

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-09-14 Thread Alexander Korotkov

On Sat, Sep 12, 2015 at 2:05 PM, Amit Kapila 
wrote:

> On Thu, Aug 6, 2015 at 3:31 PM, Ildus Kurbangaliev <
> i.kurbangal...@postgrespro.ru> wrote:
> >
> > On 08/05/2015 09:33 PM, Robert Haas wrote:
> >>
> >>
> >> You're missing the point.  Those multi-byte fields have additional
> >> synchronization requirements, as I explained in some detail in my
> >> previous email. You can't just wave that away.
> >
> > I see that now. Thank you for the point.
> >
> > I've looked deeper and I found PgBackendStatus to be not a suitable
> > place for keeping information about low level waits. Really,
> PgBackendStatus
> > is used to track high level information about backend. This is why
> auxiliary
> > processes don't have PgBackendStatus, because they don't have such
> information
> > to expose. But when we come to the low level wait events then auxiliary
> > processes are as useful for monitoring as backends are. WAL writer,
> > checkpointer, bgwriter etc are using LWLocks as well. This is certainly
> unclear
> > why they can't be monitored.
> >
>
> I think the chances of background processes stuck in LWLock is quite less
> as compare to backends as they do the activities periodically.  As an
> example
> WALWriter will take WALWriteLock to write the WAL, but actually there will
> never
> be any much contention for WALWriter. In synchronous_commit = on, the
> backends themselves write the WAL so WALWriter won't do much in that
> case and for synchronous_commit = off, backends won't write the WAL so
> WALWriter won't face any contention unless some buffers have to be written
> by bgwriter or checkpoint for which WAL is not flushed which I don't think
> would lead to any contention.
>

Hmm, synchronous_commit is per session variable: some transactions could
run with synchronous_commit = on, but some with synchronous_commit = off.
This is very popular feature of PostgreSQL: achieve better performance by
making non-critical transaction asynchronous while leaving critical
transactions synchronous. Thus, contention for WALWriteLock between
backends and WALWriter could be real.

I am not denying from the fact that there could be some contention in rare
> scenarios for background processes, but I think tracking them is not as
> important as tracking the LWLocks for backends.
>

I would be more careful in calling some of scenarios rare. As DBMS
developers we should do our best to evade contention for LWLocks: any
contention, not only between backends and background processes.  One may
assume that high LWLock contention is rare scenario in general. Once we're
here we doesn't think so, though.
You claims that there couldn't be contention for WALWriteLock between
backends and WALWriter. This is unclear for me: I think it could be. Nobody
opposes tracking wait events for backends and tracking them for background
processes. I think we need to track both in order to provide full picture
to DBA.

Also as we are planning to track the wait_event information in
> pg_stat_activity
> along with other backends information, it will not make sense to include
> information about backend processes in this variable as pg_stat_activity
> just displays information of backend processes.
>

I'm not objecting that we should track only backends information in
pg_stat_activity. I think we should have also some other way of tracking
wait events for background processes. We should think it out before
extending pg_stat_activity to evade design issues later.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-09-14 Thread Alexander Korotkov

On Sat, Sep 12, 2015 at 3:24 PM, Amit Kapila 
wrote:

> On Fri, Aug 14, 2015 at 7:23 PM, Alexander Korotkov 
> wrote:
> >
> > On Thu, Aug 6, 2015 at 1:01 PM, Ildus Kurbangaliev <
> i.kurbangal...@postgrespro.ru> wrote:
> >>
> >>
> >> I've looked deeper and I found PgBackendStatus to be not a suitable
> >> place for keeping information about low level waits. Really,
> PgBackendStatus
> >> is used to track high level information about backend. This is why
> auxiliary
> >> processes don't have PgBackendStatus, because they don't have such
> information
> >> to expose. But when we come to the low level wait events then auxiliary
> >> processes are as useful for monitoring as backends are. WAL writer,
> >> checkpointer, bgwriter etc are using LWLocks as well. This is certainly
> unclear
> >> why they can't be monitored.
> >>
> >> This is why I think we shoudn't place wait event into PgBackendStatus.
> It
> >> could be placed into PGPROC or even separate data structure with
> different
> >> concurrency model which would be most suitable for monitoring.
> >
> >
> > +1 for tracking wait events not only for backends
> >
> > Ildus, could you do following?
> > 1) Extract LWLocks refactoring into separate patch.
> > 2) Make a patch with storing current wait event information in PGPROC.
> >
>
> Now as Robert has committed standardization of lwlock names in
> commit - aa65de04, let us try to summarize and work on remaining parts
> of the patch. So I think now the next set of work is as follows:
>
> 1. Modify the tranche mechanism so that information about LWLocks
> can be tracked easily.  For this already there is some discussion, ideas
> and initial patch is floated in this thread and there doesn't seem to be
> much
> conflicts, so we can write the patch for it.  I am planning to write or
> modify
> the existing patch unless you, IIdus or anyone has objections or want to
> write it, please let me know to avoid duplication of work.
>
> 2. Track wait_event in pg_stat_activity.  Now here the main point where
> we doesn't seem to be in complete agreement is that shall we keep it
> as one byte in PgBackendStatus or use two or more bytes to store
> wait_event information and still there is another point made by you to
> store it in PGProc rather than in PgBackendStatus so that we can display
> information of background processes as well.
>
> Now as a matter of consensus, I think Tom has raised a fair point [1]
> against
> storing this information in PGProc and I feel that it is relatively less
> important to have such information about background processes and the
> reason for same is explained upthread [2].  About having more than one-byte
> to store information about various wait_events, I think now we will not
> have
> more than 100 or so events to track, do you really think that anytime in
> forseeable
> future we will have more than 256 important events which we would like to
> track?
>
> So I think about this lets first try to build the consensus and then
> attempt to
> write or modify the patch.
>
>
> [1] - http://www.postgresql.org/message-id/4067.1439561...@sss.pgh.pa.us
> [2] -
> http://www.postgresql.org/message-id/CAA4eK1JRQGTgRMRao6H65rA=fhifucnqhx7ongy58um6rn1...@mail.gmail.com
>

In order to build the consensus we need the roadmap for waits monitoring.
Would single byte in PgBackendStatus be the only way for tracking wait
events? Could we have pluggable infrastructure in waits monitoring: for
instance, hooks for wait event begin and end?
Limit of 256 wait events is probably OK. But we have no room for any
additional parameters of wait event. For instance, if you notice high
contention for buffer content lock then it would be nice to figure out: how
many blocks are involved?, which blocks? We need to track additional
parameters in order to answer this question.
Comments about tracking only backends are in [1] and [2].
PGProc is not the only way to track events with parameters for every
process. We could try to extend PgBackendStatus or other build secondary
data structure.

However, this is depending of our general roadmap. If pg_stat_activity is
just for default while suitable waits monitoring could be a plugin, then
it's pobably OK.

[1] -
http://www.postgresql.org/message-id/capphfdsnju40qud2sqjs_iu+z-jm-b_f39foc7eydjvza5e...@mail.gmail.com
[2] -
http://www.postgresql.org/message-id/9a99c2a7-1760-419f-bdc9-a2cf99ecd...@simply.name

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

[HACKERS] Can extension build own SGML document?

2015-09-14 Thread Kouhei Kaigai

Hello,

The pgxs makefile (pgxs.ml) says:

#   DOCS -- random files to install under $PREFIX/doc/$MODULEDIR

It is a bunch of files to be copied to document directory, however,
not looks like a variable to specify SGML source as PostgreSQL core
doing.

Do we have way to build SGML source of extension using makefile?

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partitioned checkpointing

2015-09-14 Thread Takashi Horikawa

Hi,

I wrote:
> The original purpose is to mitigate full-page-write rush that occurs at
> immediately after the beginning of each checkpoint.
> The amount of FPW at each checkpoint is reduced to 1/16 by the
> 'Partitioned checkpointing.'
Let me show another set of measurement results that clearly illustrates this 
point. Please find DBT2-sync.jpg and DBT2-sync-FPWoff.jpg.

At first, I noticed the performance dip due to checkpointing when I conducted 
some performance measurement using DBT-2 that implements transactions based on 
the TPC-C specification. As can be seen in DBT2-sync.jpg, original 9.5alpha2 
showed sharp dips in throughput periodically.

The point here is that I identified that those dips were caused by 
full-page-write rush that occurs immediately after the beginning of each 
checkpoint. As shown in DBT2-sync-FPWoff.jpg; those dips were eliminated when a 
GUC parameter 'full_page_writes' was set to 'off.' This also indicates that 
existing mechanism of spreading buffer sync operations over time was 
effectively worked. As only difference between original 9.5alpha2 case in 
DBT2-sync.jpg and DBT2-sync-FPWoff.jpg was in the setting of 
'full_page_writes,' those dips were attributed to the full-page-write as a 
corollary. 

The 'Partitioned checkpointing' was implemented to mitigate the dips by 
spreading full-page-writes over time and was worked exactly as designed (see 
DBT2-sync.jpg). It also produced good effect for pgbench, thus I have posted an 
article with a Partitioned-checkpointing.patch to this mailing list. 

As to pgbench, however, I have found that full-page-writes did not cause the 
performance dips, because the dips also occurred when 'full_page_writes' was 
set to 'off.' So, honestly, I do not exactly know why 'Partitioned 
checkpointing' mitigated the dips in pgbench executions. 

However, it is certain that there are some, other than pgbench, workloads for 
PostgreSQL in which the full-page-write rush causes performance dips and 
'Partitioned checkpointing' is effective to eliminate (or mitigate) them; DBT-2 
is an example.

And also, 'Partitioned checkpointing' is worth to study why it is effective in 
pgbench executions. By studying it, it may lead to devising better ways.
--
Takashi Horikawa
NEC Corporation
Knowledge Discovery Research Laboratories

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Takashi Horikawa
> Sent: Saturday, September 12, 2015 12:50 PM
> To: Simon Riggs; Fabien COELHO
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Partitioned checkpointing
> 
> Hi,
> 
> > I understand that what this patch does is cutting the checkpoint
> > of buffers in 16 partitions, each addressing 1/16 of buffers, and each
> with
> > its own wal-log entry, pacing, fsync and so on.
> Right.
> However,
> > The key point is that we spread out the fsyncs across the whole checkpoint
> > period.
> this is not the key point of the 'partitioned checkpointing,' I think.
> The original purpose is to mitigate full-page-write rush that occurs at
> immediately after the beginning of each checkpoint.
> The amount of FPW at each checkpoint is reduced to 1/16 by the
> 'Partitioned checkpointing.'
> 
> > This method interacts with the current proposal to improve the
> > checkpointer behavior by avoiding random I/Os, but it could be combined.
> I agree.
> 
> > Splitting with N=16 does nothing to guarantee the partitions are equally
> > sized, so there would likely be an imbalance that would reduce the
> > effectiveness of the patch.
> May be right.
> However, current method was designed with considering to split
> buffers so as to balance the load as equally as possible;
> current patch splits the buffer as
> ---
> 1st round: b[0], b[p], b[2p], … b[(n-1)p]
> 2nd round: b[1], b[p+1], b[2p+1], … b[(n-1)p+1]
> …
> p-1 th round:b[p-1], b[p+(p-1)], b[2p+(p-1)], … b[(n-1)p+(p-1)]
> ---
> where N is the number of buffers,
> p is the number of partitions, and n = (N / p).
> 
> It would be extremely unbalance if buffers are divided as follow.
> ---
> 1st round: b[0], b[1], b[2], … b[n-1]
> 2nd round: b[n], b[n+1], b[n+2], … b[2n-1]
> …
> p-1 th round:b[(p-1)n], b[(p-1)n+1], b[(p-1)n+2], … b[(p-1)n+(n-1)]
> ---
> 
> 
> I'm afraid that I miss the point, but
> > 2.
> > Assign files to one of N batches so we can make N roughly equal sized
> > mini-checkpoints
> Splitting buffers with considering the file boundary makes FPW related
> processing
> (in xlog.c and xloginsert.c) complicated intolerably, as 'Partitioned
> checkpointing' is strongly related to the decision of whether this buffer
> is necessary to FPW or not at the time of inserting the xlog record.
> # 'partition id = buffer id % number of partitions' is fairly simple.
> 
> Best regards.
> --
> Takashi Horikawa
> NEC Corporation
> Knowledge Discovery Research Laboratories
> 
> 
> 
> > -Original Message-
> > From: Simon Riggs

Re: [HACKERS] RLS open items are vague and unactionable

2015-09-14 Thread Dean Rasheed

On 13 September 2015 at 22:54, Stephen Frost  wrote:
> Not in front of my laptop and will review it when I get back in more detail,
> but the original approach that I tried was changing
> get_policies_for_relation to try and build everything necessary, which
> didn't work as we need to OR the various ALL/SELECT policies together and
> then AND the result to apply the filtering.
>

Yes that wouldn't work because get_policies_for_relation() isn't the
place for that. It has a clear responsibility which is to build and
return 2 lists of policies (permissive and restrictive) for the
specified relation, command type and user role. It doesn't have any
say in what is done with those policies (how their quals get
combined), it just fetches them.

It shouldn't be necessary to change get_policies_for_relation() at all
to support the RETURNING check. You just need to call it with
CMD_SELECT. BTW, your change to change get_policies_for_relation() has
a bug -- if the policy is for ALL commands it gets added
unconditionally to the list of returning_policies, regardless of
whether it applies to the current user role. That's the risk of
complicating the function by trying to make it do more than it was
designed to do.

> Seems like that might not be an issue with your proposed approach, but
> wouldn't we need to either pass in fresh lists and then append them to the
> existing lists, or change the functions to work with non-empty lists? (I had
> thought about changing that anyway since there's often cases where it's
> useful to be able to call a function which adds to an existing list).
> Further, actually, we'd still have to figure out how to build up the OR'd
> qual from the ALL/SELECT policies and then add that to the restrictive set.
> That didn't appear easy to do from get_policies_for_relation as all the
> ChangeVarNode work is handled in the build_* functions, which have the info
> necessary.
>

No, the lists of policies built and returned by
get_policies_for_relation() should not be appended to any other lists.
That would have the effect of OR'ing the SELECT policy quals with the
UPDATE/DELETE policy quals, which is wrong. The user needs to have
both SELECT and UPDATE/DELETE permissions on each row, so the OR'ed
set of permissive SELECT quals needs to be AND'ed with the OR'ed set
of permissive UPDATE/DELETE quals. The build_* functions do precisely
what is needed for that, and shouldn't need changing either (other
than the rename discussed previously).

So for example, build^H^H^Hadd_security_quals() takes 2 lists of
policies (permissive and restrictive), combines them in the right way
using OR and AND, does the necessary varno manipulation, and adds the
resulting quals to the passed-in list of security quals (thereby
implicitly AND'ing the new quals with any pre-existing ones), which is
precisely what's needed to support RETURNING.

>> Then it isn't necessary to modify get_policies_for_relation(),
>> build_security_quals() and build_with_check_options() to know anything
>> specific about returning. They're just another set of permissive and
>> restrictive policies to be fetched and added to the command.
>
>
> The ALL/SELECT policies need to be combined with OR's (just like all
> permissive sets of policies) and then added to the restrictive set of quals,
> to ensure that they are evaluated as a restriction and not just another set
> of permissive policies, which wouldn't provide the required filtering.
>

Right, and that's what the add_* functions do. The separation of
concerns between get_policies_for_relation() and the add_* functions
was intended to make just this kind of change trivial at a higher
level.

I think this should actually be 2 separate commits, since the
refactoring and the support for RETURNING are entirely different
things. It just happens that after the refactoring, the RETURNING
patch becomes trivial (4 new executable lines of code wrapped in a
couple of if statements, to fetch and then apply the new policies in
the necessary cases). At least that's the theory :-)

Regards,
Dean

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-09-14 Thread Pavel Stehule

2015-09-14 10:23 GMT+02:00 Shulgin, Oleksandr 
:

>
>
>
> I can think of something like:
>
> EXPLAIN [ ( option [, ...] ) ] PROCESS ;
>
> where option is extended with:
>
>   QUERY
>   PROGRESS
>   BACKTRACE
>
> in addition to the usual ANALYZE, VERBOSE, FORMAT, etc.
>
>
It can work

Regards

Pavel

Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-09-14 Thread Tomas Vondra




On 09/14/2015 10:23 AM, Shulgin, Oleksandr wrote:

On Sat, Sep 12, 2015 at 11:50 AM, Tomas Vondra
> wrote:

...

- Attempts to get plan for simple insert queries like this

   INSERT INTO x SELECT * FROM x;

   end with a segfault, because ActivePortal->queryDesc is 0x0 for this
   query. Needs investigation.


Yes, I've hit the same problem after submitting the latest version of
the patch.  For now I've just added a check for queryDesc being not
NULL, but I guess the top of the current_query_stack might contains
something useful.  Something I need to try.


Well, the thing is we're able to do EXPLAIN on those queries, and IIRC 
auto_explain can log them too. So perhaps look into the hooks where they 
take the queryDesc in those cases - it has to be available somewhere.



- The lockless approach seems fine to me, although I think the fear
   of performance issues is a bit moot (I don't think we expect large
   number of processes calling pg_cmdstatus at the same time). But
   it's not significantly more complex, so why not.


I believe the main benefit of the less-locking approach is that if
something goes wrong when two backends tried to communicate it doesn't
prevent the rest of them from operating, because there is no shared (and
thus locked) communication channel.


Sure, but I think it really deserves a bit more detailed explanation of 
the protocol, and discussion of the expected behavior for some basic 
failure types.


For example - what happens when a backend requests info, but dies before 
receiving it, and the backed is reused for another connection? Doesn't 
this introduce a race condition? Perhaps not, I'm just asking.




- The patch contains pretty much no documentation, both comments
   at the code level and user docs. The lack of user docs is not that
   a big deal at this point (although the patch seems to be mature
   enough, although the user-level API will likely change).

   The lack of code comments is more serious, as it makes the review
   somewhat more difficult. For example it'd be very nice to document
   the contract for the lock-less interface.


I will add the code comments.  The user docs could wait before we decide
on the interface, I think.


Agreed, although I think having rudimentary user documentation would be 
useful for the reviewers - a summary of the goals that are a bit 
scattered over the e-mail thread.



- I agree that pg_cmdstatus() is not the best API. Having something
   like EXPLAIN PID would be nice, but it does not really work for
   all the request types (just CMD_STATUS_REQUEST_EXPLAIN). Maybe
   there's not a single API for all cases, i.e. we should use EXPLAIN
   PID for one case and invent something different for the other?


I can think of something like:

EXPLAIN [ ( option [, ...] ) ] PROCESS ;

where option is extended with:

   QUERY
   PROGRESS
   BACKTRACE

in addition to the usual ANALYZE, VERBOSE, FORMAT, etc.


Seems OK. I'm not quite sure what PROGRESS is - I assume it's the same 
thing as ANALYZE? Why not to use the keyword, then?




- Is there a particular reason why we allocate slots for auxiliary
   processes and not just for backends (NumBackends)? Do we expect those
   auxiliary processes to ever use this API?


If we extend the interface to a more general one, there still might be
some space for querying status of checkpointer of bgwriter.


I don't think we should mix this with monitoring of auxiliary processes. 
This interface is designed at monitoring SQL queries running in other 
backends, effectively "remote" EXPLAIN. But those auxiliary processes 
are not processing SQL queries at all, they're not even using regular 
executor ...


OTOH the ability to request this info (e.g. auxiliary process looking at 
plans running in backends) seems useful, so I'm ok with tuple slots for 
auxiliary processes.




- CleanupCmdStatusSlot seems needlessly complicated. I don't quite see
   the need for the second argument, or the internal slot variable. Why
   not to simply use the MyCmdStatusSlot directly?


Good point.

- I also don't quite understand why we need to track css_pid for the
   slot? In what scenario will this actually matter?


I think it's being only used for error reporting and could help in
debugging, but for now that's it.


I recommend getting rid of it, unless we have a clear idea of how to use 
it. Otherwise I envision we won't really keep it updated properly 
(because it's "just for debugging"), and then one day someone actually 
starts using it and get bitten by it.




- While being able to get EXPLAIN from the running process is nice,
   I'm especially interested in getting EXPLAIN ANALYZE to get insight
   into the progress of the execution. The are other ways to get the
   EXPLAIN, e.g. by opening a different

[HACKERS] Re: Review: check existency of table for -t option (pg_dump) when pattern...

2015-09-14 Thread Teodor Sigaev


(new version in attach), but patch shows some inconsistent output:
% pg_dump -t 'aaa*'  postgres
pg_dump: No matching tables were found
% pg_dump -t 'aaa*' --strict-names postgres
pg_dump: Table "aaa*" not found.



There are two different situation - first message says "there are not any table
for work", second says "There are not specific table(s) with specific names
(mask)". I am thinking so this information is enough interesting for showing.

Agree, but "aaa*" is not a table



Can be changed to "No matching table(s) were found for filter "aaa*" " ?


"TBL" is not a filter. May be:

No matching table(s) were found for pattern "aaa*" ?
No matching table(s) were found for pattern "TBL" ?

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: check existency of table for -t option (pg_dump) when pattern...

2015-09-14 Thread Pavel Stehule

2015-09-14 12:05 GMT+02:00 Teodor Sigaev :

> (new version in attach), but patch shows some inconsistent output:
>> % pg_dump -t 'aaa*'  postgres
>> pg_dump: No matching tables were found
>> % pg_dump -t 'aaa*' --strict-names postgres
>> pg_dump: Table "aaa*" not found.
>>
>
> There are two different situation - first message says "there are not any
>> table
>> for work", second says "There are not specific table(s) with specific
>> names
>> (mask)". I am thinking so this information is enough interesting for
>> showing.
>>
> Agree, but "aaa*" is not a table
>
>
>> Can be changed to "No matching table(s) were found for filter "aaa*" " ?
>>
>
> "TBL" is not a filter. May be:
>
> No matching table(s) were found for pattern "aaa*" ?
> No matching table(s) were found for pattern "TBL" ?


+1

it is better

Pavel


>
>
> --
> Teodor Sigaev   E-mail: teo...@sigaev.ru
>WWW:
> http://www.sigaev.ru/
>

Re: [HACKERS] PATCH: index-only scans with partial indexes

2015-09-14 Thread Kyotaro HORIGUCHI

Hi, this looks to be a bug of cost_index(). The attached patch
would fix that.

=
The following part in cost_index,

> cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
> 
> run_cost += cpu_per_tuple * tuples_fetched;

Adds, *cpu_tuple_cost* (which is 0.01) + qpqual_cost.per_tuple
(0.0025) per tuple even they are index tuples. On the other hand
getnericcostestimate adds the following value for the same deed.

> indexTotalCost += numIndexTuples * num_sa_scans * (cpu_index_tuple_cost + 
> qual_op_cost);

cpu_index_tuple_cost is 0.005, just a half of cpu_tuple cost as
default. I think this should be the culprit of the difference.

For confirmation, setting cpu_tuple_cost to 0.05 to equate with
cpu_index_tuple_cost and the oppisit makes the estimate for both
indexes the same value.


set cpu_tuple_cost to 0.005;
explain select a from t where b < 30;
QUERY PLAN 
---
 Index Only Scan using idx2 on t  (cost=0.42..7022.06 rows=297876 width=4)
   Index Cond: (b < 30)
(2 rows)

explain select a from t where b < 30;
QUERY PLAN 
---
 Index Only Scan using idx1 on t  (cost=0.42..7022.66 rows=297876 width=4)
(1 row)

This should be a bug.  The attached patch would fix this and
perhaps costs for all of your examples should match except for
errors of double precision. I think it is enough since
IndexOnlyScan may not have quals on columns out of the index in
focus so qpquals should be index quals.

regards,


At Mon, 14 Sep 2015 10:00:24 +0200, Tomas Vondra  
wrote in <55f67e98.5050...@2ndquadrant.com>
> On 09/14/2015 09:35 AM, Kyotaro HORIGUCHI wrote:
> > Hi,
> ,,,
> >> Which is exactly the difference between costs from amcostestimate
> >>
> >> idx1: 4769.115000 + 0.015 * 297823 = 9236.46
> >> idx2: 6258.23 + 0.010 * 297823 = 9236.46
> >
> > These calculations are exactly right, but you overlooked the
> > breakedown of indexTotalCost for idx2.
> >
> >> Sppoky! Although it seems like a mere coincidence, thanks to the nice
> >> round numbers of tuples in the table, and lucky choice of two
> >> conditions.
> >
> > As said above, it is not a conincidence. The exactly same
> > calculation about baserestrictinfo is simply calculated in
> > different places, cost_index for the former and
> > btcostestiamte(genericcostestimate) for the latter.
> 
> By "coincidence" I meant that we happened to choose such a number of
> conditions in the index predicate & query that this perfect match is
> possible. Apparently there are two places that manipulate the costs
> and in this particular case happen to perfectly compensate the
> effects.

Ok, I understood.

> As demonstrated by the example with a single condition, the costs may
> actually differ for different numbers of clauses (e.g. using a single
> clause makes the wider index - unexpectedly - cheaper).
> 
> >
> > We should properly ignore or remove the implicitly-applied quals
> > for partial indexes on cost estimation.
> 
> Probably. So far I've traced the difference to build_index_paths()
> where we build index_clauses by iterating over index columns - the
> smaller index does not have the column from the predicate, so we don't
> add the clause. I'm not particularly familiar with this part of the
> code, so I wonder where's the best place to fix this, though.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index d107d76..d354dc2 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -516,7 +516,12 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
 	cost_qual_eval(_cost, qpquals, root);
 
 	startup_cost += qpqual_cost.startup;
-	cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
+
+	/* Indexonly scan applies this qual on index tuples */
+	if (indexonly)
+		cpu_per_tuple = cpu_index_tuple_cost + qpqual_cost.per_tuple;
+	else
+		cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
 
 	run_cost += cpu_per_tuple * tuples_fetched;
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: index-only scans with partial indexes

2015-09-14 Thread Kyotaro HORIGUCHI

Sorry.

> Hi, this looks to be a bug of cost_index(). The attached patch
> would fix that.

No, that's wrong. please forget the patch. The qual in qpquals
should be indexquals which is excluded because it is not
necessary to be applied. The right way would be remove the cost
for qpqual in cost_index().

> =
> The following part in cost_index,
> 
> > cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
> > 
> > run_cost += cpu_per_tuple * tuples_fetched;
> 
> Adds, *cpu_tuple_cost* (which is 0.01) + qpqual_cost.per_tuple
> (0.0025) per tuple even they are index tuples. On the other hand
> getnericcostestimate adds the following value for the same deed.
> 
> > indexTotalCost += numIndexTuples * num_sa_scans * (cpu_index_tuple_cost + 
> > qual_op_cost);
> 
> cpu_index_tuple_cost is 0.005, just a half of cpu_tuple cost as
> default. I think this should be the culprit of the difference.
> 
> For confirmation, setting cpu_tuple_cost to 0.05 to equate with
> cpu_index_tuple_cost and the oppisit makes the estimate for both
> indexes the same value.
> 
> 
> set cpu_tuple_cost to 0.005;
> explain select a from t where b < 30;
> QUERY PLAN 
> ---
>  Index Only Scan using idx2 on t  (cost=0.42..7022.06 rows=297876 width=4)
>Index Cond: (b < 30)
> (2 rows)
> 
> explain select a from t where b < 30;
> QUERY PLAN 
> ---
>  Index Only Scan using idx1 on t  (cost=0.42..7022.66 rows=297876 width=4)
> (1 row)
> 
> This should be a bug.  The attached patch would fix this and
> perhaps costs for all of your examples should match except for
> errors of double precision. I think it is enough since
> IndexOnlyScan may not have quals on columns out of the index in
> focus so qpquals should be index quals.
> 
> regards,
> 
> 
> At Mon, 14 Sep 2015 10:00:24 +0200, Tomas Vondra 
>  wrote in <55f67e98.5050...@2ndquadrant.com>
> > On 09/14/2015 09:35 AM, Kyotaro HORIGUCHI wrote:
> > > Hi,
> > ,,,
> > >> Which is exactly the difference between costs from amcostestimate
> > >>
> > >> idx1: 4769.115000 + 0.015 * 297823 = 9236.46
> > >> idx2: 6258.23 + 0.010 * 297823 = 9236.46
> > >
> > > These calculations are exactly right, but you overlooked the
> > > breakedown of indexTotalCost for idx2.
> > >
> > >> Sppoky! Although it seems like a mere coincidence, thanks to the nice
> > >> round numbers of tuples in the table, and lucky choice of two
> > >> conditions.
> > >
> > > As said above, it is not a conincidence. The exactly same
> > > calculation about baserestrictinfo is simply calculated in
> > > different places, cost_index for the former and
> > > btcostestiamte(genericcostestimate) for the latter.
> > 
> > By "coincidence" I meant that we happened to choose such a number of
> > conditions in the index predicate & query that this perfect match is
> > possible. Apparently there are two places that manipulate the costs
> > and in this particular case happen to perfectly compensate the
> > effects.
> 
> Ok, I understood.
> 
> > As demonstrated by the example with a single condition, the costs may
> > actually differ for different numbers of clauses (e.g. using a single
> > clause makes the wider index - unexpectedly - cheaper).
> > 
> > >
> > > We should properly ignore or remove the implicitly-applied quals
> > > for partial indexes on cost estimation.
> > 
> > Probably. So far I've traced the difference to build_index_paths()
> > where we build index_clauses by iterating over index columns - the
> > smaller index does not have the column from the predicate, so we don't
> > add the clause. I'm not particularly familiar with this part of the
> > code, so I wonder where's the best place to fix this, though.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2015-09-14 Thread YUriy Zhuravlev

On Friday 11 September 2015 18:50:35 you wrote:
> a) As I said upthread there's a patch to remove these locks entirely
It is very interesting. Could you provide a link? And it's not very good, 
since there is a bottleneck PinBuffer / UnpinBuffer instead of LWLocks.
> b) It doesn't matter anyway. Not every pin goes through the buffer
>mapping table. StrategyGetBuffer(), SyncOneBuffer(), ...
StrategyGetBuffer call only from BufferAlloc .
SyncOneBuffer not problem too because:
PinBuffer_Locked(bufHdr);
LWLockAcquire(bufHdr->content_lock, LW_SHARED);
And please read comment before LockBufHdr(bufHdr) in SyncOneBuffer.

We checked all functions with refcount and usage_count. 

Thanks! ^_^
-- 
YUriy Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

96 matches

Mail list logo