date:20140827

On 2014-08-26 22:19:47 -0400, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  The biggest concern previously were some benchmarks. I'm not entirely
  sure where to get a good testcase for this that's not completely
  artificial - most simpler testcases don't pin many buffers.
 
 FWIW, I think that's by design; we don't ever want to pin more than one
 buffer per relation+index in use in a given query.

Right.

 You could certainly
 build complicated queries joining many tables in order to push up the
 number of pinned buffers, but whether buffer pin manipulations would be
 the bottleneck in such cases is pretty dubious.

Yea, I actually tried that and I didn't see anything.
 
 I would say that the issue most deserving of performance testing is your
 sizing of the linear-search array --- it's not obvious that 8 is a good
 size.

It's about the size of a cacheline on all common architectures, that's
how I found it. I don't think it makes a very big difference whether we
make it 4 or 12, but outside of that range I think it'll be unlikely to
be beneficial. The regression tests never go about three or four pins or
so currently, so I think that's a number unlikely to regularly be
crossed in practice.

 Another thing to think about: a way to get to larger numbers of pinned
 buffers without especially-complex queries is to have nested queries,
 such as SQL queries inside plpgsql functions inside outer queries.

What I did was hack together a pgbench script that does a lot of
DECLARE c_01 CURSOR FOR SELECT * FROM pg_attribute WHERE ctid = '(0, 1)';
FETCH NEXT FROM c_01;

I couldn't measure a bigger slowdown (as that has to be executed for
every xact) for the new code than for the old one.

 Does the patch logic take any advantage of the locality-of-reference
 that will occur in such scenarios?

Yes. Whenever a buffer is pinned/unpinned that's not in the array it'll
displace an entry from the array into the hashtable. Even though the
replacement is simplistic/linear I think that should nearly always end
up with the last used buffers in the array.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC, POC] Don't require a NBuffer sized PrivateRefCount array of local buffer pins

On 2014-08-26 22:04:03 -0400, Robert Haas wrote:
 On Tue, Aug 26, 2014 at 7:52 PM, Andres Freund and...@2ndquadrant.com wrote:
  Here's the next version of this patch.
 
 + *much never requried. So we keep a small array of reference counts
 
 Typo.  But I think you could just drop the whole sentence about how
 things used to be, especially since it's recapitulated elsewhere.

Ok. I actually wonder about chucking out the whole explanation in
buf_init.c. There's been something there historically, but it's not
really a better place than just keeping everything in bufmgr.c.

 +#define REFCOUNT_ARRAY_ENTRIES 8/* one full cacheline */
 
 Obviously that's not always going to be the case.  You could say
 about, or just drop the comment.  Shouldn't cache line be two
 words?

Ok, will make it /* one cache line in common architectures */ - I want
the reasoning for the current size somewhere...

 + * refcounts are kept track of in the array, after that new array entries
 
 s/, after that/; after that,/
 
 +if (!found  !create)
 +else if (!found  free != NULL)
 +else if (!found)
 +else if (found  !do_move)
 +else if (found  free != NULL)
 +else if (found)
 +Assert(false); /* unreachable */
 +return res;
 
 There's not much point in testing found when you've already handled
 the not-found cases.  But I'd reorganize this whole thing like this:
 
 if (!found) { if (!create) { return; } if (free != NULL) { stuff;
 return }; stuff; return; }
 if (!do_move) { return; } if (free != NULL) { stuff; return; } stuff; return;

The current if () ... isn't particularly nice, I agree.

 That's all I see on a first-read through.  There might be other
 issues, and I haven't checked through it in great detail for mundane
 bugs, but generally, I favor pressing on relatively rapidly toward a
 commit.  It seems highly likely that this idea is a big win, and if
 there's some situation in which it's a loss, we're more likely to find
 out with it in the tree (and thus likely to be tested by many more
 people) than by analysis from first principles.

I agree. As long as people are happy with the approach I think we can
iron out performance edge cases later.

I'll try to send a cleaned up version soon. I'm currently wondering
about adding some minimal regression test coverage for this. What I have
right now is stuff like
DECLARE c_01 CURSOR FOR SELECT * FROM pg_attribute WHERE ctid = '(0, 1)';
DECLARE c_02 CURSOR FOR SELECT * FROM pg_attribute WHERE ctid = '(1, 1)';
...
FETCH NEXT FROM c_01;
FETCH NEXT FROM c_02;
...
CLOSE c_01;
...

While that provides some coverage, I'm unconvinced that it's appropriate
for the regression tests?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC, POC] Don't require a NBuffer sized PrivateRefCount array of local buffer pins

On 2014-08-26 20:44:32 -0500, Jim Nasby wrote:
 These results look very encouraging, especially thinking about the
 cache impact.

Yep. I've seen PrivateRefCount array accesses prominently in the source
of cache misses in big servers.

 It occurs to me that it'd also be nice to have some
 stats available on how this is performing; perhaps a dtrace probe for
 whenever we overflow to the hash table, and one that shows maximum
 usage for a statement? (Presumably that's not much extra code or
 overhead...)

I don't use dtrace, so *I* won't do that. Personally I just dynamically
add probes using perf probe when I need to track something like this.

I don't see how you could track maximum usage without more
compliations/slowdowns than warranted.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] REINDEX CONCURRENTLY 2.0

On 2014-08-27 11:00:56 +0900, Michael Paquier wrote:
 On Tue, Jan 21, 2014 at 10:12 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
  I have realigned this patch with latest head (d2458e3)... In case
  someone is interested at some point...
 Attached is a patch for REINDEX CONCURRENTLY rebased on HEAD
 (d7938a4), as some people are showing interest in it by reading recent
 discussions. Patch compiles and passes regression as well as isolation
 tests..

Can you add it to the next CF? I'll try to look earlier, but can't
promise anything.

I very much would like this to get committed in some form or another.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal for CSN based snapshots


On 08/27/2014 08:23 AM, Jeff Davis wrote:

On Tue, 2014-08-26 at 13:45 +0300, Heikki Linnakangas wrote:

Yeah. This patch in the current state is likely much much slower than
unpatched master, except in extreme cases where you have thousands of
connections and short transactions so that without the patch, you spend
most of the time acquiring snapshots.


What else are you looking to accomplish with this patch during this
'fest? Bug finding? Design review? Performance testing?


Design review, mostly. I know the performance still sucks. Although if 
you can foresee some performance problems, aside from the extra CSNLOG 
lookups, it would be good to know.



I think there's already at least one design issue to consider, which is
whether we care about CLOG/CSNLOG access for hinted records where the
xid  snapshot-xmin (that is, accesses that previously would have
looked at xip). Would more discussion help here or do we need to wait
for performance numbers?


I think my current plan is to try to make that CSNLOG lookup fast. In 
essence, rewrite SLRUs to be more performant. That would help with the 
current clog, too, which would make it more feasible to set hint bits 
less often. In particular, avoid dirtying pages just to set hint bits. 
I'm not sure if that's enough - you can't beat checking a single hint 
bit in the same cache line as the XID - but I think it's worth a try.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal for CSN based snapshots


On 08/27/2014 09:40 AM, Heikki Linnakangas wrote:

On 08/27/2014 08:23 AM, Jeff Davis wrote:

On Tue, 2014-08-26 at 13:45 +0300, Heikki Linnakangas wrote:

Yeah. This patch in the current state is likely much much slower than
unpatched master, except in extreme cases where you have thousands of
connections and short transactions so that without the patch, you spend
most of the time acquiring snapshots.


What else are you looking to accomplish with this patch during this
'fest? Bug finding? Design review? Performance testing?


Design review, mostly. I know the performance still sucks. Although if
you can foresee some performance problems, aside from the extra CSNLOG
lookups, it would be good to know.


I think for this commitfest, I've gotten as much review of this patch 
that I can hope for. Marking as Returned with Feedback. But of course, 
feel free to continue reviewing and commenting ;-).


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] REINDEX CONCURRENTLY 2.0

2014-08-27 Thread Michael Paquier

On Wed, Aug 27, 2014 at 3:41 PM, Andres Freund and...@2ndquadrant.com wrote:
 Can you add it to the next CF? I'll try to look earlier, but can't
 promise anything.

 I very much would like this to get committed in some form or another.
Added it here to keep track of it:
https://commitfest.postgresql.org/action/patch_view?id=1563
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Switch pg_basebackup to use -X stream instead of -X fetch by default?

2014-08-27 Thread Magnus Hagander

On Wed, Aug 27, 2014 at 5:16 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Aug 27, 2014 at 6:16 AM, Magnus Hagander mag...@hagander.net wrote:
 On Tue, Aug 26, 2014 at 10:46 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
 On 2014-08-26 16:41:44 -0400, Peter Eisentraut wrote:
 On 8/26/14 12:40 PM, Magnus Hagander wrote:
  I think the first reason is gone now, and the risk/damage of the two
  connections is probably smaller than running out of WAL. -x is a good
  default for smaller systems, but -X is a safer one for bigger ones. So
  I agree that changing the default mode would make sense.

 I would seriously consider just removing one of the modes.  Having two
 modes is complex enough, and then having different defaults in different
 versions, and fuzzy recommendations like, it's better for smaller
 systems, it's quite confusing.

 Happy with removing the option and just accepting -X for backward
 compat.

 Works for me - this is really the cleaner way of doing it...

 We cannot use -X stream with tar output format mode. So I'm afraid that
 removing -X fetch would make people using tar output format feel disappointed.
 Or we should make -X stream work with tar mode.

Ah, yes, I've actually had that on my TODO for some time.

I think the easy way of doing that is to just create an xlog.tar file.
Since we already create base.tar and possibly n*tablespace.tar,
adding one more file shouldn't be a big problem, and would make such
an implementation much easier. Would be trivial to do .tar.gz for it
as well, just like for the others.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] delta relations in AFTER triggers


On 08/27/2014 02:26 AM, Kevin Grittner wrote:

spi-tuplestore-registry allows tuplestores, with associated name
and TupleDesc, to be registered with the current SPI connection.
Queries planned or executed on that connection will recognize the
name as a tuplestore relation.  It doesn't care who is registering
the tuplestores or what happens to them.  It doesn't depend on
anything else.
5 files changed, 445 insertions(+)

...

plpgsql-after-trigger-transition-tables takes the tuplestores from
TriggerData and registers them with SPI before trigger planning and
execution.  It depends on the trigger-transition-tables and
spi-tuplestore-registry patches to build, and won't do anything
useful at run time without the executor-tuplestore-relations patch.
3 files changed, 37 insertions(+), 11 deletions(-)


This is a surprising way to expose the NEW/OLD relations to the 
planner/executor. The problem is the same as with making PL/pgSQL 
variables available to the planner/executor in queries within a PL/pgSQL 
function, and the solution we have for that is the parser hooks you 
pass to SPI_prepare_params. This tuplestore registry is a different 
solution to the same problem - we could've implemented parameters with a 
registry like this as well. Let's not mix two different designs.


I suggest adding a new hook to the ParseState struct, (p_rangevar_hook 
?). The planner calls it whenever it sees a reference to a table, and 
the hook function returns back some sort of placeholder reference to the 
tuplestore. With variables, the hook returns a Param node, and at 
execution time, the executor calls the paramFetch hook to fetch the 
value of the param. For relations/tuplestores, I guess we'll need to 
invent something like a Param node, but for holding information about 
the relation. Like your TsrData struct, but without the pointer to the 
tuplestore. At execution time, in the SPI_execute call, you pass the 
pointer to the tuplestore in the ParamListInfo struct, like you pass 
parameter values.


Does this make sense? In essence, make the relations work like PL/pgSQL 
variables do. If you squint a little, the new/old relation is a variable 
from the function's point of view, and a parameter from the 
planner/executor's point of view. It's just a variable/parameter that 
holds a set of tuples, instead of a single Datum.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] postgresql latency bgwriter not doing its job



Hello Andres,


[...]
I think you're misunderstanding how spread checkpoints work.


Yep, definitely:-) On the other hand I though I was seeking something 
simple, namely correct latency under small load, that I would expect out 
of the box.


What you describe is reasonable, and is more or less what I was hoping 
for, although I thought that bgwriter was involved from the start and 
checkpoint would only do what is needed in the end. My mistake.


When the checkpointer process starts a spread checkpoint it first writes 
all buffers to the kernel in a paced manner.
That pace is determined by checkpoint_completion_target and 
checkpoint_timeout.


This pacing does not seem to work, even at slow pace.


If you have a stall of roughly the same magnitude (say a factor
of two different), the smaller once a minute, the larger once an
hour. Obviously the once-an-hour one will have a better latency in many,
many more transactions.


I do not believe in delaying as much as possible writing do disk to handle 
a small load as a viable strategy.  However, to show my good will, I have 
tried to follow your advices: I've launched a 5000 seconds test with 50 
segments, 30 min timeout, 0.9 completion target, at 25 tps, which is less 
than 1/10 of the maximum throughput.


There are only two time-triggered checkpoints:

  LOG:  checkpoint starting: time
  LOG:  checkpoint complete: wrote 48725 buffers (47.6%);
  1 transaction log file(s) added, 0 removed, 0 recycled;
  write=1619.750 s, sync=27.675 s, total=1647.932 s;
  sync files=14, longest=27.593 s, average=1.976 s

  LOG:  checkpoint starting: time
  LOG:  checkpoint complete: wrote 22533 buffers (22.0%);
  0 transaction log file(s) added, 0 removed, 23 recycled;
  write=826.919 s, sync=9.989 s, total=837.023 s;
  sync files=8, longest=6.742 s, average=1.248 s

For the first one, 48725 buffers is 380MB. 1800 * 0.9 = 1620 seconds to 
complete, so it means 30 buffer writes per second... should be ok. However 
sync costs 27 seconds nevertheless, and the server was more or less 
offline for about 30 seconds flat. For the second one, 180 MB to write, 10 
seconds offline. For some reason the target time is reduced. I have also 
tried with the deadline IO scheduler which make more sense than the 
default cfq, but the result was similar. Not sure how software RAID 
interacts with IO scheduling, though.


Overall result: over the 5000s test, I have lost (i.e. more than 200ms 
behind schedule) more than 2.5% of transactions (1/40). Due to the 
unfinished cycle, the long term average is probably about 3%. Although it 
is better than 10%, it is not good. I would expect/hope for something 
pretty close to 0, even with ext4 on Linux, for a dedicated host which has 
nothing else to do but handle two dozen transactions per second.


Current conclusion: I have not found any way to improve the situation to 
good with parameters from the configuration. Currently a small load 
results in periodic offline time, that can be delayed but not avoided. The 
delaying tactic results in less frequent but longer downtime. I prefer 
frequent very short downtime instead.


I really think that something is amiss. Maybe pg does not handle pacing as 
it should.


For the record, a 25tps bench with a small config (default 3 segments, 
5min timeout, 0.5 completion target) and with a parallel:


while true ; do echo CHECKPOINT;; sleep 0.2s; done | psql

results in losing only 0.01% of transactions (12 transactions out of 
125893 where behind more than 200ms in 5000 seconds). Although you may 
think it stupid, from my point of view it shows that it is possible to 
coerce pg to behave.


With respect to the current status:

(1) the ability to put checkpoint_timeout to values smaller than 30s could 
help, although obviously there would be other consequences. But the 
ability to avoid periodic offline time looks like a desirable objective.


(2) I still think that a parameter to force bgwriter to write more stuff 
could help, but this is not tested.


(3) Any other effective idea to configure for responsiveness is welcome!

If someone wants to repeat these tests, it is easy and only takes a few 
minutes:


  sh createdb test
  sh pgbench -i -s 100 -F 95 test
  sh pgbench -M prepared -N -R 25 -L 200 -c 2 -T 5000 -P 1 test  pgb.out

Note: the -L to limit latency is a submitted patch. Without this, 
unresponsiveness shows as increasing laging time.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING


On 08/27/2014 08:13 AM, Jeff Davis wrote:

On Mon, 2014-08-25 at 17:41 +0300, Heikki Linnakangas wrote:

your patch seems to be about 2x-3x as slow as unpatched master. So this
needs some optimization. A couple of ideas:


I didn't see anywhere near that kind of regression. On unpatched master,
with your test case, I saw it stabilize to about 680ms. With
similar-escape-1, I saw about 775ms (15% regression). Are those at all
close to your numbers? Is there a chance you used an unoptimized build
for one of them, or left asserts enabled?


Oh. I can't now reproduce my earlier results either, I must've messed up 
something. I'm now seeing similar numbers as you.



Attached a patch implementing the same idea though: only use the
multibyte path if *both* the escape char and the current character from
the pattern are multibyte.

I also changed the comment to more clearly state the behavior upon which
we're relying. I hope what I said is accurate.


s/the the/the/. Other than that, looks good to me.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench throttling latency limit


On 08/27/2014 03:47 AM, Rukh Meski wrote:

Hi Fabien,

On Tue, Aug 26, 2014 at 04:07 AM, Fabien COELHO coe...@cri.ensmp.fr wrote:


Please find attached a new version which fixes these two points.


Indeed it does.  Marking the patch ready for a committer.


I find the definition of the latency limit a bit strange. It's a limit 
on how late a transaction can *start* compared to it's scheduled 
starting time, not how long a query is allowed to last. How do figure 
out what it should be set to?


That model might make some sense if you think e.g. of a web application, 
where the web server has a timeout for how long it waits to get a 
database connection from a pool, but once a query is started, the 
transaction is considered a succeess no matter how long it takes. The 
latency limit would be that timeout. But I think a more useful model is 
that when the user clicks a button, he waits at most X seconds for the 
result. If that deadline is exceeded, the web server will give a 404, or 
the user will simply get bored and go away, and the transaction is 
considered a failure.


So I think a more useful model is that new queries arrive at a given 
rate, and each query is expected to finish in X milliseconds from its 
arrival time (i.e the time the query is scheduled to begin, not the time 
it was sent to the server) or it's counted as failed. If a transaction 
cannot even be started by that deadline, because the connection is still 
busy with the previous query, it's counted as failed without even 
sending it to the server.


With that definition, it makes sense to specify the latency limit even 
without --rate. In that case, it's simply a limit on how long each 
query's execution is allowed to last until it's considered as failed. 
IOW, each query's scheduled start time is when the previous query ends.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] postgresql latency bgwriter not doing its job

On 2014-08-27 09:32:16 +0200, Fabien COELHO wrote:
 
 Hello Andres,
 
 [...]
 I think you're misunderstanding how spread checkpoints work.
 
 Yep, definitely:-) On the other hand I though I was seeking something
 simple, namely correct latency under small load, that I would expect out
 of the box.

Yea. The current situation *sucks*. Both from the utterly borked
behaviour of ext4 and other filesystems and the lack of workaround from 
postgres.

 When the checkpointer process starts a spread checkpoint it first writes
 all buffers to the kernel in a paced manner.
 That pace is determined by checkpoint_completion_target and
 checkpoint_timeout.
 
 This pacing does not seem to work, even at slow pace.

It definitely does in some cases. What's your evidence the pacing
doesn't work? Afaik it's the fsync that causes the problem, not the the
writes themselves.

 If you have a stall of roughly the same magnitude (say a factor
 of two different), the smaller once a minute, the larger once an
 hour. Obviously the once-an-hour one will have a better latency in many,
 many more transactions.
 
 I do not believe in delaying as much as possible writing do disk to handle a
 small load as a viable strategy.  However, to show my good will, I have
 tried to follow your advices: I've launched a 5000 seconds test with 50
 segments, 30 min timeout, 0.9 completion target, at 25 tps, which is less
 than 1/10 of the maximum throughput.
 
 There are only two time-triggered checkpoints:
 
   LOG:  checkpoint starting: time
   LOG:  checkpoint complete: wrote 48725 buffers (47.6%);
   1 transaction log file(s) added, 0 removed, 0 recycled;
   write=1619.750 s, sync=27.675 s, total=1647.932 s;
   sync files=14, longest=27.593 s, average=1.976 s
 
   LOG:  checkpoint starting: time
   LOG:  checkpoint complete: wrote 22533 buffers (22.0%);
   0 transaction log file(s) added, 0 removed, 23 recycled;
   write=826.919 s, sync=9.989 s, total=837.023 s;
   sync files=8, longest=6.742 s, average=1.248 s

The write pacing itself doesn't seem to be bad. The bad thing is the
'sync' times here. Those are *NOT* paced and kernel probably has delayed
flushing out much the writes...

 (1) the ability to put checkpoint_timeout to values smaller than 30s could
 help, although obviously there would be other consequences. But the ability
 to avoid periodic offline time looks like a desirable objective.

I'd rather not do that. It's a utterly horrible hack to go this write.

 (2) I still think that a parameter to force bgwriter to write more stuff
 could help, but this is not tested.

It's going to be random writes. That's not going to be helpful.

 (3) Any other effective idea to configure for responsiveness is welcome!

I've a couple of ideas how to improve the situation, but so far I've not
had the time to investigate them properly. Would you be willing to test
a couple of simple patches?

Did you test xfs already?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] postgresql latency bgwriter not doing its job



[...] What's your evidence the pacing doesn't work? Afaik it's the fsync 
that causes the problem, not the the writes themselves.


Hmmm. My (poor) understanding is that fsync would work fine if everything 
was already written beforehand:-) that is it has nothing to do but assess 
that all is already written. If there is remaining write work, it starts 
doing it now with the disastrous effects I'm complaining about.


When I say pacing does not work, I mean that things where not written 
out to disk by the OS, it does not mean that pg did not ask for it.


However it does not make much sense for an OS scheduler to wait several 
minutes with tens of thousands of pages to write and do nothing about 
it... So I'm wondering.



[...]

(1) the ability to put checkpoint_timeout to values smaller than 30s could
help, although obviously there would be other consequences. But the ability
to avoid periodic offline time looks like a desirable objective.


I'd rather not do that. It's a utterly horrible hack to go this write.


Hmmm. It does solve the issue, though:-) It would be the administrator 
choice. It is better than nothing, which is the current status.



(2) I still think that a parameter to force bgwriter to write more stuff
could help, but this is not tested.


It's going to be random writes. That's not going to be helpful.


The -N small OLTP load on a large (GB) table *is* random writes anyway, 
whether they occur at checkpoint or at any other time. Random writes are 
fine in this case, the load is small, there should be no problem.


(3) Any other effective idea to configure for responsiveness is 
welcome!


I've a couple of ideas how to improve the situation, but so far I've not
had the time to investigate them properly. Would you be willing to test
a couple of simple patches?


I can test a couple of patches. I already did one on someone advice (make 
bgwriter round all stuff in 1s instead of 120s, without positive effect.



Did you test xfs already?


No. I cannot without reinstalling, which I cannot do on a remote host, and 
I will probably not have time to do it when I'll have physical access. 
Only one partition on the host. My mistake. Will not do it again. Shame on 
me.


If someone out there has an XFS setup, it is very easy to test and only 
takes a couple of minutes, really. It takes less time to do it than to 
write a mail about it afterwards:-)


I have tested FreeBSD/UFS with similar results, a few periodic offlines. 
UFS journaled file system is probably not ideal for database work, but yet 
again the load is small, it should be able to cope without going offline.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] postgresql latency bgwriter not doing its job

On 2014-08-27 11:05:52 +0200, Fabien COELHO wrote:
 I can test a couple of patches. I already did one on someone advice (make
 bgwriter round all stuff in 1s instead of 120s, without positive effect.

I've quickly cobbled together the attached patch (which at least doesn't
seem to crash  burn). It tries to trigger pages being flushed out
during the paced phase of checkpoints instead of the fsync phase. The
sync_on_checkpoint_flush can be used to enable/disable that behaviour.

I'd be interested to hear whether that improves your latency numbers. I
unfortunately don't have more time to spend on this right now :(.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch to support SEMI and ANTI join removal

2014-08-27 Thread David Rowley

On Wed, Aug 27, 2014 at 1:40 AM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 08/26/2014 03:28 PM, David Rowley wrote:

 Any ideas or feedback on this would be welcome


 Before someone spends time reviewing this patch, are you sure this is
 worth the effort? It seems like very narrow use case to me. I understand
 removing LEFT and INNER joins, but the case for SEMI and ANTI joins seems a
 lot thinner. Unnecessary LEFT and INNER joins can easily creep into a query
 when views are used, for example, but I can't imagine that happening for a
 SEMI or ANTI join. Maybe I'm lacking imagination. If someone has run into a
 query in the wild that would benefit from this, please raise your hand.


I agree that the use case for removals of SEMI and ANTI join are a lot
thinner than LEFT and INNER joins. My longer term goal here is to add join
removal support for INNER joins. In order to do this I need the foreign key
infrastructure which is included in this patch. I held back from just going
ahead and writing the INNER JOIN removal patch as I didn't want to waste
the extra effort in doing that if someone was to find a show stopper
problem with using foreign keys the way I am with this patch. I was kind of
hoping someone would be able to look at this patch a bit more and confirm
to me that it's safe to do this or not before I go ahead and write the
inner join version.


 If I understood correctly, you're planning to work on INNER join removal
 too. How much of the code in this patch is also required for INNER join
 removal, and how much is specific to SEMI and ANTI joins?


Apart from the extra lines of code in remove_useless_joins(), there's 3
functions added here which won't be needed at all for INNER
JOINs; semiorantijoin_is_removable(), convert_semijoin_to_isnotnull_quals()
and convert_antijoin_to_isnull_quals(). Not including the regression tests,
this is 396 lines with comments and 220 lines without. All of these
functions are static and in analyzejoin.c.

The benchmarks I posted a few weeks back show that the overhead of
performing the semi/anti join removal checks is quite low. I measured an
extra 400 or so nanoseconds for a successful removal on my i5 laptop. Or
just 15 nanoseconds on the earliest fast path for a non-removal. This
accounted for between 0.008% and 0.2% of planning time for the queries I
tested.

Regards

David Rowley

Re: [HACKERS] postgresql latency bgwriter not doing its job

On 2014-08-27 11:14:46 +0200, Andres Freund wrote:
 On 2014-08-27 11:05:52 +0200, Fabien COELHO wrote:
  I can test a couple of patches. I already did one on someone advice (make
  bgwriter round all stuff in 1s instead of 120s, without positive effect.
 
 I've quickly cobbled together the attached patch (which at least doesn't
 seem to crash  burn). It tries to trigger pages being flushed out
 during the paced phase of checkpoints instead of the fsync phase. The
 sync_on_checkpoint_flush can be used to enable/disable that behaviour.
 
 I'd be interested to hear whether that improves your latency numbers. I
 unfortunately don't have more time to spend on this right now :(.

And actually attached. Note that it's linux only...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 4576970..eab30a6 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -78,6 +78,7 @@ bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
 double		bgwriter_lru_multiplier = 2.0;
 bool		track_io_timing = false;
+bool		sync_on_checkpoint_flush = false;
 
 /*
  * How many buffers PrefetchBuffer callers should try to stay ahead of their
@@ -356,7 +357,7 @@ static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(volatile BufferDesc *buf);
 static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
 static void BufferSync(int flags);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used);
+static int	SyncOneBuffer(int buf_id, bool skip_recently_used, bool flush_to_disk);
 static void WaitIO(volatile BufferDesc *buf);
 static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
 static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
@@ -369,7 +370,7 @@ static volatile BufferDesc *BufferAlloc(SMgrRelation smgr,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
 			bool *foundPtr);
-static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln, bool flush_to_disk);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
@@ -947,7 +948,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 smgr-smgr_rnode.node.dbNode,
 			  smgr-smgr_rnode.node.relNode);
 
-FlushBuffer(buf, NULL);
+FlushBuffer(buf, NULL, false);
 LWLockRelease(buf-content_lock);
 
 TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,
@@ -1573,7 +1574,7 @@ BufferSync(int flags)
 		 */
 		if (bufHdr-flags  BM_CHECKPOINT_NEEDED)
 		{
-			if (SyncOneBuffer(buf_id, false)  BUF_WRITTEN)
+			if (SyncOneBuffer(buf_id, false, true)  BUF_WRITTEN)
 			{
 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 BgWriterStats.m_buf_written_checkpoints++;
@@ -1850,7 +1851,7 @@ BgBufferSync(void)
 	/* Execute the LRU scan */
 	while (num_to_scan  0  reusable_buffers  upcoming_alloc_est)
 	{
-		int			buffer_state = SyncOneBuffer(next_to_clean, true);
+		int			buffer_state = SyncOneBuffer(next_to_clean, true, false);
 
 		if (++next_to_clean = NBuffers)
 		{
@@ -1927,7 +1928,7 @@ BgBufferSync(void)
  * Note: caller must have done ResourceOwnerEnlargeBuffers.
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used)
+SyncOneBuffer(int buf_id, bool skip_recently_used, bool flush_to_disk)
 {
 	volatile BufferDesc *bufHdr = BufferDescriptors[buf_id];
 	int			result = 0;
@@ -1966,7 +1967,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(bufHdr-content_lock, LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, flush_to_disk);
 
 	LWLockRelease(bufHdr-content_lock);
 	UnpinBuffer(bufHdr, true);
@@ -2230,7 +2231,7 @@ BufferGetTag(Buffer buffer, RelFileNode *rnode, ForkNumber *forknum,
  * as the second parameter.  If not, pass NULL.
  */
 static void
-FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln, bool flush_to_disk)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2342,6 +2343,10 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 	   reln-smgr_rnode.node.spcNode,
 	   reln-smgr_rnode.node.dbNode,
 	   reln-smgr_rnode.node.relNode);
+	if (flush_to_disk  sync_on_checkpoint_flush)
+		smgrflush(reln,
+  buf-tag.forkNum,
+  buf-tag.blockNum);
 
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
@@ -2769,7 +2774,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(bufHdr-content_lock, LW_SHARED);
-			FlushBuffer(bufHdr, rel-rd_smgr);
+

Re: [HACKERS] pgbench throttling latency limit



Hello Heikki,

I find the definition of the latency limit a bit strange. It's a limit on how 
late a transaction can *start* compared to it's scheduled starting time, not 
how long a query is allowed to last.


Yes. This is what can be done easily with pgbench under throttling. Note 
that if transactions take long it is recorded (average, stddev...) so it 
appears elsewhere.



How do figure out what it should be set to?


It is really just a simple tool to measure unresponsiveness under 
throttling, which I'm testing and complaining about in another thread.


The underlying model I have in mind would some timeout from an 
application, say a web server, or a pooling process which is handling a 
queue of requests...


Now, if I describe that as lag limit instead if latency limit, maybe
it is clearer and better?


That model might make some sense if you think e.g. of a web application,
[...]


Yep, that is what I had in mind, but the primary objective is really to 
check whether pg is responsive or not.


So I think a more useful model is that new queries arrive at a given 
rate, and each query is expected to finish in X milliseconds from its 
arrival time (i.e the time the query is scheduled to begin, not the time 
it was sent to the server) or it's counted as failed. If a transaction 
cannot even be started by that deadline, because the connection is still 
busy with the previous query, it's counted as failed without even 
sending it to the server. With that definition, it makes sense to 
specify the latency limit even without --rate.


Yep. But that is not what I'm doing here. It would be interesting as well. 
It would be another patch.


In that case, it's simply a limit on how long each query's 
execution is allowed to last until it's considered as failed. IOW, each 
query's scheduled start time is when the previous query ends.


Not under --rate... that is the point of throttling!  Under throttling, 
the latency should really be computed wrt to the schedule start time and 
not the actual start time which may be 10 seconds afterwards when things 
are going bad... Also, there is the question of whether the failed query 
is executed or not. Here I'm not executing them, in effect they were 
aborted by the application. With your suggestion they would be executed 
anyway but considered failed.


So what you are suggesting is another (interesting) functionnality, that 
could indeed be named latency limit (count slow above a threshold 
queries), what I'm doing here is lag limit (scheduled query could not 
start on time and are skipped, this is really specific to --rate).


In the updated patch attached, I changed the explanations, documentation 
and name to lag limit instead of latency limit to clarify this point. 
It was really a misnommer.


--
Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 2f7d80e..d7ee1c9 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -141,6 +141,13 @@ double		sample_rate = 0.0;
 int64		throttle_delay = 0;
 
 /*
+ * When under throttling, execution time slots which are more than
+ * this late (in us) are skipped, and the corresponding transaction
+ * will be counted as somehow aborted.
+ */
+int64		throttle_lag_limit = 0;
+
+/*
  * tablespace selection
  */
 char	   *tablespace = NULL;
@@ -238,6 +245,7 @@ typedef struct
 	int64		throttle_trigger;		/* previous/next throttling (us) */
 	int64		throttle_lag;	/* total transaction lag behind throttling */
 	int64		throttle_lag_max;		/* max transaction lag */
+	int64		throttle_lag_skipped; /* lagging transactions skipped */
 } TState;
 
 #define INVALID_THREAD		((pthread_t) 0)
@@ -250,6 +258,7 @@ typedef struct
 	int64		sqlats;
 	int64		throttle_lag;
 	int64		throttle_lag_max;
+	int64		throttle_lag_skipped;
 } TResult;
 
 /*
@@ -367,6 +376,9 @@ usage(void)
 		   -f, --file=FILENAME  read transaction script from FILENAME\n
 		 -j, --jobs=NUM   number of threads (default: 1)\n
 		 -l, --logwrite transaction times to log file\n
+		 -L, --lag-limit=NUM  under throttling (--rate), skip transactions that\n
+		  are lagging more than NUM ms behind schedule\n
+		  (default: do not skip)\n
 		 -M, --protocol=simple|extended|prepared\n
 		  protocol for submitting queries (default: simple)\n
 		 -n, --no-vacuum  do not run VACUUM before tests\n
@@ -1016,6 +1028,24 @@ top:
 
 		thread-throttle_trigger += wait;
 
+		if (throttle_lag_limit)
+		{
+			instr_time	now;
+			int64		now_us;
+			INSTR_TIME_SET_CURRENT(now);
+			now_us = INSTR_TIME_GET_MICROSEC(now);
+			while (thread-throttle_trigger  now_us - throttle_lag_limit)
+			{
+/* if too far behind, this slot is skipped, and we
+ * iterate till the next nearly on time slot.
+ */
+int64 wait = (int64) (throttle_delay *
+	1.00055271703 * -log(getrand(thread, 1, 1) / 1.0));
+

[HACKERS] re-reading SSL certificates during server reload

2014-08-27 Thread Alexey Klyukin

Greetings,

Is there a strong reason to disallow reloading server key and cert files
during the PostgreSQL reload?

Basically, once you run multiple databases in a cluster and use different
DNS names to connect to different databases (in order for those databases
to be moved somewhere without changing the client code), and enable SSL
certificate checking, the problem becomes evident: in order to add a new
database to the existing cluster you have to add its name to the SSL
certificate for the server, and in order for this changes to come into
effect you have to restart the server.

In the documentation for server cert and key file there is a notice that
this parameter can only be reloaded during the server start. It seems that
the only place the backend certificates are loaded is inside the
secure_initialize, which, in order, calls initialize_SSL().

From my point of view, I see nothing preventing separation of the
certificate reload code and SSL library initialization and calling the
former during the server reload.  It might happen that with the new
certificate file that some of the existing connections will be unable to
reconnect, or, if the certificate is invalid, the server will be unable to
restart, but this are the sort of problems that also happen with reload of
pg_hba.conf as well, so these alone does not sound like a significant
showstopper.

-- 
Regards,
Alexey Klyukin

Re: [HACKERS] pgbench throttling latency limit


On 08/27/2014 12:41 PM, Fabien COELHO wrote:


Hello Heikki,


I find the definition of the latency limit a bit strange. It's a limit on how
late a transaction can *start* compared to it's scheduled starting time, not
how long a query is allowed to last.


Yes. This is what can be done easily with pgbench under throttling. Note
that if transactions take long it is recorded (average, stddev...) so it
appears elsewhere.


How do figure out what it should be set to?


It is really just a simple tool to measure unresponsiveness under
throttling, which I'm testing and complaining about in another thread.


Ok, but wouldn't the definition I gave be just as useful for that 
purpose, and more useful in general?


You didn't really answer my question: How do you figure out what to set 
it to? With a latency limit on when the query should finish, as opposed 
to how late it can start, it's a lot easier to give a number. For 
example, your requirements might state that a user must always get a 
response to a click on a web page in 200 ms, so you set the limit to 200 ms.



So I think a more useful model is that new queries arrive at a given
rate, and each query is expected to finish in X milliseconds from its
arrival time (i.e the time the query is scheduled to begin, not the time
it was sent to the server) or it's counted as failed. If a transaction
cannot even be started by that deadline, because the connection is still
busy with the previous query, it's counted as failed without even
sending it to the server. With that definition, it makes sense to
specify the latency limit even without --rate.


Yep. But that is not what I'm doing here. It would be interesting as well.
It would be another patch.


Why is your patch more interesting than what I described? I'm pretty 
sure we don't need both.



In that case, it's simply a limit on how long each query's
execution is allowed to last until it's considered as failed. IOW, each
query's scheduled start time is when the previous query ends.


Not under --rate... that is the point of throttling!


Right, I got that. With in that case, I meant when you're not throttling.


Under throttling,
the latency should really be computed wrt to the schedule start time and
not the actual start time which may be 10 seconds afterwards when things
are going bad... Also, there is the question of whether the failed query
is executed or not. Here I'm not executing them, in effect they were
aborted by the application. With your suggestion they would be executed
anyway but considered failed.


I was thinking that if a query is already late when the connection 
becomes free to execute it, it would not be executed. It would be 
skipped, just as in your patch.



So what you are suggesting is another (interesting) functionnality, that
could indeed be named latency limit (count slow above a threshold
queries), what I'm doing here is lag limit (scheduled query could not
start on time and are skipped, this is really specific to --rate).


Ok, but *why* are you doing a lag limit, and not a latency limit? 
Under what circumstances is the lag limit a more useful setting?


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] postgresql latency bgwriter not doing its job

On 2014-08-27 11:19:22 +0200, Andres Freund wrote:
 On 2014-08-27 11:14:46 +0200, Andres Freund wrote:
  On 2014-08-27 11:05:52 +0200, Fabien COELHO wrote:
   I can test a couple of patches. I already did one on someone advice (make
   bgwriter round all stuff in 1s instead of 120s, without positive effect.
  
  I've quickly cobbled together the attached patch (which at least doesn't
  seem to crash  burn). It tries to trigger pages being flushed out
  during the paced phase of checkpoints instead of the fsync phase. The
  sync_on_checkpoint_flush can be used to enable/disable that behaviour.
  
  I'd be interested to hear whether that improves your latency numbers. I
  unfortunately don't have more time to spend on this right now :(.
 
 And actually attached. Note that it's linux only...

I got curious and ran a quick test:

config:
log_checkpoints=on
checkpoint_timeout=1min
checkpoint_completion_target=0.95
checkpoint_segments=100
synchronous_commit=on
fsync=on
huge_pages=on
max_connections=200
shared_buffers=6GB
wal_level=hot_standby

off:

$ pgbench -p 5440 -h /tmp postgres -M prepared -c 16 -j16 -T 120 -R 180 -L 200
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 150
query mode: prepared
number of clients: 16
number of threads: 16
duration: 120 s
number of transactions actually processed: 20189
latency average: 23.136 ms
latency stddev: 59.044 ms
rate limit schedule lag: avg 4.599 (max 199.975) ms
number of skipped transactions: 1345 (6.246 %)
tps = 167.664290 (including connections establishing)
tps = 167.675679 (excluding connections establishing)

LOG:  checkpoint starting: time
LOG:  checkpoint complete: wrote 12754 buffers (1.6%); 0 transaction log 
file(s) added, 0 removed, 2 recycled; write=56.928 s, sync=3.639 s, 
total=60.749 s; sync files=20, longest=2.741 s, average=0.181 s
LOG:  checkpoint starting: time
LOG:  checkpoint complete: wrote 12269 buffers (1.6%); 0 transaction log 
file(s) added, 0 removed, 6 recycled; write=20.701 s, sync=8.568 s, 
total=29.444 s; sync files=10, longest=3.568 s, average=0.856 s


on:

$ pgbench -p 5440 -h /tmp postgres -M prepared -c 16 -j16 -T 120 -R 180 -L 200
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 150
query mode: prepared
number of clients: 16
number of threads: 16
duration: 120 s
number of transactions actually processed: 21327
latency average: 20.735 ms
latency stddev: 14.643 ms
rate limit schedule lag: avg 4.965 (max 185.003) ms
number of skipped transactions: 1 (0.005 %)
tps = 177.214391 (including connections establishing)
tps = 177.225476 (excluding connections establishing)

LOG:  checkpoint starting: time
LOG:  checkpoint complete: wrote 12217 buffers (1.6%); 0 transaction log 
file(s) added, 0 removed, 1 recycled; write=57.022 s, sync=0.203 s, 
total=57.377 s; sync files=19, longest=0.033 s, average=0.010 s
LOG:  checkpoint starting: time
LOG:  checkpoint complete: wrote 13185 buffers (1.7%); 0 transaction log 
file(s) added, 0 removed, 6 recycled; write=56.628 s, sync=0.019 s, 
total=56.803 s; sync files=11, longest=0.017 s, average=0.001 s


That machine is far from idle right now, so the noise is pretty
high. But rather nice initial results.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] re-reading SSL certificates during server reload

2014-08-27 Thread Magnus Hagander

On Wed, Aug 27, 2014 at 11:56 AM, Alexey Klyukin al...@hintbits.com wrote:
 Greetings,

 Is there a strong reason to disallow reloading server key and cert files
 during the PostgreSQL reload?

Key and cert files are loaded in the postmaster. We'd need to change
that. I'm not saying that's not a good idea, but it's not as easy as
just allowing it :)


 Basically, once you run multiple databases in a cluster and use different
 DNS names to connect to different databases (in order for those databases to
 be moved somewhere without changing the client code), and enable SSL
 certificate checking, the problem becomes evident: in order to add a new
 database to the existing cluster you have to add its name to the SSL
 certificate for the server, and in order for this changes to come into
 effect you have to restart the server.

That's certainly an issue. Potentially bigger ones are that you cannot
replace an expired certificate or CRL without a restart.


 In the documentation for server cert and key file there is a notice that
 this parameter can only be reloaded during the server start. It seems that
 the only place the backend certificates are loaded is inside the
 secure_initialize, which, in order, calls initialize_SSL().

 From my point of view, I see nothing preventing separation of the
 certificate reload code and SSL library initialization and calling the
 former during the server reload.  It might happen that with the new
 certificate file that some of the existing connections will be unable to
 reconnect, or, if the certificate is invalid, the server will be unable to
 restart, but this are the sort of problems that also happen with reload of
 pg_hba.conf as well, so these alone does not sound like a significant
 showstopper.

I agree that separating this out would probably be a useful idea. We
should probably treat a failed load of cerrtificates the same way we
do with pg_hba if we can - which is log an error and revert back to
the currently loaded one.

Some of this is going to have to be at least partially reworked anyway
in the work that Heikki has been diong to support non-openssl
libraries. Making a change like this at the same time is probably a
good idea.


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sample LDIF for pg_service.conf no longer works

2014-08-27 Thread Magnus Hagander

On Sat, Aug 16, 2014 at 5:31 AM, Noah Misch n...@leadboat.com wrote:
 When using pg_service.conf with LDAP, we document[1] the following sample LDIF
 for populating the LDAP server:

 version:1
 dn:cn=mydatabase,dc=mycompany,dc=com
 changetype:add
 objectclass:top
 objectclass:groupOfUniqueNames
 cn:mydatabase
 uniqueMember:host=dbserver.mycompany.com
 uniqueMember:port=5439
 uniqueMember:dbname=mydb
 uniqueMember:user=mydb_user
 uniqueMember:sslmode=require

 That presumably worked at one point, but OpenLDAP 2.4.23 and OpenLDAP 2.4.39
 both reject it cryptically:

 ldap_add: Invalid syntax (21)
 additional info: uniqueMember: value #0 invalid per syntax

 uniqueMember is specified to bear a distinguished name.  While OpenLDAP does
 not verify that uniqueMember values correspond to known DNs, it does verify
 that the value syntactically could be a DN.  To give examples, o=foobar is
 always accepted, but xyz=foobar is always rejected: xyz is not an LDAP DN
 attribute type.  Amid the LDAP core schema, device is the best-fitting
 objectClass having the generality required.  Let's convert to that, as
 attached.  I have verified that this works end-to-end.

+1.

I've run into that problem as wel,l just not had time to prepare a
proper example in the core schema :)


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Missing comment block at the top of streamutil.h and receivelog.h

On Wed, Aug 27, 2014 at 2:07 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Wed, Aug 27, 2014 at 12:31 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Aug 27, 2014 at 10:34 AM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 I think that it's better to add an include guard like

 #ifndef RECEIVELOG_H
 #define RECEIVELOG_H
 ...
 #endif
 Oh indeed, that was missing as well! Please see attached.

Thanks! Applied.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench throttling latency limit



Hello Heikki,


[...]
With a latency limit on when the query should finish, as opposed to how 
late it can start, it's a lot easier to give a number. For example, your 
requirements might state that a user must always get a response to a click on 
a web page in 200 ms, so you set the limit to 200 ms.


Yep. See below for the details.


[...] Why is your patch more interesting than what I described?


It is more interesting because it exists, it is short and simple, it 
works, and it is useful right now to test pg responsiveness and also to 
model some timeout behavior on the client side?



I'm pretty sure we don't need both.


Why not? Testing performance is tricky enough, the tool must be flexible.

I'm pretty sure that I'm interested in testing pg responsiveness right 
now, so I did the simpler one I need for that purpose. It somehow models 
an application/pooler queue management timeout, that would anyway proceed 
with what is already started.



[...]

I was thinking that if a query is already late when the connection becomes 
free to execute it, it would not be executed. It would be skipped, just as in 
your patch.


As for an actual latency limit under throttling, this is significantly 
more tricky and invasive to implement... ISTM that it would mean:


 - if the tx is not stated an the latency is already consummed, SKIP++.

 - if the tx is after its schedule start time but under latency, then
   start it, and maybe inject a SET TIMEOUT

 - if a tx is being processed but reaches its latency limit (after
   schedule start time), abort it coldly, ROLLBACK++ (well if the tx is
   really started, there could also be shell commands and \set stuff in a
   pgbench script, which mean started is not really started, so it would
   be INTERRUPT++ if no BEGIN was sent).

 - if a tx is finished but the final commit returned after the latency
   deadline, you cannot abort it anymore but it is late nevertheless,
   LATE++.

This is doable but far beyond my current needs. Moreover, I'm not sure 
that such a patch would pass because of invasiveness and complexity, so it 
could be a total loss of time.



Ok, but *why* are you doing a lag limit, and not a latency limit?


Because it is much simpler (see above) and is enough for testing pg 
responsiveness issue, which is my current objective, and models some

client timeout behavior.


Under what circumstances is the lag limit a more useful setting?


It is not more useful per se, it is what I'm using to test pg 
unresponsivness with a simple to define and interpret measure wrt 
throttling.


If I would do latency limit under throttling, it would be (1) more time 
to develop, more complex, more invasive in the code (see above, + also the 
implementation when not under throttling), (2) more complex to interpret, 
with at least 5 possible outcomes (skipped, interrupted, committed on 
time, committed but late, aborted), (3) this added information would not 
be useful to me.


I've submitted this simple lag limit version because being able to 
measure quickly and simply (un)responsiveness seems like a good idea, 
especially given the current state of things.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Audit of logout

On Sat, Aug 23, 2014 at 3:44 PM, Amit Kapila amit.kapil...@gmail.com wrote:
 On Tue, Aug 5, 2014 at 8:04 PM, Fujii Masao masao.fu...@gmail.com wrote:

 Yep, the attached patch introduces PGC_SU_BACKEND and
 changes the contexts of log_connections and log_disconnections
 to PGC_SU_BACKEND. Review?


Thanks for reviewing the patch!

 1.
 ! else if (context != PGC_POSTMASTER  context != PGC_SU_BACKEND 
 ! context != PGC_SU_BACKEND  source != PGC_S_CLIENT)

 In the above check for PGC_SU_BACKEND is repeated, here
 one of the check should be PGC_SU_BACKEND  and other
 should be PGC_BACKEND.

Right. Fixed. Attached is the updated version of the patch.
BTW, I also added the following into the document of log_connections
and log_disconnections.

Only superusers can change this setting at session start.

 2.
 + case PGC_SU_BACKEND:
 + if (context == PGC_BACKEND)
 + {
 ..
 ..
 + return 0;
 + }
   case PGC_BACKEND:
   if (context == PGC_SIGHUP)

 Changing PGC_SU_BACKEND parameter (log_connections) is
 visible even with a non-super user client due to above code.
 Shouldn't it be only visible for super-user logins?

 Simple steps to reproduce the problem:
 a. start Server (default configuration)
 b. connect with superuser
 c. change in log_connections to on in postgresql.conf
 d. perform select pg_reload_conf();
 e. connect with non-super-user
 f.  show log_connections;  --This step shows the value as on,
--whereas I think it should have been
 off

In this case, log_connections is changed in postgresql.conf and it's
reloaded, so ISTM that it's natural that even non-superuser sees the
changed value. No? Maybe I'm missing something.

Regards,

-- 
Fujii Masao
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***
*** 4231,4236  local0.*/var/log/postgresql
--- 4231,4237 
 para
  Causes each attempted connection to the server to be logged,
  as well as successful completion of client authentication.
+ Only superusers can change this setting at session start.
  This parameter cannot be changed after session start.
  The default is off.
 /para
***
*** 4258,4263  local0.*/var/log/postgresql
--- 4259,4265 
  varnamelog_connections/varname but at session termination,
  and includes the duration of the session.  This is off by
  default.
+ Only superusers can change this setting at session start.
  This parameter cannot be changed after session start.
 /para
/listitem
*** a/src/backend/tcop/postgres.c
--- b/src/backend/tcop/postgres.c
***
*** 3258,3264  get_stats_option_name(const char *arg)
   * argv[0] is ignored in either case (it's assumed to be the program name).
   *
   * ctx is PGC_POSTMASTER for secure options, PGC_BACKEND for insecure options
!  * coming from the client, or PGC_SUSET for insecure options coming from
   * a superuser client.
   *
   * If a database name is present in the command line arguments, it's
--- 3258,3264 
   * argv[0] is ignored in either case (it's assumed to be the program name).
   *
   * ctx is PGC_POSTMASTER for secure options, PGC_BACKEND for insecure options
!  * coming from the client, or PGC_SU_BACKEND for insecure options coming from
   * a superuser client.
   *
   * If a database name is present in the command line arguments, it's
*** a/src/backend/utils/init/postinit.c
--- b/src/backend/utils/init/postinit.c
***
*** 957,963  process_startup_options(Port *port, bool am_superuser)
  	GucContext	gucctx;
  	ListCell   *gucopts;
  
! 	gucctx = am_superuser ? PGC_SUSET : PGC_BACKEND;
  
  	/*
  	 * First process any command-line switches that were included in the
--- 957,963 
  	GucContext	gucctx;
  	ListCell   *gucopts;
  
! 	gucctx = am_superuser ? PGC_SU_BACKEND : PGC_BACKEND;
  
  	/*
  	 * First process any command-line switches that were included in the
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***
*** 509,514  const char *const GucContext_Names[] =
--- 509,515 
  	 /* PGC_INTERNAL */ internal,
  	 /* PGC_POSTMASTER */ postmaster,
  	 /* PGC_SIGHUP */ sighup,
+ 	 /* PGC_SU_BACKEND */ superuser-backend,
  	 /* PGC_BACKEND */ backend,
  	 /* PGC_SUSET */ superuser,
  	 /* PGC_USERSET */ user
***
*** 907,913  static struct config_bool ConfigureNamesBool[] =
  		NULL, NULL, NULL
  	},
  	{
! 		{log_connections, PGC_BACKEND, LOGGING_WHAT,
  			gettext_noop(Logs each successful connection.),
  			NULL
  		},
--- 908,914 
  		NULL, NULL, NULL
  	},
  	{
! 		{log_connections, PGC_SU_BACKEND, LOGGING_WHAT,
  			gettext_noop(Logs each successful connection.),
  			NULL
  		},
***
*** 916,922  static struct config_bool ConfigureNamesBool[] =
  		NULL, NULL, NULL
  	},
  	{
! 		{log_disconnections, PGC_BACKEND, LOGGING_WHAT,
  			gettext_noop(Logs end of a

Re: [HACKERS] Specifying the unit in storage parameter

On Tue, Aug 26, 2014 at 3:27 AM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 Fujii Masao wrote:
 On Thu, Aug 21, 2014 at 4:20 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:

  Looking at the patch, the parameter fillfactor in the category
  RELOPT_KIND_HEAP (the first element in intRelOpts of reloptions.c) is
  not updated with the new field. It is only a one-line change.
  @@ -97,7 +97,7 @@ static relopt_int intRelOpts[] =
  Packs table pages only to this percentage,
  RELOPT_KIND_HEAP
  },
  -   HEAP_DEFAULT_FILLFACTOR, HEAP_MIN_FILLFACTOR, 100
  +   HEAP_DEFAULT_FILLFACTOR, HEAP_MIN_FILLFACTOR, 100, 0
  },

 Oh, good catch. I wonder why I did such a mistake...

 Uninitialized elements at end of struct are filled with zeroes.

Yeah, that's the reason why I could not notice the problem at compile time.

 We do
 have other examples of this -- for instance, config_generic in the guc.c
 tables are almost always only 5 members long even though the struct is
 quite a bit longer than that.  Most entries do not even have flags set.

So you imply that the trailing zero (which the patch adds as flag)
in the reloption struct should be dropped?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Simplify calls of pg_class_aclcheck when multiple modes are used

2014-08-27 Thread Michael Paquier

Hi all,

In a couple of code paths we do the following to check permissions on an
object:
if (pg_class_aclcheck(relid, userid, ACL_USAGE) != ACLCHECK_OK 
pg_class_aclcheck(relid, userid, ACL_UPDATE) != ACLCHECK_OK)
ereport(ERROR, blah);

Wouldn't it be better to simplify that with a single call of
pg_class_aclcheck, gathering together the modes that need to be checked? In
the case above, the call to pg_class_aclcheck would become like that:
if (pg_class_aclcheck(relid, userid,
 ACL_USAGE | ACL_UPDATE) != ACLCHECK_OK)
ereport(ERROR, blah);

That's not a critical thing, but it would save some cycles. Patch is
attached.
Regards,
-- 
Michael
From e6b23e537d223e4bdb3abada2d761e630c8b27c0 Mon Sep 17 00:00:00 2001
From: Michael Paquier mich...@otacoo.com
Date: Wed, 27 Aug 2014 20:45:31 +0900
Subject: [PATCH] Minimize calls of pg_class_aclcheck to minimum necessary

In a couple of code paths, pg_class_aclcheck is called in succession with
multiple different modes set. This patch combines those modes to have a
single call of this function and reduce a bit process overhead for
permission checking.
---
 src/backend/commands/sequence.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 3b89dd0..6d5f65b 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -546,8 +546,8 @@ nextval_internal(Oid relid)
 	/* open and AccessShareLock sequence */
 	init_sequence(relid, elm, seqrel);
 
-	if (pg_class_aclcheck(elm-relid, GetUserId(), ACL_USAGE) != ACLCHECK_OK 
-		pg_class_aclcheck(elm-relid, GetUserId(), ACL_UPDATE) != ACLCHECK_OK)
+	if (pg_class_aclcheck(elm-relid, GetUserId(),
+		  ACL_USAGE | ACL_UPDATE) != ACLCHECK_OK)
 		ereport(ERROR,
 (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  errmsg(permission denied for sequence %s,
@@ -759,8 +759,8 @@ currval_oid(PG_FUNCTION_ARGS)
 	/* open and AccessShareLock sequence */
 	init_sequence(relid, elm, seqrel);
 
-	if (pg_class_aclcheck(elm-relid, GetUserId(), ACL_SELECT) != ACLCHECK_OK 
-		pg_class_aclcheck(elm-relid, GetUserId(), ACL_USAGE) != ACLCHECK_OK)
+	if (pg_class_aclcheck(elm-relid, GetUserId(),
+		  ACL_SELECT | ACL_USAGE) != ACLCHECK_OK)
 		ereport(ERROR,
 (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  errmsg(permission denied for sequence %s,
@@ -801,8 +801,8 @@ lastval(PG_FUNCTION_ARGS)
 	/* nextval() must have already been called for this sequence */
 	Assert(last_used_seq-last_valid);
 
-	if (pg_class_aclcheck(last_used_seq-relid, GetUserId(), ACL_SELECT) != ACLCHECK_OK 
-		pg_class_aclcheck(last_used_seq-relid, GetUserId(), ACL_USAGE) != ACLCHECK_OK)
+	if (pg_class_aclcheck(last_used_seq-relid, GetUserId(),
+		  ACL_SELECT | ACL_USAGE) != ACLCHECK_OK)
 		ereport(ERROR,
 (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  errmsg(permission denied for sequence %s,
-- 
2.1.0


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] After switching primary server while using replication slot.

On Fri, Aug 22, 2014 at 11:29 PM, Andres Freund and...@2ndquadrant.com wrote:
 Hi,

 On 2014-08-20 13:14:30 -0400, Robert Haas wrote:
 On Tue, Aug 19, 2014 at 6:25 AM, Fujii Masao masao.fu...@gmail.com wrote:
  On Mon, Aug 18, 2014 at 11:16 PM, Sawada Masahiko sawada.m...@gmail.com 
  wrote:
  Hi all,
  After switching primary serer while using repliaction slot, the
  standby server will not able to connect new primary server.
  Imagine this situation, if primary server has two ASYNC standby
  servers, also use each replication slots.
  And the one standby(A) apply WAL without problems. But another one
  standby(B) has stopped after connected to primary server.
  (or sending WAL is too delayed)
 
  In this situation, the standby(B) has not received WAL segment file
  while stopping itself.
  And the primary server can not remove WAL segments which has not been
  received to all standby.
  Therefore the primary server have to keep the WAL segment file which
  has not been received to all standby.
  But standby(A) can do checkpoint itself, and then it's possible to
  recycle WAL segments.
  The number of WAL segment of each server are different.
  ( The number of WAL files of standby(A) having smaller than primary 
  server.)
  After the primary server is crashed, the standby(A) promote to primary,
  we can try to connect standby(B) to standby(A) as new standby server.
  But it will be failed because the standby(A) server might not have WAL
  segment files that standby(B) required.
 
  This sounds valid concern.
 
  To resolve this situation, I think that we should make master server
  to notify about removal of WAL segment to all standby servers.
  And the standby servers recycle WAL segments files base on that 
  information.

 I think that'll end up being really horrible, at least if done in an
 obligatory fashion. In a cascaded setup it's really sensible to only
 retain WAL on the intermediate nodes. Consider e.g. a setup - rather
 common these days actually - where there's a master somewhere and then a
 cascading standby on each continent feeding off to further nodes on that
 continent. You don't want to retain nodes on each continent (or on the
 primary) just because one node somewhere is down for maintenance.


 If you really want something like this we should probably add the
 infrastructure for one standby to maintain a replication slot on another
 standby server. So, if you have a setup like:

 A
/ \
  /\
 B  C
/ \ /\
 .... ..  ..

 B and C can coordinate that they keep enough WAL for each other. You can
 actually easily write a external tool for that today. Just create a
 replication slot oin B for C and the other way round and have a tool
 update them once a minute or so.

 I'm not sure if we want that builtin.

  Thought?
 
  How does the server recycle WAL files after it's promoted from the
  standby to master?
  It does that as it likes? If yes, your approach would not be enough.
 
  The approach prevents unexpected removal of WAL files while the standby
  is running. But after the standby is promoted to master, it might recycle
  needed WAL files immediately. So another standby may still fail to retrieve
  the required WAL file after the promotion.
 
  ISTM that, in order to address this, we might need to log all the 
  replication
  slot activities and replicate them to the standby. I'm not sure if this
  breaks the design of replication slot at all, though.

 Yes, that'd break it. You can't WAL log anything on a standby, and
 replication slots can be modified on standbys.

So current solution for the problem Sawada reported is to increase
wal_keep_segments on the standby to enough high maybe.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] postgresql latency bgwriter not doing its job



Hello Amit,

I see there is some merit in your point which is to make bgwriter more 
useful than its current form. I could see 3 top level points to think 
about whether improvement in any of those can improve the current 
situation:



a. Scanning of buffer pool to find the dirty buffers that can
be flushed.


Yep, that could be more aggressive, or the aggressiveness could be made 
into an adjustible parameter. There is a comment about that in the source. 
However I tested setting the round to 1s instead of 120s, and it had no 
positive effect on my test.



b. Deciding on what is criteria to flush a buffer


Indeed.

For instance when the IO load is low, there is no reason not to send out 
some buffers, it is a free lunch even if it must be done again later. If 
the load is high, this is another matter. So it would mean being able to 
decide whether the current IO load is low or not. It could be different on 
different disk... Hmmm, not that simple.



c. Sync of buffers


Yes. That is more or less the effect of my rough approach of calling 
CHECKPOINT every 0.2 seconds.  Andres Freund just implemented a quick 
linux-specific patch which does that within CHECKPOINT pacing, and which 
greatly improves the situation, although it could still be a little 
better.



[separate xlog test]


I do not have a setup available for that right now. Not sure this would be 
an issue for low loads.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Sequence Scan doubts

On Thu, Aug 21, 2014 at 2:47 AM, Haribabu Kommi
kommi.harib...@gmail.com wrote:
 Implementation of Parallel Sequence Scan

 Approach:

 1.Parallel Sequence Scan can achieved by using the background
 workers doing the job of actual sequence scan including the
 qualification check also.

 2. Planner generates the parallel scan plan by checking the possible
 criteria of the table to do a parallel scan and generates the tasks
 (range of blocks).

 3. In the executor Init phase, Try to copy the necessary data required
 by the workers and start the workers.

 4. In the executor run phase, just get the tuples which are sent by
 the workers and process them further in the plan node execution.

Well, this is something I've thought quite a bit about already.  Many
of my thoughts on parallelism are here:

https://wiki.postgresql.org/wiki/Parallel_Sort

Although the page title is parallel sort, many of the concerns are
applicable to parallelism of any sort.

I posted some patches containing some of the necessary infrastructure here:

http://archives.postgresql.org/message-id/CA+Tgmoam66dTzCP8N2cRcS6S6dBMFX+JMba+mDf68H=kakn...@mail.gmail.com

I seem to have forgotten to add that message to the CommitFest.  Crap.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Function to know last log write timestamp

On Tue, Aug 19, 2014 at 1:07 AM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Aug 15, 2014 at 7:17 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Fri, Aug 15, 2014 at 3:40 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
 On 2014-08-14 14:37:22 -0400, Robert Haas wrote:
 On Thu, Aug 14, 2014 at 2:21 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
  On 2014-08-14 14:19:13 -0400, Robert Haas wrote:
  That's about the idea. However, what you've got there is actually
  unsafe, because shmem-counter++ is not an atomic operation.  It reads
  the counter (possibly even as two separate 4-byte loads if the counter
  is an 8-byte value), increments it inside the CPU, and then writes the
  resulting value back to memory.  If two backends do this concurrently,
  one of the updates might be lost.
 
  All these are only written by one backend, so it should be safe. Note
  that that coding pattern, just without memory barriers, is all over
  pgstat.c

 Ah, OK.  If there's a separate slot for each backend, I agree that it's 
 safe.

 We should probably add barriers to pgstat.c, too.

 Yea, definitely. I think this is rather borked on weaker
 architectures. It's just that the consequences of an out of date/torn
 value are rather low, so it's unlikely to be noticed.

 Imo we should encapsulate the changecount modifications/checks somehow
 instead of repeating the barriers, Asserts, comments et al everywhere.

 So what about applying the attached patch first, which adds the macros
 to load and store the changecount with the memory barries, and changes
 pgstat.c use them. Maybe this patch needs to be back-patch to at least 9.4?

 After applying the patch, I will rebase the pg_last_xact_insert_timestamp
 patch and post it again.

 That looks OK to me on a relatively-quick read-through.  I was
 initially a bit worried about this part:

   do
   {
 ! pgstat_increment_changecount_before(beentry);
   } while ((beentry-st_changecount  1) == 0);

 pgstat_increment_changecount_before is an increment followed by a
 write barrier.  This seemed like funny coding to me at first because
 while-test isn't protected by any sort of barrier.  But now I think
 it's correct, because there's only one process that can possibly write
 to that data, and that's the one that is making the test, and it had
 certainly better see its own modifications in program order no matter
 what.

 I wouldn't object to back-patching this to 9.4 if we were earlier in
 the beta cycle, but at this point I'm more inclined to just put it in
 9.5.  If we get an actual bug report about any of this, we can always
 back-patch the fix at that time.  But so far that seems mostly
 hypothetical, so I think the less-risky course of action is to give
 this a longer time to bake before it hits an official release.

Sounds reasonable. So, barring any objection, I will apply the patch
only to the master branch.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] re-reading SSL certificates during server reload

2014-08-27 Thread Stephen Frost

* Magnus Hagander (mag...@hagander.net) wrote:
 That's certainly an issue. Potentially bigger ones are that you cannot
 replace an expired certificate or CRL without a restart.

+100.  I had forgotten about that issue- but it definitely sucks. :(

 Some of this is going to have to be at least partially reworked anyway
 in the work that Heikki has been diong to support non-openssl
 libraries. Making a change like this at the same time is probably a
 good idea.

Agreed.

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Code bug or doc bug?

On Sun, Aug 24, 2014 at 6:26 PM, Josh Berkus j...@agliodbs.com wrote:
 Quoth our docs
 (http://www.postgresql.org/docs/9.3/static/sql-alterdatabase.html):

 The fourth form changes the default tablespace of the database. Only
 the database owner or a superuser can do this; you must also have create
 privilege for the new tablespace. This command physically moves any
 tables or indexes in the database's old default tablespace to the new
 tablespace. Note that tables and indexes in non-default tablespaces are
 not affected.

 Yet:

 jberkus=# alter database phc set tablespace ssd;
 ERROR:  some relations of database phc are already in tablespace ssd
 HINT:  You must move them back to the database's default tablespace
 before using this command.

 Aside from being a stupid limitation (I need to copy the tables back to
 the old tablespace so that I can recopy them to the new one?), the above
 seems to be in direct contradiction to the docs.

I think that it works OK to move objects from tablespace A to table B
while there are also objects in tablespace C, where B != C, but not to
move objects from tablespace A to tablespace B while there are already
objects in tablespace B.  So I think the documentation is right as far
as it goes, but there's an undocumented limitation there.

The reasons for the limitation are:

1. We can't move a database while there are users connected to it.
This means that we can't modify any of the data in the database in the
process of relocating it.  In particular, we can't update it's copy of
pg_class.

2. By convention, pg_class.reltablespace = 0 when the relation is in
the database's default tablespace, and only contains a non-zero OID
when the relation is in some other tablespace.  This is what lets this
feature work at all: the pg_class.reltablespace value for every
relation we're moving is guaranteed to be 0 before the move, and is
still correctly valued as 0 after the move.  But it also means there
can't be any relations from that database in the new tablespace,
because any such relations would need pg_class.reltablespace to get
updated from the OID of that tablespace to 0.

I don't see any easy way to lift this limitation.  If it were possible
to move a database while users are connected to it, then of course you
could connect to the database to move it and update pg_class, but
you'd have to take an exclusive lock on every relation in the database
simultaneously, which might blow out the lock table, deadlock against
other sessions, and other messy things.

Another idea is to have a command that you can run, while connected to
a particular database, that updates the default tablespace for that
database without actually moving any data on disk - i.e. it sets
pg_database.dattablespace, and then updates every pg_class row where
reltablespace = 0 to the old default tablespace, and pg_class row
where reltablespace = the new tablespace ID to 0.  Then you can move
individual relations afterwards if you feel like it.  But that might
still require a lot of locks, and I think we also have a limitation
that some relations (the mapped ones?) have to be in the database's
default tablespace, which obviously wouldn't work here.

So it's a tricky problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] postgresql latency bgwriter not doing its job

On Wed, Aug 27, 2014 at 6:05 AM, Fabien COELHO coe...@cri.ensmp.fr wrote:
 [...] What's your evidence the pacing doesn't work? Afaik it's the fsync
 that causes the problem, not the the writes themselves.


 Hmmm. My (poor) understanding is that fsync would work fine if everything
 was already written beforehand:-) that is it has nothing to do but assess
 that all is already written. If there is remaining write work, it starts
 doing it now with the disastrous effects I'm complaining about.

 When I say pacing does not work, I mean that things where not written out
 to disk by the OS, it does not mean that pg did not ask for it.

 However it does not make much sense for an OS scheduler to wait several
 minutes with tens of thousands of pages to write and do nothing about it...
 So I'm wondering.

Maybe what's needed, is to slightly tweak checkpoint logic to give the
kernel some time to flush buffers.

Correct me if I'm wrong, but the checkpointer does the sync right
after the reads. Of course there will be about 30s-worth of
accumulated writes (it's the default amount of time the kernel holds
on to dirty buffers).

Perhaps it should be delayed a small time, say 30s, to let the kernel
do the writing on its own.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] postgresql latency bgwriter not doing its job

On Wed, Aug 27, 2014 at 10:10 AM, Claudio Freire klaussfre...@gmail.com wrote:
 On Wed, Aug 27, 2014 at 6:05 AM, Fabien COELHO coe...@cri.ensmp.fr wrote:
 [...] What's your evidence the pacing doesn't work? Afaik it's the fsync
 that causes the problem, not the the writes themselves.


 Hmmm. My (poor) understanding is that fsync would work fine if everything
 was already written beforehand:-) that is it has nothing to do but assess
 that all is already written. If there is remaining write work, it starts
 doing it now with the disastrous effects I'm complaining about.

 When I say pacing does not work, I mean that things where not written out
 to disk by the OS, it does not mean that pg did not ask for it.

 However it does not make much sense for an OS scheduler to wait several
 minutes with tens of thousands of pages to write and do nothing about it...
 So I'm wondering.

 Maybe what's needed, is to slightly tweak checkpoint logic to give the
 kernel some time to flush buffers.

 Correct me if I'm wrong, but the checkpointer does the sync right
 after the reads. Of course there will be about 30s-worth of
 accumulated writes (it's the default amount of time the kernel holds
 on to dirty buffers).

 Perhaps it should be delayed a small time, say 30s, to let the kernel
 do the writing on its own.


Errata: just after the writes :-p


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] postgresql latency bgwriter not doing its job

On 2014-08-27 10:10:49 -0300, Claudio Freire wrote:
 On Wed, Aug 27, 2014 at 6:05 AM, Fabien COELHO coe...@cri.ensmp.fr wrote:
  [...] What's your evidence the pacing doesn't work? Afaik it's the fsync
  that causes the problem, not the the writes themselves.
 
 
  Hmmm. My (poor) understanding is that fsync would work fine if everything
  was already written beforehand:-) that is it has nothing to do but assess
  that all is already written. If there is remaining write work, it starts
  doing it now with the disastrous effects I'm complaining about.
 
  When I say pacing does not work, I mean that things where not written out
  to disk by the OS, it does not mean that pg did not ask for it.
 
  However it does not make much sense for an OS scheduler to wait several
  minutes with tens of thousands of pages to write and do nothing about it...
  So I'm wondering.
 
 Maybe what's needed, is to slightly tweak checkpoint logic to give the
 kernel some time to flush buffers.
 
 Correct me if I'm wrong, but the checkpointer does the sync right
 after the reads. Of course there will be about 30s-worth of
 accumulated writes (it's the default amount of time the kernel holds
 on to dirty buffers).
 
 Perhaps it should be delayed a small time, say 30s, to let the kernel
 do the writing on its own.

The kernel *starts* to write out pages after 30s, it doesn't finish
doing so. So I don't think that's going to work.

I think a somewhat smarter version of the explicit flushes in the
hack^Wpatch I posted nearby is going to more likely to be successful.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] postgresql latency bgwriter not doing its job

On Wed, Aug 27, 2014 at 10:15 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-08-27 10:10:49 -0300, Claudio Freire wrote:
 On Wed, Aug 27, 2014 at 6:05 AM, Fabien COELHO coe...@cri.ensmp.fr wrote:
  [...] What's your evidence the pacing doesn't work? Afaik it's the fsync
  that causes the problem, not the the writes themselves.
 
 
  Hmmm. My (poor) understanding is that fsync would work fine if everything
  was already written beforehand:-) that is it has nothing to do but assess
  that all is already written. If there is remaining write work, it starts
  doing it now with the disastrous effects I'm complaining about.
 
  When I say pacing does not work, I mean that things where not written out
  to disk by the OS, it does not mean that pg did not ask for it.
 
  However it does not make much sense for an OS scheduler to wait several
  minutes with tens of thousands of pages to write and do nothing about it...
  So I'm wondering.

 Maybe what's needed, is to slightly tweak checkpoint logic to give the
 kernel some time to flush buffers.

 Correct me if I'm wrong, but the checkpointer does the sync right
 after the reads. Of course there will be about 30s-worth of
 accumulated writes (it's the default amount of time the kernel holds
 on to dirty buffers).

 Perhaps it should be delayed a small time, say 30s, to let the kernel
 do the writing on its own.

 The kernel *starts* to write out pages after 30s, it doesn't finish
 doing so. So I don't think that's going to work.

 I think a somewhat smarter version of the explicit flushes in the
 hack^Wpatch I posted nearby is going to more likely to be successful.


That path is dangerous (as in, may not work as intended) if the
filesystem doesn't properly understand range flushes (ehem, like
ext3).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP Patch for GROUPING SETS phase 1

On Mon, Aug 25, 2014 at 1:35 AM, Andrew Gierth
and...@tao11.riddles.org.uk wrote:
 If you look at the latest patch post, there's a small patch in it that
 does nothing but unreserve the keywords and fix ruleutils to make
 deparse/parse work. The required fix to ruleutils is an example of
 violating your four kinds of keywords principle, but quoting
 keywords still works.

I think it would be intolerable to lose the ability to quote keywords.
That could easily create situations where there's no reasonable way to
dump an older database in such a fashion that it can be reloaded into
a newer database.  So it's good that you avoided that.

The four kinds of keywords principle is obviously much less
absolute.  We've talked before about introducing additional categories
of keywords, and that might be a good thing to do for one reason or
another.  But I think it's not good to do it in a highly idiosyncratic
way: I previously proposed reserving concurrently only when it follows
CREATE INDEX, and not in any other context, but Tom argued that it had
to become a type_func_name_keyword since users would be confused to
find that concurrently (but not any other keyword) needed quoting
there.  In retrospect, I tend to think he probably had it right.
There is a good amount of third-party software out there that tries to
be smart about quoting PostgreSQL keywords - for example, pgAdmin has
code for that, or did last I looked - so by making things more
complicated, we run the risk not only of bugs in our own software but
also bugs in other people's software, as well as user confusion.  So I
still think the right solution is probably to reserve CUBE across the
board, and not just in the narrowest context that we can get away
with.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Code bug or doc bug?

2014-08-27 Thread Bruce Momjian

On Wed, Aug 27, 2014 at 09:05:41AM -0400, Robert Haas wrote:
 Another idea is to have a command that you can run, while connected to
 a particular database, that updates the default tablespace for that
 database without actually moving any data on disk - i.e. it sets
 pg_database.dattablespace, and then updates every pg_class row where
 reltablespace = 0 to the old default tablespace, and pg_class row
 where reltablespace = the new tablespace ID to 0.  Then you can move
 individual relations afterwards if you feel like it.  But that might
 still require a lot of locks, and I think we also have a limitation
 that some relations (the mapped ones?) have to be in the database's
 default tablespace, which obviously wouldn't work here.
 
 So it's a tricky problem.

Is there a doc patch to make here?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] postgresql latency bgwriter not doing its job

On 2014-08-27 10:17:06 -0300, Claudio Freire wrote:
  I think a somewhat smarter version of the explicit flushes in the
  hack^Wpatch I posted nearby is going to more likely to be successful.
 
 
 That path is dangerous (as in, may not work as intended) if the
 filesystem doesn't properly understand range flushes (ehem, like
 ext3).

The sync_file_range(SYNC_FILE_RANGE_WRITE) I used isn't a operation
guaranteeing durability. And - afaik - not implemented in a file system
specific manner. It just initiates writeback for individual pages. It
doesn't cause barrier, journal flushes or anything to be issued. That's
still done by the fsync() later.

The big disadvantage is that it's a OS specific solution, but I don't
think we're going to find anything that isn't in this area.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Set new system identifier using pg_resetxlog

On Mon, Aug 25, 2014 at 4:06 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 I didn't understand this one. But it seems like the obvious solution is to
 not use the consumer's system identifier as the slot name. Or rename it
 afterwards.

You can't use the consumer's system identifier as the slot name,
because you have to create the slot before you create the consumer.
But you could rename it afterwards, or just use some other naming
convention entirely, which is why I'm -0.25 on this whole proposal.
What the 2ndQuadrant folks are proposing is not unreasonable (which is
why I'm only -0.25) but it opens an (admittedly small) can of worms
that I see no real need to open.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Code bug or doc bug?

2014-08-27 Thread David G Johnston

Bruce Momjian wrote
 On Wed, Aug 27, 2014 at 09:05:41AM -0400, Robert Haas wrote:
 Another idea is to have a command that you can run, while connected to
 a particular database, that updates the default tablespace for that
 database without actually moving any data on disk - i.e. it sets
 pg_database.dattablespace, and then updates every pg_class row where
 reltablespace = 0 to the old default tablespace, and pg_class row
 where reltablespace = the new tablespace ID to 0.  Then you can move
 individual relations afterwards if you feel like it.  But that might
 still require a lot of locks, and I think we also have a limitation
 that some relations (the mapped ones?) have to be in the database's
 default tablespace, which obviously wouldn't work here.
 
 So it's a tricky problem.
 
 Is there a doc patch to make here?

1. Last sentence change suggestion: The target tablespace must be empty.

2. Based on Robert's comments it sounds like a You cannot change the
default tablespace of the current database. comment should be added as
well.

Side note: I have no clue what the mapped relations Robert refers to
are...

If the locking problem is unsolvable, which seems to be the only realistic
reason why updating pg_class cannot be done somewhere in the process, could
we make it so that the same physical tablespace location can have multiple
pointers?  The problem here would be that a subsequent move would only grab
those relations that are in the current tablespace by default and would
leave the ones that were present originally - unless they get moved in the
interim to the default tablespace (in this case by changing their oid to 0
manually first).

David J.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Code-bug-or-doc-bug-tp5816052p5816550.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] postgresql latency bgwriter not doing its job

On Wed, Aug 27, 2014 at 10:20 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-08-27 10:17:06 -0300, Claudio Freire wrote:
  I think a somewhat smarter version of the explicit flushes in the
  hack^Wpatch I posted nearby is going to more likely to be successful.


 That path is dangerous (as in, may not work as intended) if the
 filesystem doesn't properly understand range flushes (ehem, like
 ext3).

 The sync_file_range(SYNC_FILE_RANGE_WRITE) I used isn't a operation
 guaranteeing durability. And - afaik - not implemented in a file system
 specific manner. It just initiates writeback for individual pages. It
 doesn't cause barrier, journal flushes or anything to be issued. That's
 still done by the fsync() later.

 The big disadvantage is that it's a OS specific solution, but I don't
 think we're going to find anything that isn't in this area.

I guess it should work then.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ALTER SYSTEM RESET?

On Mon, Aug 25, 2014 at 1:34 PM, Amit Kapila amit.kapil...@gmail.com wrote:
 On Wed, Jul 30, 2014 at 9:11 AM, Amit Kapila amit.kapil...@gmail.com
 wrote:
 I have verified the patch and found that it works well for
 all scenario's.  Few minor suggestions:

 1.
 !values to the filenamepostgresql.auto.conf/filename file.
 !Setting the parameter to literalDEFAULT/literal, or using the
 !commandRESET/command variant, removes the configuration entry
 from

 It would be better if we can write a separate line for RESET ALL
 as is written in case of both Alter Database and Alter Role in their
 respective documentation.

 2.
 ! %type vsetstmt generic_set set_rest set_rest_more generic_reset
 reset_rest SetResetClause FunctionSetResetClause

 Good to break it into 2 lines.

 3. I think we can add some text on top of function
 AlterSystemSetConfigFile() to explain functionality w.r.t reset all.

 I have updated the patch to address the above points.

 I will mark this patch as Ready For Committer as most of the
 review comments were already addressed by Vik and remaining
 I have addressed in attached patch.

The patch looks good to me. One minor comment is; probably you need to
update the tab-completion code.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hardening pg_upgrade

On Mon, Aug 25, 2014 at 3:04 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Bruce Momjian br...@momjian.us writes:
 OK, I will move in the direction of removing 8.3 support and use a
 single query to pull schema information.   I was hesistant to remove 8.3
 support as I know we have kept pg_dump support all the way back to 7.0,
 but it seems pg_upgrade need not have the same version requirements.

 Not really related, but ... I've been thinking that it's time to rip out
 pg_dump's support for server versions before 7.3 or 7.4.  That would let
 us get rid of a lot of klugy code associated with the lack of schemas
 and dependency info in the older versions.  It's possible that we should
 move the cutoff even further --- I've not looked closely at how much could
 be removed by dropping versions later than 7.3.

 Aside from the question of how much old code could be removed, there's the
 salient point of how do we test pg_dump against such old branches?  The
 further back you go the harder it is to even build PG on modern platforms,
 and the less likely it will work (I note for example that pre-8.0
 configure doesn't try to use -fwrapv, let alone some of the other switches
 we've found necessary on recent gcc).  I've usually tested pg_dump patches
 against old servers by running them against builds I have in captivity on
 my old HPPA box ... but once that dies, I'm *not* looking forward to
 trying to rebuild 7.x on my current machines.

I think it's fine for us to start requiring two-step upgrades beyond a
certain point, and I think removing pg_dump support for pre-7.4
versions of the server is very reasonable.  I also think removing
pg_upgrade support for 8.3 is reasonable.  Many products require
multi-step upgrades when crossing multiple release versions, and I
think we can, too.

Having said that, there are obviously advantages for our users if we
don't get too crazy about requiring that.  I've used products in the
past where to get from version 3 to version 11 you have to upgrade
from 3 to 5, then 5 to 7, then 7 to 9, and then 9 to 11.  That's
somewhat understandable from the vendor's point of view, but it's not
a lot of fun, and I think we should definitely avoid imposing those
kinds of requirements on our users.

What we're talking about here is much milder than that.  For the
pg_upgrade case, you can upgrade from 8.3 to any of 8.4, 9.0, 9.1,
9.2, 9.3, and 9.4.  You only need to do a two-step upgrade if you want
to leapfrog more than 6 major release versions.  That seems like a
wide-enough window that it shouldn't inconvenience many people.  For
the pg_dump case, you can upgrade from 7.2 or 7.3 to 7.4, 8.0, 8.1,
8.2, 8.3, 8.4, 9.0, 9.1, 9.2, 9.3, or 9.4; that is, 11 or 12 major
releases.  The number of people who want to skip more than a dozen
releases in a single upgrade should be very small, and we might
council those people that they'd be better off with a step-wise
upgrade for other reasons - like the application-level compatibility
breaks we've made over the years - anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW

On Mon, Aug 25, 2014 at 8:58 AM, Albe Laurenz laurenz.a...@wien.gv.at wrote:
 Reading the code, I noticed that the pushed down UPDATE or DELETE statement 
 is executed
 during postgresBeginForeignScan rather than during postgresIterateForeignScan.
 It probably does not matter, but is there a reason to do it different from 
 the normal scan?

Hmm, I'm worried that may be an API contract violation.  ISTM that we
might initialize nodes that we never read from - they can show up in
the EXPLAIN-plan as (never executed) - and things that aren't executed
shouldn't do work, especially work that permanently modifies data.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Specifying the unit in storage parameter

Fujii Masao wrote:
 On Tue, Aug 26, 2014 at 3:27 AM, Alvaro Herrera
 alvhe...@2ndquadrant.com wrote:
  Fujii Masao wrote:
  On Thu, Aug 21, 2014 at 4:20 PM, Michael Paquier
  michael.paqu...@gmail.com wrote:
 
   Looking at the patch, the parameter fillfactor in the category
   RELOPT_KIND_HEAP (the first element in intRelOpts of reloptions.c) is
   not updated with the new field. It is only a one-line change.
   @@ -97,7 +97,7 @@ static relopt_int intRelOpts[] =
   Packs table pages only to this percentage,
   RELOPT_KIND_HEAP
   },
   -   HEAP_DEFAULT_FILLFACTOR, HEAP_MIN_FILLFACTOR, 100
   +   HEAP_DEFAULT_FILLFACTOR, HEAP_MIN_FILLFACTOR, 100, 0
   },
 
  Oh, good catch. I wonder why I did such a mistake...
 
  Uninitialized elements at end of struct are filled with zeroes.
 
 Yeah, that's the reason why I could not notice the problem at compile time.

Right -- it's not something the compiler would warn you about.

  We do
  have other examples of this -- for instance, config_generic in the guc.c
  tables are almost always only 5 members long even though the struct is
  quite a bit longer than that.  Most entries do not even have flags set.
 
 So you imply that the trailing zero (which the patch adds as flag)
 in the reloption struct should be dropped?

Not necessarily, because it's harmless.  It's there for purely
aesthetical reasons, so it's your choice whether to add it or not.
Having it there is slightly easier on somebody reading the code,
perhaps.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW

Robert Haas robertmh...@gmail.com writes:
 On Mon, Aug 25, 2014 at 8:58 AM, Albe Laurenz laurenz.a...@wien.gv.at wrote:
 Reading the code, I noticed that the pushed down UPDATE or DELETE statement 
 is executed
 during postgresBeginForeignScan rather than during 
 postgresIterateForeignScan.
 It probably does not matter, but is there a reason to do it different from 
 the normal scan?

 Hmm, I'm worried that may be an API contract violation.

Indeed it is.  You could get away with it if you check the
EXEC_FLAG_EXPLAIN_ONLY flag before doing anything with visible
side-effects, but it's still pretty ugly.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] replication commands and log_statements

On Tue, Aug 26, 2014 at 7:17 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Aug 20, 2014 at 1:14 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 On Wed, Aug 20, 2014 at 2:06 AM, Robert Haas robertmh...@gmail.com wrote:

 On Sat, Aug 16, 2014 at 10:27 AM, Amit Kapila amit.kapil...@gmail.com
 wrote:
  I think ideally it would have been better if we could have logged
  replication commands under separate log_level, but as still there
  is no consensus on extending log_statement and nobody is even
  willing to pursue, it seems okay to go ahead and log these under
  'all' level.

 I think the consensus is clearly for a separate GUC.

 +1.

 Okay. Attached is the updated version of the patch which I posted before.
 This patch follows the consensus and adds separate parameter
 log_replication_command.

Looks fine to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW

I wrote:
 Robert Haas robertmh...@gmail.com writes:
 Hmm, I'm worried that may be an API contract violation.

 Indeed it is.  You could get away with it if you check the
 EXEC_FLAG_EXPLAIN_ONLY flag before doing anything with visible
 side-effects, but it's still pretty ugly.

Actually, there's another problem there.  What of UPDATE or DELETE with a
LIMIT clause, which is something that seems to be coming down the pike:
https://commitfest.postgresql.org/action/patch_view?id=1550

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC, POC] Don't require a NBuffer sized PrivateRefCount array of local buffer pins

Andres Freund wrote:
 On 2014-08-26 22:19:47 -0400, Tom Lane wrote:
  Andres Freund and...@2ndquadrant.com writes:

  I would say that the issue most deserving of performance testing is your
  sizing of the linear-search array --- it's not obvious that 8 is a good
  size.
 
 It's about the size of a cacheline on all common architectures, that's
 how I found it. I don't think it makes a very big difference whether we
 make it 4 or 12, but outside of that range I think it'll be unlikely to
 be beneficial. The regression tests never go about three or four pins or
 so currently, so I think that's a number unlikely to regularly be
 crossed in practice.

FWIW scanning a minmax index will keep three pages pinned IIRC
(metapage, current revmap page, current regular page).

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hardening pg_upgrade

2014-08-27 Thread Bruce Momjian

On Wed, Aug 27, 2014 at 09:54:11AM -0400, Robert Haas wrote:
 Having said that, there are obviously advantages for our users if we
 don't get too crazy about requiring that.  I've used products in the
 past where to get from version 3 to version 11 you have to upgrade
 from 3 to 5, then 5 to 7, then 7 to 9, and then 9 to 11.  That's
 somewhat understandable from the vendor's point of view, but it's not
 a lot of fun, and I think we should definitely avoid imposing those
 kinds of requirements on our users.
 
 What we're talking about here is much milder than that.  For the
 pg_upgrade case, you can upgrade from 8.3 to any of 8.4, 9.0, 9.1,
 9.2, 9.3, and 9.4.  You only need to do a two-step upgrade if you want
 to leapfrog more than 6 major release versions.  That seems like a
 wide-enough window that it shouldn't inconvenience many people.  For
 the pg_dump case, you can upgrade from 7.2 or 7.3 to 7.4, 8.0, 8.1,
 8.2, 8.3, 8.4, 9.0, 9.1, 9.2, 9.3, or 9.4; that is, 11 or 12 major
 releases.  The number of people who want to skip more than a dozen
 releases in a single upgrade should be very small, and we might
 council those people that they'd be better off with a step-wise
 upgrade for other reasons - like the application-level compatibility
 breaks we've made over the years - anyway.

Two things --- first, removing 8.3 support in pg_upgrade allowed me to
remove lots of dead code, so it was a win.  Second, I think you need to
look at the time span from old to new versions to understand if a
double-step release is reasonable.  If that 3-5-7-9 release step spans
two years, it is too short --- if it spans 15 years, it is probably fine
as few people would wait 15 years to upgrade.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-27 Thread Aidan Van Dyk

On Wed, Aug 27, 2014 at 3:32 AM, Fabien COELHO coe...@cri.ensmp.fr wrote:


 Hello Andres,

  [...]

 I think you're misunderstanding how spread checkpoints work.


 Yep, definitely:-) On the other hand I though I was seeking something
 simple, namely correct latency under small load, that I would expect out
 of the box.

 What you describe is reasonable, and is more or less what I was hoping
 for, although I thought that bgwriter was involved from the start and
 checkpoint would only do what is needed in the end. My mistake.


If all you want is to avoid the write storms when fsyncs start happening on
slow storage, can you not just adjust the kernel vm.dirty* tunables to
start making the kernel write out dirty buffers much sooner instead of
letting them accumulate until fsyncs force them out all at once?


a.


-- 
Aidan Van Dyk Create like a god,
ai...@highrise.ca   command like a king,
http://www.highrise.ca/   work like a
slave.

Re: [HACKERS] pgbench throttling latency limit


On 08/27/2014 02:37 PM, Fabien COELHO wrote:

As for an actual latency limit under throttling, this is significantly
more tricky and invasive to implement... ISTM that it would mean:

   - if the tx is not stated an the latency is already consummed, SKIP++.

   - if the tx is after its schedule start time but under latency, then
 start it, and maybe inject a SET TIMEOUT

   - if a tx is being processed but reaches its latency limit (after
 schedule start time), abort it coldly, ROLLBACK++ (well if the tx is
 really started, there could also be shell commands and \set stuff in a
 pgbench script, which mean started is not really started, so it would
 be INTERRUPT++ if no BEGIN was sent).

   - if a tx is finished but the final commit returned after the latency
 deadline, you cannot abort it anymore but it is late nevertheless,
 LATE++.


Yeah, something like that. I don't think it would be necessary to set 
statement_timeout, you can inject that in your script or postgresql.conf 
if you want. I don't think aborting a transaction that's already started 
is necessary either. You could count it as LATE, but let it finish first.



This is doable but far beyond my current needs. Moreover, I'm not sure
that such a patch would pass because of invasiveness and complexity, so it
could be a total loss of time.


Ok, but *why* are you doing a lag limit, and not a latency limit?


Because it is much simpler (see above) and is enough for testing pg
responsiveness issue, which is my current objective, and models some
client timeout behavior.


Under what circumstances is the lag limit a more useful setting?


It is not more useful per se, it is what I'm using to test pg
unresponsivness with a simple to define and interpret measure wrt
throttling.

If I would do latency limit under throttling, it would be (1) more time
to develop, more complex, more invasive in the code (see above, + also the
implementation when not under throttling), (2) more complex to interpret,
with at least 5 possible outcomes (skipped, interrupted, committed on
time, committed but late, aborted), (3) this added information would not
be useful to me.

I've submitted this simple lag limit version because being able to
measure quickly and simply (un)responsiveness seems like a good idea,
especially given the current state of things.


Ok, fair enough. I don't think doing a latency limit would be 
significantly harder, but I can't force you. I'll mark this as Returned 
with Feedback then.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] postgresql latency bgwriter not doing its job

On 2014-08-27 10:32:19 -0400, Aidan Van Dyk wrote:
 On Wed, Aug 27, 2014 at 3:32 AM, Fabien COELHO coe...@cri.ensmp.fr wrote:
 
 
  Hello Andres,
 
   [...]
 
  I think you're misunderstanding how spread checkpoints work.
 
 
  Yep, definitely:-) On the other hand I though I was seeking something
  simple, namely correct latency under small load, that I would expect out
  of the box.
 
  What you describe is reasonable, and is more or less what I was hoping
  for, although I thought that bgwriter was involved from the start and
  checkpoint would only do what is needed in the end. My mistake.
 
 
 If all you want is to avoid the write storms when fsyncs start happening on
 slow storage, can you not just adjust the kernel vm.dirty* tunables to
 start making the kernel write out dirty buffers much sooner instead of
 letting them accumulate until fsyncs force them out all at once?

Well. For one that's a os specific global tunable requiring root to be
adjustable. For another we actually do want some buffering: If a backend
writes out a buffer's data itself (happens very frequently) it *should*
get buffered... So I don't think a process independent tunable is going
to do the trick.


Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [BUGS] Re: BUG #9555: pg_dump for tables with inheritance recreates the table with the wrong order of columns

2014-08-27 Thread Bruce Momjian

On Mon, Mar 17, 2014 at 07:12:12PM -0400, Noah Misch wrote:
 On Fri, Mar 14, 2014 at 12:33:04PM -0300, Alvaro Herrera wrote:
  Tom Lane wrote:
   Alvaro Herrera alvhe...@2ndquadrant.com writes:
I wonder if the real fix here is to have ALTER / INHERIT error out of
the columns in B are not a prefix of those in A.
   
   Years ago, we sweated quite a lot of blood to make these cases work.
   I'm not thrilled about throwing away all that effort because one person
   doesn't like the behavior.
 
 Agreed.  That also makes the current pg_dump behavior a bug.  Column order
 matters; pg_dump is failing to recreate a semantically-equivalent database.
 
  Hm, well in that case it makes sense to consider the original
  suggestion: if the columns in the parent are not a prefix of those of
  the child, use ALTER INHERIT after creating both tables rather than
  CREATE TABLE INHERITS.
  
  It'd be a lot of new code in pg_dump though.  I am not volunteering ...
 
 pg_dump --binary-upgrade already gets this right.  Perhaps it won't take too
 much code to make dumpTableSchema() reuse that one part of its binary-upgrade
 approach whenever the columns of B are not a prefix of those in A.

[thread moved to hackers]

I looked at this issue from March and I think we need to do something. 
In summary, the problem is that tables using inheritance can be dumped
and reloaded with columns in a different order from the original
cluster.  What is a basically happening is that these queries:

CREATE TABLE A(a int, b int, c int);
CREATE TABLE B(a int, c int);
ALTER TABLE A INHERIT B;

cause pg_dump to generate this:

CREATE TABLE b (
a integer,
c integer
);
CREATE TABLE a (
a integer,
b integer,
c integer
)
INHERITS (b);

which issues these warnings when run:

NOTICE:  merging column a with inherited definition
NOTICE:  merging column c with inherited definition

and produces this table a:

test2= \d a
   Table public.a
 Column |  Type   | Modifiers
+-+---
 a  | integer |
--  c  | integer |
 b  | integer |

Notice the column reordering.  The logic is that a CREATE TABLE INHERITS
should place the inherited parent columns _first_.  This can't be done
by ALTER TABLE INHERIT because the table might already contain data.

I think we have several options:

1.  document this behavior
2.  have ALTER TABLE INHERIT issue a warning about future reordering
3.  use the pg_dump binary-upgrade code when such cases happen

My crude approach for #3 would be for pg_dump to loop over the columns
and, where pg_attribute.attinhcount == 0, check to see if there is a
matching column name in any inherited table.  Will such tables load fine
because pg_dump binary-upgrade mode doesn't do any data loading?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] postgresql latency bgwriter not doing its job



Hello,


If all you want is to avoid the write storms when fsyncs start happening on
slow storage, can you not just adjust the kernel vm.dirty* tunables to
start making the kernel write out dirty buffers much sooner instead of
letting them accumulate until fsyncs force them out all at once?


I can try, when I have finished with the current round of testing.

Note that, as Andres put it, it currently sucks. Having to tinker with 
linux kernel parameters just to handle a small load without being offline 
10% of the time does not look very good, so even if it works, ISTM that a 
pg side solution is desirable.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

On Tue, Aug 26, 2014 at 8:14 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Tue, Aug 19, 2014 at 6:37 PM, Rahila Syed rahilasye...@gmail.com wrote:
 Hello,
 Thank you for comments.

Could you tell me where the patch for single block in one run is?
 Please find attached patch for single block compression in one run.

 Thanks! I ran the benchmark using pgbench and compared the results.
 I'd like to share the results.

 [RESULT]
 Amount of WAL generated during the benchmark. Unit is MB.

 MultipleSingle
 off202.0201.5
 on6051.06053.0
 pglz3543.03567.0
 lz43344.03485.0
 snappy3354.03449.5

 Latency average during the benchmark. Unit is ms.

 MultipleSingle
 off19.119.0
 on55.357.3
 pglz45.045.9
 lz444.244.7
 snappy43.443.3

 These results show that FPW compression is really helpful for decreasing
 the WAL volume and improving the performance.

Yeah, those look like good numbers.  What happens if you run it at
full speed, without -R?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hardening pg_upgrade

On Wed, Aug 27, 2014 at 10:13 AM, Bruce Momjian br...@momjian.us wrote:
 On Wed, Aug 27, 2014 at 09:54:11AM -0400, Robert Haas wrote:
 Having said that, there are obviously advantages for our users if we
 don't get too crazy about requiring that.  I've used products in the
 past where to get from version 3 to version 11 you have to upgrade
 from 3 to 5, then 5 to 7, then 7 to 9, and then 9 to 11.  That's
 somewhat understandable from the vendor's point of view, but it's not
 a lot of fun, and I think we should definitely avoid imposing those
 kinds of requirements on our users.

 What we're talking about here is much milder than that.  For the
 pg_upgrade case, you can upgrade from 8.3 to any of 8.4, 9.0, 9.1,
 9.2, 9.3, and 9.4.  You only need to do a two-step upgrade if you want
 to leapfrog more than 6 major release versions.  That seems like a
 wide-enough window that it shouldn't inconvenience many people.  For
 the pg_dump case, you can upgrade from 7.2 or 7.3 to 7.4, 8.0, 8.1,
 8.2, 8.3, 8.4, 9.0, 9.1, 9.2, 9.3, or 9.4; that is, 11 or 12 major
 releases.  The number of people who want to skip more than a dozen
 releases in a single upgrade should be very small, and we might
 council those people that they'd be better off with a step-wise
 upgrade for other reasons - like the application-level compatibility
 breaks we've made over the years - anyway.

 Two things --- first, removing 8.3 support in pg_upgrade allowed me to
 remove lots of dead code, so it was a win.  Second, I think you need to
 look at the time span from old to new versions to understand if a
 double-step release is reasonable.  If that 3-5-7-9 release step spans
 two years, it is too short --- if it spans 15 years, it is probably fine
 as few people would wait 15 years to upgrade.

Right, I agree with all of that and was not intending to dispute any of it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Scaling shared buffer eviction

On Tue, Aug 26, 2014 at 11:10 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Amit Kapila amit.kapil...@gmail.com writes:
 On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas robertmh...@gmail.com wrote:
 I think you should get rid of BufFreelistLock completely and just
 decide that freelist_lck will protect everything the freeNext links, plus
 everything in StrategyControl except for nextVictimBuffer.  victimbuf_lck
 will protect nextVictimBuffer and nothing else.

 Another point is I think it will be better to protect
 StrategyControl-completePasses with victimbuf_lck rather than
 freelist_lck, as when we are going to update it we will already be
 holding the victimbuf_lck and it doesn't make much sense to release
 the victimbuf_lck and reacquire freelist_lck to update it.

 I'm rather concerned by this cavalier assumption that we can protect
 fields a,b,c with one lock and fields x,y,z in the same struct with some
 other lock.

 A minimum requirement for that to work safely at all is that the fields
 are of atomically fetchable/storable widths; which might be okay here
 but it's a restriction that bears thinking about (and documenting).

 But quite aside from safety, the fields are almost certainly going to
 be in the same cache line which means contention between processes that
 are trying to fetch or store them concurrently.  For a patch whose sole
 excuse for existence is to improve performance, that should be a very
 scary concern.

 (And yes, I realize these issues already affect the freelist.  Perhaps
 that's part of the reason we have performance issues with it.)

False sharing is certainly a concern that has crossed my mine while
looking at Amit's work, but the performance numbers he's posted
upthread are stellar.  Maybe we can squeeze some additional
performance out of this by padding out the cache lines, but it's
probably minor compared to the gains he's already seeing.  I think we
should focus on trying to lock in those gains, and then we can
consider what further things we may want to do after that.  If it
turns out that structure-padding is among those things, that's easy
enough to do as a separate patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Scaling shared buffer eviction

On Tue, Aug 26, 2014 at 10:53 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 Today, while working on updating the patch to improve locking
 I found that as now we are going to have a new process, we need
 a separate latch in StrategyControl to wakeup that process.
 Another point is I think it will be better to protect
 StrategyControl-completePasses with victimbuf_lck rather than
 freelist_lck, as when we are going to update it we will already be
 holding the victimbuf_lck and it doesn't make much sense to release
 the victimbuf_lck and reacquire freelist_lck to update it.

Sounds reasonable.  I think the key thing at this point is to get a
new version of the patch with the background reclaim running in a
different process than the background writer.  I don't see much point
in fine-tuning the locking regimen until that's done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench throttling latency limit




As for an actual latency limit under throttling, this is significantly
more tricky and invasive to implement... ISTM that it would mean:
[...] 


Yeah, something like that. I don't think it would be necessary to set 
statement_timeout, you can inject that in your script or postgresql.conf if 
you want. I don't think aborting a transaction that's already started is 
necessary either. You could count it as LATE, but let it finish first.


If you remove all difficult cases from the spec, it is obviously much 
simpler to implement:-) It seems that your simplified version of latency 
limit would be just to distinguish LATE from ONTIME among the committed 
ones, compared to the current version, and not to actually limit the 
latency, which is the tricky part.



I've submitted this simple lag limit version because being able to
measure quickly and simply (un)responsiveness seems like a good idea,
especially given the current state of things.


Ok, fair enough. I don't think doing a latency limit would be significantly 
harder, but I can't force you. I'll mark this as Returned with Feedback then.


Hmmm. I can distinguish just the two cases. Rather mark it as waiting on 
author, I may give it a go.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench throttling latency limit


On 08/27/2014 06:08 PM, Fabien COELHO wrote:

I've submitted this simple lag limit version because being able to
measure quickly and simply (un)responsiveness seems like a good idea,
especially given the current state of things.


Ok, fair enough. I don't think doing a latency limit would be significantly
harder, but I can't force you. I'll mark this as Returned with Feedback then.


Hmmm. I can distinguish just the two cases. Rather mark it as waiting on
author, I may give it a go.


Feel free to mark it as such if you think you can get a new version 
posted in the next few days.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] Re: BUG #9555: pg_dump for tables with inheritance recreates the table with the wrong order of columns

Bruce Momjian br...@momjian.us writes:
 I looked at this issue from March and I think we need to do something. 
 In summary, the problem is that tables using inheritance can be dumped
 and reloaded with columns in a different order from the original
 cluster.

Yeah ... this has been a well-understood issue for a dozen years, and
pg_dump goes to considerable trouble to get it right.

 I think we have several options:

 1.  document this behavior

That one.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-08-27 Thread Arthur Silva

Em 26/08/2014 09:16, Fujii Masao masao.fu...@gmail.com escreveu:

 On Tue, Aug 19, 2014 at 6:37 PM, Rahila Syed rahilasye...@gmail.com
wrote:
  Hello,
  Thank you for comments.
 
 Could you tell me where the patch for single block in one run is?
  Please find attached patch for single block compression in one run.

 Thanks! I ran the benchmark using pgbench and compared the results.
 I'd like to share the results.

 [RESULT]
 Amount of WAL generated during the benchmark. Unit is MB.

 MultipleSingle
 off202.0201.5
 on6051.06053.0
 pglz3543.03567.0
 lz43344.03485.0
 snappy3354.03449.5

 Latency average during the benchmark. Unit is ms.

 MultipleSingle
 off19.119.0
 on55.357.3
 pglz45.045.9
 lz444.244.7
 snappy43.443.3

 These results show that FPW compression is really helpful for decreasing
 the WAL volume and improving the performance.

 The compression ratio by lz4 or snappy is better than that by pglz. But
 it's difficult to conclude which lz4 or snappy is best, according to these
 results.

 ISTM that compression-of-multiple-pages-at-a-time approach can compress
 WAL more than compression-of-single-... does.

 [HOW TO BENCHMARK]
 Create pgbench database with scall factor 1000.

 Change the data type of the column filler on each pgbench table
 from CHAR(n) to TEXT, and fill the data with the result of pgcrypto's
 gen_random_uuid() in order to avoid empty column, e.g.,

  alter table pgbench_accounts alter column filler type text using
 gen_random_uuid()::text

 After creating the test database, run the pgbench as follows. The
 number of transactions executed during benchmark is almost same
 between each benchmark because -R option is used.

   pgbench -c 64 -j 64 -r -R 400 -T 900 -M prepared

 checkpoint_timeout is 5min, so it's expected that checkpoint was
 executed at least two times during the benchmark.

 Regards,

 --
 Fujii Masao


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

It'd be interesting to check avg cpu usage as well.

[HACKERS] Missing plpgsql.o Symbols on OS X

2014-08-27 Thread David E . Wheeler

Hackers,

I’m trying to build Pavel’s plpgsql_check against the 9.4 beta on OS X 10.9, 
but get these errors:

make
gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith 
-Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute 
-Wformat-security -fno-strict-aliasing -fwrapv  
-I/usr/local/pgsql/lib/pgxs/src/makefiles/../../src/pl/plpgsql/src -bundle 
-multiply_defined suppress -o plpgsql_check.so plpgsql_check.o 
-L/usr/local/pgsql/lib -L/usr/local/lib  -L/usr/local/lib 
-Wl,-dead_strip_dylibs   -bundle_loader /usr/local/pgsql/bin/postgres
Undefined symbols for architecture x86_64:
 _exec_get_datum_type, referenced from:
 _check_target in plpgsql_check.o
 _plpgsql_build_datatype, referenced from:
 _check_stmt in plpgsql_check.o
 _plpgsql_compile, referenced from:
 _check_plpgsql_function in plpgsql_check.o
 _plpgsql_parser_setup, referenced from:
 _prepare_expr in plpgsql_check.o
 _plpgsql_stmt_typename, referenced from:
 _put_error in plpgsql_check.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [plpgsql_check.so] Error 1

Which is odd, because plpgsql_check.c includes plpgsql.h, and those symbols do 
appear to be in plpgsql.so:

$ nm /usr/local/pgsql/lib/plpgsql.so | grep _exec_get_datum_type
f110 T _exec_get_datum_type
f380 T _exec_get_datum_type_info

So, uh, what gives? Do I need to something extra to get it to properly find 
plpgsql.so?

Thanks,

David



signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [HACKERS] Similar to csvlog but not really, json logs?

2014-08-27 Thread Greg Sabino Mullane


-BEGIN PGP SIGNED MESSAGE-
Hash: RIPEMD160


Stephen Frost wrote:

 To try to clarify that a bit, as it comes across as rather opaque even
 on my re-reading, consider a case where you can't have the
 credit_card_number field ever exported to an audit or log file, but
 you're required to log all other changes to a table.  Then consider that
 such a situation extends to individual INSERT or UPDATE commands- you
 need the command logged, but you can't have the contents of that column
 in the log file.

Perhaps you need a better example. Storing raw credit cards in the database 
is a bad idea (and potential PCI violation); audit/log files are only one 
of the many ways things can leak out. Encrypting sensitive columns is a 
solution that solves your auditing problem, and works on all current versions 
of Postgres. :)

 Our current capabilities around logging and auditing are dismal

No arguments there.

- -- 
Greg Sabino Mullane g...@turnstep.com
End Point Corporation http://www.endpoint.com/
PGP Key: 0x14964AC8 201408271200
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-BEGIN PGP SIGNATURE-

iEYEAREDAAYFAlP+AKgACgkQvJuQZxSWSsjf7gCg00BwRbwRi/UPrHBs1RdfWX/I
TRsAn2CDrG/ycetKOQFbn/4rnSSYPz9j
=Ju0B
-END PGP SIGNATURE-




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SKIP LOCKED DATA (work in progress)

Thomas Munro wrote:
 On 25 August 2014 02:57, Alvaro Herrera alvhe...@2ndquadrant.com wrote:
  Thomas Munro wrote:
  The difficulty of course will be testing all these racy cases 
  reproducibly...
 
  Does this help?
  http://www.postgresql.org/message-id/51fb4305.3070...@2ndquadrant.com
  The useful trick there is forcing a query to get its snapshot and then
  go to sleep before actually doing anything, by way of an advisory lock.
 
 Yes it does, thanks Alvaro and Craig.  I think the attached spec
 reproduces the problem using that trick, ie shows NOWAIT blocking,
 presumably in EvalPlanQualFetch (though I haven't stepped through it
 with a debugger yet).  I'm afraid I'm out of Postgres hacking cycles
 for a few days, but next weekend I should have a new patch that fixes
 this by teaching EvalPlanQualFetch about wait policies, with isolation
 tests for NOWAIT and SKIP LOCKED.

Hmm, http://www.postgresql.org/message-id/51fb6703.9090...@2ndquadrant.com


-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] postgresql latency bgwriter not doing its job


On 08/27/2014 04:20 PM, Andres Freund wrote:

On 2014-08-27 10:17:06 -0300, Claudio Freire wrote:

I think a somewhat smarter version of the explicit flushes in the
hack^Wpatch I posted nearby is going to more likely to be successful.



That path is dangerous (as in, may not work as intended) if the
filesystem doesn't properly understand range flushes (ehem, like
ext3).


The sync_file_range(SYNC_FILE_RANGE_WRITE) I used isn't a operation
guaranteeing durability. And - afaik - not implemented in a file system
specific manner. It just initiates writeback for individual pages. It
doesn't cause barrier, journal flushes or anything to be issued. That's
still done by the fsync() later.

The big disadvantage is that it's a OS specific solution, but I don't
think we're going to find anything that isn't in this area.


I've been thinking for a long time that we should interleave the writes 
and the fsyncs. That still forces up to 1GB of dirty buffers to disk at 
once, causing a spike, but at least not more than that. Also, the 
scheduling of a spread checkpoint is currently a bit bogus; we don't 
take into account the time needed for the fsync phase.


A long time ago, Itagaki Takahiro wrote a patch sort the buffers and 
write them out in order 
(http://www.postgresql.org/message-id/flat/20070614153758.6a62.itagaki.takah...@oss.ntt.co.jp). 
The performance impact of that was inconclusive, but one thing that it 
allows nicely is to interleave the fsyncs, so that you write all the 
buffers for one file, then fsync it, then next file and so on. IIRC the 
biggest worry with that patch was that sorting the buffers requires a 
fairly large amount of memory, and making a large allocation in the 
checkpointer might cause an out-of-memory, which would be bad.


I don't think anyone's seriously worked on this area since. If the 
impact on responsiveness or performance is significant, I'm pretty sure 
the OOM problem could be alleviated somehow.


For the kicks, I wrote a quick  dirty patch for interleaving the 
fsyncs, see attached. It works by repeatedly scanning the buffer pool, 
writing buffers belonging to a single relation segment at a time. I 
would be interested to hear how this performs in your test case.


- Heikki

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 938c554..0f2e4e0 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1217,6 +1217,12 @@ BufferSync(int flags)
 	int			num_to_write;
 	int			num_written;
 	int			mask = BM_DIRTY;
+	RelFileNode target_rnode = { 0, 0, 0 };
+	ForkNumber	target_forkNum = InvalidForkNumber;
+	int			target_segno = 0;
+	bool		has_target = false;
+	int			outer_num_to_scan = 0;
+	int			outer_buf_id = 0;
 
 	/* Make sure we can handle the pin inside SyncOneBuffer */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
@@ -1281,10 +1287,30 @@ BufferSync(int flags)
 	buf_id = StrategySyncStart(NULL, NULL);
 	num_to_scan = NBuffers;
 	num_written = 0;
-	while (num_to_scan--  0)
+
+	for (;;)
 	{
 		volatile BufferDesc *bufHdr = BufferDescriptors[buf_id];
 
+		if (num_to_scan == 0)
+		{
+			if (has_target)
+			{
+/*
+ * We have finished writing out buffers belonging to this
+ * relation segment. fsync it now.
+ */
+mdsync_seg(target_rnode, target_forkNum, target_segno);
+
+/* continue the outer scan where we left */
+num_to_scan = outer_num_to_scan;
+buf_id = outer_buf_id;
+has_target = false;
+			}
+			else
+break; /* all done! */
+		}
+
 		/*
 		 * We don't need to acquire the lock here, because we're only looking
 		 * at a single bit. It's possible that someone else writes the buffer
@@ -1299,7 +1325,36 @@ BufferSync(int flags)
 		 */
 		if (bufHdr-flags  BM_CHECKPOINT_NEEDED)
 		{
-			if (SyncOneBuffer(buf_id, false)  BUF_WRITTEN)
+			RelFileNode this_rnode = bufHdr-tag.rnode;
+			ForkNumber	this_forkNum = bufHdr-tag.forkNum;
+			int			this_segno = bufHdr-tag.blockNum / (RELSEG_SIZE / BLCKSZ);
+			bool		skip = false;
+
+			if (has_target)
+			{
+if (!RelFileNodeEquals(this_rnode, target_rnode) ||
+	this_forkNum != target_forkNum ||
+	this_segno != target_segno)
+{
+	/*
+	 * This buffer belongs to another relation than the one
+	 * we're targeting right now. We'll revisit it later.
+	 */
+	skip = true;
+}
+			}
+			else
+			{
+target_rnode = this_rnode;
+target_forkNum = this_forkNum;
+target_segno = this_segno;
+has_target = true;
+/* remember where we left the outer scan */
+outer_buf_id = buf_id;
+outer_num_to_scan = num_to_scan;
+			}
+
+			if (!skip  SyncOneBuffer(buf_id, false)  BUF_WRITTEN)
 			{
 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 BgWriterStats.m_buf_written_checkpoints++;
@@ -1328,6 +1383,7 @@ BufferSync(int flags)
 
 		if (++buf_id = NBuffers)
 			buf_id = 0;
+		num_to_scan--;
 	}
 
 	/*
diff --git a/src/backend/storage/smgr/md.c

Re: [HACKERS] postgresql latency bgwriter not doing its job

On 2014-08-27 19:23:04 +0300, Heikki Linnakangas wrote:
On 08/27/2014 04:20 PM, Andres Freund wrote:
On 2014-08-27 10:17:06 -0300, Claudio Freire wrote:
I think a somewhat smarter version of the explicit flushes in the
hack^Wpatch I posted nearby is going to more likely to be successful.

That path is dangerous (as in, may not work as intended) if the
filesystem doesn't properly understand range flushes (ehem, like
ext3).

The sync_file_range(SYNC_FILE_RANGE_WRITE) I used isn't a operation
guaranteeing durability. And - afaik - not implemented in a file system
specific manner. It just initiates writeback for individual pages. It
doesn't cause barrier, journal flushes or anything to be issued. That's
still done by the fsync() later.

The big disadvantage is that it's a OS specific solution, but I don't
think we're going to find anything that isn't in this area.

I've been thinking for a long time that we should interleave the writes and
the fsyncs. That still forces up to 1GB of dirty buffers to disk at once,
causing a spike, but at least not more than that.

I think there are considerable benefits to sorting checkpoint io by file
and offset in that file. Obviously the likelihood of sequential IO is
higher; but there's also less chance that other processes write out
dirty buffers that have to be flushed out by the fsync() in a drive by
manner.

I don't think it's good enough to solve the problem Fabien is talking
about though. 1GB is heck of a lot of IO to submit at once. That'll
cause latency issues unless you have a write back controller with more
than 1GB of cache.

So I think we need both, control over the amount of dirty data in the
kernel *and* sorted writeouts. To the point that I've been tinkering
with converting buftable.c into a radix tree. That'd allow to
efficiently scan all buffers of a filenode in order. Useful for
checkpoints, but also for lots of other things. Unfortunately our buffer
tags are freakishly huge, making the worst case memory requirements and
the depth of tree quite bad.

A long time ago, Itagaki Takahiro wrote a patch sort the buffers and write
them out in order
(http://www.postgresql.org/message-id/flat/20070614153758.6a62.itagaki.takah...@oss.ntt.co.jp).
The performance impact of that was inconclusive, but one thing that it
allows nicely is to interleave the fsyncs, so that you write all the buffers
for one file, then fsync it, then next file and so on. IIRC the biggest
worry with that patch was that sorting the buffers requires a fairly large
amount of memory, and making a large allocation in the checkpointer might
cause an out-of-memory, which would be bad.

I don't think anyone's seriously worked on this area since. If the impact on
responsiveness or performance is significant, I'm pretty sure the OOM
problem could be alleviated somehow.

It's a major problem imo.

What I'd been thinking of is to checkpoint writeout in batches. Collect
100k buffers, sort them, write them out. Go to the next 100k.

For the kicks, I wrote a quick dirty patch for interleaving the fsyncs,
see attached. It works by repeatedly scanning the buffer pool, writing
buffers belonging to a single relation segment at a time. I would be
interested to hear how this performs in your test case.

I bet it's not fundamentally changing the amount of transactions that
don't make the deadline - there's more than enough dirty buffers in one
1GB segment to cause issues. But I think it might already be a
significant benefit for peak throughput *and* latency if you combine it
with my approach of initiating writeout to disk during the whole
sync. My patch can slow things down considerably in the worst case by
causing superflous random IO, which your patch should alleviate.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Similar to csvlog but not really, json logs?


On 8/26/14, 8:45 PM, Michael Paquier wrote:

Hi all,

As mentioned here, we support multiple logging format:
http://www.postgresql.org/docs/devel/static/runtime-config-logging.html
Now what about a json format logging with one json object per log entry?

A single json entry would need more space than a csv one as we need to
track the field names with their values. Also, there is always the
argument that if an application needs json-format logs, it could use
csvlog on Postgres-side and do the transformation itself. But wouldn't
it be a win for application or tools if such an option is available
in-core?

Note that I am not planning to work on that in a close future, but it
would be a good TODO item for beginners if it is worth pursuing.


Perhaps instead of doing this in-core it would be better to make log handling more 
extensible? I'm thinking add a specific binary format and an external tool 
that can parse that and do whatever the user wants with it. That means we don't have to 
keep adding more complexity to the internal log handling (which already has the risk of 
being a bottleneck), while allowing maximum user flexibility.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] delta relations in AFTER triggers


On 8/27/14, 2:23 AM, Heikki Linnakangas wrote:

Does this make sense? In essence, make the relations work like PL/pgSQL 
variables do. If you squint a little, the new/old relation is a variable from 
the function's point of view, and a parameter from the planner/executor's point 
of view. It's just a variable/parameter that holds a set of tuples, instead of 
a single Datum.


Something to keep in mind is that users will definitely think about NEW/OLD as 
tables. I suspect that it won't be long after release before someone asks why 
they can't create an index on it. :)
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Similar to csvlog but not really, json logs?

On 2014-08-26 23:04:48 -0500, Jim Nasby wrote:
 On 8/26/14, 8:45 PM, Michael Paquier wrote:
 Hi all,
 
 As mentioned here, we support multiple logging format:
 http://www.postgresql.org/docs/devel/static/runtime-config-logging.html
 Now what about a json format logging with one json object per log entry?
 
 A single json entry would need more space than a csv one as we need to
 track the field names with their values. Also, there is always the
 argument that if an application needs json-format logs, it could use
 csvlog on Postgres-side and do the transformation itself. But wouldn't
 it be a win for application or tools if such an option is available
 in-core?
 
 Note that I am not planning to work on that in a close future, but it
 would be a good TODO item for beginners if it is worth pursuing.
 
 Perhaps instead of doing this in-core it would be better to make log handling 
 more extensible? I'm thinking add a specific binary format and an external 
 tool that can parse that and do whatever the user wants with it. That means 
 we don't have to keep adding more complexity to the internal log handling 
 (which already has the risk of being a bottleneck), while allowing maximum 
 user flexibility.

There's a logging hook. Most of this should be doable from there.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] postgresql latency bgwriter not doing its job




off:

$ pgbench -p 5440 -h /tmp postgres -M prepared -c 16 -j16 -T 120 -R 180 -L 200
number of skipped transactions: 1345 (6.246 %)

on:

$ pgbench -p 5440 -h /tmp postgres -M prepared -c 16 -j16 -T 120 -R 180 -L 200
number of skipped transactions: 1 (0.005 %)



That machine is far from idle right now, so the noise is pretty high.


What is the OS and FS? Could it be XFS?


But rather nice initial results.


Indeed, I can confirm:

I did 5000s 25tps tests:
 - Off: 8002 transactions lost (6.3%)
 - On: 158 transactions lost (0.12%).

Although it is still 13 times larger than the 12 (0.01%) lost with my 
every 0.2s CHECKPOINT hack, it is nevertheless much much better than 
before!


The bad news, under pgbench unthrottled load, the tps is divided by 2 (300 
- 150, could have been worse), *BUT* is also much smoother, the tps is 
not going to 0, but stay in 50-100 range before the next spike.


I'm wondering about he order of operations. It seems to me that you sync 
just after giving back a buffer. Maybe it would be better to pipeline it, 
that is something like:


  round 0:
send buffers 0
sleep?

  round N:
sync buffers N-1
send buffers N
sleep?

  final N sync:
sync buffer N

I have not found how to control the checkpoint pacing interval, if there 
is such a thing. With a 200ms lag limit on pgbench, it would be nice if it 
is less than 200ms.


I found this old thread Add basic checkpoint sync spreading by Greg 
Smith and Simons Riggs, dating from 2010: 
http://www.postgresql.org/message-id/4ce07548.4030...@2ndquadrant.com 
https://commitfest.postgresql.org/action/patch_view?id=431 which ends up 
returned with feedback.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SKIP LOCKED DATA (work in progress)

2014-08-27 Thread Thomas Munro

On 27 August 2014 17:18, Alvaro Herrera alvhe...@2ndquadrant.com wrote:
 Thomas Munro wrote:
 On 25 August 2014 02:57, Alvaro Herrera alvhe...@2ndquadrant.com wrote:
  Thomas Munro wrote:
  The difficulty of course will be testing all these racy cases 
  reproducibly...
 
  Does this help?
  http://www.postgresql.org/message-id/51fb4305.3070...@2ndquadrant.com
  The useful trick there is forcing a query to get its snapshot and then
  go to sleep before actually doing anything, by way of an advisory lock.

 Yes it does, thanks Alvaro and Craig.  I think the attached spec
 reproduces the problem using that trick, ie shows NOWAIT blocking,
 presumably in EvalPlanQualFetch (though I haven't stepped through it
 with a debugger yet).  I'm afraid I'm out of Postgres hacking cycles
 for a few days, but next weekend I should have a new patch that fixes
 this by teaching EvalPlanQualFetch about wait policies, with isolation
 tests for NOWAIT and SKIP LOCKED.

 Hmm, http://www.postgresql.org/message-id/51fb6703.9090...@2ndquadrant.com

Thanks, I hadn't seen this, I should have checked the archives better.
I have actually already updated my patch to handle EvalPlanQualFetch
with NOWAIT and SKIP LOCKED with isolation specs, see attached.  I
will compare with Craig's and see if I screwed anything up... of
course I am happy to merge and submit a new patch on top of Craig's if
it's going to be committed.

I haven't yet figured out how to get get into a situation where
heap_lock_updated_tuple_rec waits.

Best regards,
Thomas Munro
diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml
index 231dc6a..0469705 100644
--- a/doc/src/sgml/ref/select.sgml
+++ b/doc/src/sgml/ref/select.sgml
@@ -45,7 +45,7 @@ SELECT [ ALL | DISTINCT [ ON ( replaceable class=parameterexpression/replac
 [ LIMIT { replaceable class=parametercount/replaceable | ALL } ]
 [ OFFSET replaceable class=parameterstart/replaceable [ ROW | ROWS ] ]
 [ FETCH { FIRST | NEXT } [ replaceable class=parametercount/replaceable ] { ROW | ROWS } ONLY ]
-[ FOR { UPDATE | NO KEY UPDATE | SHARE | KEY SHARE } [ OF replaceable class=parametertable_name/replaceable [, ...] ] [ NOWAIT ] [...] ]
+[ FOR { UPDATE | NO KEY UPDATE | SHARE | KEY SHARE } [ OF replaceable class=parametertable_name/replaceable [, ...] ] [ NOWAIT | SKIP LOCKED ] [...] ]
 
 phrasewhere replaceable class=parameterfrom_item/replaceable can be one of:/phrase
 
@@ -1283,7 +1283,7 @@ FETCH { FIRST | NEXT } [ replaceable class=parametercount/replaceable ] {
 The locking clause has the general form
 
 synopsis
-FOR replaceablelock_strength/ [ OF replaceable class=parametertable_name/replaceable [, ...] ] [ NOWAIT ]
+FOR replaceablelock_strength/ [ OF replaceable class=parametertable_name/replaceable [, ...] ] [ NOWAIT | SKIP LOCKED ]
 /synopsis
 
 where replaceablelock_strength/ can be one of
@@ -1359,11 +1359,17 @@ KEY SHARE
 
para
 To prevent the operation from waiting for other transactions to commit,
-use the literalNOWAIT/ option.  With literalNOWAIT/, the statement
-reports an error, rather than waiting, if a selected row
-cannot be locked immediately.  Note that literalNOWAIT/ applies only
-to the row-level lock(s) mdash; the required literalROW SHARE/literal
-table-level lock is still taken in the ordinary way (see
+use either the literalNOWAIT/ or literalSKIP LOCKED/literal
+option.  With literalNOWAIT/, the statement reports an error, rather
+than waiting, if a selected row cannot be locked immediately.
+With literalSKIP LOCKED/literal, any selected rows that cannot be
+immediately locked are skipped.  Skipping locked rows provides an
+inconsistent view of the data, so this is not suitable for general purpose
+work, but can be used to avoid lock contention with multiple consumers
+accessing a queue-like table.  Note that literalNOWAIT/
+and literalSKIP LOCKED/literal apply only to the row-level lock(s)
+mdash; the required literalROW SHARE/literal table-level lock is
+still taken in the ordinary way (see
 xref linkend=mvcc).  You can use
 xref linkend=sql-lock
 with the literalNOWAIT/ option first,
@@ -1386,14 +1392,14 @@ KEY SHARE
/para
 
para
-Multiple locking
-clauses can be written if it is necessary to specify different locking
-behavior for different tables.  If the same table is mentioned (or
-implicitly affected) by more than one locking clause,
-then it is processed as if it was only specified by the strongest one.
-Similarly, a table is processed
-as literalNOWAIT/ if that is specified in any of the clauses
-affecting it.
+Multiple locking clauses can be written if it is necessary to specify
+different locking behavior for different tables.  If the same table is
+mentioned (or implicitly affected) by more than one locking clause, then
+it is processed as if it was only specified by the strongest one.
+

Re: [HACKERS] pgbench throttling latency limit




[...]


Yeah, something like that. I don't think it would be necessary to set 
statement_timeout, you can inject that in your script or postgresql.conf if 
you want. I don't think aborting a transaction that's already started is 
necessary either. You could count it as LATE, but let it finish first.


I've implemented something along these simplified lines. The latency is 
not limited as such, but slow (over the limit) queries are counted and 
reported.


--
Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 2f7d80e..96e5fb9 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -141,6 +141,18 @@ double		sample_rate = 0.0;
 int64		throttle_delay = 0;
 
 /*
+ * Transactions which take longer that this limit are counted as late
+ * and reported as such, although they are completed anyway.
+ *
+ * When under throttling: execution time slots which are more than
+ * this late (in us) are simply skipped, and the corresponding transaction
+ * is counted as such... it is not even started;
+ * otherwise above the limit transactions are counted as such, with the latency
+ * measured wrt the transaction schedule, not its actual start.
+ */
+int64		latency_limit = 0;
+
+/*
  * tablespace selection
  */
 char	   *tablespace = NULL;
@@ -238,6 +250,8 @@ typedef struct
 	int64		throttle_trigger;		/* previous/next throttling (us) */
 	int64		throttle_lag;	/* total transaction lag behind throttling */
 	int64		throttle_lag_max;		/* max transaction lag */
+	int64		throttle_latency_skipped; /* lagging transactions skipped */
+	int64		latency_late; /* late transactions */
 } TState;
 
 #define INVALID_THREAD		((pthread_t) 0)
@@ -250,6 +264,8 @@ typedef struct
 	int64		sqlats;
 	int64		throttle_lag;
 	int64		throttle_lag_max;
+	int64		throttle_latency_skipped;
+	int64		latency_late;
 } TResult;
 
 /*
@@ -367,6 +383,10 @@ usage(void)
 		   -f, --file=FILENAME  read transaction script from FILENAME\n
 		 -j, --jobs=NUM   number of threads (default: 1)\n
 		 -l, --logwrite transaction times to log file\n
+		 -L, --limit=NUM  count transactions lasting more than NUM ms.\n
+		  under throttling (--rate), transactions behind schedule\n
+		  more than NUM ms are skipped, and those finishing more\n
+		  than NUM ms after their scheduled start are counted.\n
 		 -M, --protocol=simple|extended|prepared\n
 		  protocol for submitting queries (default: simple)\n
 		 -n, --no-vacuum  do not run VACUUM before tests\n
@@ -1016,6 +1036,24 @@ top:
 
 		thread-throttle_trigger += wait;
 
+		if (latency_limit)
+		{
+			instr_time	now;
+			int64		now_us;
+			INSTR_TIME_SET_CURRENT(now);
+			now_us = INSTR_TIME_GET_MICROSEC(now);
+			while (thread-throttle_trigger  now_us - latency_limit)
+			{
+/* if too far behind, this slot is skipped, and we
+ * iterate till the next nearly on time slot.
+ */
+int64 wait = (int64) (throttle_delay *
+	1.00055271703 * -log(getrand(thread, 1, 1) / 1.0));
+thread-throttle_trigger += wait;
+thread-throttle_latency_skipped ++;
+			}
+		}
+
 		st-until = thread-throttle_trigger;
 		st-sleeping = 1;
 		st-throttling = true;
@@ -1080,13 +1118,17 @@ top:
 			thread-exec_count[cnum]++;
 		}
 
-		/* transaction finished: record latency under progress or throttling */
-		if ((progress || throttle_delay)  commands[st-state + 1] == NULL)
+		/* transaction finished: record latency under progress or throttling,
+		 * ot if latency limit is set
+		 */
+		if ((progress || throttle_delay || latency_limit) 
+			commands[st-state + 1] == NULL)
 		{
 			instr_time	diff;
-			int64		latency;
+			int64		latency, now;
 
 			INSTR_TIME_SET_CURRENT(diff);
+			now = INSTR_TIME_GET_MICROSEC(diff);
 			INSTR_TIME_SUBTRACT(diff, st-txn_begin);
 			latency = INSTR_TIME_GET_MICROSEC(diff);
 			st-txn_latencies += latency;
@@ -1099,6 +1141,19 @@ top:
 			 * would take 256 hours.
 			 */
 			st-txn_sqlats += latency * latency;
+
+			/* record over the limit transactions if needed.
+			 */
+			if (latency_limit)
+			{
+/* Under throttling, late means wrt to the initial schedule,
+ * not the actual transaction start
+ */
+if (throttle_delay)
+	latency = now - thread-throttle_trigger;
+if (latency  latency_limit)
+	thread-latency_late++;
+			}
 		}
 
 		/*
@@ -1294,7 +1349,7 @@ top:
 	}
 
 	/* Record transaction start time under logging, progress or throttling */
-	if ((logfile || progress || throttle_delay)  st-state == 0)
+	if ((logfile || progress || throttle_delay || latency_limit)  st-state == 0)
 		INSTR_TIME_SET_CURRENT(st-txn_begin);
 
 	/* Record statement start time if per-command latencies are requested */
@@ -2351,7 +2406,8 @@ printResults(int ttype, int64 normal_xacts, int nclients,
 			 TState *threads, int nthreads,
 			 instr_time total_time,

Re: [HACKERS] postgresql latency bgwriter not doing its job

On 2014-08-27 19:00:12 +0200, Fabien COELHO wrote:
 
 off:
 
 $ pgbench -p 5440 -h /tmp postgres -M prepared -c 16 -j16 -T 120 -R 180 -L 
 200
 number of skipped transactions: 1345 (6.246 %)
 
 on:
 
 $ pgbench -p 5440 -h /tmp postgres -M prepared -c 16 -j16 -T 120 -R 180 -L 
 200
 number of skipped transactions: 1 (0.005 %)
 
 That machine is far from idle right now, so the noise is pretty high.
 
 What is the OS and FS? Could it be XFS?

That's linux v3.17-rc2 + ext4.

 But rather nice initial results.
 
 Indeed, I can confirm:
 
 I did 5000s 25tps tests:
  - Off: 8002 transactions lost (6.3%)
  - On: 158 transactions lost (0.12%).
 
 Although it is still 13 times larger than the 12 (0.01%) lost with my every
 0.2s CHECKPOINT hack, it is nevertheless much much better than before!
 
 The bad news, under pgbench unthrottled load, the tps is divided by 2 (300
 - 150, could have been worse), *BUT* is also much smoother, the tps is not
 going to 0, but stay in 50-100 range before the next spike.

Yea, I'm not surprised. With a sensible (aka larger) checkpoint_timeout
the performance penalty isn't that big, but it's there. That's why I
think (as mentioned to Heikki nearby) it needs to be combined with
sorting during the checkpoint phase.

 I'm wondering about he order of operations. It seems to me that you sync
 just after giving back a buffer.

Yep. Was just a rather quick patch...

 Maybe it would be better to pipeline it,
 that is something like:
 
   round 0:
 send buffers 0
 sleep?
 
   round N:
 sync buffers N-1
 send buffers N
 sleep?
 
   final N sync:
 sync buffer N

Yes, I think we're going to need to leave a it more room for write
combining and such here. But I think it's going to better to issue
flushes for several buffers together - just not after each write(). To
be really beneficial it needs sorted output though.

 I have not found how to control the checkpoint pacing interval, if there is
 such a thing. With a 200ms lag limit on pgbench, it would be nice if it is
 less than 200ms.

Not sure what you mean.

 I found this old thread Add basic checkpoint sync spreading by Greg Smith
 and Simons Riggs, dating from 2010:
 http://www.postgresql.org/message-id/4ce07548.4030...@2ndquadrant.com
 https://commitfest.postgresql.org/action/patch_view?id=431 which ends up
 returned with feedback.

I didn't really like the unapplied remainder of what was proposed in
there.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch to support SEMI and ANTI join removal


On 8/26/14, 8:40 AM, Heikki Linnakangas wrote:

Just so everyone is on the same page on what kind of queries this helps with, 
here are some examples from the added regression tests:


-- Test join removals for semi and anti joins
CREATE TEMP TABLE b (id INT NOT NULL PRIMARY KEY, val INT);
CREATE TEMP TABLE a (id INT NOT NULL PRIMARY KEY, b_id INT REFERENCES b(id));
-- should remove semi join to b
EXPLAIN (COSTS OFF)
SELECT id FROM a WHERE b_id IN(SELECT id FROM b);

snip

SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id = id);


I also fail to see a use for examples that are that silly *unless* we're 
talking machine-generated SQL, but I suspect that normally uses JOINS.

Where I would expect this to be useful is in cases where we can pre-evaluate 
some other condition in the subqueries to make the subqueries useless (ie: 
SELECT id FROM b WHERE 1=1), or where the condition could be passed through 
(ie: SELECT id FROM b WHERE id=42). Another possibility would be if there's a 
condition in the subquery that could trigger constraint elimination.

Those are the real world cases I'd expect to see from anything reasonably sane (an 
adjective that doesn't always apply to some of the users I have to support...) My $0.01 
on the burden of carrying the useless tests and code around is that it 
doesn't seem like all that much overhead...
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Similar to csvlog but not really, json logs?

2014-08-27 Thread Petr Jelinek


On 27/08/14 18:53, Andres Freund wrote:

On 2014-08-26 23:04:48 -0500, Jim Nasby wrote:

On 8/26/14, 8:45 PM, Michael Paquier wrote:

Hi all,

As mentioned here, we support multiple logging format:
http://www.postgresql.org/docs/devel/static/runtime-config-logging.html
Now what about a json format logging with one json object per log entry?

A single json entry would need more space than a csv one as we need to
track the field names with their values. Also, there is always the
argument that if an application needs json-format logs, it could use
csvlog on Postgres-side and do the transformation itself. But wouldn't
it be a win for application or tools if such an option is available
in-core?

Note that I am not planning to work on that in a close future, but it
would be a good TODO item for beginners if it is worth pursuing.


Perhaps instead of doing this in-core it would be better to make log handling more 
extensible? I'm thinking add a specific binary format and an external tool 
that can parse that and do whatever the user wants with it. That means we don't have to 
keep adding more complexity to the internal log handling (which already has the risk of 
being a bottleneck), while allowing maximum user flexibility.


There's a logging hook. Most of this should be doable from there.



Yes, as demonstrated by https://github.com/mpihlak/pg_logforward


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] delta relations in AFTER triggers

2014-08-27 Thread David Fetter

On Wed, Aug 27, 2014 at 11:51:40AM -0500, Jim Nasby wrote:
 On 8/27/14, 2:23 AM, Heikki Linnakangas wrote:
 Does this make sense? In essence, make the relations work like
 PL/pgSQL variables do. If you squint a little, the new/old relation
 is a variable from the function's point of view, and a parameter
 from the planner/executor's point of view. It's just a
 variable/parameter that holds a set of tuples, instead of a single
 Datum.
 
 Something to keep in mind is that users will definitely think about
 NEW/OLD as tables. I suspect that it won't be long after release
 before someone asks why they can't create an index on it. :)

Continuing with this digression, that request seems more likely with
views and foreign tables, given that they persist across statements.
I'm given to understand that other systems have at least the former.

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Sequence Scan doubts


On 8/24/14, 6:22 AM, Haribabu Kommi wrote:

Yes, we are mainly targeting CPU-limited sequential scans, Because of
this reason
only I want the worker to handle the predicates also not just reading
the tuples from
disk.


In that case, I would suggest focusing on parallel execution of conditions 
regardless of where they show up in the query plan. In my experience, they 
often have nothing to do with a seqscan.

Here's a real-world example. We have a view that pivots our applications 
accounting journal into a ledger. The expensive part of the view is this:

sum(
CASE
WHEN b.tag::text = 'installment_principal'::text THEN b.type_cd -- 
type_cd is either 1, 0, or -1
ELSE 0::numeric
END
) * transaction_amount AS installment_principal

The view with this pivot has about 100 of these case statements. Frequently we 
only reference a few of them, but anytime we need to refer to 20+ the 
evaluation of that expression gets VERY cpu-expensive compared to the rest of 
the query.

The other thing I would look at before seqscan filters is join processing and 
bitmap index index combining (ie: ANDing together the results of several bitmap 
index scans). Those are things that can be very CPU intensive even when doing 
simple equality comparisons.

BTW, it's also possible that these cases would be good fits for GPU parallel 
execution.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Function to know last log write timestamp


On 8/27/14, 7:33 AM, Fujii Masao wrote:

On Tue, Aug 19, 2014 at 1:07 AM, Robert Haas robertmh...@gmail.com wrote:

On Fri, Aug 15, 2014 at 7:17 AM, Fujii Masao masao.fu...@gmail.com wrote:

On Fri, Aug 15, 2014 at 3:40 AM, Andres Freund and...@2ndquadrant.com wrote:

On 2014-08-14 14:37:22 -0400, Robert Haas wrote:

On Thu, Aug 14, 2014 at 2:21 PM, Andres Freund and...@2ndquadrant.com wrote:

On 2014-08-14 14:19:13 -0400, Robert Haas wrote:

That's about the idea. However, what you've got there is actually
unsafe, because shmem-counter++ is not an atomic operation.  It reads
the counter (possibly even as two separate 4-byte loads if the counter
is an 8-byte value), increments it inside the CPU, and then writes the
resulting value back to memory.  If two backends do this concurrently,
one of the updates might be lost.


All these are only written by one backend, so it should be safe. Note
that that coding pattern, just without memory barriers, is all over
pgstat.c


Ah, OK.  If there's a separate slot for each backend, I agree that it's safe.

We should probably add barriers to pgstat.c, too.


Yea, definitely. I think this is rather borked on weaker
architectures. It's just that the consequences of an out of date/torn
value are rather low, so it's unlikely to be noticed.

Imo we should encapsulate the changecount modifications/checks somehow
instead of repeating the barriers, Asserts, comments et al everywhere.


So what about applying the attached patch first, which adds the macros
to load and store the changecount with the memory barries, and changes
pgstat.c use them. Maybe this patch needs to be back-patch to at least 9.4?

After applying the patch, I will rebase the pg_last_xact_insert_timestamp
patch and post it again.


That looks OK to me on a relatively-quick read-through.  I was
initially a bit worried about this part:

   do
   {
! pgstat_increment_changecount_before(beentry);
   } while ((beentry-st_changecount  1) == 0);

pgstat_increment_changecount_before is an increment followed by a
write barrier.  This seemed like funny coding to me at first because
while-test isn't protected by any sort of barrier.  But now I think
it's correct, because there's only one process that can possibly write
to that data, and that's the one that is making the test, and it had
certainly better see its own modifications in program order no matter
what.

I wouldn't object to back-patching this to 9.4 if we were earlier in
the beta cycle, but at this point I'm more inclined to just put it in
9.5.  If we get an actual bug report about any of this, we can always
back-patch the fix at that time.  But so far that seems mostly
hypothetical, so I think the less-risky course of action is to give
this a longer time to bake before it hits an official release.


Sounds reasonable. So, barring any objection, I will apply the patch
only to the master branch.


It's probably worth adding a comment explaining why it's safe to do this 
without a barrier...
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC, POC] Don't require a NBuffer sized PrivateRefCount array of local buffer pins


On 8/27/14, 1:38 AM, Andres Freund wrote:

It occurs to me that it'd also be nice to have some
stats available on how this is performing; perhaps a dtrace probe for
whenever we overflow to the hash table, and one that shows maximum
usage for a statement? (Presumably that's not much extra code or
overhead...)

I don't use dtrace, so*I*  won't do that. Personally I just dynamically
add probes using perf probe when I need to track something like this.


Yeah, I didn't mean dtrace directly; don't we have some macro that equates to 
dtrace or perf-probe depending on architecture?


I don't see how you could track maximum usage without more
compliations/slowdowns than warranted.


I was thinking we'd only show maximum if we overflowed, but maybe it's still 
too much overhead in that case.

In any case, I was thinking this would be trivial to add now, but if it's not 
then someone can do it when there's actual need.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch to support SEMI and ANTI join removal

Jim Nasby j...@nasby.net writes:
 On 8/26/14, 8:40 AM, Heikki Linnakangas wrote:
 Just so everyone is on the same page on what kind of queries this helps 
 with, here are some examples from the added regression tests:
 
 -- Test join removals for semi and anti joins
 CREATE TEMP TABLE b (id INT NOT NULL PRIMARY KEY, val INT);
 CREATE TEMP TABLE a (id INT NOT NULL PRIMARY KEY, b_id INT REFERENCES b(id));
 -- should remove semi join to b
 EXPLAIN (COSTS OFF)
 SELECT id FROM a WHERE b_id IN(SELECT id FROM b);
 snip
 SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id = id);

 I also fail to see a use for examples that are that silly *unless* we're 
 talking machine-generated SQL, but I suspect that normally uses JOINS.

 Where I would expect this to be useful is in cases where we can pre-evaluate 
 some other condition in the subqueries to make the subqueries useless (ie: 
 SELECT id FROM b WHERE 1=1), or where the condition could be passed through 
 (ie: SELECT id FROM b WHERE id=42). Another possibility would be if there's a 
 condition in the subquery that could trigger constraint elimination.

Unless I'm misunderstanding something, pretty much *any* WHERE restriction
in the subquery would defeat this optimization, since it would no longer
be certain that there was a match to an arbitrary outer-query row.  So
it seems unlikely to me that this would fire in enough real-world cases
to be worth including.  I am definitely not a fan of carrying around
deadwood in the planner.

If the majority of the added code is code that will be needed for
less-bogus optimizations, it might be all right; but I'd kind of want to
see the less-bogus optimizations working first.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-27 Thread Arthur Silva

On Wed, Aug 27, 2014 at 1:09 AM, Arthur Silva arthur...@gmail.com wrote:

 It won't be faster by any means, but it should definitely be incorporated
 if any format changes are made (like Tom already suggested).

 I think it's important we gather at least 2 more things before making any
 calls:
 * Josh tests w/ cache aware patch, which should confirm cache aware is
 indeed prefered
 * Tests with toast hacked to use lz4 instead, which might ease any
 decisions


 --
 Arthur Silva



 On Wed, Aug 27, 2014 at 12:53 AM, Peter Geoghegan p...@heroku.com wrote:

 On Tue, Aug 26, 2014 at 8:41 PM, Arthur Silva arthur...@gmail.com
 wrote:
  The difference is small but I's definitely faster, which makes sense
 since
  cache line misses are probably slightly reduced.
  As in the previous runs, I ran the query a dozen times and took the
 average
  after excluding runs with a high deviation.

 I'm not surprised that it hasn't beaten HEAD. I haven't studied the
 problem in detail, but I don't think that the cache awareness of the
 new revision is necessarily a distinct advantage.

 --
 Peter Geoghegan



I'm attaching a quick-n-dirty patch that uses lz4 compression instead of
pglz in case someone wants to experiment with it. Seems to work in my test
env, I'll make more tests when I get home.

PS: gotta love gmail fixed defaults of top-posting...


lz4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] possible optimization: push down aggregates

Hi

one user asked about using a partitioning for faster aggregates queries.

I found so there is not any optimization.

create table x1(a int, d date);
create table x_1 ( check(d = '2014-01-01'::date)) inherits(x1);
create table x_2 ( check(d = '2014-01-02'::date)) inherits(x1);
create table x_3 ( check(d = '2014-01-03'::date)) inherits(x1);

When I have this schema, then optimizer try to do

postgres=# explain verbose select max(a) from x1 group by d order by d;
   QUERY
PLAN

 GroupAggregate  (cost=684.79..750.99 rows=200 width=8)
   Output: max(x1.a), x1.d
   Group Key: x1.d
   -  Sort  (cost=684.79..706.19 rows=8561 width=8)
 Output: x1.d, x1.a
 Sort Key: x1.d
 -  Append  (cost=0.00..125.60 rows=8561 width=8)
   -  Seq Scan on public.x1  (cost=0.00..0.00 rows=1 width=8)
 Output: x1.d, x1.a
   -  Seq Scan on public.x_1  (cost=0.00..31.40 rows=2140
width=8)
 Output: x_1.d, x_1.a
   -  Seq Scan on public.x_2  (cost=0.00..31.40 rows=2140
width=8)
 Output: x_2.d, x_2.a
   -  Seq Scan on public.x_3  (cost=0.00..31.40 rows=2140
width=8)
 Output: x_3.d, x_3.a
   -  Seq Scan on public.x_4  (cost=0.00..31.40 rows=2140
width=8)
 Output: x_4.d, x_4.a
 Planning time: 0.333 ms

It can be reduced to:

  sort by d
  Append
   Aggegate (a), d
 seq scan from x_1
   Aggregate (a), d
 seq scan from x_2

Are there some plans to use partitioning for aggregation?

Regards

Pavel

Re: [HACKERS] possible optimization: push down aggregates

2014-08-27 Thread Merlin Moncure

On Wed, Aug 27, 2014 at 2:07 PM, Pavel Stehule pavel.steh...@gmail.com wrote:
 Hi

 one user asked about using a partitioning for faster aggregates queries.

 I found so there is not any optimization.

 create table x1(a int, d date);
 create table x_1 ( check(d = '2014-01-01'::date)) inherits(x1);
 create table x_2 ( check(d = '2014-01-02'::date)) inherits(x1);
 create table x_3 ( check(d = '2014-01-03'::date)) inherits(x1);

 When I have this schema, then optimizer try to do

 postgres=# explain verbose select max(a) from x1 group by d order by d;
QUERY PLAN
 
  GroupAggregate  (cost=684.79..750.99 rows=200 width=8)
Output: max(x1.a), x1.d
Group Key: x1.d
-  Sort  (cost=684.79..706.19 rows=8561 width=8)
  Output: x1.d, x1.a
  Sort Key: x1.d
  -  Append  (cost=0.00..125.60 rows=8561 width=8)
-  Seq Scan on public.x1  (cost=0.00..0.00 rows=1 width=8)
  Output: x1.d, x1.a
-  Seq Scan on public.x_1  (cost=0.00..31.40 rows=2140
 width=8)
  Output: x_1.d, x_1.a
-  Seq Scan on public.x_2  (cost=0.00..31.40 rows=2140
 width=8)
  Output: x_2.d, x_2.a
-  Seq Scan on public.x_3  (cost=0.00..31.40 rows=2140
 width=8)
  Output: x_3.d, x_3.a
-  Seq Scan on public.x_4  (cost=0.00..31.40 rows=2140
 width=8)
  Output: x_4.d, x_4.a
  Planning time: 0.333 ms

 It can be reduced to:

   sort by d
   Append
Aggegate (a), d
  seq scan from x_1
Aggregate (a), d
  seq scan from x_2

 Are there some plans to use partitioning for aggregation?

Besides min/max, what other aggregates (mean/stddev come to mind)
would you optimize and how would you determine which ones could be?
Where is that decision made?

For example, could user defined aggregates be pushed down if you had a
reaggregation routine broken out from the main one?

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] possible optimization: push down aggregates

On Wed, Aug 27, 2014 at 4:41 PM, Merlin Moncure mmonc...@gmail.com wrote:
 On Wed, Aug 27, 2014 at 2:07 PM, Pavel Stehule pavel.steh...@gmail.com 
 wrote:
 Hi

 one user asked about using a partitioning for faster aggregates queries.

 I found so there is not any optimization.

 create table x1(a int, d date);
 create table x_1 ( check(d = '2014-01-01'::date)) inherits(x1);
 create table x_2 ( check(d = '2014-01-02'::date)) inherits(x1);
 create table x_3 ( check(d = '2014-01-03'::date)) inherits(x1);

 When I have this schema, then optimizer try to do

 postgres=# explain verbose select max(a) from x1 group by d order by d;
QUERY PLAN
 
  GroupAggregate  (cost=684.79..750.99 rows=200 width=8)
Output: max(x1.a), x1.d
Group Key: x1.d
-  Sort  (cost=684.79..706.19 rows=8561 width=8)
  Output: x1.d, x1.a
  Sort Key: x1.d
  -  Append  (cost=0.00..125.60 rows=8561 width=8)
-  Seq Scan on public.x1  (cost=0.00..0.00 rows=1 width=8)
  Output: x1.d, x1.a
-  Seq Scan on public.x_1  (cost=0.00..31.40 rows=2140
 width=8)
  Output: x_1.d, x_1.a
-  Seq Scan on public.x_2  (cost=0.00..31.40 rows=2140
 width=8)
  Output: x_2.d, x_2.a
-  Seq Scan on public.x_3  (cost=0.00..31.40 rows=2140
 width=8)
  Output: x_3.d, x_3.a
-  Seq Scan on public.x_4  (cost=0.00..31.40 rows=2140
 width=8)
  Output: x_4.d, x_4.a
  Planning time: 0.333 ms

 It can be reduced to:

   sort by d
   Append
Aggegate (a), d
  seq scan from x_1
Aggregate (a), d
  seq scan from x_2

 Are there some plans to use partitioning for aggregation?

 Besides min/max, what other aggregates (mean/stddev come to mind)
 would you optimize and how would you determine which ones could be?
 Where is that decision made?


You can't with mean and stddev, only with associative aggregates.

That's min, max, sum, bit_and, bit_or, bool_and, bool_or, count.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] possible optimization: push down aggregates

2014-08-27 21:41 GMT+02:00 Merlin Moncure mmonc...@gmail.com:

 On Wed, Aug 27, 2014 at 2:07 PM, Pavel Stehule pavel.steh...@gmail.com
 wrote:
  Hi
 
  one user asked about using a partitioning for faster aggregates queries.
 
  I found so there is not any optimization.
 
  create table x1(a int, d date);
  create table x_1 ( check(d = '2014-01-01'::date)) inherits(x1);
  create table x_2 ( check(d = '2014-01-02'::date)) inherits(x1);
  create table x_3 ( check(d = '2014-01-03'::date)) inherits(x1);
 
  When I have this schema, then optimizer try to do
 
  postgres=# explain verbose select max(a) from x1 group by d order by d;
 QUERY PLAN
 
 
   GroupAggregate  (cost=684.79..750.99 rows=200 width=8)
 Output: max(x1.a), x1.d
 Group Key: x1.d
 -  Sort  (cost=684.79..706.19 rows=8561 width=8)
   Output: x1.d, x1.a
   Sort Key: x1.d
   -  Append  (cost=0.00..125.60 rows=8561 width=8)
 -  Seq Scan on public.x1  (cost=0.00..0.00 rows=1
 width=8)
   Output: x1.d, x1.a
 -  Seq Scan on public.x_1  (cost=0.00..31.40 rows=2140
  width=8)
   Output: x_1.d, x_1.a
 -  Seq Scan on public.x_2  (cost=0.00..31.40 rows=2140
  width=8)
   Output: x_2.d, x_2.a
 -  Seq Scan on public.x_3  (cost=0.00..31.40 rows=2140
  width=8)
   Output: x_3.d, x_3.a
 -  Seq Scan on public.x_4  (cost=0.00..31.40 rows=2140
  width=8)
   Output: x_4.d, x_4.a
   Planning time: 0.333 ms
 
  It can be reduced to:
 
sort by d
Append
 Aggegate (a), d
   seq scan from x_1
 Aggregate (a), d
   seq scan from x_2
 
  Are there some plans to use partitioning for aggregation?

 Besides min/max, what other aggregates (mean/stddev come to mind)
 would you optimize and how would you determine which ones could be?
 Where is that decision made?


I am thinking so all aggregates are possible

when you have a partitions by column X -- then you have a natural sets by X,

so you can directly calculate any aggregates on any column when GROUP BY
clause is a GROUP BY X

isn't it?

probably some similar optimizations are possible when you have GROUP BY
X,Y -- minimally you have more sets, and you can do aggregations on
smaller sets.

Pavel



 For example, could user defined aggregates be pushed down if you had a
 reaggregation routine broken out from the main one?

 merlin

Re: [HACKERS] possible optimization: push down aggregates

2014-08-27 21:46 GMT+02:00 Claudio Freire klaussfre...@gmail.com:

 On Wed, Aug 27, 2014 at 4:41 PM, Merlin Moncure mmonc...@gmail.com
 wrote:
  On Wed, Aug 27, 2014 at 2:07 PM, Pavel Stehule pavel.steh...@gmail.com
 wrote:
  Hi
 
  one user asked about using a partitioning for faster aggregates queries.
 
  I found so there is not any optimization.
 
  create table x1(a int, d date);
  create table x_1 ( check(d = '2014-01-01'::date)) inherits(x1);
  create table x_2 ( check(d = '2014-01-02'::date)) inherits(x1);
  create table x_3 ( check(d = '2014-01-03'::date)) inherits(x1);
 
  When I have this schema, then optimizer try to do
 
  postgres=# explain verbose select max(a) from x1 group by d order by d;
 QUERY PLAN
 
 
   GroupAggregate  (cost=684.79..750.99 rows=200 width=8)
 Output: max(x1.a), x1.d
 Group Key: x1.d
 -  Sort  (cost=684.79..706.19 rows=8561 width=8)
   Output: x1.d, x1.a
   Sort Key: x1.d
   -  Append  (cost=0.00..125.60 rows=8561 width=8)
 -  Seq Scan on public.x1  (cost=0.00..0.00 rows=1
 width=8)
   Output: x1.d, x1.a
 -  Seq Scan on public.x_1  (cost=0.00..31.40 rows=2140
  width=8)
   Output: x_1.d, x_1.a
 -  Seq Scan on public.x_2  (cost=0.00..31.40 rows=2140
  width=8)
   Output: x_2.d, x_2.a
 -  Seq Scan on public.x_3  (cost=0.00..31.40 rows=2140
  width=8)
   Output: x_3.d, x_3.a
 -  Seq Scan on public.x_4  (cost=0.00..31.40 rows=2140
  width=8)
   Output: x_4.d, x_4.a
   Planning time: 0.333 ms
 
  It can be reduced to:
 
sort by d
Append
 Aggegate (a), d
   seq scan from x_1
 Aggregate (a), d
   seq scan from x_2
 
  Are there some plans to use partitioning for aggregation?
 
  Besides min/max, what other aggregates (mean/stddev come to mind)
  would you optimize and how would you determine which ones could be?
  Where is that decision made?


 You can't with mean and stddev, only with associative aggregates.

 That's min, max, sum, bit_and, bit_or, bool_and, bool_or, count.


I don't think

I have a partitions by X .. and my query has group by clause GROUP BY X

so I can calculate any aggregate

Pavel

Re: [HACKERS] possible optimization: push down aggregates

2014-08-27 Thread Merlin Moncure

On Wed, Aug 27, 2014 at 2:46 PM, Claudio Freire klaussfre...@gmail.com wrote:
 On Wed, Aug 27, 2014 at 4:41 PM, Merlin Moncure mmonc...@gmail.com wrote:
 On Wed, Aug 27, 2014 at 2:07 PM, Pavel Stehule pavel.steh...@gmail.com 
 wrote:
 Hi

 one user asked about using a partitioning for faster aggregates queries.

 I found so there is not any optimization.

 create table x1(a int, d date);
 create table x_1 ( check(d = '2014-01-01'::date)) inherits(x1);
 create table x_2 ( check(d = '2014-01-02'::date)) inherits(x1);
 create table x_3 ( check(d = '2014-01-03'::date)) inherits(x1);

 When I have this schema, then optimizer try to do

 postgres=# explain verbose select max(a) from x1 group by d order by d;
QUERY PLAN
 
  GroupAggregate  (cost=684.79..750.99 rows=200 width=8)
Output: max(x1.a), x1.d
Group Key: x1.d
-  Sort  (cost=684.79..706.19 rows=8561 width=8)
  Output: x1.d, x1.a
  Sort Key: x1.d
  -  Append  (cost=0.00..125.60 rows=8561 width=8)
-  Seq Scan on public.x1  (cost=0.00..0.00 rows=1 width=8)
  Output: x1.d, x1.a
-  Seq Scan on public.x_1  (cost=0.00..31.40 rows=2140
 width=8)
  Output: x_1.d, x_1.a
-  Seq Scan on public.x_2  (cost=0.00..31.40 rows=2140
 width=8)
  Output: x_2.d, x_2.a
-  Seq Scan on public.x_3  (cost=0.00..31.40 rows=2140
 width=8)
  Output: x_3.d, x_3.a
-  Seq Scan on public.x_4  (cost=0.00..31.40 rows=2140
 width=8)
  Output: x_4.d, x_4.a
  Planning time: 0.333 ms

 It can be reduced to:

   sort by d
   Append
Aggegate (a), d
  seq scan from x_1
Aggregate (a), d
  seq scan from x_2

 Are there some plans to use partitioning for aggregation?

 Besides min/max, what other aggregates (mean/stddev come to mind)
 would you optimize and how would you determine which ones could be?
 Where is that decision made?


 You can't with mean and stddev, only with associative aggregates.

associative bit just makes it easier (which is important of course!).
mean for example can be pushed down if the 'pushed down' aggregates
return to the count to the reaggregator so that you can weight the
final average.  that's a lot more complicated though.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SKIP LOCKED DATA (work in progress)

Thomas Munro wrote:
 On 27 August 2014 17:18, Alvaro Herrera alvhe...@2ndquadrant.com wrote:
  Thomas Munro wrote:

  Yes it does, thanks Alvaro and Craig.  I think the attached spec
  reproduces the problem using that trick, ie shows NOWAIT blocking,
  presumably in EvalPlanQualFetch (though I haven't stepped through it
  with a debugger yet).  I'm afraid I'm out of Postgres hacking cycles
  for a few days, but next weekend I should have a new patch that fixes
  this by teaching EvalPlanQualFetch about wait policies, with isolation
  tests for NOWAIT and SKIP LOCKED.
 
  Hmm, http://www.postgresql.org/message-id/51fb6703.9090...@2ndquadrant.com
 
 Thanks, I hadn't seen this, I should have checked the archives better.
 I have actually already updated my patch to handle EvalPlanQualFetch
 with NOWAIT and SKIP LOCKED with isolation specs, see attached.  I
 will compare with Craig's and see if I screwed anything up... of
 course I am happy to merge and submit a new patch on top of Craig's if
 it's going to be committed.

I tried Craig's patch with your test case and found that it stalls in
XactLockTableWait inside EPQFetch because it doesn't throw an error in
the noWait case before waiting.  I think I will fix that and push,
including both test cases.  Then we can see about rebasing your patch.

I am wondering about backpatching Craig's fix.  It looks to me like it
should be backpatched as far back as NOWAIT exists, but that was in 8.1
and we haven't ever gotten a complaint until Craig's report AFAIK, which
I understand wasn't coming from a user finding a problem but rather some
new development.  So I hesitate.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] possible optimization: push down aggregates

Merlin Moncure mmonc...@gmail.com writes:
 associative bit just makes it easier (which is important of course!).
 mean for example can be pushed down if the 'pushed down' aggregates
 return to the count to the reaggregator so that you can weight the
 final average.  that's a lot more complicated though.

The real question is what you're expecting to get out of such an
optimization.  If the aggregate has to visit all rows then it's
not apparent to me that any win emerges from the extra complication.

We do already have optimization of min/max across inheritance trees,
and that's certainly a win because you don't have to visit all rows.

regression=# create table pp(f1 int unique);
CREATE TABLE
regression=# create table cc(unique(f1)) inherits(pp);
CREATE TABLE
regression=# create table cc2(unique(f1)) inherits(pp);
CREATE TABLE
regression=# explain select max(f1) from pp;
 QUERY PLAN 


 Result  (cost=0.51..0.52 rows=1 width=0)
   InitPlan 1 (returns $0)
 -  Limit  (cost=0.46..0.51 rows=1 width=4)
   -  Merge Append  (cost=0.46..267.71 rows=4777 width=4)
 Sort Key: pp.f1
 -  Index Only Scan Backward using pp_f1_key on pp  
(cost=0.12..8.14 rows=1 width=4)
   Index Cond: (f1 IS NOT NULL)
 -  Index Only Scan Backward using cc_f1_key on cc  
(cost=0.15..85.94 rows=2388 width=4)
   Index Cond: (f1 IS NOT NULL)
 -  Index Only Scan Backward using cc2_f1_key on cc2  
(cost=0.15..85.94 rows=2388 width=4)
   Index Cond: (f1 IS NOT NULL)
 Planning time: 0.392 ms
(12 rows)

That doesn't currently extend to the GROUP BY case unfortunately.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Fwd: Re: [HACKERS] proposal: new long psql parameter --on-error-stop]

Hi

I chose \? xxx, because it is related to psql features. I wrote commands:

\? options
\? variables

comments?

Regards

Pavel



2014-08-26 13:48 GMT+02:00 Andres Freund and...@2ndquadrant.com:

 On 2014-08-26 13:44:16 +0200, Pavel Stehule wrote:
  2014-08-26 13:30 GMT+02:00 Petr Jelinek p...@2ndquadrant.com:
 
   On 26/08/14 13:20, Andres Freund wrote:
  
  
   I'm looking at committing this, but I wonder: Shouldn't this be
   accessible from inside psql as well? I.e. as a backslash command?
  
  
   +1
  
 
  have you idea about command name?  \?+

 Some ideas:

 \hv
 \help-variables
 \? set
 \? variables


 Greetings,

 Andres Freund

 --
  Andres Freund http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services

commit 1c00470629f31d61886ff9ae95a6855693f358c8
Author: Pavel Stehule pavel.steh...@gooddata.com
Date:   Wed Aug 27 22:47:07 2014 +0200

access to help_variables and usage from psql via psql command

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 74d4618..6d3189d 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -568,6 +568,15 @@ EOF
   /listitem
 /varlistentry
 
+varlistentry
+  termoption--help-variables//term
+  listitem
+  para
+  Show help about applicationpsql/application variables,
+  and exit.
+  /para
+  /listitem
+/varlistentry
   /variablelist
  /refsect1
 
@@ -2572,10 +2581,12 @@ testdb=gt; userinput\setenv LESS -imx4F/userinput
 
 
   varlistentry
-termliteral\?/literal/term
+termliteral\? [ options | variables ]/literal/term
 listitem
 para
-Shows help information about the backslash commands.
+Shows help information about the backslash commands.  This command can have a
+option variables or options to take help for psql configuration variables
+or psql command line options.
 /para
 /listitem
   /varlistentry
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index e27ff8c..12cbb20 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -1503,7 +1503,19 @@ exec_command(const char *cmd,
 
 	/* \? -- slash command help */
 	else if (strcmp(cmd, ?) == 0)
-		slashUsage(pset.popt.topt.pager);
+	{
+		char	   *opt0 = psql_scan_slash_option(scan_state,
+	OT_NORMAL, NULL, false);
+
+		if (!opt0)
+			slashUsage(pset.popt.topt.pager);
+		else if (strcmp(opt0, variables) == 0)
+			help_variables(pset.popt.topt.pager);
+		else if (strcmp(opt0, options) == 0)
+			usage(pset.popt.topt.pager);
+		else
+			slashUsage(pset.popt.topt.pager);
+	}
 
 #if 0
 
diff --git a/src/bin/psql/help.c b/src/bin/psql/help.c
index f8f000f..5e7953d 100644
--- a/src/bin/psql/help.c
+++ b/src/bin/psql/help.c
@@ -46,11 +46,12 @@
 #define ON(var) (var ? _(on) : _(off))
 
 void
-usage(void)
+usage(unsigned short int pager)
 {
 	const char *env;
 	const char *user;
 	char	   *errstr;
+	FILE	   *output;
 
 	/* Find default user, in case we need it. */
 	user = getenv(PGUSER);
@@ -64,77 +65,82 @@ usage(void)
 		}
 	}
 
-	printf(_(psql is the PostgreSQL interactive terminal.\n\n));
-	printf(_(Usage:\n));
-	printf(_(  psql [OPTION]... [DBNAME [USERNAME]]\n\n));
+	output = PageOutput(59, pager);
+
+	fprintf(output, _(psql is the PostgreSQL interactive terminal.\n\n));
+	fprintf(output, _(Usage:\n));
+	fprintf(output, _(  psql [OPTION]... [DBNAME [USERNAME]]\n\n));
 
-	printf(_(General options:\n));
+	fprintf(output, _(General options:\n));
 	/* Display default database */
 	env = getenv(PGDATABASE);
 	if (!env)
 		env = user;
-	printf(_(  -c, --command=COMMANDrun only single command (SQL or internal) and exit\n));
-	printf(_(  -d, --dbname=DBNAME  database name to connect to (default: \%s\)\n), env);
-	printf(_(  -f, --file=FILENAME  execute commands from file, then exit\n));
-	printf(_(  -l, --list   list available databases, then exit\n));
-	printf(_(  -v, --set=, --variable=NAME=VALUE\n
-			set psql variable NAME to VALUE\n));
-	printf(_(  -V, --versionoutput version information, then exit\n));
-	printf(_(  -X, --no-psqlrc  do not read startup file (~/.psqlrc)\n));
-	printf(_(  -1 (\one\), --single-transaction\n
+	fprintf(output, _(  -c, --command=COMMANDrun only single command (SQL or internal) and exit\n));
+	fprintf(output, _(  -d, --dbname=DBNAME  database name to connect to (default: \%s\)\n), env);
+	fprintf(output, _(  -f, --file=FILENAME  execute commands from file, then exit\n));
+	fprintf(output, _(  -l, --list   list available databases, then exit\n));
+	fprintf(output, _(  -v, --set=, --variable=NAME=VALUE\n
+			set psql variable NAME to VALUE e.g.: -v ON_ERROR_STOP=1\n));
+	fprintf(output, _(  -V, --versionoutput version information, then exit\n));
+	fprintf(output, _(  -X, --no-psqlrc

Re: [HACKERS] possible optimization: push down aggregates

2014-08-27 22:27 GMT+02:00 Tom Lane t...@sss.pgh.pa.us:

 Merlin Moncure mmonc...@gmail.com writes:
  associative bit just makes it easier (which is important of course!).
  mean for example can be pushed down if the 'pushed down' aggregates
  return to the count to the reaggregator so that you can weight the
  final average.  that's a lot more complicated though.

 The real question is what you're expecting to get out of such an
 optimization.  If the aggregate has to visit all rows then it's
 not apparent to me that any win emerges from the extra complication.


I expect a remove a hashing or sorting part of aggregation. It can reduce
aggregation to seq scan only.

Pavel



 We do already have optimization of min/max across inheritance trees,
 and that's certainly a win because you don't have to visit all rows.

 regression=# create table pp(f1 int unique);
 CREATE TABLE
 regression=# create table cc(unique(f1)) inherits(pp);
 CREATE TABLE
 regression=# create table cc2(unique(f1)) inherits(pp);
 CREATE TABLE
 regression=# explain select max(f1) from pp;
  QUERY PLAN

 
  Result  (cost=0.51..0.52 rows=1 width=0)
InitPlan 1 (returns $0)
  -  Limit  (cost=0.46..0.51 rows=1 width=4)
-  Merge Append  (cost=0.46..267.71 rows=4777 width=4)
  Sort Key: pp.f1
  -  Index Only Scan Backward using pp_f1_key on pp
 (cost=0.12..8.14 rows=1 width=4)
Index Cond: (f1 IS NOT NULL)
  -  Index Only Scan Backward using cc_f1_key on cc
 (cost=0.15..85.94 rows=2388 width=4)
Index Cond: (f1 IS NOT NULL)
  -  Index Only Scan Backward using cc2_f1_key on cc2
 (cost=0.15..85.94 rows=2388 width=4)
Index Cond: (f1 IS NOT NULL)
  Planning time: 0.392 ms
 (12 rows)

 That doesn't currently extend to the GROUP BY case unfortunately.

 regards, tom lane

Re: [HACKERS] delta relations in AFTER triggers

2014-08-27 Thread Kevin Grittner

Heikki Linnakangas hlinnakan...@vmware.com wrote:
 On 08/27/2014 02:26 AM, Kevin Grittner wrote:

 spi-tuplestore-registry allows tuplestores, with associated name
 and TupleDesc, to be registered with the current SPI connection.
 Queries planned or executed on that connection will recognize the
 name as a tuplestore relation.  It doesn't care who is registering
 the tuplestores or what happens to them.  It doesn't depend on
 anything else.
 5 files changed, 445 insertions(+)

 ...

 plpgsql-after-trigger-transition-tables takes the tuplestores from
 TriggerData and registers them with SPI before trigger planning and
 execution.  It depends on the trigger-transition-tables and
 spi-tuplestore-registry patches to build, and won't do anything
 useful at run time without the executor-tuplestore-relations patch.
 3 files changed, 37 insertions(+), 11 deletions(-)

 This is a surprising way to expose the NEW/OLD relations to the
 planner/executor. The problem is the same as with making PL/pgSQL
 variables available to the planner/executor in queries within a PL/pgSQL
 function, and the solution we have for that is the parser hooks you
 pass to SPI_prepare_params. This tuplestore registry is a different
 solution to the same problem - we could've implemented parameters with a
 registry like this as well. Let's not mix two different designs.

 I suggest adding a new hook to the ParseState struct, (p_rangevar_hook
 ?). The planner calls it whenever it sees a reference to a table, and
 the hook function returns back some sort of placeholder reference to the
 tuplestore. With variables, the hook returns a Param node, and at
 execution time, the executor calls the paramFetch hook to fetch the
 value of the param. For relations/tuplestores, I guess we'll need to
 invent something like a Param node, but for holding information about
 the relation. Like your TsrData struct, but without the pointer to the
 tuplestore. At execution time, in the SPI_execute call, you pass the
 pointer to the tuplestore in the ParamListInfo struct, like you pass
 parameter values.

 Does this make sense?

I see your point, but SPI first has to be made aware of the
tuplestores and their corresponding names and TupleDesc structures.
Does it make sense to keep the SPI_register_tuplestore() and
SPI_unregister_tuplestore() functions for the client side of the
API, and pass things along to the parse analysis through execution
phases using the techniques you describe?  That would eliminate the
need for the SPI_get_caller_tuplestore() function and the
parse_tuplestore.[ch] files, and change how the data is fetched in
parse analysis and execution phases, but that seems fairly minimal
-- there are exactly three places that would need to call the new
hooks where the patch is now getting the information from the
registry.

 In essence, make the relations work like PL/pgSQL
 variables do. If you squint a little, the new/old relation is a variable
 from the function's point of view, and a parameter from the
 planner/executor's point of view. It's just a variable/parameter that
 holds a set of tuples, instead of a single Datum.

I don't have to squint that hard -- I've always been comfortable
with the definition of a table as a relation variable, and it's not
too big a stretch to expand that to a tuplestore.  ;-)  In fact, I
will be surprised if someone doesn't latch onto this to create a
new declared temporary table that only exists within the scope of
a compound statement (i.e., a BEGIN/END block).  You would DECLARE
them just like you would a scalar variable in a PL, and they would
have the same scope.

I'll take a look at doing this in the next couple days, and see
whether doing it that way is as easy as it seems on the face of it.

Thanks!

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] delta relations in AFTER triggers

2014-08-27 Thread Kevin Grittner

Jim Nasby j...@nasby.net wrote:

 Something to keep in mind is that users will definitely think about NEW/OLD as
 tables. I suspect that it won't be long after release before someone asks
 why they can't create an index on it. :)

I'm comfortable saying No to that.  But it's a good point -- I'll 
review error checking and documentation to make sure that it is 
clear.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] delta relations in AFTER triggers

2014-08-27 Thread Stephen Frost

* Kevin Grittner (kgri...@ymail.com) wrote:
 Heikki Linnakangas hlinnakan...@vmware.com wrote:
  In essence, make the relations work like PL/pgSQL
  variables do. If you squint a little, the new/old relation is a variable
  from the function's point of view, and a parameter from the
  planner/executor's point of view. It's just a variable/parameter that
  holds a set of tuples, instead of a single Datum.
 
 I don't have to squint that hard -- I've always been comfortable
 with the definition of a table as a relation variable, and it's not
 too big a stretch to expand that to a tuplestore.  ;-)  In fact, I
 will be surprised if someone doesn't latch onto this to create a
 new declared temporary table that only exists within the scope of
 a compound statement (i.e., a BEGIN/END block).  You would DECLARE
 them just like you would a scalar variable in a PL, and they would
 have the same scope.
 
 I'll take a look at doing this in the next couple days, and see
 whether doing it that way is as easy as it seems on the face of it.

(not following this very closely, but saw this...)

Yes, please? :)

Thanks!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] delta relations in AFTER triggers