Re: [PERFORM] Hmmm... why does pl/pgsql code parallelise so badly when queries parallelise fine? Anyone else seen this?

2015-07-09 Thread Graeme B. Bell
 No, of course it doesn't.  It appears that you didn't look at the repo or 
 read my previous mail before you wrote this. 
 
 FFS, I *ran* some of the tests and reported on results. With you in CC.

Just checked back. So you did. I'm sorry, I made the mistake I accused you of. 

But... why then did you say I hadn't provided him with individual functions, 
when you've seen the repo yourself? I don't understand. You knew they're there.

 What I mean is that I don't just run random code from some random github
 repository.

Sure, but surely that's not an issue when the SQL functions are also seperately 
provided and clearly labelled in the repo?

Do you feel there is a difference about the trustworthiness of isolated files 
containing an SQL function presented in a github repo, and SQL functions 
presented in an email?

I am not sure I can agree with that idea, I think they are both just SQL 
functions. The difference is that one also offers you a bit more if you want to 
check/try it.

 I do not wish to antagonise you either, so please go and look at the
 repo before you write the next reply.
 
 Over and out.

Seems there has been a misunderstanding here and I feel I'm still missing 
something in what you're saying. Sorry Andres. Let's just forget this. I don't 
think we disagree especially on this and I am not looking to make an enemy here.

Also, thanks for running the benchmarks to get some numbers.

Graeme. 

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Hmmm... why does CPU-intensive pl/pgsql code parallelise so badly when queries parallelise fine? Anyone else seen this?

2015-07-09 Thread Graeme B. Bell

On 09 Jul 2015, at 15:22, Thomas Kellerer spam_ea...@gmx.net wrote:

 Graeme B. Bell schrieb am 09.07.2015 um 11:44:
 I don't recall seeing a clear statement telling me I should mark pl/pgsql
 functions nonvolatile wherever possible or throw all performance and
 scalability out the window. 
 
 From: http://www.postgresql.org/docs/current/static/xfunc-volatility.html
 
   For best optimization results, you should label your functions 
with the strictest volatility category that is valid for them.


Hi Thomas,

Thank you very much for the link.

However, the point I was making wasn't that no sentence exists anywhere. My 
point was that I've read the docs more than anyone else in my institute and I 
was completely unaware of this. 

It also quite vague - if you hand that to a younger programmer in particular, 
how do they implement it in practice? When is it important to do it?  If this 
one factor silently breaks multiprocessor scaling of pl/pgsql, and 
multiprocessing is the biggest trend in CPU processing of the last decade 
(comparing server CPUS of 2005 with 2015), then why is this information not up 
front and clear?


A second point to keep in mind that optimization and 
parallelisation/scalability are not always the same thing. 

For example, in one project I took a bunch of looped parallel UPDATEs on a set 
of 50 tables, and rewrote them so as to run the loop all at once inside a 
pl/pgsql function. Crudely, I took out the table-level for loop and put it at 
row-level instead. 

I expected they'd execute much faster if UPDATEs were using data still in 
cache. Also, I would be updating without writing out WAL entries to disk 
repeatedly. 

It turns out the update per row ran much faster - as expected - when I used one 
table, but when I ran it in parallel on many tables, the performance was even 
worse than when I started. If you look at the benchmarks, you'll see that 
performance drops through the floor at 8-16 cores. I think that was when I 
first noticed this bug/feature.

[If anyone is curious, the way I solved that one in the end was to 
pre-calculate every possible way the tables might be updated after N loops of 
updates using Python, and import that as a lookup table into PG. It turns out 
that although we had 10's of GB of data per table, there were only about 100,00 
different types of situation, and only e.g. 80 iterations to consider). Then I 
ran a single set of UPDATEs with no pl/pgsql. It was something like a 1x 
performance improvement.]

Graeme.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Hmmm... why does CPU-intensive pl/pgsql code parallelise so badly when queries parallelise fine? Anyone else seen this?

2015-07-09 Thread Graeme B. Bell
 
 3. I don't disagree that the benchmark code is objectively 'bad' in the 
 sense that it is missing an important optimisation.
 
 Particularly with regards documentation, a patch improving things is
 much more likely to improve the situation than griping.  Also,
 conversation on this list gets recorded for posterity and google is
 remarkably good at matching people looking for problems with
 solutions.  So, even in absence of a patch perhaps we've made the
 lives of future head-scratchers a little bit easier with this
 discussion.

I agree that patchgripe, and about the google aspect. But nonetheless, a 
well-intentioned gripe is  ignorance of a problem. 

As mentioned earlier, I'm sick just now and will be back in hospital again 
tomorrow  monday, so a patch may be a little bit much to ask from me here :-) 
It's a bit much even keeping up with the posts on the thread so far.

I might try to fix the documentation a bit later, though as someone with no 
experience in marking up volatility on pl/pgsql functions I doubt my efforts 
would be that great. I also have other OSS project contributions that need some 
attention first. 

Re: the google effect. Are these mailing list archives mirrored anywhere, 
incidentally? For example, I notice we just lost http:reddit.com/r/amd at the 
weekend, all the discussion of the last few years on that forum is out of 
reach.  

Graeme Bell


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Hmmm... why does CPU-intensive pl/pgsql code parallelise so badly when queries parallelise fine? Anyone else seen this?

2015-07-09 Thread Merlin Moncure
On Thu, Jul 9, 2015 at 10:12 AM, Graeme B. Bell graeme.b...@nibio.no wrote:

 3. I don't disagree that the benchmark code is objectively 'bad' in the 
 sense that it is missing an important optimisation.

 Particularly with regards documentation, a patch improving things is
 much more likely to improve the situation than griping.  Also,
 conversation on this list gets recorded for posterity and google is
 remarkably good at matching people looking for problems with
 solutions.  So, even in absence of a patch perhaps we've made the
 lives of future head-scratchers a little bit easier with this
 discussion.

 I agree that patchgripe, and about the google aspect. But nonetheless, a 
 well-intentioned gripe is  ignorance of a problem.

 As mentioned earlier, I'm sick just now and will be back in hospital again 
 tomorrow  monday, so a patch may be a little bit much to ask from me here 
 :-) It's a bit much even keeping up with the posts on the thread so far.

 I might try to fix the documentation a bit later, though as someone with no 
 experience in marking up volatility on pl/pgsql functions I doubt my efforts 
 would be that great. I also have other OSS project contributions that need 
 some attention first.

 Re: the google effect. Are these mailing list archives mirrored anywhere, 
 incidentally? For example, I notice we just lost http:reddit.com/r/amd at the 
 weekend, all the discussion of the last few years on that forum is out of 
 reach.

The community maintains it's own mailing list archives in
postgresql.org.  Short of an array of tactical nuclear strikes this is
going to be preserved because it represents the history of the project
and in many ways is as important as the source code itself.

The archives are also mirrored by a number of high quality providers
such as nabble (which tend to beat our archives in google rankings --
likely due to the improved interface).

merlin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Hmmm... why does CPU-intensive pl/pgsql code parallelise so badly when queries parallelise fine? Anyone else seen this?

2015-07-09 Thread Graeme B. Bell

On 09 Jul 2015, at 17:42, Merlin Moncure mmonc...@gmail.com wrote:

 The community maintains it's own mailing list archives in
 postgresql.org.  Short of an array of tactical nuclear strikes this is
 going to be preserved 

Good to know, I've seen a lot of dead software projects throughout my life. 

But still - we will have to pray that Kim Jong Un never decides to become a 
MySQL contributor... :)

Graeme.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] pg_stat_all_indexes understand

2015-07-09 Thread Jeff Janes
On Thu, Jul 9, 2015 at 5:20 AM, Nicolas Paris nipari...@gmail.com wrote:

 Hello,

 I wonder how understanding pg_stat_all_indexes working

 When I run an explain, some index are not used, but
 pg_stat_all_indexes.idx_scan is incremented for those indexes.


When the planner considers using a merge join on a indexed column, it uses
an index to check the endpoints of the column (the min and the max) to make
sure it has the latest values to get the most accurate estimate.  This
causes the usage counts to get incremented.  Even when it doesn't end up
using the merge join.


 Does this mean idx_scan is incremented each time the planner check if an
 index could be use whenever it won't use it ?


Not in general, only in a few peculiar cases.

Cheers,

Jeff


Re: [PERFORM] Hmmm... why does CPU-intensive pl/pgsql code parallelise so badly when queries parallelise fine? Anyone else seen this?

2015-07-09 Thread Andres Freund
On 2015-07-08 23:38:38 -0400, Tom Lane wrote:
 and...@anarazel.de (Andres Freund) writes:
  On 2015-07-08 15:38:24 -0700, Craig James wrote:
  From my admittedly naive point of view, it's hard to see why any of this
  matters. I have functions that do purely CPU-intensive mathematical
  calculations ... you could imagine something like is_prime(N) that
  determines if N is a prime number. I have eight clients that connect to
  eight backends. Each client issues an SQL command like, select
  is_prime(N) where N is a simple number.

  I mostly replied to Merlin's general point (additionally in the context of
  plpgsql).

  But I have a hard time seing that postgres would be the bottleneck for a
  is_prime() function (or something with similar characteristics) that's
  written in C where the average runtime is more than, say, a couple
  thousand cyles.  I'd like to see a profile of that.

 But that was not the case that Graeme was complaining about.

No, Craig was complaining about that case...

 One of my Salesforce colleagues has been looking into ways that we could
 decide to skip the per-statement snapshot acquisition even in volatile
 functions, if we could be sure that a particular statement isn't going to
 do anything that would need a snapshot.

Yea, I actually commented about that on IRC as well.

I was thinking about actually continuing to get a snapshot, but mark it
as 'complete on usage'. I.e. only call GetSnapshotData() only when the
snapshot is used to decide about visibility.  We probably can't do that
in the toplevel visibility case because it'll probably have noticeable
semantic effects, but ISTM it should be doable for the volatile function
using spi case.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Hmmm... why does pl/pgsql code parallelise so badly when queries parallelise fine? Anyone else seen this?

2015-07-09 Thread Andres Freund
On 2015-07-09 10:30:35 +, Graeme B. Bell wrote:
  Well, that requires reviewing the source code of the run script and
  such.
 
 No, of course it doesn't.  It appears that you didn't look at the repo or 
 read my previous mail before you wrote this. 

FFS, I *ran* some of the tests and reported on results. With you in CC.

What I mean is that I don't just run random code from some random github
repository.

 I do not wish to antagonise you either, so please go and look at the
 repo before you write the next reply.

Over and out.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Hmmm... why does CPU-intensive pl/pgsql code parallelise so badly when queries parallelise fine? Anyone else seen this?

2015-07-09 Thread Graeme B. Bell
On 08 Jul 2015, at 22:27, Andres Freund and...@anarazel.de wrote:

 On 2015-07-08 13:46:53 -0500, Merlin Moncure wrote:
 On Wed, Jul 8, 2015 at 12:48 PM, Craig James cja...@emolecules.com wrote:
 
 Well, right, which is why I mentioned even with dozens of clients.
 Shouldn't that scale to at least all of the CPUs in use if the function is
 CPU intensive (which it is)?
 
 only in the absence of inter-process locking and cache line bouncing.
 
 And addititionally memory bandwidth (shared between everything, even in
 the numa case), cross socket/bus bandwidth (absolutely performance
 critical in multi-socket configurations), cache capacity (shared between
 cores, and sometimes even sockets!).

1. Note for future readers - it's also worth noting that depending on the 
operation, and on your hardware, you may have less CPU cores than you think 
to parallelise upon.

1a. For example AMD CPUs list the number of integer cores (e.g. 16), but there 
is actually only half as many cores available for floating point work (8). So 
if your functions need to use floating point, your scaling will suffer badly on 
FP functions. 

https://en.wikipedia.org/wiki/Bulldozer_(microarchitecture)
  In terms of hardware complexity and functionality, this module is equal to 
a dual-core processor in its integer power, and to a single-core processor in 
its floating-point power: for each two integer cores, there is one 
floating-point core.


1b. Or, if you have hyper-threading enabled on an Intel CPU, you may think you 
have e.g. 8 cores, but if all the threads are running the same type of 
operation repeatedly, it won't be possible for the hyper-threading to work well 
and you'll only get 4 in practice. Maybe less due to overheads. Or, if your 
work is continuallly going to main memory for data (e.g. limited by the memory 
bus), it will run at 4-core speed, because the cores have to share the same 
memory bus. 

Hyper-threading depends on the 2 logical cores being asked to perform two 
different types of tasks at once (each having relatively lower demands on 
memory).

When execution resources would not be used by the current task in a processor 
without hyper-threading, and especially when the processor is stalled, a 
hyper-threading equipped processor can use those execution resources to execute 
another scheduled task.
https://en.wikipedia.org/wiki/Hyper-threading
https://en.wikipedia.org/wiki/Superscalar


2. Keep in mind also when benchmarking that it's normal to see an small 
drop-off when you hit the maximum number of cores for your system. 
After all, the O/S and the benchmark program and anything else you have running 
will need a core or two.

 



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] pg_stat_all_indexes understand

2015-07-09 Thread Nicolas Paris
Hello,

I wonder how understanding pg_stat_all_indexes working

When I run an explain, some index are not used, but
pg_stat_all_indexes.idx_scan is incremented for those indexes.

Does this mean idx_scan is incremented each time the planner check if an
index could be use whenever it won't use it ?

Is there a better way to check which index could be delete ?

Thanks by advance.


Re: [PERFORM] Hmmm... why does pl/pgsql code parallelise so badly when queries parallelise fine? Anyone else seen this?

2015-07-09 Thread Graeme B. Bell

On 08 Jul 2015, at 13:20, Andres Freund and...@anarazel.de wrote:

 On 2015-07-08 11:13:04 +, Graeme B. Bell wrote:
 I'm guessing you are maybe pressed for time at the moment because I
 already clearly included this on the last email, as well as the links
 to the alternative benchmarks with the same problem I referred to on
 both of my last emails which are also trivial to drop into pgbench
 (cut/paste).
 
 You realize that you want something here, not Merlin, right?

Hi Andreas,

My email was saying it's not helpful for anyone on the list for him to keep 
asking me to give him X and me to keep sending it.  Do you disagree with that 
idea?

I tried to phrase my request politely, but perhaps I failed. If you have 
suggestions for better ways to say I already sent it, twice more politely in 
this situation, I'd welcome them off list. 

He asked me to disclose the function body I was testing. I did that, *and* also 
disclosed the entire approach to the benchmark too in a way that made it 
trivial for him or others to replicate the situation I'd found. I'm pretty sure 
you should not be discouraging this kind of thing in bug/performance reports. 

I get your point that when you're asking for other people to look at something 
with you, don't antagonise them. 

I didn't intend it as antagonising and Merlin hasn't mailed me anything to say 
he was antagonised. I'm quite sure he's capable of defending himself or 
communicating with me himself if he does feel antagonised by something. I hope 
we can end the discussion of that here?

Merlin, if you were antagonised, sorry, I did not mean to antagonise you. I 
just wanted to just wanted make it clear that I'd sent you what you asked for, 
+ more, and that I was surprised you hadn't noticed it. 

 To clear up the issue I build a little test harness around your comment 
 below.
 http://github.com/gbb/t;
 
 Well, that requires reviewing the source code of the run script and
 such.

No, of course it doesn't.  It appears that you didn't look at the repo or read 
my previous mail before you wrote this. 

I do not wish to antagonise you either, so please go and look at the repo 
before you write the next reply. 

http://github.com/gbb/t
Just pick any function you like, there are 6 there, and 3 of them demonstrate 2 
different problems, all of it is clearly documented.

When you open up the repo, there are the tests
https://github.com/gbb/t/tree/master/tests

You don't need to review any code from the run script. The functions are there 
as isolated files and what they are intended to demonstrate is clearly 
described with text and graphics. I could see your point if I had mailed out 
some giant script with a bunch of SQL calls embedded in its guts, but that's 
the opposite of what I did here.  

Did you find it difficult to navigate the repo structure (2 folders, a few 
files)? If so please let me know off-list what was difficult and I will see if 
I can improve it. 

 I think we shouldn't discuss this on two threads (-performance, -bugs),
 that makes it hard to follow. Given Tom's more detailed answer I think
 the -bugs thread already contains more pertinent information.

I don't necessarily disagree with this idea, but...

How many people concerned with performance are following the -bugs list? How 
much space is there for discussion of this on -bugs? Since only working 
solutions for this performance problem so far are all user-side rather than 
commiter-side, why would you want to restrict that information to a 
commiter-side list?

It has developed this way because I noticed it as a performance issue first, 
then decided to report it as a potential bug.

Perhaps it would be useful to keep the discussion separate as the -commiter 
side aspects (how to fix this at the server level) and -user side (what you can 
do to improve performance right now).  I will defer to general opinion on this 
in my follow-up posts. 

Graeme. 

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Hmmm... why does CPU-intensive pl/pgsql code parallelise so badly when queries parallelise fine? Anyone else seen this?

2015-07-09 Thread Graeme B. Bell
On 09 Jul 2015, at 05:38, Tom Lane t...@sss.pgh.pa.us wrote:

 If you
 write your is_prime function purely in plpgsql, and don't bother to mark
 it nonvolatile, *it will not scale*.  

 much for properly written plpgsql; but there's an awful lot of bad plpgsql
 code out there, and it can make a huge difference for that.


Hi Tom, 

I object to phrases like 'don't bother to mark it' and 'bad plpgsql' here. That 
is putting the blame on programmers. Clearly, if there is no end of code out 
there that isn't right in this regard, there's something very wrong in the 
project documentation.

1. I have been writing pl/pgsql on and off for a couple of years now and I've 
read quite a bit of the postgres doumentation, but I don't recall seeing a 
clear statement telling me I should mark pl/pgsql functions nonvolatile 
wherever possible or throw all performance and scalability out the window. I'm 
sure there may be a line hidden somewhere in the docs, but judging from the 
impact it has in practice, this seems like a very fundamental concept that 
should be repeatedly and clearly marked in the docs. 

2. Furthermore, I have never come across anything in the documentation that 
made it clear to me that any pl/pgsql function I write will, by default, be 
taking out locks for every single statement in the code. I've written code in I 
dunno, maybe 15-20 different languages in my life, and I can't think of another 
language offhand that does that by default. From the reactions on this thread 
to this benchmark and the par_psql benchmarks, it doesn't seem that it was even 
immediately obvious to many postgres enthusiasts and developers.

3. I don't disagree that the benchmark code is objectively 'bad' in the sense 
that it is missing an important optimisation. 

But I really don't think it helps to frame this as laziness or bad in any 
other sense of the word e.g. 'clumsy'.

Let's look at the postgresql documentation for some examples of 'bad' and lazy 
code: 

http://www.postgresql.org/docs/9.3/static/plpgsql-structure.html
http://www.postgresql.org/docs/9.3/static/plpgsql-declarations.html

There are about 13 functions on that page.
How many functions on that page make use non-volatile or immutable wherever it 
would be appropriate?  
zero.

or this one: 
http://www.postgresql.org/docs/9.3/static/plpgsql-control-structures.html
zero

or this one: 
http://www.postgresql.org/docs/9.3/static/plpgsql-cursors.html#PLPGSQL-CURSOR-USING
zero

The reason 90% of people out there are 'not bothering' and 'writing bad code' 
is because **99% of the postgresql documentation teaches them to do it that 
way**. 

So when you talk about other people 'not bothering' to do things - who is 
really at fault here what for what you see as endemic 'bad' or 'lazy' code? Is 
it the new postgres programmers, or the people that taught them with bad 
examples consistently throughout the *entire project documentation*, starting 
from the very first example? 

I think I'm going to raise this as a documentation bug. 

Graeme. 









-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Hmmm... why does CPU-intensive pl/pgsql code parallelise so badly when queries parallelise fine? Anyone else seen this?

2015-07-09 Thread Merlin Moncure
On Wed, Jul 8, 2015 at 5:38 PM, Craig James cja...@emolecules.com wrote:
 On Wed, Jul 8, 2015 at 1:27 PM, Andres Freund and...@anarazel.de wrote:

 On 2015-07-08 13:46:53 -0500, Merlin Moncure wrote:
  On Wed, Jul 8, 2015 at 12:48 PM, Craig James cja...@emolecules.com
  wrote:
   On Tue, Jul 7, 2015 at 10:31 PM, Joshua D. Drake
   j...@commandprompt.com
   Using Apache Fast-CGI, you are going to fork a process for each
   instance
   of the function being executed and that in turn will use all CPUs up
   to the
   max available resource.
  
   With PostgreSQL, that isn't going to happen unless you are running
   (at
   least) 8 functions across 8 connections.
  
  
   Well, right, which is why I mentioned even with dozens of clients.
   Shouldn't that scale to at least all of the CPUs in use if the
   function is
   CPU intensive (which it is)?
 
  only in the absence of inter-process locking and cache line bouncing.

 And addititionally memory bandwidth (shared between everything, even in
 the numa case), cross socket/bus bandwidth (absolutely performance
 critical in multi-socket configurations), cache capacity (shared between
 cores, and sometimes even sockets!).


 From my admittedly naive point of view, it's hard to see why any of this
 matters. I have functions that do purely CPU-intensive mathematical
 calculations ... you could imagine something like is_prime(N) that
 determines if N is a prime number. I have eight clients that connect to
 eight backends. Each client issues an SQL command like, select is_prime(N)
 where N is a simple number.

 Are you saying that in order to calculate is_prime(N), all of that stuff
 (inter-process locking, memory bandwith, bus bandwidth, cache capacity,
 etc.) is even relevant? And if so, how is it that Postgres is so different
 from an Apache fast-CGI program that runs the exact same is_prime(N)
 calculation?

 Just curious ... as I said, I've already implemented a different solution.

If your is_prime() was written in C and was written so that it did not
utilize the database API, it should scale up quite nicely.  This can
be easily proved.  On my quad core workstation,

postgres=# select 12345! * 0;
 ?column?
──
0
(1 row)

Time: 10435.554 ms


...which is heavily cpu bound, takes about 10 seconds.  scaling out to
4 threads via:

time ~/pgdev/bin/pgbench -n -t1 -c4 -f (echo select 12345! * 0;)

yields:
real 0m11.317s
user 0m0.001s
sys 0m0.005s

...I'll call that pretty good scaling.  The reason why this scales so
good is that the numeric code is all operating on local data
structures and is not involving backend componentry with it's various
attached complexity such as having to be checked for being visible to
the current transaction.

I submit that toy benchmarks like factoring or pi digits are not
really good indicators of language scaling and performance because
just about all real world code involves data structures, i/o, memory
allocation, amateur coders, etc.   Java tends to approach C in
benchmark shootouts but woefully underperforms my expectations
relative to C in code that does things that's actually useful (aside:
if you think I'm knocking java, the situation is even worse with most
other languages I come across).

pl/pgsql is simply not optimized for that style of coding although if
you know postgres you can start to tickle the limits of what's
expected from the language.  If that isn't working for you, pl/v8
strikes me as the best alternative due to it's performance and good
integration with postgres data structures (in fact, I'd be arguing for
it to be moved to core if the v8 dependency wasn't so capricious).
Either way, I'll advocate any solution that allows you to code inside
the database environment as opposed to the client side.

merlin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Re: Hmmm... why does CPU-intensive pl/pgsql code parallelise so badly when queries parallelise fine? Anyone else seen this?

2015-07-09 Thread Thomas Kellerer
Graeme B. Bell schrieb am 09.07.2015 um 11:44:
 I don't recall seeing a clear statement telling me I should mark pl/pgsql
 functions nonvolatile wherever possible or throw all performance and
 scalability out the window. 

From: http://www.postgresql.org/docs/current/static/xfunc-volatility.html

   For best optimization results, you should label your functions 
with the strictest volatility category that is valid for them.





-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Hmmm... why does CPU-intensive pl/pgsql code parallelise so badly when queries parallelise fine? Anyone else seen this?

2015-07-09 Thread Merlin Moncure
On Thu, Jul 9, 2015 at 4:44 AM, Graeme B. Bell graeme.b...@nibio.no wrote:
 On 09 Jul 2015, at 05:38, Tom Lane t...@sss.pgh.pa.us wrote:

 If you
 write your is_prime function purely in plpgsql, and don't bother to mark
 it nonvolatile, *it will not scale*.

 much for properly written plpgsql; but there's an awful lot of bad plpgsql
 code out there, and it can make a huge difference for that.


 Hi Tom,

 I object to phrases like 'don't bother to mark it' and 'bad plpgsql' here. 
 That is putting the blame on programmers. Clearly, if there is no end of code 
 out there that isn't right in this regard, there's something very wrong in 
 the project documentation.

 1. I have been writing pl/pgsql on and off for a couple of years now and I've 
 read quite a bit of the postgres doumentation, but I don't recall seeing a 
 clear statement telling me I should mark pl/pgsql functions nonvolatile 
 wherever possible or throw all performance and scalability out the window. 
 I'm sure there may be a line hidden somewhere in the docs, but judging from 
 the impact it has in practice, this seems like a very fundamental concept 
 that should be repeatedly and clearly marked in the docs.

 2. Furthermore, I have never come across anything in the documentation that 
 made it clear to me that any pl/pgsql function I write will, by default, be 
 taking out locks for every single statement in the code. I've written code in 
 I dunno, maybe 15-20 different languages in my life, and I can't think of 
 another language offhand that does that by default. From the reactions on 
 this thread to this benchmark and the par_psql benchmarks, it doesn't seem 
 that it was even immediately obvious to many postgres enthusiasts and 
 developers.

 3. I don't disagree that the benchmark code is objectively 'bad' in the sense 
 that it is missing an important optimisation.

Particularly with regards documentation, a patch improving things is
much more likely to improve the situation than griping.  Also,
conversation on this list gets recorded for posterity and google is
remarkably good at matching people looking for problems with
solutions.  So, even in absence of a patch perhaps we've made the
lives of future head-scratchers a little bit easier with this
discussion.

merlin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] QUERY PLANNER - Indexe mono column VS composite Index

2015-07-09 Thread Guillaume Lelarge
2015-07-09 22:34 GMT+02:00 Nicolas Paris nipari...@gmail.com:

 Hello,

 My 9.4 database is used as datawharehouse. I can't change the queries
 generated.

 first index  : INDEX COL (A,B,C,D,E)


 In case of query based on COL A,  the query planner sometimes go to a seq
 scan instead of using the first composite index.

 The solution is to add a second indexe (redondant)
 second index : INDEX COL (A)

 In case of query based on COL A, B, C, D, (without E) as well, it doesn't
 uses the first index and prefers a seq scan.

 I could create a third indexe :
 first index  : INDEX COL (A,B,C,D)

 But I hope there is an other solution for that (table is huge).

 It seems that the malus for using composite indexes is high.

 Question is : is there a way to make the composite index more attractive
 to query planner ? (idealy equivalent to mono column indexe)


There's no way we can answer that without seeing actual queries and query
plans.


-- 
Guillaume.
  http://blog.guillaume.lelarge.info
  http://www.dalibo.com


[PERFORM] QUERY PLANNER - Indexe mono column VS composite Index

2015-07-09 Thread Nicolas Paris
Hello,

My 9.4 database is used as datawharehouse. I can't change the queries
generated.

first index  : INDEX COL (A,B,C,D,E)


In case of query based on COL A,  the query planner sometimes go to a seq
scan instead of using the first composite index.

The solution is to add a second indexe (redondant)
second index : INDEX COL (A)

In case of query based on COL A, B, C, D, (without E) as well, it doesn't
uses the first index and prefers a seq scan.

I could create a third indexe :
first index  : INDEX COL (A,B,C,D)

But I hope there is an other solution for that (table is huge).

It seems that the malus for using composite indexes is high.

Question is : is there a way to make the composite index more attractive to
query planner ? (idealy equivalent to mono column indexe)


Thanks by advance


Re: [PERFORM] [BUGS] BUG #13493: pl/pgsql doesn't scale with cpus (PG9.3, 9.4)

2015-07-09 Thread Graeme B. Bell

This is a reply to to Andreas's post on the #13495 documentation thread in 
-bugs. 
I am responding to it here because it relates to #13493 only.

Andres wrote, re: #13493

 This issue is absolutely critical for performance and scalability of code,

 Pft. In most cases it doesn't actually matter that much because the
 contained query are the expensive stuff. It's just when you do lots of
 very short and cheap things that it has such a big effect.  Usually the
 effect on the planner is bigger.

Hi Andres,

'Pft' is kinda rude - I wouldn't comment on it normally,  but seeing as you 
just lectured me on -performance on something you perceived as impolite (just 
like you lectured me on not spreading things onto multiple threads), can you 
please try to set a good example? You don't encourage new contributors into 
open source communities this way. 

Getting to the point. I think the gap between our viewpoints comes from the 
fact I (and others here at my institute) have a bunch of pl/pgsql code here 
with for loops and calculations, which we see as 'code'. Thinking of all the 
users I know myself, I know there are plenty of GIS people out there using for 
loops and pgsql to simulate models on data in the DB, and I expect the same is 
true among e.g. older scientists with DB datasets. 

Whereas it sounds like you and Tom see pl/pgsql as 'glue' and don't see any 
problem. As I have never seen statistics on pl/pgsql use-cases among users at 
large, I don't know what happens everywhere else outside of GIS-world and 
pgdev-world. Have you any references/data you can share on that? I would be 
interested to know because I don't want to overclaim on the importance of these 
bugs or any other bugs in future. In this case, #13493 wrecked the code for 
estimates on a 20 million euro national roadbuilding project here and it cost 
me a few weeks of my life, but for all I know you're totally right about the 
general importance to the world at large.

Though keep in mind: This isn't just only about scaling up one program. It's a 
db-level problem. If you have a large GIS DB server with many users, 
long-running queries etc. on large amounts of data, then you only need e.g. 2-3 
people to be running some code with for-loops or a long series of calculation 
in pl/pgsql, and everything will fall apart in pgsql-land. 

Last point. When I wrote 'absolutely critical' I was under the impression this 
bug could have some serious impact on postgis/pgrouting. Since I wanted to 
double check what you said about 'expensive stuff' vs 'short/cheap stuff', I 
ran some benchmarks to check on a few functions. 

You are right that only short, looped things are affected. e.g. for loops with 
calculations and so on. Didn't see any trouble with the calls I made to postgis 
inside or outside of pgsql. This confirms/replicates your findings. Updated 
numbers/tests posted to github shortly.

Regards

Graeme Bell

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance