Re: [PERFORM] prepared query performs much worse than regular query

2010-05-21 Thread Matthew Wakeling

On Fri, 21 May 2010, Richard Yen wrote:

Any ideas why the query planner chooses a different query plan when using 
prepared statements?


This is a FAQ. Preparing a statement makes Postgres create a plan, without 
knowing the values that you will plug in, so it will not be as optimal as 
if the values were available. The whole idea is to avoid the planning cost 
each time the query is executed, but if your data is unusual it can 
result in worse plans.


Matthew

--
Existence is a convenient concept to designate all of the files that an
executable program can potentially process.   -- Fortran77 standard

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] prepared query performs much worse than regular query

2010-05-21 Thread Rosser Schwarz
On Fri, May 21, 2010 at 4:53 PM, Richard Yen  wrote:
> Any ideas why the query planner chooses a different query plan when using 
> prepared statements?

A prepared plan is the best one the planner can come up with *in
general* for the query in question.  If the distribution of the values
you're querying against -- in your case, "owner" and "assignment" --
aren't relatively uniform, that plan is going to be suboptimal, if not
downright pathological, for the more outlying-ly distributed values.

Looking at your prepared plan, it seems that, on average, there are
177 rows for every "assignment", and 184 per "owner".  As it turns
out, though, nearly a quarter of your table has an "owner" of -1.
It's not terribly surprising, with a table that big and a distribution
skew of that magnitude, that this query plan, with these arguments,
ends up pretty firmly in the "pathological" category.

rls

-- 
:wq

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] prepared query performs much worse than regular query

2010-05-21 Thread Richard Yen
Hi everyone,

I use DBD::Pg to interface with our 8.4.2 database, but for a particular query, 
performance is horrible.  I'm assuming that the behavior of $dbh->prepare is as 
if I did PREPARE foo AS (query), so I did an explain analyze in the commandline:
> db_alpha=# prepare foo6 as (SELECT me.id, me.assignment, me.title, 
> me.x_firstname, me.x_lastname, me.owner, me.node, me.grade, me.folder, 
> me.word_count, me.char_length, me.char_count, me.page_count FROM submissions 
> me WHERE ( ( owner = $1 AND me.assignment = $2 ) ));
> PREPARE
> db_alpha=# explain analyze execute foo6('-1', '8996557');
> QUERY 
> PLAN 
> ---
>  Bitmap Heap Scan on submissions me  (cost=38.84..42.85 rows=1 width=70) 
> (actual time=346567.665..346567.665 rows=0 loops=1)
>Recheck Cond: ((assignment = $2) AND (owner = $1))
>->  BitmapAnd  (cost=38.84..38.84 rows=1 width=0) (actual 
> time=346567.642..346567.642 rows=0 loops=1)
>  ->  Bitmap Index Scan on submissions_assignment_idx  
> (cost=0.00..19.27 rows=177 width=0) (actual time=0.038..0.038 rows=2 loops=1)
>Index Cond: (assignment = $2)
>  ->  Bitmap Index Scan on submissions_owner_idx  (cost=0.00..19.32 
> rows=184 width=0) (actual time=346566.501..346566.501 rows=28977245 loops=1)
>Index Cond: (owner = $1)
>  Total runtime: 346567.757 ms
> (8 rows)


Now, if I run it without preparing it--just run it directly in the 
commandline--I get this plan:
> db_alpha=# explain analyze SELECT me.id, me.assignment, me.title, 
> me.x_firstname, me.x_lastname, me.owner, me.node, me.grade, me.folder, 
> me.word_count, me.char_length, me.char_count, me.page_count FROM submissions 
> me WHERE ( ( owner = -1 AND me.assignment = 8996557 ) )
> db_alpha-# ;
>  QUERY 
> PLAN  
> -
>  Index Scan using submissions_assignment_idx on submissions me  
> (cost=0.00..549.15 rows=36 width=70) (actual time=0.021..0.021 rows=0 loops=1)
>Index Cond: (assignment = 8996557)
>Filter: (owner = (-1))
>  Total runtime: 0.042 ms
> (4 rows)

submissions has ~124 million rows, and owner -1 is a placeholder in my 
database, to fulfill a foreign key requirement.  I tried REINDEXing 
submissions_owner_idx and performing a VACUUM ANALYZE on the submissions table, 
but nothing seems to make a difference for this query.  One other thing to note 
is that if I use any other value for the owner column, it comes back really 
fast (< 0.04 ms).

Any ideas why the query planner chooses a different query plan when using 
prepared statements?

--Richard
-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-21 Thread Stephen Frost
* Yeb Havinga (yebhavi...@gmail.com) wrote:
>> Normalizing by date parts was fast. Partitioning the tables by year  
>> won't do much good -- users will probably choose 1900 to 2009,  
>> predominately.
> Ok, in that case it is a bad idea.

Yeah, now that I understand what the user actually wants, I can
certainly understand that you wouldn't want to partition by year.  It
does strike me that perhaps you could partition by day ranges, but you'd
have to store them as something other than the 'date' type, which is
certainly frustrating, but you're not really operating on these in a
'normal' fashion as you would with a date.

The next question I would have, however, is if you could pre-aggregate
some of this data..  If users are going to typically use 1900-2009 for
years, then could the information about all of those years be aggregated
apriori to make those queries faster?

>> I thought about splitting the data by station by category, but that's  
>> ~73000 tables.

Do not get hung up on having to have a separate table for every unique
value in the column- you don't need that.  constraint_exclusion will
work just fine with ranges too- the problem is that you need to have
ranges that make sense with the data type you're using and with the
queries you're running.  That doesn't really work here with the
measurement_date, but it might work just fine with your station_id
field.

>> I also thought about splitting the data by station district by  
>> category -- there are 79 districts, yielding 474 child tables, which  
>> is ~575000 rows per child table. Most of the time I'd imagine only one  
>> or two districts would be selected. (Again, hard to know exactly.)

Also realize that PG will use multiple files for a single table once the
size of that table goes beyond 1G.

> I agee with Matthew Wakeling in a different post: its probably wise to  
> first see how fast things can get by using indexes. Only if that fails  
> to be fast, partitioning might be an option. (Though sequentially  
> scanning 0.5M rows is not cheap).

I would agree with this too- get it working first, then look at
partitioning.  Even more so- work on a smaller data set to begin with
while you're figuring out how to get the right answer in a generally
efficient way (not doing seq. scans through everything because you're
operating on every row for something).  It needs to be a couple
hundred-thousand rows, but it doesn't need to be the full data set, imv.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-21 Thread Matthew Wakeling
Regarding the leap year problem, you might consider creating a modified day 
of year field, which always assumes that the year contains a leap day. Then 
a given number always resolves to a given date, regardless of year. If you 
then partition (or index) on that field, then you may get a benefit.

On Fri, 21 May 2010, Yeb Havinga wrote:
Shouldn't it be just the other way around - assume all years are non leap 
years for the doy part field to be indexed.


The mapping doesn't matter massively, as long as all days of the year can 
be mapped uniquely onto a number, and the numbers are sequential. Your 
suggestion does not satisfy the first of those two requirements.


If you assume that all yeasr are leap years, then you merely skip a number 
in the middle of the year, which isn't a problem when you want to check 
for days between two bounds. However, if you assume non leap year, then 
there is no representation for the 29th of February, so not all data 
points will have a representative number to insert into the database.


Matthew

--
No, C++ isn't equal to D. 'C' is undeclared, so we assume it's an int,
with a default value of zero.  Hence, C++ should really be called 1.
-- met24, commenting on the quote "C++ -- shouldn't it be called D?"

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-21 Thread Yeb Havinga

David Jarvis wrote:

Hi, Yeb.

This is starting to go back to the design I used with MySQL:

* YEAR_REF - Has year and station
* MONTH_REF - Has month, category, and yea referencer
* MEASUREMENT - Has month reference, amount, and day

Normalizing by date parts was fast. Partitioning the tables by year 
won't do much good -- users will probably choose 1900 to 2009, 
predominately.

Ok, in that case it is a bad idea.
I thought about splitting the data by station by category, but that's 
~73000 tables. My understanding is that PostgreSQL uses files per 
index, which would be messy at the OS level (Linux 2.6.31). Even by 
station alone is 12139 tables, which might be tolerable for now, but 
with an order of magnitude more stations on the distant horizon, it 
will not scale.
Yes, I've read a few times now that PG's partitioning doesn't scale 
beyond a few 100 partitions.
I also thought about splitting the data by station district by 
category -- there are 79 districts, yielding 474 child tables, which 
is ~575000 rows per child table. Most of the time I'd imagine only one 
or two districts would be selected. (Again, hard to know exactly.)
I agee with Matthew Wakeling in a different post: its probably wise to 
first see how fast things can get by using indexes. Only if that fails 
to be fast, partitioning might be an option. (Though sequentially 
scanning 0.5M rows is not cheap).


I experimented a bit with a doy and year function.

-- note: leap year fix must still be added
create or replace function doy(timestamptz) RETURNS float8
as 'select extract(doy from $1);'
language sql
immutable
strict;
create or replace function year(timestamptz) RETURNS float8
as 'select extract(year from $1);'
language sql
immutable
strict;

\d parent
Table "public.parent"
Column |   Type   | Modifiers
+--+---
t  | timestamp with time zone |
y  | smallint |
Indexes:
   "doy_i" btree (doy(t))
   "year_i" btree (year(t))

A plan like the following is probably what you want

test=# explain select * from parent where doy(t) between 10 and 20 and 
year(t) between 1900 and 2009;
  
QUERY 
PLAN


-
Bitmap Heap Scan on parent  (cost=9.95..14.97 rows=1 width=10)
  Recheck Cond: ((year(t) >= 1900::double precision) AND (year(t) <= 
2009::double precision) AND (doy(t) >= 10::double precision) AND (doy(t) 
<= 20::double precision))

  ->  BitmapAnd  (cost=9.95..9.95 rows=1 width=0)
->  Bitmap Index Scan on year_i  (cost=0.00..4.85 rows=10 width=0)
  Index Cond: ((year(t) >= 1900::double precision) AND 
(year(t) <= 2009::double precision))

->  Bitmap Index Scan on doy_i  (cost=0.00..4.85 rows=10 width=0)
  Index Cond: ((doy(t) >= 10::double precision) AND (doy(t) 
<= 20::double precision))

(7 rows)

regards,
Yeb Havinga




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-21 Thread Yeb Havinga

Matthew Wakeling wrote:

On Fri, 21 May 2010, Yeb Havinga wrote:

For time based data I would for sure go for year based indexing.


On the contrary, most of the queries seem to be over many years, but 
rather restricting on the time of year. Therefore, partitioning by 
month or some other per-year method would seem sensible.
The fact is that at the time I wrote my mail, I had not read a specifion 
of distribution of parameters (or I missed it). That's why the sentence 
of my mail before the one you quoted said: "the partitioning is only 
useful for speed, if it matches how your queries select data.". In most 
of the databases I've worked with, the recent data was queried most 
(accounting, medical) but I can see that for climate analysis this might 
be different.
Regarding the leap year problem, you might consider creating a 
modified day of year field, which always assumes that the year 
contains a leap day. Then a given number always resolves to a given 
date, regardless of year. If you then partition (or index) on that 
field, then you may get a benefit.
Shouldn't it be just the other way around - assume all years are non 
leap years for the doy part field to be indexed.


regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-21 Thread David Jarvis
Hi, Yeb.

This is starting to go back to the design I used with MySQL:

   - YEAR_REF - Has year and station
   - MONTH_REF - Has month, category, and yea referencer
   - MEASUREMENT - Has month reference, amount, and day

Normalizing by date parts was fast. Partitioning the tables by year won't do
much good -- users will probably choose 1900 to 2009, predominately.

I thought about splitting the data by station by category, but that's ~73000
tables. My understanding is that PostgreSQL uses files per index, which
would be messy at the OS level (Linux 2.6.31). Even by station alone is
12139 tables, which might be tolerable for now, but with an order of
magnitude more stations on the distant horizon, it will not scale.

I also thought about splitting the data by station district by category --
there are 79 districts, yielding 474 child tables, which is ~575000 rows per
child table. Most of the time I'd imagine only one or two districts would be
selected. (Again, hard to know exactly.)

Dave


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-21 Thread Matthew Wakeling

On Fri, 21 May 2010, Yeb Havinga wrote:

For time based data I would for sure go for year based indexing.


On the contrary, most of the queries seem to be over many years, but 
rather restricting on the time of year. Therefore, partitioning by month 
or some other per-year method would seem sensible.


Regarding the leap year problem, you might consider creating a modified 
day of year field, which always assumes that the year contains a leap day. 
Then a given number always resolves to a given date, regardless of year. 
If you then partition (or index) on that field, then you may get a 
benefit.


In this case, partitioning is only really useful when you are going to be 
forced to do seq scans. If you can get a suitably selective index, in the 
case where you are selecting a small proportion of the data, then I would 
concentrate on getting the index right, rather than the partition, and 
maybe even not do partitioning.


Matthew

--
Trying to write a program that can't be written is... well, it can be an
enormous amount of fun! -- Computer Science Lecturer

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] old server, new server, same performance

2010-05-21 Thread Piotr Legiecki

Scott Marlowe pisze:


Is one connecting via SSL?  Is this a simple flat switched network, or
are these machines on different segments connected via routers?


SSL is disabled.
It is switched network, all tested computers are in the same segment.


Finally I have switched the production database from old server to new 
one and strange things happened. The same query on new server I have 
used before with 30sec results now runs about 9 sec (so the same time as 
the old production server). Hm...


Is it possible that database under some load performs better because it 
makes some something that database  not used is not doing?



P.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-21 Thread Yeb Havinga

There is a thing that might lead to confusion in the previous post:

create or replace function yearmod(int) RETURNS int
as 'select (($1 >> 2) %32);'
language sql
immutable
strict;

is equivalent with

create or replace function yearmod(int) RETURNS int
as 'select (($1 / 4) %32);'
language sql
immutable
strict;

and that is the function that was used with all the other output (it can 
be seen inlined in the explain output). I did not catch this until after 
the post.


regards,
Yeb Havinga



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-21 Thread Yeb Havinga
David Jarvis wrote: 


Also, you're trying to do constraint_exclusion, but have you made sure
that it's turned on?  And have you made sure that those
constraints are
really the right ones and that they make sense?  You're using a
bunch of
extract()'s there too, why not just specify a CHECK constraint on the
date ranges which are allowed in the table..?


I don't know what the date ranges are? So I can't partition them by year?

Right now I created 72 child tables by using the category and month. 
This may have been a bad choice. But at least all the data is in the 
system now so dissecting or integrating it back in different ways 
shouldn't take days.


Thanks everyone for all your help, I really appreciate the time you've 
taken to guide me in the right direction to make the system as fast as 
it can be.


My $0.02 - its hard to comment inline due to the number of responses, 
but: the partitioning is only useful for speed, if it matches how your 
queries select data. For time based data I would for sure go for year 
based indexing. If you want a fixed number of partitions, you could 
perhaps do something like year % 64. I did a test to see of the 
constraint exclusion could work with extract but that failed:


test=# create table parent(t timestamptz);
test=# create table child1(check ((extract(year from t)::int % 2)=0)) 
inherits( parent);
test=# create table child2(check ((extract(year from t)::int % 2)=1)) 
inherits(parent);
test=# explain select * from parent where (extract(year from t)::int % 
2) = 0;
   QUERY PLAN
---

Result  (cost=0.00..158.40 rows=33 width=8)
  ->  Append  (cost=0.00..158.40 rows=33 width=8)
->  Seq Scan on parent  (cost=0.00..52.80 rows=11 width=8)
  Filter: (((date_part('year'::text, t))::integer % 2) = 0)
->  Seq Scan on child1 parent  (cost=0.00..52.80 rows=11 width=8)
  Filter: (((date_part('year'::text, t))::integer % 2) = 0)
->  Seq Scan on child2 parent  (cost=0.00..52.80 rows=11 width=8)
  Filter: (((date_part('year'::text, t))::integer % 2) = 0)

It hits all partitions even when I requested for a single year.

So an extra column would be needed, attempt 2 with added year smallint.

test=# create table parent(t timestamptz, y smallint);
test=# create table child1(check ((y % 2)=0)) inherits( parent);
test=# create table child2(check ((y % 2)=1)) inherits( parent);
test=# explain select * from parent where (y % 2) between 0 and 0;
  QUERY 
PLAN   
-

Result  (cost=0.00..122.00 rows=20 width=10)
  ->  Append  (cost=0.00..122.00 rows=20 width=10)
->  Seq Scan on parent  (cost=0.00..61.00 rows=10 width=10)
  Filter: y)::integer % 2) >= 0) AND (((y)::integer % 
2) <= 0))

->  Seq Scan on child1 parent  (cost=0.00..61.00 rows=10 width=10)
  Filter: y)::integer % 2) >= 0) AND (((y)::integer % 
2) <= 0))


This works: only one child table hit.

That made me think: if you'd scan two consecutive years, you'd always 
hit two different partitions. For your use case it'd be nice if some 
year wraparounds would fall in the same partition. The following query 
shows partition numbers for 1900 - 2010 with 4 consecutive years in the 
same partition. It also shows that in this case 32 partitions is enough:


test=# select x, (x / 4) % 32 from generate_series(1900,2010) as x(x);
x   | ?column?
--+--
1900 |   27
1901 |   27
1902 |   27
1903 |   27
1904 |   28
1905 |   28
etc
1918 |   31
1919 |   31
1920 |0
1921 |0
etc
2005 |   21
2006 |   21
2007 |   21
2008 |   22
2009 |   22
2010 |   22
(111 rows)

This would mean that a extra smallint column is needed which would 
inflate the 300M relation with.. almost a GB, but I still think it'd be 
a good idea.


create or replace function yearmod(int) RETURNS int
as 'select (($1 >> 2) %32);'
language sql
immutable
strict;

create table parent(t timestamptz, y smallint);

select 'create table child'||x||'(check (yearmod(y)='||x-1||')) 
inherits(parent);' from generate_series(1,32) as x(x);
  ?column?   
---

create table child1(check (yearmod(y)=0)) inherits(parent);
create table child2(check (yearmod(y)=1)) inherits(parent);
create table child3(check (yearmod(y)=2)) inherits(parent);
etc
create table child30(check (yearmod(y)=29)) inherits(parent);
create table child31(check (yearmod(y)=30)) inherits(parent);
create table child32(check (yearmod(y)=31)) inherits(parent);
(32 rows)

Copy and paste output of this query in psql to create child tables.

Example