Re: [PERFORM] prepared query performs much worse than regular query
On Fri, 21 May 2010, Richard Yen wrote: Any ideas why the query planner chooses a different query plan when using prepared statements? This is a FAQ. Preparing a statement makes Postgres create a plan, without knowing the values that you will plug in, so it will not be as optimal as if the values were available. The whole idea is to avoid the planning cost each time the query is executed, but if your data is unusual it can result in worse plans. Matthew -- Existence is a convenient concept to designate all of the files that an executable program can potentially process. -- Fortran77 standard -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] prepared query performs much worse than regular query
On Fri, May 21, 2010 at 4:53 PM, Richard Yen wrote: > Any ideas why the query planner chooses a different query plan when using > prepared statements? A prepared plan is the best one the planner can come up with *in general* for the query in question. If the distribution of the values you're querying against -- in your case, "owner" and "assignment" -- aren't relatively uniform, that plan is going to be suboptimal, if not downright pathological, for the more outlying-ly distributed values. Looking at your prepared plan, it seems that, on average, there are 177 rows for every "assignment", and 184 per "owner". As it turns out, though, nearly a quarter of your table has an "owner" of -1. It's not terribly surprising, with a table that big and a distribution skew of that magnitude, that this query plan, with these arguments, ends up pretty firmly in the "pathological" category. rls -- :wq -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] prepared query performs much worse than regular query
Hi everyone, I use DBD::Pg to interface with our 8.4.2 database, but for a particular query, performance is horrible. I'm assuming that the behavior of $dbh->prepare is as if I did PREPARE foo AS (query), so I did an explain analyze in the commandline: > db_alpha=# prepare foo6 as (SELECT me.id, me.assignment, me.title, > me.x_firstname, me.x_lastname, me.owner, me.node, me.grade, me.folder, > me.word_count, me.char_length, me.char_count, me.page_count FROM submissions > me WHERE ( ( owner = $1 AND me.assignment = $2 ) )); > PREPARE > db_alpha=# explain analyze execute foo6('-1', '8996557'); > QUERY > PLAN > --- > Bitmap Heap Scan on submissions me (cost=38.84..42.85 rows=1 width=70) > (actual time=346567.665..346567.665 rows=0 loops=1) >Recheck Cond: ((assignment = $2) AND (owner = $1)) >-> BitmapAnd (cost=38.84..38.84 rows=1 width=0) (actual > time=346567.642..346567.642 rows=0 loops=1) > -> Bitmap Index Scan on submissions_assignment_idx > (cost=0.00..19.27 rows=177 width=0) (actual time=0.038..0.038 rows=2 loops=1) >Index Cond: (assignment = $2) > -> Bitmap Index Scan on submissions_owner_idx (cost=0.00..19.32 > rows=184 width=0) (actual time=346566.501..346566.501 rows=28977245 loops=1) >Index Cond: (owner = $1) > Total runtime: 346567.757 ms > (8 rows) Now, if I run it without preparing it--just run it directly in the commandline--I get this plan: > db_alpha=# explain analyze SELECT me.id, me.assignment, me.title, > me.x_firstname, me.x_lastname, me.owner, me.node, me.grade, me.folder, > me.word_count, me.char_length, me.char_count, me.page_count FROM submissions > me WHERE ( ( owner = -1 AND me.assignment = 8996557 ) ) > db_alpha-# ; > QUERY > PLAN > - > Index Scan using submissions_assignment_idx on submissions me > (cost=0.00..549.15 rows=36 width=70) (actual time=0.021..0.021 rows=0 loops=1) >Index Cond: (assignment = 8996557) >Filter: (owner = (-1)) > Total runtime: 0.042 ms > (4 rows) submissions has ~124 million rows, and owner -1 is a placeholder in my database, to fulfill a foreign key requirement. I tried REINDEXing submissions_owner_idx and performing a VACUUM ANALYZE on the submissions table, but nothing seems to make a difference for this query. One other thing to note is that if I use any other value for the owner column, it comes back really fast (< 0.04 ms). Any ideas why the query planner chooses a different query plan when using prepared statements? --Richard -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?
* Yeb Havinga (yebhavi...@gmail.com) wrote: >> Normalizing by date parts was fast. Partitioning the tables by year >> won't do much good -- users will probably choose 1900 to 2009, >> predominately. > Ok, in that case it is a bad idea. Yeah, now that I understand what the user actually wants, I can certainly understand that you wouldn't want to partition by year. It does strike me that perhaps you could partition by day ranges, but you'd have to store them as something other than the 'date' type, which is certainly frustrating, but you're not really operating on these in a 'normal' fashion as you would with a date. The next question I would have, however, is if you could pre-aggregate some of this data.. If users are going to typically use 1900-2009 for years, then could the information about all of those years be aggregated apriori to make those queries faster? >> I thought about splitting the data by station by category, but that's >> ~73000 tables. Do not get hung up on having to have a separate table for every unique value in the column- you don't need that. constraint_exclusion will work just fine with ranges too- the problem is that you need to have ranges that make sense with the data type you're using and with the queries you're running. That doesn't really work here with the measurement_date, but it might work just fine with your station_id field. >> I also thought about splitting the data by station district by >> category -- there are 79 districts, yielding 474 child tables, which >> is ~575000 rows per child table. Most of the time I'd imagine only one >> or two districts would be selected. (Again, hard to know exactly.) Also realize that PG will use multiple files for a single table once the size of that table goes beyond 1G. > I agee with Matthew Wakeling in a different post: its probably wise to > first see how fast things can get by using indexes. Only if that fails > to be fast, partitioning might be an option. (Though sequentially > scanning 0.5M rows is not cheap). I would agree with this too- get it working first, then look at partitioning. Even more so- work on a smaller data set to begin with while you're figuring out how to get the right answer in a generally efficient way (not doing seq. scans through everything because you're operating on every row for something). It needs to be a couple hundred-thousand rows, but it doesn't need to be the full data set, imv. Thanks, Stephen signature.asc Description: Digital signature
Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?
Regarding the leap year problem, you might consider creating a modified day of year field, which always assumes that the year contains a leap day. Then a given number always resolves to a given date, regardless of year. If you then partition (or index) on that field, then you may get a benefit. On Fri, 21 May 2010, Yeb Havinga wrote: Shouldn't it be just the other way around - assume all years are non leap years for the doy part field to be indexed. The mapping doesn't matter massively, as long as all days of the year can be mapped uniquely onto a number, and the numbers are sequential. Your suggestion does not satisfy the first of those two requirements. If you assume that all yeasr are leap years, then you merely skip a number in the middle of the year, which isn't a problem when you want to check for days between two bounds. However, if you assume non leap year, then there is no representation for the 29th of February, so not all data points will have a representative number to insert into the database. Matthew -- No, C++ isn't equal to D. 'C' is undeclared, so we assume it's an int, with a default value of zero. Hence, C++ should really be called 1. -- met24, commenting on the quote "C++ -- shouldn't it be called D?" -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?
David Jarvis wrote: Hi, Yeb. This is starting to go back to the design I used with MySQL: * YEAR_REF - Has year and station * MONTH_REF - Has month, category, and yea referencer * MEASUREMENT - Has month reference, amount, and day Normalizing by date parts was fast. Partitioning the tables by year won't do much good -- users will probably choose 1900 to 2009, predominately. Ok, in that case it is a bad idea. I thought about splitting the data by station by category, but that's ~73000 tables. My understanding is that PostgreSQL uses files per index, which would be messy at the OS level (Linux 2.6.31). Even by station alone is 12139 tables, which might be tolerable for now, but with an order of magnitude more stations on the distant horizon, it will not scale. Yes, I've read a few times now that PG's partitioning doesn't scale beyond a few 100 partitions. I also thought about splitting the data by station district by category -- there are 79 districts, yielding 474 child tables, which is ~575000 rows per child table. Most of the time I'd imagine only one or two districts would be selected. (Again, hard to know exactly.) I agee with Matthew Wakeling in a different post: its probably wise to first see how fast things can get by using indexes. Only if that fails to be fast, partitioning might be an option. (Though sequentially scanning 0.5M rows is not cheap). I experimented a bit with a doy and year function. -- note: leap year fix must still be added create or replace function doy(timestamptz) RETURNS float8 as 'select extract(doy from $1);' language sql immutable strict; create or replace function year(timestamptz) RETURNS float8 as 'select extract(year from $1);' language sql immutable strict; \d parent Table "public.parent" Column | Type | Modifiers +--+--- t | timestamp with time zone | y | smallint | Indexes: "doy_i" btree (doy(t)) "year_i" btree (year(t)) A plan like the following is probably what you want test=# explain select * from parent where doy(t) between 10 and 20 and year(t) between 1900 and 2009; QUERY PLAN - Bitmap Heap Scan on parent (cost=9.95..14.97 rows=1 width=10) Recheck Cond: ((year(t) >= 1900::double precision) AND (year(t) <= 2009::double precision) AND (doy(t) >= 10::double precision) AND (doy(t) <= 20::double precision)) -> BitmapAnd (cost=9.95..9.95 rows=1 width=0) -> Bitmap Index Scan on year_i (cost=0.00..4.85 rows=10 width=0) Index Cond: ((year(t) >= 1900::double precision) AND (year(t) <= 2009::double precision)) -> Bitmap Index Scan on doy_i (cost=0.00..4.85 rows=10 width=0) Index Cond: ((doy(t) >= 10::double precision) AND (doy(t) <= 20::double precision)) (7 rows) regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?
Matthew Wakeling wrote: On Fri, 21 May 2010, Yeb Havinga wrote: For time based data I would for sure go for year based indexing. On the contrary, most of the queries seem to be over many years, but rather restricting on the time of year. Therefore, partitioning by month or some other per-year method would seem sensible. The fact is that at the time I wrote my mail, I had not read a specifion of distribution of parameters (or I missed it). That's why the sentence of my mail before the one you quoted said: "the partitioning is only useful for speed, if it matches how your queries select data.". In most of the databases I've worked with, the recent data was queried most (accounting, medical) but I can see that for climate analysis this might be different. Regarding the leap year problem, you might consider creating a modified day of year field, which always assumes that the year contains a leap day. Then a given number always resolves to a given date, regardless of year. If you then partition (or index) on that field, then you may get a benefit. Shouldn't it be just the other way around - assume all years are non leap years for the doy part field to be indexed. regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?
Hi, Yeb. This is starting to go back to the design I used with MySQL: - YEAR_REF - Has year and station - MONTH_REF - Has month, category, and yea referencer - MEASUREMENT - Has month reference, amount, and day Normalizing by date parts was fast. Partitioning the tables by year won't do much good -- users will probably choose 1900 to 2009, predominately. I thought about splitting the data by station by category, but that's ~73000 tables. My understanding is that PostgreSQL uses files per index, which would be messy at the OS level (Linux 2.6.31). Even by station alone is 12139 tables, which might be tolerable for now, but with an order of magnitude more stations on the distant horizon, it will not scale. I also thought about splitting the data by station district by category -- there are 79 districts, yielding 474 child tables, which is ~575000 rows per child table. Most of the time I'd imagine only one or two districts would be selected. (Again, hard to know exactly.) Dave
Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?
On Fri, 21 May 2010, Yeb Havinga wrote: For time based data I would for sure go for year based indexing. On the contrary, most of the queries seem to be over many years, but rather restricting on the time of year. Therefore, partitioning by month or some other per-year method would seem sensible. Regarding the leap year problem, you might consider creating a modified day of year field, which always assumes that the year contains a leap day. Then a given number always resolves to a given date, regardless of year. If you then partition (or index) on that field, then you may get a benefit. In this case, partitioning is only really useful when you are going to be forced to do seq scans. If you can get a suitably selective index, in the case where you are selecting a small proportion of the data, then I would concentrate on getting the index right, rather than the partition, and maybe even not do partitioning. Matthew -- Trying to write a program that can't be written is... well, it can be an enormous amount of fun! -- Computer Science Lecturer -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] old server, new server, same performance
Scott Marlowe pisze: Is one connecting via SSL? Is this a simple flat switched network, or are these machines on different segments connected via routers? SSL is disabled. It is switched network, all tested computers are in the same segment. Finally I have switched the production database from old server to new one and strange things happened. The same query on new server I have used before with 30sec results now runs about 9 sec (so the same time as the old production server). Hm... Is it possible that database under some load performs better because it makes some something that database not used is not doing? P. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?
There is a thing that might lead to confusion in the previous post: create or replace function yearmod(int) RETURNS int as 'select (($1 >> 2) %32);' language sql immutable strict; is equivalent with create or replace function yearmod(int) RETURNS int as 'select (($1 / 4) %32);' language sql immutable strict; and that is the function that was used with all the other output (it can be seen inlined in the explain output). I did not catch this until after the post. regards, Yeb Havinga -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?
David Jarvis wrote: Also, you're trying to do constraint_exclusion, but have you made sure that it's turned on? And have you made sure that those constraints are really the right ones and that they make sense? You're using a bunch of extract()'s there too, why not just specify a CHECK constraint on the date ranges which are allowed in the table..? I don't know what the date ranges are? So I can't partition them by year? Right now I created 72 child tables by using the category and month. This may have been a bad choice. But at least all the data is in the system now so dissecting or integrating it back in different ways shouldn't take days. Thanks everyone for all your help, I really appreciate the time you've taken to guide me in the right direction to make the system as fast as it can be. My $0.02 - its hard to comment inline due to the number of responses, but: the partitioning is only useful for speed, if it matches how your queries select data. For time based data I would for sure go for year based indexing. If you want a fixed number of partitions, you could perhaps do something like year % 64. I did a test to see of the constraint exclusion could work with extract but that failed: test=# create table parent(t timestamptz); test=# create table child1(check ((extract(year from t)::int % 2)=0)) inherits( parent); test=# create table child2(check ((extract(year from t)::int % 2)=1)) inherits(parent); test=# explain select * from parent where (extract(year from t)::int % 2) = 0; QUERY PLAN --- Result (cost=0.00..158.40 rows=33 width=8) -> Append (cost=0.00..158.40 rows=33 width=8) -> Seq Scan on parent (cost=0.00..52.80 rows=11 width=8) Filter: (((date_part('year'::text, t))::integer % 2) = 0) -> Seq Scan on child1 parent (cost=0.00..52.80 rows=11 width=8) Filter: (((date_part('year'::text, t))::integer % 2) = 0) -> Seq Scan on child2 parent (cost=0.00..52.80 rows=11 width=8) Filter: (((date_part('year'::text, t))::integer % 2) = 0) It hits all partitions even when I requested for a single year. So an extra column would be needed, attempt 2 with added year smallint. test=# create table parent(t timestamptz, y smallint); test=# create table child1(check ((y % 2)=0)) inherits( parent); test=# create table child2(check ((y % 2)=1)) inherits( parent); test=# explain select * from parent where (y % 2) between 0 and 0; QUERY PLAN - Result (cost=0.00..122.00 rows=20 width=10) -> Append (cost=0.00..122.00 rows=20 width=10) -> Seq Scan on parent (cost=0.00..61.00 rows=10 width=10) Filter: y)::integer % 2) >= 0) AND (((y)::integer % 2) <= 0)) -> Seq Scan on child1 parent (cost=0.00..61.00 rows=10 width=10) Filter: y)::integer % 2) >= 0) AND (((y)::integer % 2) <= 0)) This works: only one child table hit. That made me think: if you'd scan two consecutive years, you'd always hit two different partitions. For your use case it'd be nice if some year wraparounds would fall in the same partition. The following query shows partition numbers for 1900 - 2010 with 4 consecutive years in the same partition. It also shows that in this case 32 partitions is enough: test=# select x, (x / 4) % 32 from generate_series(1900,2010) as x(x); x | ?column? --+-- 1900 | 27 1901 | 27 1902 | 27 1903 | 27 1904 | 28 1905 | 28 etc 1918 | 31 1919 | 31 1920 |0 1921 |0 etc 2005 | 21 2006 | 21 2007 | 21 2008 | 22 2009 | 22 2010 | 22 (111 rows) This would mean that a extra smallint column is needed which would inflate the 300M relation with.. almost a GB, but I still think it'd be a good idea. create or replace function yearmod(int) RETURNS int as 'select (($1 >> 2) %32);' language sql immutable strict; create table parent(t timestamptz, y smallint); select 'create table child'||x||'(check (yearmod(y)='||x-1||')) inherits(parent);' from generate_series(1,32) as x(x); ?column? --- create table child1(check (yearmod(y)=0)) inherits(parent); create table child2(check (yearmod(y)=1)) inherits(parent); create table child3(check (yearmod(y)=2)) inherits(parent); etc create table child30(check (yearmod(y)=29)) inherits(parent); create table child31(check (yearmod(y)=30)) inherits(parent); create table child32(check (yearmod(y)=31)) inherits(parent); (32 rows) Copy and paste output of this query in psql to create child tables. Example