Re: [PERFORM] Postgres 8.x on Windows Server in production
We're running 8.3, but when we started this server about 2 years ago it was an earlier 8.x, I don't remember which. --Rainer > -Original Message- > From: pgsql-performance-ow...@postgresql.org [mailto:pgsql-performance- > ow...@postgresql.org] On Behalf Of Scott Marlowe > Sent: Monday, April 13, 2009 8:41 AM > To: Rainer Mager > Cc: Ognjen Blagojevic; pgsql-performance@postgresql.org > Subject: Re: [PERFORM] Postgres 8.x on Windows Server in production > > On Sun, Apr 12, 2009 at 5:13 PM, Rainer Mager > wrote: > > We use Postgres 8.x in production on Windows Server 2003. We have not > done a > > direct head-to-head comparison against any *nix environment, so I > can't > > really compare them, but I can still give a few comments. > > Just wondering, what version are you actually running? Big > differences from 8.0, 8.1, 8.2, 8.3 and soon 8.4. For people taking > your advice on running on windows, it helps them make a decision on > whether or not to upgrade. > > > First of all, it seems that some of the popular file systems in *nix > are > > more robust at preventing disk fragmentation than NTFS is. Because of > this I > > definitely recommend have some defragging solution. What we've > settled on in > > Linux file systems still fragment, they just don't tend to fragment as > much. As the drive gets closer to being full fragmentation will > become more of a problem. > > -- > Sent via pgsql-performance mailing list (pgsql- > performa...@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Postgres 8.x on Windows Server in production
We use Postgres 8.x in production on Windows Server 2003. We have not done a direct head-to-head comparison against any *nix environment, so I can't really compare them, but I can still give a few comments. First of all, it seems that some of the popular file systems in *nix are more robust at preventing disk fragmentation than NTFS is. Because of this I definitely recommend have some defragging solution. What we've settled on in O&O (that's the company name) Defrag Server version. Their software has some nice features related to defragging in the background while monitoring system usage so an to impact performance minimally. Second, one big difficulty with running on Windows is that most of the Postgres expertise seems is around *nix environments. This means that when you do need to investigate a performance issue it can be more difficult to get direct advice. For example, perusing this mailing list will show lot's of tips suggesting running various tools to show io performance, etc. Well, on Windows the toolset is different. All in all we've been happy enough with Windows. Certainly we've never considered migrating to *nix because of difficulties with it. --Rainer > -Original Message- > From: pgsql-performance-ow...@postgresql.org [mailto:pgsql-performance- > ow...@postgresql.org] On Behalf Of Ognjen Blagojevic > Sent: Friday, April 10, 2009 6:47 PM > To: pgsql-performance@postgresql.org > Subject: [PERFORM] Postgres 8.x on Windows Server in production > > Hi all, > > What are your experiences with Postgres 8.x in production use on > Windows > Server 2003/2008? Are there any limitations, trade-offs or quirks? > > My client is accustomed to Windows Server environment, but it seems > hard > to google good information about these types of installations. > > Regards, > Ognjen > > -- > Sent via pgsql-performance mailing list (pgsql- > performa...@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] difficulties with time based queries
Thanks for all of the suggestions so far. I've been trying to reduce the number of indices I have, but I'm running into a problem. I have a need to do queries on this table with criteria applied to the date and possibly any or all of the other key columns. As a reminder, here's my table: Table "public.ad_log" Column |Type | Modifiers +-+- --- ad_log_id | integer | not null default nextval('ad_log_ad_log_id_seq'::regclass) channel| integer | not null player | integer | not null ad | integer | not null start_time | timestamp without time zone | end_time | timestamp without time zone | So, I need indices that make it fast querying against start_time as well as all possible combinations of channel, player, and ad. Below is a sample query that uses all of these (note that I've removed actual strings to protect customer data). The result is fine in terms of speed, but since it's using the ad_log_ad_date index I'm wondering what the best strategy is to cover queries that don't specify an ad. Should I have 4 indices, one with just the start_time (for when no other columns are specified) and the other three each with the start_time and the three other criteria: channel, player, and ad? I'm currently experimenting with various options, but since it takes a couple of hours to create a particular index this is taking a while. # explain analyze SELECT ad_log.ad_log_id, channels.name as channel_name, players.name as player_name, ads.name as ad_name, start_time, end_time, (data IS NOT NULL) AS has_screenshot FROM channels, players, players_history, ads, ad_log LEFT OUTER JOIN ad_log_screenshot USING (ad_log_id) WHERE channel=channels.id AND player=players_history.id AND players_history.player_instance = players.id AND ad=ads.id AND channels.name LIKE '' AND players.name LIKE '' AND ads.name LIKE '' AND date(start_time) BETWEEN '2009-01-20' AND date('2009-01-21') ORDER BY channels.name, players.name, start_time, ads.name LIMIT 100 OFFSET 100; QUERY PLAN -- Limit (cost=7425.26..7425.26 rows=1 width=120) (actual time=1256.116..1256.202 rows=39 loops=1) -> Sort (cost=7425.26..7425.26 rows=1 width=120) (actual time=1255.965..1256.068 rows=139 loops=1) Sort Key: channels.name, players.name, ad_log.start_time, ads.name Sort Method: quicksort Memory: 35kB -> Nested Loop Left Join (cost=0.01..7425.25 rows=1 width=120) (actual time=179.086..1255.451 rows=139 loops=1) -> Nested Loop (cost=0.01..7417.06 rows=1 width=88) (actual time=137.488..1212.531 rows=139 loops=1) Join Filter: (ad_log.channel = channels.id) -> Nested Loop (cost=0.01..7415.73 rows=1 width=60) (actual time=120.308..1192.867 rows=139 loops=1) Join Filter: (players_history.id = ad_log.player) -> Nested Loop (cost=0.00..36.92 rows=1 width=17) (actual time=21.960..23.405 rows=1 loops=1) Join Filter: (players.id = players_history.player_instance) -> Seq Scan on players (cost=0.00..11.80 rows=1 width=17) (actual time=5.981..6.083 rows=1 loops=1) Filter: (name ~~ ''::text) -> Seq Scan on players_history (cost=0.00..14.50 rows=850 width=8) (actual time=15.880..16.592 rows=850 loops=1) -> Nested Loop (cost=0.01..7371.03 rows=622 width=51) (actual time=75.161..1156.076 rows=15600 loops=1) -> Seq Scan on ads (cost=0.00..72.79 rows=1 width=27) (actual time=15.776..31.975 rows=1 loops=1) Filter: (name ~~ ''::text) -> Index Scan using ad_log_ad_date on ad_log (cost=0.01..7267.77 rows=2438 width=32) (actual time=59.375..1095.229 rows=15600 loops=1) Index Cond: ((ad_log.ad = ads.id) AND (date(ad_log.start_time) >= '2009-01-20'::date) AND (date(ad_log.start_time) <= '2009-01-21'::date)) -> Seq Scan on channels (cost=0.00..1.31 rows=1 width=36) (actual time=0.128..0.132 rows=1 loops=139) Filter: (channels.name ~~ ''::text) -> Index Scan using ad_log_screenshot_pkey on ad_log_screenshot (cost=0.00..8.18 rows=1 width=36) (actual time=0.304..0.304 rows=0 loops=139) Index Cond: (ad_log.ad_log_id = ad_log_screenshot.ad_log_id) Total runtime
Re: [PERFORM] difficulties with time based queries
So, I defragged my disk and reran my original query and it got a little better, but still far higher than I'd like. I then rebuilt (dropped and recreated) the ad_log_date_all index and reran the query and it is quite a bit better: # explain analyze select * from ad_log where date(start_time) < date('2009-03-31') and date(start_time) >= date('2009-03-30'); QUERY PLAN -- Bitmap Heap Scan on ad_log (cost=64770.21..3745596.62 rows=2519276 width=32) (actual time=1166.479..13862.107 rows=2275167 loops=1) Recheck Cond: ((date(start_time) < '2009-03-31'::date) AND (date(start_time) >= '2009-03-30'::date)) -> Bitmap Index Scan on ad_log_date_all (cost=0.00..64140.39 rows=2519276 width=0) (actual time=1143.582..1143.582 rows=2275167 loops=1) Index Cond: ((date(start_time) < '2009-03-31'::date) AND (date(start_time) >= '2009-03-30'::date)) Total runtime: 14547.885 ms During the query the disk throughput peaked at 30MB/s and was mostly at around 20MB/s, much better. So, a few questions: What can I do to prevent the index from getting bloated, or in whatever state it was in? What else can I do to further improve queries on this table? Someone suggested posting details of my conf file. Which settings are most likely to be useful for this? Any other suggestions? Thanks, --Rainer
Re: [PERFORM] difficulties with time based queries
> -Original Message- > From: Tom Lane [mailto:t...@sss.pgh.pa.us] > "Rainer Mager" writes: > >> From: Tom Lane [mailto:t...@sss.pgh.pa.us] > >> Hmm ... it's pretty unusual to see the index fetch portion of a > bitmap > >> scan take the bulk of the runtime. Usually that part is fast and > where > >> the pain comes is in fetching from the heap. I wonder whether that > >> index has become bloated. How big are the table and the index > >> physically? (Look at pg_class.relpages, or if you want a really > >> accurate number try pg_relation_size().) > > > Can you give me some more info on how to look at these stats? > > Since you've got 8.3 it's easy: select pg_relation_size('tablename') > (or indexname). The result is in bytes, so you might want to > divide by 1K or 1M to keep the number readable. Ok, nice and simple...I like it: The result for the table ad_log, is 30,063 MB. The result for the index, ad_log_date_all, is 17,151 MB. I guess this roughly makes sense since the index is on 4 fields and the table only has 6 fields. For the particular query I'm trying to optimize at the moment I believe I should be able to use an index that references only 2 fields, which, I imagine, should reduce the time needed to read it. I'll play with this a bit and see what happens. Any other suggestions? --Rainer -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] difficulties with time based queries
Thanks for all the replies, I'll try to address the follow up questions: > From: David Wilson [mailto:david.t.wil...@gmail.com] > > The stats look good and it's using a viable index for your query. What > kind of hardware is this on, and what are the relevant postgresql.conf > lines? (Or, for that matter, what does iostat say while this query's > running?) I'm running on Windows, so I don't have iostat, but perfmon tells me my Avg. Disk Queue Length went up to 1.2 during the query (versus a normal value of about 0.02). Also disk throughput was at about 1.2 MB/s during the query. I don't know how much of this is random versus linear. > From: PFC [mailto:li...@peufeu.com] > > With this quantity of rows, you want to try to make the disk > accesses as > linear as possible. > This means your table should be organized on disk by date, at > least > roughly. > If your data comes from an import that was sorted on some other > column, > this may not be the case. > > What kind of bytes/s do you get from the drives ? The data should be mostly ordered by date. It is all logged in semi-realtime such that 99% will be logged within an hour of the timestamp. Also, as stated above, during this query it was about 1.2 MB/s, which I know isn't great. I admit this isn't the best hardware in the world, but I would expect better than that for linear queries. > Do you UPDATE or DELETE a lot from this table ? Is it vacuum'd > enough ? No, this table has no UPDATEs or DELETEs. It is auto vacuum'd, but no manual vacuuming. In regards to clustering, I'm hesitant to do that unless I have no other choice. My understanding is that I would need to do periodic re-clustering to maintain it, and during that time the table is very busy. > From: Tom Lane [mailto:t...@sss.pgh.pa.us] > Hmm ... it's pretty unusual to see the index fetch portion of a bitmap > scan take the bulk of the runtime. Usually that part is fast and where > the pain comes is in fetching from the heap. I wonder whether that > index has become bloated. How big are the table and the index > physically? (Look at pg_class.relpages, or if you want a really > accurate number try pg_relation_size().) Can you give me some more info on how to look at these stats? That is, what/where is pg_class.relpages, etc. I'll also do some searching for this info. > What Postgres version is this, exactly? 8.3.3 > BTW, I think you've gone way overboard in your indexing of this table; > those indexes are certainly consuming well more space than the table > does, and a lot of them are redundant. Agreed, I need to look carefully at all of the queries we do on this table and reduce this. --Rainer -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] difficulties with time based queries
I have a somewhat large table (more than 100 million rows) that contains log data with start_time and end_time columns. When I try to do queries on this table I always find them slower than what I need and what I believe should be possible. For example, I limited the following query to just a single day and it still is much slower than what I would expect. In reality I need to do queries that span a few weeks. explain analyze select * from ad_log where date(start_time) < date('2009-03-31') and date(start_time) >= date('2009-03-30'); Bitmap Heap Scan on ad_log (cost=73372.57..3699152.24 rows=2488252 width=32) (actual time=49792.862..64611.255 rows=2268490 loops=1) Recheck Cond: ((date(start_time) < '2009-03-31'::date) AND (date(start_time) >= '2009-03-30'::date)) -> Bitmap Index Scan on ad_log_date_all (cost=0.00..72750.51 rows=2488252 width=0) (actual time=49776.332..49776.332 rows=2268490 loops=1) Index Cond: ((date(start_time) < '2009-03-31'::date) AND (date(start_time) >= '2009-03-30'::date)) Total runtime: 65279.352 ms The definition of the table is: Column |Type | Modifiers +-+- --- ad_log_id | integer | not null default nextval('ad_log_ad_log_id_seq'::regclass) channel| integer | not null player | integer | not null ad | integer | not null start_time | timestamp without time zone | end_time | timestamp without time zone | Indexes: "ad_log_pkey" PRIMARY KEY, btree (ad_log_id) "ad_log_unique" UNIQUE, btree (channel, player, ad, start_time, end_time) "ad_log_ad" btree (ad) "ad_log_ad_date" btree (ad, date(start_time)) "ad_log_channel" btree (channel) "ad_log_channel_date" btree (channel, date(start_time)) "ad_log_date_all" btree (date(start_time), channel, player, ad) "ad_log_player" btree (player) "ad_log_player_date" btree (player, date(start_time)) Foreign-key constraints: "ad_log_ad_fkey" FOREIGN KEY (ad) REFERENCES ads(id) "ad_log_channel_fkey" FOREIGN KEY (channel) REFERENCES channels(id) "ad_log_player_fkey" FOREIGN KEY (player) REFERENCES players_history(id) Triggers: rollup_ad_logs_daily AFTER INSERT ON ad_log FOR EACH ROW EXECUTE PROCEDURE rollup_ad_logs_daily() Any suggestions would be appreciated. --Rainer
[PERFORM] speeding up table creation
I have an interesting performance improvement need. As part of the automatic test suite we run in our development environment, we re-initialize our test database a number of times in order to ensure it is clean before running a test. We currently do this by dropping the public schema and then recreating our tables (roughly 30 tables total). After that we do normal inserts, etc, but never with very much data. My question is, what settings can we tweak to improve performance is this scenario? Specifically, if there was a way to tell Postgres to keep all operations in memory, that would probably be ideal. We actually tried running Postgres off of a RAM disk and this did help a reasonable amount, but we're running under Windows and setting up the RAM disk is a hassle and all of our developers would need to do it. Any tips would be appreciated. --Rainer
[PERFORM] why does this use the wrong index?
I have two identical queries except for the date range. In the first case, with the wider date range, the correct (I believe) index is used. In the second case where the date range is smaller a different index is used and a less efficient plan is chosen. In the second query the problem seems to be CPU resoures; while it is running 1 core of the CPU is 100% busy. Note, if I drop the ad_log_date index then this query is always fast, but some other queries I do require that index. So, What can I do to encourage Postgres to use the first index even when the date range is smaller. # explain analyze SELECT name FROM players AS foo WHERE EXISTS (SELECT 1 FROM ad_log WHERE player = foo.id AND date(start_time) BETWEEN E'2008-09-14' AND E'2008-09-18' LIMIT 1) ORDER BY name; QUERY PLAN - Sort (cost=1573.74..1574.31 rows=230 width=13) (actual time=28.421..28.505 rows=306 loops=1) Sort Key: foo.name Sort Method: quicksort Memory: 28kB -> Seq Scan on players foo (cost=0.00..1564.72 rows=230 width=13) (actual time=0.104..27.876 rows=306 loops=1) Filter: (subplan) SubPlan -> Limit (cost=0.01..3.39 rows=1 width=0) (actual time=0.058..0.058 rows=1 loops=460) -> Index Scan using ad_log_player_date on ad_log (cost=0.01..34571.03 rows=10228 width=0) (actual time=0.056..0.056 rows=1 loops=460) Index Cond: ((player = $0) AND (date(start_time) >= '2008-09-14'::date) AND (date(start_time) <= '2008-09-18'::date)) Total runtime: 28.623 ms (10 rows) # explain analyze SELECT name FROM players AS foo WHERE EXISTS (SELECT 1 FROM ad_log WHERE player = foo.id AND date(start_time) BETWEEN E'2008-09-18' AND E'2008-09-18' LIMIT 1) ORDER BY name; QUERY PLAN - Index Scan using players_name_key on players foo (cost=0.00..8376.84 rows=230 width=13) (actual time=813.695..143452.810 rows=301 loops=1) Filter: (subplan) SubPlan -> Limit (cost=0.01..18.14 rows=1 width=0) (actual time=311.846..311.846 rows=1 loops=460) -> Index Scan using ad_log_date on ad_log (cost=0.01..18.14 rows=1 width=0) (actual time=311.844..311.844 rows=1 loops=460) Index Cond: ((date(start_time) >= '2008-09-18'::date) AND (date(start_time) <= '2008-09-18'::date)) Filter: (player = $0) Total runtime: 143453.100 ms (8 rows) Thanks, --Rainer -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] best use of another drive
Thanks. I should have mentioned the existing disk and the new one are already both mirrored (not together, though :-p). So we had 2 drives that were mirrored and just added 2 more that are mirrored. --Rainer -Original Message- From: Joshua Drake [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 09, 2008 1:16 PM To: Scott Marlowe Cc: Rainer Mager; pgsql-performance@postgresql.org Subject: Re: [PERFORM] best use of another drive On Mon, 8 Sep 2008 22:11:25 -0600 "Scott Marlowe" <[EMAIL PROTECTED]> wrote: > On Mon, Sep 8, 2008 at 8:19 PM, Rainer Mager <[EMAIL PROTECTED]> > wrote: > > I've recently installed another drive in my db server and was > > wondering what the best use of it is. Some thoughts I have are: > > Being a DBA, I'd tend to say make it a mirror of the first drive. > +1 -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ United States PostgreSQL Association: http://www.postgresql.us/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] best use of another drive
I've recently installed another drive in my db server and was wondering what the best use of it is. Some thoughts I have are: 1. Move some of the databases to the new drive. If this is a good idea, is there a way to do this without a dump/restore? I'd prefer to move the folder if possible since that would be much faster. 2. Move some logs to the new drive. Again, if this is recommended I'd be happy to, but some directions on the right procedures would be appreciated. 3. Other...any other ideas? Thanks, --Rainer -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] indexing for distinct search in timestamp based table
Thanks for the suggestion. This seems to work pretty well on 8.3, but not so well on 8.2. We were planning on upgrading to 8.3 soon anyway, we just have to move up our schedule a bit. I think that this type of algorithm would make sense in core. I suspect that being in there some further optimizations could be done that pl/pgsql can't do. --Rainer From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Vladimir Sitnikov Sent: Saturday, September 06, 2008 12:11 AM To: pgsql-performance@postgresql.org Subject: Re: [PERFORM] indexing for distinct search in timestamp based table You might get great improvement for '%' cases using index on channel_name(, start_time) and a little bit of pl/pgsql Basically, you need to implement the following algorithm: 1) curr_ = ( select min() from ad_log ) 2) record_exists = ( select 1 from ad_log where =cur_ and _all_other_conditions limit 1 ) 3) if record_exists==1 then add curr_ to the results 3) curr_ = (select min() from ad_log where > curr_ ) 4) if curr_ is not null then goto 2 I believe it might make sense implement this approach in the core (I would call it "index distinct scan") That could dramatically improve "select distinct from " and "select from group by " kind of queries when there exists an index on and a particular column has very small number of distinct values. For instance: say a table has 10'000'000 rows, while column of interest has only 20 distinct values. In that case, the database will be able to get every of those 20 values in virtually 20 index lookups. What does the community think about that?
Re: [PERFORM] indexing for distinct search in timestamp based table
Thanks for the suggestions. David's assumption is correct in that this is a log table with no updates or deletes. I tried making it CLUSTERED in our test environment, but it didn't seem to make any difference for my query. It did take about 6 hours to do the transformation, so it would be difficult to find the time to do in production, but I'm sure we could work something out if it really looked beneficial. Unfortunately, as I said, initial tests don't seem to indicate any benefit. I believe that my performance difficulty comes from the need for DISTINCT (or GROUP BY) data. That is, the normal start_time index seems fine for limiting the date range, but when I need to select DISTINCT data from the date range it seems that Postgres still needs to scan the entire limited date range. Unfortunately, we support arbitrary date range queries on this table, so I don't think the partitioning idea is an option for us. What I'm playing with now is creating separate tables to hold the channel_name, ad_name, and player_name data with PRIMARY KEY ids. Since there are very few of these compared to the number of rows in the main table, this will give me a quick way to get the DISTINCT values over the entire data set. My problem then will be reducing that to the DISTINCT values for a limited date range. As a side effect bonus of this I expect the database to shrink considerably as these text fields, although not that long (roughly 20 to 50 characters), are certainly longer than a simple foreign key reference. --Rainer From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Scott Carey Sent: Friday, August 29, 2008 8:02 AM To: David Rowley Cc: Rainer Mager; pgsql-performance@postgresql.org Subject: Re: [PERFORM] indexing for distinct search in timestamp based table Another suggestion is to partition the table by date ranges. If most of the range queries occur on particular batches of time, this will make all queries more efficient, and improve locality and efficiency of all indexes on the table. This is more work than simply a table CLUSTER, especially in maintenance overhead, but it will generally help a lot in cases like these. Additionally, if these don't change much after some period of time the tables older than the modification window can be vacuumed, clustered, and reindexed if needed to make them as efficient as possible and maintenance free after that point (other than backups and archives). Another benefit of clustering is in backup / restore. You can incrementally back up only the index partitions that have changed -- for large databases this reduces pg_dump and pg_restore times substantially. To do this you combine regular expressions with the pg_dump "exclude tables" or "include tables" flags. On Thu, Aug 28, 2008 at 3:48 PM, David Rowley <[EMAIL PROTECTED]> wrote: I once also had a similar performance problem when looking for all matching rows between two timestamps. In fact that's why I'm here today. The problem was with MySQL. I had some tables of around 10 million rows and all my searching was timestamp based. MySQL didn't do what I wanted. I found that using a CLUSTERED index with postgresql to be lightning quick. Yet mostly the matching rows I was working with was not much over the 100k mark. I'm wondering if clustering the table on ad_log_start_time will help cut down on random reads. That's if you can afford to block the users while postgresql clusters the table. If you're inserting in order of the start_time column (which I was) then the cluster should almost maintain itself (I think), providing you're not updating or deleting anyway, I'd assume that since it looks like a log table. David. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Rainer Mager Sent: 28 August 2008 09:06 To: pgsql-performance@postgresql.org Subject: [PERFORM] indexing for distinct search in timestamp based table I'm looking for some help in speeding up searches. My table is pretty simple (see below), but somewhat large, and continuously growing. Currently it has about 50 million rows. The table is (I know I have excessive indexes, I'm trying to get the appropriate ones and drop the extras): Table "public.ad_log" Column|Type | Modifiers --+-+--- - ad_log_id| integer | not null default nextval('ad_log_ad_log_id_seq'::regclass) channel_name | text| not null player_name | text| not null ad_name | text| not null start_time | timestamp without time zone | not null end_time | timestamp without time zone | not null Inde
[PERFORM] indexing for distinct search in timestamp based table
I'm looking for some help in speeding up searches. My table is pretty simple (see below), but somewhat large, and continuously growing. Currently it has about 50 million rows. The table is (I know I have excessive indexes, I'm trying to get the appropriate ones and drop the extras): Table "public.ad_log" Column|Type | Modifiers --+-+--- - ad_log_id| integer | not null default nextval('ad_log_ad_log_id_seq'::regclass) channel_name | text| not null player_name | text| not null ad_name | text| not null start_time | timestamp without time zone | not null end_time | timestamp without time zone | not null Indexes: "ad_log_pkey" PRIMARY KEY, btree (ad_log_id) "ad_log_channel_name_key" UNIQUE, btree (channel_name, player_name, ad_name, start_time, end_time) "ad_log_ad_and_start" btree (ad_name, start_time) "ad_log_ad_name" btree (ad_name) "ad_log_all" btree (channel_name, player_name, start_time, ad_name) "ad_log_channel_name" btree (channel_name) "ad_log_end_time" btree (end_time) "ad_log_player_and_start" btree (player_name, start_time) "ad_log_player_name" btree (player_name) "ad_log_start_time" btree (start_time) The query I'm trying to speed up is below. In it the tag can be one of channel_name, player_name, or ad_name. I'm actually trying to return the distinct values and I found GROUP BY to be slightly faster than using DISTINCT. Also, any of those fields may be unspecified in the WHERE clauses in which case we use '%', but it seems Postgres optimizes that pretty well. SELECT FROM ad_log WHERE channel_name LIKE :channel_name AND player_name LIKE :player_name AND ad_name LIKE :ad_name AND start_time BETWEEN :start_date AND (date(:end_date) + 1) GROUP BY ORDER BY A typical query is: explain analyze SELECT channel_name FROM ad_log WHERE channel_name LIKE '%' AND ad_name LIKE '%' AND start_time BETWEEN '2008-07-01' AND (date('2008-07-28') + 1) GROUP BY channel_name ORDER BY channel_name; with the result being: QUERY PLAN --- Sort (cost=1163169.02..1163169.03 rows=5 width=10) (actual time=75460.187..75460.192 rows=15 loops=1) Sort Key: channel_name Sort Method: quicksort Memory: 17kB -> HashAggregate (cost=1163168.91..1163168.96 rows=5 width=10) (actual time=75460.107..75460.114 rows=15 loops=1) -> Bitmap Heap Scan on ad_log (cost=285064.30..1129582.84 rows=13434427 width=10) (actual time=8506.250..65771.597 rows=13701296 loops=1) Recheck Cond: ((start_time >= '2008-07-01 00:00:00'::timestamp without time zone) AND (start_time <= '2008-07-29'::date)) Filter: ((channel_name ~~ '%'::text) AND (ad_name ~~ '%'::text)) -> Bitmap Index Scan on ad_log_start_time (cost=0.00..281705.70 rows=13434427 width=0) (actual time=8488.443..8488.443 rows=13701296 loops=1) Index Cond: ((start_time >= '2008-07-01 00:00:00'::timestamp without time zone) AND (start_time <= '2008-07-29'::date)) Total runtime: 75460.361 ms It seems to me there should be some way to create an index to speed this up, but the various ones I've tried so far haven't helped. Any suggestions would be greatly appreciated. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance