2010/12/17 Royce Ausburn <ro...@inomial.com> > Hi all, > > I have a table that in the typical case holds two minute sample data for a > few thousand sources. Often we need to report on these data for a > particular source over a particular time period and we're finding this query > tends to get a bit slow. > > The structure of the table: > > Table "public.sample" > Column | Type | > Modifiers > > -------------------+--------------------------+------------------------------------------------- > client | integer | not null > aggregateid | bigint | not null > sample | bigint | not null default > nextval('samplekey'::regclass) > customer | integer | > period | integer | not null > starttime | integer | not null > duration | integer | not null > ip | text | > tariff | integer | > bytessentrate | bigint | > bytessent | bigint | > bytesreceived | bigint | > packets | integer | not null > queuetype | integer | not null default 0 > collection | integer | > bytesreceivedrate | bigint | > greatestrate | bigint | > invalidated | timestamp with time zone | > Indexes: > "sample_pkey" PRIMARY KEY, btree (sample) > "sample_collection_starttime_idx" btree (collection, starttime) > "sample_customer_starttime_idx" btree (customer, starttime) > "sample_sample_idx" btree (client, sample) > Foreign-key constraints: > "sample_client_fkey" FOREIGN KEY (client) REFERENCES client(client) > > > fc=# explain analyse select collection, period, tariff, sum(bytesSent), > sum(bytesReceived), sum(packets), max(sample), (starttime / 3600) * 3600 as > startchunk from sample_20101001 where starttime between 1287493200 and > 1290171599 and collection=128 and ip = '10.9.125.207' group by > startchunk, tariff, collection, period; > QUERY PLAN > > > --------------------------------------------------------------------------------------------------------------------------------------------------------------------- > HashAggregate (cost=34959.01..34959.03 rows=1 width=44) (actual > time=67047.850..67047.850 rows=0 loops=1) > -> Bitmap Heap Scan on sample_20101001 (cost=130.56..34958.91 rows=5 > width=44) (actual time=67047.847..67047.847 rows=0 loops=1) > Recheck Cond: ((collection = 128) AND (starttime >= 1287493200) > AND (starttime <= 1290171599)) > Filter: (ip = '10.9.125.207'::text) > -> Bitmap Index Scan on sample_20101001_collection_starttime_idx > (cost=0.00..130.56 rows=9596 width=0) (actual time=9806.115..9806.115 > rows=6830 loops=1) > Index Cond: ((collection = 128) AND (starttime >= > 1287493200) AND (starttime <= 1290171599)) > Total runtime: 67048.201 ms > (7 rows) > > how about (auto)vacuuming?
> > I figure at most there should only be ~20,000 rows to be read from disk, > and I expect that the index is doing a pretty good job of making sure only > the rows that need reading are read. inclusion of the ip in the query is > almost redundant as most of the time an ip has its own collection.... My > suspicion is that the rows that we're interested in are very sparsely > distributed on disk, so we're having to read too many pages for the query... > you can test this suspicion in very simple way: - create test table (like yours including indexes including constraints, but with no data) - insert into test select * from yours order by - analyze test tablee available - test the query on the new table If new query is much faster, and if you have intensive random UPD/DEL/INS activity, periodic CLUSTER could be a good idea... but it depends on actual usage patterns (SELECT/modify ratio, types of updates, and so on). > > All of the queries on this table are reporting on a single collection, so > ideally a collection's data would all be stored in the same part of the > disk... or at least clumped together. This can be achieved using "cluster", > however as far as I know there's no automated, non-cronesque means of > clustering and having the table become unusable during the cluster is not > ideal. > cron is a way of automation, isn't it :-) > > > I've considered partitioning, but I don't think that's going to give the > effect I need. Apparently clustering is only going to scale to a few dozen > child tables, so that's only going to give one order of magnitude > performance for significant complexity. > > regarding partitioning: I guess it starts to make sense around 10M rows or 10G Bytes in one table. regarding clustering: it does not help with index bloat. and finally, you did not specify what PostgreSQL version are you using. cheers, Filip