2010/12/17 Royce Ausburn <ro...@inomial.com>

> Hi all,
>
> I have a table that in the typical case holds two minute sample data for a
> few thousand sources.  Often we need to report on these data for a
> particular source over a particular time period and we're finding this query
> tends to get a bit slow.
>
> The structure of the table:
>
>                                      Table "public.sample"
>       Column       |           Type           |
>  Modifiers
>
> -------------------+--------------------------+-------------------------------------------------
>  client            | integer                  | not null
>  aggregateid       | bigint                   | not null
>  sample            | bigint                   | not null default
> nextval('samplekey'::regclass)
>  customer          | integer                  |
>  period            | integer                  | not null
>  starttime         | integer                  | not null
>  duration          | integer                  | not null
>  ip                | text                     |
>  tariff            | integer                  |
>  bytessentrate     | bigint                   |
>  bytessent         | bigint                   |
>  bytesreceived     | bigint                   |
>  packets           | integer                  | not null
>  queuetype         | integer                  | not null default 0
>  collection        | integer                  |
>  bytesreceivedrate | bigint                   |
>  greatestrate      | bigint                   |
>  invalidated       | timestamp with time zone |
> Indexes:
>     "sample_pkey" PRIMARY KEY, btree (sample)
>     "sample_collection_starttime_idx" btree (collection, starttime)
>     "sample_customer_starttime_idx" btree (customer, starttime)
>     "sample_sample_idx" btree (client, sample)
> Foreign-key constraints:
>     "sample_client_fkey" FOREIGN KEY (client) REFERENCES client(client)
>
>
> fc=# explain  analyse select collection, period, tariff, sum(bytesSent),
> sum(bytesReceived), sum(packets), max(sample), (starttime / 3600) * 3600 as
> startchunk from sample_20101001 where starttime between 1287493200 and
> 1290171599  and collection=128    and ip = '10.9.125.207' group by
> startchunk, tariff, collection, period;
>                                         QUERY PLAN
>
>
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  HashAggregate  (cost=34959.01..34959.03 rows=1 width=44) (actual
> time=67047.850..67047.850 rows=0 loops=1)
>    ->  Bitmap Heap Scan on sample_20101001  (cost=130.56..34958.91 rows=5
> width=44) (actual time=67047.847..67047.847 rows=0 loops=1)
>          Recheck Cond: ((collection = 128) AND (starttime >= 1287493200)
> AND (starttime <= 1290171599))
>          Filter: (ip = '10.9.125.207'::text)
>          ->  Bitmap Index Scan on sample_20101001_collection_starttime_idx
>  (cost=0.00..130.56 rows=9596 width=0) (actual time=9806.115..9806.115
> rows=6830 loops=1)
>                Index Cond: ((collection = 128) AND (starttime >=
> 1287493200) AND (starttime <= 1290171599))
>  Total runtime: 67048.201 ms
> (7 rows)
>
>
how about (auto)vacuuming?


>
> I figure at most there should only be ~20,000 rows to be read from disk,
> and I expect that the index is doing a pretty good job of making sure only
> the rows that need reading are read. inclusion of the ip in the query is
> almost redundant as most of the time an ip has its own collection....  My
> suspicion is that the rows that we're interested in are very sparsely
> distributed on disk, so we're having to read too many pages for the query...
>


you can test this suspicion in very simple way:
- create test table (like yours including indexes including constraints, but
with no data)
- insert into test select * from yours order by
- analyze test tablee available
- test the query on the new table

If new query is much faster, and if you have intensive random UPD/DEL/INS
activity, periodic CLUSTER could be a good idea...
but it depends on actual usage patterns (SELECT/modify ratio, types of
updates, and so on).



>
> All of the queries on this table are reporting on a single collection, so
> ideally a collection's data would all be stored in the same part of the
> disk... or at least clumped together.  This can be achieved using "cluster",
> however as far as I know there's no automated, non-cronesque means of
> clustering and having the table become unusable during the cluster is not
> ideal.
>

cron is a way of automation, isn't it :-)



>
>
> I've considered partitioning, but I don't think that's going to give the
> effect I need.  Apparently clustering is only going to scale to a few dozen
> child tables, so that's only going to give one order of magnitude
> performance for significant complexity.
>
>


regarding partitioning: I guess it starts to make sense around 10M rows or
10G Bytes in one table.

regarding clustering: it does not help with index bloat.

and finally, you did not specify what PostgreSQL version are you using.


cheers,
Filip

Reply via email to