Re: [PERFORM] Auto-clustering?

2010-12-18 Thread Royce Ausburn

On 17/12/2010, at 8:27 PM, Filip Rembiałkowski wrote:

 
 2010/12/17 Royce Ausburn ro...@inomial.com
 Hi all,
 
 I have a table that in the typical case holds two minute sample data for a 
 few thousand sources.  Often we need to report on these data for a particular 
 source over a particular time period and we're finding this query tends to 
 get a bit slow.
 
 
 how about (auto)vacuuming?

A key piece of information I left out: we almost never update rows in this 
table.

  
 
 I figure at most there should only be ~20,000 rows to be read from disk, and 
 I expect that the index is doing a pretty good job of making sure only the 
 rows that need reading are read. inclusion of the ip in the query is almost 
 redundant as most of the time an ip has its own collection  My suspicion 
 is that the rows that we're interested in are very sparsely distributed on 
 disk, so we're having to read too many pages for the query...
 
 
 you can test this suspicion in very simple way:
 - create test table (like yours including indexes including constraints, but 
 with no data)
 - insert into test select * from yours order by 
 - analyze test tablee available 
 - test the query on the new table
 
 If new query is much faster, and if you have intensive random UPD/DEL/INS 
 activity, periodic CLUSTER could be a good idea... 
 but it depends on actual usage patterns (SELECT/modify ratio, types of 
 updates, and so on).

Good idea!  This vastly improves query times.
 
 
 and finally, you did not specify what PostgreSQL version are you using.

In the case I've been working with it's 8.1 =(  But we have a few instances of 
this database... I believe the rest are a mixture of 8.4s and they all have the 
same problem.

--Royce

Re: [PERFORM] Auto-clustering?

2010-12-18 Thread Royce Ausburn

On 17/12/2010, at 9:20 PM, Pierre C wrote:

 
 fc=# explain  analyse select collection, period, tariff, sum(bytesSent), 
 sum(bytesReceived), sum(packets), max(sample), (starttime / 3600) * 3600 as 
 startchunk from sample_20101001 where starttime between 1287493200 and 
 1290171599  and collection=128and ip = '10.9.125.207' group by 
 startchunk, tariff, collection, period;
 
 If CLUSTER locks bother you, and you don't do UPDATEs, you might consider 
 doing something like this :
 
 - accumulate the rows in a recent table
 - every hour, INSERT INTO archive SELECT * FROM recent ORDER BY (your cluster 
 fields)
 - DELETE FROM recent the rows you just inserted
 - VACUUM recent
 
 The cluster in your archive table will not be perfect but at least all rows 
 from 1 source in 1 hour will be stored close together. But clustering doesn't 
 need to be perfect either, if you get 100x better locality, that's already 
 good !

That's a really decent idea and can slot in perfectly well with how the 
application already works!  We have existing DBAO code that handles monthly 
tables; it'll happily pop data in to a recent table In fact we can probably 
tolerate having a today table.  Thanks!

--Royce



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Auto-clustering?

2010-12-17 Thread Royce Ausburn
Hi all,

I have a table that in the typical case holds two minute sample data for a few 
thousand sources.  Often we need to report on these data for a particular 
source over a particular time period and we're finding this query tends to get 
a bit slow.

The structure of the table:

 Table public.sample
  Column   |   Type   |Modifiers

---+--+-
 client| integer  | not null
 aggregateid   | bigint   | not null
 sample| bigint   | not null default 
nextval('samplekey'::regclass)
 customer  | integer  | 
 period| integer  | not null
 starttime | integer  | not null
 duration  | integer  | not null
 ip| text | 
 tariff| integer  | 
 bytessentrate | bigint   | 
 bytessent | bigint   | 
 bytesreceived | bigint   | 
 packets   | integer  | not null
 queuetype | integer  | not null default 0
 collection| integer  | 
 bytesreceivedrate | bigint   | 
 greatestrate  | bigint   | 
 invalidated   | timestamp with time zone | 
Indexes:
sample_pkey PRIMARY KEY, btree (sample)
sample_collection_starttime_idx btree (collection, starttime)
sample_customer_starttime_idx btree (customer, starttime)
sample_sample_idx btree (client, sample)
Foreign-key constraints:
sample_client_fkey FOREIGN KEY (client) REFERENCES client(client)


fc=# explain  analyse select collection, period, tariff, sum(bytesSent), 
sum(bytesReceived), sum(packets), max(sample), (starttime / 3600) * 3600 as 
startchunk from sample_20101001 where starttime between 1287493200 and 
1290171599  and collection=128and ip = '10.9.125.207' group by startchunk, 
tariff, collection, period; 
QUERY PLAN  

-
 HashAggregate  (cost=34959.01..34959.03 rows=1 width=44) (actual 
time=67047.850..67047.850 rows=0 loops=1)
   -  Bitmap Heap Scan on sample_20101001  (cost=130.56..34958.91 rows=5 
width=44) (actual time=67047.847..67047.847 rows=0 loops=1)
 Recheck Cond: ((collection = 128) AND (starttime = 1287493200) AND 
(starttime = 1290171599))
 Filter: (ip = '10.9.125.207'::text)
 -  Bitmap Index Scan on sample_20101001_collection_starttime_idx  
(cost=0.00..130.56 rows=9596 width=0) (actual time=9806.115..9806.115 rows=6830 
loops=1)
   Index Cond: ((collection = 128) AND (starttime = 1287493200) 
AND (starttime = 1290171599))
 Total runtime: 67048.201 ms
(7 rows)


I figure at most there should only be ~20,000 rows to be read from disk, and I 
expect that the index is doing a pretty good job of making sure only the rows 
that need reading are read. inclusion of the ip in the query is almost 
redundant as most of the time an ip has its own collection  My suspicion is 
that the rows that we're interested in are very sparsely distributed on disk, 
so we're having to read too many pages for the query...

All of the queries on this table are reporting on a single collection, so 
ideally a collection's data would all be stored in the same part of the disk... 
or at least clumped together.  This can be achieved using cluster, however as 
far as I know there's no automated, non-cronesque means of clustering and 
having the table become unusable during the cluster is not ideal.  

I've considered partitioning, but I don't think that's going to give the effect 
I need.  Apparently clustering is only going to scale to a few dozen child 
tables, so that's only going to give one order of magnitude performance for 
significant complexity.

Are there any other options?

Cheers!

--Royce



Re: [PERFORM] Auto-clustering?

2010-12-17 Thread Filip Rembiałkowski
2010/12/17 Royce Ausburn ro...@inomial.com

 Hi all,

 I have a table that in the typical case holds two minute sample data for a
 few thousand sources.  Often we need to report on these data for a
 particular source over a particular time period and we're finding this query
 tends to get a bit slow.

 The structure of the table:

  Table public.sample
   Column   |   Type   |
  Modifiers

 ---+--+-
  client| integer  | not null
  aggregateid   | bigint   | not null
  sample| bigint   | not null default
 nextval('samplekey'::regclass)
  customer  | integer  |
  period| integer  | not null
  starttime | integer  | not null
  duration  | integer  | not null
  ip| text |
  tariff| integer  |
  bytessentrate | bigint   |
  bytessent | bigint   |
  bytesreceived | bigint   |
  packets   | integer  | not null
  queuetype | integer  | not null default 0
  collection| integer  |
  bytesreceivedrate | bigint   |
  greatestrate  | bigint   |
  invalidated   | timestamp with time zone |
 Indexes:
 sample_pkey PRIMARY KEY, btree (sample)
 sample_collection_starttime_idx btree (collection, starttime)
 sample_customer_starttime_idx btree (customer, starttime)
 sample_sample_idx btree (client, sample)
 Foreign-key constraints:
 sample_client_fkey FOREIGN KEY (client) REFERENCES client(client)


 fc=# explain  analyse select collection, period, tariff, sum(bytesSent),
 sum(bytesReceived), sum(packets), max(sample), (starttime / 3600) * 3600 as
 startchunk from sample_20101001 where starttime between 1287493200 and
 1290171599  and collection=128and ip = '10.9.125.207' group by
 startchunk, tariff, collection, period;
 QUERY PLAN


 -
  HashAggregate  (cost=34959.01..34959.03 rows=1 width=44) (actual
 time=67047.850..67047.850 rows=0 loops=1)
-  Bitmap Heap Scan on sample_20101001  (cost=130.56..34958.91 rows=5
 width=44) (actual time=67047.847..67047.847 rows=0 loops=1)
  Recheck Cond: ((collection = 128) AND (starttime = 1287493200)
 AND (starttime = 1290171599))
  Filter: (ip = '10.9.125.207'::text)
  -  Bitmap Index Scan on sample_20101001_collection_starttime_idx
  (cost=0.00..130.56 rows=9596 width=0) (actual time=9806.115..9806.115
 rows=6830 loops=1)
Index Cond: ((collection = 128) AND (starttime =
 1287493200) AND (starttime = 1290171599))
  Total runtime: 67048.201 ms
 (7 rows)


how about (auto)vacuuming?



 I figure at most there should only be ~20,000 rows to be read from disk,
 and I expect that the index is doing a pretty good job of making sure only
 the rows that need reading are read. inclusion of the ip in the query is
 almost redundant as most of the time an ip has its own collection  My
 suspicion is that the rows that we're interested in are very sparsely
 distributed on disk, so we're having to read too many pages for the query...



you can test this suspicion in very simple way:
- create test table (like yours including indexes including constraints, but
with no data)
- insert into test select * from yours order by
- analyze test tablee available
- test the query on the new table

If new query is much faster, and if you have intensive random UPD/DEL/INS
activity, periodic CLUSTER could be a good idea...
but it depends on actual usage patterns (SELECT/modify ratio, types of
updates, and so on).




 All of the queries on this table are reporting on a single collection, so
 ideally a collection's data would all be stored in the same part of the
 disk... or at least clumped together.  This can be achieved using cluster,
 however as far as I know there's no automated, non-cronesque means of
 clustering and having the table become unusable during the cluster is not
 ideal.


cron is a way of automation, isn't it :-)





 I've considered partitioning, but I don't think that's going to give the
 effect I need.  Apparently clustering is only going to scale to a few dozen
 child tables, so that's only going to give one order of magnitude
 performance for significant complexity.




regarding partitioning: I guess it starts to make sense around 10M rows or
10G Bytes in one table.

regarding clustering: it does not help with index bloat.

and finally, you did not specify what 

Re: [PERFORM] Auto-clustering?

2010-12-17 Thread Marti Raudsepp
2010/12/17 Filip Rembiałkowski filip.rembialkow...@gmail.com:
 regarding clustering: it does not help with index bloat.

I'm almost sure it does, CLUSTER re-creates all indexes from scratch
after copying the tuples.

Regards,
Marti

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Auto-clustering?

2010-12-17 Thread Filip Rembiałkowski
you are right, I must have missed it...

 Table public.u
 Column |Type | Modifiers
+-+---
 id | integer |
 t  | timestamp without time zone |
 d  | text|
Indexes:
u_d btree (d)
u_id btree (id)
u_t btree (t)

fi...@filip=# select oid, relname, pg_Relation_size(oid) from pg_class where
relname in('u','u_id','u_t','u_d');
  oid  | relname | pg_relation_size
---+-+--
 64283 | u   | 15187968
 64289 | u_id|  6758400
 64290 | u_t |  6086656
 64291 | u_d | 16482304

fi...@filip=# CLUSTER u USING u_t;
CLUSTER
fi...@filip=# select oid, relname, pg_Relation_size(oid) from pg_class where
relname in('u','u_id','u_t','u_d');
  oid  | relname | pg_relation_size
---+-+--
 64283 | u   | 12115968
 64289 | u_id|  3391488
 64290 | u_t |  3391488
 64291 | u_d |  8216576
(4 rows)


So CLUSTER is effectively CLUSTER + REINDEX... nice.


W dniu 17 grudnia 2010 10:41 użytkownik Marti Raudsepp ma...@juffo.orgnapisał:

 2010/12/17 Filip Rembiałkowski filip.rembialkow...@gmail.com:
  regarding clustering: it does not help with index bloat.

 I'm almost sure it does, CLUSTER re-creates all indexes from scratch
 after copying the tuples.

 Regards,
 Marti



Re: [PERFORM] Auto-clustering?

2010-12-17 Thread Pierre C


fc=# explain  analyse select collection, period, tariff, sum(bytesSent),  
sum(bytesReceived), sum(packets), max(sample), (starttime / 3600) * 3600  
as startchunk from sample_20101001 where starttime between 1287493200  
and 1290171599  and collection=128and ip = '10.9.125.207' group by  
startchunk, tariff, collection,  
period;


If CLUSTER locks bother you, and you don't do UPDATEs, you might consider  
doing something like this :


- accumulate the rows in a recent table
- every hour, INSERT INTO archive SELECT * FROM recent ORDER BY (your  
cluster fields)

- DELETE FROM recent the rows you just inserted
- VACUUM recent

The cluster in your archive table will not be perfect but at least all  
rows from 1 source in 1 hour will be stored close together. But clustering  
doesn't need to be perfect either, if you get 100x better locality, that's  
already good !


Now, if you have a huge amount of data but never query it with a precision  
exceeding 1 hour, you might consider creating an aggregate table where, at  
the end of every hour, you only store sum(), min(), max() of the data for  
the last hour's data using GROUP BY the fields you want. You could also  
use a trigger, but that would generate a huge amount of UPDATEs.


For the above query you'd do :

INSERT INTO stats_by_hour (columns...) SELECT
collection, ip, period, tariff, sum(bytesSent),
sum(bytesReceived), sum(packets), max(sample), (starttime / 3600) * 3600
as startchunk from sample_20101001 WHERE starttime  some value
GROUP BY collection, ip, period, tariff, startchunk

Then you can run aggregates against this much smaller table instead.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Auto-clustering?

2010-12-17 Thread phb07


Royce Ausburn a écrit :
All of the queries on this table are reporting on a single collection, so ideally a collection's data would all be stored in the same part of the disk... or at least clumped together.  This can be achieved using cluster, however as far as I know there's no automated, non-cronesque means of clustering and having the table become unusable during the cluster is not ideal.  



  


If the lock level used by CLUSTER is a problem for you, you could 
consider pg_reorg contrib. AFAIK, it does similar work as CLUSTER but 
allowing a concurrent read and write activity on the table.


Regards. Philippe.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance