On Wed, 2004-12-22 at 21:02, Sean Chittenden wrote:
> > explain analyze SELECT distinct ip FROM rawdata WHERE sid='2';
> 
> <puts_on_database_consulting_hat/>
> 
> The selectivity of your qual is terrible.  sid=2::SMALLINT is returning 
> half of the table as a potential match!  It should be no surprise then 
> that the planner decided to do a sequential scan.  You have a *lot* of 
> duplicate IP addresses in that table.

I know it's terrible..  It's actually horrible, because sid=2 equals
100% of the table at this time.

> Look at the estimated rows (566130) and actual rows (2232934) for the 
> sid qual.  You could potentially improve the estimates by increasing 
> the stats target for your sid column.

Yup, 2.2mill records is correct. I looked at the wrong figure.

> ALTER TABLE tbl ALTER COLUMN sid SET STATISTICS 20;
> Assuming you're at the default of 10.  Increase them by 2x at a time 
> until you get something that gives you a good estimate.  But you'll 
> likely never get the planner to use anything but a sequential scan 
> given the number of rows it has to examine.

This corrected the estimated rows count. But no change in planning/time.

> You'll likely need to use a stored procedure and shovel off IPs to a 
> different table that uniquifies the IPs that way the database only has 
> to go over a small number of IPs that are unique instead of a large 
> number of IPs that contains duplicates.  It'd be faster potentially for 
> SELECTs if you're doing this frequently, but you're essentially asking 
> for a sequential scan given your data set.

Damn, I had hoped for some kind of trick to use an unique index and
extract the data from there.

> Hope that helps.  If you'd like more assistance than what's above, send 
> me an email off list.  -sc

Well, thanks a lot so far. I'll go back to redesigning the database now.



Some extra background info:
The database is supposed to record every bytecount of internet traffic
from each server every 10min.

Currently only 1 of our 3 firewalls is enabled to report usage, thus
the sid=2 problem. So today we insert about 2-3000 records per hour,
that will likely atleast double soon. The retention time is 5 years!
That gives me an estimate of about 100-150million records..

It's kind of an extreme database in my opinion, and might give me
problems that will be similar to dbmail problems on big systems.

I guess I'll separate the IP out of the table, but unfortunately it
still will be slow on other queries.

Maybe I'll go learn stored procedures some day..

-HK

Reply via email to