On Wed, 2004-12-22 at 21:02, Sean Chittenden wrote: > > explain analyze SELECT distinct ip FROM rawdata WHERE sid='2'; > > <puts_on_database_consulting_hat/> > > The selectivity of your qual is terrible. sid=2::SMALLINT is returning > half of the table as a potential match! It should be no surprise then > that the planner decided to do a sequential scan. You have a *lot* of > duplicate IP addresses in that table.
I know it's terrible.. It's actually horrible, because sid=2 equals 100% of the table at this time. > Look at the estimated rows (566130) and actual rows (2232934) for the > sid qual. You could potentially improve the estimates by increasing > the stats target for your sid column. Yup, 2.2mill records is correct. I looked at the wrong figure. > ALTER TABLE tbl ALTER COLUMN sid SET STATISTICS 20; > Assuming you're at the default of 10. Increase them by 2x at a time > until you get something that gives you a good estimate. But you'll > likely never get the planner to use anything but a sequential scan > given the number of rows it has to examine. This corrected the estimated rows count. But no change in planning/time. > You'll likely need to use a stored procedure and shovel off IPs to a > different table that uniquifies the IPs that way the database only has > to go over a small number of IPs that are unique instead of a large > number of IPs that contains duplicates. It'd be faster potentially for > SELECTs if you're doing this frequently, but you're essentially asking > for a sequential scan given your data set. Damn, I had hoped for some kind of trick to use an unique index and extract the data from there. > Hope that helps. If you'd like more assistance than what's above, send > me an email off list. -sc Well, thanks a lot so far. I'll go back to redesigning the database now. Some extra background info: The database is supposed to record every bytecount of internet traffic from each server every 10min. Currently only 1 of our 3 firewalls is enabled to report usage, thus the sid=2 problem. So today we insert about 2-3000 records per hour, that will likely atleast double soon. The retention time is 5 years! That gives me an estimate of about 100-150million records.. It's kind of an extreme database in my opinion, and might give me problems that will be similar to dbmail problems on big systems. I guess I'll separate the IP out of the table, but unfortunately it still will be slow on other queries. Maybe I'll go learn stored procedures some day.. -HK