The selectivity of your qual is terrible. sid=2::SMALLINT is returning
half of the table as a potential match!  It should be no surprise then
that the planner decided to do a sequential scan.  You have a *lot* of
duplicate IP addresses in that table.

I know it's terrible..  It's actually horrible, because sid=2 equals
100% of the table at this time.

Ugh, this is a worst case scenario and any RDBMS would fall over here (ie, this isn't something that's particular to PostgreSQL).

Look at the estimated rows (566130) and actual rows (2232934) for the
sid qual.  You could potentially improve the estimates by increasing
the stats target for your sid column.

Yup, 2.2mill records is correct. I looked at the wrong figure.

:) np, it's easy to do as there are lots of numbers in an EXPLAIN ANALYZE.

ALTER TABLE tbl ALTER COLUMN sid SET STATISTICS 20;
Assuming you're at the default of 10.  Increase them by 2x at a time
until you get something that gives you a good estimate.  But you'll
likely never get the planner to use anything but a sequential scan
given the number of rows it has to examine.

This corrected the estimated rows count. But no change in planning/time.

Figured it'd fix the estimate, but there's no way to get it to run any faster when it has to examine 2.2M rows to uniquify the data.

You'll likely need to use a stored procedure and shovel off IPs to a
different table that uniquifies the IPs that way the database only has
to go over a small number of IPs that are unique instead of a large
number of IPs that contains duplicates. It'd be faster potentially for
SELECTs if you're doing this frequently, but you're essentially asking
for a sequential scan given your data set.

Damn, I had hoped for some kind of trick to use an unique index and
extract the data from there.

Well, there is a trick for this.... here's your necessary goo to fix your problem so it's *FAST*. This comes with no warranty, etc., etc.... unless you want me to do some consulting to actually do this and certify my work. That said... here's how you solve your problem so you've got a lightning fast set of unique IPs (I haven't attached it to an sid, but I'll leave that as an exercise for the reader). The following assumes you have plpgsql installed as a language (ex: `createlang plpgsql`):

CREATE TABLE rawdata_unique_ip (
  id SERIAL NOT NULL,
  ip INET NOT NULL,
  PRIMARY KEY(id)
);
CREATE UNIQUE INDEX rawdata_unique_ip_udx ON rawdata_unique_ip (ip);
INSERT INTO rawdata_unique_ip (ip) SELECT DISTINCT ip FROM rawdata;
ANALYZE rawdata_unique_ip;
ALTER TABLE rawdata ADD FOREIGN KEY(ip) REFERENCES rawdata_unique_ip(ip) MATCH FULL ON UPDATE CASCADE ON DELETE CASCADE;

CREATE FUNCTION tbl_trg_ins() RETURNS TRIGGER AS 'BEGIN
        PERFORM TRUE FROM rawdata_unique_ip_tbl AS uip WHERE uip.ip = NEW.ip;
        IF NOT FOUND THEN
                INSERT INTO rawdata_unique_ip_tbl (ip) VALUE (NEW.ip);
        END IF;
        RETURN NEW;
END;' LANGUAGE 'plpgsql';
CREATE TRIGGER tbl_trg_ins BEFORE INSERT ON rawdata FOR EACH ROW EXECUTE PROCEDURE tbl_trg_ins();

Now instead of SELECTing out of rawdata, SELECT your IPs out of the rawdata_unique_ip table. I hope the above makes enough sense so you can follow it and make it a reality. Not to suggest that you may have design issues, but, the above at least gives you a boost on your current design w/o a serious redesign. An INT4 is only 4 bytes on disk too, where as an INET is 12 bytes, so you could also save some space if that's an issue, though I suspect not.

[snip]
Currently only 1 of our 3 firewalls is enabled to report usage, thus
the sid=2 problem. So today we insert about 2-3000 records per hour,
that will likely atleast double soon. The retention time is 5 years!
That gives me an estimate of about 100-150million records..

The good news is that's very doable. The biggest Pg database I've seen has tens of billions of rows and it's rumored that the largest Pg database has trillions of rows... you'll be in very good company.

It's kind of an extreme database in my opinion, and might give me
problems that will be similar to dbmail problems on big systems.

Well... dbmail has its own set of issues that shouldn't exist, but do because of the desire to retain support for MySQL (*pukes*). dbmail could be easily 5x faster than it currently is if done right, and I firmly believe that a Pg+dbmail version would be faster than a MySQL+dbmail version... but that's another rant all together that's not appropriate for a public mailing list.

Maybe I'll go learn stored procedures some day..

Take the boost I gave you above and run w/ it... pl languages aren't too hard to use.... remember, SQL was designed so accountants could use it. -sc

--
Sean Chittenden

Reply via email to