Re: [Dbmail-dev] OT: Postgresql problem

Sean Chittenden Thu, 23 Dec 2004 22:38:42 +0100 (CET)

The selectivity of your qual is terrible. sid=2::SMALLINT isreturning

half of the table as a potential match!  It should be no surprise then
that the planner decided to do a sequential scan.  You have a *lot* of
duplicate IP addresses in that table.


I know it's terrible..  It's actually horrible, because sid=2 equals
100% of the table at this time.

Ugh, this is a worst case scenario and any RDBMS would fall over here(ie, this isn't something that's particular to PostgreSQL).

Look at the estimated rows (566130) and actual rows (2232934) for the
sid qual.  You could potentially improve the estimates by increasing
the stats target for your sid column.


Yup, 2.2mill records is correct. I looked at the wrong figure.

:) np, it's easy to do as there are lots of numbers in an EXPLAINANALYZE.

ALTER TABLE tbl ALTER COLUMN sid SET STATISTICS 20;
Assuming you're at the default of 10.  Increase them by 2x at a time
until you get something that gives you a good estimate.  But you'll
likely never get the planner to use anything but a sequential scan
given the number of rows it has to examine.

This corrected the estimated rows count. But no change inplanning/time.

Figured it'd fix the estimate, but there's no way to get it to run anyfaster when it has to examine 2.2M rows to uniquify the data.

You'll likely need to use a stored procedure and shovel off IPs to a
different table that uniquifies the IPs that way the database only has
to go over a small number of IPs that are unique instead of a large

number of IPs that contains duplicates. It'd be faster potentiallyfor

SELECTs if you're doing this frequently, but you're essentially asking
for a sequential scan given your data set.


Damn, I had hoped for some kind of trick to use an unique index and
extract the data from there.

Well, there is a trick for this.... here's your necessary goo to fixyour problem so it's *FAST*. This comes with no warranty, etc.,etc.... unless you want me to do some consulting to actually do thisand certify my work. That said... here's how you solve your problemso you've got a lightning fast set of unique IPs (I haven't attached itto an sid, but I'll leave that as an exercise for the reader). Thefollowing assumes you have plpgsql installed as a language (ex:`createlang plpgsql`):


CREATE TABLE rawdata_unique_ip (
  id SERIAL NOT NULL,
  ip INET NOT NULL,
  PRIMARY KEY(id)
);
CREATE UNIQUE INDEX rawdata_unique_ip_udx ON rawdata_unique_ip (ip);
INSERT INTO rawdata_unique_ip (ip) SELECT DISTINCT ip FROM rawdata;
ANALYZE rawdata_unique_ip;

ALTER TABLE rawdata ADD FOREIGN KEY(ip) REFERENCESrawdata_unique_ip(ip) MATCH FULL ON UPDATE CASCADE ON DELETE CASCADE;


CREATE FUNCTION tbl_trg_ins() RETURNS TRIGGER AS 'BEGIN
        PERFORM TRUE FROM rawdata_unique_ip_tbl AS uip WHERE uip.ip = NEW.ip;
        IF NOT FOUND THEN
                INSERT INTO rawdata_unique_ip_tbl (ip) VALUE (NEW.ip);
        END IF;
        RETURN NEW;
END;' LANGUAGE 'plpgsql';

CREATE TRIGGER tbl_trg_ins BEFORE INSERT ON rawdata FOR EACH ROWEXECUTE PROCEDURE tbl_trg_ins();

Now instead of SELECTing out of rawdata, SELECT your IPs out of therawdata_unique_ip table. I hope the above makes enough sense so youcan follow it and make it a reality. Not to suggest that you may havedesign issues, but, the above at least gives you a boost on yourcurrent design w/o a serious redesign. An INT4 is only 4 bytes on disktoo, where as an INET is 12 bytes, so you could also save some space ifthat's an issue, though I suspect not.

[snip]
Currently only 1 of our 3 firewalls is enabled to report usage, thus
the sid=2 problem. So today we insert about 2-3000 records per hour,
that will likely atleast double soon. The retention time is 5 years!
That gives me an estimate of about 100-150million records..

The good news is that's very doable. The biggest Pg database I've seenhas tens of billions of rows and it's rumored that the largest Pgdatabase has trillions of rows... you'll be in very good company.

It's kind of an extreme database in my opinion, and might give me
problems that will be similar to dbmail problems on big systems.

Well... dbmail has its own set of issues that shouldn't exist, but dobecause of the desire to retain support for MySQL (*pukes*). dbmailcould be easily 5x faster than it currently is if done right, and Ifirmly believe that a Pg+dbmail version would be faster than aMySQL+dbmail version... but that's another rant all together that's notappropriate for a public mailing list.

Maybe I'll go learn stored procedures some day..

Take the boost I gave you above and run w/ it... pl languages aren'ttoo hard to use.... remember, SQL was designed so accountants could useit. -sc


--
Sean Chittenden

Re: [Dbmail-dev] OT: Postgresql problem

Reply via email to