[SQL] SQL report
I have the following senario. I have a tracking system. The system will record the status of an object regularly, all the status records are stored in one table. And it will keep a history of maximum 1000 status record for each object it tracks. The maximum objects the system will track is 100,000. Which means I will potentially have a table size of 100 million records. I have to generate a report on the latest status of all objects being tracked at a particular point in time, and also I have to allow user to sort and filter on different columes in the status record displayed in the report. The following is a brief description in the status record (they are not actual code) ObjectRecord( objectId bigint PrimaryKey desc varchar ) StatusRecord ( id bigint PrimaryKey objectId bigint indexed datetime bigint indexed capacity double reliability double efficiency double ) I have tried to do the following, it works very well with around 20,000 objects. (The query return in less than 10s) But when I have 100,000 objects it becomes very very slow. (I don't even have patience to wait for it to return I kill it after 30 mins) select * from statusrecord s1 INNER JOIN ( SELECT objectId , MAX(datetime) AS msdt FROM statusrecord WHERE startDatetime <= 1233897527657 GROUP BY objectId ) AS s2 ON ( s1.objectId = s2.objectId AND s1.datetime = s2.msdt ) where ( capacity < 10.0 ) order by s1.datetime DESC, s1.objectId DESC; I did try to write a store procedure like below, for 100,000 objects and 1000 status records / object, it returns in around 30 mins. CREATE OR REPLACE FUNCTION getStatus(pitvalue BIGINT) RETURNS SETOF statusrecord AS $BODY$ DECLARE id VARCHAR; status statusrecord%ROWTYPE; BEGIN FOR object IN SELECT * FROM objectRecord LOOP EXECUTE 'SELECT * FROM statusrecord WHERE objectId = ' || quote_literal(object.objectId) || ' AND datetime <= ' || quote_literal(pitvalue) || ' ORDER BY datetime DESC' INTO status; IF FOUND THEN RETURN NEXT status; END IF; END LOOP; RETURN; END $BODY$ LANGUAGE plpgsql; Just wanna to know if anyone have a different approach to my senario. Thanks alot. John
Re: [SQL] SQL report
Hi Steve, Thanks for you suggestions. In my senario, what is current depends on users. Because if user wants a status report at 00:00 1st Jan 2009, then 00:00 1st Jan 2009 is current. So it is not possible to flag any records as current unless the user tells us what is current. cheers John On Jul 31, 2009 2:41am, Steve Crawford wrote: wkipj...@gmail.com wrote: I have the following senario. I have a tracking system. The system will record the status of an object regularly, all the status records are stored in one table. And it will keep a history of maximum 1000 status record for each object it tracks. The maximum objects the system will track is 100,000. Which means I will potentially have a table size of 100 million records. I have to generate a report on the latest status of all objects being tracked at a particular point in time, and also I have to allow user to sort and filter on different columes in the status record displayed in the report. ... Just wanna to know if anyone have a different approach to my senario. Thanks alot. Not knowing all the details of your system, here are some things you could experiment with: 1. Add a "latest record id" field in your object table (automatically updated with a trigger) that would allow you to do a simple join with the tracking table. I suspect that such a join will be far faster than calculating "max" 100,000 times at the expense of a slightly larger main table. 2. Add a "current record flag" in the status table that simply flags the most recent record for each object (again, use triggers to keep the flag appropriately updated). This would also eliminate the need for the "max" subquery. You could even create a partial index filtering on the "current record flag" which could speed things up if the reporting query is written correctly. 3. Partition the table into a "current status table" and "historical status table" (each inheriting from the main table). Use a trigger so that anytime a new status record in added, the old "current" record is moved from the "current" to the "historical" table and the new one added to the "current" table. The latest status report will only need a simple join on the "current" table with a max size of 100,000 rather than a more complex query over a 100,000,000 record table. Cheers, Steve
Re: [SQL] SQL report
Hi Rob, I have default B-Tree indexes created for each of the indexed columes and primary key columes. (No multiple columes indexe or NULL FIRST or DESC/ASC). I am using PostgreSQL 8.3 with the auto vacuum daemon on. I assume analyse will be automatically run to collect statistics for use by the planner and there is no maintainance for B-tree indexes once it is created. (Please point me out if I am wrong about this) I will probably try to partition the status table to group more recent status records together to minimize the dataset I am querying. Thx John On Jul 31, 2009 1:16am, Rob Sargent wrote: I would be curious to know the performance curve for let's say 20K, 40K , 60K, 80K, 100K records. And what sort of indexing you have, whether or not it's clustered, re-built and so on. One could envision partitioning the status table such that recent records were grouped together (on the assumption that they will be most frequently "reported"). wkipj...@gmail.com wrote: I have the following senario. I have a tracking system. The system will record the status of an object regularly, all the status records are stored in one table. And it will keep a history of maximum 1000 status record for each object it tracks. The maximum objects the system will track is 100,000. Which means I will potentially have a table size of 100 million records. I have to generate a report on the latest status of all objects being tracked at a particular point in time, and also I have to allow user to sort and filter on different columes in the status record displayed in the report. The following is a brief description in the status record (they are not actual code) ObjectRecord( objectId bigint PrimaryKey desc varchar ) StatusRecord ( id bigint PrimaryKey objectId bigint indexed datetime bigint indexed capacity double reliability double efficiency double ) I have tried to do the following, it works very well with around 20,000 objects. (The query return in less than 10s) But when I have 100,000 objects it becomes very very slow. (I don't even have patience to wait for it to return I kill it after 30 mins) select * from statusrecord s1 INNER JOIN ( SELECT objectId , MAX(datetime) AS msdt FROM statusrecord WHERE startDatetime I did try to write a store procedure like below, for 100,000 objects and 1000 status records / object, it returns in around 30 mins. CREATE OR REPLACE FUNCTION getStatus(pitvalue BIGINT) RETURNS SETOF statusrecord AS $BODY$ DECLARE id VARCHAR; status statusrecord%ROWTYPE; BEGIN FOR object IN SELECT * FROM objectRecord LOOP EXECUTE 'SELECT * FROM statusrecord WHERE objectId = ' || quote_literal(object.objectId) || ' AND datetime INTO status; IF FOUND THEN RETURN NEXT status; END IF; END LOOP; RETURN; END $BODY$ LANGUAGE plpgsql; Just wanna to know if anyone have a different approach to my senario. Thanks alot. John