Re: [GENERAL] temporary table as a subset of an existing table and indexes
Basically, I take the same query as above and replace all occurences of tables logs and tags with temp_logs and temp_tags, created as follow: CREATE TEMPORARY TABLE temp_logs ON COMMIT DROP AS SELECT * FROM logs WHERE condition ORDER BY date DESC LIMIT max_size; CREATE TEMPORARY TABLE temp_tags ON COMMIT DROP AS SELECT * FROM tags WHERE logid IN (SELECT logid FROM temp_logs); With condition usually defining a date window. As we are experimenting with this approach, date has become a forced criteria. I have experimented with partitioning, but it led to the logid primary key not being unique anymore, which was a problem when joining data with the tags table. So the queries are pretty much the same, the boost in speed being simply due to the limitation of the search space. How are you partitioning the tags? Is the partitioned query doing the same job as the non partitioned query? Is date a forced criteria? (and if it is, have you considered date partition/brute force?) merlin -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] temporary table as a subset of an existing table and indexes
On Tue, Nov 9, 2010 at 4:47 AM, Matthieu Huin wrote: > Hello Merlin, > > So far the improvement in responsiveness has been very noticeable, even > without indexing the temporary tables. Of course, this is just trading > accuracy for speed as I simply narrow arbitrarily the search space ... > > The schema I am working on is close to the one I am referencing in this > thread : http://archives.postgresql.org/pgsql-general/2010-11/msg00272.php > > Since we want to implement full text search and tags querying, it can lead > to rather complex autogenerated queries such as this one (find log lines > with the word 'root' in it, dated from 11/04 to 11/06, where the 'program' > tag is sshd and the 'severity_code' tag is less than 3) : > > > EXPLAIN ANALYZE SELECT r.logid FROM (((SELECT lcond84.logid, lcond84.date > FROM > ( SELECT tmp84.logid, tmp84.date FROM logs tmp84 > WHERE TRUE AND ( to_tsvector('simple',tmp84.body) @@ > plainto_tsquery('simple','root') ) AND tmp84.date > '2010-11-04 10:22:06.26' > AND tmp84.date < '2010-11-06 10:22:06.26' ) AS lcond84 > NATURAL JOIN > ( SELECT tmp85.logid FROM tags tmp85 WHERE > FALSE > OR (tmp85.name='severity_code' AND num_lt(tmp85.value, 3) ) > OR (tmp85.name='program' AND tmp85.value = CAST(ROW('sshd') AS tagvalue) ) > GROUP BY tmp85.logid HAVING COUNT(tmp85.logid) = 2 ) AS tcond84 ) > > )) AS r ORDER BY r.date DESC LIMIT 1000; > > > > Giving the following query plan : > > Limit (cost=765445.54..765445.56 rows=9 width=16) (actual > time=34744.257..34744.257 rows=0 loops=1) > -> Sort (cost=765445.54..765445.56 rows=9 width=16) (actual > time=34744.255..34744.255 rows=0 loops=1) > Sort Key: tmp84.date > Sort Method: quicksort Memory: 17kB > -> Hash Join (cost=765005.46..765445.40 rows=9 width=16) (actual > time=34744.202..34744.202 rows=0 loops=1) > Hash Cond: (tmp85.logid = tmp84.logid) > -> HashAggregate (cost=758440.29..758669.77 rows=15299 > width=8) (actual time=33343.816..33343.816 rows=0 loops=1) > Filter: (count(tmp85.logid) = 2) > -> Bitmap Heap Scan on tags tmp85 > (cost=92363.26..757225.45 rows=242968 width=8) (actual > time=20676.354..33294.252 rows=32864 loops=1) > Recheck Cond: ((name = 'severity_code'::text) OR > (name = 'program'::text)) > Filter: (((name = 'severity_code'::text) AND > num_lt(value, 3::double precision)) OR ((name = 'program'::text) AND > ((value).storedvalue = 'sshd'::text))) > -> BitmapOr (cost=92363.26..92363.26 > rows=2148625 width=0) (actual time=19688.915..19688.915 rows=0 loops=1) > -> Bitmap Index Scan on nameval_idx > (cost=0.00..30388.35 rows=707841 width=0) (actual time=19337.358..19337.358 > rows=708719 loops=1) > Index Cond: (name = > 'severity_code'::text) > -> Bitmap Index Scan on nameval_idx > (cost=0.00..61853.42 rows=1440784 width=0) (actual time=351.551..351.551 > rows=1484703 loops=1) > Index Cond: (name = 'program'::text) > -> Hash (cost=6553.06..6553.06 rows=969 width=16) (actual > time=1400.378..1400.378 rows=32516 loops=1) > -> Bitmap Heap Scan on logs tmp84 > (cost=215.51..6553.06 rows=969 width=16) (actual time=1104.226..1383.745 > rows=32516 loops=1) > Recheck Cond: (to_tsvector('simple'::regconfig, > body) @@ '''root'''::tsquery) > Filter: ((date > '2010-11-04 > 10:22:06.26+01'::timestamp with time zone) AND (date < '2010-11-06 > 10:22:06.26+01'::timestamp with time zone)) > -> Bitmap Index Scan on fulltext_body_idx > (cost=0.00..215.27 rows=1740 width=0) (actual time=1097.874..1097.874 > rows=64340 loops=1) > Index Cond: > (to_tsvector('simple'::regconfig, body) @@ '''root'''::tsquery) > Total runtime: 34756.938 ms > > This one isn't too bad, but the runtime seems to increase exponentially with > the tables size. Therefore, using a temporary table based on the date > condition can cut the query time by a factor of up to ten (table creation > included, and provided the resulting table isn't too big - I make a COUNT > check prior to creation so that I will eventually limit manually the table > size.). But of course, I'd rather have speed AND accuracy ... > > To make things worse, the tables tend to grow very quickly since as you > might have guessed, I am working on the database part of a logs collector; > the current implementation doesn't scale well along the data. > > I hope this makes things clearer. Feel free to ask if you need more > clarifications, and thanks for your time. How are you partitioning the tags? Is the partitioned query doing the same job as the non partitioned query? Is date a forced criteria? (and if it is, have
Re: [GENERAL] temporary table as a subset of an existing table and indexes
Hello Merlin, So far the improvement in responsiveness has been very noticeable, even without indexing the temporary tables. Of course, this is just trading accuracy for speed as I simply narrow arbitrarily the search space ... The schema I am working on is close to the one I am referencing in this thread : http://archives.postgresql.org/pgsql-general/2010-11/msg00272.php Since we want to implement full text search and tags querying, it can lead to rather complex autogenerated queries such as this one (find log lines with the word 'root' in it, dated from 11/04 to 11/06, where the 'program' tag is sshd and the 'severity_code' tag is less than 3) : EXPLAIN ANALYZE SELECT r.logid FROM (((SELECT lcond84.logid, lcond84.date FROM ( SELECT tmp84.logid, tmp84.date FROM logs tmp84 WHERE TRUE AND ( to_tsvector('simple',tmp84.body) @@ plainto_tsquery('simple','root') ) AND tmp84.date > '2010-11-04 10:22:06.26' AND tmp84.date < '2010-11-06 10:22:06.26' ) AS lcond84 NATURAL JOIN ( SELECT tmp85.logid FROM tags tmp85 WHERE FALSE OR (tmp85.name='severity_code' AND num_lt(tmp85.value, 3) ) OR (tmp85.name='program' AND tmp85.value = CAST(ROW('sshd') AS tagvalue) ) GROUP BY tmp85.logid HAVING COUNT(tmp85.logid) = 2 ) AS tcond84 ) )) AS r ORDER BY r.date DESC LIMIT 1000; Giving the following query plan : Limit (cost=765445.54..765445.56 rows=9 width=16) (actual time=34744.257..34744.257 rows=0 loops=1) -> Sort (cost=765445.54..765445.56 rows=9 width=16) (actual time=34744.255..34744.255 rows=0 loops=1) Sort Key: tmp84.date Sort Method: quicksort Memory: 17kB -> Hash Join (cost=765005.46..765445.40 rows=9 width=16) (actual time=34744.202..34744.202 rows=0 loops=1) Hash Cond: (tmp85.logid = tmp84.logid) -> HashAggregate (cost=758440.29..758669.77 rows=15299 width=8) (actual time=33343.816..33343.816 rows=0 loops=1) Filter: (count(tmp85.logid) = 2) -> Bitmap Heap Scan on tags tmp85 (cost=92363.26..757225.45 rows=242968 width=8) (actual time=20676.354..33294.252 rows=32864 loops=1) Recheck Cond: ((name = 'severity_code'::text) OR (name = 'program'::text)) Filter: (((name = 'severity_code'::text) AND num_lt(value, 3::double precision)) OR ((name = 'program'::text) AND ((value).storedvalue = 'sshd'::text))) -> BitmapOr (cost=92363.26..92363.26 rows=2148625 width=0) (actual time=19688.915..19688.915 rows=0 loops=1) -> Bitmap Index Scan on nameval_idx (cost=0.00..30388.35 rows=707841 width=0) (actual time=19337.358..19337.358 rows=708719 loops=1) Index Cond: (name = 'severity_code'::text) -> Bitmap Index Scan on nameval_idx (cost=0.00..61853.42 rows=1440784 width=0) (actual time=351.551..351.551 rows=1484703 loops=1) Index Cond: (name = 'program'::text) -> Hash (cost=6553.06..6553.06 rows=969 width=16) (actual time=1400.378..1400.378 rows=32516 loops=1) -> Bitmap Heap Scan on logs tmp84 (cost=215.51..6553.06 rows=969 width=16) (actual time=1104.226..1383.745 rows=32516 loops=1) Recheck Cond: (to_tsvector('simple'::regconfig, body) @@ '''root'''::tsquery) Filter: ((date > '2010-11-04 10:22:06.26+01'::timestamp with time zone) AND (date < '2010-11-06 10:22:06.26+01'::timestamp with time zone)) -> Bitmap Index Scan on fulltext_body_idx (cost=0.00..215.27 rows=1740 width=0) (actual time=1097.874..1097.874 rows=64340 loops=1) Index Cond: (to_tsvector('simple'::regconfig, body) @@ '''root'''::tsquery) Total runtime: 34756.938 ms This one isn't too bad, but the runtime seems to increase exponentially with the tables size. Therefore, using a temporary table based on the date condition can cut the query time by a factor of up to ten (table creation included, and provided the resulting table isn't too big - I make a COUNT check prior to creation so that I will eventually limit manually the table size.). But of course, I'd rather have speed AND accuracy ... To make things worse, the tables tend to grow very quickly since as you might have guessed, I am working on the database part of a logs collector; the current implementation doesn't scale well along the data. I hope this makes things clearer. Feel free to ask if you need more clarifications, and thanks for your time. Matthieu Le 08/11/2010 22:26, Merlin Moncure a écrit : On Mon, Nov 8, 2010 at 12:15 PM, Matthieu Huin wrote: Greetings all, I am trying to optimize SELECT queries on a large table (10M rows and more) by using temporary tables that are subsets of my main table, thus narrowing the search space to a more m
Re: [GENERAL] temporary table as a subset of an existing table and indexes
On Mon, Nov 8, 2010 at 12:15 PM, Matthieu Huin wrote: > Greetings all, > > I am trying to optimize SELECT queries on a large table (10M rows and more) > by using temporary tables that are subsets of my main table, thus narrowing > the search space to a more manageable size. > Is it possible to transfer indices (or at least use the information from > existing indices) from the big table to its subset in a reasonable amount of > time ? Are you sure that the benefit of creating scratch tables is worth the overhead? Can you give explain/analyze of the query you are trying to optimize? merlin -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] temporary table as a subset of an existing table and indexes
Greetings all, I am trying to optimize SELECT queries on a large table (10M rows and more) by using temporary tables that are subsets of my main table, thus narrowing the search space to a more manageable size. Is it possible to transfer indices (or at least use the information from existing indices) from the big table to its subset in a reasonable amount of time ? When I try : CREATE TEMPORARY TABLE tmp AS SELECT * FROM big_table WHERE condition; The table creation is fast ( a few seconds ) as there are indices on the big table that are optimized for condition, but then indexing the data is rather costly (the new table would have around 100k rows) and takes a few minutes to complete. This is not acceptable as the whole process aims at reducing the query time. I get even worse results with the following transaction : CREATE TEMPORARY TABLE tmp ( LIKE big_table INCLUDING INDEXES ); INSERT INTO tmp SELECT * FROM big_table WHERE condition; Also, partitioning my big table from the very beginning is not an option, as it doesn't guarantee index key unicity ( according to http://www.postgresql.org/docs/current/interactive/ddl-partitioning.html ). Any suggestions on this ? Kind regards, Matthieu Huin -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general