Re: [GENERAL] Planner: rows=1 after similar to where condition.
-Original Message- From: Tom Lane [mailto:[EMAIL PROTECTED] Sent: Wednesday, 5 March 2008 0:29 To: Joris Dobbelsteen Cc: Gregory Stark; Scott Marlowe; pgsql-general@postgresql.org Subject: Re: [GENERAL] Planner: rows=1 after similar to where condition. Joris Dobbelsteen [EMAIL PROTECTED] writes: [snip] Recheck Cond: (((program)::text = 'amavis'::text) AND ((facility)::text = 'mail'::text)) selects rows that match the regex with much higher probability than the general row population does. Since we don't yet have any cross-column statistics the planner has no chance of realizing that. Yes, that was it, should have realized that. Thanks, - Joris ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org/
Re: [GENERAL] Planner: rows=1 after similar to where condition.
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Joris Dobbelsteen Sent: Monday, 25 February 2008 17:08 To: Tom Lane Cc: Gregory Stark; Scott Marlowe; pgsql-general@postgresql.org Subject: Re: [GENERAL] Planner: rows=1 after similar to where condition. -Original Message- From: Tom Lane [mailto:[EMAIL PROTECTED] Sent: Monday, 25 February 2008 16:34 To: Joris Dobbelsteen Cc: Gregory Stark; Scott Marlowe; pgsql-general@postgresql.org Subject: Re: [GENERAL] Planner: rows=1 after similar to where condition. Joris Dobbelsteen [EMAIL PROTECTED] writes: Bitmap Heap Scan on log_syslog syslog (cost=11168.32..16988.84 rows=1 width=221) (actual time=11145.729..30067.606 rows=212 loops=1) Recheck Cond: (((program)::text = 'amavis'::text) AND ((facility)::text = 'mail'::text)) Filter: (((priority)::text = 'notice'::text) AND ((text)::text ~ '***:^(?:amavis\\[[0-9]+\\]: \\([0-9]+-[-0-9]+\\) Passed [A-Za-z0-9]+, [][0-9\\.]* [^]+ - [^]+, Message-ID: [^]+, (Resent-Message-ID: [^]+, |)mail.id: [^ ,]+, Hits: [-+0-9\\.,]+, queued.as: [^ ,]+, [0-9]+ ms)$'::text)) It's not too surprising that you'd get a small selectivity estimate for such a long regexp; the default estimate is just based on the amount of fixed text in the pattern, and you've got a lot. If you increase the stats target for the column to 100 or more then it will try actually applying the regexp to all the histogram entries. That might or might not give you a better estimate. I will try that, expect result back within a few days (have it collect some better sample set). Unfortunally the regex is not so much for narrowing down the selection, but rather guarenteeing the format of the messages. You seem to consider the common case differently, and I can agree for most part. Unfortunally my use-case is different from the expected. That said, might a less aggressive selectivity estimation for long strings work better in the common case? A new test case (I did a fresh VACUUM ANALYZE with your statistics for text set to 100): Arround 5288 rows out of 4.3 Million match. Bitmap Heap Scan on log_syslog syslog (cost=1.94..53522.27 rows=1 width=226) (actual time=41661.354..92719.083 rows=5288 loops=1) Recheck Cond: (((program)::text = 'amavis'::text) AND ((facility)::text = 'mail'::text)) Filter: (((priority)::text = 'notice'::text) AND ((text)::text ~ '***:^(?:amavis\\[[0-9]+\\]: \\([0-9]+-[-0-9]+\\) Passed [A-Za-z0-9]+, [][0-9\\.]* [^]+ - [^]+, Message-ID: [^]+, (Resent-Message-ID: [^]+, |)mail.id: [^ ,]+, Hits: [-+0-9\\.,]+, queued.as: [^ ,]+, [0-9]+ ms)$'::text)) - BitmapAnd (cost=1.94..1.94 rows=15279 width=0) (actual time=4641.009..4641.009 rows=0 loops=1) - Bitmap Index Scan on IX_log_syslog_program (cost=0.00..2908.86 rows=113370 width=0) (actual time=2913.718..2913.718 rows=113897 loops=1) Index Cond: ((program)::text = 'amavis'::text) - Bitmap Index Scan on IX_log_syslog_facility (cost=0.00..14868.57 rows=591426 width=0) (actual time=1715.591..1715.591 rows=586509 loops=1) Index Cond: ((facility)::text = 'mail'::text) Total runtime: 92738.389 ms Unfortunally, Tom, it seems the data varies to much and is not included in the histogram. Probably the data varies too much. In this case, a regex NOT for selection but rather for forcing the input format should be done differently. My construction with the regex as substring() construction and a WHERE substring() IS NOT NULL seems to give a better estimate in these cases. The result seems equivalent. Bitmap Heap Scan on log_syslog syslog (cost=17783.78..53966.33 rows=5844 width=226) (actual time=59095.076..110913.152 rows=5295 loops=1) Recheck Cond: (((program)::text = 'amavis'::text) AND ((facility)::text = 'mail'::text)) Filter: (((priority)::text = 'notice'::text) AND (substring((text)::text, 'amavis\\[[0-9]+\\]: \\([0-9]+-[-0-9]+\\) Passed \\[A-Za-z0-9]+\\, [][0-9.]* [^]+ - [^]+, Message-ID: [^]+, (Resent-Message-ID: [^]+, |)mail_id: [^ ,]+, Hits: [-+0-9.,]+, queued_as: [^ ,]+, [0-9]+ ms'::text, '\\'::text) IS NOT NULL)) - BitmapAnd (cost=17783.78..17783.78 rows=15279 width=0) (actual time=4003.657..4003.657 rows=0 loops=1) - Bitmap Index Scan on IX_log_syslog_program (cost=0.00..2908.86 rows=113370 width=0) (actual time=1652.278..1652.278 rows=113939 loops=1) Index Cond: ((program)::text = 'amavis'::text) - Bitmap Index Scan on IX_log_syslog_facility (cost=0.00..14868.57 rows=591426 width=0) (actual time=2339.943..2339.943 rows=586653 loops=1) Index Cond: ((facility)::text = 'mail'::text) Total runtime: 110921.978 ms Note: few added rows in second run is due to the fact that this is a live table that receives input continuesly. Concluding: Your estimator is really great and seems to give pretty good estimates! Except for regular expressions, which seem more tricky in this regard. A good note might
Re: [GENERAL] Planner: rows=1 after similar to where condition.
Joris Dobbelsteen [EMAIL PROTECTED] writes: From: Tom Lane [mailto:[EMAIL PROTECTED] If you increase the stats target for the column to 100 or more then it will try actually applying the regexp to all the histogram entries. That might or might not give you a better estimate. A new test case (I did a fresh VACUUM ANALYZE with your statistics for text set to 100): Arround 5288 rows out of 4.3 Million match. Ah, you had not given us that number before. That's one in 800 rows, more or less, which means that there's no chance of getting a well-founded statistical estimate with less than 800 items in the stats collection. Does it do any better with stats target set to 1000? I think though that the real problem may be that the index condition Recheck Cond: (((program)::text = 'amavis'::text) AND ((facility)::text = 'mail'::text)) selects rows that match the regex with much higher probability than the general row population does. Since we don't yet have any cross-column statistics the planner has no chance of realizing that. My construction with the regex as substring() construction and a WHERE substring() IS NOT NULL seems to give a better estimate in these cases. The result seems equivalent. Actually, it's got exactly 0 knowledge about substring() and is giving you a completely generic guess for this clause :-( regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [GENERAL] Planner: rows=1 after similar to where condition.
-Original Message- From: Scott Marlowe [mailto:[EMAIL PROTECTED] Sent: Monday, 25 February 2008 7:14 To: Joris Dobbelsteen Cc: pgsql-general@postgresql.org Subject: Re: [GENERAL] Planner: rows=1 after similar to where condition. On Sun, Feb 24, 2008 at 4:35 PM, Joris Dobbelsteen [EMAIL PROTECTED] wrote: I seem to have some planner oddity, where it seems to completely mispredict the output after a regex compare. I've seem it on other occasions, where it completely screws up the join. You can note the rows=1 after the filter. A similar sitution has occurred when doing a regex filter in a subquery, which was subsequently predited as 1 row and triggered (oddly enough) a sequencial scan. Doing the same using equality on the result to substring(text from regex) seemed to work and produced a useful plan, since it did a hash-join (as it should have). Is this a known problem? Otherwise I think I should build a smaller test case... Using Postgresql 8.2.6 from Debian Etch-backports. Should be: PostGreSQL 8.2.5 on x86_64-pc-linux-gnu (GCC 4.1.2.20061115) (Debian 4.1.1-21). Should have paid closer attention. Bitmap Heap Scan on log_syslog syslog (cost=13124.26..51855.25 rows=1 width=270) Recheck Cond: (((program)::text = 'amavis'::text) AND ((facility)::text = 'mail'::text)) Filter: ***SOME VERY LONG SIMILAR TO REGEX - BitmapAnd (cost=13124.26..13124.26 rows=18957 width=0) - Bitmap Index Scan on IX_log_syslog_program (cost=0.00..2223.95 rows=92323 width=0) Index Cond: ((program)::text = 'amavis'::text) - Bitmap Index Scan on IX_log_syslog_facility (cost=0.00..10899.81 rows=463621 width=0) Index Cond: ((facility)::text = 'mail'::text) It's not saying it will only get one row back for sure, it's saying it thinks it will return one row. and depending on your query, it might. What's the query, and what's the explain analyze of that query? See the attached file for the query and the explain (hopefully this gives a consistent view and maintains the layout for easier reading). The point is that it will NOT, not even close. The planner guesses 1 row, but the output was arround 13000 rows (of the 2.2M rows in the table). Oddly enough the 18k rows on the bitmap and seems a very good estimation. In fact, if I omit the SIMILAR TO, it estimates ~12000 rows, which is spot on. So it seems the SIMILAR TO really gets the planner confused. The real value was 12981 rows that were returned by the query in the first case. However, since I removed this data from the original table (its now somewhere else), I cannot present the original EXPLAIN ANALYZE any more. The new dataset only contains ~137 (but I still have the old statistics, I think, or at least they provide the same predictions). I also included a run after EXPLAIN ANALYZE on the current dataset. Hopefully this helps. Thanks, - Joris QUERY= SELECT a.msgid, a.rcv_time, a.sent_time, a.hostname, a.passedas, a.from, a.to, a.message_id, a.resent_message_id, a.mail_id, a.queue_id, convert_score(a.spamscore) AS spamscore, CAST(a.duration AS integer) FROM(SELECT syslog.msgid, syslog.rcv_time, syslog.sent_time, syslog.hostname, substring(syslog.text from 'amavis% Passed #%#, % % - %, Message-ID: %, mail_id: %, Hits: %, queued_as: %, % ms' for '#') as passedas, substring(syslog.text from 'amavis% Passed %, % #%# - %, Message-ID: %, mail_id: %, Hits: %, queued_as: %, % ms' for '#') as from, substring(syslog.text from 'amavis% Passed %, % % - #%#, Message-ID: %, mail_id: %, Hits: %, queued_as: %, % ms' for '#') as to, substring(syslog.text from 'amavis% Passed %, % % - %, Message-ID: #[^]+#, (Resent-Message-ID: #%#, |)mail_id: %, Hits: %, queued_as: %, % ms' for '#') as message_id, substring(syslog.text from 'amavis% Passed %, % % - %, Message-ID: %, Resent-Message-ID: #%#, mail_id: %, Hits: %, queued_as: %, % ms' for '#') as resent_message_id, substring(syslog.text from 'amavis% Passed %, % % - %, Message-ID: %, mail_id: #%#, Hits: %, queued_as: %, % ms' for '#') as mail_id, substring(syslog.text from 'amavis% Passed %, % % - %, Message-ID: %, mail_id: %, Hits: #%#, queued_as: %, % ms' for '#') as spamscore, substring(syslog.text from 'amavis% Passed %, % % - %, Message-ID: %, mail_id: %, Hits: %, queued_as: #%#, % ms' for '#') as queue_id, substring(syslog.text from 'amavis% Passed %, % % - %, Message-ID: %, mail_id: %, Hits: %, queued_as: %, #%# ms' for '#') as duration, syslog.text, syslog.facility, syslog.priority, syslog.program FROM public.log_syslog syslog WHERE syslog.program = 'amavis' AND syslog.facility = 'mail' AND syslog.priority
Re: [GENERAL] Planner: rows=1 after similar to where condition.
Joris Dobbelsteen [EMAIL PROTECTED] writes: Should be: PostGreSQL 8.2.5 on x86_64-pc-linux-gnu (GCC 4.1.2.20061115) (Debian 4.1.1-21). In this case that may matter. One of the changes in the 8.2.6 update was: . Improve planner's handling of LIKE/regex estimation in non-C locales (Tom) I seem to recall this largely had to do with negated regexp matches but I might only be remembering part of it. I would start by taking all the fixes for known bugs before trying to diagnose a new one :) -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services! ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org/
Re: [GENERAL] Planner: rows=1 after similar to where condition.
-Original Message- From: Gregory Stark [mailto:[EMAIL PROTECTED] Sent: Monday, 25 February 2008 12:31 To: Joris Dobbelsteen Cc: Scott Marlowe; pgsql-general@postgresql.org Subject: Re: Planner: rows=1 after similar to where condition. Joris Dobbelsteen [EMAIL PROTECTED] writes: Should be: PostGreSQL 8.2.5 on x86_64-pc-linux-gnu (GCC 4.1.2.20061115) (Debian 4.1.1-21). In this case that may matter. One of the changes in the 8.2.6 update was: . Improve planner's handling of LIKE/regex estimation in non-C locales (Tom) I seem to recall this largely had to do with negated regexp matches but I might only be remembering part of it. I would start by taking all the fixes for known bugs before trying to diagnose a new one :) It seems debian backports are not upgraded on my system. So I performed the upgrade to: PostgreSQL 8.2.6 on x86_64-pc-linux-gnu, compiled by GCC cc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21) And indeed, I should have upgraded it before to the latest version. It still spits out the same EXPLAIN ANALYZE result. Bitmap Heap Scan on log_syslog syslog (cost=11168.32..16988.84 rows=1 width=221) (actual time=11145.729..30067.606 rows=212 loops=1) Recheck Cond: (((program)::text = 'amavis'::text) AND ((facility)::text = 'mail'::text)) Filter: (((priority)::text = 'notice'::text) AND ((text)::text ~ '***:^(?:amavis\\[[0-9]+\\]: \\([0-9]+-[-0-9]+\\) Passed [A-Za-z0-9]+, [][0-9\\.]* [^]+ - [^]+, Message-ID: [^]+, (Resent-Message-ID: [^]+, |)mail.id: [^ ,]+, Hits: [-+0-9\\.,]+, queued.as: [^ ,]+, [0-9]+ ms)$'::text)) - BitmapAnd (cost=11168.32..11168.32 rows=2095 width=0) (actual time=608.771..608.771 rows=0 loops=1) - Bitmap Index Scan on IX_log_syslog_program (cost=0.00..490.06 rows=19160 width=0) (actual time=94.982..94.982 rows=85238 loops=1) Index Cond: ((program)::text = 'amavis'::text) - Bitmap Index Scan on IX_log_syslog_facility (cost=0.00..10677.75 rows=426214 width=0) (actual time=504.960..504.960 rows=455084 loops=1) Index Cond: ((facility)::text = 'mail'::text) Total runtime: 30068.868 ms From what little I know statistics and how, in postgres, the filter expressions are computed, the rows=1 prediction seems still a bit odd. It seems the fix did not effect this query. Thanks for your support so far, - Joris ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [GENERAL] Planner: rows=1 after similar to where condition.
Joris Dobbelsteen [EMAIL PROTECTED] writes: Bitmap Heap Scan on log_syslog syslog (cost=11168.32..16988.84 rows=1 width=221) (actual time=11145.729..30067.606 rows=212 loops=1) Recheck Cond: (((program)::text = 'amavis'::text) AND ((facility)::text = 'mail'::text)) Filter: (((priority)::text = 'notice'::text) AND ((text)::text ~ '***:^(?:amavis\\[[0-9]+\\]: \\([0-9]+-[-0-9]+\\) Passed [A-Za-z0-9]+, [][0-9\\.]* [^]+ - [^]+, Message-ID: [^]+, (Resent-Message-ID: [^]+, |)mail.id: [^ ,]+, Hits: [-+0-9\\.,]+, queued.as: [^ ,]+, [0-9]+ ms)$'::text)) It's not too surprising that you'd get a small selectivity estimate for such a long regexp; the default estimate is just based on the amount of fixed text in the pattern, and you've got a lot. If you increase the stats target for the column to 100 or more then it will try actually applying the regexp to all the histogram entries. That might or might not give you a better estimate. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [GENERAL] Planner: rows=1 after similar to where condition.
Resent due to bounce... orange.nl #5.0.0 X-SMTP-Server; host sss.pgh.pa.us[66.207.139.130] said: 550 -Original Message- From: Tom Lane [mailto:[EMAIL PROTECTED] Sent: Monday, 25 February 2008 16:34 To: Joris Dobbelsteen Cc: Gregory Stark; Scott Marlowe; pgsql-general@postgresql.org Subject: Re: [GENERAL] Planner: rows=1 after similar to where condition. Joris Dobbelsteen [EMAIL PROTECTED] writes: Bitmap Heap Scan on log_syslog syslog (cost=11168.32..16988.84 rows=1 width=221) (actual time=11145.729..30067.606 rows=212 loops=1) Recheck Cond: (((program)::text = 'amavis'::text) AND ((facility)::text = 'mail'::text)) Filter: (((priority)::text = 'notice'::text) AND ((text)::text ~ '***:^(?:amavis\\[[0-9]+\\]: \\([0-9]+-[-0-9]+\\) Passed [A-Za-z0-9]+, [][0-9\\.]* [^]+ - [^]+, Message-ID: [^]+, (Resent-Message-ID: [^]+, |)mail.id: [^ ,]+, Hits: [-+0-9\\.,]+, queued.as: [^ ,]+, [0-9]+ ms)$'::text)) It's not too surprising that you'd get a small selectivity estimate for such a long regexp; the default estimate is just based on the amount of fixed text in the pattern, and you've got a lot. If you increase the stats target for the column to 100 or more then it will try actually applying the regexp to all the histogram entries. That might or might not give you a better estimate. I will try that, expect result back within a few days (have it collect some better sample set). Unfortunally the regex is not so much for narrowing down the selection, but rather guarenteeing the format of the messages. You seem to consider the common case differently, and I can agree for most part. Unfortunally my use-case is different from the expected. That said, might a less aggressive selectivity estimation for long strings work better in the common case? Might an alternative be to use a function and check for a positive result, i.e. something that the predictor cannot take into account? - Joris ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [GENERAL] Planner: rows=1 after similar to where condition.
-Original Message- From: Tom Lane [mailto:[EMAIL PROTECTED] Sent: Monday, 25 February 2008 16:34 To: Joris Dobbelsteen Cc: Gregory Stark; Scott Marlowe; pgsql-general@postgresql.org Subject: Re: [GENERAL] Planner: rows=1 after similar to where condition. Joris Dobbelsteen [EMAIL PROTECTED] writes: Bitmap Heap Scan on log_syslog syslog (cost=11168.32..16988.84 rows=1 width=221) (actual time=11145.729..30067.606 rows=212 loops=1) Recheck Cond: (((program)::text = 'amavis'::text) AND ((facility)::text = 'mail'::text)) Filter: (((priority)::text = 'notice'::text) AND ((text)::text ~ '***:^(?:amavis\\[[0-9]+\\]: \\([0-9]+-[-0-9]+\\) Passed [A-Za-z0-9]+, [][0-9\\.]* [^]+ - [^]+, Message-ID: [^]+, (Resent-Message-ID: [^]+, |)mail.id: [^ ,]+, Hits: [-+0-9\\.,]+, queued.as: [^ ,]+, [0-9]+ ms)$'::text)) It's not too surprising that you'd get a small selectivity estimate for such a long regexp; the default estimate is just based on the amount of fixed text in the pattern, and you've got a lot. If you increase the stats target for the column to 100 or more then it will try actually applying the regexp to all the histogram entries. That might or might not give you a better estimate. I will try that, expect result back within a few days (have it collect some better sample set). Unfortunally the regex is not so much for narrowing down the selection, but rather guarenteeing the format of the messages. You seem to consider the common case differently, and I can agree for most part. Unfortunally my use-case is different from the expected. That said, might a less aggressive selectivity estimation for long strings work better in the common case? Might an alternative be to use a function and check for a positive result, i.e. something that the predictor cannot take into account? - Joris ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[GENERAL] Planner: rows=1 after similar to where condition.
I seem to have some planner oddity, where it seems to completely mispredict the output after a regex compare. I've seem it on other occasions, where it completely screws up the join. You can note the rows=1 after the filter. A similar sitution has occurred when doing a regex filter in a subquery, which was subsequently predited as 1 row and triggered (oddly enough) a sequencial scan. Doing the same using equality on the result to substring(text from regex) seemed to work and produced a useful plan, since it did a hash-join (as it should have). Is this a known problem? Otherwise I think I should build a smaller test case... Using Postgresql 8.2.6 from Debian Etch-backports. Bitmap Heap Scan on log_syslog syslog (cost=13124.26..51855.25 rows=1 width=270) Recheck Cond: (((program)::text = 'amavis'::text) AND ((facility)::text = 'mail'::text)) Filter: ***SOME VERY LONG SIMILAR TO REGEX - BitmapAnd (cost=13124.26..13124.26 rows=18957 width=0) - Bitmap Index Scan on IX_log_syslog_program (cost=0.00..2223.95 rows=92323 width=0) Index Cond: ((program)::text = 'amavis'::text) - Bitmap Index Scan on IX_log_syslog_facility (cost=0.00..10899.81 rows=463621 width=0) Index Cond: ((facility)::text = 'mail'::text) - Joris ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [GENERAL] Planner: rows=1 after similar to where condition.
On Sun, Feb 24, 2008 at 4:35 PM, Joris Dobbelsteen [EMAIL PROTECTED] wrote: I seem to have some planner oddity, where it seems to completely mispredict the output after a regex compare. I've seem it on other occasions, where it completely screws up the join. You can note the rows=1 after the filter. A similar sitution has occurred when doing a regex filter in a subquery, which was subsequently predited as 1 row and triggered (oddly enough) a sequencial scan. Doing the same using equality on the result to substring(text from regex) seemed to work and produced a useful plan, since it did a hash-join (as it should have). Is this a known problem? Otherwise I think I should build a smaller test case... Using Postgresql 8.2.6 from Debian Etch-backports. Bitmap Heap Scan on log_syslog syslog (cost=13124.26..51855.25 rows=1 width=270) Recheck Cond: (((program)::text = 'amavis'::text) AND ((facility)::text = 'mail'::text)) Filter: ***SOME VERY LONG SIMILAR TO REGEX - BitmapAnd (cost=13124.26..13124.26 rows=18957 width=0) - Bitmap Index Scan on IX_log_syslog_program (cost=0.00..2223.95 rows=92323 width=0) Index Cond: ((program)::text = 'amavis'::text) - Bitmap Index Scan on IX_log_syslog_facility (cost=0.00..10899.81 rows=463621 width=0) Index Cond: ((facility)::text = 'mail'::text) It's not saying it will only get one row back for sure, it's saying it thinks it will return one row. and depending on your query, it might. What's the query, and what's the explain analyze of that query? ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq