Re: [HACKERS] -HEAD planner issue wrt hash_joins on dbt3 ?
Matteo Beccati wrote: Stefan Kaltenbrunner wrote: too bad - however any idea on one of the other troubling querys (q21) I mentioned in the mail I resent to the list (after the original one got lost)? http://archives.postgresql.org/pgsql-hackers/2006-09/msg02011.php What happens if you increase statistics for l_orderkey? statistic target is already at 1000 ... Stefan ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] -HEAD planner issue wrt hash_joins on dbt3 ?
Stefan Kaltenbrunner wrote: too bad - however any idea on one of the other troubling querys (q21) I mentioned in the mail I resent to the list (after the original one got lost)? http://archives.postgresql.org/pgsql-hackers/2006-09/msg02011.php What happens if you increase statistics for l_orderkey? Best regards -- Matteo Beccati http://phpadsnew.com http://phppgads.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] -HEAD planner issue wrt hash_joins on dbt3 ?
Tom Lane wrote: Stefan Kaltenbrunner [EMAIL PROTECTED] writes: http://www.kaltenbrunner.cc/files/analyze_q9_beta1.txt The next problem seems to be the drastic misestimation of this join size: - Nested Loop (cost=0.00..6872092.36 rows=135 width=28) (actual time=94.762..14429291.129 rows=3554044 loops=1) - Merge Join (cost=0.00..519542.74 rows=449804 width=16) (actual time=48.197..49636.006 rows=474008 loops=1) Merge Cond: (part.p_partkey = partsupp.ps_partkey) - Index Scan using pk_part on part (cost=0.00..105830.22 rows=112447 width=4) (actual time=34.646..14381.644 rows=118502 loops=1) Filter: ((p_name)::text ~~ '%ghost%'::text) - Index Scan using i_ps_partkey on partsupp (cost=0.00..388943.05 rows=8000278 width=12) (actual time=13.511..22659.364 rows=7999685 loops=1) - Index Scan using i_l_suppkey_partkey on lineitem (cost=0.00..14.11 rows=1 width=24) (actual time=4.415..30.310 rows=7 loops=474008) Index Cond: ((partsupp.ps_partkey = lineitem.l_partkey) AND (partsupp.ps_suppkey = lineitem.l_suppkey)) With a factor-of-25000 error in that rowcount estimate, it's amazing the plans aren't worse than they are. It evidently thinks that most of the rows in the join of part and partsupp won't have any matching rows in lineitem, whereas on average there are about 7 matching rows apiece. So that's totally wacko, and it's not immediately obvious why. Could we see the pg_stats entries for part.p_partkey, partsupp.ps_partkey, partsupp.ps_suppkey, lineitem.l_partkey, lineitem.l_suppkey? http://www.kaltenbrunner.cc/files/dbt3_beta1_statistics.txt Stefan ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] -HEAD planner issue wrt hash_joins on dbt3 ?
Stefan Kaltenbrunner [EMAIL PROTECTED] writes: Tom Lane wrote: It evidently thinks that most of the rows in the join of part and partsupp won't have any matching rows in lineitem, whereas on average there are about 7 matching rows apiece. So that's totally wacko, and it's not immediately obvious why. Could we see the pg_stats entries for part.p_partkey, partsupp.ps_partkey, partsupp.ps_suppkey, lineitem.l_partkey, lineitem.l_suppkey? http://www.kaltenbrunner.cc/files/dbt3_beta1_statistics.txt OK, so we have 2 million parts and 10 suppliers, and ANALYZE doesn't seem to have been too far off at estimating either of those numbers. I think the problem is that there are not very many suppliers for any particular part, and thus the condition part match AND supplier match is really not much more selective than part match alone. The planner is supposing that their selectivities are independent, which they aren't. Offhand I don't see any good way to fix this without multi-column statistics, which is something that's certainly not happening for 8.2 :-( regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] -HEAD planner issue wrt hash_joins on dbt3 ?
Tom Lane wrote: Stefan Kaltenbrunner [EMAIL PROTECTED] writes: Tom Lane wrote: It evidently thinks that most of the rows in the join of part and partsupp won't have any matching rows in lineitem, whereas on average there are about 7 matching rows apiece. So that's totally wacko, and it's not immediately obvious why. Could we see the pg_stats entries for part.p_partkey, partsupp.ps_partkey, partsupp.ps_suppkey, lineitem.l_partkey, lineitem.l_suppkey? http://www.kaltenbrunner.cc/files/dbt3_beta1_statistics.txt OK, so we have 2 million parts and 10 suppliers, and ANALYZE doesn't seem to have been too far off at estimating either of those numbers. I think the problem is that there are not very many suppliers for any particular part, and thus the condition part match AND supplier match is really not much more selective than part match alone. The planner is supposing that their selectivities are independent, which they aren't. looks like there are exactly 4 suppliers for any given part so that seems indeed like the problem :-( Offhand I don't see any good way to fix this without multi-column statistics, which is something that's certainly not happening for 8.2 :-( too bad - however any idea on one of the other troubling querys (q21) I mentioned in the mail I resent to the list (after the original one got lost)? http://archives.postgresql.org/pgsql-hackers/2006-09/msg02011.php Stefan ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] -HEAD planner issue wrt hash_joins on dbt3 ?
[RESENT in a modified version since the original reply seems to have ben lost by the listserver which seems to happen sometimes lately] Tom Lane wrote: Matteo Beccati [EMAIL PROTECTED] writes: Tom Lane ha scritto: Matteo Beccati [EMAIL PROTECTED] writes: I cannot see anything bad by using something like that: if (histogram is large/representative enough) Well, the question is exactly what is large enough? I feel a bit uncomfortable about applying the idea to a histogram with only 10 entries (especially if we ignore two of 'em). With 100 or more, it sounds all right. What's the breakpoint? Yes, I think 100-200 could be a good breakpoint. I've committed this change with (for now) 100 as the minimum histogram size to use. Stefan, are you interested in retrying your benchmark? spent some time retesting that and I got the following results(this is the same box as before but with a much slower disk-setup and a newly initdb'd cluster): http://www.kaltenbrunner.cc/files/analyze_q9_beta1.txt all that is with 2GB of effective_cache_size(plan does not change with much smaller settings btw) and a statistic target of 1000 and the following parameters: default planner settings for the EXPLAIN and the first EXPLAIN ANALYZE - then the same query with disabled hash_joins, the next one is with disabled nest_loops and the fastest one is with both nest_loop and hash_joins disabled (all run in that order - so there are possible caching effects). in comparision to: http://www.kaltenbrunner.cc/files/analyze_q9.txt we nearly got a 7 figure speedup due to the latest changes(much better estimates at least) - however the mergejoin+sort only plan is still faster. the other troubling query is the following: query: http://www.kaltenbrunner.cc/files/7/power1/db/plans/power_query21.txt plans: http://www.kaltenbrunner.cc/files/analyze_q21_beta1.txt (default,default,enable_nestloop=off,enable_nestloop=off and enable_hashjoin=off) despite having not-too bad estimates for most of the key-points in the plan the actual runtime of the choosen plan is quite disappointing. Stefan ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] -HEAD planner issue wrt hash_joins on dbt3 ?
Stefan Kaltenbrunner [EMAIL PROTECTED] writes: http://www.kaltenbrunner.cc/files/analyze_q9_beta1.txt The next problem seems to be the drastic misestimation of this join size: - Nested Loop (cost=0.00..6872092.36 rows=135 width=28) (actual time=94.762..14429291.129 rows=3554044 loops=1) - Merge Join (cost=0.00..519542.74 rows=449804 width=16) (actual time=48.197..49636.006 rows=474008 loops=1) Merge Cond: (part.p_partkey = partsupp.ps_partkey) - Index Scan using pk_part on part (cost=0.00..105830.22 rows=112447 width=4) (actual time=34.646..14381.644 rows=118502 loops=1) Filter: ((p_name)::text ~~ '%ghost%'::text) - Index Scan using i_ps_partkey on partsupp (cost=0.00..388943.05 rows=8000278 width=12) (actual time=13.511..22659.364 rows=7999685 loops=1) - Index Scan using i_l_suppkey_partkey on lineitem (cost=0.00..14.11 rows=1 width=24) (actual time=4.415..30.310 rows=7 loops=474008) Index Cond: ((partsupp.ps_partkey = lineitem.l_partkey) AND (partsupp.ps_suppkey = lineitem.l_suppkey)) With a factor-of-25000 error in that rowcount estimate, it's amazing the plans aren't worse than they are. It evidently thinks that most of the rows in the join of part and partsupp won't have any matching rows in lineitem, whereas on average there are about 7 matching rows apiece. So that's totally wacko, and it's not immediately obvious why. Could we see the pg_stats entries for part.p_partkey, partsupp.ps_partkey, partsupp.ps_suppkey, lineitem.l_partkey, lineitem.l_suppkey? regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] -HEAD planner issue wrt hash_joins on dbt3 ?
Tom Lane wrote: Matteo Beccati [EMAIL PROTECTED] writes: Tom Lane ha scritto: Matteo Beccati [EMAIL PROTECTED] writes: I cannot see anything bad by using something like that: if (histogram is large/representative enough) Well, the question is exactly what is large enough? I feel a bit uncomfortable about applying the idea to a histogram with only 10 entries (especially if we ignore two of 'em). With 100 or more, it sounds all right. What's the breakpoint? Yes, I think 100-200 could be a good breakpoint. I've committed this change with (for now) 100 as the minimum histogram size to use. Stefan, are you interested in retrying your benchmark? sure - but I'm having hardware (harddisk firmware) related issues on my testbox which will take a few further days to be resolved ... Stefan ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] -HEAD planner issue wrt hash_joins on dbt3 ?
Matteo Beccati [EMAIL PROTECTED] writes: Tom Lane ha scritto: Matteo Beccati [EMAIL PROTECTED] writes: I cannot see anything bad by using something like that: if (histogram is large/representative enough) Well, the question is exactly what is large enough? I feel a bit uncomfortable about applying the idea to a histogram with only 10 entries (especially if we ignore two of 'em). With 100 or more, it sounds all right. What's the breakpoint? Yes, I think 100-200 could be a good breakpoint. I've committed this change with (for now) 100 as the minimum histogram size to use. Stefan, are you interested in retrying your benchmark? regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] -HEAD planner issue wrt hash_joins on dbt3 ?
Hi, Tom Lane wrote: I've committed this change with (for now) 100 as the minimum histogram size to use. Stefan, are you interested in retrying your benchmark? A first try with ltree gave big improvements on my smaller data set: the estimated row count is correct or off by only 1 row. I'm now restoring a bigger database to get more reliable results. I hope Stefan can confirm the improvement on dbt3 too. Thanks Tom :) Best regards -- Matteo Beccati http://phpadsnew.com http://phppgads.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] -HEAD planner issue wrt hash_joins on dbt3 ?
On Mon, 2006-09-18 at 17:46 +0200, Matteo Beccati wrote: Tom Lane ha scritto: Matteo Beccati [EMAIL PROTECTED] writes: I cannot see anything bad by using something like that: if (histogram is large/representative enough) Well, the question is exactly what is large enough? I feel a bit uncomfortable about applying the idea to a histogram with only 10 entries (especially if we ignore two of 'em). With 100 or more, it sounds all right. What's the breakpoint? Yes, I think 100-200 could be a good breakpoint. I don't actually know what is the current usage of SET STATISTICS, I usually set it to 1000 for columns which need more precise selectivity. The breakpoint could be set even higher (500?) so there is space to increase statistics without enabling the histogram check, but I don't feel very comfortable though suggesting this kind of possibly undocumented side effect... Hi everyone, You may be interested to have a look at the statistics collector for the geometry type within PostGIS. In order to prevent very large or very small geometries from ruining the statistics histogram and generating incorrect query plans, we make the assumption that the column distribution is likely to be close to normal, and then remove any ANALYZE-collected geometries from the set that lie outside +/- 3.25 standard deviations from the mean before creating the final histogram (removes just under 1% of the data from each end of an assumed normal distribution). This works well and AFAIK we've only ever had one reported case of an incorrect query plan being generated using this method. Kind regards, Mark. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] -HEAD planner issue wrt hash_joins on dbt3 ?
Tom Lane wrote: Stefan Kaltenbrunner [EMAIL PROTECTED] writes: Tom Lane wrote: I'm not sure what we could do about the concurrent-sessions issue, but we could make some sort of attack on the query complexity issue by pro-rating the effective_cache_size among all the tables used by a query. hmm not sure i understand what you mean here :-( Per the comment for index_pages_fetched: * We assume that effective_cache_size is the total number of buffer pages * available for both table and index, and pro-rate that space between the * table and index. (Ideally other_pages should include all the other * tables and indexes used by the query too; but we don't have a good way * to get that number here.) A first-order approximation to this would be to add up the total sizes of all the other tables used in the query. I am thinking of leaving out other indexes, mainly because we can't tell at this level which other indexes are actually gonna get used. This would tend to underestimate by leaving out indexes, but not by a lot if you assume indexes are much smaller than their tables. It would also be an overestimate because tables that are not indexscanned concurrently with the one under consideration probably shouldn't be counted anyway. So one might hope these effects would more or less cancel out. Anyway it seems to be a better idea than what we have now. aah - I think I understand that logic now - thanks for the reference to the source :-) I will redo with lower settings - do you have any suggestions for that ? Try reducing effective_cache_size to maybe a fourth of what it is now. If that helps the thing pick better plans for these multi-table queries, then we should try changing the other_pages calculation as above. ok - the planner switches to a different plan at about 2.5GB of effective_cache_size resulting in the following plan: http://www.kaltenbrunner.cc/files/analyze_q7_1GB.txt (3 consecutive runs - starting with cold caches) with 6GB I get: http://www.kaltenbrunner.cc/files/analyze_q7_6GB.txt (single run - immediatly after the above ones) Stefan ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] -HEAD planner issue wrt hash_joins on dbt3 ?
Stefan Kaltenbrunner [EMAIL PROTECTED] writes: ok - the planner switches to a different plan at about 2.5GB of effective_cache_size resulting in the following [ much better ] plan: OK, so it sounds like it'd be a good idea to try to pro-rate effective_cache_size among all the tables in the query. I'll see if I can get that in without kluging the code up too much. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] -HEAD planner issue wrt hash_joins on dbt3 ?
Hi, Tom Lane wrote: Hmm ... pattern_sel already applies the operator directly to the most_common_vals, but in this situation those aren't common enough to help much. With such an extensive histogram it is awfully tempting to assume that the histogram members are a representative sample, and take the selectivity as being the fraction of histogram entries that match the pattern. Maybe drop the first and last histogram entries on the grounds they're probably outliers. Thoughts? What would be a reasonable minimum histogram size to enable using this approach instead of the guess-on-the-basis-of-the-pattern code? That's what I was suggesting here respectively for ltree operators and like: http://archives.postgresql.org/pgsql-patches/2006-05/msg00178.php http://archives.postgresql.org/pgsql-performance/2006-01/msg00083.php My original ltree patch was stripped of the histogram matching code and I will need to re-patch 8.2 when deploying it to get decent performance with a couple of queries, but it would be very nice to avoid it ;) I cannot see anything bad by using something like that: if (histogram is large/representative enough) { recalculate_selectivity_matching_histogram_values() if (new_selectivity old_selectivity) return new_selectivity else return old_selectivity } Best regards -- Matteo Beccati http://phpadsnew.com http://phppgads.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] -HEAD planner issue wrt hash_joins on dbt3 ?
Matteo Beccati [EMAIL PROTECTED] writes: I cannot see anything bad by using something like that: if (histogram is large/representative enough) Well, the question is exactly what is large enough? I feel a bit uncomfortable about applying the idea to a histogram with only 10 entries (especially if we ignore two of 'em). With 100 or more, it sounds all right. What's the breakpoint? regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] -HEAD planner issue wrt hash_joins on dbt3 ?
Tom Lane ha scritto: Matteo Beccati [EMAIL PROTECTED] writes: I cannot see anything bad by using something like that: if (histogram is large/representative enough) Well, the question is exactly what is large enough? I feel a bit uncomfortable about applying the idea to a histogram with only 10 entries (especially if we ignore two of 'em). With 100 or more, it sounds all right. What's the breakpoint? Yes, I think 100-200 could be a good breakpoint. I don't actually know what is the current usage of SET STATISTICS, I usually set it to 1000 for columns which need more precise selectivity. The breakpoint could be set even higher (500?) so there is space to increase statistics without enabling the histogram check, but I don't feel very comfortable though suggesting this kind of possibly undocumented side effect... Best ragards -- Matteo Beccati http://phpadsnew.com http://phppgads.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] -HEAD planner issue wrt hash_joins on dbt3 ?
[already sent a variant of that yesterday but it doesn't look like it made it to the list] Tom Lane wrote: Stefan Kaltenbrunner [EMAIL PROTECTED] writes: Tom Lane wrote: Apparently we've made the planner a bit too optimistic about the savings that can be expected from repeated indexscans occurring on the inside of a join. effective_cache_size was set to 10GB(my fault for copying over the conf from a 16GB box) during the run - lowering it just a few megabytes(!) or to a more realistic 6GB results in the following MUCH better plan: http://www.kaltenbrunner.cc/files/dbt3_explain_analyze2.txt Interesting. It used to be that effective_cache_size wasn't all that critical... what I think this report is showing is that with the 8.2 changes to try to account for caching effects in repeated indexscans, we've turned that into a pretty significant parameter. took me a while due to hardware issues on my testbox - but there are new results(with 6GB for effective_cache_size) up at: http://www.kaltenbrunner.cc/files/5/ there are still a few issues with the validity of the run like the rf tests not actually being done right - but lowering effective_cache_size gave a dramtic speedup on Q5,Q7 and Q8. that is the explain for the 4h+ Q9: http://www.kaltenbrunner.cc/files/analyze_q9.txt increasing the the statistic_target up to 1000 does not seem to change the plan btw. disabling nested loop leads to the following (4 times faster) plan: http://www.kaltenbrunner.cc/files/analyze_q9_no_nest.txt since the hash-joins in there look rather slow (inappropriate hashtable set up due to the wrong estimates?) I disabled hash_joins too: http://www.kaltenbrunner.cc/files/analyze_q9_no_nest_no_hashjoin.txt and amazingly this plan is by far the fastest one in runtime (15min vs 4,5h ...) except that the planner thinks it is 20 times more expensive ... Stefan ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] -HEAD planner issue wrt hash_joins on dbt3 ?
Stefan Kaltenbrunner wrote: [already sent a variant of that yesterday but it doesn't look like it made it to the list] Tom Lane wrote: Stefan Kaltenbrunner [EMAIL PROTECTED] writes: Tom Lane wrote: Apparently we've made the planner a bit too optimistic about the savings that can be expected from repeated indexscans occurring on the inside of a join. effective_cache_size was set to 10GB(my fault for copying over the conf from a 16GB box) during the run - lowering it just a few megabytes(!) or to a more realistic 6GB results in the following MUCH better plan: http://www.kaltenbrunner.cc/files/dbt3_explain_analyze2.txt Interesting. It used to be that effective_cache_size wasn't all that critical... what I think this report is showing is that with the 8.2 changes to try to account for caching effects in repeated indexscans, we've turned that into a pretty significant parameter. took me a while due to hardware issues on my testbox - but there are new results(with 6GB for effective_cache_size) up at: http://www.kaltenbrunner.cc/files/5/ there are still a few issues with the validity of the run like the rf tests not actually being done right - but lowering effective_cache_size gave a dramtic speedup on Q5,Q7 and Q8. that is the explain for the 4h+ Q9: http://www.kaltenbrunner.cc/files/analyze_q9.txt increasing the the statistic_target up to 1000 does not seem to change the plan btw. disabling nested loop leads to the following (4 times faster) plan: http://www.kaltenbrunner.cc/files/analyze_q9_no_nest.txt since the hash-joins in there look rather slow (inappropriate hashtable set up due to the wrong estimates?) I disabled hash_joins too: http://www.kaltenbrunner.cc/files/analyze_q9_no_nest_no_hashjoin.txt and amazingly this plan is by far the fastest one in runtime (15min vs 4,5h ...) except that the planner thinks it is 20 times more expensive ... some additional numbers(first one is with default settings, second is with enable_nestloop = 'off', third one is with enable_nestloop = 'off' and enable_hashjoin='off'): http://www.kaltenbrunner.cc/files/analyze_q7.txt here we have a 3x speedup with disabling nested loops and a 2x speedup (over the original plan) with nested loops and hashjoins disabled. http://www.kaltenbrunner.cc/files/analyze_q20.txt here we have a 180x(!) speedup with both disabled planner options ... it is worth mentioning that for both queries the estimated costs in relation to each other looks quite reasonable as soon as enable_nestloop = 'off' (ie 5042928 vs 10715247 with 344sec vs 514 for Q7 and 101441851 vs 101445468 with 10sec vs 11sec) Stefan ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] -HEAD planner issue wrt hash_joins on dbt3 ?
Stefan Kaltenbrunner [EMAIL PROTECTED] writes: that is the explain for the 4h+ Q9: http://www.kaltenbrunner.cc/files/analyze_q9.txt The big problem there seems to be the drastic misestimation of the number of rows matching the p_name ~~ '%ghost%' condition. What does pg_stats have for the p_name column? regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] -HEAD planner issue wrt hash_joins on dbt3 ?
Tom Lane wrote: Stefan Kaltenbrunner [EMAIL PROTECTED] writes: that is the explain for the 4h+ Q9: http://www.kaltenbrunner.cc/files/analyze_q9.txt The big problem there seems to be the drastic misestimation of the number of rows matching the p_name ~~ '%ghost%' condition. What does pg_stats have for the p_name column? http://www.kaltenbrunner.cc/files/pg_stat_p_name.txt Stefan ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] -HEAD planner issue wrt hash_joins on dbt3 ?
Stefan Kaltenbrunner [EMAIL PROTECTED] writes: some additional numbers(first one is with default settings, second is with enable_nestloop = 'off', third one is with enable_nestloop = 'off' and enable_hashjoin='off'): http://www.kaltenbrunner.cc/files/analyze_q7.txt I'm inclined to think you still have effective_cache_size set too high; or at least that the planner is being too optimistic about how much cache space is actually available to each indexscan. With the code as it currently stands, effective_cache_size has some of the same properties as work_mem: the planner effectively assumes that that much space is available to *each* indexscan, and so you'd need to de-rate the setting based on the complexity of queries and the number of concurrent sessions. I'm not sure what we could do about the concurrent-sessions issue, but we could make some sort of attack on the query complexity issue by pro-rating the effective_cache_size among all the tables used by a query. http://www.kaltenbrunner.cc/files/analyze_q20.txt here we have a 180x(!) speedup with both disabled planner options ... There's something awfully bogus about that one --- how is it that the aggregate subplan, with the exact same plan and same number of executions in all three cases, has an actual runtime 200x more in the first case? regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] -HEAD planner issue wrt hash_joins on dbt3 ?
Stefan Kaltenbrunner [EMAIL PROTECTED] writes: Tom Lane wrote: The big problem there seems to be the drastic misestimation of the number of rows matching the p_name ~~ '%ghost%' condition. What does pg_stats have for the p_name column? http://www.kaltenbrunner.cc/files/pg_stat_p_name.txt Hmm ... pattern_sel already applies the operator directly to the most_common_vals, but in this situation those aren't common enough to help much. With such an extensive histogram it is awfully tempting to assume that the histogram members are a representative sample, and take the selectivity as being the fraction of histogram entries that match the pattern. Maybe drop the first and last histogram entries on the grounds they're probably outliers. Thoughts? What would be a reasonable minimum histogram size to enable using this approach instead of the guess-on-the-basis-of-the-pattern code? regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] -HEAD planner issue wrt hash_joins on dbt3 ?
Tom Lane wrote: Stefan Kaltenbrunner [EMAIL PROTECTED] writes: some additional numbers(first one is with default settings, second is with enable_nestloop = 'off', third one is with enable_nestloop = 'off' and enable_hashjoin='off'): http://www.kaltenbrunner.cc/files/analyze_q7.txt I'm inclined to think you still have effective_cache_size set too high; or at least that the planner is being too optimistic about how much cache space is actually available to each indexscan. I have long term external monitoring on that server and it indeed shows that that there was never less then about 5.8G of buffercache used (or more then 2.2GB used by other means). So 6G might still be a bit on the optimistic side but it is not actually that far of from reality. I will redo with lower settings - do you have any suggestions for that ? With the code as it currently stands, effective_cache_size has some of the same properties as work_mem: the planner effectively assumes that that much space is available to *each* indexscan, and so you'd need to de-rate the setting based on the complexity of queries and the number of concurrent sessions. concurrency is 1 here - there is never more than a single query running in parallel in those tests. I'm not sure what we could do about the concurrent-sessions issue, but we could make some sort of attack on the query complexity issue by pro-rating the effective_cache_size among all the tables used by a query. hmm not sure i understand what you mean here :-( http://www.kaltenbrunner.cc/files/analyze_q20.txt here we have a 180x(!) speedup with both disabled planner options ... There's something awfully bogus about that one --- how is it that the aggregate subplan, with the exact same plan and same number of executions in all three cases, has an actual runtime 200x more in the first case? hmm - good question. I will redo those in a bit ... Stefan ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] -HEAD planner issue wrt hash_joins on dbt3 ?
Stefan Kaltenbrunner [EMAIL PROTECTED] writes: Tom Lane wrote: I'm not sure what we could do about the concurrent-sessions issue, but we could make some sort of attack on the query complexity issue by pro-rating the effective_cache_size among all the tables used by a query. hmm not sure i understand what you mean here :-( Per the comment for index_pages_fetched: * We assume that effective_cache_size is the total number of buffer pages * available for both table and index, and pro-rate that space between the * table and index. (Ideally other_pages should include all the other * tables and indexes used by the query too; but we don't have a good way * to get that number here.) A first-order approximation to this would be to add up the total sizes of all the other tables used in the query. I am thinking of leaving out other indexes, mainly because we can't tell at this level which other indexes are actually gonna get used. This would tend to underestimate by leaving out indexes, but not by a lot if you assume indexes are much smaller than their tables. It would also be an overestimate because tables that are not indexscanned concurrently with the one under consideration probably shouldn't be counted anyway. So one might hope these effects would more or less cancel out. Anyway it seems to be a better idea than what we have now. I will redo with lower settings - do you have any suggestions for that ? Try reducing effective_cache_size to maybe a fourth of what it is now. If that helps the thing pick better plans for these multi-table queries, then we should try changing the other_pages calculation as above. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] -HEAD planner issue wrt hash_joins on dbt3 ?
Tom Lane wrote: Stefan Kaltenbrunner [EMAIL PROTECTED] writes: Tom Lane wrote: Could we see the actual EXPLAIN ANALYZE results for the slow plan? http://www.kaltenbrunner.cc/files/dbt3_explain_analyze.txt Well, indeed it seems that the hash join is just an innocent bystander: the bulk of the runtime (all but about 120 sec in fact) is spent here: - Nested Loop (cost=13.65..1719683.85 rows=12000672 width=49) (actual time=60.325..24923860.713 rows=11897899 loops=1) - Merge Join (cost=0.00..10248.66 rows=2 width=41) (actual time=16.654..2578.060 rows=19837 loops=1) ... - Bitmap Heap Scan on lineitem (cost=13.65..77.16 rows=665 width=16) (actual time=13.492..1254.535 rows=600 loops=19837) Recheck Cond: (lineitem.l_suppkey = supplier.s_suppkey) - Bitmap Index Scan on i_l_suppkey (cost=0.00..13.65 rows=665 width=0) (actual time=10.662..10.662 rows=600 loops=19837) Index Cond: (lineitem.l_suppkey = supplier.s_suppkey) I suppose that the profile result you showed was taken during the startup transient where it was computing the hashtables that this loop's results are joined to ... but that's not where the problem is. The problem is repeating that bitmap scan on lineitem for nearly 2 different l_suppkeys. possible - I actually took them over a longer period of time Apparently we've made the planner a bit too optimistic about the savings that can be expected from repeated indexscans occurring on the inside of a join. The other plan uses a different join order and doesn't try to join lineitem until it's got orders.o_orderkey, whereupon it does a mergejoin against an indexscan on lineitem: - Index Scan using i_l_orderkey on lineitem (cost=0.00..2715943.34 rows=60003360 width=16) (actual time=32.868..123668.380 rows=59991868 loops=1) The runtimes for the remainders of the plans are roughly comparable, so it's the cost of joining lineitem that is hurting here. Is lineitem sorted (or nearly sorted) by l_orderkey? Part of the problem could be overestimating the cost of this indexscan. What are the physical sizes of lineitem and its indexes, and how do those compare to your RAM? What are you using for planner settings (particularly effective_cache_size)? ouch - you are right(as usual) here. effective_cache_size was set to 10GB(my fault for copying over the conf from a 16GB box) during the run - lowering it just a few megabytes(!) or to a more realistic 6GB results in the following MUCH better plan: http://www.kaltenbrunner.cc/files/dbt3_explain_analyze2.txt as for the relation sizes: dbt3=# select pg_relation_size('lineitem'); pg_relation_size -- 10832764928 (1 row) dbt3=# select pg_total_relation_size('lineitem'); pg_total_relation_size 22960259072 (1 row) there are nine btree indexes on lineitem all between 1,1GB and 1,4GB in size. Stefan ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] -HEAD planner issue wrt hash_joins on dbt3 ?
Stefan Kaltenbrunner [EMAIL PROTECTED] writes: Tom Lane wrote: Apparently we've made the planner a bit too optimistic about the savings that can be expected from repeated indexscans occurring on the inside of a join. effective_cache_size was set to 10GB(my fault for copying over the conf from a 16GB box) during the run - lowering it just a few megabytes(!) or to a more realistic 6GB results in the following MUCH better plan: http://www.kaltenbrunner.cc/files/dbt3_explain_analyze2.txt Interesting. It used to be that effective_cache_size wasn't all that critical... what I think this report is showing is that with the 8.2 changes to try to account for caching effects in repeated indexscans, we've turned that into a pretty significant parameter. It'd be nice not to have to depend on the DBA to give us a good number for this setting. But I don't know of any portable ways to find out how much RAM is in the box, let alone what fraction of it we should assume is available per-query. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] -HEAD planner issue wrt hash_joins on dbt3 ?
Tom Lane wrote: Stefan Kaltenbrunner [EMAIL PROTECTED] writes: Tom Lane wrote: Apparently we've made the planner a bit too optimistic about the savings that can be expected from repeated indexscans occurring on the inside of a join. effective_cache_size was set to 10GB(my fault for copying over the conf from a 16GB box) during the run - lowering it just a few megabytes(!) or to a more realistic 6GB results in the following MUCH better plan: http://www.kaltenbrunner.cc/files/dbt3_explain_analyze2.txt Interesting. It used to be that effective_cache_size wasn't all that critical... what I think this report is showing is that with the 8.2 changes to try to account for caching effects in repeated indexscans, we've turned that into a pretty significant parameter. yes I'm a bit worried about that too - it has been a bit of conventional wisdom that setting effective_cache_size optimistic will never hurt and that it encourages postgresql to sometimes get a better plan by favouring index-scans. It'd be nice not to have to depend on the DBA to give us a good number for this setting. But I don't know of any portable ways to find out how much RAM is in the box, let alone what fraction of it we should assume is available per-query. well there are really a number of things the dba would better give accurate information to the database - though in that case we might go from too much won't hurt to too much will hurt ... Stefan ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] -HEAD planner issue wrt hash_joins on dbt3 ?
On Wed, Sep 13, 2006 at 10:47:09AM -0400, Tom Lane wrote: Stefan Kaltenbrunner [EMAIL PROTECTED] writes: Tom Lane wrote: Apparently we've made the planner a bit too optimistic about the savings that can be expected from repeated indexscans occurring on the inside of a join. effective_cache_size was set to 10GB(my fault for copying over the conf from a 16GB box) during the run - lowering it just a few megabytes(!) or to a more realistic 6GB results in the following MUCH better plan: http://www.kaltenbrunner.cc/files/dbt3_explain_analyze2.txt Interesting. It used to be that effective_cache_size wasn't all that critical... what I think this report is showing is that with the 8.2 changes to try to account for caching effects in repeated indexscans, we've turned that into a pretty significant parameter. It'd be nice not to have to depend on the DBA to give us a good number for this setting. But I don't know of any portable ways to find out how much RAM is in the box, let alone what fraction of it we should assume is available per-query. That's fairly straight-forward, if a little crude. We ask the DBA and provide some tools for estimating and tuning same. :) Cheers, D -- David Fetter [EMAIL PROTECTED] http://fetter.org/ phone: +1 415 235 3778AIM: dfetter666 Skype: davidfetter Remember to vote! ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] -HEAD planner issue wrt hash_joins on dbt3 ?
Tom Lane wrote: Stefan Kaltenbrunner [EMAIL PROTECTED] writes: btw - the hashjoin is bad was more or less based on the observation that nearly all of the cpu is burned in hash-related functions in the profile (when profiling over a longer period of time those accumulate even more % of the time than in the short profile I included in the original report) [ shrug... ] Two out of the three functions you mentioned are not used by hash join, and anyway the other plan probably has a comparable execution density in sort-related functions; does that make it bad? hmm sorry for that - I should have checked the source before I made that assumption :-( It's possible that the large time for ExecScanHashBucket has something to do with skewed usage of the hash buckets due to an unfortunate data distribution, but that's theorizing far in advance of the data. http://www.kaltenbrunner.cc/files/4/ has preliminary data of the dbt3/scaling 10 run I did which seems to imply we have at least 4 queries in there that take an excessive amount of time (query 5 is the one I started the complaint with). However those results have to be taken with a graint of salt since there is an appearant bug in the dbt3 code which seems to rely on add_missing_from=on (as can be seen in some of the errorlogs of the database) and towards the end of the throughput run I did some of the explain analyzes for the report (those are the small 100% spikes in the graph due to the box using the second CPU to run them). I will redo those tests later this week though ... Stefan ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] -HEAD planner issue wrt hash_joins on dbt3 ?
Stefan Kaltenbrunner [EMAIL PROTECTED] writes: Tom Lane wrote: Could we see the actual EXPLAIN ANALYZE results for the slow plan? http://www.kaltenbrunner.cc/files/dbt3_explain_analyze.txt Well, indeed it seems that the hash join is just an innocent bystander: the bulk of the runtime (all but about 120 sec in fact) is spent here: - Nested Loop (cost=13.65..1719683.85 rows=12000672 width=49) (actual time=60.325..24923860.713 rows=11897899 loops=1) - Merge Join (cost=0.00..10248.66 rows=2 width=41) (actual time=16.654..2578.060 rows=19837 loops=1) ... - Bitmap Heap Scan on lineitem (cost=13.65..77.16 rows=665 width=16) (actual time=13.492..1254.535 rows=600 loops=19837) Recheck Cond: (lineitem.l_suppkey = supplier.s_suppkey) - Bitmap Index Scan on i_l_suppkey (cost=0.00..13.65 rows=665 width=0) (actual time=10.662..10.662 rows=600 loops=19837) Index Cond: (lineitem.l_suppkey = supplier.s_suppkey) I suppose that the profile result you showed was taken during the startup transient where it was computing the hashtables that this loop's results are joined to ... but that's not where the problem is. The problem is repeating that bitmap scan on lineitem for nearly 2 different l_suppkeys. Apparently we've made the planner a bit too optimistic about the savings that can be expected from repeated indexscans occurring on the inside of a join. The other plan uses a different join order and doesn't try to join lineitem until it's got orders.o_orderkey, whereupon it does a mergejoin against an indexscan on lineitem: - Index Scan using i_l_orderkey on lineitem (cost=0.00..2715943.34 rows=60003360 width=16) (actual time=32.868..123668.380 rows=59991868 loops=1) The runtimes for the remainders of the plans are roughly comparable, so it's the cost of joining lineitem that is hurting here. Is lineitem sorted (or nearly sorted) by l_orderkey? Part of the problem could be overestimating the cost of this indexscan. What are the physical sizes of lineitem and its indexes, and how do those compare to your RAM? What are you using for planner settings (particularly effective_cache_size)? regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] -HEAD planner issue wrt hash_joins on dbt3 ?
Stefan Kaltenbrunner [EMAIL PROTECTED] writes: I'm about to do some benchmarking on -HEAD one some hardware I have available and it seems I'm hitting a rather weird issue causing the osdl dbt3 benchmark to run very slow and eating CPU time for hours ... Could we see the actual EXPLAIN ANALYZE results for the slow plan? I'm unconvinced by your hash join is bad analysis, especially in the cases where you're giving it lots of work_mem. I think it's got something to do with the different join orders. The rowcount estimates in the fast plan all seem pretty good, but I'm betting something is wrong with some of them in the slow case. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] -HEAD planner issue wrt hash_joins on dbt3 ?
Tom Lane wrote: Stefan Kaltenbrunner [EMAIL PROTECTED] writes: I'm about to do some benchmarking on -HEAD one some hardware I have available and it seems I'm hitting a rather weird issue causing the osdl dbt3 benchmark to run very slow and eating CPU time for hours ... Could we see the actual EXPLAIN ANALYZE results for the slow plan? I'm unconvinced by your hash join is bad analysis, especially in the cases where you're giving it lots of work_mem. I think it's got something to do with the different join orders. The rowcount estimates in the fast plan all seem pretty good, but I'm betting something is wrong with some of them in the slow case. will do - but that will take a while - the slow one runs for 12h or so even without explain analyze overhead ... Stefan ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] -HEAD planner issue wrt hash_joins on dbt3 ?
Stefan Kaltenbrunner [EMAIL PROTECTED] writes: btw - the hashjoin is bad was more or less based on the observation that nearly all of the cpu is burned in hash-related functions in the profile (when profiling over a longer period of time those accumulate even more % of the time than in the short profile I included in the original report) [ shrug... ] Two out of the three functions you mentioned are not used by hash join, and anyway the other plan probably has a comparable execution density in sort-related functions; does that make it bad? It's possible that the large time for ExecScanHashBucket has something to do with skewed usage of the hash buckets due to an unfortunate data distribution, but that's theorizing far in advance of the data. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly