On Fri, Jun 5, 2020 at 2:30 PM Pavel Stehule <pavel.steh...@gmail.com> wrote:
> > > pá 5. 6. 2020 v 8:19 odesílatel David Rowley <dgrowle...@gmail.com> > napsal: > >> On Mon, 1 Jun 2020 at 01:24, Andy Fan <zhihui.fan1...@gmail.com> wrote: >> > The one-line fix describe the exact idea in my mind: >> > >> > +++ b/src/backend/optimizer/path/costsize.c >> > @@ -730,6 +730,13 @@ cost_index(IndexPath *path, PlannerInfo *root, >> double loop_count, >> > >> > cpu_run_cost += cpu_per_tuple * tuples_fetched; >> > >> > + /* >> > + * To make the planner more robust to handle some inaccurate >> statistics >> > + * issue, we will add a extra cost to qpquals so that the less >> qpquals >> > + * the lower cost it has. >> > + */ >> > + cpu_run_cost += 0.01 * list_length(qpquals); >> > + >> > >> > This change do fix the issue above, but will it make some other cases >> worse? My >> > answer is no based on my current knowledge, and this is most important >> place >> > I want to get advised. The mainly impact I can figure out is: it not >> only >> > change the cost difference between (a, b) and (a, c) it also cause the >> cost >> > difference between Index scan on (a, c) and seq scan. However the >> > cost different between index scan and seq scan are huge by practice so >> > I don't think this impact is harmful. >> >> Didn't that idea already get shot down in the final paragraph on [1]? >> >> I understand that you wish to increase the cost by some seemingly >> innocent constant to fix your problem case. Here are my thoughts >> about that: Telling lies is not really that easy to pull off. Bad >> liers think it's easy and good ones know it's hard. The problem is >> that the lies can start small, but then at some point the future you >> must fashion some more lies to account for your initial lie. Rinse and >> repeat that a few times and before you know it, your course is set >> well away from the point of truth. I feel the part about "rinse and >> repeat" applies reasonably well to how join costing works. The lie is >> likely to be amplified as the join level gets deeper. >> >> I think you need to think of a more generic solution and propose that >> instead. There are plenty of other quirks in the planner that can >> cause suffering due to inaccurate or non-existing statistics. For >> example, due to how we multiply individual selectivity estimates, >> having a few correlated columns in a join condition can cause the >> number of join rows to be underestimated. Subsequent joins can then >> end up making bad choices on which join operator to use based on those >> inaccurate row estimates. There's also a problem with WHERE <x> ORDER >> BY col LIMIT n; sometimes choosing an index that provides pre-sorted >> input to the ORDER BY but cannot use <x> as an indexable condition. >> We don't record any stats to make better choices there, maybe we >> should, but either way, we're taking a bit risk there as all the rows >> matching <x> might be right at the end of the index and we might need >> to scan the entire thing before hitting the LIMIT. For now, we just >> assume completely even distribution of rows. i.e. If there are 50 rows >> estimated in the path and the limit is for 5 rows, then we'll assume >> we need to read 10% of those before finding all the ones we need. In >> reality, everything matching <x> might be 95% through the index and we >> could end up reading 100% of rows. That particular problem is not just >> caused by the uneven distribution of rows in the index, but also from >> selectivity underestimation. >> >> I'd more recently wondered if we shouldn't have some sort of "risk" >> factor to the cost model. I really don't have ideas on how exactly we >> would calculate the risk factor in all cases, but in this case, say >> the risk factor always starts out as 1. We could multiply that risk >> factor by some >1 const each time we find another index filter qual. >> add_path() can prefer lower risk paths when the costs are similar. >> Unsure what the exact add_path logic would be. Perhaps a GUC would >> need to assist with the decision there. Likewise, with >> NestedLoopPaths which have a large number of join quals, the risk >> factor could go up a bit with those so that we take a stronger >> consideration for hash or merge joins instead. >> >> > I thought about these ideas too. And I am not alone. > > https://hal.archives-ouvertes.fr/hal-01316823/document > > Thanks for the documents, after checking it, that is more like oracle's statistics feedback[1]. Hope we can have it someday:) [1] https://blogs.oracle.com/optimizer/cardinality-feedback -- Best Regards Andy Fan