Re: A wrong index choose issue because of inaccurate statistics

Andy Fan Mon, 08 Jun 2020 22:11:09 -0700

On Fri, Jun 5, 2020 at 2:30 PM Pavel Stehule <[email protected]>
wrote:


>
>
> pá 5. 6. 2020 v 8:19 odesílatel David Rowley <[email protected]>
> napsal:
>
>> On Mon, 1 Jun 2020 at 01:24, Andy Fan <[email protected]> wrote:
>> > The one-line fix describe the exact idea in my mind:
>> >
>> > +++ b/src/backend/optimizer/path/costsize.c
>> > @@ -730,6 +730,13 @@ cost_index(IndexPath *path, PlannerInfo *root,
>> double loop_count,
>> >
>> >         cpu_run_cost += cpu_per_tuple * tuples_fetched;
>> >
>> > +       /*
>> > +        * To make the planner more robust to handle some inaccurate
>> statistics
>> > +        * issue, we will add a extra cost to qpquals so that the less
>> qpquals
>> > +        * the lower cost it has.
>> > +        */
>> > +       cpu_run_cost += 0.01 * list_length(qpquals);
>> > +
>> >
>> > This change do fix the issue above, but will it make some other cases
>> worse? My
>> > answer is no based on my current knowledge, and this is most important
>> place
>> > I want to get advised. The mainly impact I can figure out is: it not
>> only
>> > change the cost difference between (a, b) and (a, c) it also cause the
>> cost
>> > difference between Index scan on (a, c) and seq scan.  However the
>> > cost different between index scan and seq scan are huge by practice so
>> > I don't think this impact is harmful.
>>
>> Didn't that idea already get shot down in the final paragraph on [1]?
>>
>> I understand that you wish to increase the cost by some seemingly
>> innocent constant to fix your problem case.  Here are my thoughts
>> about that: Telling lies is not really that easy to pull off. Bad
>> liers think it's easy and good ones know it's hard. The problem is
>> that the lies can start small, but then at some point the future you
>> must fashion some more lies to account for your initial lie. Rinse and
>> repeat that a few times and before you know it, your course is set
>> well away from the point of truth.  I feel the part about "rinse and
>> repeat" applies reasonably well to how join costing works.  The lie is
>> likely to be amplified as the join level gets deeper.
>>
>> I think you need to think of a more generic solution and propose that
>> instead.  There are plenty of other quirks in the planner that can
>> cause suffering due to inaccurate or non-existing statistics. For
>> example, due to how we multiply individual selectivity estimates,
>> having a few correlated columns in a join condition can cause the
>> number of join rows to be underestimated. Subsequent joins can then
>> end up making bad choices on which join operator to use based on those
>> inaccurate row estimates.  There's also a problem with WHERE <x> ORDER
>> BY col LIMIT n; sometimes choosing an index that provides pre-sorted
>> input to the ORDER BY but cannot use <x> as an indexable condition.
>> We don't record any stats to make better choices there, maybe we
>> should, but either way, we're taking a bit risk there as all the rows
>> matching <x> might be right at the end of the index and we might need
>> to scan the entire thing before hitting the LIMIT. For now, we just
>> assume completely even distribution of rows. i.e. If there are 50 rows
>> estimated in the path and the limit is for 5 rows, then we'll assume
>> we need to read 10% of those before finding all the ones we need. In
>> reality, everything matching <x> might be 95% through the index and we
>> could end up reading 100% of rows. That particular problem is not just
>> caused by the uneven distribution of rows in the index, but also from
>> selectivity underestimation.
>>
>> I'd more recently wondered if we shouldn't have some sort of "risk"
>> factor to the cost model. I really don't have ideas on how exactly we
>> would calculate the risk factor in all cases, but in this case,  say
>> the risk factor always starts out as 1. We could multiply that risk
>> factor by some >1 const each time we find another index filter qual.
>> add_path() can prefer lower risk paths when the costs are similar.
>> Unsure what the exact add_path logic would be. Perhaps a GUC would
>> need to assist with the decision there.   Likewise, with
>> NestedLoopPaths which have a large number of join quals, the risk
>> factor could go up a bit with those so that we take a stronger
>> consideration for hash or merge joins instead.
>>
>>
> I thought about these ideas too. And I am not alone.
>
> https://hal.archives-ouvertes.fr/hal-01316823/document
>
>
Thanks for the documents,  after checking it, that is more like
 oracle's statistics feedback[1].  Hope we can have it someday:)

[1] https://blogs.oracle.com/optimizer/cardinality-feedback

-- 
Best Regards
Andy Fan

Re: A wrong index choose issue because of inaccurate statistics

Reply via email to