Hi! Thank you for your research, I'm sure it will help me to fix the problem of calculating selectivity faster) I'm sorry I didn't answer right away, to be honest, I had a full diary of urgent matters at work. For this reason, I didn't have enough time to work on this patch properly.

The optimizer will itself do a limited form of "normalizing to CNF".
Are you familiar with extract_restriction_or_clauses(), from
orclauses.c? Comments above the function have an example of how this
can work:

  * Although a join clause must reference multiple relations overall,
  * an OR of ANDs clause might contain sub-clauses that reference just one
  * relation and can be used to build a restriction clause for that rel.
  * For example consider
  *      WHERE ((a.x = 42 AND b.y = 43) OR (a.x = 44 AND b.z = 45));
  * We can transform this into
  *      WHERE ((a.x = 42 AND b.y = 43) OR (a.x = 44 AND b.z = 45))
  *          AND (a.x = 42 OR a.x = 44)
  *          AND (b.y = 43 OR b.z = 45);
  * which allows the latter clauses to be applied during the scans of a and b,
  * perhaps as index qualifications, and in any case reducing the number of
  * rows arriving at the join.  In essence this is a partial transformation to
  * CNF (AND of ORs format).  It is not complete, however, because we do not
  * unravel the original OR --- doing so would usually bloat the qualification
  * expression to little gain.
This is an interesting feature. I didn't notice this function before, I studied many times consider_new_or_cause, which were called there. As far as I know, there is a selectivity calculation going on there, but as far as I remember, I called it earlier after my conversion, and unfortunately it didn't solve my problem with calculating selectivity. I'll reconsider it again, maybe I can find something I missed.
Of course this immediately makes me wonder: shouldn't your patch be
able to perform an additional transformation here? You know, by
transforming "a.x = 42 OR a.x = 44" into "a IN (42, 44)"? Although I
haven't checked for myself, I assume that this doesn't happen right
now, since your patch currently performs all of its transformations
during parsing.

I also noticed that the same comment block goes on to say something
about "clauselist_selectivity's inability to recognize redundant
conditions". Perhaps that is relevant to the problems you were having
with selectivity estimation, back when the code was in
preprocess_qual_conditions() instead? I have no reason to believe that
there should be any redundancy left behind by your transformation, so
this is just one possibility to consider.
Separately, the commit message of commit 25a9e54d2d says something
about how the planner builds RestrictInfos, which seems
possibly-relevant. That commit enhanced extended statistics for OR
clauses, so the relevant paragraph describes a limitation of extended
statistics with OR clauses specifically. I'm just guessing, but it
still seems like it might be relevant to the problem you ran into with
selectivity estimation. Another possibility to consider.

I understood what is said about AND clauses in this comment. It seems to me that AND clauses saved like (BoolExpr *) expr->args->(RestrictInfo *) clauseA->(RestrictInfo *)clauseB lists and OR clauses saved like (BoolExpr *) expr -> orclause->(RestrictInfo *)clause A->(RestrictInfo *)clause B.

As I understand it, selectivity is calculated for each expression. But I'll exploring it deeper, because I think this place may contain the answer to the question, what's wrong with selectivity calculation in my patch.

BTW, I sometimes use RR to help improve my understanding of the planner:

https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#Recording_Postgres_using_rr_Record_and_Replay_Framework
The planner has particularly complicated control flow, which has
unique challenges -- just knowing where to begin can be difficult
(unlike most other areas). I find that setting watchpoints to see when
and where the planner modifies state using RR is far more useful than
it would be with regular GDB. Once I record a query, I find that I can
"map out" what happens in the planner relatively easily.
Thank you for sharing this source! I didn't know about this before, and it will definitely make my life easier to understand the optimizer.

I understand what you mean, and I researched the optimizer in a similar way through gdb and looked at the comments and code in postgresql. This is a complicated way and I didn't always understand correctly what this variable was doing in this place, and this created some difficulties for me.

So, thank you for the link!

Many interesting cases won't get SAOP transformation from the patch,
simply because of the or_transform_limit GUC's default of 500. I don't
think that that design makes too much sense. It made more sense back
when the focus was on expression evaluation overhead. But that's only
one of the benefits that we now expect from the patch, right? So it
seems like something that should be revisited soon.

I'm not suggesting that there is no need for some kind of limit. But
it seems like a set of heuristics might be a better approach. Although
I would like to get a better sense of the costs of the transformation
to be able to say too much more.

Yes, this may be revised in the future after some transformations. Initially, I was solving the problem described here [0]. So, after testing [1], I come to the conclusion that 500 is the ideal value for or_transform_limit.

[0] https://www.postgresql.org/message-id/919bfbcb-f812-758d-d687-71f89f0d9a68%40postgrespro.ru

[1] https://www.postgresql.org/message-id/6b97b517-f36a-f0c6-3b3a-0cf8cfba220c%40yandex.ru

--
Regards,
Alena Rybakina
Postgres Professional



Reply via email to