[GitHub] [arrow-datafusion] pjmore commented on issue #1972: DataFusion Optimizer framework discussion

GitBox Wed, 23 Mar 2022 20:55:16 -0700


pjmore commented on issue #1972:
URL: 
https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1077041378

TLDR: egg optimizer struggles with some things which have many
representations and suffers due to poor rewrite scheduling. Has some nice
properties, likely able to implement cascades based framework on top of egg.
Egg might be better suited to planner/physical operating like cascades does,
lowering the logical plan into physical operators.

I've been researching cascades a bit recently and here are some general
thoughts on optimizer stuff. This is kind of off the top of my head with
regards to the egg stuff so it might be disorganized since I haven't thought
about it for a bit.

1. The current egg based optimizer works alright, but suffers heavily in
certain situations due to rule scheduling issues/the number of potential
queries exploding. Structured rule application like cascade does might help
with that.
2. In a related issue egg does not handle large expressions/trees well due
to the large number of potential representations. This is apparent when the
filter predicate in TPCH 19 which is a fairly deep and wide expression. In
release mode it can take ~5 only focused on optimizing the filter to properly
transform the expression from (A OR B) OR (A OR C) OR (A OR D) -> A AND (B OR
C OR D). Egg does not record which rules only need to be applied once which can
cause a rule like A OR B -> B OR A to be run on already processed nodes doing
nothing. This is mostly an issue when rules that are going to match a lot and
have to match over a large number of equivalent representations. Extremely
commonly a2. In a related issue egg does not handle large expressions/trees
well due to the large number of potential representations. This is apparent
when the filter predicate in TPCH 19 which is a fairly deep and wide
expression. In release mode it can take ~5 only focused on optimizing the filt
er to properly transform the expression from (A OR B) OR (A OR C) OR (A OR D)
-> A AND (B OR C OR D). Egg does not record which rules only need to be
applied once which can cause a rule like A OR B -> B OR A to be run on already
processed nodes doing nothing. This is mostly an issue when rules that are
going to match a lot and have to match over a large number of equivalent
representations. Extremely commonly applied rules that are almost always
positive such the boolean simplification in TPCH 19 should probably exist as
specialized optimization rules as they exist in their current form as they are
likely to be faster. pplied rules that are almost always positive such the
boolean simplification in TPCH 19 should probably exist as specialized
optimization rules as they exist in their current form as they are likely to be
faster.
3. Egg provides a lot of the properties that are desirable for a query
optimizer, namely it already has a fast memo data structure and provides cost
memoized cost function calculation. I'm unsure how well something like join
order could would run within egg though.
4. Egg based optimizer might be better suited to physical optimization or as
a planner where table statistics can be considered.
5. I think that the egg based optimizer and cascades actually a number of
similarities, namely considering the memo datastructure that cascades uses.

Here is an image from [tikdb's article on their cascade
optimizer](https://developpaper.com/unveiling-tidb-new-optimizer-analysis-of-cascade-planner-principle)
that demonstrates general idea of a memo data structure.

![memo
explaination](https://imgs.developpaper.com/imgs/2310010508-c90cfa59d4bf9972_articlex.jpg)

And from [egg's website](https://egraphs-good.github.io/), this image
doesn't show up well on dark themed github so check it out on the website if
you can't see it properly.

![Egg's demo](https://egraphs-good.github.io/egraphs.svg)

The dotted lines around expression are effectively the same thing as
cascade's groups. I think egg might be slightly better for this as groups are
automatically merged together if they share equivalent terms, I'm not sure if
this is something that is typically done in cascades based optimizer.

Another [example from the CMU advanced database course](
https://15721.courses.cs.cmu.edu/spring2017/slides/15-optimizer2.pdf). The
relevant slides are around slide 34.

You can see that an egraph, which egg uses, and the memo that cascade uses
have a bunch of similar properties.
- Single node can be represented in many different ways.
- Parent <-> child relation ship is loosely coupled. Parent has group as
child which can have multiple different representations.

I think that it is possible to build a cascade optimizer using egg as a fast
memo structure. If we are going to build a more advanced optimization framework
I think it makes sense to focus a bit more on the physical optimizations as
nearly all of the logical optimizations could be done on the physical
expressions instead, and optimizations like join order, which require table
statistics/sampling, can cause order of magnitudes of difference in query
performance. I think if this is a route that people are interested in pursuing
this makes sense to develop a egg based planner which lowers the logical plan
into a physical one, but I'd love to hear other people's thoughts on this.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] pjmore commented on issue #1972: DataFusion Optimizer framework discussion

Reply via email to