pjmore commented on issue #1972:
URL: 
https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1077041378


   TLDR: egg optimizer struggles with some things which have many 
representations and suffers due to poor rewrite scheduling. Has some nice 
properties, likely able to implement cascades based framework on top of egg. 
Egg might be better suited to planner/physical operating like cascades does, 
lowering the logical plan into physical operators. 
   
   I've been researching cascades a bit recently and here are some general 
thoughts on optimizer stuff. This is kind of off the top of my head with 
regards to the egg stuff so it might be disorganized since I haven't thought 
about it for a bit.
   
   1. The current egg based optimizer works alright, but suffers heavily in 
certain situations due to rule scheduling issues/the number of potential 
queries exploding. Structured rule application like cascade does might help 
with that.
   2.  In a related issue egg does not handle large expressions/trees well due 
to the large number of potential representations. This is apparent when the 
filter predicate in TPCH 19 which is a fairly deep and wide expression. In 
release mode it can take ~5 only focused on optimizing the filter to properly 
transform the expression from (A OR B) OR (A OR C) OR (A OR D) -> A  AND (B OR 
C OR D). Egg does not record which rules only need to be applied once which can 
cause a rule like A OR B -> B OR A to be run on already processed nodes doing 
nothing. This is mostly an issue when rules that are going to match a lot and 
have to match over a large number of equivalent representations. Extremely 
commonly a2.  In a related issue egg does not handle large expressions/trees 
well due to the large number of potential representations. This is apparent 
when the filter predicate in TPCH 19 which is a fairly deep and wide 
expression. In release mode it can take ~5 only focused on optimizing the filt
 er to properly transform the expression from (A OR B) OR (A OR C) OR (A OR D) 
-> A  AND (B OR C OR D). Egg does not record which rules only need to be 
applied once which can cause a rule like A OR B -> B OR A to be run on already 
processed nodes doing nothing. This is mostly an issue when rules that are 
going to match a lot and have to match over a large number of equivalent 
representations. Extremely commonly applied rules that are almost always 
positive such the boolean simplification in TPCH 19 should probably exist as 
specialized optimization rules as they exist in their current form as they are 
likely to be faster.  pplied rules that are almost always positive such the 
boolean simplification in TPCH 19 should probably exist as specialized 
optimization rules as they exist in their current form as they are likely to be 
faster.  
   3. Egg provides a lot of the properties that are desirable for a query 
optimizer, namely it already has a fast memo data structure and provides cost 
memoized cost function calculation. I'm unsure how well something like join 
order could would run within egg though.
   4. Egg based optimizer might be better suited to physical optimization or as 
a planner where table statistics can be considered.
   5. I think that the egg based optimizer and cascades actually a number of 
similarities, namely considering the memo datastructure that cascades uses.
   
    Here is an image from [tikdb's article on their cascade 
optimizer](https://developpaper.com/unveiling-tidb-new-optimizer-analysis-of-cascade-planner-principle)
 that demonstrates general idea of a memo data structure.
   
   ![memo 
explaination](https://imgs.developpaper.com/imgs/2310010508-c90cfa59d4bf9972_articlex.jpg)
   
   And from [egg's website](https://egraphs-good.github.io/), this image 
doesn't show up well on dark themed github so check it out on the website if 
you can't see it properly.
   
   ![Egg's demo](https://egraphs-good.github.io/egraphs.svg)
   
   The dotted lines around expression are effectively the same thing as 
cascade's groups. I think egg might be slightly better for this as groups are 
automatically merged together if they share equivalent terms, I'm not sure if 
this is something that is typically done in cascades based optimizer. 
   
   Another [example from the CMU advanced database course]( 
https://15721.courses.cs.cmu.edu/spring2017/slides/15-optimizer2.pdf). The 
relevant slides are around slide 34.
   
   You can see that an egraph, which egg uses, and the memo that cascade uses 
have a bunch of similar properties. 
   - Single node can be represented in many different ways.
   - Parent <-> child relation ship is loosely coupled. Parent has group as 
child which can have multiple different representations.
   
   I think that it is possible to build a cascade optimizer using egg as a fast 
memo structure. If we are going to build a more advanced optimization framework 
 I think it makes sense to focus a bit more on the physical optimizations as 
nearly all of the logical optimizations could be done on the physical 
expressions instead, and optimizations like join order, which require table 
statistics/sampling, can cause order of magnitudes of difference in query 
performance. I think if this is a route that people are interested in pursuing 
this makes sense to develop a egg based planner which lowers the logical plan 
into a physical one, but I'd love to hear other people's thoughts on this.
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to