alamb commented on issue #116:
URL: 
https://github.com/apache/arrow-datafusion/issues/116#issuecomment-826831711


   Comment from Daniël Heres(Dandandan) @ 2020-11-24T18:56:34.496+0000:
   <pre>A summary of some points I found when coming up with a new framework 
for polars ([https://github.com/ritchie46/polars]) that uses some ideas from 
the Spark catalyst optimizer [1]
   
   Some points I liked about the design from Spark Catalyst:
    * Recursion on tree should be only be needed once, and not for each 
optimization pass. So every rule can be written using simple pattern matching. 
This can be captured in some kind of framework.
    * A large nr. of optimization rules should run a nr. of times to reach a 
fixed point, i.e. running until the logical plan doesn't change anymore. If 
doing this, it is important that all of your optimization rules only make the 
tree "smaller" in some sense. So either should reduce the nr. of nodes or make 
the plan "cheaper". 
   
   The optimizer I made for Polars is in very early stages, but I did some 
design and first iterations to come up with a first version of a optimization 
framework.
    * In Rust, if elements in your tree are Boxed, you need to clone part of 
the tree when you want to mutate part of the tree. So simple recursing the tree 
+ mutating it in scala is not possible without changing. You could maybe wrap 
everything in something like Arc/RC <RefCell>>, but this has a higher overhead. 
You could also generate a whole new tree every iteration. This will however be 
quite a bit slower, especially if you would do this per optimization rule which 
can grow a lot!
   
   Some points that a first iteration is different than the optimizer in Spark
    * It uses an tree backed by an arena to efficiently allocate data for the 
tree and mutate it. This means that if you don't generate new nodes, you don't 
even allocate, just switch some index to different nodes around. Also a tree in 
a arena is very nice for the locality of the data.
   
   The arena brings a bit more unsafety, as you 
    * Uses manual recursion (with pre allocated stack) instead of the call 
stack to recurse (a bit uglier, but if you only write it once can be worth it 
for performance).
   
    * In Catalyst, only a single optimization rule runs until reaching a fixed 
point, and then moves to . In the Polars version, all rules run in the inner 
loop until the whole optimization reaches a fixed point. Benefit is that you 
don't have to make sure the order of the rules is important. Also it can bring 
_more_  optimizations, as e.g. a rule to evaluate some expressions can have an 
effect on a rule to propagate nulls that can have an effect on predicate 
pushdown, etc.
   
    * In Catalyst you have to note whether your optimization needs to recurses 
topdown or bottom up (for example more useful to constant folding as otherwise 
you would need lots of iterations to fold a complex contant expression). In 
this optimizer, the optimizer does both itself, by also optimizing a node right 
after it changed. This means that the optimizer needs to do perform iterations 
in general, and you need to think less about it.
   
   TODO for design in Polars:
    * Some optimization rules can be more expensive than others. It might make 
sense to keep track of each node individually to check whether it changed 
    * Different optimization rules need different input, like the schema/type 
of a column, etc.
    * Some optimizations need to keep track of state, this is not yet handled 
in this optimizer.
   
    [1 ]http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf</pre>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to