peter-toth commented on issue #10426:
URL: https://github.com/apache/datafusion/issues/10426#issuecomment-2105664520

   > Are there any potential issues with simply using the existing `Hash` 
implementation of `Expr` to create `HashSet`s?
   > 
   > Serveral other optimization passes use string names as keys for 
expressions in data structures. I am wondering if any of these could also be 
refactored to simply use `HashSet<Expr>` or `HashSet<&Expr>`
   
   Thanks for these references @erratic-pattern.
   
   ### Background and general thoughts:
   
   I'm only familiar with CSE code and in its case unfortunately non-unique 
stringified expression were used as keys of the map that stores the occurrance 
counts. This bug was introduced in 
https://github.com/apache/datafusion/pull/9871 and reverted in 
https://github.com/apache/datafusion/pull/10396. The issue with these colliding 
string keys are explained here in details: 
https://github.com/apache/datafusion/pull/10333#discussion_r1595242837.
   
   If you think stringified expression can cause similar collision issues at 
other places in DataFusion then we should definitely fix it.
   
   ### Some thouths about CSE:
   
   After https://github.com/apache/datafusion/pull/10396 we still use 
stringified expressions as keys (`Identifier`), but the strings we use encode 
whole expression subtrees. This is far from optimal and this ticket / my work 
in progress change would like to help with that.
   
   In case of CSE we could use `Expr` as keys of the `ExprStats` map, but then 
we would need to clone `Expr`s when we fill up the `ExprStats` during the first 
traversal. This would be particulary costly in CSE because we need to store not 
only the counts for all top level expressions, but the counts of all their 
descendant subexpressions.
   
   We could also use `&Expr` as keys (and so we didn't need to clone the 
expressions), but there is a problem here. The current `TreeNode::apply()` / 
`TreeNode::visit()` APIs aren't capable to fill up such a `HashMap<&Expr, ...>` 
map. This is because of restricted `TreeNode` reference lifetimes used in 
closures / `TreeNodeVisitor` methods.
   I.e. this currently doesn't work:
   ```
   let e = sum((col("a") * (lit(1) - col("b"))) * (lit(1) + col("c")));
   let mut m = HashMap::new();
   e.apply(|e| {
       *m.entry(e).or_insert(0) += 1;
       Ok(TreeNodeRecursion::Continue)
   });
   
   println!("m: {:#?}", m);
   ```
   This issue can be solved by adding new `TreeNode` APIs or fixing the current 
ones.
   I have a WIP commit here: 
https://github.com/peter-toth/arrow-datafusion/commit/e8447996462ae4710b573c7088bc6d8b1e586cfb
 that adds  `TreeNode::apply_ref()` / `TreeNode::visit_ref()`.
   Using `apply_ref()` in the above example would make it work, but I haven't 
opened a PR yet as there are a few things to consider:
   a. We don't really want to add any more new APIs (especially if their 
puspose is similar to existing ones).
   b. We can't change the lifetimes of references in the current `apply()` / 
`visit()` easily. This is because some `TreeNode` implementations are not 
compatible with that. (E.g. `DynTreeNode` doesn't have a method to get 
references to its children, `LogicalPlan` creates temprorary objects in its 
`map_subqueries()`, ...).
   
   Despite the fact that my WIP commit adds new APIs, I would prefer and lean 
towards option b.. But since I'm only aware of this ticket that requires this 
change to the APIs, I haven't opened the PR yet.
   
   Now there is another thing to consider in case of CSE whether we want use 
`&Expr` as keys of `ExprStats`. The current CSE algorithm, that was added by 
the original author of CSE to DataFusion (and not myself), is very clever and 
does the following:
   In the first traversal it:
   - Creates a mapping for each top level expression (this is called `IdArray`) 
that stores the preorder visit index of a node to an `Identifier` (of a 
subexpression tree).
   - And also creates a map (this is called `ExprStats`) that contains the 
`Identifier` -> count stats gathered for all top level expressions and their 
subexpressions.
   
   This is very nice, because the second rewriting traversal can use the 
preorder visit index again to look up the identifier first and then the count 
from the `ExprStats` map. Providing that an identifier is small, this can be 
much faster then using `&Expr` as keys because:
   - Computing `hash()` of a n`&Expr` (instead of using preorder index in the 
second traversal) is costly if the expression is deep and contains lots of 
indirections (`Box`es).
   - When we generate the identifiers in the first traversal we can use the 
traversal's bottom-up phase to build up identifiers from the current node and 
the identifiers of the node's children very effectively.
   
   In my work in progress change for this issue I would like to finalize the:
   - `TreeNode` API changes required (maybe open a separate PR for it)
   - and replce the current String based identifier to a `(u64, &Expr)` like 
tuple/struct.
     The first item contains a precomputed hash of the identifier. (As I 
mentioned we can use the bottom-up phase of the first traversal to compute that 
effectively as this logic is already implmeneted in the CSE algorithm.) The 
overriden `hash()` of the struct should return this precomputed hash.
     The second item is a `&Expr` that can be used in the struct's `eq()` 
implementation in case of hash collision.
   
   **Back to the original question of using `HashSet<Expr, ...>` or 
`HashSet<&Expr, ...>` in maps:**
   
   I think both are accepable but CSE is special as the maps need to store all 
the descendant subexpressions as well and the impemented CSE algorithm seems to 
offer a way to implement a better identifier than just a simple `&Expr`.
   I don't know the other referenced usecases but if collision of string names 
can happen there then we should definitely fix it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to