peter-toth commented on issue #10426: URL: https://github.com/apache/datafusion/issues/10426#issuecomment-2105664520
> Are there any potential issues with simply using the existing `Hash` implementation of `Expr` to create `HashSet`s? > > Serveral other optimization passes use string names as keys for expressions in data structures. I am wondering if any of these could also be refactored to simply use `HashSet<Expr>` or `HashSet<&Expr>` Thanks for these references @erratic-pattern. ### Background and general thoughts: I'm only familiar with CSE code and in its case unfortunately non-unique stringified expression were used as keys of the map that stores the occurrance counts. This bug was introduced in https://github.com/apache/datafusion/pull/9871 and reverted in https://github.com/apache/datafusion/pull/10396. The issue with these colliding string keys are explained here in details: https://github.com/apache/datafusion/pull/10333#discussion_r1595242837. If you think stringified expression can cause similar collision issues at other places in DataFusion then we should definitely fix it. ### Some thouths about CSE: After https://github.com/apache/datafusion/pull/10396 we still use stringified expressions as keys (`Identifier`), but the strings we use encode whole expression subtrees. This is far from optimal and this ticket / my work in progress change would like to help with that. In case of CSE we could use `Expr` as keys of the `ExprStats` map, but then we would need to clone `Expr`s when we fill up the `ExprStats` during the first traversal. This would be particulary costly in CSE because we need to store not only the counts for all top level expressions, but the counts of all their descendant subexpressions. We could also use `&Expr` as keys (and so we didn't need to clone the expressions), but there is a problem here. The current `TreeNode::apply()` / `TreeNode::visit()` APIs aren't capable to fill up such a `HashMap<&Expr, ...>` map. This is because of restricted `TreeNode` reference lifetimes used in closures / `TreeNodeVisitor` methods. I.e. this currently doesn't work: ``` let e = sum((col("a") * (lit(1) - col("b"))) * (lit(1) + col("c"))); let mut m = HashMap::new(); e.apply(|e| { *m.entry(e).or_insert(0) += 1; Ok(TreeNodeRecursion::Continue) }); println!("m: {:#?}", m); ``` This issue can be solved by adding new `TreeNode` APIs or fixing the current ones. I have a WIP commit here: https://github.com/peter-toth/arrow-datafusion/commit/e8447996462ae4710b573c7088bc6d8b1e586cfb that adds `TreeNode::apply_ref()` / `TreeNode::visit_ref()`. Using `apply_ref()` in the above example would make it work, but I haven't opened a PR yet as there are a few things to consider: a. We don't really want to add any more new APIs (especially if their puspose is similar to existing ones). b. We can't change the lifetimes of references in the current `apply()` / `visit()` easily. This is because some `TreeNode` implementations are not compatible with that. (E.g. `DynTreeNode` doesn't have a method to get references to its children, `LogicalPlan` creates temprorary objects in its `map_subqueries()`, ...). Despite the fact that my WIP commit adds new APIs, I would prefer and lean towards option b.. But since I'm only aware of this ticket that requires this change to the APIs, I haven't opened the PR yet. Now there is another thing to consider in case of CSE whether we want use `&Expr` as keys of `ExprStats`. The current CSE algorithm, that was added by the original author of CSE to DataFusion (and not myself), is very clever and does the following: In the first traversal it: - Creates a mapping for each top level expression (this is called `IdArray`) that stores the preorder visit index of a node to an `Identifier` (of a subexpression tree). - And also creates a map (this is called `ExprStats`) that contains the `Identifier` -> count stats gathered for all top level expressions and their subexpressions. This is very nice, because the second rewriting traversal can use the preorder visit index again to look up the identifier first and then the count from the `ExprStats` map. Providing that an identifier is small, this can be much faster then using `&Expr` as keys because: - Computing `hash()` of a n`&Expr` (instead of using preorder index in the second traversal) is costly if the expression is deep and contains lots of indirections (`Box`es). - When we generate the identifiers in the first traversal we can use the traversal's bottom-up phase to build up identifiers from the current node and the identifiers of the node's children very effectively. In my work in progress change for this issue I would like to finalize the: - `TreeNode` API changes required (maybe open a separate PR for it) - and replce the current String based identifier to a `(u64, &Expr)` like tuple/struct. The first item contains a precomputed hash of the identifier. (As I mentioned we can use the bottom-up phase of the first traversal to compute that effectively as this logic is already implmeneted in the CSE algorithm.) The overriden `hash()` of the struct should return this precomputed hash. The second item is a `&Expr` that can be used in the struct's `eq()` implementation in case of hash collision. **Back to the original question of using `HashSet<Expr, ...>` or `HashSet<&Expr, ...>` in maps:** I think both are accepable but CSE is special as the maps need to store all the descendant subexpressions as well and the impemented CSE algorithm seems to offer a way to implement a better identifier than just a simple `&Expr`. I don't know the other referenced usecases but if collision of string names can happen there then we should definitely fix it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org