adriangb opened a new pull request, #18192: URL: https://github.com/apache/datafusion/pull/18192
This stemmed from wanting to deduplicate `DynamicFilterPhysicalExpr` which is essential for them to have a chance at working. I started by thinking we should add an `id` field to `DynamicFilterPhysicalExpr` specifically. But then I had the thought: what if I used the pointer address of the `Arc` as the id? This has several advantages: - Contains all of the changes to the protobuf ser/de code, no changes necessary to expressions themselves. - Works for all expressions: it's very common to have duplicates in a plan (that's the whole reason they're `Arc`'ed!) and we would currently use multiple times the memory when deserializing an `InList` expression because we make a deep copy for each `Arc`'ed reference. - Is very cheap to compute: I considered using `Hash` but not only is it a lot more code to derive it everywhere, it also requires that everything is hashable and could be very expensive to compute (again, imagine a huge `InList` expression). - Has the potential to reduce serialization cost: I did not make the change to keep the diff small so we still serialize duplicate data that we don't use. In theory we could keep some sort of `seen: HashSet<usize>` on the serialization side and only serialize each expression once. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
