Re: [I] ScalarUDFImpl::equals default implementation is error-prone [datafusion]
findepi commented on issue #16677: URL: https://github.com/apache/datafusion/issues/16677#issuecomment-3167152107 This PR introduces the proposed change for window functions: - https://github.com/apache/datafusion/pull/17081 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] ScalarUDFImpl::equals default implementation is error-prone [datafusion]
findepi commented on issue #16677: URL: https://github.com/apache/datafusion/issues/16677#issuecomment-3163290232 The error-proneness is very well exemplified by - https://github.com/apache/datafusion/pull/16625 introducing a bug - which is later fixed in https://github.com/apache/datafusion/pull/17065 The code author or the first PR was generally well-aware of existence of equals/hash_value contract on UDFs and so were the reviewers, yet the bug was made. Subtle but dangerous. To me it makes it clear that we really must fix the design and make it much harder to write incorrect code, even at the cost of downstream projects needing an update. cc @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] ScalarUDFImpl::equals default implementation is error-prone [datafusion]
findepi commented on issue #16677: URL: https://github.com/apache/datafusion/issues/16677#issuecomment-3106598537 I have defined sub tasks in this issue. I want to use `ScalarUDFImpl` as the design driver / place where review happens. The `AggregateUDFImpl` and `WindowUDFImpl` can be worked on independently. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] ScalarUDFImpl::equals default implementation is error-prone [datafusion]
alamb commented on issue #16677: URL: https://github.com/apache/datafusion/issues/16677#issuecomment-3097906043 I think this idea sounds great -- as long as it is documented (which I think we can given your description) it will be great -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] ScalarUDFImpl::equals default implementation is error-prone [datafusion]
findepi commented on issue #16677: URL: https://github.com/apache/datafusion/issues/16677#issuecomment-3097012088 https://github.com/apache/datafusion/pull/16842 is a POC how we can convert hand-written equals, hash_value implementations into `PartialEq` and `Hash` traits, which would be good preparatory step for the plan proposed above. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] ScalarUDFImpl::equals default implementation is error-prone [datafusion]
findepi commented on issue #16677:
URL: https://github.com/apache/datafusion/issues/16677#issuecomment-3092338265
How we can solve this:
## Add two new traits
```rust
trait UdfHash {
fn udf_hash(&self) -> u64;
}
trait UdfEq {
fn udf_eq(&self, other: &dyn Any) -> bool;
}
```
with blanket implementations
```rust
impl UdfHash for T {
fn udf_hash(&self) -> u64 {
let hasher = &mut DefaultHasher::new();
self.type_id().hash(hasher);
self.hash(hasher);
hasher.finish()
}
}
impl UdfEq for T {
fn udf_eq(&self, other: &dyn Any) -> bool {
let Some(other) = other.downcast_ref::() else {
return false;
};
self == other
}
}
```
## Make `ScalarUDFImpl` depend on these
```rust
trait ScalarUDFImpl: Send + Sync + UdfHash + UdfEq + Any {
...
}
```
## Usage
```rust
#[derive(Debug, Clone, PartialEq, Hash)]
struct MyFunc {
state: SomeEquitableStateType,
}
impl ScalarUDFImpl for MyFunc {
// nothing eq/hash related needed here
}
```
As shown, the end result is that `ScalarUDFImpl` can be hashed and compared
for equality and all the implementor needs is `#[derive(PartialEq, Hash)]` to
make it work.
## Notes
### `UdfEq` accepts any `Any`
`UdfEq` accepts any `Any` so it's possible to call it with instances other
than `&dyn ScalarUDFImpl`.
I don't know how to avoid this.
In particular, this is not allowed in Rust:
```rust
// Error: cycle detected when computing the super predicates of
`ScalarUDFImpl` [E0391]
trait ScalarUDFImpl: PartialEq {
/* ... */
}
```
We can cover this with additional eq/hash methods directly in
`ScalarUDFImpl` trait that will have default implementation and will not need
to be reimplemented.
```rust
trait ScalarUDFImpl: Send + Sync + UdfHash + UdfEq + Any {
/* ... */
fn eq(&self, other: &dyn ScalarUDFImpl) -> bool {
self.udf_eq(other)
}
fn hash(&self) -> u64 {
self.udf_hash()
}
}
```
### Compiler errors don't make it obvious
As with any trait with blanket implementations (such as `Into`), when you
don't implement it yet, the compiler won't tell you how you should implement.
For example, if a function lacks `PartialEq` implementation, the compiler
will complain about `UdfEq` instead.
```rust
// Missing eq implementation
#[derive(Debug, Clone, /*PartialEq,*/ Hash)]
struct MyFunc {
safe: bool,
}
// Error: the trait bound `MyFunc: UdfEq` is not satisfied
impl ScalarUDFImpl for MyFunc {
/* ... */
}
```
It can be mitigated by a good documentation for the `UdfEq`, `UdfHash`
traits informing about blanket implementation and that they should not be
implemented directly.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [I] ScalarUDFImpl::equals default implementation is error-prone [datafusion]
findepi commented on issue #16677: URL: https://github.com/apache/datafusion/issues/16677#issuecomment-3088759771 Worth noting - #16781 That PR adds missing equality implementation to many functions and improves default implementation to consider type_id and aliases. It well shows how "wrong by default" equality turned out to be a real problem in practice even within this code base. We don't know about other codebases, but for dbt it was a problem as well, leading to hard-to-understand query failures. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] ScalarUDFImpl::equals default implementation is error-prone [datafusion]
alamb commented on issue #16677: URL: https://github.com/apache/datafusion/issues/16677#issuecomment-3035199021 🤔 That is a tricky business Maybe ScalarUDFImpl needs some way to have `eq` called on it too 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
