Re: [I] ScalarUDFImpl::equals default implementation is error-prone [datafusion]

2025-08-08 Thread via GitHub


findepi commented on issue #16677:
URL: https://github.com/apache/datafusion/issues/16677#issuecomment-3167152107

   This PR introduces the proposed change for window functions:
   
   - https://github.com/apache/datafusion/pull/17081
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] ScalarUDFImpl::equals default implementation is error-prone [datafusion]

2025-08-07 Thread via GitHub


findepi commented on issue #16677:
URL: https://github.com/apache/datafusion/issues/16677#issuecomment-3163290232

   The error-proneness is very well exemplified by
   
   - https://github.com/apache/datafusion/pull/16625 introducing a bug
 - which is later fixed in https://github.com/apache/datafusion/pull/17065
   
   The code author or the first PR was generally well-aware of existence of 
equals/hash_value contract on UDFs and so were the reviewers, yet the bug was 
made. Subtle but dangerous.
   
   To me it makes it clear that we really must fix the design and make it much 
harder to write incorrect code, even at the cost of downstream projects needing 
an update.
   
   cc @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] ScalarUDFImpl::equals default implementation is error-prone [datafusion]

2025-07-23 Thread via GitHub


findepi commented on issue #16677:
URL: https://github.com/apache/datafusion/issues/16677#issuecomment-3106598537

   I have defined sub tasks in this issue.
   I want to use `ScalarUDFImpl` as the design driver / place where review 
happens. The `AggregateUDFImpl` and `WindowUDFImpl` can be worked on 
independently. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] ScalarUDFImpl::equals default implementation is error-prone [datafusion]

2025-07-21 Thread via GitHub


alamb commented on issue #16677:
URL: https://github.com/apache/datafusion/issues/16677#issuecomment-3097906043

   I think this idea sounds great -- as long as it is documented (which I think 
we can given your description) it will be great


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] ScalarUDFImpl::equals default implementation is error-prone [datafusion]

2025-07-21 Thread via GitHub


findepi commented on issue #16677:
URL: https://github.com/apache/datafusion/issues/16677#issuecomment-3097012088

   https://github.com/apache/datafusion/pull/16842 is a POC how we can convert 
hand-written equals, hash_value implementations into `PartialEq` and `Hash` 
traits, which would be good preparatory step for the plan proposed above.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] ScalarUDFImpl::equals default implementation is error-prone [datafusion]

2025-07-19 Thread via GitHub


findepi commented on issue #16677:
URL: https://github.com/apache/datafusion/issues/16677#issuecomment-3092338265

   How we can solve this:
   
   ## Add two new traits
   
   ```rust
   trait UdfHash {
   fn udf_hash(&self) -> u64;
   }
   
   trait UdfEq {
   fn udf_eq(&self, other: &dyn Any) -> bool;
   }
   ```
   
   with blanket implementations
   
   ```rust
   impl UdfHash for T {
   fn udf_hash(&self) -> u64 {
   let hasher = &mut DefaultHasher::new();
   self.type_id().hash(hasher);
   self.hash(hasher);
   hasher.finish()
   }
   }
   
   impl UdfEq for T {
   fn udf_eq(&self, other: &dyn Any) -> bool {
   let Some(other) = other.downcast_ref::() else {
   return false;
   };
   self == other
   }
   }
   ```
   
   ## Make `ScalarUDFImpl` depend on these
   
   ```rust
   trait ScalarUDFImpl: Send + Sync + UdfHash + UdfEq + Any {
...
   }
   ```
   
   ## Usage
   
   ```rust
   #[derive(Debug, Clone, PartialEq, Hash)]
   struct MyFunc {
   state: SomeEquitableStateType,
   }
   
   impl ScalarUDFImpl for MyFunc {
  // nothing eq/hash related needed here
   }
   ```
   
   As shown, the end result is that `ScalarUDFImpl` can be hashed and compared 
for equality and all the implementor needs is `#[derive(PartialEq, Hash)]` to 
make it work.
   
   ## Notes
   
   ### `UdfEq` accepts any `Any`
   
   `UdfEq` accepts any `Any` so it's possible to call it with instances other 
than `&dyn ScalarUDFImpl`.
   I don't know how to avoid this.
   In particular, this is not allowed in Rust:
   ```rust
   // Error: cycle detected when computing the super predicates of 
`ScalarUDFImpl` [E0391]
   trait ScalarUDFImpl: PartialEq {
/* ... */
   }
   ```
   
   We can cover this with additional eq/hash methods directly in 
`ScalarUDFImpl` trait that will have default implementation and will not need 
to be reimplemented.
   
   ```rust
   trait ScalarUDFImpl: Send + Sync + UdfHash + UdfEq + Any {
   /* ... */
   
   fn eq(&self, other: &dyn ScalarUDFImpl) -> bool {
   self.udf_eq(other)
   }
   
   fn hash(&self) -> u64 {
   self.udf_hash()
   }
   }
   ```
   
   
   ### Compiler errors don't make it obvious
   
   As with any trait with blanket implementations (such as `Into`), when you 
don't implement it yet, the compiler won't tell you how you should implement.
   For example, if a function lacks `PartialEq` implementation, the compiler 
will complain about `UdfEq` instead.
   
   ```rust
   // Missing eq implementation
   #[derive(Debug, Clone, /*PartialEq,*/ Hash)]
   struct MyFunc {
   safe: bool,
   }
   
   // Error: the trait bound `MyFunc: UdfEq` is not satisfied
   impl ScalarUDFImpl for MyFunc {
   /* ... */
   }
   ```
   
   It can be mitigated by a good documentation for the `UdfEq`, `UdfHash` 
traits informing about blanket implementation and that they should not be 
implemented directly.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] ScalarUDFImpl::equals default implementation is error-prone [datafusion]

2025-07-18 Thread via GitHub


findepi commented on issue #16677:
URL: https://github.com/apache/datafusion/issues/16677#issuecomment-3088759771

   Worth noting
   
   -  #16781 
   
   That PR adds missing equality implementation to many functions and improves 
default implementation to consider type_id and aliases. It well shows how 
"wrong by default" equality turned out to be a real problem in practice even 
within this code base. We don't know about other codebases, but for dbt it was 
a problem as well, leading to hard-to-understand query failures.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] ScalarUDFImpl::equals default implementation is error-prone [datafusion]

2025-07-04 Thread via GitHub


alamb commented on issue #16677:
URL: https://github.com/apache/datafusion/issues/16677#issuecomment-3035199021

   🤔 That is a tricky business
   
   Maybe ScalarUDFImpl needs some way to have `eq` called on it too 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]