zanmato1984 commented on issue #43768:
URL: https://github.com/apache/arrow/issues/43768#issuecomment-2301194348

   I guess your major concern is that counting a valid scalar only once (i.e., 
`this->count += scalar.is_valid`), as opposed to the actual batch length times 
(i.e., `this->count += scalar.is_valid * batch.length` as you proposed), would 
make the count less and emit wrong result given how this count is used:
   ```
   class ARROW_EXPORT ScalarAggregateOptions : public FunctionOptions {
     ...
     /// If less than this many non-null values are observed, emit null.
     uint32_t min_count;
   };
   ```
   ```
       if ((!options.skip_nulls && !this->any && this->has_nulls) ||
           this->count < options.min_count) {
         out->value = std::make_shared<BooleanScalar>();
       }
   ```
   
   For example, assume a batch contains 3 rows and the `min_count` is 2. 
According to the current logic, `any(false)` would have count 1 (as opposed to 
the batch length 3) and emit `null` (`count == 1 < min_count == 2`). I think 
this is wrong.
   
   @felipecrv What do you think? Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to