zanmato1984 commented on issue #43768:
URL: https://github.com/apache/arrow/issues/43768#issuecomment-2303644889

   > Edit: I've try to write tests here and found it's actually bug-free:
   > 
   > 1. `ExecBatch` will regarded as size == 1 when all input is scalar ( has 
constant )
   > 2. So, the size is always 1, this is also handled by 
`PromoteExecSpanScalars`
   > 
   > So, actually it's always 1 here
   
   You are right in the context when solely compute kernels are involved. In 
this case, you can assume that when the argument of `any` is scalar, then the 
batch length must be `1`.
   
   However this might not be the case in a more complex context, e.g. acero. 
Here is a concrete test that reproduces the expected bug (explained at last):
   ```C++
   TEST(ScalarAggregate, BuggyAny) {
     std::shared_ptr<Schema> in_schema = schema({field("not_used", int32())});
     std::vector<ExecBatch> batches{
         ExecBatchFromJSON({int32()}, "[[42], [42], [42], [42]]")};
   
     std::vector<Aggregate> aggregates = {
         Aggregate("any",
                   
std::make_shared<compute::ScalarAggregateOptions>(/*skip_nulls=*/false,
                                                                     
/*min_count=*/2),
                   FieldRef("literal_true"))};
   
     Declaration plan = Declaration::Sequence(
         {{"exec_batch_source", ExecBatchSourceNodeOptions(in_schema, 
std::move(batches))},
          {"project", ProjectNodeOptions({literal(true)}, {"literal_true"})},
          {"aggregate", AggregateNodeOptions(aggregates)}});
   
     ASSERT_OK_AND_ASSIGN(BatchesWithCommonSchema out_batches,
                          DeclarationToExecBatches(plan));
   
     std::cout << out_batches.batches[0].values[0].ToString() << std::endl;
   }
   ```
   Output:
   ```
   Scalar(null)
   ```
   Explain: One source node with 1 batch of 4 rows (contents don't matter), 
followed by a projection node which outputs literal `true` only (also 4 rows). 
The tricky part is what this projection node emits: a batch of logically 4 rows 
but of a single scalar column. When this batch is eventually ingested into the 
subsequent aggregation node, which calls `any` on this scalar column with 
`min_count` being `2`, boom.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to