[I] Support for observing intermediate aggregation results [datafusion]

via GitHub Mon, 17 Nov 2025 07:20:50 -0800


ahmed-mez opened a new issue, #18773:
URL: https://github.com/apache/datafusion/issues/18773


   ### Is your feature request related to a problem or challenge?
   
   For long-running aggregation queries, users want to observe intermediate 
results as the query progresses rather than waiting for the final answer. This 
enables:
   - Progress monitoring: Show users that queries are making progress and 
haven't stalled
   - Early insights: Start analyzing partial results before the query completes
   - Better UX: Display progressive updates in interactive applications 
(dashboards, notebooks, Jupyter)
   - Approximate query processing: Return "good enough" answers quickly, refine 
as more data is processed
   
   Other mature query engines provide various mechanisms for observing query 
progress or returning partial results.
   
   ### Describe the solution you'd like
   
   We'd like to explore ways to observe intermediate aggregation state for 
queries with unordered input, particularly for grouped aggregations.
   
   Some properties that seem important:
   - Non-destructive: Observing intermediate state shouldn't affect final 
results
   - Periodic: Ability to check state at intervals (time-based or 
batch-count-based)
   - Efficient: Overhead should be proportional to data observed, not 
re-processing
   - Opt-in: Feature should be optional and not impact queries that don't use it
   
   We've done some preliminary exploration and have ideas. We explored 
callback-based peeking with a user-facing API like this:
   
   ```rust
   let peek_config = IntermediatePeekConfig::new(1000, |context: PeekContext| {
       // Receive intermediate results as Arrow RecordBatch
       handle_intermediate_results(&context.intermediate_batch);
       Ok(())
   });
   
   let agg_exec = agg_exec.with_intermediate_peek_config(Some(peek_config));
   ```
   
   And extending existing traits with non-destructive peek methods:
   
   ```rust
   pub trait GroupValues: Send {
       // ... existing methods ...
       
       fn peek(&self, num_groups: usize) -> Result<Vec<ArrayRef>> {
           not_impl_err!("peek not supported")  // Default implementation
       }
   }
   
   pub trait GroupsAccumulator: Send {
       // ... existing methods ...
       
       fn peek_evaluate(&self, num_groups: usize) -> Result<ArrayRef> {
           not_impl_err!("peek_evaluate not supported")  // Default 
implementation
       }
   }
   ```
   
   We're very interested in hearing from the community about:
   - Whether this use case resonates with others
   - Potential approaches we may have missed
   - How this might fit with DataFusion's architecture and future direction
   - Whether there are existing extension points we could leverage
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   We believe this feature would benefit the broader DataFusion community 
beyond our specific use case (interactive analytics, approximate query 
processing, debugging). We're looking for:
   - Design feedback: Which approach aligns best with DataFusion's architecture?
   - API suggestions: Better ways to expose this functionality?
   - Alternative approaches: Are there existing mechanisms we've overlooked?
   - Collaboration: Would DataFusion maintainers/community be interested in 
this feature?
   - Scope guidance: Should we aim for comprehensive support or start with a 
minimal viable feature?
   
   If the community sees value in this feature and we can align on an approach, 
we're willing to invest engineering effort to implement it properly. We'd 
appreciate guidance from maintainers and looking forward to the discussion!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Support for observing intermediate aggregation results [datafusion]

Reply via email to