Jimexist commented on a change in pull request #375:
URL: https://github.com/apache/arrow-datafusion/pull/375#discussion_r638359608



##########
File path: datafusion/src/physical_plan/mod.rs
##########
@@ -509,6 +540,43 @@ pub trait Accumulator: Send + Sync + Debug {
     fn evaluate(&self) -> Result<ScalarValue>;
 }
 
+/// A window accumulator represents a stateful object that lives throughout 
the evaluation of multiple
+/// rows and generically accumulates values.
+///
+/// An accumulator knows how to:
+/// * update its state from inputs via `update`
+/// * convert its internal state to a vector of scalar values
+/// * update its state from multiple accumulators' states via `merge`
+/// * compute the final value from its internal state via `evaluate`
+pub trait WindowAccumulator: Send + Sync + Debug {
+    /// scans the accumulator's state from a vector of scalars, similar to 
Accumulator it also
+    /// optionally generates values.
+    fn scan(&mut self, values: &[ScalarValue]) -> Result<Option<ScalarValue>>;

Review comment:
       @alamb good question!
   
   The window word in this trait is purely indicating the fact that window 
functions will use this. it can be of a better name but...
   
   for a design, there are two complications:
   1. multiple window functions, each having different window frames, can be 
scanning batches at the same time, so i'd want each to create its own window 
accumulator, remembering to push_back, and remove front, on their own; this 
trait would not put that into the API, it just _scans_
   2. specifically for window that peeks ahead, because batches arrive in async 
stream, it is not feasible to _peek_, so my plan is to allow them to optionally 
execute a one time shift upon finishing up (e.g. `lead` is just producing the 
same value _in situ_, but with a one time shift at the end)
   
   Due to 1 and 2, a best possible state vector for window accumulator is 
possibly `VecDequeue`. And the name `scan` is because accumulators iterate the 
list and either optionally produce one value at a time (`max` with order by), 
or exactly one accumulated value upon finish scanning (`max` with empty over), 
but not both




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to