[GitHub] [arrow] westonpace commented on issue #35508: [C++][Python] Adding data to tdigest in pyarrow

via GitHub Tue, 09 May 2023 14:45:30 -0700


westonpace commented on issue #35508:
URL: https://github.com/apache/arrow/issues/35508#issuecomment-1540931214


   Yes, internally, most aggregate functions are implemented in an incremental 
map/reduce style.  A lot of the pieces are in place to expose this but not 
everything.
   
   However, C++ changes would be required (I think).
   
   You could choose to do this completely outside of Acero using the function 
registry directly.  You would end up creating something that looks quite a bit 
like the aggregate node so I'd recommend starting by looking at that and 
getting familiar.
   
   On the other hand, I think I'd prefer an approach reusing Acero since it 
already does this.  Acero can take streaming (and potentially infinite inputs). 
 This is not very well exposed to pyarrow today (I think you can use an 
iterator of batches as the input to a scanner).  It might be nice to be able to 
create a push-based pyarrow/acero data source though.
   
   The other problem is that you would need an aggregate node that emitted its 
results every time a batch arrived.  This shouldn't be too hard and could 
probably be done with a flag to the aggregate node.  The only wrinkle I think 
is that the aggregate node, today, has one state per thread.  For this to work 
you would want to synchronize all the threads on a single state (or run without 
threads).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35508: [C++][Python] Adding data to tdigest in pyarrow

Reply via email to