lidavidm commented on PR #43632:
URL: https://github.com/apache/arrow/pull/43632#issuecomment-2287830710

   > I'm basically modeling this on the current lance file reader. The file 
reader knows (from the metadata) how many tasks there will be. get_next_task is 
a synchronous method that creates a task which is an asynchronous task. 
   
   I think outside of file readers, we're never going to know how many tasks 
there will be up front so I'm not sure if we want to choose an interface that 
bakes in this assumption.
   
   > This gets more and more complicated the more times this mismatch is 
encountered in a pipeline.
   
   Yes, I think we're going to have to end up with a lowest common denominator 
style approach. And overall I feel like the callback based approach is most 
likely to be the most natural between languages. 
   
   For the pull based stream here: 
https://github.com/apache/arrow/pull/43632#issuecomment-2281940063
   
   - Doesn't this still mean the consumer can block the producer's thread on 
accident (by doing processing inside `wake`)?
   
   For the task based approach here: 
https://github.com/apache/arrow/pull/43632#issuecomment-2282026639
   
   - Is the reason for a separate task to help optimize cache usage? (Basically 
because the control flow is on the consumer's side, so it is the consumer 
thread that eventually calls get_next and can then immediately do processing 
without having to transfer threads to avoid blocking the producer)
   - Would it be sufficient for that use case if we had a callback approach 
that produced a task instead of directly producing an array?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to