lidavidm commented on PR #43632: URL: https://github.com/apache/arrow/pull/43632#issuecomment-2287830710
> I'm basically modeling this on the current lance file reader. The file reader knows (from the metadata) how many tasks there will be. get_next_task is a synchronous method that creates a task which is an asynchronous task. I think outside of file readers, we're never going to know how many tasks there will be up front so I'm not sure if we want to choose an interface that bakes in this assumption. > This gets more and more complicated the more times this mismatch is encountered in a pipeline. Yes, I think we're going to have to end up with a lowest common denominator style approach. And overall I feel like the callback based approach is most likely to be the most natural between languages. For the pull based stream here: https://github.com/apache/arrow/pull/43632#issuecomment-2281940063 - Doesn't this still mean the consumer can block the producer's thread on accident (by doing processing inside `wake`)? For the task based approach here: https://github.com/apache/arrow/pull/43632#issuecomment-2282026639 - Is the reason for a separate task to help optimize cache usage? (Basically because the control flow is on the consumer's side, so it is the consumer thread that eventually calls get_next and can then immediately do processing without having to transfer threads to avoid blocking the producer) - Would it be sufficient for that use case if we had a callback approach that produced a task instead of directly producing an array? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org