Re: [PR] GH-43631: [C][Format] Add ArrowAsyncDeviceStreamHandler interface [arrow]

via GitHub Mon, 12 Aug 2024 09:37:38 -0700


zeroshade commented on PR #43632:
URL: https://github.com/apache/arrow/pull/43632#issuecomment-2284467244


   @westonpace your last example confuses me a bit, particularly because the 
entire purpose of this is to create a *push-based* approach for async handling, 
rather than a pull-based approach.
   
   Below are my questions:
   
   ```c
   struct Waker {
     // Signal to the producer that more data is available, the consumer shall 
release any resources
     // associated with the waker after this method is called.  The producer 
shall not call any other
     // methods after calling this method.
     //
     // The producer must always call this method even if an error is 
encountered (in which case the
     // error will be reported on the following call to `get_next`).
     void wake(Waker* waker); 
     void* private_data;
   };
   ```
   
   I don't understand signalling to the producer that data is available, the 
producer knows when data is available right? it's producing the data in the 
first place. Isn't the point that the producer needs to signal the *consumer* 
that more data is available? `Waker` would need to be consumer created, with 
the producer calling `wake` when data is available. Was your comment just a 
typo, or am I missing something?
   
   ```c
   struct ArrowAsyncDeviceArrayStream {
     // Callback to get the next array task
     // (if no error and the task is released, the stream has ended)
     //
     // Return value: 0 if successful, an `errno`-compatible error code 
otherwise.
     //
     // The consumer is allowed to call this method again even if the tasks 
returned by a previous
     // call have not completed.  However, the consumer shall not call this 
re-entrantly.
     //
     // If successful, the ArrowDeviceArray must be released independently from 
the stream.
     int (*get_next_task)(struct ArrowAsyncDeviceArrayStream* self, struct 
ArrowArrayTask* task);
   
     // The rest is identical to `ArrowDeviceArrayStream`
     ArrowDeviceType device_type;
     const char* (*get_last_error)(struct ArrowAsyncDeviceArrayStream* self);
     void (*release)(struct ArrowAsyncDeviceArrayStream* self);
     void* private_data;
   };
   ```
   
   Why the abstraction of a task in this case? Couldn't we remove the task 
abstraction and just put the `Waker` here? It feels like the `get_next_task` 
abstraction here is just a way to re-invert the direction of the calls. 
Producer calls `get_next_task` which turns around and calls `task->get_next`? 
At that point, it feels like we could bypass the idea of needing 
`ArrowAsyncDeviceArrayStream` to be constructed by the consumer and instead 
build the entire API around the waker?
   
   Also in the case of this being able to handle concurrency, the 
`get_last_error` function becomes mostly meaningless, right? Depending on when 
it is called, you end up with a race condition as far as what the error might 
be.
   
   Finally, If a consumer can call `get_next_task` before the previous one has 
completed, what should happen in the following scenario:
   
   * There is only one record in the stream, it is delayed
   * Consumer calls `get_next_task` 5 times in a loop to get the next 5 tasks
   
   Do all 5 calls from the consumer wait until the record comes in? Does each 
one provide its own task object which then turns around and waits until the 
record comes in to call wake? The producer then needs to manage the order the 
calls came in, along with the order that the records come in for the calls to 
`get_next`. This seems like a lot of added complexity without much benefit.
   
   All in all, i'm not completely sold on the idea of this re-inversion of the 
paradigm. It honestly feels sorta hacky, but that might just be me.
   
   @paleolimbot @lidavidm thoughts on @westonpace's suggestion here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] GH-43631: [C][Format] Add ArrowAsyncDeviceStreamHandler interface [arrow]

Reply via email to