lidavidm edited a comment on pull request #9620:
URL: https://github.com/apache/arrow/pull/9620#issuecomment-812194926
What I pushed is still not quite what I want. Ideally, we'd be able to ask
the read cache for a future that finishes when all I/O for the given row group
has completed. That way, we can then kick off a decoding task. On master,
currently, you just spawn a bunch of tasks that block and wait for I/O and then
proceed (wasting threads), and in this PR, we have hijinks to manually
pre-buffer each row group separately (wasting the effectiveness of
pre-buffering).
That is, we should be able to say
```
reader->PreBuffer(row_groups, columns)
...
// I/O generator
return reader->WhenBuffered({current_row_group}, {columns});
// Decoding generator
return cpu_executor_->Transfer(io_generator()).Then([]() { return
ReadRowGroup(current_row_group); });
```
and this will let us coalesce read ranges across row groups while only
performing work on the CPU pool when it's truly ready.
Also, the range cache will have to be swappable for something that just does
normal file I/O for the non-S3 case so that local file scans are still
reasonable.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]