[
https://issues.apache.org/jira/browse/PARQUET-1993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17293962#comment-17293962
]
Weston Pace commented on PARQUET-1993:
--------------------------------------
> Is it to have full async Parquet reading?
Yes. Streaming & async Parquet reading with readahead.
You could have...
{code:java}
Future<RecordBatch> ReadNext()
{code}
...but with pre-fetching that makes it difficult to figure out readahead.
Consider what happens if I decide to add 4 calls worth of readahead and the
reader decides that the underlying table is many small row groups and so it
will prefetch reads of 20 record batches at once. Then I end up leaving the
I/O idle.
Another approach could be to push the readahead into the parquet reader. I'm
not sure what would be easier.
> [C++] Expose when prefetching completes
> ---------------------------------------
>
> Key: PARQUET-1993
> URL: https://issues.apache.org/jira/browse/PARQUET-1993
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-cpp
> Reporter: David Li
> Assignee: David Li
> Priority: Major
> Labels: pull-request-available
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> As a follow up to PARQUET-1820, we should let an application be notified when
> pre-buffering has completed (e.g. PreBuffer() should return Future<void>).
> This would let an application pre-buffer some amount of data (across multiple
> files and/or row groups) and then decode data as it becomes available instead
> of blocking.
> A more ergonomic API would be to expose Future<RecordBatchReader> or
> something along those lines.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)