[GitHub] [arrow] rjzamora edited a comment on pull request #10991: ARROW-13572: [C++][Datasets] Add ORC support to Datasets API

GitBox Mon, 01 Nov 2021 07:37:49 -0700


rjzamora edited a comment on pull request #10991:
URL: https://github.com/apache/arrow/pull/10991#issuecomment-956288941



   Thanks for all the great work here @jorisvandenbossche!
   
   In order to utilize the Dataset API for read_orc in Dask, we will need an 
API to split file-level fragments into stripe-level fragments.  For example, 
for parquet datasets there is a `split_by_row_group` method.
   
   We also want to be able to select a subset of stripes from a file fragment 
to produce a new dataset fragment. For example, for parquet datasets we can do 
`old_frag.format.make_fragment(..., row_groups=selected_row_group_indices)`.
   
   Does it make sense for me to raise separate Jira issues for these features? 
Or, is this functionality already available?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] rjzamora edited a comment on pull request #10991: ARROW-13572: [C++][Datasets] Add ORC support to Datasets API

Reply via email to