[ 
https://issues.apache.org/jira/browse/ARROW-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576926#comment-17576926
 ] 

Ziheng Wang commented on ARROW-17313:
-------------------------------------

There is no physical way you can do this with a .csv.gz file – you can't 
decompress starting at a random byte, so the interface can be supported but the 
implementation won't skip scanning bytes and will be less efficient.

I am in favor of adding a file format specific option in the FIleFragment class 
to denote how the byte range / row group range is specified. This option could 
take different values for different types of files and each file type reader 
can interpret accordingly. This might avoid the need for multiple file classes.

> [C++] Add Byte Range to CSV Reader ReadOptions
> ----------------------------------------------
>
>                 Key: ARROW-17313
>                 URL: https://issues.apache.org/jira/browse/ARROW-17313
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Ziheng Wang
>            Assignee: Ziheng Wang
>            Priority: Major
>
> Sometimes it's desirable to just read a portion of a CSV. The best way to do 
> that is to pass in a list of byte ranges to CSV read options that specify 
> where in the CSV you want to read. These byte ranges don't necessarily have 
> to be aligned on line break boundaries, the CSV reader should just read until 
> the end of the line, and skip anything before the first line break in a byte 
> range.  
> Based on discussion, the scope is going to be reduced here. The first 
> implementation will support a single byte range that is already assumed to be 
> aligned on byte boundaries. 
> Will not handle quotes/returns and other edge cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to