[ 
https://issues.apache.org/jira/browse/ARROW-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576900#comment-17576900
 ] 

Weston Pace commented on ARROW-17313:
-------------------------------------

Yes.  I think the original Substrait use case was based on Spark's 
implementation (linking to https://github.com/substrait-io/substrait/pull/102) 
but I didn't ask too many details.

Iceberg has something kind of similar.  In their manifest they have a list of 
data files.  Each data file has a list of split offsets.  These are byte 
indices where the file could be split.  That sort of approach could be 
interesting.  FileFragment isn't persistable today but we could easily add 
split offsets when discovering parquet, IPC, ORC today.  Plus, there could be a 
boolean to scan CSV files during discovery to discover line breaks (probably 
debounced by some block size) and record those as split offsets.  That would 
solve the "need to know the right spot to split" problem for CSV (at the cost 
of a more expensive discovery).

> [C++] Add Byte Range to CSV Reader ReadOptions
> ----------------------------------------------
>
>                 Key: ARROW-17313
>                 URL: https://issues.apache.org/jira/browse/ARROW-17313
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Ziheng Wang
>            Assignee: Ziheng Wang
>            Priority: Major
>
> Sometimes it's desirable to just read a portion of a CSV. The best way to do 
> that is to pass in a list of byte ranges to CSV read options that specify 
> where in the CSV you want to read. These byte ranges don't necessarily have 
> to be aligned on line break boundaries, the CSV reader should just read until 
> the end of the line, and skip anything before the first line break in a byte 
> range.  
> Based on discussion, the scope is going to be reduced here. The first 
> implementation will support a single byte range that is already assumed to be 
> aligned on byte boundaries. 
> Will not handle quotes/returns and other edge cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to