[jira] [Commented] (ARROW-17313) [C++] Add Byte Range to CSV Reader ReadOptions

Weston Pace (Jira) Mon, 08 Aug 2022 08:24:31 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576852#comment-17576852
 ]


Weston Pace commented on ARROW-17313:
-------------------------------------

I think the {{FileFragment}} would be a good place for this.  For example, I 
could imagine something like...

{code}
import pyarrow.dataset as ds
# Discovery happens here.  After this line we will have a
# FileSystemDataset and each FileFragment will be the
# entire file
my_dataset = ds.dataset("/var/data/my_dataset")
# Splits the dataset into 32 partitions.  Each one is
# still a FileSystemDataset with FileFragments but
# now the FileFragments may have slicing information
my_datasets = my_dataset.partition(32)
{code}

{quote}
We shouldn't ask the file format implementations (for example CSV, Parquet or 
Orc) to accept innacurate byte ranges.
{quote}

I think there are (at least) two options here.  The partitioned ranges could be 
byte ranges without any knowledge of the format.  This is easy to create but 
means the file format would need to be able to map a byte range to some 
readable range.  For example, if a user has 10 parquet files, each 10GiB large, 
with 10 equal sized row groups and we want to divide it into 32 partitions then 
the partitions would look like:

File 0:  Bytes 0 to 33554432
File 0:  Bytes 33554432 to 67108864
...

However, the row group boundaries would be 0, 107374182, 214748365, 322122547, 
429496730.  So, in the case, the parquet file format would adapt bytes 0 - 
33554432 to row groups 0, 1, 2 (since the first byte falls in the requested 
range) even though this actually represents a slightly larger than requested 
partition (0 - 429496730).

Approach 2.  As an alternative approach we could expect the producer to know 
the details of the file format.  In this case the partitions would probably be 
best expressed in terms that make sense for the format.  A "parquet 
partitioner" would specify a list of files with a list of row groups for each 
file.  An Orc partitioner would give a list of stripes.  A CSV partitioner 
would still need to use byte ranges.

With this approach I think you end up needing ParquetFileFragment, 
OrcFileFragment, etc. (although you could maybe get by with just 
RowGroupFileFragment, accepted by the grouped formats and 
ByteRangeFileFragment, accepted by the text formats).

> [C++] Add Byte Range to CSV Reader ReadOptions
> ----------------------------------------------
>
>                 Key: ARROW-17313
>                 URL: https://issues.apache.org/jira/browse/ARROW-17313
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Ziheng Wang
>            Assignee: Ziheng Wang
>            Priority: Major
>
> Sometimes it's desirable to just read a portion of a CSV. The best way to do 
> that is to pass in a list of byte ranges to CSV read options that specify 
> where in the CSV you want to read. These byte ranges don't necessarily have 
> to be aligned on line break boundaries, the CSV reader should just read until 
> the end of the line, and skip anything before the first line break in a byte 
> range.  
> Based on discussion, the scope is going to be reduced here. The first 
> implementation will support a single byte range that is already assumed to be 
> aligned on byte boundaries. 
> Will not handle quotes/returns and other edge cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17313) [C++] Add Byte Range to CSV Reader ReadOptions

Reply via email to