[jira] [Commented] (ARROW-17313) [C++] Add Byte Range to CSV Reader ReadOptions

2022-12-08 Thread Apache Arrow JIRA Bot (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644922#comment-17644922
 ] 

Apache Arrow JIRA Bot commented on ARROW-17313:
---

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned per [project 
policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment].
 Please feel free to re-take assignment of the issue if it is being actively 
worked, or if you plan to start that work soon.

> [C++] Add Byte Range to CSV Reader ReadOptions
> --
>
> Key: ARROW-17313
> URL: https://issues.apache.org/jira/browse/ARROW-17313
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Ziheng Wang
>Assignee: Ziheng Wang
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> Sometimes it's desirable to just read a portion of a CSV. The best way to do 
> that is to pass in a list of byte ranges to CSV read options that specify 
> where in the CSV you want to read. These byte ranges don't necessarily have 
> to be aligned on line break boundaries, the CSV reader should just read until 
> the end of the line, and skip anything before the first line break in a byte 
> range.  
> Based on discussion, the scope is going to be reduced here. The first 
> implementation will support a single byte range that is already assumed to be 
> aligned on byte boundaries. 
> Will not handle quotes/returns and other edge cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17313) [C++] Add Byte Range to CSV Reader ReadOptions

2022-08-10 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578080#comment-17578080
 ] 

Weston Pace commented on ARROW-17313:
-

There is also ARROW-15589, which I had referenced above, also motivated from 
Spark (but via Substrait).

> [C++] Add Byte Range to CSV Reader ReadOptions
> --
>
> Key: ARROW-17313
> URL: https://issues.apache.org/jira/browse/ARROW-17313
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Ziheng Wang
>Assignee: Ziheng Wang
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Sometimes it's desirable to just read a portion of a CSV. The best way to do 
> that is to pass in a list of byte ranges to CSV read options that specify 
> where in the CSV you want to read. These byte ranges don't necessarily have 
> to be aligned on line break boundaries, the CSV reader should just read until 
> the end of the line, and skip anything before the first line break in a byte 
> range.  
> Based on discussion, the scope is going to be reduced here. The first 
> implementation will support a single byte range that is already assumed to be 
> aligned on byte boundaries. 
> Will not handle quotes/returns and other edge cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17313) [C++] Add Byte Range to CSV Reader ReadOptions

2022-08-10 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577977#comment-17577977
 ] 

David Li commented on ARROW-17313:
--

ARROW-17159 is a very similar issue, except motivated by Parquet/Spark

> [C++] Add Byte Range to CSV Reader ReadOptions
> --
>
> Key: ARROW-17313
> URL: https://issues.apache.org/jira/browse/ARROW-17313
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Ziheng Wang
>Assignee: Ziheng Wang
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Sometimes it's desirable to just read a portion of a CSV. The best way to do 
> that is to pass in a list of byte ranges to CSV read options that specify 
> where in the CSV you want to read. These byte ranges don't necessarily have 
> to be aligned on line break boundaries, the CSV reader should just read until 
> the end of the line, and skip anything before the first line break in a byte 
> range.  
> Based on discussion, the scope is going to be reduced here. The first 
> implementation will support a single byte range that is already assumed to be 
> aligned on byte boundaries. 
> Will not handle quotes/returns and other edge cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17313) [C++] Add Byte Range to CSV Reader ReadOptions

2022-08-08 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17576937#comment-17576937
 ] 

Weston Pace commented on ARROW-17313:
-

That sounds good to me.  Even if we end up later unifying everything in a 
common "byte ranges" or "percentages" API I don't think there is any harm in 
also having format-specific APIs.  Plus, having the format-specific APIs should 
simplify adoption of a common API if we decide to go that route.

> [C++] Add Byte Range to CSV Reader ReadOptions
> --
>
> Key: ARROW-17313
> URL: https://issues.apache.org/jira/browse/ARROW-17313
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Ziheng Wang
>Assignee: Ziheng Wang
>Priority: Major
>
> Sometimes it's desirable to just read a portion of a CSV. The best way to do 
> that is to pass in a list of byte ranges to CSV read options that specify 
> where in the CSV you want to read. These byte ranges don't necessarily have 
> to be aligned on line break boundaries, the CSV reader should just read until 
> the end of the line, and skip anything before the first line break in a byte 
> range.  
> Based on discussion, the scope is going to be reduced here. The first 
> implementation will support a single byte range that is already assumed to be 
> aligned on byte boundaries. 
> Will not handle quotes/returns and other edge cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17313) [C++] Add Byte Range to CSV Reader ReadOptions

2022-08-08 Thread Ziheng Wang (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17576926#comment-17576926
 ] 

Ziheng Wang commented on ARROW-17313:
-

There is no physical way you can do this with a .csv.gz file – you can't 
decompress starting at a random byte, so the interface can be supported but the 
implementation won't skip scanning bytes and will be less efficient.

I am in favor of adding a file format specific option in the FIleFragment class 
to denote how the byte range / row group range is specified. This option could 
take different values for different types of files and each file type reader 
can interpret accordingly. This might avoid the need for multiple file classes.

> [C++] Add Byte Range to CSV Reader ReadOptions
> --
>
> Key: ARROW-17313
> URL: https://issues.apache.org/jira/browse/ARROW-17313
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Ziheng Wang
>Assignee: Ziheng Wang
>Priority: Major
>
> Sometimes it's desirable to just read a portion of a CSV. The best way to do 
> that is to pass in a list of byte ranges to CSV read options that specify 
> where in the CSV you want to read. These byte ranges don't necessarily have 
> to be aligned on line break boundaries, the CSV reader should just read until 
> the end of the line, and skip anything before the first line break in a byte 
> range.  
> Based on discussion, the scope is going to be reduced here. The first 
> implementation will support a single byte range that is already assumed to be 
> aligned on byte boundaries. 
> Will not handle quotes/returns and other edge cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17313) [C++] Add Byte Range to CSV Reader ReadOptions

2022-08-08 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17576914#comment-17576914
 ] 

Antoine Pitrou commented on ARROW-17313:


Well, if arbitrary byte ranges need to be supported, then some scanning will be 
required when starting to read the CSV file. And we need to error out if 
{{allow_newlines_in_values}} is true.

An additional complication: what happens with e.g. a {{.csv.gz}} file?

> [C++] Add Byte Range to CSV Reader ReadOptions
> --
>
> Key: ARROW-17313
> URL: https://issues.apache.org/jira/browse/ARROW-17313
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Ziheng Wang
>Assignee: Ziheng Wang
>Priority: Major
>
> Sometimes it's desirable to just read a portion of a CSV. The best way to do 
> that is to pass in a list of byte ranges to CSV read options that specify 
> where in the CSV you want to read. These byte ranges don't necessarily have 
> to be aligned on line break boundaries, the CSV reader should just read until 
> the end of the line, and skip anything before the first line break in a byte 
> range.  
> Based on discussion, the scope is going to be reduced here. The first 
> implementation will support a single byte range that is already assumed to be 
> aligned on byte boundaries. 
> Will not handle quotes/returns and other edge cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17313) [C++] Add Byte Range to CSV Reader ReadOptions

2022-08-08 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17576900#comment-17576900
 ] 

Weston Pace commented on ARROW-17313:
-

Yes.  I think the original Substrait use case was based on Spark's 
implementation (linking to https://github.com/substrait-io/substrait/pull/102) 
but I didn't ask too many details.

Iceberg has something kind of similar.  In their manifest they have a list of 
data files.  Each data file has a list of split offsets.  These are byte 
indices where the file could be split.  That sort of approach could be 
interesting.  FileFragment isn't persistable today but we could easily add 
split offsets when discovering parquet, IPC, ORC today.  Plus, there could be a 
boolean to scan CSV files during discovery to discover line breaks (probably 
debounced by some block size) and record those as split offsets.  That would 
solve the "need to know the right spot to split" problem for CSV (at the cost 
of a more expensive discovery).

> [C++] Add Byte Range to CSV Reader ReadOptions
> --
>
> Key: ARROW-17313
> URL: https://issues.apache.org/jira/browse/ARROW-17313
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Ziheng Wang
>Assignee: Ziheng Wang
>Priority: Major
>
> Sometimes it's desirable to just read a portion of a CSV. The best way to do 
> that is to pass in a list of byte ranges to CSV read options that specify 
> where in the CSV you want to read. These byte ranges don't necessarily have 
> to be aligned on line break boundaries, the CSV reader should just read until 
> the end of the line, and skip anything before the first line break in a byte 
> range.  
> Based on discussion, the scope is going to be reduced here. The first 
> implementation will support a single byte range that is already assumed to be 
> aligned on byte boundaries. 
> Will not handle quotes/returns and other edge cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17313) [C++] Add Byte Range to CSV Reader ReadOptions

2022-08-08 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17576889#comment-17576889
 ] 

Antoine Pitrou commented on ARROW-17313:


Hmm... so this is partitioning on the client side by sharding the queries to 
different compute servers?


> [C++] Add Byte Range to CSV Reader ReadOptions
> --
>
> Key: ARROW-17313
> URL: https://issues.apache.org/jira/browse/ARROW-17313
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Ziheng Wang
>Assignee: Ziheng Wang
>Priority: Major
>
> Sometimes it's desirable to just read a portion of a CSV. The best way to do 
> that is to pass in a list of byte ranges to CSV read options that specify 
> where in the CSV you want to read. These byte ranges don't necessarily have 
> to be aligned on line break boundaries, the CSV reader should just read until 
> the end of the line, and skip anything before the first line break in a byte 
> range.  
> Based on discussion, the scope is going to be reduced here. The first 
> implementation will support a single byte range that is already assumed to be 
> aligned on byte boundaries. 
> Will not handle quotes/returns and other edge cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17313) [C++] Add Byte Range to CSV Reader ReadOptions

2022-08-08 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17576885#comment-17576885
 ] 

Weston Pace commented on ARROW-17313:
-

I don't think I've been explaining myself well.  Let's imagine a worst-case 
(though not unheard of) scenario where a user has a single 10GiB file, stored 
in S3, that they want to scan using 4 different EC2 containers.  Using the 
current datasets API this would be impossible to do unless that file happens to 
be parquet (since we do have ParquetFileFormat and split_row_groups).  I'd like 
a solution that I can use regardless of the format.

> [C++] Add Byte Range to CSV Reader ReadOptions
> --
>
> Key: ARROW-17313
> URL: https://issues.apache.org/jira/browse/ARROW-17313
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Ziheng Wang
>Assignee: Ziheng Wang
>Priority: Major
>
> Sometimes it's desirable to just read a portion of a CSV. The best way to do 
> that is to pass in a list of byte ranges to CSV read options that specify 
> where in the CSV you want to read. These byte ranges don't necessarily have 
> to be aligned on line break boundaries, the CSV reader should just read until 
> the end of the line, and skip anything before the first line break in a byte 
> range.  
> Based on discussion, the scope is going to be reduced here. The first 
> implementation will support a single byte range that is already assumed to be 
> aligned on byte boundaries. 
> Will not handle quotes/returns and other edge cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17313) [C++] Add Byte Range to CSV Reader ReadOptions

2022-08-08 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17576881#comment-17576881
 ] 

Antoine Pitrou commented on ARROW-17313:


What I strive to understand is why the Substrait producer would be in a better 
position to choose the file fragment granularity rather than the query engine 
implementation (or, in the Arrow C++ case, the dataset subsystem).

> [C++] Add Byte Range to CSV Reader ReadOptions
> --
>
> Key: ARROW-17313
> URL: https://issues.apache.org/jira/browse/ARROW-17313
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Ziheng Wang
>Assignee: Ziheng Wang
>Priority: Major
>
> Sometimes it's desirable to just read a portion of a CSV. The best way to do 
> that is to pass in a list of byte ranges to CSV read options that specify 
> where in the CSV you want to read. These byte ranges don't necessarily have 
> to be aligned on line break boundaries, the CSV reader should just read until 
> the end of the line, and skip anything before the first line break in a byte 
> range.  
> Based on discussion, the scope is going to be reduced here. The first 
> implementation will support a single byte range that is already assumed to be 
> aligned on byte boundaries. 
> Will not handle quotes/returns and other edge cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17313) [C++] Add Byte Range to CSV Reader ReadOptions

2022-08-08 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17576880#comment-17576880
 ] 

Weston Pace commented on ARROW-17313:
-

Would it help to think of these not as byte ranges but as percentages?  I'm 
pretty sure the goal is just to be able to split a scan specification into 
subtasks.  They could then be divided amongst processes, divided amongst 
servers, or simply run piecemeal so that partial success and retry is simpler 
(I think this might be [~marsupialtail]'s end goal).

"Repartition the data into smaller files" should always work but I don't know 
that this is always an acceptable option.

> [C++] Add Byte Range to CSV Reader ReadOptions
> --
>
> Key: ARROW-17313
> URL: https://issues.apache.org/jira/browse/ARROW-17313
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Ziheng Wang
>Assignee: Ziheng Wang
>Priority: Major
>
> Sometimes it's desirable to just read a portion of a CSV. The best way to do 
> that is to pass in a list of byte ranges to CSV read options that specify 
> where in the CSV you want to read. These byte ranges don't necessarily have 
> to be aligned on line break boundaries, the CSV reader should just read until 
> the end of the line, and skip anything before the first line break in a byte 
> range.  
> Based on discussion, the scope is going to be reduced here. The first 
> implementation will support a single byte range that is already assumed to be 
> aligned on byte boundaries. 
> Will not handle quotes/returns and other edge cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17313) [C++] Add Byte Range to CSV Reader ReadOptions

2022-08-08 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17576858#comment-17576858
 ] 

Antoine Pitrou commented on ARROW-17313:


The intent of datasets has always be that each file format defines its own 
granularity for reading files. I don't understand why the consumer would 
specify byte ranges by hand.

[~bkietz] What is your opinion on this?

> [C++] Add Byte Range to CSV Reader ReadOptions
> --
>
> Key: ARROW-17313
> URL: https://issues.apache.org/jira/browse/ARROW-17313
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Ziheng Wang
>Assignee: Ziheng Wang
>Priority: Major
>
> Sometimes it's desirable to just read a portion of a CSV. The best way to do 
> that is to pass in a list of byte ranges to CSV read options that specify 
> where in the CSV you want to read. These byte ranges don't necessarily have 
> to be aligned on line break boundaries, the CSV reader should just read until 
> the end of the line, and skip anything before the first line break in a byte 
> range.  
> Based on discussion, the scope is going to be reduced here. The first 
> implementation will support a single byte range that is already assumed to be 
> aligned on byte boundaries. 
> Will not handle quotes/returns and other edge cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17313) [C++] Add Byte Range to CSV Reader ReadOptions

2022-08-08 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17576852#comment-17576852
 ] 

Weston Pace commented on ARROW-17313:
-

I think the {{FileFragment}} would be a good place for this.  For example, I 
could imagine something like...

{code}
import pyarrow.dataset as ds
# Discovery happens here.  After this line we will have a
# FileSystemDataset and each FileFragment will be the
# entire file
my_dataset = ds.dataset("/var/data/my_dataset")
# Splits the dataset into 32 partitions.  Each one is
# still a FileSystemDataset with FileFragments but
# now the FileFragments may have slicing information
my_datasets = my_dataset.partition(32)
{code}

{quote}
We shouldn't ask the file format implementations (for example CSV, Parquet or 
Orc) to accept innacurate byte ranges.
{quote}

I think there are (at least) two options here.  The partitioned ranges could be 
byte ranges without any knowledge of the format.  This is easy to create but 
means the file format would need to be able to map a byte range to some 
readable range.  For example, if a user has 10 parquet files, each 10GiB large, 
with 10 equal sized row groups and we want to divide it into 32 partitions then 
the partitions would look like:

File 0:  Bytes 0 to 33554432
File 0:  Bytes 33554432 to 67108864
...

However, the row group boundaries would be 0, 107374182, 214748365, 322122547, 
429496730.  So, in the case, the parquet file format would adapt bytes 0 - 
33554432 to row groups 0, 1, 2 (since the first byte falls in the requested 
range) even though this actually represents a slightly larger than requested 
partition (0 - 429496730).

Approach 2.  As an alternative approach we could expect the producer to know 
the details of the file format.  In this case the partitions would probably be 
best expressed in terms that make sense for the format.  A "parquet 
partitioner" would specify a list of files with a list of row groups for each 
file.  An Orc partitioner would give a list of stripes.  A CSV partitioner 
would still need to use byte ranges.

With this approach I think you end up needing ParquetFileFragment, 
OrcFileFragment, etc. (although you could maybe get by with just 
RowGroupFileFragment, accepted by the grouped formats and 
ByteRangeFileFragment, accepted by the text formats).

> [C++] Add Byte Range to CSV Reader ReadOptions
> --
>
> Key: ARROW-17313
> URL: https://issues.apache.org/jira/browse/ARROW-17313
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Ziheng Wang
>Assignee: Ziheng Wang
>Priority: Major
>
> Sometimes it's desirable to just read a portion of a CSV. The best way to do 
> that is to pass in a list of byte ranges to CSV read options that specify 
> where in the CSV you want to read. These byte ranges don't necessarily have 
> to be aligned on line break boundaries, the CSV reader should just read until 
> the end of the line, and skip anything before the first line break in a byte 
> range.  
> Based on discussion, the scope is going to be reduced here. The first 
> implementation will support a single byte range that is already assumed to be 
> aligned on byte boundaries. 
> Will not handle quotes/returns and other edge cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17313) [C++] Add Byte Range to CSV Reader ReadOptions

2022-08-06 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17576154#comment-17576154
 ] 

Antoine Pitrou commented on ARROW-17313:


Ok, so perhaps byte ranges should actually be provided by the partitioning 
scheme ? Most partitioning schemes would only emit trivial byte ranges (i.e. 
the entire file), but the Substrait partitioning would emit whatever is in the 
Substrait plan.

Of course, it's up to the Substrait producer to ensure that these byte ranges 
actually make sense for the given data format. We shouldn't ask the file format 
implementations (for example CSV, Parquet or Orc) to accept innacurate byte 
ranges.

> [C++] Add Byte Range to CSV Reader ReadOptions
> --
>
> Key: ARROW-17313
> URL: https://issues.apache.org/jira/browse/ARROW-17313
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Ziheng Wang
>Assignee: Ziheng Wang
>Priority: Major
>
> Sometimes it's desirable to just read a portion of a CSV. The best way to do 
> that is to pass in a list of byte ranges to CSV read options that specify 
> where in the CSV you want to read. These byte ranges don't necessarily have 
> to be aligned on line break boundaries, the CSV reader should just read until 
> the end of the line, and skip anything before the first line break in a byte 
> range.  
> Based on discussion, the scope is going to be reduced here. The first 
> implementation will support a single byte range that is already assumed to be 
> aligned on byte boundaries. 
> Will not handle quotes/returns and other edge cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17313) [C++] Add Byte Range to CSV Reader ReadOptions

2022-08-05 Thread Ziheng Wang (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17576079#comment-17576079
 ] 

Ziheng Wang commented on ARROW-17313:
-

Ideally we update the Dataset Scanner to be able to take in different byte 
ranges for different fragments. Or is this not required?

A complication would be that fragments right now don't seem to have some sort 
of "ID", so it might be hard for a user to specify which fragments should read 
which byte ranges. The way to do this would be to let the user pass in a dict 
in the ScanOptions that's something like \{file_path1: byte_range1, file_path2: 
byte_range2}. I think this would make sense.

Alternatively if this is not going to be supported, then this option ideally 
should only make sense for a dataset with one fragment. Perhaps I'll just add a 
check in the FragmentsToBatches function or something.

> [C++] Add Byte Range to CSV Reader ReadOptions
> --
>
> Key: ARROW-17313
> URL: https://issues.apache.org/jira/browse/ARROW-17313
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Ziheng Wang
>Assignee: Ziheng Wang
>Priority: Major
>
> Sometimes it's desirable to just read a portion of a CSV. The best way to do 
> that is to pass in a list of byte ranges to CSV read options that specify 
> where in the CSV you want to read. These byte ranges don't necessarily have 
> to be aligned on line break boundaries, the CSV reader should just read until 
> the end of the line, and skip anything before the first line break in a byte 
> range.  
> Based on discussion, the scope is going to be reduced here. The first 
> implementation will support a single byte range that is already assumed to be 
> aligned on byte boundaries. 
> Will not handle quotes/returns and other edge cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17313) [C++] Add Byte Range to CSV Reader ReadOptions

2022-08-05 Thread Ziheng Wang (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17575967#comment-17575967
 ] 

Ziheng Wang commented on ARROW-17313:
-

Also this will not support compressed formats, at least at the start

> [C++] Add Byte Range to CSV Reader ReadOptions
> --
>
> Key: ARROW-17313
> URL: https://issues.apache.org/jira/browse/ARROW-17313
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Ziheng Wang
>Assignee: Ziheng Wang
>Priority: Major
>
> Sometimes it's desirable to just read a portion of a CSV. The best way to do 
> that is to pass in a list of byte ranges to CSV read options that specify 
> where in the CSV you want to read. These byte ranges don't necessarily have 
> to be aligned on line break boundaries, the CSV reader should just read until 
> the end of the line, and skip anything before the first line break in a byte 
> range.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17313) [C++] Add Byte Range to CSV Reader ReadOptions

2022-08-05 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17575937#comment-17575937
 ] 

Weston Pace commented on ARROW-17313:
-

> We should reject a partial read if newlines_in_values is false.

Oops, I mean, if {{newlines_in_values}} is true :)

> [C++] Add Byte Range to CSV Reader ReadOptions
> --
>
> Key: ARROW-17313
> URL: https://issues.apache.org/jira/browse/ARROW-17313
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Ziheng Wang
>Assignee: Ziheng Wang
>Priority: Major
>
> Sometimes it's desirable to just read a portion of a CSV. The best way to do 
> that is to pass in a list of byte ranges to CSV read options that specify 
> where in the CSV you want to read. These byte ranges don't necessarily have 
> to be aligned on line break boundaries, the CSV reader should just read until 
> the end of the line, and skip anything before the first line break in a byte 
> range.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17313) [C++] Add Byte Range to CSV Reader ReadOptions

2022-08-05 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17575928#comment-17575928
 ] 

Antoine Pitrou commented on ARROW-17313:


Nothing. The Substrait producer should produce valid ranges. There is no reason 
to accept random invalid ranges.

> [C++] Add Byte Range to CSV Reader ReadOptions
> --
>
> Key: ARROW-17313
> URL: https://issues.apache.org/jira/browse/ARROW-17313
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Ziheng Wang
>Assignee: Ziheng Wang
>Priority: Major
>
> Sometimes it's desirable to just read a portion of a CSV. The best way to do 
> that is to pass in a list of byte ranges to CSV read options that specify 
> where in the CSV you want to read. These byte ranges don't necessarily have 
> to be aligned on line break boundaries, the CSV reader should just read until 
> the end of the line, and skip anything before the first line break in a byte 
> range.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17313) [C++] Add Byte Range to CSV Reader ReadOptions

2022-08-05 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17575930#comment-17575930
 ] 

Weston Pace commented on ARROW-17313:
-

> It's not too late to change the Substrait spec, is it?
> Or we can raise NotImplemented if the offset is ever-non-zero.

Raising "not implemented" in this case is fine I'm sure.  If it can't be done 
then it can't be done.  Perhaps we can avoid most of these cases by reading a 
little bit (e.g. 32 bytes) before the beginning of the block as well.

> The sample block starts inside a "quoted" field

I think this is only a problem if we allow newlines in values.  We should 
reject a partial read if {{newlines_in_values}} is false.

> The first char of a block is "\n" but the last char of previous block is an 
> "escape"

Reading a bit early would help here as long as it isn't a really long chain of 
escapes which should be rare and detectable (we could error in this case).

> Sample at middle of "\r\n" may also be confusing

Reading a bit early would help here too.

> [C++] Add Byte Range to CSV Reader ReadOptions
> --
>
> Key: ARROW-17313
> URL: https://issues.apache.org/jira/browse/ARROW-17313
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Ziheng Wang
>Assignee: Ziheng Wang
>Priority: Major
>
> Sometimes it's desirable to just read a portion of a CSV. The best way to do 
> that is to pass in a list of byte ranges to CSV read options that specify 
> where in the CSV you want to read. These byte ranges don't necessarily have 
> to be aligned on line break boundaries, the CSV reader should just read until 
> the end of the line, and skip anything before the first line break in a byte 
> range.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17313) [C++] Add Byte Range to CSV Reader ReadOptions

2022-08-05 Thread Ziheng Wang (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17575916#comment-17575916
 ] 

Ziheng Wang commented on ARROW-17313:
-

Ah I meant what we should do about the linbreaks and quotes etc.

> [C++] Add Byte Range to CSV Reader ReadOptions
> --
>
> Key: ARROW-17313
> URL: https://issues.apache.org/jira/browse/ARROW-17313
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Ziheng Wang
>Assignee: Ziheng Wang
>Priority: Major
>
> Sometimes it's desirable to just read a portion of a CSV. The best way to do 
> that is to pass in a list of byte ranges to CSV read options that specify 
> where in the CSV you want to read. These byte ranges don't necessarily have 
> to be aligned on line break boundaries, the CSV reader should just read until 
> the end of the line, and skip anything before the first line break in a byte 
> range.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17313) [C++] Add Byte Range to CSV Reader ReadOptions

2022-08-05 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17575902#comment-17575902
 ] 

Antoine Pitrou commented on ARROW-17313:


There's not much to elaborate. {{RandomAccessFile::GetStream}} gives you a 
stream over a subrange of a file.  See API docs:
https://arrow.apache.org/docs/cpp/api/io.html#_CPPv4N5arrow2io16RandomAccessFile9GetStreamENSt10shared_ptrI16RandomAccessFileEE7int64_t7int64_t

> [C++] Add Byte Range to CSV Reader ReadOptions
> --
>
> Key: ARROW-17313
> URL: https://issues.apache.org/jira/browse/ARROW-17313
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Ziheng Wang
>Assignee: Ziheng Wang
>Priority: Major
>
> Sometimes it's desirable to just read a portion of a CSV. The best way to do 
> that is to pass in a list of byte ranges to CSV read options that specify 
> where in the CSV you want to read. These byte ranges don't necessarily have 
> to be aligned on line break boundaries, the CSV reader should just read until 
> the end of the line, and skip anything before the first line break in a byte 
> range.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)