[ https://issues.apache.org/jira/browse/ARROW-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576852#comment-17576852 ]
Weston Pace commented on ARROW-17313: ------------------------------------- I think the {{FileFragment}} would be a good place for this. For example, I could imagine something like... {code} import pyarrow.dataset as ds # Discovery happens here. After this line we will have a # FileSystemDataset and each FileFragment will be the # entire file my_dataset = ds.dataset("/var/data/my_dataset") # Splits the dataset into 32 partitions. Each one is # still a FileSystemDataset with FileFragments but # now the FileFragments may have slicing information my_datasets = my_dataset.partition(32) {code} {quote} We shouldn't ask the file format implementations (for example CSV, Parquet or Orc) to accept innacurate byte ranges. {quote} I think there are (at least) two options here. The partitioned ranges could be byte ranges without any knowledge of the format. This is easy to create but means the file format would need to be able to map a byte range to some readable range. For example, if a user has 10 parquet files, each 10GiB large, with 10 equal sized row groups and we want to divide it into 32 partitions then the partitions would look like: File 0: Bytes 0 to 33554432 File 0: Bytes 33554432 to 67108864 ... However, the row group boundaries would be 0, 107374182, 214748365, 322122547, 429496730. So, in the case, the parquet file format would adapt bytes 0 - 33554432 to row groups 0, 1, 2 (since the first byte falls in the requested range) even though this actually represents a slightly larger than requested partition (0 - 429496730). Approach 2. As an alternative approach we could expect the producer to know the details of the file format. In this case the partitions would probably be best expressed in terms that make sense for the format. A "parquet partitioner" would specify a list of files with a list of row groups for each file. An Orc partitioner would give a list of stripes. A CSV partitioner would still need to use byte ranges. With this approach I think you end up needing ParquetFileFragment, OrcFileFragment, etc. (although you could maybe get by with just RowGroupFileFragment, accepted by the grouped formats and ByteRangeFileFragment, accepted by the text formats). > [C++] Add Byte Range to CSV Reader ReadOptions > ---------------------------------------------- > > Key: ARROW-17313 > URL: https://issues.apache.org/jira/browse/ARROW-17313 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python > Reporter: Ziheng Wang > Assignee: Ziheng Wang > Priority: Major > > Sometimes it's desirable to just read a portion of a CSV. The best way to do > that is to pass in a list of byte ranges to CSV read options that specify > where in the CSV you want to read. These byte ranges don't necessarily have > to be aligned on line break boundaries, the CSV reader should just read until > the end of the line, and skip anything before the first line break in a byte > range. > Based on discussion, the scope is going to be reduced here. The first > implementation will support a single byte range that is already assumed to be > aligned on byte boundaries. > Will not handle quotes/returns and other edge cases. -- This message was sent by Atlassian Jira (v8.20.10#820010)