[ https://issues.apache.org/jira/browse/ARROW-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576885#comment-17576885 ]
Weston Pace commented on ARROW-17313: ------------------------------------- I don't think I've been explaining myself well. Let's imagine a worst-case (though not unheard of) scenario where a user has a single 10GiB file, stored in S3, that they want to scan using 4 different EC2 containers. Using the current datasets API this would be impossible to do unless that file happens to be parquet (since we do have ParquetFileFormat and split_row_groups). I'd like a solution that I can use regardless of the format. > [C++] Add Byte Range to CSV Reader ReadOptions > ---------------------------------------------- > > Key: ARROW-17313 > URL: https://issues.apache.org/jira/browse/ARROW-17313 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python > Reporter: Ziheng Wang > Assignee: Ziheng Wang > Priority: Major > > Sometimes it's desirable to just read a portion of a CSV. The best way to do > that is to pass in a list of byte ranges to CSV read options that specify > where in the CSV you want to read. These byte ranges don't necessarily have > to be aligned on line break boundaries, the CSV reader should just read until > the end of the line, and skip anything before the first line break in a byte > range. > Based on discussion, the scope is going to be reduced here. The first > implementation will support a single byte range that is already assumed to be > aligned on byte boundaries. > Will not handle quotes/returns and other edge cases. -- This message was sent by Atlassian Jira (v8.20.10#820010)