Ziheng Wang created ARROW-17380: ----------------------------------- Summary: Tag record batches with start_byte and end_byte infromation Key: ARROW-17380 URL: https://issues.apache.org/jira/browse/ARROW-17380 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Reporter: Ziheng Wang Assignee: Ziheng Wang
It might be desirable for a record batch to have information of where it came from in the source dataset. This can be used for a few purposes: * Rereading a particular record batch without rereading the entire fragment * Easily tracking progress of how much a particular (file) dataset has been consumed. It could also be useful for debugging if a record batch resulted in an error downstream. The plan is to add some attribute like this here: [https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/scanner.cc#L923] that will be tagged on to the record batch by the Scanner as it is being generated. This is useful for file based formats like CSV. In Parquet this is less necessary since record batches (usually) correspond to row groups and row group ids can be used to serve this function. -- This message was sent by Atlassian Jira (v8.20.10#820010)