jorgecarleitao opened a new pull request #9084: URL: https://github.com/apache/arrow/pull/9084
This PR exposes two new functions: * to parse a single array from CSV (out of a column from a `[StringRecord]`) and * to parse a RecordBatch The motivation for the first function is that parsing arrays is trivially parallelizable. Thus, people may want to use e.g `rayon` to iterate in parallel over fields to build each array. IMO `arrow` crate should not make any assumption about how people want to parallelize this, and only offer the functionality to do it, in the same way we do it with kernels. The motivation for the second function stems from the fact that parsing (not the IO reading) is the slowest operation in reading a CSV and people may want to iterate over the CSV differently. The main use-case here is to split the read of a single CSV file in multiple parts (using `seek`), and returning record batches (DataFusion is the example here) in parallel. Again, IMO the arrow crate should not make assumptions about how to perform this work, and instead offer the necessary CPU-blocking core functionality for users to build on top of. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org