jorgecarleitao opened a new pull request #9084:
URL: https://github.com/apache/arrow/pull/9084


   This PR exposes two new functions:
   
   * to parse a single array from CSV (out of a column from a `[StringRecord]`) 
and
   * to parse a RecordBatch
   
   The motivation for the first function is that parsing arrays is trivially 
parallelizable. Thus, people may want to use e.g `rayon` to iterate in parallel 
over fields to build each array. IMO `arrow` crate should not make any 
assumption about how people want to parallelize this, and only offer the 
functionality to do it, in the same way we do it with kernels.
   
   The motivation for the second function stems from the fact that parsing (not 
the IO reading) is the slowest operation in reading a CSV and people may want 
to iterate over the CSV differently. The main use-case here is to split the 
read of a single CSV file in multiple parts (using `seek`), and returning 
record batches (DataFusion is the example here) in parallel. Again, IMO the 
arrow crate should not make assumptions about how to perform this work, and 
instead offer the necessary CPU-blocking core functionality for users to build 
on top of.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to