Dandandan edited a comment on pull request #9084:
URL: https://github.com/apache/arrow/pull/9084#issuecomment-753658416


   This looks cool @jorgecarleitao !
   
   Some thoughts for the future of csv /other parsers:
   
   * It might be worth exploring if we can use the `cast` (or a similar) kernel 
of Arrow to parse the data. The benefit of this would be that we can just load 
the data (as bytes / string) into arrays and utilize the existing parsing logic 
in Arrow. I think this is interesting because the code can be vectorized/use 
SIMD, parallelized, etc. more easily from that point, will reduce code 
duplication, and creates more incentive to improve the `cast` kernels, which 
benefits more than "only" one parser.
   * For further optimization it might be worth to stop using the 
`StringRecord`s at some point (and use `csv_core`), as there is quite some 
overhead associated with them compared to "just" loading the bytes from the 
file. How would this fit into your suggestion?
   * For a user like DataFusion, it might often not make sense to have 
parallelism on the file level (if there are many files), so I think it makes 
sense to not make the parser slower / consume more resources for one thread. 
This is more a general thing we should keep in mind.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to