[GitHub] [arrow-datafusion] alamb commented on issue #5205: use more than one core/thread when querying CSV

via GitHub Tue, 07 Feb 2023 02:47:11 -0800


alamb commented on issue #5205:
URL: 
https://github.com/apache/arrow-datafusion/issues/5205#issuecomment-1420567792


   DuckDB does parallel csv reading, FWIW.  
https://github.com/duckdb/duckdb/pull/5194
   
   It would be great to implement this feature in DataFusion
   
   Regarding the "csv escaping means you can't always know when a newline is a 
record delimiter" I suggest:
   1. Default to parsing using multi cores with a newline heuristic, and 
document that it may be incorrect in some cases
   2. Have a config setting that switches back to the slower but "handles 
newlines correctly" behavior
   
   A small variation of 1 would be to detect (and error) if DataFusion realized 
the parallel split did not work well, and produce an error that mentioned the 
config setting,.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on issue #5205: use more than one core/thread when querying CSV

Reply via email to