alamb commented on issue #5205: URL: https://github.com/apache/arrow-datafusion/issues/5205#issuecomment-1420567792
DuckDB does parallel csv reading, FWIW. https://github.com/duckdb/duckdb/pull/5194 It would be great to implement this feature in DataFusion Regarding the "csv escaping means you can't always know when a newline is a record delimiter" I suggest: 1. Default to parsing using multi cores with a newline heuristic, and document that it may be incorrect in some cases 2. Have a config setting that switches back to the slower but "handles newlines correctly" behavior A small variation of 1 would be to detect (and error) if DataFusion realized the parallel split did not work well, and produce an error that mentioned the config setting,. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
