[GitHub] [arrow-datafusion] kmitchener opened a new issue, #5205: use more than one core/thread when querying CSV

via GitHub Mon, 06 Feb 2023 11:12:32 -0800


kmitchener opened a new issue, #5205:
URL: https://github.com/apache/arrow-datafusion/issues/5205

**Is your feature request related to a problem or challenge? Please describe
what you are trying to do.**

When querying a CSV file on local disk only 1 thread is used, leaving the
system mostly idle. For example, I have a 9 GB gzip-compressed CSV file that
I'd like to query and convert to Parquet format. However, when running queries
against the CSV, it can take 5 minutes or more for simple counts or group bys
on single columns. 5 minutes per query isn't too unexpected (when uncompressed,
it's a 100 GB file, 45 million records x 242 columns). But it's nowhere near
utilizing all the machine's resources. It seems like there should be a way to
move more of the work to other threads and querying/conversion to parquet could
be substantially faster.

When profiled, it looks like this (green is the only busy thread):

![image](https://user-images.githubusercontent.com/692497/217054812-d3cf83f3-b596-4685-b727-2f910bfc4ebc.png)

and if you zoom into what that thread is doing, it's following a pattern of:
* read data from disk
* then for about 10 cycles, it
* decompresses a chunk
* reads csv records (csv_core) from decompressed chunk
* read again from disk and repeat the cycle

![image](https://user-images.githubusercontent.com/692497/217063170-ef639be8-431b-403b-bbf0-4c6ffc0c3f1c.png)

**Describe the solution you'd like**

Better utilization of the machine, work done in more than one thread and
sustained reads from disk.

**Describe alternatives you've considered**

Is it possible to offload the decompression and processing of bytes -> csv
records to other threads? I had an idea to try to implement some kind of
fan-out/fan-in somewhere under CsvExec so that one thread is just reading from
disk and sending the raw bytes to other threads to decompress/csv_read/convert
to recordbatches (CPU-heavy work is in the CSV reading) -> fan back in to
stream the recordbatches from CsvExec as expected. Is there a better way to do
it?

**Additional context**
Add any other context or screenshots about the feature request here.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] kmitchener opened a new issue, #5205: use more than one core/thread when querying CSV

Reply via email to