> 2. Some of the files end up being larger than memory when unzipped. In this > case I’m using the file size to switch over and use open_csv instead of > read_csv. Is there any plan for open_csv to be multithreaded in a future > release (didn’t see anything on Jira, but I’m not great at searching on it)?
There is a PR in progress[1] which will add parallel reads to `open_csv` when provided with a random access file (e.g. not just an input stream). This is important when reading from S3 but perhaps not as big of a deal when reading from a local disk (which usually doesn't support a ton of parallelism). The streaming CSV reader's parsing is also not very parallel and could be improved (I would presume to get pretty close to read_csv performance). However, I don't know anyone currently working on this. > If I go with this approach, will the dataset to batches read be > single-threaded (reading csv format) like open_csv? That is obviously not an > issue for large files I would have had to use open_csv for anyway, but if the > eventual dataset API read is single threaded, I might still want to use > read_csv and process columns post read for smaller datasets. dataset.to_batches is built on top of the streaming CSV reader (e.g. open_csv). However, any compute work done by datasets (e.g. dictionary encoding, joins, casts, etc.) will be done in parallel. [1] https://github.com/apache/arrow/pull/14269 On Wed, Nov 9, 2022 at 3:21 PM Ryan Kuhns <[email protected]> wrote: > > Hi Everyone, > > Adam’s reply got me thinking about using the dataset API to overcome the > problem I was facing in my third question. It seems like I could use the > column projection to provide a mapping of from strings to integer lookup > values. Then similar to the writing large amounts of data example > (https://arrow.Apache.org/docs/Python/dataset.html) I can pass the dataset to > write_dataset and never have everything in memory. > > If I go with this approach, will the dataset to batches read be > single-threaded (reading csv format) like open_csv? That is obviously not an > issue for large files I would have had to use open_csv for anyway, but if the > eventual dataset API read is single threaded, I might still want to use > read_csv and process columns post read for smaller datasets. > > Thanks, > > Ryan > > On Nov 9, 2022, at 4:07 PM, Ryan Kuhns <[email protected]> wrote: > > > Adam, > > Thanks for pointing me to that. The fsspec approach looks like it will be > helpful and the code snippet give me a good starting point. > > -Ryan > > On Nov 9, 2022, at 2:42 PM, Kirby, Adam <[email protected]> wrote: > > > Hi Ryan, > > For your first question of a ZIP of multiple CSVs, I've had good luck [2] > combining fsspec [1] with pyarrow dataset to process ZIPs of multiple CSVs. > fsspec allows you to manage how much RAM you use on the read side with a few > different cache configs. > > In case helpful, I sent a python snippet earlier. [3] > > [1] > https://filesystem-spec.readthedocs.io/en/latest/_modules/fsspec/implementations/zip.html > > [2] The idea was proposed by [email protected] on this list and proved > very helpful. > > [3] https://www.mail-archive.com/[email protected]/msg02176.html > > > On Wed, Nov 9, 2022, 12:15 PM Ryan Kuhns <[email protected]> wrote: >> >> Hi Everyone, >> >> I’m using pyarrow to read, process, store and analyze some large files >> (~460GB zipped on 400+ files updated quarterly). >> >> I’ve have a couple thoughts/questions come up as I have worked through the >> process. First two questions are mainly informational, wanting to confirm >> what I’ve inferred from existing docs. >> >> 1. I know pyarrow has functionality to uncompress a zipped file with a >> single CSV in it, but in my case I have 3 files in the zip. I’m currently >> using Python’s zipfile to find and open the file I want in the zip and then >> I am reading it with pyarrow.read_csv. I wanted to confirm there isn’t >> pyarrow functionality that might be able to tell me the files in the zip and >> let me select the one to unzip and read. >> >> 2. Some of the files end up being larger than memory when unzipped. In this >> case I’m using the file size to switch over and use open_csv instead of >> read_csv. Is there any plan for open_csv to be multithreaded in a future >> release (didn’t see anything on Jira, but I’m not great at searching on it)? >> >> 3. My data has lots of columns that are dimensions (with low cardinality) >> with longish string values and a large number of rows. Since I have files >> getting close to or above my available memory when unzipped, I need to be as >> memory efficient as possible. Converting these to dictionaries via >> ConvertOptions helps with the in-memory size. But then I get errors when >> looking to join tables together later (due to functionality to unify >> dictionaries not being implemented yet). Is that something that will be >> added? How about the ability to provide a user dictionary that should be >> used in the encoding (as optional param, fallback to current functionality >> when not provided). Seems like that would reduce the need to infer the >> dictionary from the data when encoding. It would be nice to ensure the same >> dictionary mapping is used for a column across each file I read in. It seems >> like I can’t guarantee that currently. A related feature that would solve my >> issue would be a way to easily map a columns values to other values on read. >> I’d imagine this would be something in ConvertOptions, where you could >> specify a column and the mapping to use (parameter accepting list of name, >> mapping tuples?). The end result would be the ability to convert a string >> column to something like int16 on read via the mapping. This would be more >> space efficient and also avoid the inability to join on dictionary columns I >> am seeing currently. >> >> Thanks, >> >> Ryan >>
