Hi Everyone, I’m using pyarrow to read, process, store and analyze some large files (~460GB zipped on 400+ files updated quarterly).
I’ve have a couple thoughts/questions come up as I have worked through the process. First two questions are mainly informational, wanting to confirm what I’ve inferred from existing docs. 1. I know pyarrow has functionality to uncompress a zipped file with a single CSV in it, but in my case I have 3 files in the zip. I’m currently using Python’s zipfile to find and open the file I want in the zip and then I am reading it with pyarrow.read_csv. I wanted to confirm there isn’t pyarrow functionality that might be able to tell me the files in the zip and let me select the one to unzip and read. 2. Some of the files end up being larger than memory when unzipped. In this case I’m using the file size to switch over and use open_csv instead of read_csv. Is there any plan for open_csv to be multithreaded in a future release (didn’t see anything on Jira, but I’m not great at searching on it)? 3. My data has lots of columns that are dimensions (with low cardinality) with longish string values and a large number of rows. Since I have files getting close to or above my available memory when unzipped, I need to be as memory efficient as possible. Converting these to dictionaries via ConvertOptions helps with the in-memory size. But then I get errors when looking to join tables together later (due to functionality to unify dictionaries not being implemented yet). Is that something that will be added? How about the ability to provide a user dictionary that should be used in the encoding (as optional param, fallback to current functionality when not provided). Seems like that would reduce the need to infer the dictionary from the data when encoding. It would be nice to ensure the same dictionary mapping is used for a column across each file I read in. It seems like I can’t guarantee that currently. A related feature that would solve my issue would be a way to easily map a columns values to other values on read. I’d imagine this would be something in ConvertOptions, where you could specify a column and the mapping to use (parameter accepting list of name, mapping tuples?). The end result would be the ability to convert a string column to something like int16 on read via the mapping. This would be more space efficient and also avoid the inability to join on dictionary columns I am seeing currently. Thanks, Ryan
