Questions about working with large multi-file zipped CSV data

Ryan Kuhns Wed, 09 Nov 2022 10:15:37 -0800

Hi Everyone,

I’m using pyarrow to read, process, store and analyze some large files (~460GB 
zipped on 400+ files updated quarterly).


I’ve have a couple thoughts/questions come up as I have worked through the 
process. First two questions are mainly informational, wanting to confirm what 
I’ve inferred from existing docs.

1. I know pyarrow has functionality to uncompress a zipped file with a single 
CSV in it, but in my case I have 3 files in the zip. I’m currently using 
Python’s zipfile to find and open the file I want in the zip and then I am 
reading it with pyarrow.read_csv. I wanted to confirm there isn’t pyarrow 
functionality that might be able to tell me the files in the zip and let me 
select the one to unzip and read.

2. Some of the files end up being larger than memory when unzipped. In this 
case I’m using the file size to switch over and use open_csv instead of 
read_csv. Is there any plan for open_csv to be multithreaded in a future 
release (didn’t see anything on Jira, but I’m not great at searching on it)?

3. My data has lots of columns that are dimensions (with low cardinality) with 
longish string values and a large number of rows. Since I have files getting 
close to or above my available memory when unzipped, I need to be as memory 
efficient as possible. Converting these to dictionaries via ConvertOptions 
helps with the in-memory size. But then I get errors when looking to join 
tables together later (due to functionality to unify dictionaries not being 
implemented yet). Is that something that will be added? How about the ability 
to provide a user dictionary that should be used in the encoding (as optional 
param, fallback to current functionality when not provided). Seems like that 
would reduce the need to infer the dictionary from the data when encoding. It 
would be nice to ensure the same dictionary mapping is used for a column across 
each file I read in. It seems like I can’t guarantee that currently. A related 
feature that would solve my issue would be a way to easily map a columns values 
to other values on read. I’d imagine this would be something in ConvertOptions, 
where you could specify a column and the mapping to use (parameter accepting 
list of name, mapping tuples?). The end result would be the ability to convert 
a string column to something like int16 on read via the mapping. This would be 
more space efficient and also avoid the inability to join on dictionary columns 
I am seeing currently. 

Thanks,

Ryan

Questions about working with large multi-file zipped CSV data

Reply via email to