Re: Questions about working with large multi-file zipped CSV data

Ryan Kuhns Wed, 09 Nov 2022 13:08:03 -0800

Adam,

Thanks for pointing me to that. The fsspec approach looks like it will be 
helpful and the code snippet give me a good starting point.


-Ryan

> On Nov 9, 2022, at 2:42 PM, Kirby, Adam <[email protected]> wrote:
> 
> 
> Hi Ryan,
> 
> For your first question of a ZIP of multiple CSVs, I've had good luck [2] 
> combining fsspec [1] with pyarrow dataset to process ZIPs of multiple CSVs. 
> fsspec allows you to manage how much RAM you use on the read side with a few 
> different cache configs.
> 
> In case helpful, I sent a python snippet earlier. [3]
> 
> [1] 
> https://filesystem-spec.readthedocs.io/en/latest/_modules/fsspec/implementations/zip.html
> 
> [2] The idea was proposed by [email protected] on this list and proved 
> very helpful. 
> 
> [3] https://www.mail-archive.com/[email protected]/msg02176.html
> 
> 
>> On Wed, Nov 9, 2022, 12:15 PM Ryan Kuhns <[email protected]> wrote:
>> Hi Everyone,
>> 
>> I’m using pyarrow to read, process, store and analyze some large files 
>> (~460GB zipped on 400+ files updated quarterly).
>> 
>> I’ve have a couple thoughts/questions come up as I have worked through the 
>> process. First two questions are mainly informational, wanting to confirm 
>> what I’ve inferred from existing docs.
>> 
>> 1. I know pyarrow has functionality to uncompress a zipped file with a 
>> single CSV in it, but in my case I have 3 files in the zip. I’m currently 
>> using Python’s zipfile to find and open the file I want in the zip and then 
>> I am reading it with pyarrow.read_csv. I wanted to confirm there isn’t 
>> pyarrow functionality that might be able to tell me the files in the zip and 
>> let me select the one to unzip and read.
>> 
>> 2. Some of the files end up being larger than memory when unzipped. In this 
>> case I’m using the file size to switch over and use open_csv instead of 
>> read_csv. Is there any plan for open_csv to be multithreaded in a future 
>> release (didn’t see anything on Jira, but I’m not great at searching on it)?
>> 
>> 3. My data has lots of columns that are dimensions (with low cardinality) 
>> with longish string values and a large number of rows. Since I have files 
>> getting close to or above my available memory when unzipped, I need to be as 
>> memory efficient as possible. Converting these to dictionaries via 
>> ConvertOptions helps with the in-memory size. But then I get errors when 
>> looking to join tables together later (due to functionality to unify 
>> dictionaries not being implemented yet). Is that something that will be 
>> added? How about the ability to provide a user dictionary that should be 
>> used in the encoding (as optional param, fallback to current functionality 
>> when not provided). Seems like that would reduce the need to infer the 
>> dictionary from the data when encoding. It would be nice to ensure the same 
>> dictionary mapping is used for a column across each file I read in. It seems 
>> like I can’t guarantee that currently. A related feature that would solve my 
>> issue would be a way to easily map a columns values to other values on read. 
>> I’d imagine this would be something in ConvertOptions, where you could 
>> specify a column and the mapping to use (parameter accepting list of name, 
>> mapping tuples?). The end result would be the ability to convert a string 
>> column to something like int16 on read via the mapping. This would be more 
>> space efficient and also avoid the inability to join on dictionary columns I 
>> am seeing currently. 
>> 
>> Thanks,
>> 
>> Ryan
>>

Re: Questions about working with large multi-file zipped CSV data

Reply via email to