Re: Questions about working with large multi-file zipped CSV data

Kirby, Adam Wed, 09 Nov 2022 11:42:49 -0800

Hi Ryan,

For your first question of a ZIP of multiple CSVs, I've had good luck [2]
combining fsspec [1] with pyarrow dataset to process ZIPs of multiple CSVs.
fsspec allows you to manage how much RAM you use on the read side with a
few different cache configs.


In case helpful, I sent a python snippet earlier. [3]

[1]
https://filesystem-spec.readthedocs.io/en/latest/_modules/fsspec/implementations/zip.html

[2] The idea was proposed by [email protected] on this list and proved
very helpful.

[3] https://www.mail-archive.com/[email protected]/msg02176.html


On Wed, Nov 9, 2022, 12:15 PM Ryan Kuhns <[email protected]> wrote:

> Hi Everyone,
>
> I’m using pyarrow to read, process, store and analyze some large files
> (~460GB zipped on 400+ files updated quarterly).
>
> I’ve have a couple thoughts/questions come up as I have worked through the
> process. First two questions are mainly informational, wanting to confirm
> what I’ve inferred from existing docs.
>
> 1. I know pyarrow has functionality to uncompress a zipped file with a
> single CSV in it, but in my case I have 3 files in the zip. I’m currently
> using Python’s zipfile to find and open the file I want in the zip and then
> I am reading it with pyarrow.read_csv. I wanted to confirm there isn’t
> pyarrow functionality that might be able to tell me the files in the zip
> and let me select the one to unzip and read.
>
> 2. Some of the files end up being larger than memory when unzipped. In
> this case I’m using the file size to switch over and use open_csv instead
> of read_csv. Is there any plan for open_csv to be multithreaded in a future
> release (didn’t see anything on Jira, but I’m not great at searching on it)?
>
> 3. My data has lots of columns that are dimensions (with low cardinality)
> with longish string values and a large number of rows. Since I have files
> getting close to or above my available memory when unzipped, I need to be
> as memory efficient as possible. Converting these to dictionaries via
> ConvertOptions helps with the in-memory size. But then I get errors when
> looking to join tables together later (due to functionality to unify
> dictionaries not being implemented yet). Is that something that will be
> added? How about the ability to provide a user dictionary that should be
> used in the encoding (as optional param, fallback to current functionality
> when not provided). Seems like that would reduce the need to infer the
> dictionary from the data when encoding. It would be nice to ensure the same
> dictionary mapping is used for a column across each file I read in. It
> seems like I can’t guarantee that currently. A related feature that would
> solve my issue would be a way to easily map a columns values to other
> values on read. I’d imagine this would be something in ConvertOptions,
> where you could specify a column and the mapping to use (parameter
> accepting list of name, mapping tuples?). The end result would be the
> ability to convert a string column to something like int16 on read via the
> mapping. This would be more space efficient and also avoid the inability to
> join on dictionary columns I am seeing currently.
>
> Thanks,
>
> Ryan
>
>

Re: Questions about working with large multi-file zipped CSV data

Reply via email to