Appending to a partitioned parquet dataset with pyarrow

Antonino Ingargiola Tue, 14 Dec 2021 02:17:37 -0800

Hi arrow community,

I just subscribed to this mailing list. First, let me thank all the
contributors for this great project!


I have a question on which pyarrow API to use on a specific use-case. I
need to update/append data to a large partitioned parquet dataset using
pyarrow. I receive the data in small batches that are transformed in small
pandas dataframes. All the new dataframes have the same schema. The data
can be saved locally or on a cloud object store (s3).

When I receive a new batch, I need to update the parquet dataset with the
new rows in the pandas dataframe. Essentially, I need to save additional
xyz.parquet files in the appropriate partition subfolders, without removing
or overwriting pre-existing .parquet files in the same partition folder.

My goal is ending up with a dataset like this:

parquet_dataset/
    partition=1/
        a.parquet
        b.parquet
        c.parquet
    partition=2/
        a.parquet
        b.parquet

where each individual parquet file contains a single batch of data
(actually a single batch may be splitted in 2 or more partitions).

Is there a preferred API to achieve this continuous update in pyarrow?

I can implement all this logic manually, but, ideally, I would like to
defer to pyarrow the task of splitting the input dataframe in partitions
and saving each chunk in the appropriate subfolder, generating a filename
that will not conflict with existing files. Is this possible with the
current pyarrow?

PS: I understand that this fragmentation is not ideal for reading/quering,
but it allows to handle the update process quickly. And anyway, I
periodically save a consolidated copy of the dataset with one file per
partition to improve the read performance.

Thanks in advance,
Antonio

Appending to a partitioned parquet dataset with pyarrow

Reply via email to