Hi arrow community,
I just subscribed to this mailing list. First, let me thank all the
contributors for this great project!
I have a question on which pyarrow API to use on a specific use-case. I
need to update/append data to a large partitioned parquet dataset using
pyarrow. I receive the data in small batches that are transformed in small
pandas dataframes. All the new dataframes have the same schema. The data
can be saved locally or on a cloud object store (s3).
When I receive a new batch, I need to update the parquet dataset with the
new rows in the pandas dataframe. Essentially, I need to save additional
xyz.parquet files in the appropriate partition subfolders, without removing
or overwriting pre-existing .parquet files in the same partition folder.
My goal is ending up with a dataset like this:
parquet_dataset/
partition=1/
a.parquet
b.parquet
c.parquet
partition=2/
a.parquet
b.parquet
where each individual parquet file contains a single batch of data
(actually a single batch may be splitted in 2 or more partitions).
Is there a preferred API to achieve this continuous update in pyarrow?
I can implement all this logic manually, but, ideally, I would like to
defer to pyarrow the task of splitting the input dataframe in partitions
and saving each chunk in the appropriate subfolder, generating a filename
that will not conflict with existing files. Is this possible with the
current pyarrow?
PS: I understand that this fragmentation is not ideal for reading/quering,
but it allows to handle the update process quickly. And anyway, I
periodically save a consolidated copy of the dataset with one file per
partition to improve the read performance.
Thanks in advance,
Antonio