Re: Appending to a partitioned parquet dataset with pyarrow

Will Jones Thu, 16 Dec 2021 07:17:48 -0800

I've been starting to think through how to support table formats like
Apache Iceberg and Delta Lake. For Delta Lake at least there is some
question as to whether we want to do that in the arrow repo or the delta-rs
repo[1].


I haven't researched deeply into Iceberg yet. I just created a Jira for
Iceberg support and we can discuss the details there. [2]

I'd like to contribute to those efforts over the next year.

[1]
https://github.com/delta-io/delta-rs/blob/624b1729d3db30ac4acecee5b123cf34cbebe41f/python/deltalake/table.py#L251
[2] https://issues.apache.org/jira/browse/ARROW-15135

On Wed, Dec 15, 2021 at 4:29 PM David Li <[email protected]> wrote:

> There is this issue for Delta Lake support (which has the added
> complication of potentially needing to bind a Rust library):
> https://issues.apache.org/jira/browse/ARROW-14730
>
> I don't see any JIRAs about Iceberg nor do I recall any discussions about
> it recently, perhaps someone else will chime in.
>
> -David
>
> On Wed, Dec 15, 2021, at 19:24, Micah Kornfield wrote:
>
> As a follow-up is it on anybody's road map to support new Table like
> structures (e.g. Apache Iceberg) for Datasets?  This is something I'd like
> to see and might have some time in the new year to contribute to it.
>
> On Wed, Dec 15, 2021 at 3:25 PM Antonino Ingargiola <[email protected]>
> wrote:
>
> Hi Weston,
>
> many thanks for the complete example: it works like a charm!
>
> The function `dataset.write_dataset` has a nice API, but I cannot figure
> out how to use some arguments.
>
> For example, it seems I should be able to change some parquet format
> parameters with the `file_options` argument, by passing an
> `ParquetFileFormat.make_write_options()` object.
>
> But I cannot find which **kwargs are accepted by make_write_options. The
> function `parquet.write_table` has several arguments for parquet. Are these
> the same arguments I can pass to ParquetFileFormat.make_write_options()?
>
> Thanks,
> Antonio
>
>
> On Wed, Dec 15, 2021 at 1:09 AM Weston Pace <[email protected]> wrote:
>
> You may be able to meet this use case using the tabular datasets[1]
> feature of pyarrow.  A few thoughts:
>
> 1. The easiest way to get an "append" workflow with
> pyarrow.dataset.write_dataset is to use a unique basename_template for
> each write_dataset operation.   A uuid is helpful here.
> 2. As you mentioned, if your writes generate a bunch of small files,
> you will want to periodically compact your partitions.
> 3. Reads should not happen at the same time as writes or else you risk
> reading partial / incomplete files.
>
> Example:
>
> import pyarrow as pa
> import pyarrow.dataset as ds
>
> import tempfile
> from glob import glob
> from uuid import uuid4
>
> tab1 = pa.Table.from_pydict({'partition': [1, 1, 2, 2], 'value': [1, 2, 3,
> 4]})
> tab2 = pa.Table.from_pydict({'partition': [1, 1, 2, 2], 'value': [5, 6, 7,
> 8]})
>
> with tempfile.TemporaryDirectory() as dataset_dir:
>     ds.write_dataset(tab1, dataset_dir, format='parquet',
>                      partitioning=['partition'],
>                      partitioning_flavor='hive',
>                      existing_data_behavior='overwrite_or_ignore',
>                      basename_template=f'{uuid4()}-{{i}}')
>     ds.write_dataset(tab2, dataset_dir, format='parquet',
>                      partitioning=['partition'],
>                      partitioning_flavor='hive',
>                      existing_data_behavior='overwrite_or_ignore',
>                      basename_template=f'{uuid4()}-{{i}}')
>
>     print('\n'.join(glob(f'{dataset_dir}/**/*')))
>
>     dataset = ds.dataset(dataset_dir)
>
>     print(dataset.to_table().to_pandas())
>
> [1] https://arrow.apache.org/docs/python/dataset.html
>
> On Tue, Dec 14, 2021 at 12:17 AM Antonino Ingargiola <[email protected]>
> wrote:
> >
> > Hi arrow community,
> >
> > I just subscribed to this mailing list. First, let me thank all the
> contributors for this great project!
> >
> > I have a question on which pyarrow API to use on a specific use-case. I
> need to update/append data to a large partitioned parquet dataset using
> pyarrow. I receive the data in small batches that are transformed in small
> pandas dataframes. All the new dataframes have the same schema. The data
> can be saved locally or on a cloud object store (s3).
> >
> > When I receive a new batch, I need to update the parquet dataset with
> the new rows in the pandas dataframe. Essentially, I need to save
> additional xyz.parquet files in the appropriate partition subfolders,
> without removing or overwriting pre-existing .parquet files in the same
> partition folder.
> >
> > My goal is ending up with a dataset like this:
> >
> > parquet_dataset/
> >     partition=1/
> >         a.parquet
> >         b.parquet
> >         c.parquet
> >     partition=2/
> >         a.parquet
> >         b.parquet
> >
> > where each individual parquet file contains a single batch of data
> (actually a single batch may be splitted in 2 or more partitions).
> >
> > Is there a preferred API to achieve this continuous update in pyarrow?
> >
> > I can implement all this logic manually, but, ideally, I would like to
> defer to pyarrow the task of splitting the input dataframe in partitions
> and saving each chunk in the appropriate subfolder, generating a filename
> that will not conflict with existing files. Is this possible with the
> current pyarrow?
> >
> > PS: I understand that this fragmentation is not ideal for
> reading/quering, but it allows to handle the update process quickly. And
> anyway, I periodically save a consolidated copy of the dataset with one
> file per partition to improve the read performance.
> >
> > Thanks in advance,
> > Antonio
> >
> >
> >
>
>
>

Re: Appending to a partitioned parquet dataset with pyarrow

Reply via email to