Re: Appending to a partitioned parquet dataset with pyarrow

David Li Wed, 15 Dec 2021 16:29:52 -0800

There is this issue for Delta Lake support (which has the added complication of 
potentially needing to bind a Rust library): 
https://issues.apache.org/jira/browse/ARROW-14730


I don't see any JIRAs about Iceberg nor do I recall any discussions about it 
recently, perhaps someone else will chime in.

-David

On Wed, Dec 15, 2021, at 19:24, Micah Kornfield wrote:
> As a follow-up is it on anybody's road map to support new Table like 
> structures (e.g. Apache Iceberg) for Datasets?  This is something I'd like to 
> see and might have some time in the new year to contribute to it.  
> 
> On Wed, Dec 15, 2021 at 3:25 PM Antonino Ingargiola <[email protected]> 
> wrote:
>> Hi Weston,
>> 
>> many thanks for the complete example: it works like a charm!
>> 
>> The function `dataset.write_dataset` has a nice API, but I cannot figure out 
>> how to use some arguments.
>> 
>> For example, it seems I should be able to change some parquet format 
>> parameters with the `file_options` argument, by passing an 
>> `ParquetFileFormat.make_write_options()` object. 
>> 
>> But I cannot find which **kwargs are accepted by make_write_options. The 
>> function `parquet.write_table` has several arguments for parquet. Are these 
>> the same arguments I can pass to ParquetFileFormat.make_write_options()?
>> 
>> Thanks,
>> Antonio
>> 
>> 
>> On Wed, Dec 15, 2021 at 1:09 AM Weston Pace <[email protected]> wrote:
>>> You may be able to meet this use case using the tabular datasets[1]
>>> feature of pyarrow.  A few thoughts:
>>> 
>>> 1. The easiest way to get an "append" workflow with
>>> pyarrow.dataset.write_dataset is to use a unique basename_template for
>>> each write_dataset operation.   A uuid is helpful here.
>>> 2. As you mentioned, if your writes generate a bunch of small files,
>>> you will want to periodically compact your partitions.
>>> 3. Reads should not happen at the same time as writes or else you risk
>>> reading partial / incomplete files.
>>> 
>>> Example:
>>> 
>>> import pyarrow as pa
>>> import pyarrow.dataset as ds
>>> 
>>> import tempfile
>>> from glob import glob
>>> from uuid import uuid4
>>> 
>>> tab1 = pa.Table.from_pydict({'partition': [1, 1, 2, 2], 'value': [1, 2, 3, 
>>> 4]})
>>> tab2 = pa.Table.from_pydict({'partition': [1, 1, 2, 2], 'value': [5, 6, 7, 
>>> 8]})
>>> 
>>> with tempfile.TemporaryDirectory() as dataset_dir:
>>>     ds.write_dataset(tab1, dataset_dir, format='parquet',
>>>                      partitioning=['partition'],
>>>                      partitioning_flavor='hive',
>>>                      existing_data_behavior='overwrite_or_ignore',
>>>                      basename_template=f'{uuid4()}-{{i}}')
>>>     ds.write_dataset(tab2, dataset_dir, format='parquet',
>>>                      partitioning=['partition'],
>>>                      partitioning_flavor='hive',
>>>                      existing_data_behavior='overwrite_or_ignore',
>>>                      basename_template=f'{uuid4()}-{{i}}')
>>> 
>>>     print('\n'.join(glob(f'{dataset_dir}/**/*')))
>>> 
>>>     dataset = ds.dataset(dataset_dir)
>>> 
>>>     print(dataset.to_table().to_pandas())
>>> 
>>> [1] https://arrow.apache.org/docs/python/dataset.html
>>> 
>>> On Tue, Dec 14, 2021 at 12:17 AM Antonino Ingargiola <[email protected]> 
>>> wrote:
>>> >
>>> > Hi arrow community,
>>> >
>>> > I just subscribed to this mailing list. First, let me thank all the 
>>> > contributors for this great project!
>>> >
>>> > I have a question on which pyarrow API to use on a specific use-case. I 
>>> > need to update/append data to a large partitioned parquet dataset using 
>>> > pyarrow. I receive the data in small batches that are transformed in 
>>> > small pandas dataframes. All the new dataframes have the same schema. The 
>>> > data can be saved locally or on a cloud object store (s3).
>>> >
>>> > When I receive a new batch, I need to update the parquet dataset with the 
>>> > new rows in the pandas dataframe. Essentially, I need to save additional 
>>> > xyz.parquet files in the appropriate partition subfolders, without 
>>> > removing or overwriting pre-existing .parquet files in the same partition 
>>> > folder.
>>> >
>>> > My goal is ending up with a dataset like this:
>>> >
>>> > parquet_dataset/
>>> >     partition=1/
>>> >         a.parquet
>>> >         b.parquet
>>> >         c.parquet
>>> >     partition=2/
>>> >         a.parquet
>>> >         b.parquet
>>> >
>>> > where each individual parquet file contains a single batch of data 
>>> > (actually a single batch may be splitted in 2 or more partitions).
>>> >
>>> > Is there a preferred API to achieve this continuous update in pyarrow?
>>> >
>>> > I can implement all this logic manually, but, ideally, I would like to 
>>> > defer to pyarrow the task of splitting the input dataframe in partitions 
>>> > and saving each chunk in the appropriate subfolder, generating a filename 
>>> > that will not conflict with existing files. Is this possible with the 
>>> > current pyarrow?
>>> >
>>> > PS: I understand that this fragmentation is not ideal for 
>>> > reading/quering, but it allows to handle the update process quickly. And 
>>> > anyway, I periodically save a consolidated copy of the dataset with one 
>>> > file per partition to improve the read performance.
>>> >
>>> > Thanks in advance,
>>> > Antonio
>>> >
>>> >
>>> >

Re: Appending to a partitioned parquet dataset with pyarrow

Reply via email to