Re: Appending to a partitioned parquet dataset with pyarrow

Weston Pace Thu, 16 Dec 2021 10:42:24 -0800

> For example, it seems I should be able to change some parquet format 
> parameters with the `file_options` argument, by passing an 
> `ParquetFileFormat.make_write_options()` object.


> But I cannot find which **kwargs are accepted by make_write_options. The 
> function `parquet.write_table` has several arguments for parquet. Are these 
> the same arguments I can pass to ParquetFileFormat.make_write_options()?

Sort of...looks like that is rather undocumented at the moment :)  We
should fix that (I just now opened [1])

So most of the options in the parquet writer can be set, but not all
of them.  For example, you can't control row_group_size as that is
controlled by the dataset writer itself.  To create an options object
that you can pass into write_dataset you will need a format object.
Putting it all together looks something like...

    import pyarrow.dataset as ds
    import pyarrow as pa

    tab = pa.Table.from_pydict({"x": [1, 2, 3]})
    format = ds.ParquetFileFormat()
    opts = format.make_write_options(use_dictionary=False)
    ds.write_dataset(tab, '/tmp/my_dataset', format='parquet',
file_options=opts)

The keyword arguments that can be supplied to "make_write_options"
(for Parquet) are:

- use_dictionary
- compression
- version
- write_statistics
- data_page_size
- compression_level
- use_byte_stream_split
- data_page_version
- use_deprecated_int96_timestamps
- coerce_timestamps
- allow_truncated_timestamps
- use_compliant_nested_type

The behavior for these is documented at
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html

[1] https://issues.apache.org/jira/browse/ARROW-15139

On Thu, Dec 16, 2021 at 5:17 AM Will Jones <[email protected]> wrote:
>
> I've been starting to think through how to support table formats like Apache 
> Iceberg and Delta Lake. For Delta Lake at least there is some question as to 
> whether we want to do that in the arrow repo or the delta-rs repo[1].
>
> I haven't researched deeply into Iceberg yet. I just created a Jira for 
> Iceberg support and we can discuss the details there. [2]
>
> I'd like to contribute to those efforts over the next year.
>
> [1] 
> https://github.com/delta-io/delta-rs/blob/624b1729d3db30ac4acecee5b123cf34cbebe41f/python/deltalake/table.py#L251
> [2] https://issues.apache.org/jira/browse/ARROW-15135
>
> On Wed, Dec 15, 2021 at 4:29 PM David Li <[email protected]> wrote:
>>
>> There is this issue for Delta Lake support (which has the added complication 
>> of potentially needing to bind a Rust library): 
>> https://issues.apache.org/jira/browse/ARROW-14730
>>
>> I don't see any JIRAs about Iceberg nor do I recall any discussions about it 
>> recently, perhaps someone else will chime in.
>>
>> -David
>>
>> On Wed, Dec 15, 2021, at 19:24, Micah Kornfield wrote:
>>
>> As a follow-up is it on anybody's road map to support new Table like 
>> structures (e.g. Apache Iceberg) for Datasets?  This is something I'd like 
>> to see and might have some time in the new year to contribute to it.
>>
>> On Wed, Dec 15, 2021 at 3:25 PM Antonino Ingargiola <[email protected]> 
>> wrote:
>>
>> Hi Weston,
>>
>> many thanks for the complete example: it works like a charm!
>>
>> The function `dataset.write_dataset` has a nice API, but I cannot figure out 
>> how to use some arguments.
>>
>> For example, it seems I should be able to change some parquet format 
>> parameters with the `file_options` argument, by passing an 
>> `ParquetFileFormat.make_write_options()` object.
>>
>> But I cannot find which **kwargs are accepted by make_write_options. The 
>> function `parquet.write_table` has several arguments for parquet. Are these 
>> the same arguments I can pass to ParquetFileFormat.make_write_options()?
>>
>> Thanks,
>> Antonio
>>
>>
>> On Wed, Dec 15, 2021 at 1:09 AM Weston Pace <[email protected]> wrote:
>>
>> You may be able to meet this use case using the tabular datasets[1]
>> feature of pyarrow.  A few thoughts:
>>
>> 1. The easiest way to get an "append" workflow with
>> pyarrow.dataset.write_dataset is to use a unique basename_template for
>> each write_dataset operation.   A uuid is helpful here.
>> 2. As you mentioned, if your writes generate a bunch of small files,
>> you will want to periodically compact your partitions.
>> 3. Reads should not happen at the same time as writes or else you risk
>> reading partial / incomplete files.
>>
>> Example:
>>
>> import pyarrow as pa
>> import pyarrow.dataset as ds
>>
>> import tempfile
>> from glob import glob
>> from uuid import uuid4
>>
>> tab1 = pa.Table.from_pydict({'partition': [1, 1, 2, 2], 'value': [1, 2, 3, 
>> 4]})
>> tab2 = pa.Table.from_pydict({'partition': [1, 1, 2, 2], 'value': [5, 6, 7, 
>> 8]})
>>
>> with tempfile.TemporaryDirectory() as dataset_dir:
>>     ds.write_dataset(tab1, dataset_dir, format='parquet',
>>                      partitioning=['partition'],
>>                      partitioning_flavor='hive',
>>                      existing_data_behavior='overwrite_or_ignore',
>>                      basename_template=f'{uuid4()}-{{i}}')
>>     ds.write_dataset(tab2, dataset_dir, format='parquet',
>>                      partitioning=['partition'],
>>                      partitioning_flavor='hive',
>>                      existing_data_behavior='overwrite_or_ignore',
>>                      basename_template=f'{uuid4()}-{{i}}')
>>
>>     print('\n'.join(glob(f'{dataset_dir}/**/*')))
>>
>>     dataset = ds.dataset(dataset_dir)
>>
>>     print(dataset.to_table().to_pandas())
>>
>> [1] https://arrow.apache.org/docs/python/dataset.html
>>
>> On Tue, Dec 14, 2021 at 12:17 AM Antonino Ingargiola <[email protected]> 
>> wrote:
>> >
>> > Hi arrow community,
>> >
>> > I just subscribed to this mailing list. First, let me thank all the 
>> > contributors for this great project!
>> >
>> > I have a question on which pyarrow API to use on a specific use-case. I 
>> > need to update/append data to a large partitioned parquet dataset using 
>> > pyarrow. I receive the data in small batches that are transformed in small 
>> > pandas dataframes. All the new dataframes have the same schema. The data 
>> > can be saved locally or on a cloud object store (s3).
>> >
>> > When I receive a new batch, I need to update the parquet dataset with the 
>> > new rows in the pandas dataframe. Essentially, I need to save additional 
>> > xyz.parquet files in the appropriate partition subfolders, without 
>> > removing or overwriting pre-existing .parquet files in the same partition 
>> > folder.
>> >
>> > My goal is ending up with a dataset like this:
>> >
>> > parquet_dataset/
>> >     partition=1/
>> >         a.parquet
>> >         b.parquet
>> >         c.parquet
>> >     partition=2/
>> >         a.parquet
>> >         b.parquet
>> >
>> > where each individual parquet file contains a single batch of data 
>> > (actually a single batch may be splitted in 2 or more partitions).
>> >
>> > Is there a preferred API to achieve this continuous update in pyarrow?
>> >
>> > I can implement all this logic manually, but, ideally, I would like to 
>> > defer to pyarrow the task of splitting the input dataframe in partitions 
>> > and saving each chunk in the appropriate subfolder, generating a filename 
>> > that will not conflict with existing files. Is this possible with the 
>> > current pyarrow?
>> >
>> > PS: I understand that this fragmentation is not ideal for reading/quering, 
>> > but it allows to handle the update process quickly. And anyway, I 
>> > periodically save a consolidated copy of the dataset with one file per 
>> > partition to improve the read performance.
>> >
>> > Thanks in advance,
>> > Antonio
>> >
>> >
>> >
>>
>>

Re: Appending to a partitioned parquet dataset with pyarrow

Reply via email to