> For example, it seems I should be able to change some parquet format
> parameters with the `file_options` argument, by passing an
> `ParquetFileFormat.make_write_options()` object.
> But I cannot find which **kwargs are accepted by make_write_options. The
> function `parquet.write_table` has several arguments for parquet. Are these
> the same arguments I can pass to ParquetFileFormat.make_write_options()?
Sort of...looks like that is rather undocumented at the moment :) We
should fix that (I just now opened [1])
So most of the options in the parquet writer can be set, but not all
of them. For example, you can't control row_group_size as that is
controlled by the dataset writer itself. To create an options object
that you can pass into write_dataset you will need a format object.
Putting it all together looks something like...
import pyarrow.dataset as ds
import pyarrow as pa
tab = pa.Table.from_pydict({"x": [1, 2, 3]})
format = ds.ParquetFileFormat()
opts = format.make_write_options(use_dictionary=False)
ds.write_dataset(tab, '/tmp/my_dataset', format='parquet',
file_options=opts)
The keyword arguments that can be supplied to "make_write_options"
(for Parquet) are:
- use_dictionary
- compression
- version
- write_statistics
- data_page_size
- compression_level
- use_byte_stream_split
- data_page_version
- use_deprecated_int96_timestamps
- coerce_timestamps
- allow_truncated_timestamps
- use_compliant_nested_type
The behavior for these is documented at
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html
[1] https://issues.apache.org/jira/browse/ARROW-15139
On Thu, Dec 16, 2021 at 5:17 AM Will Jones <[email protected]> wrote:
>
> I've been starting to think through how to support table formats like Apache
> Iceberg and Delta Lake. For Delta Lake at least there is some question as to
> whether we want to do that in the arrow repo or the delta-rs repo[1].
>
> I haven't researched deeply into Iceberg yet. I just created a Jira for
> Iceberg support and we can discuss the details there. [2]
>
> I'd like to contribute to those efforts over the next year.
>
> [1]
> https://github.com/delta-io/delta-rs/blob/624b1729d3db30ac4acecee5b123cf34cbebe41f/python/deltalake/table.py#L251
> [2] https://issues.apache.org/jira/browse/ARROW-15135
>
> On Wed, Dec 15, 2021 at 4:29 PM David Li <[email protected]> wrote:
>>
>> There is this issue for Delta Lake support (which has the added complication
>> of potentially needing to bind a Rust library):
>> https://issues.apache.org/jira/browse/ARROW-14730
>>
>> I don't see any JIRAs about Iceberg nor do I recall any discussions about it
>> recently, perhaps someone else will chime in.
>>
>> -David
>>
>> On Wed, Dec 15, 2021, at 19:24, Micah Kornfield wrote:
>>
>> As a follow-up is it on anybody's road map to support new Table like
>> structures (e.g. Apache Iceberg) for Datasets? This is something I'd like
>> to see and might have some time in the new year to contribute to it.
>>
>> On Wed, Dec 15, 2021 at 3:25 PM Antonino Ingargiola <[email protected]>
>> wrote:
>>
>> Hi Weston,
>>
>> many thanks for the complete example: it works like a charm!
>>
>> The function `dataset.write_dataset` has a nice API, but I cannot figure out
>> how to use some arguments.
>>
>> For example, it seems I should be able to change some parquet format
>> parameters with the `file_options` argument, by passing an
>> `ParquetFileFormat.make_write_options()` object.
>>
>> But I cannot find which **kwargs are accepted by make_write_options. The
>> function `parquet.write_table` has several arguments for parquet. Are these
>> the same arguments I can pass to ParquetFileFormat.make_write_options()?
>>
>> Thanks,
>> Antonio
>>
>>
>> On Wed, Dec 15, 2021 at 1:09 AM Weston Pace <[email protected]> wrote:
>>
>> You may be able to meet this use case using the tabular datasets[1]
>> feature of pyarrow. A few thoughts:
>>
>> 1. The easiest way to get an "append" workflow with
>> pyarrow.dataset.write_dataset is to use a unique basename_template for
>> each write_dataset operation. A uuid is helpful here.
>> 2. As you mentioned, if your writes generate a bunch of small files,
>> you will want to periodically compact your partitions.
>> 3. Reads should not happen at the same time as writes or else you risk
>> reading partial / incomplete files.
>>
>> Example:
>>
>> import pyarrow as pa
>> import pyarrow.dataset as ds
>>
>> import tempfile
>> from glob import glob
>> from uuid import uuid4
>>
>> tab1 = pa.Table.from_pydict({'partition': [1, 1, 2, 2], 'value': [1, 2, 3,
>> 4]})
>> tab2 = pa.Table.from_pydict({'partition': [1, 1, 2, 2], 'value': [5, 6, 7,
>> 8]})
>>
>> with tempfile.TemporaryDirectory() as dataset_dir:
>> ds.write_dataset(tab1, dataset_dir, format='parquet',
>> partitioning=['partition'],
>> partitioning_flavor='hive',
>> existing_data_behavior='overwrite_or_ignore',
>> basename_template=f'{uuid4()}-{{i}}')
>> ds.write_dataset(tab2, dataset_dir, format='parquet',
>> partitioning=['partition'],
>> partitioning_flavor='hive',
>> existing_data_behavior='overwrite_or_ignore',
>> basename_template=f'{uuid4()}-{{i}}')
>>
>> print('\n'.join(glob(f'{dataset_dir}/**/*')))
>>
>> dataset = ds.dataset(dataset_dir)
>>
>> print(dataset.to_table().to_pandas())
>>
>> [1] https://arrow.apache.org/docs/python/dataset.html
>>
>> On Tue, Dec 14, 2021 at 12:17 AM Antonino Ingargiola <[email protected]>
>> wrote:
>> >
>> > Hi arrow community,
>> >
>> > I just subscribed to this mailing list. First, let me thank all the
>> > contributors for this great project!
>> >
>> > I have a question on which pyarrow API to use on a specific use-case. I
>> > need to update/append data to a large partitioned parquet dataset using
>> > pyarrow. I receive the data in small batches that are transformed in small
>> > pandas dataframes. All the new dataframes have the same schema. The data
>> > can be saved locally or on a cloud object store (s3).
>> >
>> > When I receive a new batch, I need to update the parquet dataset with the
>> > new rows in the pandas dataframe. Essentially, I need to save additional
>> > xyz.parquet files in the appropriate partition subfolders, without
>> > removing or overwriting pre-existing .parquet files in the same partition
>> > folder.
>> >
>> > My goal is ending up with a dataset like this:
>> >
>> > parquet_dataset/
>> > partition=1/
>> > a.parquet
>> > b.parquet
>> > c.parquet
>> > partition=2/
>> > a.parquet
>> > b.parquet
>> >
>> > where each individual parquet file contains a single batch of data
>> > (actually a single batch may be splitted in 2 or more partitions).
>> >
>> > Is there a preferred API to achieve this continuous update in pyarrow?
>> >
>> > I can implement all this logic manually, but, ideally, I would like to
>> > defer to pyarrow the task of splitting the input dataframe in partitions
>> > and saving each chunk in the appropriate subfolder, generating a filename
>> > that will not conflict with existing files. Is this possible with the
>> > current pyarrow?
>> >
>> > PS: I understand that this fragmentation is not ideal for reading/quering,
>> > but it allows to handle the update process quickly. And anyway, I
>> > periodically save a consolidated copy of the dataset with one file per
>> > partition to improve the read performance.
>> >
>> > Thanks in advance,
>> > Antonio
>> >
>> >
>> >
>>
>>