Re: [Parquet] writing metadata for dataset with partitions

Weston Pace Tue, 20 Jul 2021 15:07:21 -0700

I just ran into this myself not too long ago :).  I've been adding
support for this process to the new write_dataset API which will
eventually (hopefully) obsolete pq.write_to_dataset.


> Happy to take a more manual approach to writing the metadata - as suggested 
> in the docs for when write_dataset isn't used, but was wondering if this is,
> - a known issue
Yes.  You can find more details in ARROW-13269[1].

> - for which there is a correct solution (i.e. which of the two schema's 
> should it be?)
The _common_metadata should have all of the columns (including
partitioning).  The _metadata should not.  Right now this also only
works if all the files in _metadata have the same schema so let us
know if that is an issue for your use case.

> - that I could contribute a fix for.
The documentation around this could definitely be improved.  There is
precious little non-arrow documentation around these files so it is
rather tricky to google for.  If you have suggestions on ways this
process could be made easier that is always welcome too.

[1] https://issues.apache.org/jira/browse/ARROW-13269

On Tue, Jul 20, 2021 at 6:18 AM Joris Peeters
<[email protected]> wrote:
>
> The docs on https://arrow.apache.org/docs/python/parquet.html suggest a 
> mechanism for collecting+writing metadata, when using `pq.write_dataset` to 
> build the dataset on disk.
>
> Using that mechanism I ran into an issue when employing partitions. Perhaps 
> most easily demonstrated using a little script to reproduce, using a table 
> with two columns, one of which (`letter`) is the partitioning column.
>
> import tempfile
> import random
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> import os
> import pandas as pd
>
> N = 100
> df = pd.DataFrame({
>     'letter': [random.choice(['A', 'B']) for _ in range(0, N)],
>     'number': np.random.rand(N)})
>
> table = pa.Table.from_pandas(df, schema=pa.schema(fields=[
>     pa.field('letter', pa.string(), nullable=False),
>     pa.field('number', pa.float64(), nullable=False)]))
>
> with tempfile.TemporaryDirectory() as root:
>     metadata_collector = []
>     pq.write_to_dataset(
>         table,
>         root_path=root,
>         partition_cols=['letter'],
>         metadata_collector=metadata_collector)
>
>     pq.write_metadata(table.schema, os.path.join(root, '_common_metadata'))
>     pq.write_metadata(table.schema,  os.path.join(root, '_metadata'),
>                  metadata_collector=metadata_collector)
>
> which gives (on the last line),
>
> RuntimeError: AppendRowGroups requires equal schemas.
>
> on https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L2184
>
>
> It looks like this is because the common metadata's schema (which has all the 
> cols) has a different schema from the collected filemetadata's - which omit 
> the partioning column, i.e. are of shape:
>
>
> <pyarrow._parquet.ParquetSchema object at 0x0000024FB1084100>
> required group field_id=0 schema {
>   required double field_id=1 number;
> }
>
>
> Happy to take a more manual approach to writing the metadata - as suggested 
> in the docs for when write_dataset isn't used, but was wondering if this is,
>
> - a known issue
>
> - for which there is a correct solution (i.e. which of the two schema's 
> should it be?)
>
> - that I could contribute a fix for.
>
>
> -J
>
>

Re: [Parquet] writing metadata for dataset with partitions

Reply via email to