[Parquet] writing metadata for dataset with partitions

Joris Peeters Tue, 20 Jul 2021 09:18:44 -0700

The docs on https://arrow.apache.org/docs/python/parquet.html suggest a
mechanism for collecting+writing metadata, when using `pq.write_dataset` to
build the dataset on disk.


Using that mechanism I ran into an issue when employing partitions. Perhaps
most easily demonstrated using a little script to reproduce, using a table
with two columns, one of which (`letter`) is the partitioning column.

import tempfile
import random
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import os
import pandas as pd

N = 100
df = pd.DataFrame({
    'letter': [random.choice(['A', 'B']) for _ in range(0, N)],
    'number': np.random.rand(N)})

table = pa.Table.from_pandas(df, schema=pa.schema(fields=[
    pa.field('letter', pa.string(), nullable=False),
    pa.field('number', pa.float64(), nullable=False)]))

with tempfile.TemporaryDirectory() as root:
    metadata_collector = []
    pq.write_to_dataset(
        table,
        root_path=root,
        partition_cols=['letter'],
        metadata_collector=metadata_collector)

    pq.write_metadata(table.schema, os.path.join(root, '_common_metadata'))
    pq.write_metadata(table.schema,  os.path.join(root, '_metadata'),
                 metadata_collector=metadata_collector)

which gives (on the last line),

RuntimeError: AppendRowGroups requires equal schemas.

on https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L2184


It looks like this is because the common metadata's schema (which has
all the cols) has a different schema from the collected filemetadata's
- which omit the partioning column, i.e. are of shape:


<pyarrow._parquet.ParquetSchema object at 0x0000024FB1084100>
required group field_id=0 schema {
  required double field_id=1 number;
}


Happy to take a more manual approach to writing the metadata - as
suggested in the docs for when write_dataset isn't used, but was
wondering if this is,

- a known issue

- for which there is a correct solution (i.e. which of the two
schema's should it be?)

- that I could contribute a fix for.


-J

[Parquet] writing metadata for dataset with partitions

Reply via email to