The docs on https://arrow.apache.org/docs/python/parquet.html suggest a
mechanism for collecting+writing metadata, when using `pq.write_dataset` to
build the dataset on disk.
Using that mechanism I ran into an issue when employing partitions. Perhaps
most easily demonstrated using a little script to reproduce, using a table
with two columns, one of which (`letter`) is the partitioning column.
import tempfile
import random
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import os
import pandas as pd
N = 100
df = pd.DataFrame({
'letter': [random.choice(['A', 'B']) for _ in range(0, N)],
'number': np.random.rand(N)})
table = pa.Table.from_pandas(df, schema=pa.schema(fields=[
pa.field('letter', pa.string(), nullable=False),
pa.field('number', pa.float64(), nullable=False)]))
with tempfile.TemporaryDirectory() as root:
metadata_collector = []
pq.write_to_dataset(
table,
root_path=root,
partition_cols=['letter'],
metadata_collector=metadata_collector)
pq.write_metadata(table.schema, os.path.join(root, '_common_metadata'))
pq.write_metadata(table.schema, os.path.join(root, '_metadata'),
metadata_collector=metadata_collector)
which gives (on the last line),
RuntimeError: AppendRowGroups requires equal schemas.
on https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L2184
It looks like this is because the common metadata's schema (which has
all the cols) has a different schema from the collected filemetadata's
- which omit the partioning column, i.e. are of shape:
<pyarrow._parquet.ParquetSchema object at 0x0000024FB1084100>
required group field_id=0 schema {
required double field_id=1 number;
}
Happy to take a more manual approach to writing the metadata - as
suggested in the docs for when write_dataset isn't used, but was
wondering if this is,
- a known issue
- for which there is a correct solution (i.e. which of the two
schema's should it be?)
- that I could contribute a fix for.
-J