I just ran into this myself not too long ago :). I've been adding support for this process to the new write_dataset API which will eventually (hopefully) obsolete pq.write_to_dataset.
> Happy to take a more manual approach to writing the metadata - as suggested > in the docs for when write_dataset isn't used, but was wondering if this is, > - a known issue Yes. You can find more details in ARROW-13269[1]. > - for which there is a correct solution (i.e. which of the two schema's > should it be?) The _common_metadata should have all of the columns (including partitioning). The _metadata should not. Right now this also only works if all the files in _metadata have the same schema so let us know if that is an issue for your use case. > - that I could contribute a fix for. The documentation around this could definitely be improved. There is precious little non-arrow documentation around these files so it is rather tricky to google for. If you have suggestions on ways this process could be made easier that is always welcome too. [1] https://issues.apache.org/jira/browse/ARROW-13269 On Tue, Jul 20, 2021 at 6:18 AM Joris Peeters <[email protected]> wrote: > > The docs on https://arrow.apache.org/docs/python/parquet.html suggest a > mechanism for collecting+writing metadata, when using `pq.write_dataset` to > build the dataset on disk. > > Using that mechanism I ran into an issue when employing partitions. Perhaps > most easily demonstrated using a little script to reproduce, using a table > with two columns, one of which (`letter`) is the partitioning column. > > import tempfile > import random > import numpy as np > import pyarrow as pa > import pyarrow.parquet as pq > import os > import pandas as pd > > N = 100 > df = pd.DataFrame({ > 'letter': [random.choice(['A', 'B']) for _ in range(0, N)], > 'number': np.random.rand(N)}) > > table = pa.Table.from_pandas(df, schema=pa.schema(fields=[ > pa.field('letter', pa.string(), nullable=False), > pa.field('number', pa.float64(), nullable=False)])) > > with tempfile.TemporaryDirectory() as root: > metadata_collector = [] > pq.write_to_dataset( > table, > root_path=root, > partition_cols=['letter'], > metadata_collector=metadata_collector) > > pq.write_metadata(table.schema, os.path.join(root, '_common_metadata')) > pq.write_metadata(table.schema, os.path.join(root, '_metadata'), > metadata_collector=metadata_collector) > > which gives (on the last line), > > RuntimeError: AppendRowGroups requires equal schemas. > > on https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L2184 > > > It looks like this is because the common metadata's schema (which has all the > cols) has a different schema from the collected filemetadata's - which omit > the partioning column, i.e. are of shape: > > > <pyarrow._parquet.ParquetSchema object at 0x0000024FB1084100> > required group field_id=0 schema { > required double field_id=1 number; > } > > > Happy to take a more manual approach to writing the metadata - as suggested > in the docs for when write_dataset isn't used, but was wondering if this is, > > - a known issue > > - for which there is a correct solution (i.e. which of the two schema's > should it be?) > > - that I could contribute a fix for. > > > -J > >
