[ https://issues.apache.org/jira/browse/ARROW-16287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kyle Barron updated ARROW-16287: -------------------------------- Description: I'm trying to follow the example here: [https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files] to write an example partitioned dataset. But I'm consistently getting an error about non-equal schemas. Here's a mcve: {code:java} from pathlib import Path import numpy as np import pandas as pd import pyarrow as pa import pyarrow.parquet as pq size = 100_000_000 partition_col = np.random.randint(0, 10, size) values = np.random.rand(size) table = pa.Table.from_pandas( pd.DataFrame({"partition_col": partition_col, "values": values}) ) metadata_collector = [] root_path = Path("random.parquet") pq.write_to_dataset( table, root_path, partition_cols=["partition_col"], metadata_collector=metadata_collector, ) Write the ``_common_metadata`` parquet file without row groups statistics pq.write_metadata(table.schema, root_path / "_common_metadata") Write the ``_metadata`` parquet file with row groups statistics of all files pq.write_metadata( table.schema, root_path / "_metadata", metadata_collector=metadata_collector ) {code} This raises the error {code:java} --------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) Input In [92], in <cell line: 1>() ----> 1 pq.write_metadata( 2 table.schema, root_path / "_metadata", metadata_collector=metadata_collector 3 ) File ~/tmp/env/lib/python3.8/site-packages/pyarrow/parquet.py:2324, in write_metadata(schema, where, metadata_collector, **kwargs) 2322 metadata = read_metadata(where) 2323 for m in metadata_collector: -> 2324 metadata.append_row_groups(m) 2325 metadata.write_metadata_file(where) File ~/tmp/env/lib/python3.8/site-packages/pyarrow/_parquet.pyx:628, in pyarrow._parquet.FileMetaData.append_row_groups() RuntimeError: AppendRowGroups requires equal schemas. {code} But all schemas in the `metadata_collector` list seem to be the same: {code:java} all(metadata_collector[0].schema == meta.schema for meta in metadata_collector) # True {code} was: I'm trying to follow the example here: [https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files] to write an example partitioned dataset. But I'm consistently getting an error about non-equal schemas. Here's a mcve: ``` from pathlib import Path import numpy as np import pandas as pd import pyarrow as pa import pyarrow.parquet as pq size = 100_000_000 partition_col = np.random.randint(0, 10, size) values = np.random.rand(size) table = pa.Table.from_pandas( pd.DataFrame(\{"partition_col": partition_col, "values": values}) ) metadata_collector = [] root_path = Path("random.parquet") pq.write_to_dataset( table, root_path, partition_cols=["partition_col"], metadata_collector=metadata_collector, ) # Write the ``_common_metadata`` parquet file without row groups statistics pq.write_metadata(table.schema, root_path / "_common_metadata") # Write the ``_metadata`` parquet file with row groups statistics of all files pq.write_metadata( table.schema, root_path / "_metadata", metadata_collector=metadata_collector ) ``` This raises the error ``` --------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) Input In [92], in <cell line: 1>() ----> 1 pq.write_metadata( 2 table.schema, root_path / "_metadata", metadata_collector=metadata_collector 3 ) File ~/tmp/env/lib/python3.8/site-packages/pyarrow/parquet.py:2324, in write_metadata(schema, where, metadata_collector, **kwargs) 2322 metadata = read_metadata(where) 2323 for m in metadata_collector: -> 2324 metadata.append_row_groups(m) 2325 metadata.write_metadata_file(where) File ~/tmp/env/lib/python3.8/site-packages/pyarrow/_parquet.pyx:628, in pyarrow._parquet.FileMetaData.append_row_groups() RuntimeError: AppendRowGroups requires equal schemas. ``` But all schemas in the `metadata_collector` list seem to be the same: ``` all(metadata_collector[0].schema == meta.schema for meta in metadata_collector) # True ``` > PyArrow: RuntimeError: AppendRowGroups requires equal schemas when writing > _metadata file > ----------------------------------------------------------------------------------------- > > Key: ARROW-16287 > URL: https://issues.apache.org/jira/browse/ARROW-16287 > Project: Apache Arrow > Issue Type: Bug > Components: Parquet > Affects Versions: 7.0.0 > Environment: MacOS. Python 3.8.10. > pyarrow: '7.0.0' > pandas: '1.4.2' > numpy: '1.22.3' > Reporter: Kyle Barron > Priority: Major > > I'm trying to follow the example here: > [https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files] > to write an example partitioned dataset. But I'm consistently getting an > error about non-equal schemas. Here's a mcve: > {code:java} > from pathlib import Path > import numpy as np > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > size = 100_000_000 > partition_col = np.random.randint(0, 10, size) > values = np.random.rand(size) > table = pa.Table.from_pandas( > pd.DataFrame({"partition_col": partition_col, "values": values}) > ) > metadata_collector = [] > root_path = Path("random.parquet") > pq.write_to_dataset( > table, > root_path, > partition_cols=["partition_col"], > metadata_collector=metadata_collector, > ) > Write the ``_common_metadata`` parquet file without row groups statistics > pq.write_metadata(table.schema, root_path / "_common_metadata") > Write the ``_metadata`` parquet file with row groups statistics of all files > pq.write_metadata( > table.schema, root_path / "_metadata", > metadata_collector=metadata_collector > ) {code} > This raises the error > {code:java} > --------------------------------------------------------------------------- > RuntimeError Traceback (most recent call last) > Input In [92], in <cell line: 1>() > ----> 1 pq.write_metadata( > 2 table.schema, root_path / "_metadata", > metadata_collector=metadata_collector > 3 ) > File ~/tmp/env/lib/python3.8/site-packages/pyarrow/parquet.py:2324, in > write_metadata(schema, where, metadata_collector, **kwargs) > 2322 metadata = read_metadata(where) > 2323 for m in metadata_collector: > -> 2324 metadata.append_row_groups(m) > 2325 metadata.write_metadata_file(where) > File ~/tmp/env/lib/python3.8/site-packages/pyarrow/_parquet.pyx:628, in > pyarrow._parquet.FileMetaData.append_row_groups() > RuntimeError: AppendRowGroups requires equal schemas. {code} > But all schemas in the `metadata_collector` list seem to be the same: > {code:java} > all(metadata_collector[0].schema == meta.schema for meta in > metadata_collector) > # True {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)