[ https://issues.apache.org/jira/browse/ARROW-7087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney updated ARROW-7087: -------------------------------- Summary: [Python] Table Metadata disappear when we write a partitioned dataset (was: [Pyarrow] Table Metadata disappear when we write a partitioned dataset) > [Python] Table Metadata disappear when we write a partitioned dataset > --------------------------------------------------------------------- > > Key: ARROW-7087 > URL: https://issues.apache.org/jira/browse/ARROW-7087 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.14.1 > Reporter: François Blanchard > Priority: Major > > There is an unexpected behavior with the method > *[write_to_dataset|https://github.com/apache/arrow/blob/10a3b716a5ca227c8d97e6f6b27976df14678263/python/pyarrow/parquet.py#L1373]* > in *pyarrow/parquet.py* > When we write a table that contains metadata then metadata are replaced by > pandas metadata. This happens only if we defined *partition_cols*. > > To be more explicit here is an example code: > {code:python} > from pyarrow.parquet import write_to_dataset > import pyarrow as pa > import pyarrow.parquet as pd > columnA = pa.array(['a', 'b', 'c'], type=pa.string()) > columnB = pa.array([1, 1, 2], type=pa.int32()) > # Build table from collumns > table = pa.Table.from_arrays([columnA, columnB], names=['columnA', > 'columnB'], metadata={'data': 'test'}) > print table.schema.metadata > """ > Metadata is set as expected > >> OrderedDict([('data', 'test')]) > """ > # Write table in parquet format partitioned per columnB > write_to_dataset(table, '/path/to/test', partition_cols=['columnB']) > # Load data from parquet files > ds = pd.ParquetDataset('/path/to/test') > load_table = pq.read_table(ds.pieces[0].path) > print load_table.schema.metadata > """ > Metadata with the key `data` is missing > >> OrderedDict([('pandas', '{"creator": {"version": "0.14.1", "library": > >> "pyarrow"}, "pandas_version": "0.22.0", "index_columns": [], "columns": > >> [{"metadata": null, "field_name": "columnA", "name": "columnA", > >> "numpy_type": "object", "pandas_type": "unicode"}], "column_indexes": > >> []}')]) > """{code} > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)