Thomas Buhrmann created ARROW-4088: -------------------------------------- Summary: Table.from_batches() fails when passed a schema with metadata Key: ARROW-4088 URL: https://issues.apache.org/jira/browse/ARROW-4088 Project: Apache Arrow Issue Type: Bug Components: C++, Python Affects Versions: 0.11.0 Reporter: Thomas Buhrmann
This seems to be a regression. In 0.10 I used to have this function to set column-level and table-level metadata on an existing Table: {code:python} def set_metadata(tbl, col_meta={}, tbl_meta={}): # Create updated column fields with new metadata if col_meta or tbl_meta: fields = [] for col in tbl.itercolumns(): if col.name in col_meta: # Get updated column metadata metadata = col.field.metadata or {} for k, v in col_meta[col.name].items(): metadata[k] = json.dumps(v).encode('utf-8') # Update field with updated metadata fields.append(col.field.add_metadata(metadata)) else: fields.append(col.field) # Get updated table metadata tbl_metadata = tbl.schema.metadata for k, v in tbl_meta.items(): tbl_metadata[k] = json.dumps(v).encode('utf-8') # Create new schema with updated metadata schema = pa.schema(fields, metadata=tbl_metadata) # With updated schema build new table (shouldn't copy data?) tbl = pa.Table.from_batches(tbl.to_batches(), schema=schema) return tbl {code} However, in 0.11 this fails with error: {noformat} ArrowInvalid: Schema at index 0 was different: x: int64 vs x: int64 ... {noformat} It works however if I replace from_batches() with from_arrays(), like this: {code} tbl = pa.Table.from_arrays(list(tbl.itercolumns()), schema=schema) {code} It seems that from_batches() compares the existing batch's schema with the new schema, and upon encountering a difference (in metadata only) fails. A short test would be this: {code} import pandas as pd import pyarrow as pa df = pd.DataFrame({'x': [0,1,2]}) tbl = pa.Table.from_pandas(df, preserve_index=False) field = tbl.schema[0].add_metadata({'test': 'data'}) schema = pa.schema([field]) # tbl2 = pa.Table.from_arrays(list(tbl.itercolumns()), schema=schema) tbl2 = pa.Table.from_batches(tbl.to_batches(), schema) tbl2.schema[0].metadata {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)