[ https://issues.apache.org/jira/browse/ARROW-4088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Uwe L. Korn resolved ARROW-4088. -------------------------------- Resolution: Fixed Issue resolved by pull request 3256 [https://github.com/apache/arrow/pull/3256] > [Python] Table.from_batches() fails when passed a schema with metadata > ---------------------------------------------------------------------- > > Key: ARROW-4088 > URL: https://issues.apache.org/jira/browse/ARROW-4088 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Affects Versions: 0.11.0 > Reporter: Thomas Buhrmann > Assignee: Krisztian Szucs > Priority: Major > Labels: pull-request-available, regression > Fix For: 0.12.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > This seems to be a regression. In 0.10 I used to have this function to set > column-level and table-level metadata on an existing Table: > > {code:python} > def set_metadata(tbl, col_meta={}, tbl_meta={}): > # Create updated column fields with new metadata > if col_meta or tbl_meta: > fields = [] > for col in tbl.itercolumns(): > if col.name in col_meta: > # Get updated column metadata > metadata = col.field.metadata or {} > for k, v in col_meta[col.name].items(): > metadata[k] = json.dumps(v).encode('utf-8') > # Update field with updated metadata > fields.append(col.field.add_metadata(metadata)) > else: > fields.append(col.field) > # Get updated table metadata > tbl_metadata = tbl.schema.metadata > for k, v in tbl_meta.items(): > tbl_metadata[k] = json.dumps(v).encode('utf-8') > # Create new schema with updated metadata > schema = pa.schema(fields, metadata=tbl_metadata) > # With updated schema build new table (shouldn't copy data?) > tbl = pa.Table.from_batches(tbl.to_batches(), schema=schema) > return tbl > {code} > However, in 0.11 this fails with error: > {noformat} > ArrowInvalid: Schema at index 0 was different: > x: int64 > vs > x: int64 > ... > {noformat} > It works however if I replace from_batches() with from_arrays(), like this: > {code} > tbl = pa.Table.from_arrays(list(tbl.itercolumns()), schema=schema) > {code} > It seems that from_batches() compares the existing batch's schema with the > new schema, and upon encountering a difference (in metadata only) fails. > A short test would be this: > {code} > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({'x': [0,1,2]}) > tbl = pa.Table.from_pandas(df, preserve_index=False) > field = tbl.schema[0].add_metadata({'test': 'data'}) > schema = pa.schema([field]) > # tbl2 = pa.Table.from_arrays(list(tbl.itercolumns()), schema=schema) > tbl2 = pa.Table.from_batches(tbl.to_batches(), schema) > tbl2.schema[0].metadata > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)