[ https://issues.apache.org/jira/browse/ARROW-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Krisztian Szucs reassigned ARROW-3065: -------------------------------------- Assignee: Krisztian Szucs > [Python] concat_tables() failing from bad Pandas Metadata > --------------------------------------------------------- > > Key: ARROW-3065 > URL: https://issues.apache.org/jira/browse/ARROW-3065 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.10.0 > Reporter: David Lee > Assignee: Krisztian Szucs > Priority: Major > Fix For: 0.11.0 > > > Looks like the major bug from > https://issues.apache.org/jira/browse/ARROW-1941 is back... > After I downgraded from 0.10.0 to 0.9.0, the error disappeared.. > {code:python} > new_arrow_table = pa.concat_tables(my_arrow_tables) > File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables > File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Schema at index 2 was different: > {code} > In order to debug this I saved the first 4 arrow tables to 4 parquet files > and inspected the parquet files. The parquet schema is identical, but the > Pandas Metadata is different. > {code:python} > for i in range(5): > pq.write_table(my_arrow_tables[i], "test" + str(i) + ".parquet") > {code} > It looks like a column which contains empty strings is getting typed as > float64. > {code:python} > >>> test1.schema > HoldingDetail_Id: string > metadata > -------- > {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [ > {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": > "unicode", "numpy_type": "object", "metadata": null}, > >>> test1[0] > <Column name='HoldingDetail_Id' type=DataType(string)> > [ > [ > "Z4", > "SF", > "J7", > "W6", > "L7", > "Q9", > "NE", > "F7", > >>> test2.schema > HoldingDetail_Id: string > metadata > -------- > {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [ > {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": > "unicode", "numpy_type": "float64", "metadata": null}, > >>> test2[0] > <Column name='HoldingDetail_Id' type=DataType(string)> > [ > [ > "", > "", > "", > "", > "", > "", > "", > "", > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)