[jira] [Commented] (ARROW-3065) [Python] concat_tables() failing from bad Pandas Metadata
[ https://issues.apache.org/jira/browse/ARROW-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630255#comment-16630255 ] Krisztian Szucs commented on ARROW-3065: Thanks [~davlee1...@yahoo.com]! That's an excellent example, now I'm able to reproduce. Fix is arriving! > [Python] concat_tables() failing from bad Pandas Metadata > - > > Key: ARROW-3065 > URL: https://issues.apache.org/jira/browse/ARROW-3065 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.10.0 >Reporter: David Lee >Priority: Major > Fix For: 0.11.0 > > > Looks like the major bug from > https://issues.apache.org/jira/browse/ARROW-1941 is back... > After I downgraded from 0.10.0 to 0.9.0, the error disappeared.. > {code:python} > new_arrow_table = pa.concat_tables(my_arrow_tables) > File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables > File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Schema at index 2 was different: > {code} > In order to debug this I saved the first 4 arrow tables to 4 parquet files > and inspected the parquet files. The parquet schema is identical, but the > Pandas Metadata is different. > {code:python} > for i in range(5): > pq.write_table(my_arrow_tables[i], "test" + str(i) + ".parquet") > {code} > It looks like a column which contains empty strings is getting typed as > float64. > {code:python} > >>> test1.schema > HoldingDetail_Id: string > metadata > > {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [ > {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": > "unicode", "numpy_type": "object", "metadata": null}, > >>> test1[0] > > [ > [ > "Z4", > "SF", > "J7", > "W6", > "L7", > "Q9", > "NE", > "F7", > >>> test2.schema > HoldingDetail_Id: string > metadata > > {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [ > {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": > "unicode", "numpy_type": "float64", "metadata": null}, > >>> test2[0] > > [ > [ > "", > "", > "", > "", > "", > "", > "", > "", > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3065) [Python] concat_tables() failing from bad Pandas Metadata
[ https://issues.apache.org/jira/browse/ARROW-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16626470#comment-16626470 ] David Lee commented on ARROW-3065: -- In pyarrow 0.9.0 the pandas metadata still says float64, but it works.. {code:java} >>> tbl1.schema col1: string col2: string metadata {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":' b' "col1", "field_name": "col1", "pandas_type": "unicode", "numpy_' b'type": "object", "metadata": null}, {"name": "col2", "field_name' b'": "col2", "pandas_type": "unicode", "numpy_type": "object", "me' b'tadata": null}], "pandas_version": "0.23.0"}'} >>> tbl2.schema col1: string col2: string metadata {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":' b' "col1", "field_name": "col1", "pandas_type": "unicode", "numpy_' b'type": "float64", "metadata": null}, {"name": "col2", "field_nam' b'e": "col2", "pandas_type": "unicode", "numpy_type": "object", "m' b'etadata": null}], "pandas_version": "0.23.0"}'} >>> tbl3.schema col1: string col2: string metadata {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":' b' "col1", "field_name": "col1", "pandas_type": "unicode", "numpy_' b'type": "object", "metadata": null}, {"name": "col2", "field_name' b'": "col2", "pandas_type": "unicode", "numpy_type": "object", "me' b'tadata": null}], "pandas_version": "0.23.0"}'} >>> tbl3[0] chunk 0: [ 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h' ] chunk 1: [ '', '', '', '', '', '', '', '' ] {code} In the 0.10.0 example above that can't produce the error tbl3[0] comes back with: {code:java} >>> tbl3[0] [ [ "a", "b", "c", "d", "e", "f", "g", "h" ], [ "", "", "", "", "", "", "", "" ] ] {code} > [Python] concat_tables() failing from bad Pandas Metadata > - > > Key: ARROW-3065 > URL: https://issues.apache.org/jira/browse/ARROW-3065 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.10.0 >Reporter: David Lee >Priority: Major > Fix For: 0.11.0 > > > Looks like the major bug from > https://issues.apache.org/jira/browse/ARROW-1941 is back... > After I downgraded from 0.10.0 to 0.9.0, the error disappeared.. > {code:python} > new_arrow_table = pa.concat_tables(my_arrow_tables) > File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables > File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Schema at index 2 was different: > {code} > In order to debug this I saved the first 4 arrow tables to 4 parquet files > and inspected the parquet files. The parquet schema is identical, but the > Pandas Metadata is different. > {code:python} > for i in range(5): > pq.write_table(my_arrow_tables[i], "test" + str(i) + ".parquet") > {code} > It looks like a column which contains empty strings is getting typed as > float64. > {code:python} > >>> test1.schema > HoldingDetail_Id: string > metadata > > {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [ > {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": > "unicode", "numpy_type": "object", "metadata": null}, > >>> test1[0] > > [ > [ > "Z4", > "SF", > "J7", > "W6", > "L7", > "Q9", > "NE", > "F7", > >>> test2.schema > HoldingDetail_Id: string > metadata > > {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [ > {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": > "unicode", "numpy_type": "float64", "metadata": null}, > >>> test2[0] > > [ > [ > "", > "", > "", > "", > "", > "", > "", > "", > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3065) [Python] concat_tables() failing from bad Pandas Metadata
[ https://issues.apache.org/jira/browse/ARROW-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16626336#comment-16626336 ] David Lee commented on ARROW-3065: -- This test fails.. Tested against 0.10.0.. Works in 0.9.0 {{import pandas as pd}} {{import pyarrow as pa}} {{import pyarrow.parquet as pq}}{{schema = pa.schema([}} {{pa.field('col1', pa.string()),}} {{pa.field('col2', pa.string()),}} {{])}} {{df1 = pd.DataFrame([\{"col1": v, "col2": v} for v in list("abcdefgh")])}} {{df2 = pd.DataFrame([\{"col2": v} for v in list("abcdefgh")])}}{{df1 = df1.reindex(columns=schema.names)}} {{df2 = df2.reindex(columns=schema.names)}}{{tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False)}} {{tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False)}}{{tbl3 = pa.concat_tables([tbl1, tbl2])}}{{Traceback (most recent call last):}} {{ File "", line 1, in }} {{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}} {{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}} {{pyarrow.lib.ArrowInvalid: Schema at index 1 was different:}} > [Python] concat_tables() failing from bad Pandas Metadata > - > > Key: ARROW-3065 > URL: https://issues.apache.org/jira/browse/ARROW-3065 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.10.0 >Reporter: David Lee >Priority: Major > Fix For: 0.12.0 > > > Looks like the major bug from > https://issues.apache.org/jira/browse/ARROW-1941 is back... > After I downgraded from 0.10.0 to 0.9.0, the error disappeared.. > {code:python} > new_arrow_table = pa.concat_tables(my_arrow_tables) > File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables > File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Schema at index 2 was different: > {code} > In order to debug this I saved the first 4 arrow tables to 4 parquet files > and inspected the parquet files. The parquet schema is identical, but the > Pandas Metadata is different. > {code:python} > for i in range(5): > pq.write_table(my_arrow_tables[i], "test" + str(i) + ".parquet") > {code} > It looks like a column which contains empty strings is getting typed as > float64. > {code:python} > >>> test1.schema > HoldingDetail_Id: string > metadata > > {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [ > {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": > "unicode", "numpy_type": "object", "metadata": null}, > >>> test1[0] > > [ > [ > "Z4", > "SF", > "J7", > "W6", > "L7", > "Q9", > "NE", > "F7", > >>> test2.schema > HoldingDetail_Id: string > metadata > > {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [ > {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": > "unicode", "numpy_type": "float64", "metadata": null}, > >>> test2[0] > > [ > [ > "", > "", > "", > "", > "", > "", > "", > "", > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3065) [Python] concat_tables() failing from bad Pandas Metadata
[ https://issues.apache.org/jira/browse/ARROW-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16625011#comment-16625011 ] Krisztian Szucs commented on ARROW-3065: I've tried to reproduce it, sadly with no luck. Based on your description, I used the following snippet including the parquet roundtrip: {code:python} import pandas as pd import pyarrow as pa import pyarrow.parquet as pq df1 = pd.DataFrame({"col": list("abcdefgh")}) df2 = pd.DataFrame({"col": [""] * 8}) tbl1 = pa.Table.from_pandas(df1) tbl2 = pa.Table.from_pandas(df2) pq.write_table(tbl1, 'tbl1.parquet') pq.write_table(tbl2, 'tbl2.parquet') tbl1_ = pq.read_table('tbl1.parquet') tbl2_ = pq.read_table('tbl2.parquet') pa.concat_tables([tbl1_, tbl2_]) print(tbl2.schema) {code} Also the column which contains empty strings correctly has object numpy_type instead of float64. Tested with 0.10.0, 0.9.0, HEAD(391516df8ce084c279e854cf52c8beb4a4fc444a) [~davlee1...@yahoo.com] Could You please provide a more reproducible example? > [Python] concat_tables() failing from bad Pandas Metadata > - > > Key: ARROW-3065 > URL: https://issues.apache.org/jira/browse/ARROW-3065 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.10.0 >Reporter: David Lee >Priority: Major > Fix For: 0.11.0 > > > Looks like the major bug from > https://issues.apache.org/jira/browse/ARROW-1941 is back... > After I downgraded from 0.10.0 to 0.9.0, the error disappeared.. > {code:python} > new_arrow_table = pa.concat_tables(my_arrow_tables) > File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables > File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Schema at index 2 was different: > {code} > In order to debug this I saved the first 4 arrow tables to 4 parquet files > and inspected the parquet files. The parquet schema is identical, but the > Pandas Metadata is different. > {code:python} > for i in range(5): > pq.write_table(my_arrow_tables[i], "test" + str(i) + ".parquet") > {code} > It looks like a column which contains empty strings is getting typed as > float64. > {code:python} > >>> test1.schema > HoldingDetail_Id: string > metadata > > {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [ > {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": > "unicode", "numpy_type": "object", "metadata": null}, > >>> test1[0] > > [ > [ > "Z4", > "SF", > "J7", > "W6", > "L7", > "Q9", > "NE", > "F7", > >>> test2.schema > HoldingDetail_Id: string > metadata > > {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [ > {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": > "unicode", "numpy_type": "float64", "metadata": null}, > >>> test2[0] > > [ > [ > "", > "", > "", > "", > "", > "", > "", > "", > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)