[ 
https://issues.apache.org/jira/browse/ARROW-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16626336#comment-16626336
 ] 

David Lee edited comment on ARROW-3065 at 9/24/18 7:58 PM:
-----------------------------------------------------------

This test fails.. Tested against 0.10.0.. Works in 0.9.0. In one table the 
column doesn't exist to start and is added using pandas.reindex(). The 
reasoning behind this is the original file(s) being converted to parquet may or 
may not contain all 100+ columns.
{quote}import pandas as pd
 import pyarrow as pa
 import pyarrow.parquet as pq

schema = pa.schema([
 pa.field('col1', pa.string()),
 pa.field('col2', pa.string()),
 ])

df1 = pd.DataFrame([\{"col1": v, "col2": v} for v in list("abcdefgh")])
df2 = pd.DataFrame([\{"col2": v} for v in list("abcdefgh")])

df1 = df1.reindex(columns=schema.names)
 df2 = df2.reindex(columns=schema.names)

tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False)

tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False)

tbl3 = pa.concat_tables([tbl1, tbl2])

Traceback (most recent call last):

{\{ File "<stdin>", line 1, in <module>}}
 \{{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}}
 \{{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}}
 pyarrow.lib.ArrowInvalid: Schema at index 1 was different:
{quote}


was (Author: davlee1...@yahoo.com):
This test fails.. Tested against 0.10.0.. Works in 0.9.0. In one table the 
column doesn't exist to start and is added using pandas.reindex(). The 
reasoning behind this is the original file(s) being converted to parquet may or 
may not contain all 100+ columns.
{quote}import pandas as pd
 import pyarrow as pa
 import pyarrow.parquet as pq

schema = pa.schema([
 pa.field('col1', pa.string()),
 pa.field('col2', pa.string()),
 ])
 {{df1 = pd.DataFrame([
Unknown macro: \{"col1"}
for v in list("abcdefgh")])}}
 {{df2 = pd.DataFrame([
Unknown macro: \{"col2"}
for v in list("abcdefgh")])}}

df1 = df1.reindex(columns=schema.names)
 df2 = df2.reindex(columns=schema.names)

tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False)

tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False)

tbl3 = pa.concat_tables([tbl1, tbl2])

Traceback (most recent call last):

{\{ File "<stdin>", line 1, in <module>}}
 \{{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}}
 \{{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}}
 pyarrow.lib.ArrowInvalid: Schema at index 1 was different:
{quote}

> [Python] concat_tables() failing from bad Pandas Metadata
> ---------------------------------------------------------
>
>                 Key: ARROW-3065
>                 URL: https://issues.apache.org/jira/browse/ARROW-3065
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.10.0
>            Reporter: David Lee
>            Priority: Major
>             Fix For: 0.12.0
>
>
> Looks like the major bug from 
> https://issues.apache.org/jira/browse/ARROW-1941 is back...
> After I downgraded from 0.10.0 to 0.9.0, the error disappeared..
> {code:python}
> new_arrow_table = pa.concat_tables(my_arrow_tables)
>  File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables
>   File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Schema at index 2 was different:
> {code}
> In order to debug this I saved the first 4 arrow tables to 4 parquet files 
> and inspected the parquet files. The parquet schema is identical, but the 
> Pandas Metadata is different.
> {code:python}
> for i in range(5):
>      pq.write_table(my_arrow_tables[i], "test" + str(i) + ".parquet")
> {code}
> It looks like a column which contains empty strings is getting typed as 
> float64.
> {code:python}
> >>> test1.schema
> HoldingDetail_Id: string
> metadata
> --------
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [
> {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": 
> "unicode", "numpy_type": "object", "metadata": null},
> >>> test1[0]
> <Column name='HoldingDetail_Id' type=DataType(string)>
> [
>   [
>     "Z4",
>     "SF",
>     "J7",
>     "W6",
>     "L7",
>     "Q9",
>     "NE",
>     "F7",
> >>> test2.schema
> HoldingDetail_Id: string
> metadata
> --------
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [
> {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": 
> "unicode", "numpy_type": "float64", "metadata": null},
> >>> test2[0]
> <Column name='HoldingDetail_Id' type=DataType(string)>
> [
>   [
>     "",
>     "",
>     "",
>     "",
>     "",
>     "",
>     "",
>     "",
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to