[jira] [Commented] (ARROW-3065) [Python] concat_tables() failing from bad Pandas Metadata

2018-09-27 Thread Krisztian Szucs (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630255#comment-16630255
 ] 

Krisztian Szucs commented on ARROW-3065:


Thanks [~davlee1...@yahoo.com]! That's an excellent example, now I'm able to 
reproduce. Fix is arriving!

> [Python] concat_tables() failing from bad Pandas Metadata
> -
>
> Key: ARROW-3065
> URL: https://issues.apache.org/jira/browse/ARROW-3065
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0
>Reporter: David Lee
>Priority: Major
> Fix For: 0.11.0
>
>
> Looks like the major bug from 
> https://issues.apache.org/jira/browse/ARROW-1941 is back...
> After I downgraded from 0.10.0 to 0.9.0, the error disappeared..
> {code:python}
> new_arrow_table = pa.concat_tables(my_arrow_tables)
>  File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables
>   File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Schema at index 2 was different:
> {code}
> In order to debug this I saved the first 4 arrow tables to 4 parquet files 
> and inspected the parquet files. The parquet schema is identical, but the 
> Pandas Metadata is different.
> {code:python}
> for i in range(5):
>  pq.write_table(my_arrow_tables[i], "test" + str(i) + ".parquet")
> {code}
> It looks like a column which contains empty strings is getting typed as 
> float64.
> {code:python}
> >>> test1.schema
> HoldingDetail_Id: string
> metadata
> 
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [
> {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": 
> "unicode", "numpy_type": "object", "metadata": null},
> >>> test1[0]
> 
> [
>   [
> "Z4",
> "SF",
> "J7",
> "W6",
> "L7",
> "Q9",
> "NE",
> "F7",
> >>> test2.schema
> HoldingDetail_Id: string
> metadata
> 
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [
> {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": 
> "unicode", "numpy_type": "float64", "metadata": null},
> >>> test2[0]
> 
> [
>   [
> "",
> "",
> "",
> "",
> "",
> "",
> "",
> "",
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3065) [Python] concat_tables() failing from bad Pandas Metadata

2018-09-24 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16626470#comment-16626470
 ] 

David Lee commented on ARROW-3065:
--

In pyarrow 0.9.0 the pandas metadata still says float64, but it works..

 
{code:java}
>>> tbl1.schema
col1: string
col2: string
metadata

{b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":'
b' "col1", "field_name": "col1", "pandas_type": "unicode", "numpy_'
b'type": "object", "metadata": null}, {"name": "col2", "field_name'
b'": "col2", "pandas_type": "unicode", "numpy_type": "object", "me'
b'tadata": null}], "pandas_version": "0.23.0"}'}
>>> tbl2.schema
col1: string
col2: string
metadata

{b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":'
b' "col1", "field_name": "col1", "pandas_type": "unicode", "numpy_'
b'type": "float64", "metadata": null}, {"name": "col2", "field_nam'
b'e": "col2", "pandas_type": "unicode", "numpy_type": "object", "m'
b'etadata": null}], "pandas_version": "0.23.0"}'}
>>> tbl3.schema
col1: string
col2: string
metadata

{b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":'
b' "col1", "field_name": "col1", "pandas_type": "unicode", "numpy_'
b'type": "object", "metadata": null}, {"name": "col2", "field_name'
b'": "col2", "pandas_type": "unicode", "numpy_type": "object", "me'
b'tadata": null}], "pandas_version": "0.23.0"}'}
>>> tbl3[0]

chunk 0: 
[
'a',
'b',
'c',
'd',
'e',
'f',
'g',
'h'
]
chunk 1: 
[
'',
'',
'',
'',
'',
'',
'',
''
]

{code}
In the 0.10.0 example above that can't produce the error tbl3[0] comes back 
with:

 
{code:java}
>>> tbl3[0]

[
[
"a",
"b",
"c",
"d",
"e",
"f",
"g",
"h"
],
[
"",
"",
"",
"",
"",
"",
"",
""
]
]


{code}
 

> [Python] concat_tables() failing from bad Pandas Metadata
> -
>
> Key: ARROW-3065
> URL: https://issues.apache.org/jira/browse/ARROW-3065
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0
>Reporter: David Lee
>Priority: Major
> Fix For: 0.11.0
>
>
> Looks like the major bug from 
> https://issues.apache.org/jira/browse/ARROW-1941 is back...
> After I downgraded from 0.10.0 to 0.9.0, the error disappeared..
> {code:python}
> new_arrow_table = pa.concat_tables(my_arrow_tables)
>  File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables
>   File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Schema at index 2 was different:
> {code}
> In order to debug this I saved the first 4 arrow tables to 4 parquet files 
> and inspected the parquet files. The parquet schema is identical, but the 
> Pandas Metadata is different.
> {code:python}
> for i in range(5):
>  pq.write_table(my_arrow_tables[i], "test" + str(i) + ".parquet")
> {code}
> It looks like a column which contains empty strings is getting typed as 
> float64.
> {code:python}
> >>> test1.schema
> HoldingDetail_Id: string
> metadata
> 
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [
> {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": 
> "unicode", "numpy_type": "object", "metadata": null},
> >>> test1[0]
> 
> [
>   [
> "Z4",
> "SF",
> "J7",
> "W6",
> "L7",
> "Q9",
> "NE",
> "F7",
> >>> test2.schema
> HoldingDetail_Id: string
> metadata
> 
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [
> {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": 
> "unicode", "numpy_type": "float64", "metadata": null},
> >>> test2[0]
> 
> [
>   [
> "",
> "",
> "",
> "",
> "",
> "",
> "",
> "",
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3065) [Python] concat_tables() failing from bad Pandas Metadata

2018-09-24 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16626336#comment-16626336
 ] 

David Lee commented on ARROW-3065:
--

This test fails.. Tested against 0.10.0.. Works in 0.9.0


{{import pandas as pd}}
{{import pyarrow as pa}}
{{import pyarrow.parquet as pq}}{{schema = pa.schema([}}
{{pa.field('col1', pa.string()),}}
{{pa.field('col2', pa.string()),}}
{{])}}
{{df1 = pd.DataFrame([\{"col1": v, "col2": v} for v in list("abcdefgh")])}}
{{df2 = pd.DataFrame([\{"col2": v} for v in list("abcdefgh")])}}{{df1 = 
df1.reindex(columns=schema.names)}}
{{df2 = df2.reindex(columns=schema.names)}}{{tbl1 = pa.Table.from_pandas(df1, 
schema = schema, preserve_index=False)}}
{{tbl2 = pa.Table.from_pandas(df2, schema = schema, 
preserve_index=False)}}{{tbl3 = pa.concat_tables([tbl1, tbl2])}}{{Traceback 
(most recent call last):}}
{{ File "", line 1, in }}
{{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}}
{{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}}
{{pyarrow.lib.ArrowInvalid: Schema at index 1 was different:}}

> [Python] concat_tables() failing from bad Pandas Metadata
> -
>
> Key: ARROW-3065
> URL: https://issues.apache.org/jira/browse/ARROW-3065
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0
>Reporter: David Lee
>Priority: Major
> Fix For: 0.12.0
>
>
> Looks like the major bug from 
> https://issues.apache.org/jira/browse/ARROW-1941 is back...
> After I downgraded from 0.10.0 to 0.9.0, the error disappeared..
> {code:python}
> new_arrow_table = pa.concat_tables(my_arrow_tables)
>  File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables
>   File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Schema at index 2 was different:
> {code}
> In order to debug this I saved the first 4 arrow tables to 4 parquet files 
> and inspected the parquet files. The parquet schema is identical, but the 
> Pandas Metadata is different.
> {code:python}
> for i in range(5):
>  pq.write_table(my_arrow_tables[i], "test" + str(i) + ".parquet")
> {code}
> It looks like a column which contains empty strings is getting typed as 
> float64.
> {code:python}
> >>> test1.schema
> HoldingDetail_Id: string
> metadata
> 
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [
> {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": 
> "unicode", "numpy_type": "object", "metadata": null},
> >>> test1[0]
> 
> [
>   [
> "Z4",
> "SF",
> "J7",
> "W6",
> "L7",
> "Q9",
> "NE",
> "F7",
> >>> test2.schema
> HoldingDetail_Id: string
> metadata
> 
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [
> {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": 
> "unicode", "numpy_type": "float64", "metadata": null},
> >>> test2[0]
> 
> [
>   [
> "",
> "",
> "",
> "",
> "",
> "",
> "",
> "",
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3065) [Python] concat_tables() failing from bad Pandas Metadata

2018-09-23 Thread Krisztian Szucs (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16625011#comment-16625011
 ] 

Krisztian Szucs commented on ARROW-3065:


I've tried to reproduce it, sadly with no luck.
Based on your description, I used the following snippet including the parquet 
roundtrip:

{code:python}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


df1 = pd.DataFrame({"col": list("abcdefgh")})
df2 = pd.DataFrame({"col":  [""] * 8})

tbl1 = pa.Table.from_pandas(df1)
tbl2 = pa.Table.from_pandas(df2)

pq.write_table(tbl1, 'tbl1.parquet')
pq.write_table(tbl2, 'tbl2.parquet')

tbl1_ = pq.read_table('tbl1.parquet')
tbl2_ = pq.read_table('tbl2.parquet')

pa.concat_tables([tbl1_, tbl2_])
print(tbl2.schema)
{code}

Also the column which contains empty strings correctly has object numpy_type 
instead of float64.
Tested with 0.10.0, 0.9.0, HEAD(391516df8ce084c279e854cf52c8beb4a4fc444a)

[~davlee1...@yahoo.com] Could You please provide a more reproducible example?

> [Python] concat_tables() failing from bad Pandas Metadata
> -
>
> Key: ARROW-3065
> URL: https://issues.apache.org/jira/browse/ARROW-3065
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0
>Reporter: David Lee
>Priority: Major
> Fix For: 0.11.0
>
>
> Looks like the major bug from 
> https://issues.apache.org/jira/browse/ARROW-1941 is back...
> After I downgraded from 0.10.0 to 0.9.0, the error disappeared..
> {code:python}
> new_arrow_table = pa.concat_tables(my_arrow_tables)
>  File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables
>   File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Schema at index 2 was different:
> {code}
> In order to debug this I saved the first 4 arrow tables to 4 parquet files 
> and inspected the parquet files. The parquet schema is identical, but the 
> Pandas Metadata is different.
> {code:python}
> for i in range(5):
>  pq.write_table(my_arrow_tables[i], "test" + str(i) + ".parquet")
> {code}
> It looks like a column which contains empty strings is getting typed as 
> float64.
> {code:python}
> >>> test1.schema
> HoldingDetail_Id: string
> metadata
> 
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [
> {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": 
> "unicode", "numpy_type": "object", "metadata": null},
> >>> test1[0]
> 
> [
>   [
> "Z4",
> "SF",
> "J7",
> "W6",
> "L7",
> "Q9",
> "NE",
> "F7",
> >>> test2.schema
> HoldingDetail_Id: string
> metadata
> 
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [
> {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": 
> "unicode", "numpy_type": "float64", "metadata": null},
> >>> test2[0]
> 
> [
>   [
> "",
> "",
> "",
> "",
> "",
> "",
> "",
> "",
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)