[ 
https://issues.apache.org/jira/browse/ARROW-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488402#comment-17488402
 ] 

Alenka Frim commented on ARROW-14488:
-------------------------------------

Thank you Joris!

An example would be:
{code:python}
>>> import pandas as pd
>>> df = pd.DataFrame([['a', 1, 1.0]], columns=['a', 'b', 'c'])
>>> import pyarrow as pa
>>> 
>>> schema = pa.schema([
...    pa.field('a', pa.string()),
...    pa.field('b', pa.int64()),
...    pa.field('c', pa.float64())])
>>> 
>>> pa.Table.from_pandas(df, schema=schema)
pyarrow.Table
a: string
b: int64
c: double
----
a: [["a"]]
b: [[1]]
c: [[1]]
>>> pa.Table.from_pandas(df.head(0), schema=schema)
pyarrow.Table
a: string
b: int64
c: double
----
a: [[]]
b: [[]]
c: [[]]
{code}

> [Python] Incorrect inferred schema from pandas dataframe with length 0.
> -----------------------------------------------------------------------
>
>                 Key: ARROW-14488
>                 URL: https://issues.apache.org/jira/browse/ARROW-14488
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 5.0.0
>         Environment: OS: Windows 10, CentOS 7
>            Reporter: Yuan Zhou
>            Priority: Major
>
> We use pandas(with pyarrow engine) to write out parquet files and those 
> outputs will be consumed by other applications such as Java apps using 
> org.apache.parquet.hadoop.ParquetFileReader. We found that some empty 
> dataframes would get incorrect schema for string columns in other 
> applications. After some investigation, we narrow down the issue to the 
> schema inference by pyarrow:
> {code:java}
> In [1]: import pandas as pd
> In [2]: df = pd.DataFrame([['a', 1, 1.0]], columns=['a', 'b', 'c'])
> In [3]: import pyarrow as pa
> In [4]: pa.Schema.from_pandas(df)
>  Out[4]:
>  a: string
>  b: int64
>  c: double
>  -- schema metadata --
>  pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 
> 562
> In [5]: pa.Schema.from_pandas(df.head(0))
>  Out[5]:
>  a: null
>  b: int64
>  c: double
>  -- schema metadata --
>  pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 
> 560
> In [6]: pa._version_
>  Out[6]: '5.0.0'
> {code}
>  As you can see, the column 'a' which should be string type if inferred as 
> null type and is converted to int32 while writing to parquet files.
> Is this an expected behavior? Or do we have any workaround for this issue? 
> Could anyone take a look please. Thanks!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to