[ 
https://issues.apache.org/jira/browse/ARROW-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487900#comment-17487900
 ] 

Alenka Frim commented on ARROW-14488:
-------------------------------------

Hi [~zijie0] , thank you for reporting! And sorry for a late reply.

I think this may be a bug on Arrow side: when constructing metadata in 
_dataframe_to_types_ ({_}pandas_compat.py{_}) the conversion from empty Pandas 
series to pa.array is wrong in the case of a string dtype. Here is an example:
{code:python}
>>> import pandas as pd
>>> import pyarrow as pa
>>> df = pd.DataFrame([['a', 1, 1.0]], columns=['a', 'b', 'c'])
>>> df
   a  b    c
0  a  1  1.0
>>> df["a"]
0    a
Name: a, dtype: object

# Non-empty dataframe
>>> pa.array(df["a"], from_pandas=True) # Works for non-empty dataframe
<pyarrow.lib.StringArray object at 0x12462ed00>
[
  "a"
]
>>> pa.array(df["a"], from_pandas=True).type
DataType(string)

# Empty dataframe
>>> pa.array(df["a"].head(0), from_pandas=True) # Becomes NullArray with no 
>>> dtype in case of string/object
<pyarrow.lib.NullArray object at 0x12462eac0>
0 nulls
>>> pa.array(df["a"].head(0), from_pandas=True).type
DataType(null)
{code}
but that doesn't happen for integer or double:
{code:python}
>>> df["b"]
0    1
Name: b, dtype: int64

>>> pa.array(df["b"], from_pandas=True)
<pyarrow.lib.Int64Array object at 0x12462eac0>
[
  1
]

>>> pa.array(df["b"].head(0), from_pandas=True)
<pyarrow.lib.Int64Array object at 0x12462ea60>
[]
>>> pa.array(df["b"].head(0), from_pandas=True).type
DataType(int64)
{code}
[~jorisvandenbossche] what do you think?

> [Python] Incorrect inferred schema from pandas dataframe with length 0.
> -----------------------------------------------------------------------
>
>                 Key: ARROW-14488
>                 URL: https://issues.apache.org/jira/browse/ARROW-14488
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 5.0.0
>         Environment: OS: Windows 10, CentOS 7
>            Reporter: Yuan Zhou
>            Priority: Major
>
> We use pandas(with pyarrow engine) to write out parquet files and those 
> outputs will be consumed by other applications such as Java apps using 
> org.apache.parquet.hadoop.ParquetFileReader. We found that some empty 
> dataframes would get incorrect schema for string columns in other 
> applications. After some investigation, we narrow down the issue to the 
> schema inference by pyarrow:
> {code:java}
> In [1]: import pandas as pd
> In [2]: df = pd.DataFrame([['a', 1, 1.0]], columns=['a', 'b', 'c'])
> In [3]: import pyarrow as pa
> In [4]: pa.Schema.from_pandas(df)
>  Out[4]:
>  a: string
>  b: int64
>  c: double
>  -- schema metadata --
>  pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 
> 562
> In [5]: pa.Schema.from_pandas(df.head(0))
>  Out[5]:
>  a: null
>  b: int64
>  c: double
>  -- schema metadata --
>  pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 
> 560
> In [6]: pa._version_
>  Out[6]: '5.0.0'
> {code}
>  As you can see, the column 'a' which should be string type if inferred as 
> null type and is converted to int32 while writing to parquet files.
> Is this an expected behavior? Or do we have any workaround for this issue? 
> Could anyone take a look please. Thanks!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to