Yuan Zhou created ARROW-14488:
---------------------------------

             Summary: [Python] Incorrect inferred schema from pandas dataframe 
with length 0.
                 Key: ARROW-14488
                 URL: https://issues.apache.org/jira/browse/ARROW-14488
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 5.0.0
         Environment: OS: Windows 10, CentOS 7
            Reporter: Yuan Zhou


We use pandas(with pyarrow engine) to write out parquet files and those outputs 
will be consumed by other applications such as Java apps using 
org.apache.parquet.hadoop.ParquetFileReader. We found that some empty 
dataframes would get incorrect schema for string columns in other applications. 
After some investigation, we narrow down the issue to the schema inference by 
pyarrow:

{{In [1]: import pandas as pd}}

{{In [2]: df = pd.DataFrame([['a', 1, 1.0]], columns=['a', 'b', 'c'])}}

{{In [3]: import pyarrow as pa}}

{{In [4]: pa.Schema.from_pandas(df)}}
{{Out[4]:}}
{{a: string}}
{{b: int64}}
{{c: double}}
{{-- schema metadata --}}
{{pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 
562}}

{{In [5]: pa.Schema.from_pandas(df.head(0))}}
{{Out[5]:}}
{{a: null}}
{{b: int64}}
{{c: double}}
{{-- schema metadata --}}
{{pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 
560}}

{{In [6]: pa.__version__}}
{{Out[6]: '5.0.0'}}

 

Is this an expected behavior? Or do we have any workaround for this issue? 
Could anyone take a look please. Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to