Yuan Zhou created ARROW-14488: --------------------------------- Summary: [Python] Incorrect inferred schema from pandas dataframe with length 0. Key: ARROW-14488 URL: https://issues.apache.org/jira/browse/ARROW-14488 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 5.0.0 Environment: OS: Windows 10, CentOS 7 Reporter: Yuan Zhou
We use pandas(with pyarrow engine) to write out parquet files and those outputs will be consumed by other applications such as Java apps using org.apache.parquet.hadoop.ParquetFileReader. We found that some empty dataframes would get incorrect schema for string columns in other applications. After some investigation, we narrow down the issue to the schema inference by pyarrow: {{In [1]: import pandas as pd}} {{In [2]: df = pd.DataFrame([['a', 1, 1.0]], columns=['a', 'b', 'c'])}} {{In [3]: import pyarrow as pa}} {{In [4]: pa.Schema.from_pandas(df)}} {{Out[4]:}} {{a: string}} {{b: int64}} {{c: double}} {{-- schema metadata --}} {{pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 562}} {{In [5]: pa.Schema.from_pandas(df.head(0))}} {{Out[5]:}} {{a: null}} {{b: int64}} {{c: double}} {{-- schema metadata --}} {{pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 560}} {{In [6]: pa.__version__}} {{Out[6]: '5.0.0'}} Is this an expected behavior? Or do we have any workaround for this issue? Could anyone take a look please. Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005)