[ https://issues.apache.org/jira/browse/ARROW-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487900#comment-17487900 ]
Alenka Frim commented on ARROW-14488: ------------------------------------- Hi [~zijie0] , thank you for reporting! And sorry for a late reply. I think this may be a bug on Arrow side: when constructing metadata in _dataframe_to_types_ ({_}pandas_compat.py{_}) the conversion from empty Pandas series to pa.array is wrong in the case of a string dtype. Here is an example: {code:python} >>> import pandas as pd >>> import pyarrow as pa >>> df = pd.DataFrame([['a', 1, 1.0]], columns=['a', 'b', 'c']) >>> df a b c 0 a 1 1.0 >>> df["a"] 0 a Name: a, dtype: object # Non-empty dataframe >>> pa.array(df["a"], from_pandas=True) # Works for non-empty dataframe <pyarrow.lib.StringArray object at 0x12462ed00> [ "a" ] >>> pa.array(df["a"], from_pandas=True).type DataType(string) # Empty dataframe >>> pa.array(df["a"].head(0), from_pandas=True) # Becomes NullArray with no >>> dtype in case of string/object <pyarrow.lib.NullArray object at 0x12462eac0> 0 nulls >>> pa.array(df["a"].head(0), from_pandas=True).type DataType(null) {code} but that doesn't happen for integer or double: {code:python} >>> df["b"] 0 1 Name: b, dtype: int64 >>> pa.array(df["b"], from_pandas=True) <pyarrow.lib.Int64Array object at 0x12462eac0> [ 1 ] >>> pa.array(df["b"].head(0), from_pandas=True) <pyarrow.lib.Int64Array object at 0x12462ea60> [] >>> pa.array(df["b"].head(0), from_pandas=True).type DataType(int64) {code} [~jorisvandenbossche] what do you think? > [Python] Incorrect inferred schema from pandas dataframe with length 0. > ----------------------------------------------------------------------- > > Key: ARROW-14488 > URL: https://issues.apache.org/jira/browse/ARROW-14488 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 5.0.0 > Environment: OS: Windows 10, CentOS 7 > Reporter: Yuan Zhou > Priority: Major > > We use pandas(with pyarrow engine) to write out parquet files and those > outputs will be consumed by other applications such as Java apps using > org.apache.parquet.hadoop.ParquetFileReader. We found that some empty > dataframes would get incorrect schema for string columns in other > applications. After some investigation, we narrow down the issue to the > schema inference by pyarrow: > {code:java} > In [1]: import pandas as pd > In [2]: df = pd.DataFrame([['a', 1, 1.0]], columns=['a', 'b', 'c']) > In [3]: import pyarrow as pa > In [4]: pa.Schema.from_pandas(df) > Out[4]: > a: string > b: int64 > c: double > -- schema metadata -- > pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + > 562 > In [5]: pa.Schema.from_pandas(df.head(0)) > Out[5]: > a: null > b: int64 > c: double > -- schema metadata -- > pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + > 560 > In [6]: pa._version_ > Out[6]: '5.0.0' > {code} > As you can see, the column 'a' which should be string type if inferred as > null type and is converted to int32 while writing to parquet files. > Is this an expected behavior? Or do we have any workaround for this issue? > Could anyone take a look please. Thanks! -- This message was sent by Atlassian Jira (v8.20.1#820001)