[ https://issues.apache.org/jira/browse/ARROW-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488402#comment-17488402 ]
Alenka Frim commented on ARROW-14488: ------------------------------------- Thank you Joris! An example would be: {code:python} >>> import pandas as pd >>> df = pd.DataFrame([['a', 1, 1.0]], columns=['a', 'b', 'c']) >>> import pyarrow as pa >>> >>> schema = pa.schema([ ... pa.field('a', pa.string()), ... pa.field('b', pa.int64()), ... pa.field('c', pa.float64())]) >>> >>> pa.Table.from_pandas(df, schema=schema) pyarrow.Table a: string b: int64 c: double ---- a: [["a"]] b: [[1]] c: [[1]] >>> pa.Table.from_pandas(df.head(0), schema=schema) pyarrow.Table a: string b: int64 c: double ---- a: [[]] b: [[]] c: [[]] {code} > [Python] Incorrect inferred schema from pandas dataframe with length 0. > ----------------------------------------------------------------------- > > Key: ARROW-14488 > URL: https://issues.apache.org/jira/browse/ARROW-14488 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 5.0.0 > Environment: OS: Windows 10, CentOS 7 > Reporter: Yuan Zhou > Priority: Major > > We use pandas(with pyarrow engine) to write out parquet files and those > outputs will be consumed by other applications such as Java apps using > org.apache.parquet.hadoop.ParquetFileReader. We found that some empty > dataframes would get incorrect schema for string columns in other > applications. After some investigation, we narrow down the issue to the > schema inference by pyarrow: > {code:java} > In [1]: import pandas as pd > In [2]: df = pd.DataFrame([['a', 1, 1.0]], columns=['a', 'b', 'c']) > In [3]: import pyarrow as pa > In [4]: pa.Schema.from_pandas(df) > Out[4]: > a: string > b: int64 > c: double > -- schema metadata -- > pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + > 562 > In [5]: pa.Schema.from_pandas(df.head(0)) > Out[5]: > a: null > b: int64 > c: double > -- schema metadata -- > pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + > 560 > In [6]: pa._version_ > Out[6]: '5.0.0' > {code} > As you can see, the column 'a' which should be string type if inferred as > null type and is converted to int32 while writing to parquet files. > Is this an expected behavior? Or do we have any workaround for this issue? > Could anyone take a look please. Thanks! -- This message was sent by Atlassian Jira (v8.20.1#820001)