[jira] [Commented] (ARROW-14488) [Python] Incorrect inferred schema from pandas dataframe with length 0.

2022-02-07 Thread Alenka Frim (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17488402#comment-17488402
 ] 

Alenka Frim commented on ARROW-14488:
-

Thank you Joris!

An example would be:
{code:python}
>>> import pandas as pd
>>> df = pd.DataFrame([['a', 1, 1.0]], columns=['a', 'b', 'c'])
>>> import pyarrow as pa
>>> 
>>> schema = pa.schema([
...pa.field('a', pa.string()),
...pa.field('b', pa.int64()),
...pa.field('c', pa.float64())])
>>> 
>>> pa.Table.from_pandas(df, schema=schema)
pyarrow.Table
a: string
b: int64
c: double

a: [["a"]]
b: [[1]]
c: [[1]]
>>> pa.Table.from_pandas(df.head(0), schema=schema)
pyarrow.Table
a: string
b: int64
c: double

a: [[]]
b: [[]]
c: [[]]
{code}

> [Python] Incorrect inferred schema from pandas dataframe with length 0.
> ---
>
> Key: ARROW-14488
> URL: https://issues.apache.org/jira/browse/ARROW-14488
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 5.0.0
> Environment: OS: Windows 10, CentOS 7
>Reporter: Yuan Zhou
>Priority: Major
>
> We use pandas(with pyarrow engine) to write out parquet files and those 
> outputs will be consumed by other applications such as Java apps using 
> org.apache.parquet.hadoop.ParquetFileReader. We found that some empty 
> dataframes would get incorrect schema for string columns in other 
> applications. After some investigation, we narrow down the issue to the 
> schema inference by pyarrow:
> {code:java}
> In [1]: import pandas as pd
> In [2]: df = pd.DataFrame([['a', 1, 1.0]], columns=['a', 'b', 'c'])
> In [3]: import pyarrow as pa
> In [4]: pa.Schema.from_pandas(df)
>  Out[4]:
>  a: string
>  b: int64
>  c: double
>  -- schema metadata --
>  pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 
> 562
> In [5]: pa.Schema.from_pandas(df.head(0))
>  Out[5]:
>  a: null
>  b: int64
>  c: double
>  -- schema metadata --
>  pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 
> 560
> In [6]: pa._version_
>  Out[6]: '5.0.0'
> {code}
>  As you can see, the column 'a' which should be string type if inferred as 
> null type and is converted to int32 while writing to parquet files.
> Is this an expected behavior? Or do we have any workaround for this issue? 
> Could anyone take a look please. Thanks!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-14488) [Python] Incorrect inferred schema from pandas dataframe with length 0.

2022-02-07 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17488237#comment-17488237
 ] 

Joris Van den Bossche commented on ARROW-14488:
---

bq. the conversion from empty Pandas series to pa.array is wrong in the case of 
a string dtype.

The main problem is that the example code is not using a "string dtype". By 
default, pandas uses the generic "object" dtype to store strings. But this data 
type basically means that it can hold _any_ Python object. So it is not 
guaranteed to be strings (eg it could also be decimals, bytes, .., for some 
python types that pyarrow also infers). 

As long as the array is not empty, the conversion to a pyarrow array will try 
to infer the appropriate type based on the values in the input array (eg in 
case of an object dtype array with strings, it will indeed convert that to a 
{{pa.string()}} type). But if the array is empty, there are no values to infer 
the type from. And that is the reason why pyarrow defaults to use the generic 
"null" data type for such array (or column in a DataFrame).

If you know that you have strings for a certain column, and want the 
pandas->pyarrow conversion to robustly work (regardless of having empty 
dataframes/arrays), the {{from_pandas}} method has a {{schema}} argument, and 
this way you can specific a schema to use (and so pyarrow will not try to infer 
the types based on the values in the array). You will have to construct this 
schema manually, though, in this case. 



> [Python] Incorrect inferred schema from pandas dataframe with length 0.
> ---
>
> Key: ARROW-14488
> URL: https://issues.apache.org/jira/browse/ARROW-14488
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 5.0.0
> Environment: OS: Windows 10, CentOS 7
>Reporter: Yuan Zhou
>Priority: Major
>
> We use pandas(with pyarrow engine) to write out parquet files and those 
> outputs will be consumed by other applications such as Java apps using 
> org.apache.parquet.hadoop.ParquetFileReader. We found that some empty 
> dataframes would get incorrect schema for string columns in other 
> applications. After some investigation, we narrow down the issue to the 
> schema inference by pyarrow:
> {code:java}
> In [1]: import pandas as pd
> In [2]: df = pd.DataFrame([['a', 1, 1.0]], columns=['a', 'b', 'c'])
> In [3]: import pyarrow as pa
> In [4]: pa.Schema.from_pandas(df)
>  Out[4]:
>  a: string
>  b: int64
>  c: double
>  -- schema metadata --
>  pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 
> 562
> In [5]: pa.Schema.from_pandas(df.head(0))
>  Out[5]:
>  a: null
>  b: int64
>  c: double
>  -- schema metadata --
>  pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 
> 560
> In [6]: pa._version_
>  Out[6]: '5.0.0'
> {code}
>  As you can see, the column 'a' which should be string type if inferred as 
> null type and is converted to int32 while writing to parquet files.
> Is this an expected behavior? Or do we have any workaround for this issue? 
> Could anyone take a look please. Thanks!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-14488) [Python] Incorrect inferred schema from pandas dataframe with length 0.

2022-02-06 Thread Alenka Frim (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487900#comment-17487900
 ] 

Alenka Frim commented on ARROW-14488:
-

Hi [~zijie0] , thank you for reporting! And sorry for a late reply.

I think this may be a bug on Arrow side: when constructing metadata in 
_dataframe_to_types_ ({_}pandas_compat.py{_}) the conversion from empty Pandas 
series to pa.array is wrong in the case of a string dtype. Here is an example:
{code:python}
>>> import pandas as pd
>>> import pyarrow as pa
>>> df = pd.DataFrame([['a', 1, 1.0]], columns=['a', 'b', 'c'])
>>> df
   a  bc
0  a  1  1.0
>>> df["a"]
0a
Name: a, dtype: object

# Non-empty dataframe
>>> pa.array(df["a"], from_pandas=True) # Works for non-empty dataframe

[
  "a"
]
>>> pa.array(df["a"], from_pandas=True).type
DataType(string)

# Empty dataframe
>>> pa.array(df["a"].head(0), from_pandas=True) # Becomes NullArray with no 
>>> dtype in case of string/object

0 nulls
>>> pa.array(df["a"].head(0), from_pandas=True).type
DataType(null)
{code}
but that doesn't happen for integer or double:
{code:python}
>>> df["b"]
01
Name: b, dtype: int64

>>> pa.array(df["b"], from_pandas=True)

[
  1
]

>>> pa.array(df["b"].head(0), from_pandas=True)

[]
>>> pa.array(df["b"].head(0), from_pandas=True).type
DataType(int64)
{code}
[~jorisvandenbossche] what do you think?

> [Python] Incorrect inferred schema from pandas dataframe with length 0.
> ---
>
> Key: ARROW-14488
> URL: https://issues.apache.org/jira/browse/ARROW-14488
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 5.0.0
> Environment: OS: Windows 10, CentOS 7
>Reporter: Yuan Zhou
>Priority: Major
>
> We use pandas(with pyarrow engine) to write out parquet files and those 
> outputs will be consumed by other applications such as Java apps using 
> org.apache.parquet.hadoop.ParquetFileReader. We found that some empty 
> dataframes would get incorrect schema for string columns in other 
> applications. After some investigation, we narrow down the issue to the 
> schema inference by pyarrow:
> {code:java}
> In [1]: import pandas as pd
> In [2]: df = pd.DataFrame([['a', 1, 1.0]], columns=['a', 'b', 'c'])
> In [3]: import pyarrow as pa
> In [4]: pa.Schema.from_pandas(df)
>  Out[4]:
>  a: string
>  b: int64
>  c: double
>  -- schema metadata --
>  pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 
> 562
> In [5]: pa.Schema.from_pandas(df.head(0))
>  Out[5]:
>  a: null
>  b: int64
>  c: double
>  -- schema metadata --
>  pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 
> 560
> In [6]: pa._version_
>  Out[6]: '5.0.0'
> {code}
>  As you can see, the column 'a' which should be string type if inferred as 
> null type and is converted to int32 while writing to parquet files.
> Is this an expected behavior? Or do we have any workaround for this issue? 
> Could anyone take a look please. Thanks!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)