AlenkaF opened a new issue, #36202:
URL: https://github.com/apache/arrow/issues/36202
### Describe the bug, including details regarding any error messages,
version, and platform.
There are two cases where the array constructor taking the python conversion
path (`python_to_arrow.cc`) doesn't handle the safe keyword properly.
### The safe keyword is set to True by default and is ignored if passed
- One example where python list is taking `python_to_arrow.cc` code path are
Decimals (which are python objects). Here the conversion from Decimal to int
does not overflow by default and one can’t turn it off with `safe=False`:
```python
>>> import pyarrow as pa
>>> pa.array([Decimal('1234')]).cast(pa.int8(), safe=False)
<pyarrow.lib.Int8Array object at 0x7efbbd18e4b0>
[
-46
]
>>> pa.array([Decimal('1234')], pa.int8(), safe=False)
Traceback (most recent call last):
...
ArrowInvalid: Value 1234 too large to fit in C integer type
```
- Another example is JSON data with nested data (list type). The conversion
is taking python_to_arrow.cc code path where safe keyword is also ignored. See
example in https://github.com/apache/arrow/issues/31402.
### In some cases the safe keyword is ignored and does unsafe conversions
- Nested case
```python
>>> import pyarrow as pa
>>> pa.array(np.array([[1.5], [2.5, 3.5]], dtype=object),
type=pa.list_(pa.int64()), safe=True)
<pyarrow.lib.ListArray object at 0x7f004fc74700>
[
[
1
],
[
2,
3
]
]
```
- Primitive array case. Numpy array and python list take a different code
path (`numpy_to_array.cc` vs `python_to_arrow.cc`):
```python
>>> import pyarrow as pa
>>> pa.array(np.array([1.5, 2.5]), type=pa.int64(), safe=True)
...
ArrowInvalid: Float value 1.5 was truncated converting to int64
```
vs
```python
>>> import pyarrow as pa
>>> pa.array([1.5, 2.5], type=pa.int64(), safe=True)
<pyarrow.lib.Int64Array object at 0x7f004fc72c40>
[
1,
2
]
```
- Another example of wrong handling of safe keyword is when using nested
data in pandas (an object thus taking the `python_to_arrow.cc` code path):
```python
>>> import pandas as pd
>>> import pyarrow as pa
>>> int_dataframe = pd.DataFrame({"array": [[1, 2]]})
>>> float_dataframe = pd.DataFrame({"array": [[1.5, 2.3]]})
>>> int_table = pa.Table.from_pandas(int_dataframe)
>>> table = pa.Table.from_pandas(float_dataframe, schema=int_table.schema)
>>> table
pyarrow.Table
array: list<item: int64>
child 0, item: int64
----
array: [[[1,2]]]
```
cc @jorisvandenbossche @dane
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]