brett-patterson-ent opened a new issue, #49392:
URL: https://github.com/apache/arrow/issues/49392
### Describe the bug, including details regarding any error messages,
version, and platform.
We originally saw this issue with reading parquet files into Pandas
DataFrames via PyArrow, where a column containing an array of floats has
corrupted values when applying a filter at read time. We've narrowed it down to
the below example (without any I/O or Pandas involved). Note that the error
shown in the script below only happens with large values of `N`. Here are the
`N` values that I tested:
* 1,000 - ok
* 10,000 - ok
* 100,000 - ok
* 250,000 - ok
* 500,000 - fail
* 1,000,000 - fail
```python
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.compute as pc
N = 500_000
ARRAY_LEN = 2000
ids = np.arange(N)
texts = [f"Row {i} with data" for i in range(N)]
rng = np.random.default_rng(42)
matrix = rng.random((N, ARRAY_LEN))
matrix[:, 0] = ids
numbers = [matrix[i] for i in range(N)]
tbl = pa.table({"id": ids, "text": texts, "numbers": numbers})
print("PYARROW VERSION:", pa.__version__)
print()
print("ORIGINAL DATA")
print(ids[N - 1])
print(numbers[N - 1].tolist()[:5])
print()
print("SLICED DATA")
print(tbl.slice(N - 1, 1))
print()
print("FILTERED DATA")
print(tbl.filter(pc.field("id") == N - 1))
```
Output (generated on Ubuntu 22.04 x86_64 with `pyarrow==23.0.1`):
```
PYARROW VERSION: 23.0.1
ORIGINAL DATA
499999
[499999.0, 0.2806802660498191, 0.18948458094650322, 0.6611584406407851,
0.340530752637791]
SLICED DATA
pyarrow.Table
id: int64
text: string
numbers: list<item: double>
child 0, item: double
----
id: [[499999]]
text: [["Row 499999 with data"]]
numbers:
[[[499999,0.2806802660498191,0.18948458094650322,0.6611584406407851,0.340530752637791,...,0.19918275933231844,0.42906946186903017,0.49644347191463034,0.3171420306034032,0.13584405454197468]]]
FILTERED DATA
pyarrow.Table
id: int64
text: string
numbers: list<item: double>
child 0, item: double
----
id: [[],[],...,[],[499999]]
text: [[],[],...,[],["Row 499999 with data"]]
numbers:
[[],[],...,[],[[0.31442923271553835,0.6938060356899268,0.6428265846122176,0.45896565050138827,0.5739393526702229,...,0.13894123671983727,0.47783950795209007,0.7710005399634996,0.6678959811701984,0.7366509797101941]]]
```
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]