brett-patterson-ent opened a new issue, #49392:
URL: https://github.com/apache/arrow/issues/49392

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   We originally saw this issue with reading parquet files into Pandas 
DataFrames via PyArrow, where a column containing an array of floats has 
corrupted values when applying a filter at read time. We've narrowed it down to 
the below example (without any I/O or Pandas involved). Note that the error 
shown in the script below only happens with large values of `N`. Here are the 
`N` values that I tested:
   * 1,000 - ok
   * 10,000 - ok
   * 100,000 - ok
   * 250,000 - ok
   * 500,000 - fail
   * 1,000,000 - fail
   
   ```python
   import numpy as np
   import pyarrow as pa
   import pyarrow.parquet as pq
   import pyarrow.compute as pc
   
   N = 500_000
   ARRAY_LEN = 2000
   
   ids = np.arange(N)
   texts = [f"Row {i} with data" for i in range(N)]
   
   rng = np.random.default_rng(42)
   matrix = rng.random((N, ARRAY_LEN))
   matrix[:, 0] = ids
   numbers = [matrix[i] for i in range(N)]
   
   tbl = pa.table({"id": ids, "text": texts, "numbers": numbers})
   
   print("PYARROW VERSION:", pa.__version__)
   print()
   
   print("ORIGINAL DATA")
   print(ids[N - 1])
   print(numbers[N - 1].tolist()[:5])
   print()
   
   print("SLICED DATA")
   print(tbl.slice(N - 1, 1))
   print()
   
   print("FILTERED DATA")
   print(tbl.filter(pc.field("id") == N - 1))
   ```
   
   Output (generated on Ubuntu 22.04 x86_64 with `pyarrow==23.0.1`):
   ```
   PYARROW VERSION: 23.0.1
   
   ORIGINAL DATA
   499999
   [499999.0, 0.2806802660498191, 0.18948458094650322, 0.6611584406407851, 
0.340530752637791]
   
   SLICED DATA
   pyarrow.Table
   id: int64
   text: string
   numbers: list<item: double>
     child 0, item: double
   ----
   id: [[499999]]
   text: [["Row 499999 with data"]]
   numbers: 
[[[499999,0.2806802660498191,0.18948458094650322,0.6611584406407851,0.340530752637791,...,0.19918275933231844,0.42906946186903017,0.49644347191463034,0.3171420306034032,0.13584405454197468]]]
   
   FILTERED DATA
   pyarrow.Table
   id: int64
   text: string
   numbers: list<item: double>
     child 0, item: double
   ----
   id: [[],[],...,[],[499999]]
   text: [[],[],...,[],["Row 499999 with data"]]
   numbers: 
[[],[],...,[],[[0.31442923271553835,0.6938060356899268,0.6428265846122176,0.45896565050138827,0.5739393526702229,...,0.13894123671983727,0.47783950795209007,0.7710005399634996,0.6678959811701984,0.7366509797101941]]]
   ```
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to