pchintar opened a new issue, #9950:
URL: https://github.com/apache/arrow-rs/issues/9950
### Description
Currently, when reading IPC data with column projection enabled, duplicate
projection indices can produce an invalid `RecordBatch`.
---
### Root Cause
In `arrow-ipc/src/reader.rs`, projected columns are matched using:
```rust id="jlwmjv"
projection.iter().position(|p| p == &idx)
```
However, `position()` only returns the first matching entry.
For example:
```rust id="7bgmrr"
projection = vec![1, 1]
```
Only a single column is decoded even though the projected schema contains
two fields.
`Schema::project` and `RecordBatch::project` both allow duplicate projection
indices, so the IPC reader behavior becomes inconsistent with the rest of Arrow.
---
### Impact
Can lead to:
* invalid `RecordBatch` construction
* runtime errors due to schema/column count mismatch
Occurs when:
* projection contains duplicate indices
* reading IPC data through `FileReader` or `StreamReader`
---
### Reproduction
A minimal example:
```rust id="s4q7gx"
let projection = vec![1, 1];
let reader =
FileReader::try_new(std::io::Cursor::new(buf), Some(projection))?;
```
Before fix:
```text id="uwmkmt"
InvalidArgumentError(
"number of columns(1) must match number of fields(2) in schema"
)
```
---
### Proposed Fix to this Bug
Update projection handling to preserve all matching projection entries while
decoding each physical field only once.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]