bvolpato-dd opened a new issue, #9389:
URL: https://github.com/apache/arrow-rs/issues/9389

   **Describe the bug**
   
   `BatchCoalescer::push_batch` uses a hard `assert_eq!` at 
[coalesce.rs:462](https://github.com/apache/arrow-rs/blob/57.2.0/arrow-select/src/coalesce.rs#L462)
 to check the incoming batch column count against the coalescer's schema:
   
   ```rust
   assert_eq!(arrays.len(), self.in_progress_arrays.len());
   ```
   
   When a batch with a mismatched number of columns is pushed, this panics the 
process instead of returning an `ArrowError`. The function signature is already 
`Result<(), ArrowError>`, so it could just return an error here.
   
   This is inconsistent with the rest of the arrow API — `RecordBatch::try_new` 
returns `Err(ArrowError::InvalidArgumentError)` for the same kind of column 
count mismatch, and other checks in the same struct (lines 475, 533, 534) use 
`debug_assert!` rather than a hard assert.
   
   We hit this in production in a DataFusion-based query service. A connector 
returned batches whose schema didn't match the plan's output schema, and 
instead of getting a query error we got a process-wide panic through the 
repartition operator:
   
   ```
   assertion `left == right` failed
     left: 2
    right: 0
   ```
   
   **To Reproduce**
   
   ```rust
   use std::sync::Arc;
   use arrow_array::{RecordBatch, UInt32Array};
   use arrow_select::coalesce::BatchCoalescer;
   use arrow_schema::{DataType, Field, Schema};
   
   #[test]
   #[should_panic(expected = "assertion `left == right` failed")]
   fn push_batch_panics_on_schema_mismatch() {
       let empty_schema = Arc::new(Schema::empty());
       let mut coalescer = BatchCoalescer::new(empty_schema, 100);
   
       let schema = Arc::new(Schema::new(vec![
           Field::new("c0", DataType::UInt32, false),
           Field::new("c1", DataType::UInt32, false),
       ]));
       let batch = RecordBatch::try_new(
           schema,
           vec![
               Arc::new(UInt32Array::from(vec![1, 2, 3])),
               Arc::new(UInt32Array::from(vec![4, 5, 6])),
           ],
       )
       .unwrap();
   
       // panics instead of returning Err
       let _ = coalescer.push_batch(batch);
   }
   ```
   
   Also reproduces in the reverse direction (coalescer has 2 fields, batch has 
1).
   
   **Expected behavior**
   
   `push_batch` should return `Err(ArrowError::InvalidArgumentError(...))` so 
the caller can handle it gracefully, rather than bringing down the process.
   
   Something like:
   
   ```rust
   if arrays.len() != self.in_progress_arrays.len() {
       return Err(ArrowError::InvalidArgumentError(format!(
           "Batch has {} columns but BatchCoalescer expects {}",
           arrays.len(),
           self.in_progress_arrays.len()
       )));
   }
   ```
   
   **Additional context**
   
   Affects arrow-select 57.2.0 and 57.3.0 (same code in both). Happy to open a 
PR for this if it makes sense.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to