rouault opened a new pull request, #41366:
URL: https://github.com/apache/arrow/pull/41366

   ### Rationale for this change
   
   Fixes the crash detailed in #41317 in TableBatchReader::ReadNext() on a 
corrupted Parquet file
   
   ### What changes are included in this PR?
   
   Add a validation on the chunk index requested in column_data_[i]->chunk() 
and return an error if out of obunds
   
   ### Are these changes tested?
   
   I've tested on the reproducer I provided in #41317 that it now triggers a 
clean error:
   ```
   Traceback (most recent call last):
     File "test.py", line 3, in <module>
       [_ for _ in parquet_file.iter_batches()]
     File "test.py", line 3, in <listcomp>
       [_ for _ in parquet_file.iter_batches()]
     File "pyarrow/_parquet.pyx", line 1587, in iter_batches
     File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
   pyarrow.lib.ArrowInvalid: columns do not have the same size
   ```
   I'm not sure if/how unit tests for corrupted datasets should be added
   
   ### Are there any user-facing changes?
   
   No
   
   **This PR contains a "Critical Fix".**
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to