ShreyeshArangath opened a new pull request, #49289:
URL: https://github.com/apache/arrow/pull/49289

   ### Rationale for this change
   The ORC dataset integration currently lacks stripe-level subsetting support. 
When scanning ORC files through the Dataset API, there is no way to select 
specific stripes. The entire file is always read. This is a gap compared to 
ParquetFileFragment, which provides row-group-level subsetting via Subset(), 
row_groups(), and MakeFragment(..., row_groups).
   
   ### What changes are included in this PR?
   Modeled after the ParquetFileFragment design, we introduce stripe-aware ORC 
fragments so callers can target specific stripes during planning and scanning 
(instead of always reading the full file). This adds a small, consistent 
surface area in both C++ (and Python, separate issue):
   
   An ORC-specific fragment type that can represent either the full file, or a 
subset of the file defined by stripe IDs
   Fragment subsetting via a subset(...)/Subset(...) API, analogous to Parquet 
row-group subsetting.
   Scan behavior that honors stripe selection, so execution reads only the 
requested stripes.
   Correct row counting for subset fragments, where row counts reflect only the 
selected stripes
   
   ### Are these changes tested?
   - Unit tested 
   
   ### Are there any user-facing changes?
   The C++ API has the following changes
     - `OrcFileFragment` class with `stripe_ids()` and `Subset()` methods
     -  `OrcFileFormat::MakeFragment(source, partition_expression, 
physical_schema, stripe_ids)` overload
    
   Though, there are no breaking changes. Existing ORC scanning behavior is 
unchanged when no stripe IDs are specified.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to