ShreyeshArangath opened a new pull request, #49289:
URL: https://github.com/apache/arrow/pull/49289
### Rationale for this change
The ORC dataset integration currently lacks stripe-level subsetting support.
When scanning ORC files through the Dataset API, there is no way to select
specific stripes. The entire file is always read. This is a gap compared to
ParquetFileFragment, which provides row-group-level subsetting via Subset(),
row_groups(), and MakeFragment(..., row_groups).
### What changes are included in this PR?
Modeled after the ParquetFileFragment design, we introduce stripe-aware ORC
fragments so callers can target specific stripes during planning and scanning
(instead of always reading the full file). This adds a small, consistent
surface area in both C++ (and Python, separate issue):
An ORC-specific fragment type that can represent either the full file, or a
subset of the file defined by stripe IDs
Fragment subsetting via a subset(...)/Subset(...) API, analogous to Parquet
row-group subsetting.
Scan behavior that honors stripe selection, so execution reads only the
requested stripes.
Correct row counting for subset fragments, where row counts reflect only the
selected stripes
### Are these changes tested?
- Unit tested
### Are there any user-facing changes?
The C++ API has the following changes
- `OrcFileFragment` class with `stripe_ids()` and `Subset()` methods
- `OrcFileFormat::MakeFragment(source, partition_expression,
physical_schema, stripe_ids)` overload
Though, there are no breaking changes. Existing ORC scanning behavior is
unchanged when no stripe IDs are specified.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]