[jira] [Updated] (HBASE-29863) Add API to KeyValueScanner to retrieve the set of StoreFiles accessed during a scan

Himanshu Gwalani (Jira) Fri, 30 Jan 2026 05:03:10 -0800


     [ 
https://issues.apache.org/jira/browse/HBASE-29863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Himanshu Gwalani updated HBASE-29863:
-------------------------------------
    Description: 
*Goal:* Introduce a mechanism to track and expose the specific HFiles involved 
in a scan operation.

{*}Use-case{*}: This is essential for validations on client side to ensure 
right set of files are scanned (if source of truth is available, for example: 
snapshot data manifest during snapshot based scans), debugging performance 
related issues and analysis on data access patterns.

*Proposed API* Add {{Set<Path> getScannerInitializedFiles()}} to the 
{{KeyValueScanner}} interface.

*Implementation Details*
 * *Capturing list of files when scanner is initialized.*
 ** Leaf Scanners
 *** StoreFileScanner: Returns singleton having the path of the associated 
{{{}HFile{}}}.
 *** SnapshotSegmentScanner / CollectionBackedScanner / SegmentScanner: Returns 
empty set.
 ** Composite Scanners
 *** StoreScanner & ReversedStoreScanner: Aggregates files from all active 
{{StoreFileScanners}}
 *** KeyValueHeap & ReversedKeyValueHeap: Aggregates files from its internal 
priority queue of scanners.
 ** Abstract Scanners
 *** NonLazyKeyValueScanner / NonReversedNonLazyKeyValueScanner: Returns empty 
set.{*}{{*}}
 * *Exposing via RegionScanner & TableSnapshotRecordReader*
 ** RegionScanner: Aggregates files from all underlying StoreScanners
 ** TableSnapshotRecordReader: Proxies the call through ClientSideRegionScanner 
to allow MapReduce jobs to access this for snapshot-based scans.
 ** Note: Also 

  was:
*Goal:* Introduce a mechanism to track and expose the specific HFiles involved 
in a scan operation.

{*}Use-case{*}: This is essential for validations on client side to ensure 
right set of files are scanned (if source of truth is available, for example: 
snapshot data manifest during snapshot based scans), debugging performance 
related issues and analysis on data access patterns.

*Proposed API* Add {{Set<Path> getScannerInitializedFiles()}} to the 
{{KeyValueScanner}} interface.

*Implementation Details*
 * *Capturing list of files when scanner is initialized.*
 ** Leaf Scanners
 *** StoreFileScanner: Returns singleton having the path of the associated 
{{{}HFile{}}}.
 *** SnapshotSegmentScanner / CollectionBackedScanner / SegmentScanner: Returns 
empty set.
 ** Composite Scanners
 *** StoreScanner & ReversedStoreScanner: Aggregates files from all active 
{{StoreFileScanners}}
 *** KeyValueHeap & ReversedKeyValueHeap: Aggregates files from its internal 
priority queue of scanners.
 ** Abstract Scanners
 *** NonLazyKeyValueScanner / NonReversedNonLazyKeyValueScanner: Returns empty 
set.{*}{*}
 * *Exposing via RegionScanner & TableSnapshotRecordReader*
 ** RegionScanner: Aggregates files from all underlying StoreScanners
 ** TableSnapshotRecordReader: Proxies the call through ClientSideRegionScanner 
to allow MapReduce jobs to access this for snapshot-based scans.


> Add API to KeyValueScanner to retrieve the set of StoreFiles accessed during 
> a scan
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-29863
>                 URL: https://issues.apache.org/jira/browse/HBASE-29863
>             Project: HBase
>          Issue Type: New Feature
>          Components: API, regionserver, Scanners
>            Reporter: Himanshu Gwalani
>            Assignee: Himanshu Gwalani
>            Priority: Major
>             Fix For: 2.7.0, 3.0.0-beta-2
>
>
> *Goal:* Introduce a mechanism to track and expose the specific HFiles 
> involved in a scan operation.
> {*}Use-case{*}: This is essential for validations on client side to ensure 
> right set of files are scanned (if source of truth is available, for example: 
> snapshot data manifest during snapshot based scans), debugging performance 
> related issues and analysis on data access patterns.
> *Proposed API* Add {{Set<Path> getScannerInitializedFiles()}} to the 
> {{KeyValueScanner}} interface.
> *Implementation Details*
>  * *Capturing list of files when scanner is initialized.*
>  ** Leaf Scanners
>  *** StoreFileScanner: Returns singleton having the path of the associated 
> {{{}HFile{}}}.
>  *** SnapshotSegmentScanner / CollectionBackedScanner / SegmentScanner: 
> Returns empty set.
>  ** Composite Scanners
>  *** StoreScanner & ReversedStoreScanner: Aggregates files from all active 
> {{StoreFileScanners}}
>  *** KeyValueHeap & ReversedKeyValueHeap: Aggregates files from its internal 
> priority queue of scanners.
>  ** Abstract Scanners
>  *** NonLazyKeyValueScanner / NonReversedNonLazyKeyValueScanner: Returns 
> empty set.{*}{{*}}
>  * *Exposing via RegionScanner & TableSnapshotRecordReader*
>  ** RegionScanner: Aggregates files from all underlying StoreScanners
>  ** TableSnapshotRecordReader: Proxies the call through 
> ClientSideRegionScanner to allow MapReduce jobs to access this for 
> snapshot-based scans.
>  ** Note: Also 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HBASE-29863) Add API to KeyValueScanner to retrieve the set of StoreFiles accessed during a scan

Reply via email to