Hi all,

Is there any Scan/Filter API with the following behavior?

Given time range, I would like the scanner to include data from HFiles out of 
range, for row keys included in the HFiles which are in range.
The idea is to scan in-memory indexes of all HFiles, but get data from disk 
only for rowkeys from HFiles that are in range.
For example, if HFile1 is in range and HFile2 is out of range, and rowkey1 has 
any data in HFile1, I would like to get all columns of rowkey1 from HFile2 as 
well, as if it were in range.
On the other hand, if rowkey2 is included in HFile2 but not in HFile1, the 
index scanner should just skip to the next row key.

The use case is to load entire rows that were modified (even on just one 
column) during the last T time, avoiding full scan or any disk scan of 
redundant data.
This is going to be integrated into Spark/MR applications, probably based on 
TableSnapshotInputFormat, so I guess I could ship some custom code for HStore 
or whatever, if it comes to this.

Thank you very much,
Shay.


Reply via email to