[DISCUSS] Inconsistent query results with timerange filter HBASE-29460

Daniel Roudnitsky (BLOOMBERG/ 919 3RD A) Mon, 21 Jul 2025 06:53:06 -0700

Hello all,

I recently created HBASE-29460 - "Inconsistent query results with timerange 
filter" - which I would like to start discussion around. I have marked it as 
critical because the inconsistent query behavior described is indistinguishable 
from data loss from the user perspective - one can have a timerange query which 
returns data, and without any PUTS/DELETES/major compaction happening, the same 
query suddenly stops returning data - and there are more subtle inconsistencies 
that are also possible. I paste the issue description here for convenience:


JIRA link: https://issues.apache.org/jira/browse/HBASE-29460

At my company a team reported that a query with a timerange filter which was 
previously returning a non-empty result began returning an empty result, with 
no deletions or major compactions having occurred between the time the query 
returned data and when it stopped returning data. Upon investigating we found 
that the behavior of GET/SCAN with a timerange filter when there are multiple 
versions of the same column lying around is inconsistent.

The server accumulates excess versions until flush/major compaction, so by 
design there will be long periods of time where we have cells that physically 
exist but have logically versioned out and should not be visible/queryable by 
user. The issue looks to boil down to store scanner being able to return cells 
that have logically versioned out when:
                                                                                
                                                                                
                                                                                
                                                                                
                               1)A timerange filter is specified AND
                                                                                
                                                                                
                                                                                
                                                                                
                               2)The number of cells that fall in the specified 
timerange which have not logically versioned out is less than both the number 
of VERSIONS configured on the column family and the number of versions 
specified by the query

Take the example of a user updating the same column over time with new versions 
and occasionally running queries to get the past version of the column that 
existed at a specific point in time. This user will very organically run into 
this scenario where a cell falling in the timerange of interest physically 
exists but has logically versioned out. Whether this user’s timerange query 
returns the matching but logically versioned out cell and how long it continues 
to do so varies depending on
*How many younger versions exist in the specified timerange (either in memstore 
or hfile)
*How the cell got flushed - if the cell was flushed in the same batch as 
younger versions of the same column the query may return data before the flush 
and stop returning data after the flush
*If the cell survived the flush process, then the query may continue to return 
data until major compaction, after which its physically versioned out and the 
query stops returning data

More concretely, take the base case with default VERSIONS=>1 where we do two 
PUTS to the same column with PUT2 timestamp > PUT1 timestamp, and the two cells 
are flushed independently to different hfiles. We observe a few interesting 
things (hbase shell code in jira comment):
                                                                                
                                                                                
                                                                                
                                     1)A query with a timerange filter 
including only PUT1 timestamp returns PUT1 if executed before major compaction 
- we return a cell that has logically versioned out
                                                                                
                                                                                
                                                                                
                                     2)A query to get all versions, without any 
timerange, only returns PUT2 - we respect logical versioning here and do not 
return the PUT1 cell
                                                                                
                                                                                
                                                                                
                                     3)A query to get all versions, with a 
timerange filter which includes both PUT1 and PUT2 timestamps, only returns 
PUT2 - we respect logical versioning here
                                                                                
                                                                                
                                                                                
                                     4)A query to get all versions, with a 
narrower timerange that includes only PUT1 timestamp, returns PUT1. This is odd 
behavior from user perspective, this query is identical to query 3 but with a 
time range that is a subinterval of the one in query 3, one would reasonably 
expect the result of the subinterval query to be a subset of the results when 
querying on the larger interval, but the results are completely disjoint in 
this case. To give a SQL example, one would not expect a SELECT * WHERE TIME < 
10 to return anything that would not appear in SELECT * WHERE TIME < 20, which 
is what happens in our case
                                                                                
                                                                                
                                                                                
                                     5)After we major compact , PUT1 has 
physically versioned out and query 1 will stop returning a result

For the default VERSIONS=>1 case these version visibility semantics are 
especially strange. A user with VERSIONS=>1 may very reasonably expect that 
only the latest version of a column can ever be returned by a query, regardless 
of filter, but the reality is that the same query with a different timerange 
filter can return an arbitrary number of different versions of the same column 
(up until major compaction). For a user with VERSIONS=>1 who does rely on the 
existing semantics, there is still the oddity that they cannot query for all 
versions of a column that exist, since we return at most 1 version for a given 
query, they can only slide the timerange around to get at most one version 
falling in the timerange (queries 3/4 in example above).

We have additional query indeterminism when we have multiple versions in 
memstore. We keep all (recent) versions in memstore until flushing, and one can 
have a timerange query return logically versioned out cells while they are in 
memstore. At flush time we will flush at most VERSIONS number of cells - we do 
some “opportunistic” version pruning if we had more versions in memstore than 
needed - but this means that before the flush one can have a timerange query 
which returns data, and after the flush the same query no longer returns data, 
and the behavior is dependent on the number of versions that were in memstore 
at the time of flush.

With NEW_VERSION_BEHAVIOR enabled (HBASE-15968) the query behavior when 
versions are in memstore changes - a timerange query where all versions are in 
memstore won't return logically versioned out cells, but if the versioned out 
cell was written out to an hfile then it is queryable. I have not tested 
NEW_VERSION_BEHAVIOR thoroughly, but from my initial testing it does not 
resolve the issues here, but does impact some of the query behavior in question 
here.

I am of the opinion that we should not return logically versioned out cells by 
default regardless of filter so that query behavior is consistent/predictable 
and users can reason about how things will behave without deep diving HBase 
internals and understanding the corner cases involved here. Timerange queries 
look to have behaved this way for a long time (HBASE-10102) so this would be an 
incompatible change to version visibility semantics. If we want to continue to 
support querying data that has been logically versioned out we could have a new 
API/flag that allows one to do so if explicitly enabled, very similar to the 
raw scan option which allows one to read tombstoned data that is still hanging 
around.

Where we need to preserve the existing version visibility semantics for 
compatibility reasons, I am of the opinion that those semantics should behave 
more predictably - I propose we do not do version pruning at flush time so that 
timing of PUTS/flushes cannot change query result and update the docs to make 
it clear that major compaction can change timerange query result.

[DISCUSS] Inconsistent query results with timerange filter HBASE-29460

Reply via email to