Hello all, I recently created HBASE-29460 - "Inconsistent query results with timerange filter" - which I would like to start discussion around. I have marked it as critical because the inconsistent query behavior described is indistinguishable from data loss from the user perspective - one can have a timerange query which returns data, and without any PUTS/DELETES/major compaction happening, the same query suddenly stops returning data - and there are more subtle inconsistencies that are also possible. I paste the issue description here for convenience:
JIRA link: https://issues.apache.org/jira/browse/HBASE-29460 At my company a team reported that a query with a timerange filter which was previously returning a non-empty result began returning an empty result, with no deletions or major compactions having occurred between the time the query returned data and when it stopped returning data. Upon investigating we found that the behavior of GET/SCAN with a timerange filter when there are multiple versions of the same column lying around is inconsistent. The server accumulates excess versions until flush/major compaction, so by design there will be long periods of time where we have cells that physically exist but have logically versioned out and should not be visible/queryable by user. The issue looks to boil down to store scanner being able to return cells that have logically versioned out when: 1)A timerange filter is specified AND 2)The number of cells that fall in the specified timerange which have not logically versioned out is less than both the number of VERSIONS configured on the column family and the number of versions specified by the query Take the example of a user updating the same column over time with new versions and occasionally running queries to get the past version of the column that existed at a specific point in time. This user will very organically run into this scenario where a cell falling in the timerange of interest physically exists but has logically versioned out. Whether this user’s timerange query returns the matching but logically versioned out cell and how long it continues to do so varies depending on *How many younger versions exist in the specified timerange (either in memstore or hfile) *How the cell got flushed - if the cell was flushed in the same batch as younger versions of the same column the query may return data before the flush and stop returning data after the flush *If the cell survived the flush process, then the query may continue to return data until major compaction, after which its physically versioned out and the query stops returning data More concretely, take the base case with default VERSIONS=>1 where we do two PUTS to the same column with PUT2 timestamp > PUT1 timestamp, and the two cells are flushed independently to different hfiles. We observe a few interesting things (hbase shell code in jira comment): 1)A query with a timerange filter including only PUT1 timestamp returns PUT1 if executed before major compaction - we return a cell that has logically versioned out 2)A query to get all versions, without any timerange, only returns PUT2 - we respect logical versioning here and do not return the PUT1 cell 3)A query to get all versions, with a timerange filter which includes both PUT1 and PUT2 timestamps, only returns PUT2 - we respect logical versioning here 4)A query to get all versions, with a narrower timerange that includes only PUT1 timestamp, returns PUT1. This is odd behavior from user perspective, this query is identical to query 3 but with a time range that is a subinterval of the one in query 3, one would reasonably expect the result of the subinterval query to be a subset of the results when querying on the larger interval, but the results are completely disjoint in this case. To give a SQL example, one would not expect a SELECT * WHERE TIME < 10 to return anything that would not appear in SELECT * WHERE TIME < 20, which is what happens in our case 5)After we major compact , PUT1 has physically versioned out and query 1 will stop returning a result For the default VERSIONS=>1 case these version visibility semantics are especially strange. A user with VERSIONS=>1 may very reasonably expect that only the latest version of a column can ever be returned by a query, regardless of filter, but the reality is that the same query with a different timerange filter can return an arbitrary number of different versions of the same column (up until major compaction). For a user with VERSIONS=>1 who does rely on the existing semantics, there is still the oddity that they cannot query for all versions of a column that exist, since we return at most 1 version for a given query, they can only slide the timerange around to get at most one version falling in the timerange (queries 3/4 in example above). We have additional query indeterminism when we have multiple versions in memstore. We keep all (recent) versions in memstore until flushing, and one can have a timerange query return logically versioned out cells while they are in memstore. At flush time we will flush at most VERSIONS number of cells - we do some “opportunistic” version pruning if we had more versions in memstore than needed - but this means that before the flush one can have a timerange query which returns data, and after the flush the same query no longer returns data, and the behavior is dependent on the number of versions that were in memstore at the time of flush. With NEW_VERSION_BEHAVIOR enabled (HBASE-15968) the query behavior when versions are in memstore changes - a timerange query where all versions are in memstore won't return logically versioned out cells, but if the versioned out cell was written out to an hfile then it is queryable. I have not tested NEW_VERSION_BEHAVIOR thoroughly, but from my initial testing it does not resolve the issues here, but does impact some of the query behavior in question here. I am of the opinion that we should not return logically versioned out cells by default regardless of filter so that query behavior is consistent/predictable and users can reason about how things will behave without deep diving HBase internals and understanding the corner cases involved here. Timerange queries look to have behaved this way for a long time (HBASE-10102) so this would be an incompatible change to version visibility semantics. If we want to continue to support querying data that has been logically versioned out we could have a new API/flag that allows one to do so if explicitly enabled, very similar to the raw scan option which allows one to read tombstoned data that is still hanging around. Where we need to preserve the existing version visibility semantics for compatibility reasons, I am of the opinion that those semantics should behave more predictably - I propose we do not do version pruning at flush time so that timing of PUTS/flushes cannot change query result and update the docs to make it clear that major compaction can change timerange query result.