Daniel Roudnitsky created HBASE-29460: -----------------------------------------
Summary: Inconsistent query results with timerange filter when there are multiple column versions Key: HBASE-29460 URL: https://issues.apache.org/jira/browse/HBASE-29460 Project: HBase Issue Type: Bug Affects Versions: 2.5.12, 3.0.0-beta-1 Reporter: Daniel Roudnitsky Assignee: Daniel Roudnitsky A team at $dayjob reported that a query with a timerange filter which was previously returning a non-empty result began returning an empty result, with no deletions or major compactions having occurred between the time the query returned data and when it stopped returning data. Upon investigating we found that the behavior of GET/SCAN with a timerange filter when there are multiple versions of the same column lying around is inconsistent. The server accumulates excess versions until flush/major compaction, so by design there will be long periods of time where we have cells that physically exist but have logically versioned out and should not be visible/queryable by user (at least that seems to have been the intention?). The issue looks to boil down to store scanner being able to return cells that have logically versioned out when: # A timerange filter is specified AND # The number of cells that fall in the specified timerange which have not logically versioned out does not exceed the maximum number of VERSIONS configured on the column family. Take the example of a user updating the same column over time with new versions and occasionally running queries to get the past version of the column that existed at a specific point in time. This user will very organically run into this scenario where a cell falling in the timerange of interest physically exists but has logically versioned out. Whether this user’s timerange query returns the matching but logically versioned out cell and how long it continues to do so varies depending on * How many younger versions exist in the specified timerange (either in memstore or hfile) * How the cell got flushed - if the cell was flushed in the same batch as younger versions of the same column the query may return data before the flush and stop returning data after the flush * If the cell survived the flush process in (2), then the query may continue to return data until major compaction, after which its physically versioned out and the query stops returning data More concretely, take the base case with default VERSIONS=>1 where we do two PUTS to the same column with PUT2 timestamp > PUT1 timestamp, and the two cells are flushed independently to different hfiles. We observe a few interesting things (hbase shell code in jira comment): # A query with a timerange filter including only PUT1 timestamp returns PUT1 if executed before major compaction - we return a cell that has logically versioned out # A query to get all versions, without any timerange, only returns PUT2 - we respect logical versioning here and do not return the PUT1 cell # A query to get all versions, with a timerange filter which includes both PUT1 and PUT2 timestamps, only returns PUT2 - we respect logical versioning here # A query to get all versions, with a narrower timerange that includes only PUT1 timestamp, returns PUT1. This is odd behavior from user perspective, this query is identical to query 3 but with a time range that is a subinterval of the one in query 3, one would reasonably expect the result of the subinterval query to be a subset of the results when querying on the larger interval, but the results are completely disjoint in this case. To give a SQL example, one would not expect a SELECT * WHERE TIME < 10 to return anything that would not appear in SELECT * WHERE TIME < 20, which is what happens in our case # After we major compact , PUT1 has physically versioned out and query 1 will stop returning a result We have additional query indeterminism when we have multiple versions in memstore. We keep all (recent) versions in memstore until flushing, and one can have a timerange query return logically versioned out cells while they are in memstore. At flush time we will flush at most VERSIONS number of cells - we do some “opportunistic” version pruning if we had more versions in memstore than needed - but this means that before the flush one can have a timerange query which returns data, and after the flush the same query no longer returns data, and the behavior is dependent on the number of versions that were in memstore at the time of flush. I am of the (possibly naive) opinion that we should not return logically versioned out cells by default so that query behavior is consistent/predictable and users can reason about how things will behave without deep diving HBase internals and understanding the corner cases involved here. I am not sure how long timerange queries have behaved this way, probably a long time, if we really want to preserve this behavior than I think at the very least it should behave predictably - timing of PUTS/flushes should not change query result and we should be clear in the docs that major compaction can change query result (even if you do not do any deletes). -- This message was sent by Atlassian Jira (v8.20.10#820010)