Daniel Roudnitsky created HBASE-29460:
-----------------------------------------
Summary: Inconsistent query results with timerange filter when
there are multiple column versions
Key: HBASE-29460
URL: https://issues.apache.org/jira/browse/HBASE-29460
Project: HBase
Issue Type: Bug
Affects Versions: 2.5.12, 3.0.0-beta-1
Reporter: Daniel Roudnitsky
Assignee: Daniel Roudnitsky
A team at $dayjob reported that a query with a timerange filter which was
previously returning a non-empty result began returning an empty result, with
no deletions or major compactions having occurred between the time the query
returned data and when it stopped returning data. Upon investigating we found
that the behavior of GET/SCAN with a timerange filter when there are multiple
versions of the same column lying around is inconsistent.
The server accumulates excess versions until flush/major compaction, so by
design there will be long periods of time where we have cells that physically
exist but have logically versioned out and should not be visible/queryable by
user (at least that seems to have been the intention?). The issue looks to boil
down to store scanner being able to return cells that have logically versioned
out when:
# A timerange filter is specified AND
# The number of cells that fall in the specified timerange which have not
logically versioned out does not exceed the maximum number of VERSIONS
configured on the column family.
Take the example of a user updating the same column over time with new versions
and occasionally running queries to get the past version of the column that
existed at a specific point in time. This user will very organically run into
this scenario where a cell falling in the timerange of interest physically
exists but has logically versioned out. Whether this user’s timerange query
returns the matching but logically versioned out cell and how long it continues
to do so varies depending on
* How many younger versions exist in the specified timerange (either in
memstore or hfile)
* How the cell got flushed - if the cell was flushed in the same batch as
younger versions of the same column the query may return data before the flush
and stop returning data after the flush
* If the cell survived the flush process in (2), then the query may continue to
return data until major compaction, after which its physically versioned out
and the query stops returning data
More concretely, take the base case with default VERSIONS=>1 where we do two
PUTS to the same column with PUT2 timestamp > PUT1 timestamp, and the two cells
are flushed independently to different hfiles. We observe a few interesting
things (hbase shell code in jira comment):
# A query with a timerange filter including only PUT1 timestamp returns PUT1 if
executed before major compaction - we return a cell that has logically
versioned out
# A query to get all versions, without any timerange, only returns PUT2 - we
respect logical versioning here and do not return the PUT1 cell
# A query to get all versions, with a timerange filter which includes both PUT1
and PUT2 timestamps, only returns PUT2 - we respect logical versioning here
# A query to get all versions, with a narrower timerange that includes only
PUT1 timestamp, returns PUT1. This is odd behavior from user perspective, this
query is identical to query 3 but with a time range that is a subinterval of
the one in query 3, one would reasonably expect the result of the subinterval
query to be a subset of the results when querying on the larger interval, but
the results are completely disjoint in this case. To give a SQL example, one
would not expect a SELECT * WHERE TIME < 10 to return anything that would not
appear in SELECT * WHERE TIME < 20, which is what happens in our case
# After we major compact , PUT1 has physically versioned out and query 1 will
stop returning a result
We have additional query indeterminism when we have multiple versions in
memstore. We keep all (recent) versions in memstore until flushing, and one can
have a timerange query return logically versioned out cells while they are in
memstore. At flush time we will flush at most VERSIONS number of cells - we do
some “opportunistic” version pruning if we had more versions in memstore than
needed - but this means that before the flush one can have a timerange query
which returns data, and after the flush the same query no longer returns data,
and the behavior is dependent on the number of versions that were in memstore
at the time of flush.
I am of the (possibly naive) opinion that we should not return logically
versioned out cells by default so that query behavior is consistent/predictable
and users can reason about how things will behave without deep diving HBase
internals and understanding the corner cases involved here. I am not sure how
long timerange queries have behaved this way, probably a long time, if we
really want to preserve this behavior than I think at the very least it should
behave predictably - timing of PUTS/flushes should not change query result and
we should be clear in the docs that major compaction can change query result
(even if you do not do any deletes).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)