Daniel Roudnitsky created HBASE-29460:
-----------------------------------------

             Summary: Inconsistent query results with timerange filter when 
there are multiple column versions
                 Key: HBASE-29460
                 URL: https://issues.apache.org/jira/browse/HBASE-29460
             Project: HBase
          Issue Type: Bug
    Affects Versions: 2.5.12, 3.0.0-beta-1
            Reporter: Daniel Roudnitsky
            Assignee: Daniel Roudnitsky


A team at $dayjob reported that a query with a timerange filter which was 
previously returning a non-empty result began returning an empty result, with 
no deletions or major compactions having occurred between the time the query 
returned data and when it stopped returning data. Upon investigating we found 
that the behavior of GET/SCAN with a timerange filter when there are multiple 
versions of the same column lying around is inconsistent. 

The server accumulates excess versions until flush/major compaction, so by 
design there will be long periods of time where we have cells that physically 
exist but have logically versioned out and should not be visible/queryable by 
user (at least that seems to have been the intention?). The issue looks to boil 
down to store scanner being able to return cells that have logically versioned 
out when:
# A timerange filter is specified AND
# The number of cells that fall in the specified timerange which have not 
logically versioned out does not exceed the maximum number of VERSIONS 
configured on the column family. 

Take the example of a user updating the same column over time with new versions 
and occasionally running queries to get the past version of the column that 
existed at a specific point in time. This user will very organically run into 
this scenario where a cell falling in the timerange of interest physically 
exists but has logically versioned out. Whether this user’s timerange query 
returns the matching but logically versioned out cell and how long it continues 
to do so varies depending on 
* How many younger versions exist in the specified timerange (either in 
memstore or hfile)
* How the cell got flushed - if the cell was flushed in the same batch as 
younger versions of the same column the query may return data before the flush 
and stop returning data after the flush 
* If the cell survived the flush process in (2), then the query may continue to 
return data until major compaction, after which its physically versioned out 
and the query stops returning data

More concretely, take the base case with default VERSIONS=>1 where we do two 
PUTS to the same column with PUT2 timestamp > PUT1 timestamp, and the two cells 
are flushed independently to different hfiles. We observe a few interesting 
things (hbase shell code in jira comment):
# A query with a timerange filter including only PUT1 timestamp returns PUT1 if 
executed before major compaction - we return a cell that has logically 
versioned out
# A query to get all versions, without any timerange, only returns PUT2 - we 
respect logical versioning here and do not return the PUT1 cell
# A query to get all versions, with a timerange filter which includes both PUT1 
and PUT2 timestamps, only returns PUT2 - we respect logical versioning here 
# A query to get all versions, with a narrower timerange that includes only 
PUT1 timestamp, returns PUT1. This is odd behavior from user perspective, this 
query is identical to query 3 but with a time range that is a subinterval of 
the one in query 3, one would reasonably expect the result of the subinterval 
query to be a subset of the results when querying on the larger interval, but 
the results are completely disjoint in this case. To give a SQL example, one 
would not expect a SELECT * WHERE TIME < 10 to return anything that would not 
appear in SELECT * WHERE TIME < 20, which is what happens in our case
# After we major compact , PUT1 has physically versioned out and query 1 will 
stop returning a result

We have additional query indeterminism when we have multiple versions in 
memstore. We keep all (recent) versions in memstore until flushing, and one can 
have a timerange query return logically versioned out cells while they are in 
memstore. At flush time we will flush at most VERSIONS number of cells - we do 
some “opportunistic” version pruning if we had more versions in memstore than 
needed - but this means that before the flush one can have a timerange query 
which returns data, and after the flush the same query no longer returns data, 
and the behavior is dependent on the number of versions that were in memstore 
at the time of flush. 

I am of the (possibly naive) opinion that we should not return logically 
versioned out cells by default so that query behavior is consistent/predictable 
and users can reason about how things will behave without deep diving HBase 
internals and understanding the corner cases involved here. I am not sure how 
long timerange queries have behaved this way, probably a long time, if we 
really want to preserve this behavior than I think at the very least it should 
behave predictably - timing of PUTS/flushes should not change query result and 
we should be clear in the docs that major compaction can change query result 
(even if you do not do any deletes).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to