[
https://issues.apache.org/jira/browse/HBASE-29460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18008538#comment-18008538
]
Daniel Roudnitsky commented on HBASE-29460:
-------------------------------------------
Code for the query examples in the issue description
{code:java}
> create 'test_versions', {NAME => 'cf1'}
> put 'test_versions', 'row1', 'cf1:col1', 'put1', 10
> flush 'test_versions'
> put 'test_versions', 'row1', 'cf1:col1', 'put2', 20
> flush 'test_versions'{code}
{code:java}
# QUERY 1 : We return the logically versioned out put1
> get 'test_versions', 'row1', {COLUMN => 'cf1:col1', TIMERANGE=>[0, 15]}
COLUMN CELL
cf1:col1 timestamp=1969-12-31T19:00:00.010, value=put1
1 row(s)
# QUERY 2: Only put2 appears in the get "all" versions query
> scan 'test_versions', {VERSIONS => 100}
ROW COLUMN+CELL
row1 column=cf1:col1, timestamp=1969-12-31T19:00:00.020, value=put2
1 row(s)
# QUERY 3: Only put2 appears in the get "all" versions query with timerange
> scan 'test_versions', {VERSIONS => 100, TIMERANGE=>[0, 30]}
ROW COLUMN+CELL
row1 column=cf1:col1, timestamp=1969-12-31T19:00:00.020, value=put2
1 row(s)
# QUERY 4: Same as query 3 with subinterval timerange excluding put2, put1 now
appears
> scan 'test_versions', {VERSIONS => 100, TIMERANGE=>[0, 15]}
ROW COLUMN+CELL
row1 column=cf1:col1, timestamp=1969-12-31T19:00:00.010, value=put1
1 row(s)
# QUERY 5: We major compact and run query 1 which previously returned data, and
see that it no longer returns data
> major_compact 'test_versions'
> get 'test_versions', 'row1', {COLUMN => 'cf1:col1', TIMERANGE=>[0, 15]}
COLUMN CELL
0 row(s)
{code}
> Inconsistent query results with timerange filter when there are multiple
> column versions
> ----------------------------------------------------------------------------------------
>
> Key: HBASE-29460
> URL: https://issues.apache.org/jira/browse/HBASE-29460
> Project: HBase
> Issue Type: Bug
> Affects Versions: 3.0.0-beta-1, 2.5.12
> Reporter: Daniel Roudnitsky
> Assignee: Daniel Roudnitsky
> Priority: Critical
>
> A team at $dayjob reported that a query with a timerange filter which was
> previously returning a non-empty result began returning an empty result, with
> no deletions or major compactions having occurred between the time the query
> returned data and when it stopped returning data. Upon investigating we found
> that the behavior of GET/SCAN with a timerange filter when there are multiple
> versions of the same column lying around is inconsistent.
> The server accumulates excess versions until flush/major compaction, so by
> design there will be long periods of time where we have cells that physically
> exist but have logically versioned out and should not be visible/queryable by
> user (at least that seems to have been the intention?). The issue looks to
> boil down to store scanner being able to return cells that have logically
> versioned out when:
> # A timerange filter is specified AND
> # The number of cells that fall in the specified timerange which have not
> logically versioned out does not exceed the maximum number of VERSIONS
> configured on the column family.
> Take the example of a user updating the same column over time with new
> versions and occasionally running queries to get the past version of the
> column that existed at a specific point in time. This user will very
> organically run into this scenario where a cell falling in the timerange of
> interest physically exists but has logically versioned out. Whether this
> user’s timerange query returns the matching but logically versioned out cell
> and how long it continues to do so varies depending on
> * How many younger versions exist in the specified timerange (either in
> memstore or hfile)
> * How the cell got flushed - if the cell was flushed in the same batch as
> younger versions of the same column the query may return data before the
> flush and stop returning data after the flush
> * If the cell survived the flush process in (2), then the query may continue
> to return data until major compaction, after which its physically versioned
> out and the query stops returning data
> More concretely, take the base case with default VERSIONS=>1 where we do two
> PUTS to the same column with PUT2 timestamp > PUT1 timestamp, and the two
> cells are flushed independently to different hfiles. We observe a few
> interesting things (hbase shell code in jira comment):
> # A query with a timerange filter including only PUT1 timestamp returns PUT1
> if executed before major compaction - we return a cell that has logically
> versioned out
> # A query to get all versions, without any timerange, only returns PUT2 - we
> respect logical versioning here and do not return the PUT1 cell
> # A query to get all versions, with a timerange filter which includes both
> PUT1 and PUT2 timestamps, only returns PUT2 - we respect logical versioning
> here
> # A query to get all versions, with a narrower timerange that includes only
> PUT1 timestamp, returns PUT1. This is odd behavior from user perspective,
> this query is identical to query 3 but with a time range that is a
> subinterval of the one in query 3, one would reasonably expect the result of
> the subinterval query to be a subset of the results when querying on the
> larger interval, but the results are completely disjoint in this case. To
> give a SQL example, one would not expect a SELECT * WHERE TIME < 10 to return
> anything that would not appear in SELECT * WHERE TIME < 20, which is what
> happens in our case
> # After we major compact , PUT1 has physically versioned out and query 1 will
> stop returning a result
> We have additional query indeterminism when we have multiple versions in
> memstore. We keep all (recent) versions in memstore until flushing, and one
> can have a timerange query return logically versioned out cells while they
> are in memstore. At flush time we will flush at most VERSIONS number of cells
> - we do some “opportunistic” version pruning if we had more versions in
> memstore than needed - but this means that before the flush one can have a
> timerange query which returns data, and after the flush the same query no
> longer returns data, and the behavior is dependent on the number of versions
> that were in memstore at the time of flush.
> I am of the (possibly naive) opinion that we should not return logically
> versioned out cells by default so that query behavior is
> consistent/predictable and users can reason about how things will behave
> without deep diving HBase internals and understanding the corner cases
> involved here. I am not sure how long timerange queries have behaved this
> way, probably a long time, if we really want to preserve this behavior than I
> think at the very least it should behave predictably - timing of PUTS/flushes
> should not change query result and we should be clear in the docs that major
> compaction can change query result (even if you do not do any deletes).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)