Tej Meka created HBASE-27691:
--------------------------------
Summary: Fake Cell is being passed to Filters and comparators
during StoreFileScans
Key: HBASE-27691
URL: https://issues.apache.org/jira/browse/HBASE-27691
Project: HBase
Issue Type: Bug
Components: scan, Scanners
Affects Versions: 2.2.7
Reporter: Tej Meka
Attachments: image-2023-03-07-15-46-01-182.png,
image-2023-03-07-15-50-59-696.png
I am trying to upgrade HBase version (client and server) from 1.2.0 to 2.2.6
and started seeing some unexpected behavior around discovery of ambiguous row
in filter during StoreFileScans.
*Is it a valid case that filters and Comparators might see a fake cell passed
to them if that row is set as an inclusive(by default) start row to skip
preceding row during Store file scans during client side execution?*
When rows were persisted or updated on a table through bulkload, looks like a
scan with specific column triggers a different behavior compared to a scan
without columns which doesn't trigger this behavior.
>From what I have troubleshooted so far, it looks like this is triggered during
>[lazy
>scan|https://github.com/apache/hbase/blob/rel/2.2.6/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java#L251-L256]
> inside
>[StoreScanner|https://github.com/apache/hbase/blob/rel/2.2.6/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java#L406-L410]
> with
>[StoreFileScanner|https://github.com/apache/hbase/blob/rel/2.2.6/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java#L388-L448]
> implementation where it eventually returns fake cell as current row on store
>heap
>[StoreFileScanner|https://github.com/apache/hbase/blob/rel/2.2.6/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java#L437-L444]
> thus passed to filter but it's actually filtered later and not returned to
>client.
This was not the case with hbase 1.7.2. I have created couple of simple Tests
using hbase 1.7.2 and hbase 2.2.6 that bulkloads some sample rows to table and
creates a column specific Scan to reproduce behavior that I have been talking
about.
I have simply copied KeyOnlyFilter, added few loggers to catch rowkeys being
passed to filter and added few loggers to catch row keys returned as a result
on client side.
Here is my working repo that demonstrate this diverged behavior
[hbase-scans|https://github.com/tejkiran/hbase-scans]
I have a mapper that creates PUT with row keys 0, 2, 3 and bulkload those rows
to table. When a scan is issued with 2.2.6 hbase API, it parses that start row
on Scan to filter during server side execution.
Screenshot of discovered row keys in filter during server side .
!image-2023-03-07-15-46-01-182.png!
Screenshoot of discovered row keys in filter with hbase 1.7.2
!image-2023-03-07-15-50-59-696.png!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)