Tej Meka created HBASE-27691: -------------------------------- Summary: Fake Cell is being passed to Filters and comparators during StoreFileScans Key: HBASE-27691 URL: https://issues.apache.org/jira/browse/HBASE-27691 Project: HBase Issue Type: Bug Components: scan, Scanners Affects Versions: 2.2.7 Reporter: Tej Meka Attachments: image-2023-03-07-15-46-01-182.png, image-2023-03-07-15-50-59-696.png
I am trying to upgrade HBase version (client and server) from 1.2.0 to 2.2.6 and started seeing some unexpected behavior around discovery of ambiguous row in filter during StoreFileScans. *Is it a valid case that filters and Comparators might see a fake cell passed to them if that row is set as an inclusive(by default) start row to skip preceding row during Store file scans during client side execution?* When rows were persisted or updated on a table through bulkload, looks like a scan with specific column triggers a different behavior compared to a scan without columns which doesn't trigger this behavior. >From what I have troubleshooted so far, it looks like this is triggered during >[lazy >scan|https://github.com/apache/hbase/blob/rel/2.2.6/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java#L251-L256] > inside >[StoreScanner|https://github.com/apache/hbase/blob/rel/2.2.6/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java#L406-L410] > with >[StoreFileScanner|https://github.com/apache/hbase/blob/rel/2.2.6/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java#L388-L448] > implementation where it eventually returns fake cell as current row on store >heap >[StoreFileScanner|https://github.com/apache/hbase/blob/rel/2.2.6/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java#L437-L444] > thus passed to filter but it's actually filtered later and not returned to >client. This was not the case with hbase 1.7.2. I have created couple of simple Tests using hbase 1.7.2 and hbase 2.2.6 that bulkloads some sample rows to table and creates a column specific Scan to reproduce behavior that I have been talking about. I have simply copied KeyOnlyFilter, added few loggers to catch rowkeys being passed to filter and added few loggers to catch row keys returned as a result on client side. Here is my working repo that demonstrate this diverged behavior [hbase-scans|https://github.com/tejkiran/hbase-scans] I have a mapper that creates PUT with row keys 0, 2, 3 and bulkload those rows to table. When a scan is issued with 2.2.6 hbase API, it parses that start row on Scan to filter during server side execution. Screenshot of discovered row keys in filter during server side . !image-2023-03-07-15-46-01-182.png! Screenshoot of discovered row keys in filter with hbase 1.7.2 !image-2023-03-07-15-50-59-696.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)