junegunn commented on code in PR #6666:
URL: https://github.com/apache/hbase/pull/6666#discussion_r1944032580
##########
hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/RowCounter.java:
##########
@@ -311,24 +314,40 @@ private static List<MultiRowRangeFilter.RowRange>
parseRowRangeParameter(String
/**
* Sets filter {@link FilterBase} to the {@link Scan} instance. If provided
rowRangeList contains
- * more than one element, method sets filter which is instance of {@link
MultiRowRangeFilter}.
- * Otherwise, method sets filter which is instance of {@link
FirstKeyOnlyFilter}. If rowRangeList
- * contains exactly one element, startRow and stopRow are set to the scan.
+ * more than one element, method sets filter which is instance of {@link
MultiRowRangeFilter}. If
+ * rowRangeList contains exactly one element, startRow and stopRow are set
to the scan. Also,
+ * method may apply {@link FirstKeyOnlyFilter} an {@link KeyOnlyFilter} for
better performance.
*/
private static void setScanFilter(Scan scan,
List<MultiRowRangeFilter.RowRange> rowRangeList,
boolean countDeleteMarkers) {
- final int size = rowRangeList == null ? 0 : rowRangeList.size();
- // all cells will be needed if --countDeleteMarkers flag is set, hence,
skipping filter
- if (size <= 1 && !countDeleteMarkers) {
- scan.setFilter(new FirstKeyOnlyFilter());
+ List<Filter> filters = new ArrayList<>();
+
+ // Apply filters for better performance.
+ if (!countDeleteMarkers) {
+ // We only need one cell per row, unless --countDeleteMarkers flag is
set.
+ filters.add(new FirstKeyOnlyFilter());
+
+ // We're not interested in values. Use KeyOnlyFilter.
+ // NOTE: Logically, KeyOnlyFilter should be okay for
--countDeleteMarkers, because it should
+ // only empty the values, but currently, having a filter changes the
behavior of a raw scan.
+ // So we don't use it. See {@link
UserScanQueryMatcher#mergeFilterResponse} for more details.
Review Comment:
This was a little strange to me. But well, it is how it is.
```ruby
create 't', 'd'
put 't', 'r', 'd:foo', 'bar'
now = Time.now.to_i * 1000
delete 't', 'r', 'd:foo'
delete 't', 'r', 'd:foo', now
deleteall 't', 'r'
deleteall 't', 'r', 'd:foo'
deleteall 't', 'r', 'd:foo', now
scan 't'
# ROW COLUMN+CELL
# 0 row(s)
scan 't', RAW => true
# ROW COLUMN+CELL
# r column=d:, timestamp=2025-02-06T12:02:23.136, type=DeleteFamily
# r column=d:foo, timestamp=2025-02-06T12:02:23.176, type=DeleteColumn
# r column=d:foo, timestamp=2025-02-06T12:02:23, type=DeleteColumn
# r column=d:foo, timestamp=2025-02-06T12:02:23, type=Delete
# r column=d:foo, timestamp=2025-02-06T12:02:21.837, type=Delete
# r column=d:foo, timestamp=2025-02-06T12:02:21.837, value=bar
# 1 row(s)
scan 't', RAW => true, FILTER => 'KeyOnlyFilter()'
# ROW COLUMN+CELL
# r column=d:, timestamp=2025-02-06T12:02:23.136, type=DeleteFamily
# r column=d:foo, timestamp=2025-02-06T12:02:23.176, type=DeleteColumn
# 1 row(s)
scan 't', RAW => true, FILTER => 'KeyOnlyFilter()', VERSIONS => 100
# ROW COLUMN+CELL
# r column=d:, timestamp=2025-02-06T12:02:23.136, type=DeleteFamily
# r column=d:foo, timestamp=2025-02-06T12:02:23.176, type=DeleteColumn
# r column=d:foo, timestamp=2025-02-06T12:02:23, type=DeleteColumn
# r column=d:foo, timestamp=2025-02-06T12:02:23, type=Delete
# r column=d:foo, timestamp=2025-02-06T12:02:21.837, type=Delete
# r column=d:foo, timestamp=2025-02-06T12:02:21.837, value=
# 1 row(s)
```
The thing is, `UserScanQueryMatcher#mergeFilterResponse` alters the
MatchCode. `INCLUDE` + `INCLUDE from filter` becomes `SEEK_NEXT_COL` as shown
below.
<img width="607" alt="image"
src="https://github.com/user-attachments/assets/68547274-ee44-4ed3-ac4a-7432e1dfcda7"
/>
https://github.com/apache/hbase/blob/85a8b54213ddd3220df59b6ecdf295c9f891c46f/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/querymatcher/UserScanQueryMatcher.java#L164-L166
This probably doesn't matter. Typical use cases shouldn't be affected. It's
just something odd I happened to notice.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]