Copilot commented on code in PR #8134:
URL: https://github.com/apache/hbase/pull/8134#discussion_r3302733255
##########
hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableRecordReaderImpl.java:
##########
@@ -141,9 +144,67 @@ public void initialize(InputSplit inputsplit,
TaskAttemptContext context)
if (context != null) {
this.context = context;
}
+ initProgressBounds();
restart(scan.getStartRow());
}
+ /**
+ * Resolve the start/stop row keys used for progress estimation. The
TableInputFormat splitter
+ * sets start and stop row keys from region boundaries, so they are only
empty for the table's
+ * very first region (empty start) or last region (empty stop). In those
cases, probe the table to
+ * discover the actual first or last row key as an approximation.
+ */
+ private void initProgressBounds() {
+ if (context == null) {
+ return;
+ }
+ byte[] startRow = scan.getStartRow();
+ byte[] stopRow = scan.getStopRow();
+ if (startRow == null || startRow.length == 0) {
+ startRow = probeFirstRow();
+ }
+ if (stopRow == null || stopRow.length == 0) {
+ stopRow = probeLastRow();
+ }
+ Configuration conf = context.getConfiguration();
+ Class<? extends RowKeyProgress> progressClass =
conf.getClass(RowKeyProgress.PROGRESS_CLASS_KEY,
+ UniformRowKeyProgress.class, RowKeyProgress.class);
+ rowKeyProgress = ReflectionUtils.newInstance(progressClass, conf);
+ rowKeyProgress.setStartStopRows(startRow, stopRow);
+ }
+
+ private byte[] probeFirstRow() {
+ try {
+ Scan probeScan = new Scan(scan);
+ probeScan.setOneRowLimit();
+ try (ResultScanner probeScanner = htable.getScanner(probeScan)) {
+ Result result = probeScanner.next();
+ return result != null ? result.getRow() : null;
+ }
+ } catch (IOException e) {
+ LOG.warn("Failed to probe first row for progress estimation", e);
+ return null;
+ }
+ }
+
+ private byte[] probeLastRow() {
+ try {
+ Scan probeScan = new Scan(scan);
+ // Only called for the last region, so swap row bounds for the reversed
scan.
+ probeScan.withStartRow(HConstants.EMPTY_START_ROW);
+ probeScan.withStopRow(scan.getStartRow(), scan.includeStartRow());
+ probeScan.setReversed(true);
+ probeScan.setOneRowLimit();
+ try (ResultScanner probeScanner = htable.getScanner(probeScan)) {
+ Result result = probeScanner.next();
+ return result != null ? result.getRow() : null;
+ }
Review Comment:
probeFirstRow/probeLastRow create the probe Scan by copying the full user
scan (families/qualifiers, result size, caching, cacheBlocks, filters, etc.).
Even with setOneRowLimit(), this can still fetch an entire (potentially huge)
row just to learn the row key, which is avoidable overhead during task init.
Consider configuring the probe scan to be as lightweight as possible (e.g.,
caching=1, cacheBlocks=false, maxResultSize very small, and a
KeyOnly/FirstKeyOnly filter) so progress estimation does not materially
increase load or risk OOM on wide rows.
##########
hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/UniformRowKeyProgress.java:
##########
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hbase.mapreduce;
+
+import org.apache.yetus.audience.InterfaceAudience;
+
+/**
+ * {@link RowKeyProgress} implementation that treats row keys as raw byte
sequences. Converts the
+ * leading bytes to a big-endian unsigned numeric value and computes progress
as a linear fraction
+ * of the key space.
+ */
[email protected]
+public class UniformRowKeyProgress implements RowKeyProgress {
+ private static final int BYTES_FOR_PROGRESS = Double.BYTES;
+
+ private double start;
+ private double stop;
+
+ @Override
+ public void setStartStopRows(byte[] startRow, byte[] stopRow) {
+ this.start = rowKeyToDouble(startRow);
+ this.stop = rowKeyToDouble(stopRow);
+ }
+
+ @Override
+ public float getProgress(byte[] currentRow) {
+ if (currentRow == null || stop <= start) {
+ return 0.0f;
+ }
+ double current = rowKeyToDouble(currentRow);
+ float progress = (float) ((current - start) / (stop - start));
+ return Math.min(1.0f, Math.max(0.0f, progress));
Review Comment:
UniformRowKeyProgress converts the first 8 bytes into a double. A double
only has ~53 bits of integer precision, so many distinct 8-byte prefixes (and
all bytes beyond the first ~6–7 significant bytes) will collapse to the same
numeric value. This can make `stop <= start` even when the actual start/stop
row keys differ (e.g., when keys share the same leading bytes and vary later),
causing progress to be stuck at 0 for an entire split. Consider limiting the
interpreted prefix to what a double can represent reliably (<= 6–7 bytes)
and/or using a common-prefix+padding approach (similar to
HexStringRowKeyProgress) so start/stop can still be distinguished when possible.
##########
hbase-mapreduce/src/test/java/org/apache/hadoop/hbase/mapreduce/TestUniformRowKeyProgress.java:
##########
@@ -0,0 +1,100 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hbase.mapreduce;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+
+import org.apache.hadoop.hbase.testclassification.MapReduceTests;
+import org.apache.hadoop.hbase.testclassification.SmallTests;
+import org.apache.hadoop.hbase.util.Bytes;
+import org.junit.jupiter.api.Tag;
+import org.junit.jupiter.api.Test;
+
+@Tag(MapReduceTests.TAG)
+@Tag(SmallTests.TAG)
+public class TestUniformRowKeyProgress {
+ private static RowKeyProgress create(byte[] start, byte[] stop) {
+ RowKeyProgress p = new UniformRowKeyProgress();
+ p.setStartStopRows(start, stop);
+ return p;
+ }
+
+ @Test
+ public void testNullCurrentRow() {
+ assertEquals(0.0f, create(Bytes.toBytes("a"),
Bytes.toBytes("z")).getProgress(null));
+ }
+
+ @Test
+ public void testNullStopRow() {
+ assertEquals(0.0f, create(Bytes.toBytes("a"),
null).getProgress(Bytes.toBytes("m")));
+ }
+
+ @Test
+ public void testEmptyStopRow() {
+ assertEquals(0.0f, create(Bytes.toBytes("a"), new
byte[0]).getProgress(Bytes.toBytes("m")));
+ }
+
+ @Test
+ public void testMidpoint() {
+ assertEquals(0.5f, create(new byte[] { 0x00 }, new byte[] { (byte) 0xFF })
+ .getProgress(new byte[] { (byte) 0x80 }), 0.01f);
+ }
+
+ @Test
+ public void testAtStart() {
+ assertEquals(0.0f,
+ create(Bytes.toBytes("aaa"),
Bytes.toBytes("zzz")).getProgress(Bytes.toBytes("aaa")), 0.001f);
+ }
+
+ @Test
+ public void testNearMid() {
+ assertEquals(0.48f,
+ create(Bytes.toBytes("aaa"),
Bytes.toBytes("zzz")).getProgress(Bytes.toBytes("mmm")), 0.01f);
+ }
+
+ @Test
+ public void testNearEnd() {
+ assertEquals(1.0f,
+ create(Bytes.toBytes("aaa"),
Bytes.toBytes("zzz")).getProgress(Bytes.toBytes("zzy")), 0.01f);
+ }
+
+ @Test
+ public void testEmptyStartRow() {
+ assertEquals(0.5f,
+ create(new byte[0], new byte[] { (byte) 0xFF }).getProgress(new byte[] {
(byte) 0x80 }),
+ 0.01f);
+ }
+
+ @Test
+ public void testNullStartRow() {
+ assertEquals(0.5f,
+ create(null, new byte[] { (byte) 0xFF }).getProgress(new byte[] { (byte)
0x80 }), 0.01f);
+ }
+
+ @Test
+ public void testProgressNeverExceedsOne() {
+ assertEquals(1.0f,
+ create(Bytes.toBytes("aaa"),
Bytes.toBytes("mmm")).getProgress(Bytes.toBytes("zzz")));
+ }
+
+ @Test
+ public void testProgressNeverBelowZero() {
+ assertEquals(0.0f,
+ create(Bytes.toBytes("mmm"),
Bytes.toBytes("zzz")).getProgress(Bytes.toBytes("aaa")));
+ }
Review Comment:
The current test suite does not cover cases where start/stop/current row
keys share a long common prefix (e.g., identical first 8 bytes) and differ only
in later bytes. With the current numeric conversion approach, those cases can
collapse to the same value and make progress un-estimatable. Adding a
regression test with >8-byte keys that differ after the shared prefix would
help prevent progress being stuck at 0 for such common row-key designs.
##########
hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/HexStringRowKeyProgress.java:
##########
@@ -0,0 +1,97 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hbase.mapreduce;
+
+import org.apache.hadoop.hbase.util.Bytes;
+import org.apache.yetus.audience.InterfaceAudience;
+
+/**
+ * {@link RowKeyProgress} implementation for hex-encoded row keys (e.g.
MD5/SHA prefixes). Non-hex
+ * bytes contribute zero.
+ */
[email protected]
+public class HexStringRowKeyProgress implements RowKeyProgress {
+ /**
+ * Cap on hex characters interpreted. A {@code double} mantissa carries ~53
bits (~13 hex chars);
+ * reading more adds no information and risks precision loss.
+ */
+ private static final int MAX_PREFIX_LENGTH = 13;
+
+ /**
+ * Hex characters past the start/stop divergence point to include for
resolution. 4 hex chars =
+ * 65,536 buckets, finer than any progress bar can display.
+ */
+ private static final int RESOLUTION_PADDING = 4;
+
+ private int prefixLength;
+ private double start;
+ private double stop;
+
+ @Override
+ public void setStartStopRows(byte[] startRow, byte[] stopRow) {
+ int common = commonPrefixLength(startRow, stopRow);
+ this.prefixLength = Math.min(common + RESOLUTION_PADDING,
MAX_PREFIX_LENGTH);
+ this.start = hexPrefixToDouble(startRow);
+ this.stop = hexPrefixToDouble(stopRow);
+ }
+
+ @Override
+ public float getProgress(byte[] currentRow) {
+ if (currentRow == null || stop <= start) {
+ return 0.0f;
+ }
+ double current = hexPrefixToDouble(currentRow);
+ float progress = (float) ((current - start) / (stop - start));
+ return Math.min(1.0f, Math.max(0.0f, progress));
+ }
+
+ private static int commonPrefixLength(byte[] a, byte[] b) {
+ if (a == null || b == null) {
+ return 0;
+ }
+ return Bytes.findCommonPrefix(a, b, a.length, b.length, 0, 0);
+ }
+
+ private double hexPrefixToDouble(byte[] row) {
+ if (row == null) {
+ return 0;
+ }
+ int len = Math.min(prefixLength, row.length);
+ double d = 0;
+ for (int i = 0; i < prefixLength; i++) {
+ d *= 16;
+ if (i < len) {
+ d += hexCharToInt(row[i]);
+ }
+ }
+ return d;
Review Comment:
hexPrefixToDouble keeps consuming bytes even after encountering a non-hex
separator (non-hex chars are treated as 0, but the method continues and later
bytes that happen to be valid hex (a-f/A-F/0-9) will still affect the computed
value). For common key patterns like `80:<suffix>`, this can make progress
depend on the suffix content, which is surprising and contradicts the
expectation that only the hex prefix drives ordering. Consider stopping parsing
once a non-hex byte is seen (or explicitly stopping at a delimiter such as ':')
so the suffix cannot influence progress.
##########
hbase-mapreduce/src/test/java/org/apache/hadoop/hbase/mapreduce/TestHexStringRowKeyProgress.java:
##########
@@ -0,0 +1,100 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hbase.mapreduce;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+import org.apache.hadoop.hbase.testclassification.MapReduceTests;
+import org.apache.hadoop.hbase.testclassification.SmallTests;
+import org.apache.hadoop.hbase.util.Bytes;
+import org.junit.jupiter.api.Tag;
+import org.junit.jupiter.api.Test;
+
+@Tag(MapReduceTests.TAG)
+@Tag(SmallTests.TAG)
+public class TestHexStringRowKeyProgress {
+ private static RowKeyProgress create(byte[] start, byte[] stop) {
+ HexStringRowKeyProgress p = new HexStringRowKeyProgress();
+ p.setStartStopRows(start, stop);
+ return p;
+ }
+
+ @Test
+ public void testNullCurrentRow() {
+ assertEquals(0.0f, create(Bytes.toBytes("00"),
Bytes.toBytes("ff")).getProgress(null));
+ }
+
+ @Test
+ public void testMidpoint() {
+ assertEquals(0.5f,
+ create(Bytes.toBytes("0000"),
Bytes.toBytes("ffff")).getProgress(Bytes.toBytes("8000")),
+ 0.01f);
+ }
+
+ @Test
+ public void testQuarterPoint() {
+ assertEquals(0.25f,
+ create(Bytes.toBytes("0000"),
Bytes.toBytes("ffff")).getProgress(Bytes.toBytes("4000")),
+ 0.01f);
+ }
+
+ @Test
+ public void testAcross9ToAGap() {
+ RowKeyProgress p = create(Bytes.toBytes("00"), Bytes.toBytes("ff"));
+ float at0f = p.getProgress(Bytes.toBytes("0f"));
+ float at10 = p.getProgress(Bytes.toBytes("10"));
+ assertEquals(1.0f / 255, at10 - at0f, 0.001f);
+ }
+
+ @Test
+ public void testProgressNeverExceedsOne() {
+ assertEquals(1.0f,
+ create(Bytes.toBytes("0000"),
Bytes.toBytes("8000")).getProgress(Bytes.toBytes("ffff")));
+ }
+
+ @Test
+ public void testProgressNeverBelowZero() {
+ assertEquals(0.0f,
+ create(Bytes.toBytes("8000"),
Bytes.toBytes("ffff")).getProgress(Bytes.toBytes("0000")));
+ }
+
+ @Test
+ public void testNonHexSuffixIgnored() {
+ RowKeyProgress p = create(Bytes.toBytes("00"), Bytes.toBytes("ff"));
+ float progressA = p.getProgress(Bytes.toBytes("80:dataA"));
+ float progressB = p.getProgress(Bytes.toBytes("80:dataZ"));
Review Comment:
testNonHexSuffixIgnored does not actually demonstrate that suffix content is
ignored: with the current prefixLength calculation (4 for start=00/stop=ff),
the 3rd/4th characters of `80:data*` (":d") are still part of what gets parsed,
and both inputs share those characters. If the intent is that anything after a
non-hex separator must not affect progress, update this test to use suffixes
whose early characters are valid hex digits (e.g. `80:0...` vs `80:f...`) and
assert they produce the same progress, or rename the test to match the actual
behavior being validated.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]