JinHyuk Kim created HBASE-30115:
-----------------------------------

             Summary: Introduce approximate progress estimation for 
TableRecordReader based on row key position
                 Key: HBASE-30115
                 URL: https://issues.apache.org/jira/browse/HBASE-30115
             Project: HBase
          Issue Type: Task
          Components: mapreduce
            Reporter: JinHyuk Kim
            Assignee: JinHyuk Kim
         Attachments: mapreduce-progress-0.png, mapreduce-progress-after.png

h1. Background

Currently, {{TableRecordReaderImpl.getProgress()}} always returns {*}0{*}, 
providing no progress feedback to the MapReduce framework. This makes it 
impossible for users to monitor scan progress during long-running jobs.

!mapreduce-progress-0.png|width=1095,height=236!

 
h1. Suggestion

This patch estimates progress by converting row keys to numeric values and 
computing the fraction of the key space covered so far: {{{}(current - start) / 
(stop - start){}}}.

Since the {{TableInputFormat}} splitter sets start/stop row keys from region 
boundaries, they are only empty for the table's very first region (empty start) 
or last region (empty stop). In those cases, we *probe* the table with a 
forward or reverse scan (limit 1) to discover the actual boundary row key.
                                                                                
                                                                      The 
implementation is pluggable via {{hbase.mapreduce.rowkey.progress.class}} 
configuration:
 * {{ByteBasedRowKeyProgress}} (default) : treats row keys as raw bytes. Works 
well for most key designs.
 * {{HexPrefixRowKeyProgress}} : interprets leading bytes as hex characters 
([0-9a-f]). Gives accurate linear progress for tables using hex-encoded hash 
prefixes (e.g. MD5). The raw byte approach is inaccurate for hex keys because 
there are large byte gaps between '9'→'a' (0x39→0x61) and between "0f"→"10" 
(0x3066→0x3130) that don't correspond to actual key distance. The prefix length 
is configurable via {{hbase.mapreduce.rowkey.progress.hex.prefix.length}} 
(default 4). Bytes beyond the prefix are ignored, so non-hex suffixes do not 
affect progress.
 * Users can implement the {{RowKeyProgress}} interface for custom key encoding 
strategies.

After this change, you can monitor the progress in this way.
 
!mapreduce-progress-after.png|width=1792,height=119!
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to