Charles Connell created HBASE-28905:
---------------------------------------
Summary: Skip excessive evaluations of LINK_NAME_PATTERN and
REF_NAME_PATTERN regular expressions
Key: HBASE-28905
URL: https://issues.apache.org/jira/browse/HBASE-28905
Project: HBase
Issue Type: Improvement
Affects Versions: 2.6.0
Reporter: Charles Connell
Assignee: Charles Connell
To test if a file is a link file, HBase checks if its file name matches the
regex
{code:java}
^(?:((?:[_\p{Digit}\p{IsAlphabetic}]+))(?:\=))?((?:[_\p{Digit}\p{IsAlphabetic}][-_.\p{Digit}\p{IsAlphabetic}]*))=((?:[a-f0-9]+))-([0-9a-f]+(?:(?:_SeqId_[0-9]+_)|(?:_del))?)$
{code}
To test if an HFile has a "reference name," HBase checks if its file name
matches the regex
{code:java}
^([0-9a-f]+(?:(?:_SeqId_[0-9]+_)|(?:_del))?|^(?:((?:[_\p{Digit}\p{IsAlphabetic}]+))(?:\=))?((?:[_\p{Digit}\p{IsAlphabetic}][-_.\p{Digit}\p{IsAlphabetic}]*))=((?:[a-f0-9]+))-([0-9a-f]+(?:(?:_SeqId_[0-9]+_)|(?:_del))?)$)\.(.+)$
{code}
Matching against these big regexes is computationally expensive. HBASE-27474
introduced (in 2.6.0) code in a hot path in HFileReaderImpl that checks whether
an HFile is a link or reference file while deciding whether to cache blocks
from that file. In flamegraphs taken at my company during performance tests,
this meant that these regex evaulations take 2-3% of the CPU time on a busy
RegionServer.
Later, the hot-path invocation of the regexes was removed in HBASE-28596 in
branch-2 and later, but not branch-2.6, so only the 2.6.x series suffers the
performance regression. Nonetheless, all invocations of these regexes are still
unnecessarily expensive and can be fast-failed easily.
The link name pattern contains a literal "=", so any string that does not
contain a "=" can be assumed to not match the regex. The reference name pattern
contains a literal ".", so any string that does not contain a "." can be
assumed to not match the regex. This optimization is mostly helpful in 2.6.x,
but is valid in all branches.
Running performance tests of this optimization removed the regex evaluations
from my flamegraphs entirely, and reduced query latency by 15%. Some charts are
attached.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)