Nick Dimiduk created HBASE-30092:
------------------------------------
Summary: MonitoredRPCHandlerImpl.generateCallInfoMap() can OOM the
RegionServer heartbeat path when an in-flight RPC request protobuf is large
Key: HBASE-30092
URL: https://issues.apache.org/jira/browse/HBASE-30092
Project: HBase
Issue Type: Bug
Components: monitoring, regionserver
Affects Versions: 2.5.14, 2.6.5
Reporter: Nick Dimiduk
A burst of large {{MultiRequest}} / {{ScanRequest}} RPCs on a RegionServer with
a tight heap crashed the JVM with {{OutOfMemoryError}} from the periodic
RegionServer-to-HMaster heartbeat path. The offending allocation is inside
{{MonitoredRPCHandlerImpl.generateCallInfoMap()}}, which calls
{{AbstractMessage.toString()}} on the live in-flight RPC request protobuf.
{{TextFormat$Printer}} recursively appends to a {{StringBuilder}} that has to
grow past available heap.
{noformat}
java.lang.OutOfMemoryError
at java.util.Arrays.copyOf
at java.lang.AbstractStringBuilder.ensureCapacityInternal
at java.lang.StringBuilder.append (CharSequence)
at
org.apache.hbase.thirdparty.com.google.protobuf.TextFormat$TextGenerator.eol
at
org.apache.hbase.thirdparty.com.google.protobuf.TextFormat$Printer.printSingleField
at
org.apache.hbase.thirdparty.com.google.protobuf.TextFormat$Printer.printField
at
org.apache.hbase.thirdparty.com.google.protobuf.TextFormat$Printer.printMessage
... [recursive printMessage / printField for nested protobuf fields] ...
at org.apache.hbase.thirdparty.com.google.protobuf.AbstractMessage.toString
(AbstractMessage.java:91)
at
org.apache.hadoop.hbase.monitoring.MonitoredRPCHandlerImpl.generateCallInfoMap
(MonitoredRPCHandlerImpl.java:253)
at org.apache.hadoop.hbase.monitoring.MonitoredRPCHandlerImpl.clone
(MonitoredRPCHandlerImpl.java:57)
at org.apache.hadoop.hbase.monitoring.TaskMonitor.processTasks
(TaskMonitor.java:229)
at org.apache.hadoop.hbase.monitoring.TaskMonitor.getTasks
(TaskMonitor.java:183)
at org.apache.hadoop.hbase.regionserver.HRegionServer.buildServerLoad
(HRegionServer.java:1566)
at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport
at org.apache.hadoop.hbase.regionserver.HRegionServer.run
{noformat}
{{tryRegionServerReport}} runs on the main RegionServer thread at
{{hbase.regionserver.msginterval}} (default 3 s). An OOM there terminates the
JVM under {{-XX:OnOutOfMemoryError}}, or aborts the heartbeat otherwise. On a
heap sized close to the working set, a brief write slowdown that causes
handlers to back up is enough to grow per-handler scanner retention and push
{{generateCallInfoMap()}} over the edge. Auto-replacement by an external pool
manager doesn't break the loop — each replacement RS inherits the same client
traffic and crashes again within 15–60 seconds of registering.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)