Nick Dimiduk created HBASE-30092:
------------------------------------

             Summary: MonitoredRPCHandlerImpl.generateCallInfoMap() can OOM the 
RegionServer heartbeat path when an in-flight RPC request protobuf is large
                 Key: HBASE-30092
                 URL: https://issues.apache.org/jira/browse/HBASE-30092
             Project: HBase
          Issue Type: Bug
          Components: monitoring, regionserver
    Affects Versions: 2.5.14, 2.6.5
            Reporter: Nick Dimiduk


A burst of large {{MultiRequest}} / {{ScanRequest}} RPCs on a RegionServer with 
a tight heap crashed the JVM with {{OutOfMemoryError}} from the periodic 
RegionServer-to-HMaster heartbeat path. The offending allocation is inside 
{{MonitoredRPCHandlerImpl.generateCallInfoMap()}}, which calls 
{{AbstractMessage.toString()}} on the live in-flight RPC request protobuf. 
{{TextFormat$Printer}} recursively appends to a {{StringBuilder}} that has to 
grow past available heap.

{noformat}
java.lang.OutOfMemoryError
  at java.util.Arrays.copyOf
  at java.lang.AbstractStringBuilder.ensureCapacityInternal
  at java.lang.StringBuilder.append (CharSequence)
  at 
org.apache.hbase.thirdparty.com.google.protobuf.TextFormat$TextGenerator.eol
  at 
org.apache.hbase.thirdparty.com.google.protobuf.TextFormat$Printer.printSingleField
  at 
org.apache.hbase.thirdparty.com.google.protobuf.TextFormat$Printer.printField
  at 
org.apache.hbase.thirdparty.com.google.protobuf.TextFormat$Printer.printMessage
  ... [recursive printMessage / printField for nested protobuf fields] ...
  at org.apache.hbase.thirdparty.com.google.protobuf.AbstractMessage.toString  
(AbstractMessage.java:91)
  at 
org.apache.hadoop.hbase.monitoring.MonitoredRPCHandlerImpl.generateCallInfoMap  
 (MonitoredRPCHandlerImpl.java:253)
  at org.apache.hadoop.hbase.monitoring.MonitoredRPCHandlerImpl.clone           
     (MonitoredRPCHandlerImpl.java:57)
  at org.apache.hadoop.hbase.monitoring.TaskMonitor.processTasks                
     (TaskMonitor.java:229)
  at org.apache.hadoop.hbase.monitoring.TaskMonitor.getTasks                    
     (TaskMonitor.java:183)
  at org.apache.hadoop.hbase.regionserver.HRegionServer.buildServerLoad         
     (HRegionServer.java:1566)
  at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport
  at org.apache.hadoop.hbase.regionserver.HRegionServer.run
{noformat}

{{tryRegionServerReport}} runs on the main RegionServer thread at 
{{hbase.regionserver.msginterval}} (default 3 s). An OOM there terminates the 
JVM under {{-XX:OnOutOfMemoryError}}, or aborts the heartbeat otherwise. On a 
heap sized close to the working set, a brief write slowdown that causes 
handlers to back up is enough to grow per-handler scanner retention and push 
{{generateCallInfoMap()}} over the edge. Auto-replacement by an external pool 
manager doesn't break the loop — each replacement RS inherits the same client 
traffic and crashes again within 15–60 seconds of registering.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to