[ 
https://issues.apache.org/jira/browse/RATIS-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17633276#comment-17633276
 ] 

Song Ziyang commented on RATIS-1743:
------------------------------------

Seems that there is a reference leak of SegmentedRaftLogWorker. As a 
workaround, I can manually set the sharedBuffer to null in the close() hook. 
What do you think? [~adoroszlai] [~szetszwo] 

> Memory leak in SegmentedRaftLogWorker due to metrics
> ----------------------------------------------------
>
>                 Key: RATIS-1743
>                 URL: https://issues.apache.org/jira/browse/RATIS-1743
>             Project: Ratis
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.0.0, 2.4.1
>            Reporter: Attila Doroszlai
>            Priority: Blocker
>         Attachments: Screenshot from 2022-11-12 22-17-11.png
>
>
> OOME happens in Ozone integration tests.  Currently Xmx=2g, but increasing it 
> does not help.
> {code:title=https://github.com/adoroszlai/hadoop-ozone/actions/runs/3450185096/jobs/5761108630#step:5:3155}
> [INFO] Running org.apache.hadoop.ozone.scm.TestStorageContainerManagerHA
> Error:  java.lang.OutOfMemoryError: Java heap space
> Error:  Tests run: 8, Failures: 0, Errors: 4, Skipped: 0, Time elapsed: 
> 426.774 s <<< FAILURE! - in 
> org.apache.hadoop.ozone.scm.TestStorageContainerManagerHA
> {code}
> {code}
> java.lang.OutOfMemoryError: Java heap space
>       at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.lambda$new$4(SegmentedRaftLogWorker.java:223)
>       at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$$Lambda$603/1771708635.get(Unknown
>  Source)
>       at org.apache.ratis.util.MemoizedSupplier.get(MemoizedSupplier.java:62)
>       at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogOutputStream.write(SegmentedRaftLogOutputStream.java:101)
>       at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$WriteLog.execute(SegmentedRaftLogWorker.java:568)
>       at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.run(SegmentedRaftLogWorker.java:320)
>       at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$$Lambda$595/1626598428.run(Unknown
>  Source)
>       at java.lang.Thread.run(Thread.java:750)
> {code}
> Ozone registers JMX reporter (this is not new):
> {code:title=https://github.com/apache/ozone/blob/a13c62b60556cd003ee2149179f72029d9e35756/hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/server/http/RatisDropwizardExports.java#L51-L53}
>     MetricRegistries.global()
>         .addReporterRegistration(MetricsReporting.jmxReporter(),
>             MetricsReporting.stopJmxReporter());
> {code}
> Based on the heap dump and test log, {{SegmentedRaftLogWorker}} instances are 
> retained by JmxMBeanServer after {{close()}}.
> The problem is probably not new, but its effect is much worse now, because 
> {{SegmentedRaftLogWorker}} recently got a shared buffer (RATIS-1717).
> {code:title=config in Ozone}
> raft.server.log.appender.buffer.byte-limit = 33554432 (custom)
> {code}
> See screenshot for GC root.
> CC [~szetszwo], [~William Song]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to