[ https://issues.apache.org/jira/browse/HDFS-5461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Liang Xie updated HDFS-5461: ---------------------------- Attachment: HDFS-5461.txt bq. It's because each open stream holds a buffer, and we have hundreds of open streams? i am not 100% sure, but in my mind, i agree with you, this oom is easy to repro while we have lots of opened storefiles to be read(e.g. compaction can't catch up sometimes) Oh, i see, seems the fallback only meaningful for some config like mine: big Xmx and small MaxDirectMemorySize :) I attached a patch with more logging about using/pooled direct buffer size. In my option, it could be useful probably while online resetting the log level to "trace" during OOM occur. And add a simple try/catch fallback handle for OOM without introducing any config value, per me, seems this way is more reasonable:) > fallback to non-ssr(local short circuit reads) while oom detected > ----------------------------------------------------------------- > > Key: HDFS-5461 > URL: https://issues.apache.org/jira/browse/HDFS-5461 > Project: Hadoop HDFS > Issue Type: Improvement > Affects Versions: 3.0.0, 2.2.0 > Reporter: Liang Xie > Attachments: HDFS-5461.txt > > > Currently, the DirectBufferPool used by ssr feature seems doesn't have a > upper-bound limit except DirectMemory VM option. So there's a risk to > encounter direct memory oom. see HBASE-8143 for example. > IMHO, maybe we could improve it a bit: > 1) detect OOM or reach a setting up-limit from caller, then fallback to > non-ssr > 2) add a new metric about current raw consumed direct memory size. -- This message was sent by Atlassian JIRA (v6.1#6144)