[
https://issues.apache.org/jira/browse/HDFS-5957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13902547#comment-13902547
]
Chris Nauroth commented on HDFS-5957:
-------------------------------------
Here are some additional details on the scenario that prompted filing this
issue. Thanks to [~gopalv] for sharing the details.
Gopal has a YARN application that performs strictly sequential reads of HDFS
files. The application may rapidly iterate through a large number of blocks.
The reason for this is that each block contains a small metadata header, and
based on the contents of this metadata, the application often can decide that
there is nothing relevant in the rest of the block. If that happens, then the
application seeks all the way past that block. Gopal estimates that it's
feasible this code would scan through ~100 HDFS blocks in ~10 seconds.
This usage pattern in combination with zero-copy read causes retention of a
large number of memory-mapped regions in the {{ShortCircuitCache}}.
Eventually, YARN's resource check kills the container process for exceeding the
enforced physical memory bounds. The asynchronous nature of our {{munmap}}
calls was surprising for Gopal, who had carefully calculated his memory usage
to stay under YARN's resource checks.
As a workaround, I advised Gopal to downtune
{{dfs.client.mmap.cache.timeout.ms}} to make the {{munmap}} happen more
quickly. A better solution would be to provide support in the HDFS client for
a caching policy that fits this usage pattern. Two possibilities are:
# LRU bounded by a client-specified maximum memory size. (Note this is maximum
memory size and not number of files or number of blocks, because of the
possibility of differing block counts and block sizes.)
# Do not cache at all. Effectively, there is only one memory-mapped region
alive at a time. The sequential read usage pattern described above always
results in a cache miss anyway, so a cache adds no value.
I don't propose removing the current time-triggered threshold, because I think
that's valid for other use cases. I only propose adding support for new
policies.
In addition to the caching policy itself, I want to propose a way to move the
{{munmap}} calls to run synchronous with the caller instead of in a background
thread. This would be a better fit for clients who want deterministic resource
cleanup. Right now, we have no way to guarantee that the OS will schedule the
{{CacheCleaner}} thread ahead of YARN's resource check thread. This isn't a
proposal to remove support for the background thread, only to add support for
synchronous {{munmap}}.
I think you could also make an argument that YARN shouldn't count these
memory-mapped regions towards the container process's RSS. It's really the
DataNode process that owns that memory, and clients who {{mmap}} the same
region shouldn't get penalized. Let's address that part separately though.
> Provide support for different mmap cache retention policies in
> ShortCircuitCache.
> ---------------------------------------------------------------------------------
>
> Key: HDFS-5957
> URL: https://issues.apache.org/jira/browse/HDFS-5957
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: hdfs-client
> Affects Versions: 2.3.0
> Reporter: Chris Nauroth
>
> Currently, the {{ShortCircuitCache}} retains {{mmap}} regions for reuse by
> multiple reads of the same block or by multiple threads. The eventual
> {{munmap}} executes on a background thread after an expiration period. Some
> client usage patterns would prefer strict bounds on this cache and
> deterministic cleanup by calling {{munmap}}. This issue proposes additional
> support for different caching policies that better fit these usage patterns.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)