[ 
https://issues.apache.org/jira/browse/HDFS-5957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13902547#comment-13902547
 ] 

Chris Nauroth commented on HDFS-5957:
-------------------------------------

Here are some additional details on the scenario that prompted filing this 
issue.  Thanks to [~gopalv] for sharing the details.

Gopal has a YARN application that performs strictly sequential reads of HDFS 
files.  The application may rapidly iterate through a large number of blocks.  
The reason for this is that each block contains a small metadata header, and 
based on the contents of this metadata, the application often can decide that 
there is nothing relevant in the rest of the block.  If that happens, then the 
application seeks all the way past that block.  Gopal estimates that it's 
feasible this code would scan through ~100 HDFS blocks in ~10 seconds.

This usage pattern in combination with zero-copy read causes retention of a 
large number of memory-mapped regions in the {{ShortCircuitCache}}.  
Eventually, YARN's resource check kills the container process for exceeding the 
enforced physical memory bounds.  The asynchronous nature of our {{munmap}} 
calls was surprising for Gopal, who had carefully calculated his memory usage 
to stay under YARN's resource checks.

As a workaround, I advised Gopal to downtune 
{{dfs.client.mmap.cache.timeout.ms}} to make the {{munmap}} happen more 
quickly.  A better solution would be to provide support in the HDFS client for 
a caching policy that fits this usage pattern.  Two possibilities are:

# LRU bounded by a client-specified maximum memory size.  (Note this is maximum 
memory size and not number of files or number of blocks, because of the 
possibility of differing block counts and block sizes.)
# Do not cache at all.  Effectively, there is only one memory-mapped region 
alive at a time.  The sequential read usage pattern described above always 
results in a cache miss anyway, so a cache adds no value.

I don't propose removing the current time-triggered threshold, because I think 
that's valid for other use cases.  I only propose adding support for new 
policies.

In addition to the caching policy itself, I want to propose a way to move the 
{{munmap}} calls to run synchronous with the caller instead of in a background 
thread.  This would be a better fit for clients who want deterministic resource 
cleanup.  Right now, we have no way to guarantee that the OS will schedule the 
{{CacheCleaner}} thread ahead of YARN's resource check thread.  This isn't a 
proposal to remove support for the background thread, only to add support for 
synchronous {{munmap}}.

I think you could also make an argument that YARN shouldn't count these 
memory-mapped regions towards the container process's RSS.  It's really the 
DataNode process that owns that memory, and clients who {{mmap}} the same 
region shouldn't get penalized.  Let's address that part separately though.


> Provide support for different mmap cache retention policies in 
> ShortCircuitCache.
> ---------------------------------------------------------------------------------
>
>                 Key: HDFS-5957
>                 URL: https://issues.apache.org/jira/browse/HDFS-5957
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs-client
>    Affects Versions: 2.3.0
>            Reporter: Chris Nauroth
>
> Currently, the {{ShortCircuitCache}} retains {{mmap}} regions for reuse by 
> multiple reads of the same block or by multiple threads.  The eventual 
> {{munmap}} executes on a background thread after an expiration period.  Some 
> client usage patterns would prefer strict bounds on this cache and 
> deterministic cleanup by calling {{munmap}}.  This issue proposes additional 
> support for different caching policies that better fit these usage patterns.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to