[jira] [Commented] (HDFS-4817) make HDFS advisory caching configurable on a per-file basis

Hari Mankude (JIRA) Fri, 17 May 2013 06:47:25 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13660724#comment-13660724
 ]


Hari Mankude commented on HDFS-4817:
------------------------------------

Colin,

Can this feature be extended to determine where data needs to be stored in DN? 
For example, a DN might have SSDs and SATA/SAS drives and depending on hints 
provided by the user on the access patterns (random reads vs long sequential 
reads), it might be useful to put the data in SSDs vs SATA. I understand that 
NN has to be involved to make this information persistent during block 
relocation. 

The nice goal would be to make DN smarter (or have the ability to learn with 
minimal involvement from NN) than what it is doing right now given that nodes 
can have storage devices with vastly different characteristics. Another option 
is to use access patterns to move data across various storages in DN. [sort of 
HSM]

It looks like current patch is mainly to manage the OS pagecache. 
                
> make HDFS advisory caching configurable on a per-file basis
> -----------------------------------------------------------
>
>                 Key: HDFS-4817
>                 URL: https://issues.apache.org/jira/browse/HDFS-4817
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs-client
>    Affects Versions: 3.0.0
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>            Priority: Minor
>         Attachments: HDFS-4817.001.patch
>
>
> HADOOP-7753 and related JIRAs introduced some performance optimizations for 
> the DataNode.  One of them was readahead.  When readahead is enabled, the 
> DataNode starts reading the next bytes it thinks it will need in the block 
> file, before the client requests them.  This helps hide the latency of 
> rotational media and send larger reads down to the device.  Another 
> optimization was "drop-behind."  Using this optimization, we could remove 
> files from the Linux page cache after they were no longer needed.
> Using {{dfs.datanode.drop.cache.behind.writes}} and 
> {{dfs.datanode.drop.cache.behind.reads}} can improve performance  
> substantially on many MapReduce jobs.  In our internal benchmarks, we have 
> seen speedups of 40% on certain workloads.  The reason is because if we know 
> the block data will not be read again any time soon, keeping it out of memory 
> allows more memory to be used by the other processes on the system.  See 
> HADOOP-7714 for more benchmarks.
> We would like to turn on these configurations on a per-file or per-client 
> basis, rather than on the DataNode as a whole.  This will allow more users to 
> actually make use of them.  It would also be good to add unit tests for the 
> drop-cache code path, to ensure that it is functioning as we expect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-4817) make HDFS advisory caching configurable on a per-file basis

Reply via email to