[jira] [Commented] (HDFS-4817) make HDFS advisory caching configurable on a per-file basis

Aaron T. Myers (JIRA) Wed, 03 Jul 2013 17:57:17 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13699650#comment-13699650
 ]


Aaron T. Myers commented on HDFS-4817:
--------------------------------------

Patch looks pretty good to me, Colin. A few small comments.

# The patch no longer applies cleanly to trunk and will need to be rebased. The 
conflicts appear pretty minor.
# I think the changes in BlockLocation are unnecessary, and potentially 
incompatible as well.
# I'm not in love with the interface name "CanDropCacheBehindWrites", though I 
don't have much of a better suggestion. "CacheDroppableBehindWrites" ? 
"WriteCacheDroppable" ? Feel free to ignore this comment.
# Why the asymmetry in naming between FSDataInputStream#setDropBehind and 
FSDataOutputStream#setDropCacheBehindWrites? i.e. why not 
FSDataInputStream#setDropCacheBehindReads? Or just setDropBehind in both cases? 
I personally prefer setDropBehind in both cases - each of those streams are 
only for reading or writing, respectively.
# In FSDataInputStream#setDropBehind and 
FSDataOutputStream#setDropCacheBehindWrites, recommend you add a LOG.debug(...) 
(maybe even a LOG.warn?) to log the ClassCastException if it happens.
# Recommend moving the default no-op versions of setReadahead and setDropBehind 
into FSInputStream, so that you don't have to add no-op versions to 
FSInputChecker, RawLocalFileSystem, FTPInputStream, S3InputStream, 
NativeS3FileSystem, ByteRangeInputStream, and TestStreamFile.
# What's the reason for having CachingStrategy#dropBehind and 
CachingStrategy#readahead public? They both have a getter and a setter.
# Is the reason that CachingStrategy doesn't separate dropBehind for reads from 
dropBehind for writes because a CachingStrategy will only ever be used by 
either an input or an output stream? If so, might want to add a comment to 
CachingStrategy to make that more obvious. In general CachingStrategy could 
stand to have a class comment.
# Is there any good reason why we only have a single client-side config for 
drop behind, i.e. why not have separate read and write drop behind configs?
# Minor point throughout the patch: the asymmetry in camel casing of 
"readahead" vs. "dropBehind" kind of bothers me, but that's a bit of a matter 
of taste. I don't think "readahead" is one word, but if it is, so shouldn't 
"dropbehind"?

Tests look really good, btw.

+1 once these are addressed.
                
> make HDFS advisory caching configurable on a per-file basis
> -----------------------------------------------------------
>
>                 Key: HDFS-4817
>                 URL: https://issues.apache.org/jira/browse/HDFS-4817
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs-client
>    Affects Versions: 3.0.0
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>            Priority: Minor
>         Attachments: HDFS-4817.001.patch, HDFS-4817.002.patch, 
> HDFS-4817.004.patch
>
>
> HADOOP-7753 and related JIRAs introduced some performance optimizations for 
> the DataNode.  One of them was readahead.  When readahead is enabled, the 
> DataNode starts reading the next bytes it thinks it will need in the block 
> file, before the client requests them.  This helps hide the latency of 
> rotational media and send larger reads down to the device.  Another 
> optimization was "drop-behind."  Using this optimization, we could remove 
> files from the Linux page cache after they were no longer needed.
> Using {{dfs.datanode.drop.cache.behind.writes}} and 
> {{dfs.datanode.drop.cache.behind.reads}} can improve performance  
> substantially on many MapReduce jobs.  In our internal benchmarks, we have 
> seen speedups of 40% on certain workloads.  The reason is because if we know 
> the block data will not be read again any time soon, keeping it out of memory 
> allows more memory to be used by the other processes on the system.  See 
> HADOOP-7714 for more benchmarks.
> We would like to turn on these configurations on a per-file or per-client 
> basis, rather than on the DataNode as a whole.  This will allow more users to 
> actually make use of them.  It would also be good to add unit tests for the 
> drop-cache code path, to ensure that it is functioning as we expect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-4817) make HDFS advisory caching configurable on a per-file basis

Reply via email to