[ 
https://issues.apache.org/jira/browse/HDFS-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662555#comment-13662555
 ] 

Colin Patrick McCabe commented on HDFS-4817:
--------------------------------------------

[~andrew.wang] thanks for the terasort benchmarks, those are very encouraging!  
It's great to know that we can get this kind of performance with this patch 
just by changing a few MR client settings.

bq. The FileSystem cache is annoying here, since it means all threads share the 
same DFSClient and thus the same default CachingStrategy

Agree.  I'm going to get rid of the get/set methods on the {{FileSystem}}, for 
the reasons you and Todd brought up.  It just doesn't make sense, given how we 
cache those FileSystem objects.

bq. Cache management behavior in BlockReceiver is a bit different. Whereas 
before we would sync or dropbehind in window increments, now we do it up to the 
current position. Any comments on why this was changed? The difference seems 
minor (4MB vs. 4MB+64KB packet).

That's not quite true.  The general idea is, we try to {{sync_file_pages}} the 
last few bytes we read, and we call {{FADV_DONTNEED}} from the beginning of the 
file up to a position slightly behind the current position.  The way this is 
done changed slightly, but not the concept.

The general idea is that there isn't much point in calling {{sync_file_pages}} 
twice on the same offsets, since the sync process has presumably already begun. 
 On the other hand, calling {{fadvise(FADV_DONTNEED)}} again and again will 
tend to purge more and more bytes from the cache.  The reason is because dirty 
pages (those containing un-written-out-data) cannot be purged using 
{{FADV_DONTNEED}}.  And we can't know exactly when the pages we wrote will be 
flushed to disk.  But we do know that calling {{FADV_DONTNEED}} on very 
recently written bytes is a waste, since they will almost certainly not have 
been written out to disk.  That is why it purges between 0 and 
{{lastCacheManagementOffset - CACHE_WINDOW_SIZE}}, rather than simply 0 to pos.

Fixed typo in BlockReceiver comment.

bq. NativeIO.java: @VisibleForTesting on LONG_READ_THRESHOLD_BYTES too. Any 
reason why these aren't configuration parameters? Then they could be final, and 
still modified for tests.

I think maybe what we should do is keep {{LONG_READ_THRESHOLD_BYTES}} as-is, 
but only use it if no explicit cache management policy was set.  This allows 
users to have control if they want on the file-level rather than the 
Datanode-level, which is-- after all-- the whole point here.

bq. [API comments]

I changed the API to:
* be per-stream.
* use boxed Booleans and Longs where {{null}} represents using the (server) 
defaults.  I think this makes things a lot clearer than using magic numbers to 
represent things.

CopyBlock, fsck: yep.  We should set dropBehind in those places.

bq. Having classes named CachingStrategy and CacheStrategy is a little 
confusing. Rename CacheStrategy to CachingStrategyProto per normal convention?

ok

bq. BlockPoolSliceScanner#verifyBlock, can use CachingStrategy#newDropBehind 
convenience method rather than writing it out.

ok

[~saint....@gmail.com]  You can now change the caching strategy on an open 
file, something I believe you requested earlier.  I also reduced the long read 
threshold from 256kb to 16kb, so that HBase can take better advantage of these 
settings.  We will have to benchmark more to get the perfect balance, but this 
should be at least a starting point.
                
> make HDFS advisory caching configurable on a per-file basis
> -----------------------------------------------------------
>
>                 Key: HDFS-4817
>                 URL: https://issues.apache.org/jira/browse/HDFS-4817
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs-client
>    Affects Versions: 3.0.0
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>            Priority: Minor
>         Attachments: HDFS-4817.001.patch
>
>
> HADOOP-7753 and related JIRAs introduced some performance optimizations for 
> the DataNode.  One of them was readahead.  When readahead is enabled, the 
> DataNode starts reading the next bytes it thinks it will need in the block 
> file, before the client requests them.  This helps hide the latency of 
> rotational media and send larger reads down to the device.  Another 
> optimization was "drop-behind."  Using this optimization, we could remove 
> files from the Linux page cache after they were no longer needed.
> Using {{dfs.datanode.drop.cache.behind.writes}} and 
> {{dfs.datanode.drop.cache.behind.reads}} can improve performance  
> substantially on many MapReduce jobs.  In our internal benchmarks, we have 
> seen speedups of 40% on certain workloads.  The reason is because if we know 
> the block data will not be read again any time soon, keeping it out of memory 
> allows more memory to be used by the other processes on the system.  See 
> HADOOP-7714 for more benchmarks.
> We would like to turn on these configurations on a per-file or per-client 
> basis, rather than on the DataNode as a whole.  This will allow more users to 
> actually make use of them.  It would also be good to add unit tests for the 
> drop-cache code path, to ensure that it is functioning as we expect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to