RE: hadoop/hdfs cache question, do client processes share cache?

Naganarasimha G R (Naga) Tue, 11 Aug 2015 21:22:13 -0700

Hi Demai,
 centralized cache required 'explicit' configuration, so by default, there is 
no HDFS-managed cache?
YES only explicit centralized caching is supported, as the size of the data 
stored in HDFS will be too high and if multiple clients are accesing the 
Datanode then cache hit ratio will be very low, hence there is no point in 
having implicit HDFS-managed cache.


Will the cache occur at local filesystem level like Linux?
Please refer to dfs.datanode.drop.cache.behind.reads & 
dfs.datanode.drop.cache.behind.writes in 
http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
it gives more details about the OS buffer cache

the client has 10 processes repeatedly read the same HDFS file. will HDFS 
client API be able to cache the file content at Client side?
Each individual process will have its own HDFS Client so this caching needs to 
be done at the application layer.

or every READ will have to move the whole file through network, and no sharing  
between processes?
Yes, READ will have to move the whole file through network and no sharing 
between multiple clients/processes within a given node.

+ Naga
________________________________
From: Demai Ni [nid...@gmail.com]
Sent: Wednesday, August 12, 2015 02:05
To: user@hadoop.apache.org
Subject: Re: hadoop/hdfs cache question, do client processes share cache?

Ritesh,

many thanks for your response. I just read through the centralized Cache 
document. Thanks for the pointer. A couple follow-up questions.

First, the centralized cache required 'explicit' configuration, so by default, 
there is no HDFS-managed cache? Will the cache occur at local filesystem level 
like Linux?

The 2nd question. The centralized Cache is among the DN of HDFS. Let's say the 
client is a stand-alone Linux(not part of the cluster), which connects to the 
HDFS cluster with centralized cache configured. So on HDFS cluster, the file is 
cached. In the same scenario, the client has 10 processes repeatedly read the 
same HDFS file. will HDFS client API be able to cache the file content at 
Client side? or every READ will have to move the whole file through network, 
and no sharing  between processes?

Demai


On Tue, Aug 11, 2015 at 12:58 PM, Ritesh Kumar Singh 
<riteshoneinamill...@gmail.com<mailto:riteshoneinamill...@gmail.com>> wrote:
Let's assume that hdfs maintains 3 replicas of the 256MB block, then all of 
these 3 datanodes will have only one copy of the block in their respective mem 
cache and thus avoiding the repeated i/o reads. This goes with the centralized 
cache management policy of hdfs that also gives you an option to pin 2 of these 
3 blocks in cache and save the remaining 256MB of cache space. Here's a 
link<https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html>
 on the same.

Hope that helps.

Ritesh

RE: hadoop/hdfs cache question, do client processes share cache?

Reply via email to