Edward,

Interesting concept. I imagine that implementing "CachedInputFormat" over
something like memcached would make for the most straightforward
implementation. You could store 64MB chunks in memcached and try to retrieve
them from there, falling back to the filesystem on failure. One obvious
potential drawback of this is that a memcached cluster might store those
blocks on different servers than the file chunks themselves, leading to an
increased number of network transfers during the mapping phase. I don't know
if it's possible to "pin" the objects in memcached to particular nodes;
you'd want to do this for mapper locality reasons.

I would say, though, that 1 GB out of 8 GB on a datanode is somewhat
ambitious. It's been my observation that people tend to write memory-hungry
mappers. If you've got 8 cores in a node, and 1 GB each have already gone to
the OS, the datanode, and the tasktracker, that leaves only 5 GB for task
processes. Running 6 or 8 map tasks concurrently can easily gobble that up.
On a 16 GB datanode with 8 cores, you might get that much wiggle room
though.

- Aaron


On Tue, Oct 6, 2009 at 8:16 AM, Edward Capriolo <edlinuxg...@gmail.com>wrote:

> After looking at the HBaseRegionServer and its functionality, I began
> wondering if there is a more general use case for memory caching of
> HDFS blocks/files. In many use cases people wish to store data on
> Hadoop indefinitely, however the last day,last week, last month, data
> is probably the most actively used. For some Hadoop clusters the
> amount of raw new data could be less then the RAM memory in the
> cluster.
>
> Also some data will be used repeatedly, the same source data may be
> used to generate multiple result sets, and those results may be used
> as the input to other processes.
>
> I am thinking an answer could be to dedicate an amount of physical
> memory on each DataNode, or on several dedicated node to a distributed
> memcache like layer. Managing this cache should be straight forward
> since hadoop blocks are pretty much static. (So say for a DataNode
> with 8 GB of memory dedicate 1GB to HadoopCacheServer.) If you had
> 1000 Nodes that cache would be quite large.
>
> Additionally we could create a new file system type cachedhdfs
> implemented as a facade, or possibly implement CachedInputFormat or
> CachedOutputFormat.
>
> I know that the underlying filesystems have cache, but I think Hadoop
> writing intermediate data is going to evict some of the data which
> "should be" semi-permanent.
>
> So has anyone looked into something like this? This was the closest
> thing I found.
>
> http://issues.apache.org/jira/browse/HADOOP-288
>
> My goal here is to keep recent data in memory so that tools like Hive
> can get a big boost on queries for new data.
>
> Does anyone have any ideas?
>

Reply via email to