On Wed, Mar 19, 2014 at 1:36 PM, Ravikumar Govindarajan <
[email protected]> wrote:

> I was looking at block-cache code and trying to understand why we need it.
>

Though the this writeup is for the v1 block cache.  The reasons to have a
block cache in Blur are still valid:

http://incubator.apache.org/blur/docs/0.2.0/cluster-setup.html#block-cache


>
> We divide the file into blocks of 8KB and write to hadoop. While reading,
> we only read in batches of 8KB and store in block-cache
>

If you are looking at v1 then yes.  In v2 that is the default size but it
is configurable.


>
> This is a form of read-ahead caching on the client-side[shard-server]. Am I
> correct in understanding?
>

It's not a read-ahead, it's just a simple LRU cache.


>
> Recent releases of hadoop have a notion of read-ahead caching in data-node
> itself. The default value is 4MB but I believe it can also be configured to
> whatever is needed.
>

Yes.  I think this is in the native packages and it informs the filesystem
of the intent of the reading the entire file.  But I may be wrong here, I
haven't really dug into the newer data-node read-ahead stuff.


>
> What are the advantages of a block-cache vis-a-vis data-node read-ahead
> cache?
>

The block-cache is inside the shard server process and therefore a lot
faster to access than going across the network to the data-node.


>
> I also am not familiar with hadoop IO sub-system as to whether it's correct
> and performant to do read-aheads in data-nodes for a use-case like lucene.
>

Not really.  Lucene is mostly a random access pattern except for when
merges are occurring.

Aaron


>
> Can someone help me?
>
> --
> Ravi
>

Reply via email to