On Wed, Mar 19, 2014 at 1:36 PM, Ravikumar Govindarajan < [email protected]> wrote:
> I was looking at block-cache code and trying to understand why we need it. > Though the this writeup is for the v1 block cache. The reasons to have a block cache in Blur are still valid: http://incubator.apache.org/blur/docs/0.2.0/cluster-setup.html#block-cache > > We divide the file into blocks of 8KB and write to hadoop. While reading, > we only read in batches of 8KB and store in block-cache > If you are looking at v1 then yes. In v2 that is the default size but it is configurable. > > This is a form of read-ahead caching on the client-side[shard-server]. Am I > correct in understanding? > It's not a read-ahead, it's just a simple LRU cache. > > Recent releases of hadoop have a notion of read-ahead caching in data-node > itself. The default value is 4MB but I believe it can also be configured to > whatever is needed. > Yes. I think this is in the native packages and it informs the filesystem of the intent of the reading the entire file. But I may be wrong here, I haven't really dug into the newer data-node read-ahead stuff. > > What are the advantages of a block-cache vis-a-vis data-node read-ahead > cache? > The block-cache is inside the shard server process and therefore a lot faster to access than going across the network to the data-node. > > I also am not familiar with hadoop IO sub-system as to whether it's correct > and performant to do read-aheads in data-nodes for a use-case like lucene. > Not really. Lucene is mostly a random access pattern except for when merges are occurring. Aaron > > Can someone help me? > > -- > Ravi >
