> > No. Typically the hit to miss ratio is very high, its a metric that is > recorded in Blur
This is such a handy feature. Thanks for providing such detailed metrics Just to add to the benefits of block-cache, I just found out that readFully or sync(seek+read) in FSDataInputStream occurs entirely in a synchronized method in hadoop that could limit throughput/QPS when multiple IndexInputs are open for same lucene file. Block-cache should shine in such scenarios... Thanks a lot for your inputs. -- Ravi On Thu, Mar 20, 2014 at 5:45 PM, Aaron McCurry <[email protected]> wrote: > On Wed, Mar 19, 2014 at 1:57 PM, Ravikumar Govindarajan < > [email protected]> wrote: > > > One obvious case is a cache-hit scenario, where instead of using the > > block-cache, there is a fairly heavy round-trip to data-node. It is also > > highly likely that the data-node might have evicted the hot-pages due to > > other active reads. > > > Or writes. The normal behavior in the Linux filesystem cache is to cache > newly written data and evict the oldest data from memory. So during merges > (or any other writes from other Hadoop processes) the Linux filesystem will > unload pages that you might be using. > > > > > > > > How much of cache-hit happens in Blur? Will I be correct in saying that > > repeated terms occurring in search only will benefit block-cache? > > > > No. Typically the hit to miss ratio is very high, its a metric that is > recorded in Blur (you can access via the blue shell by running the top > command). It's not unusual to see hits in the 5000-10000/s range with a > block size of 64KB and misses occurring at the same time between 10-20/s. > This has a lot to due with how Lucene stores it's indexes, they are highly > compressed files (although not compressed with a generic compression > scheme). > > > Let me know if you any other questions. > > Aaron > > > > > -- > > Ravi > > > > > > On Wed, Mar 19, 2014 at 11:06 PM, Ravikumar Govindarajan < > > [email protected]> wrote: > > > > > I was looking at block-cache code and trying to understand why we need > > it. > > > > > > We divide the file into blocks of 8KB and write to hadoop. While > reading, > > > we only read in batches of 8KB and store in block-cache > > > > > > This is a form of read-ahead caching on the client-side[shard-server]. > Am > > > I correct in understanding? > > > > > > Recent releases of hadoop have a notion of read-ahead caching in > > data-node > > > itself. The default value is 4MB but I believe it can also be > configured > > to > > > whatever is needed. > > > > > > What are the advantages of a block-cache vis-a-vis data-node read-ahead > > > cache? > > > > > > I also am not familiar with hadoop IO sub-system as to whether it's > > > correct and performant to do read-aheads in data-nodes for a use-case > > like > > > lucene. > > > > > > Can someone help me? > > > > > > -- > > > Ravi > > > > > > > > > > > >
