Re: Block-Cache and usage

Ravikumar Govindarajan Thu, 27 Mar 2014 05:28:47 -0700

Aaron, I have another doubt regarding block-cache V1 & V2.

V1 uses slabs/blocks approach, while V2 uses a file-extn->byte[] mapping.


I looked at CacheIndexInput, where if the byte[] is fully filled, we
de-alloc it and alloc again. [releaseCache()/fillCache() methods]

This de-alloc/alloc will be pretty heavy when the configured cache-size
becomes large right? Ex: 300 MB of cache for FDT file getting filled, then
de-alloc/alloc of 300 MB again...

I am not sure I understand the cache-logic correctly, so need your help
here...

--
Ravi



On Sun, Mar 23, 2014 at 8:24 PM, Ravikumar Govindarajan <
[email protected]> wrote:

> No.  Typically the hit to miss ratio is very high, its a metric that is
>> recorded in Blur
>
>
> This is such a handy feature. Thanks for providing such detailed metrics
>
> Just to add to the benefits of block-cache, I just found out that
> readFully or sync(seek+read) in FSDataInputStream occurs entirely in a
> synchronized method in hadoop that could limit throughput/QPS when multiple
> IndexInputs are open for same lucene file.
>
> Block-cache should shine in such scenarios...
>
> Thanks a lot for your inputs.
>
> --
> Ravi
>
>
> On Thu, Mar 20, 2014 at 5:45 PM, Aaron McCurry <[email protected]> wrote:
>
>> On Wed, Mar 19, 2014 at 1:57 PM, Ravikumar Govindarajan <
>> [email protected]> wrote:
>>
>> > One obvious case is a cache-hit scenario, where instead of using the
>> > block-cache, there is a fairly heavy round-trip to data-node. It is also
>> > highly likely that the data-node might have evicted the hot-pages due to
>> > other active reads.
>>
>>
>> Or writes.  The normal behavior in the Linux filesystem cache is to cache
>> newly written data and evict the oldest data from memory.  So during
>> merges
>> (or any other writes from other Hadoop processes) the Linux filesystem
>> will
>> unload pages that you might be using.
>>
>>
>> >
>>
>>
>> > How much of cache-hit happens in Blur? Will I be correct in saying that
>> > repeated terms occurring in search only will benefit block-cache?
>> >
>>
>> No.  Typically the hit to miss ratio is very high, its a metric that is
>> recorded in Blur (you can access via the blue shell by running the top
>> command).  It's not unusual to see hits in the 5000-10000/s range with a
>> block size of 64KB and misses occurring at the same time between 10-20/s.
>>  This has a lot to due with how Lucene stores it's indexes, they are
>> highly
>> compressed files (although not compressed with a generic compression
>> scheme).
>>
>>
>> Let me know if you any other questions.
>>
>> Aaron
>>
>> >
>> > --
>> > Ravi
>> >
>> >
>> > On Wed, Mar 19, 2014 at 11:06 PM, Ravikumar Govindarajan <
>> > [email protected]> wrote:
>> >
>> > > I was looking at block-cache code and trying to understand why we need
>> > it.
>> > >
>> > > We divide the file into blocks of 8KB and write to hadoop. While
>> reading,
>> > > we only read in batches of 8KB and store in block-cache
>> > >
>> > > This is a form of read-ahead caching on the
>> client-side[shard-server]. Am
>> > > I correct in understanding?
>> > >
>> > > Recent releases of hadoop have a notion of read-ahead caching in
>> > data-node
>> > > itself. The default value is 4MB but I believe it can also be
>> configured
>> > to
>> > > whatever is needed.
>> > >
>> > > What are the advantages of a block-cache vis-a-vis data-node
>> read-ahead
>> > > cache?
>> > >
>> > > I also am not familiar with hadoop IO sub-system as to whether it's
>> > > correct and performant to do read-aheads in data-nodes for a use-case
>> > like
>> > > lucene.
>> > >
>> > > Can someone help me?
>> > >
>> > > --
>> > > Ravi
>> > >
>> > >
>> > >
>> >
>>
>
>

Re: Block-Cache and usage

Reply via email to