As you pointed out, there will be some penalty for this cache, especially
when the number of rowids increases. Interacting with this cache during
IndexReader open/close is going to have some overhead.

Instead, can we decouple this and make it a "write-through-cache"?

Ex: Map<SegName, Ref-Counted-PrimeDocBitSet>

Codec will publish new data to this cache on flush[new-segment-creation].

Every access can be ref-counted and during segment removal [merges],
obsolete entries can be queued and removed from the cache, if ref-count
drops to zero.

Typically I feel that this cache should be free of IndexReader open/close,
but rather live till BlurNRTIndex.close() is called. Then the over-head is
really minimal

What do you think?

--
Ravi




On Sat, Nov 9, 2013 at 9:52 AM, Aaron McCurry <[email protected]> wrote:

> On Fri, Nov 8, 2013 at 2:22 AM, Ravikumar Govindarajan <
> [email protected]> wrote:
>
> > Wow, this saving of filters in a custom-codec is super-cool.
> >
> > Let me describe the problem I was thinking about.
> >
> > Assuming we have the RAMDir and Disk swap approach,  I was just starting
> to
> > deliberate on the Read path.
> >
> > PrimeDocCache looks like a challenge for this approach, as the same row
> > will now be present across multiple segments. Each segment will have a
> > "PrimeDoc" field per-row, but during merge this info gets duplicated for
> > each row.
> >
> > I was thinking of recording the "start-doc" of each row to a separate
> file,
> > via a custom codec, like you have done for FilterCache.
> >
> > During warm-up, it can read the entire file containing "start-docs" and
> > populate the PrimeDocCache.
> >
>
> I like the idea, I tend to prototype to figure out how hard and how
> performant  a solution will be.  :-)  Let's see if we can make it work.
>
> Aaron
>
>
> >
> > --
> > Ravi
> >
> >
> >
> >
> > On Fri, Nov 8, 2013 at 5:04 AM, Aaron McCurry <[email protected]>
> wrote:
> >
> > > So filter cache is really just a place holder for keeping Lucene
> Filters
> > > around between queries.  The DefaultFilterCache class does nothing,
> > however
> > > I have implemented one that make use of regularly.
> > >
> > >
> > >
> >
> https://git-wip-us.apache.org/repos/asf?p=incubator-blur.git;a=blob;f=blur-core/src/main/java/org/apache/blur/manager/AliasBlurFilterCache.java;h=92491d0ceb3e7ce09902110e3bac5fa485959dab;hb=apache-blur-0.2
> > >
> > > If you write your own and you want to build a logical bitset cache for
> > the
> > > filter (so it's faster) take a look at the
> > > "org.apache.blur.filter.FilterCache"
> > > class.  It wraps an existing filter, loads it into the block cache and
> > > writes it disk (via the Directory).  The filters live with the segment
> so
> > > if the segment gets removed so will the on disk "filter" and the
> > in-memory
> > > cache of it.
> > >
> > > On Thu, Nov 7, 2013 at 8:08 AM, Ravikumar Govindarajan <
> > > [email protected]> wrote:
> > >
> > > > Great. In such a case, it will benefit me for doing a "rowid"
> > > filter-cache.
> > > >
> > > > I saw Blur having a DefaultFilterCache class. Is this the class that
> > need
> > > > to be customized? Will NRT re-opens [reader-close/open, with
> > > > applyAllDeletes] take care of auto-invalidating such a cache?
> > > >
> > >
> > > Filtering is a query operation so for each new segment (NRT re-opens)
> the
> > > Lucene Filter API handles creating a new new filter for that segment.
> >  The
> > > delete operations are up to how you code the Filter.  But that's all
> > Lucene
> > > code.
> > >
> > > The DefaultFilterCache just allows you to cache the filter objects
> > > themselves and it provides callbacks when table/shards are opened and
> > > closed.
> > >
> > > Aaron
> > >
> > >
> > > >
> > > > --
> > > > Ravi
> > > >
> > > >
> > > > On Thu, Nov 7, 2013 at 5:44 PM, Aaron McCurry <[email protected]>
> > > wrote:
> > > >
> > > > > Yes.  But I believe the "rowId" needs to be "rowid".
> > > > >
> > > > > Aaron
> > > > >
> > > > >
> > > > > On Thu, Nov 7, 2013 at 5:16 AM, Ravikumar Govindarajan <
> > > > > [email protected]> wrote:
> > > > >
> > > > > > Does Blur permit queries with rowId?
> > > > > >
> > > > > > Ex:
> > > > > > docs.body:hello AND rowId:123
> > > > > >
> > > > > > Is it possible to optimize such queries with filter-caching
> etc...?
> > > > > >
> > > > > > --
> > > > > > Ravi
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to