IndexReader Cache - a different angle

Shai Erera Sat, 11 Sep 2010 11:08:48 -0700

Hi

Lucene's Caches have been heavilydiscussed before (e.g., LUCENE-831,
LUCENE-2133 and LUCENE-2394) and from what I can tell, there have been
many proposals to attack this problem, w/ no developed solution.


I'd like to explore a different, IMO much simpler, angle to attach this
problem. Instead of having Lucene manage the Cache itself, we let the
application manage it, however Lucene will provide the necessary hooks
in IndexReader to allow it. The hooks I have in mind are:

(1) IndexReader current API for TermDocs, TermEnum, TermPositions etc. --
already exists.

(2) When reopen() is called, Lucene will take care to call a
Cache.load(IndexReader), so that the application can pull whatever
information
it needs from the passed-in IndexReader.

So to be more concrete on my proposal, I'd like to support caching in
the following way (and while I've spent some time thinking about it, I'm
sure there are great suggestions to improve it):

* Application provides a CacheFactory to IndexReader.open/reopen, which
exposes some very simple API, such as createCache, or
initCache(IndexReader) etc. Something which returns a Cache object,
which does not have very strict/concrete API.

* IndexReader, most probably at the SegmentReader level uses
CacheFactory to create a new Cache instance and calls its
load(IndexReader) method, so that the Cache would initialize itself.

* The application can use CacheFactory to obtain the Cache object per
IndexReader (for example, during Collector.setNextReader), or we can
have IndexReader offer a getCache() method.

* One of Cache API would be getCache(TYPE), where TYPE is a String or
Object, or an interface CacheType w/ no methods, just to be a marker
one, and the application is free to impl it however it wants. That's a
loose API, I know, but completely at the application hands, which makes
Lucene code simpler.

* We can introduce a TermsCache, TermEnumCache and TermVectorCache to
provide the user w/ IndexReader-similar API, only more efficient than
say TermDocs -- something w/ random access to the docs inside, perhaps
even an OpenBitSet. Lucene can take advantage of it if, say, we create a
CachingSegmentReader which makes use of the cache, and checks every time
termDocs() is called if the required Term is cached or not etc. I admit
I may be thinking too much ahead.

That's more or less what I've been thinking. I'm sure there are many
details to iron out, but I hope I've managed to pass the general
proposal through to you.

What I'm after first, is to allow applications deal w/ postings caching more

natively. For example, if you have a posting w/ payloads you'd like to
read into memory, or if you would like a term's TermDocs to be cached
(to be used as a Filter) etc. -- instead of writing something that can
work at the top IndexReader level, you'd be able to take advantage of
Lucene internals, i.e. refresh the Cache only for the new segments ...

I'm sure that after this will be in place, we can refactor FieldCache to
work w/ that API, perhaps as a Cache specific implementation. But I
leave that for later.

I'd appreciate your comments. Before I set to implement it, I'd like to
know if the idea has any chances of making it to a commit :).

Shai

IndexReader Cache - a different angle

Reply via email to