Hi Shai, On Sat, Sep 11, 2010 at 8:08 PM, Shai Erera <ser...@gmail.com> wrote: > Hi > > Lucene's Caches have been heavilydiscussed before (e.g., LUCENE-831, > LUCENE-2133 and LUCENE-2394) and from what I can tell, there have been > many proposals to attack this problem, w/ no developed solution.
I didn't go through those issues so forgive me if something I bring up has already been discussed. I have a couple of question about your proposal - please find them inline... > > I'd like to explore a different, IMO much simpler, angle to attach this > problem. Instead of having Lucene manage the Cache itself, we let the > application manage it, however Lucene will provide the necessary hooks > in IndexReader to allow it. The hooks I have in mind are: > > (1) IndexReader current API for TermDocs, TermEnum, TermPositions etc. -- > already exists. > > (2) When reopen() is called, Lucene will take care to call a > Cache.load(IndexReader), so that the application can pull whatever > information > it needs from the passed-in IndexReader. Would that do anything else than passing the new reader (if reopened) to the caches load method? I wonder if this is more than If(newReader != oldReader) Cache.load(newReader) If so something like that should be done on a segment reader anyway, right? From my perspective this isn't more than a callback or visitor that should walk though the subreaders and called for each reopened sub-reader. A cache-warming visitor / callback would then be trivial and the API would be more general. > So to be more concrete on my proposal, I'd like to support caching in > the following way (and while I've spent some time thinking about it, I'm > sure there are great suggestions to improve it): > > * Application provides a CacheFactory to IndexReader.open/reopen, which > exposes some very simple API, such as createCache, or > initCache(IndexReader) etc. Something which returns a Cache object, > which does not have very strict/concrete API. My first question would be why the reader should know about Cache if there is no strict / concrete API? I can follow you with the CacheFactory to create cache objects but why would the reader have to know / "receive" this object? Maybe this is answered further down the path but I don't see the reason why the notion of a "cache" must exist within open/reopen or if that could be implemented in a more general "cache" - agnostic way. > > * IndexReader, most probably at the SegmentReader level uses > CacheFactory to create a new Cache instance and calls its > load(IndexReader) method, so that the Cache would initialize itself. That is what I was thinking above - yet is that more than a callback for each reopened or opened segment reader? > > * The application can use CacheFactory to obtain the Cache object per > IndexReader (for example, during Collector.setNextReader), or we can > have IndexReader offer a getCache() method. :) until here the cache is only used by the application itself not by any Lucene API, right? I can think of many application specific data that could be useful to be associated with an IR beyond the cacheing use case - again this could be a more general API solving that problem. > > * One of Cache API would be getCache(TYPE), where TYPE is a String or > Object, or an interface CacheType w/ no methods, just to be a marker > one, and the application is free to impl it however it wants. That's a > loose API, I know, but completely at the application hands, which makes > Lucene code simpler. I like the idea together with the metadata associating functionality from above something like public T IndexReader#get(Type<T> type). Hmm that looks quiet similar to Attributes, does it?! :) However this could be done in many ways but again "cache" - agnositc > > * We can introduce a TermsCache, TermEnumCache and TermVectorCache to > provide the user w/ IndexReader-similar API, only more efficient than > say TermDocs -- something w/ random access to the docs inside, perhaps > even an OpenBitSet. Lucene can take advantage of it if, say, we create a > CachingSegmentReader which makes use of the cache, and checks every time > termDocs() is called if the required Term is cached or not etc. I admit > I may be thinking too much ahead. I see what you are trying to do here. I also see how this could be useful but I guess coming up with a stable APi which serves lots of applications would be quiet hard. A CachingSegmentReader could be a very simple decorator which would not require to touch the IR interface. Something like that could be part of lucene but I'm not sure if necessarily part of lucene core. > That's more or less what I've been thinking. I'm sure there are many > details to iron out, but I hope I've managed to pass the general > proposal through to you. Absolutely, this is how it works isn't it! > > What I'm after first, is to allow applications deal w/ postings caching more > natively. For example, if you have a posting w/ payloads you'd like to > read into memory, or if you would like a term's TermDocs to be cached > (to be used as a Filter) etc. -- instead of writing something that can > work at the top IndexReader level, you'd be able to take advantage of > Lucene internals, i.e. refresh the Cache only for the new segments ... I wonder if a custom codec would be the right place to implement caching / mem resident structures for Postings with payloads etc. You could do that on a higher level too but codec seems to be the way to go here, right? To utilize per segment capabilities a callback for (re)opened segment readers would be sufficient or do I miss something? simon > > I'm sure that after this will be in place, we can refactor FieldCache to > work w/ that API, perhaps as a Cache specific implementation. But I > leave that for later. > > I'd appreciate your comments. Before I set to implement it, I'd like to > know if the idea has any chances of making it to a commit :). > > Shai > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org