Hi Lucene's Caches have been heavilydiscussed before (e.g., LUCENE-831, LUCENE-2133 and LUCENE-2394) and from what I can tell, there have been many proposals to attack this problem, w/ no developed solution.
I'd like to explore a different, IMO much simpler, angle to attach this problem. Instead of having Lucene manage the Cache itself, we let the application manage it, however Lucene will provide the necessary hooks in IndexReader to allow it. The hooks I have in mind are: (1) IndexReader current API for TermDocs, TermEnum, TermPositions etc. -- already exists. (2) When reopen() is called, Lucene will take care to call a Cache.load(IndexReader), so that the application can pull whatever information it needs from the passed-in IndexReader. So to be more concrete on my proposal, I'd like to support caching in the following way (and while I've spent some time thinking about it, I'm sure there are great suggestions to improve it): * Application provides a CacheFactory to IndexReader.open/reopen, which exposes some very simple API, such as createCache, or initCache(IndexReader) etc. Something which returns a Cache object, which does not have very strict/concrete API. * IndexReader, most probably at the SegmentReader level uses CacheFactory to create a new Cache instance and calls its load(IndexReader) method, so that the Cache would initialize itself. * The application can use CacheFactory to obtain the Cache object per IndexReader (for example, during Collector.setNextReader), or we can have IndexReader offer a getCache() method. * One of Cache API would be getCache(TYPE), where TYPE is a String or Object, or an interface CacheType w/ no methods, just to be a marker one, and the application is free to impl it however it wants. That's a loose API, I know, but completely at the application hands, which makes Lucene code simpler. * We can introduce a TermsCache, TermEnumCache and TermVectorCache to provide the user w/ IndexReader-similar API, only more efficient than say TermDocs -- something w/ random access to the docs inside, perhaps even an OpenBitSet. Lucene can take advantage of it if, say, we create a CachingSegmentReader which makes use of the cache, and checks every time termDocs() is called if the required Term is cached or not etc. I admit I may be thinking too much ahead. That's more or less what I've been thinking. I'm sure there are many details to iron out, but I hope I've managed to pass the general proposal through to you. What I'm after first, is to allow applications deal w/ postings caching more natively. For example, if you have a posting w/ payloads you'd like to read into memory, or if you would like a term's TermDocs to be cached (to be used as a Filter) etc. -- instead of writing something that can work at the top IndexReader level, you'd be able to take advantage of Lucene internals, i.e. refresh the Cache only for the new segments ... I'm sure that after this will be in place, we can refactor FieldCache to work w/ that API, perhaps as a Cache specific implementation. But I leave that for later. I'd appreciate your comments. Before I set to implement it, I'd like to know if the idea has any chances of making it to a commit :). Shai