And it would be nice to have hooks in lucene and avoid managing refs to indexReader on reopen() and close() by myself.
Oh...and to complicate things, my index it's near-realtime using IndexWriter.getReader(), so it's not just IndexReader we need to change, but also IndexWriter should provide a reader that has proper FieldCache implementation. And I'm a bit uncomfortable to dig that deep :) On Mon, Sep 13, 2010 at 17:51, Danil ŢORIN <torin...@gmail.com> wrote: > I'd second that.... > > In my usecase we need to search, sometimes with sort, on pretty big index... > > So in worst case scenario we get OOM while loading FieldCache as it > tries do create an huge array. > You can increase -Xmx, go to bigger host, but in the end there WILL be > an index big enough to crash you. > > My idea would be to use something like EhCache with few elements in > memory and overflow to disk, so that if there are few unique terms, it > would be almost as fast as an array. > Otherwise in Collector/Sort/SortField/FieldComparator I would hit the > EhCache on disk (yes it would be a huge performance hit) but I won't > get OOMs and the results STILL will be sorted. > > Right now SegmentReader/FieldCacheImpl are pretty hardcoded on > FieldCache.DEFAULT > > And yes, I'm on 3.x... > > > On Mon, Sep 13, 2010 at 16:05, Tim Smith <tsm...@attivio.com> wrote: >> i created https://issues.apache.org/jira/browse/LUCENE-2345 some time ago >> proposing pretty much what seems to be discussed here >> >> >> -- Tim >> >> On 09/12/10 10:18, Simon Willnauer wrote: >>> >>> On Sun, Sep 12, 2010 at 11:46 AM, Michael McCandless >>> <luc...@mikemccandless.com> wrote: >>>> >>>> Having hooks to enable an app to manage its own "external, private >>>> stuff associated w/ each segment reader" would be useful and it's been >>>> asked for in the past. However, since we've now opened up >>>> SegmentReader, SegmentInfo/s, etc., in recent releases, can't an app >>>> already do this w/o core API changes? >>> >>> The visitor approach would simply be a little more than syntactic >>> sugar where only new SubReader instances are passed to the callback. >>> You can do the same with the already existing API like >>> gatherSubReaders or getSequentialSubReaders. Every API I was talking >>> about would just be simplification anyway and would be possible to >>> build without changing the core. >>>> >>>> I know Earwin has built a whole system like this on top of Lucene -- >>>> Earwin how did you do that...? Did you make core changes to >>>> Lucene...? >>>> >>>> A custom Codec should be an excellent way to handle the specific use >>>> cache (caching certain postings) -- by doing it as a Codec, any time >>>> anything in Lucene needs to tap into that posting (query scorers, >>>> filters, merging, applying deletes, etc), it hits this cache. You >>>> could model it like PulsingCodec, which wraps any other Codec but >>>> handles the low-freq ones itself. If you do it externally how would >>>> core use of postings hit it? (Or was that not the intention?) >>>> >>>> I don't understand the filter use-case... the CachingWrapperFilter >>>> already caches per-segment, so that reopen is efficient? How would an >>>> external cache (built on these hooks) be different? >>> >>> Man you are right - never mind :) >>> >>> simon >>>> >>>> For faster filters we have to apply them like we do deleted docs if >>>> the filter is "random access" (such as being cached), LUCENE-1536 -- >>>> flex actually makes this relatively easy now, since the postings API >>>> no longer implicitly filters deleted docs (ie you provide your own >>>> skipDocs) -- but these hooks won't fix that right? >>>> >>>> Mike >>>> >>>> On Sun, Sep 12, 2010 at 3:43 AM, Simon Willnauer >>>> <simon.willna...@googlemail.com> wrote: >>>>> >>>>> Hey Shai, >>>>> >>>>> On Sun, Sep 12, 2010 at 6:51 AM, Shai Erera<ser...@gmail.com> wrote: >>>>>> >>>>>> Hey Simon, >>>>>> >>>>>> You're right that the application can develop a Caching mechanism >>>>>> outside >>>>>> Lucene, and when reopen() is called, if it changed, iterate on the >>>>>> sub-readers and init the Cache w/ the new ones. >>>>> >>>>> Alright, then we are on the same track I guess! >>>>> >>>>>> However, by building something like that inside Lucene, the application >>>>>> will >>>>>> get more native support, and thus better performance, in some cases. >>>>>> For >>>>>> example, consider a field fileType with 10 possible values, and for the >>>>>> sake >>>>>> of simplicity, let's say that the index is divided evenly across them. >>>>>> Your >>>>>> users always add such a term constraint to the query (e.g. they want to >>>>>> get >>>>>> results of fileType:pdf or fileType:odt, and perhaps sometimes both, >>>>>> but not >>>>>> others). You have basically two ways of supporting this: >>>>>> (1) Add such a term to the query / clause to a BooleanQuery w/ an AND >>>>>> relation -- cons is that this term / posting is read for every query. >>>>> >>>>> Oh I wasn't saying that a cache framework would be obsolet and >>>>> shouldn't be part of lucene. My intention was rather to generalize >>>>> this functionality so that we can make the API change more easily and >>>>> at the same time brining the infrastructure you are proposing in >>>>> place. >>>>> >>>>> Regarding you example above, filters are a very good example where >>>>> something like that could help to improve performance and we should >>>>> provide it with lucene core but I would again prefer the least >>>>> intrusive way to do so. If we can make that happen without adding any >>>>> cache agnostic API we should do it. We really should try to sketch out >>>>> a simple API with gives us access to the opened segReaders and see if >>>>> that would be sufficient for our usecases. Specialization will always >>>>> be possible but I doubt that it is needed. >>>>>> >>>>>> (2) Write a Filter which works at the top IR level, that is refreshed >>>>>> whenever the index is refreshed. This is better than (1), however has >>>>>> some >>>>>> disadvantages: >>>>>> >>>>>> (2.1) As Mike already proved (on some issue which I don't remember its >>>>>> subject/number at the moment), if we could get Filter down to the lower >>>>>> level components of Lucene's search, so e.g. it is used as the deleted >>>>>> docs >>>>>> OBS, we can get better performance w/ Filters. >>>>>> >>>>>> (2.2) The Filter is refreshed for the entire IR, and not just the >>>>>> changed >>>>>> segments. Reason is, outside Collector, you have no way of telling >>>>>> IndexSearcher "use Filter F1 for segment S1 and F2 for segment F2". >>>>>> Loading/refreshing the Filter may be expensive, and definitely won't >>>>>> perform >>>>>> well w/ NRT, where by definition you'd like to get small changes >>>>>> searchable >>>>>> very fast. >>>>> >>>>> No doubt you are right about the above. A >>>>> PerSegmentCachingFilterWrapper would be something we can easily do on >>>>> an application level basis with the infrastructure we are talking >>>>> about in place. While I don't exactly know how I feel that this >>>>> particular problem should rather be addressed internally and I'm not >>>>> sure if the high level Cache mechanism is the right way to do it but >>>>> this is just a gut feeling. But when I think about it twice it might >>>>> be way sufficient enough to do it.... >>>>>> >>>>>> Therefore I think that if we could provide the necessary hooks in >>>>>> Lucene, >>>>>> let's call it a Cache plug-in for now, we can incrementally improve the >>>>>> search process. I don't want to go too far into the design of a generic >>>>>> plug-ins mechanism, but you're right (again :)) -- we could offer a >>>>>> reopen(PluginProvider) which is entirely not about Cache, and Cache >>>>>> would >>>>>> become one of the Plugins the PluginProvider provides. I just try to >>>>>> learn >>>>>> from past experience -- when the discussion is focused, there's a >>>>>> better >>>>>> chance of getting to a resolution. However if you think that in this >>>>>> case, a >>>>>> more generic API, as PluginProvider, would get us to a resolution >>>>>> faster, I >>>>>> don't mind spend some time to think about it. But for all practical >>>>>> purposes, we should IMO start w/ a Cache plug-in, that is called like >>>>>> that, >>>>>> and if it catches, generify it ... >>>>> >>>>> I absolutely agree the API might be more generic but our current >>>>> use-case / PoC should be a caching. I don't like the name Plugin but >>>>> thats a personal thing since you are not pluggin anything is. >>>>> Something like SubreaderCallback or ReaderVisitor might be more >>>>> accurate but lets argue about the details later. Why not sketching >>>>> something out for the filter problem and follow on from there? The >>>>> more iteration the better and back to your question if that would be >>>>> something which could make it to be committable I would say if it >>>>> works stand alone / not to tightly coupled I would absolutely say yes. >>>>>> >>>>>> Unfortunately, I haven't had enough experience w/ Codecs yet (still on >>>>>> 3x) >>>>>> so I can't comment on how feasible that solution is. I'll take your >>>>>> word for >>>>>> it that it's doable :). But this doesn't give us a 3x solution ... the >>>>>> Caching framework on trunk can be developed w/ Codecs. >>>>> >>>>> I guess nobody really has except of mike and maybe one or two others >>>>> but what I have done so far regarding codecs I would say that is the >>>>> place to solve this particular problem. Maybe even lower than that on >>>>> a Directory level. Anyhow, lets focus on application level caches for >>>>> now. We are not aiming to provide a whole full fledged Cache API but >>>>> the infrastructure to make it easier to build those on a app basis >>>>> which would be a valuable improvement. We should also look at Solr's >>>>> cache implementations and how they could benefit from this efforts >>>>> since Solr uses app-level caching we can learn from API design wise. >>>>> >>>>> simon >>>>>> >>>>>> Shai >>>>>> >>>>>> On Sat, Sep 11, 2010 at 10:41 PM, Simon Willnauer >>>>>> <simon.willna...@googlemail.com> wrote: >>>>>>> >>>>>>> Hi Shai, >>>>>>> >>>>>>> On Sat, Sep 11, 2010 at 8:08 PM, Shai Erera<ser...@gmail.com> wrote: >>>>>>>> >>>>>>>> Hi >>>>>>>> >>>>>>>> Lucene's Caches have been heavilydiscussed before (e.g., LUCENE-831, >>>>>>>> LUCENE-2133 and LUCENE-2394) and from what I can tell, there have >>>>>>>> been >>>>>>>> many proposals to attack this problem, w/ no developed solution. >>>>>>> >>>>>>> I didn't go through those issues so forgive me if something I bring up >>>>>>> has already been discussed. >>>>>>> I have a couple of question about your proposal - please find them >>>>>>> inline... >>>>>>> >>>>>>>> I'd like to explore a different, IMO much simpler, angle to attach >>>>>>>> this >>>>>>>> problem. Instead of having Lucene manage the Cache itself, we let the >>>>>>>> application manage it, however Lucene will provide the necessary >>>>>>>> hooks >>>>>>>> in IndexReader to allow it. The hooks I have in mind are: >>>>>>>> >>>>>>>> (1) IndexReader current API for TermDocs, TermEnum, TermPositions >>>>>>>> etc. >>>>>>>> -- >>>>>>>> already exists. >>>>>>>> >>>>>>>> (2) When reopen() is called, Lucene will take care to call a >>>>>>>> Cache.load(IndexReader), so that the application can pull whatever >>>>>>>> information >>>>>>>> it needs from the passed-in IndexReader. >>>>>>> >>>>>>> Would that do anything else than passing the new reader (if reopened) >>>>>>> to the caches load method? I wonder if this is more than >>>>>>> If(newReader != oldReader) >>>>>>> Cache.load(newReader) >>>>>>> >>>>>>> If so something like that should be done on a segment reader anyway, >>>>>>> right? From my perspective this isn't more than a callback or visitor >>>>>>> that should walk though the subreaders and called for each reopened >>>>>>> sub-reader. A cache-warming visitor / callback would then be trivial >>>>>>> and the API would be more general. >>>>>>> >>>>>>> >>>>>>>> So to be more concrete on my proposal, I'd like to support caching in >>>>>>>> the following way (and while I've spent some time thinking about it, >>>>>>>> I'm >>>>>>>> sure there are great suggestions to improve it): >>>>>>>> >>>>>>>> * Application provides a CacheFactory to IndexReader.open/reopen, >>>>>>>> which >>>>>>>> exposes some very simple API, such as createCache, or >>>>>>>> initCache(IndexReader) etc. Something which returns a Cache object, >>>>>>>> which does not have very strict/concrete API. >>>>>>> >>>>>>> My first question would be why the reader should know about Cache if >>>>>>> there is no strict / concrete API? >>>>>>> I can follow you with the CacheFactory to create cache objects but why >>>>>>> would the reader have to know / "receive" this object? Maybe this is >>>>>>> answered further down the path but I don't see the reason why the >>>>>>> notion of a "cache" must exist within open/reopen or if that could be >>>>>>> implemented in a more general "cache" - agnostic way. >>>>>>>> >>>>>>>> * IndexReader, most probably at the SegmentReader level uses >>>>>>>> CacheFactory to create a new Cache instance and calls its >>>>>>>> load(IndexReader) method, so that the Cache would initialize itself. >>>>>>> >>>>>>> That is what I was thinking above - yet is that more than a callback >>>>>>> for each reopened or opened segment reader? >>>>>>> >>>>>>>> * The application can use CacheFactory to obtain the Cache object per >>>>>>>> IndexReader (for example, during Collector.setNextReader), or we can >>>>>>>> have IndexReader offer a getCache() method. >>>>>>> >>>>>>> :) until here the cache is only used by the application itself not by >>>>>>> any Lucene API, right? I can think of many application specific data >>>>>>> that could be useful to be associated with an IR beyond the cacheing >>>>>>> use case - again this could be a more general API solving that >>>>>>> problem. >>>>>>>> >>>>>>>> * One of Cache API would be getCache(TYPE), where TYPE is a String or >>>>>>>> Object, or an interface CacheType w/ no methods, just to be a marker >>>>>>>> one, and the application is free to impl it however it wants. That's >>>>>>>> a >>>>>>>> loose API, I know, but completely at the application hands, which >>>>>>>> makes >>>>>>>> Lucene code simpler. >>>>>>> >>>>>>> I like the idea together with the metadata associating functionality >>>>>>> from above something like public T IndexReader#get(Type<T> type). >>>>>>> Hmm that looks quiet similar to Attributes, does it?! :) However this >>>>>>> could be done in many ways but again "cache" - agnositc >>>>>>>> >>>>>>>> * We can introduce a TermsCache, TermEnumCache and TermVectorCache to >>>>>>>> provide the user w/ IndexReader-similar API, only more efficient than >>>>>>>> say TermDocs -- something w/ random access to the docs inside, >>>>>>>> perhaps >>>>>>>> even an OpenBitSet. Lucene can take advantage of it if, say, we >>>>>>>> create a >>>>>>>> CachingSegmentReader which makes use of the cache, and checks every >>>>>>>> time >>>>>>>> termDocs() is called if the required Term is cached or not etc. I >>>>>>>> admit >>>>>>>> I may be thinking too much ahead. >>>>>>> >>>>>>> I see what you are trying to do here. I also see how this could be >>>>>>> useful but I guess coming up with a stable APi which serves lots of >>>>>>> applications would be quiet hard. A CachingSegmentReader could be a >>>>>>> very simple decorator which would not require to touch the IR >>>>>>> interface. Something like that could be part of lucene but I'm not >>>>>>> sure if necessarily part of lucene core. >>>>>>> >>>>>>>> That's more or less what I've been thinking. I'm sure there are many >>>>>>>> details to iron out, but I hope I've managed to pass the general >>>>>>>> proposal through to you. >>>>>>> >>>>>>> Absolutely, this is how it works isn't it! >>>>>>> >>>>>>>> What I'm after first, is to allow applications deal w/ postings >>>>>>>> caching >>>>>>>> more >>>>>>>> natively. For example, if you have a posting w/ payloads you'd like >>>>>>>> to >>>>>>>> read into memory, or if you would like a term's TermDocs to be cached >>>>>>>> (to be used as a Filter) etc. -- instead of writing something that >>>>>>>> can >>>>>>>> work at the top IndexReader level, you'd be able to take advantage of >>>>>>>> Lucene internals, i.e. refresh the Cache only for the new segments >>>>>>>> ... >>>>>>> >>>>>>> I wonder if a custom codec would be the right place to implement >>>>>>> caching / mem resident structures for Postings with payloads etc. You >>>>>>> could do that on a higher level too but codec seems to be the way to >>>>>>> go here, right? >>>>>>> To utilize per segment capabilities a callback for (re)opened segment >>>>>>> readers would be sufficient or do I miss something? >>>>>>> >>>>>>> simon >>>>>>>> >>>>>>>> I'm sure that after this will be in place, we can refactor FieldCache >>>>>>>> to >>>>>>>> work w/ that API, perhaps as a Cache specific implementation. But I >>>>>>>> leave that for later. >>>>>>>> >>>>>>>> I'd appreciate your comments. Before I set to implement it, I'd like >>>>>>>> to >>>>>>>> know if the idea has any chances of making it to a commit :). >>>>>>>> >>>>>>>> Shai >>>>>>>> >>>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>>>>> >>>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>>> >>>>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: dev-h...@lucene.apache.org >>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org