Could there be another implementation of sorting? With very large indexes, and small total result spaces, it would makes sense to maintain a partial list of sorted ids per field. Every search that finds new ids, adds them to the master list. There can even have a cache eviction policy.
Lance On Mon, Sep 13, 2010 at 8:01 AM, Danil ŢORIN <torin...@gmail.com> wrote: > And it would be nice to have hooks in lucene and avoid managing refs > to indexReader on reopen() and close() by myself. > > Oh...and to complicate things, my index it's near-realtime using > IndexWriter.getReader(), so it's not just IndexReader we need to > change, but also IndexWriter should provide a reader that has proper > FieldCache implementation. > > And I'm a bit uncomfortable to dig that deep :) > > On Mon, Sep 13, 2010 at 17:51, Danil ŢORIN <torin...@gmail.com> wrote: >> I'd second that.... >> >> In my usecase we need to search, sometimes with sort, on pretty big index... >> >> So in worst case scenario we get OOM while loading FieldCache as it >> tries do create an huge array. >> You can increase -Xmx, go to bigger host, but in the end there WILL be >> an index big enough to crash you. >> >> My idea would be to use something like EhCache with few elements in >> memory and overflow to disk, so that if there are few unique terms, it >> would be almost as fast as an array. >> Otherwise in Collector/Sort/SortField/FieldComparator I would hit the >> EhCache on disk (yes it would be a huge performance hit) but I won't >> get OOMs and the results STILL will be sorted. >> >> Right now SegmentReader/FieldCacheImpl are pretty hardcoded on >> FieldCache.DEFAULT >> >> And yes, I'm on 3.x... >> >> >> On Mon, Sep 13, 2010 at 16:05, Tim Smith <tsm...@attivio.com> wrote: >>> i created https://issues.apache.org/jira/browse/LUCENE-2345 some time ago >>> proposing pretty much what seems to be discussed here >>> >>> >>> -- Tim >>> >>> On 09/12/10 10:18, Simon Willnauer wrote: >>>> >>>> On Sun, Sep 12, 2010 at 11:46 AM, Michael McCandless >>>> <luc...@mikemccandless.com> wrote: >>>>> >>>>> Having hooks to enable an app to manage its own "external, private >>>>> stuff associated w/ each segment reader" would be useful and it's been >>>>> asked for in the past. However, since we've now opened up >>>>> SegmentReader, SegmentInfo/s, etc., in recent releases, can't an app >>>>> already do this w/o core API changes? >>>> >>>> The visitor approach would simply be a little more than syntactic >>>> sugar where only new SubReader instances are passed to the callback. >>>> You can do the same with the already existing API like >>>> gatherSubReaders or getSequentialSubReaders. Every API I was talking >>>> about would just be simplification anyway and would be possible to >>>> build without changing the core. >>>>> >>>>> I know Earwin has built a whole system like this on top of Lucene -- >>>>> Earwin how did you do that...? Did you make core changes to >>>>> Lucene...? >>>>> >>>>> A custom Codec should be an excellent way to handle the specific use >>>>> cache (caching certain postings) -- by doing it as a Codec, any time >>>>> anything in Lucene needs to tap into that posting (query scorers, >>>>> filters, merging, applying deletes, etc), it hits this cache. You >>>>> could model it like PulsingCodec, which wraps any other Codec but >>>>> handles the low-freq ones itself. If you do it externally how would >>>>> core use of postings hit it? (Or was that not the intention?) >>>>> >>>>> I don't understand the filter use-case... the CachingWrapperFilter >>>>> already caches per-segment, so that reopen is efficient? How would an >>>>> external cache (built on these hooks) be different? >>>> >>>> Man you are right - never mind :) >>>> >>>> simon >>>>> >>>>> For faster filters we have to apply them like we do deleted docs if >>>>> the filter is "random access" (such as being cached), LUCENE-1536 -- >>>>> flex actually makes this relatively easy now, since the postings API >>>>> no longer implicitly filters deleted docs (ie you provide your own >>>>> skipDocs) -- but these hooks won't fix that right? >>>>> >>>>> Mike >>>>> >>>>> On Sun, Sep 12, 2010 at 3:43 AM, Simon Willnauer >>>>> <simon.willna...@googlemail.com> wrote: >>>>>> >>>>>> Hey Shai, >>>>>> >>>>>> On Sun, Sep 12, 2010 at 6:51 AM, Shai Erera<ser...@gmail.com> wrote: >>>>>>> >>>>>>> Hey Simon, >>>>>>> >>>>>>> You're right that the application can develop a Caching mechanism >>>>>>> outside >>>>>>> Lucene, and when reopen() is called, if it changed, iterate on the >>>>>>> sub-readers and init the Cache w/ the new ones. >>>>>> >>>>>> Alright, then we are on the same track I guess! >>>>>> >>>>>>> However, by building something like that inside Lucene, the application >>>>>>> will >>>>>>> get more native support, and thus better performance, in some cases. >>>>>>> For >>>>>>> example, consider a field fileType with 10 possible values, and for the >>>>>>> sake >>>>>>> of simplicity, let's say that the index is divided evenly across them. >>>>>>> Your >>>>>>> users always add such a term constraint to the query (e.g. they want to >>>>>>> get >>>>>>> results of fileType:pdf or fileType:odt, and perhaps sometimes both, >>>>>>> but not >>>>>>> others). You have basically two ways of supporting this: >>>>>>> (1) Add such a term to the query / clause to a BooleanQuery w/ an AND >>>>>>> relation -- cons is that this term / posting is read for every query. >>>>>> >>>>>> Oh I wasn't saying that a cache framework would be obsolet and >>>>>> shouldn't be part of lucene. My intention was rather to generalize >>>>>> this functionality so that we can make the API change more easily and >>>>>> at the same time brining the infrastructure you are proposing in >>>>>> place. >>>>>> >>>>>> Regarding you example above, filters are a very good example where >>>>>> something like that could help to improve performance and we should >>>>>> provide it with lucene core but I would again prefer the least >>>>>> intrusive way to do so. If we can make that happen without adding any >>>>>> cache agnostic API we should do it. We really should try to sketch out >>>>>> a simple API with gives us access to the opened segReaders and see if >>>>>> that would be sufficient for our usecases. Specialization will always >>>>>> be possible but I doubt that it is needed. >>>>>>> >>>>>>> (2) Write a Filter which works at the top IR level, that is refreshed >>>>>>> whenever the index is refreshed. This is better than (1), however has >>>>>>> some >>>>>>> disadvantages: >>>>>>> >>>>>>> (2.1) As Mike already proved (on some issue which I don't remember its >>>>>>> subject/number at the moment), if we could get Filter down to the lower >>>>>>> level components of Lucene's search, so e.g. it is used as the deleted >>>>>>> docs >>>>>>> OBS, we can get better performance w/ Filters. >>>>>>> >>>>>>> (2.2) The Filter is refreshed for the entire IR, and not just the >>>>>>> changed >>>>>>> segments. Reason is, outside Collector, you have no way of telling >>>>>>> IndexSearcher "use Filter F1 for segment S1 and F2 for segment F2". >>>>>>> Loading/refreshing the Filter may be expensive, and definitely won't >>>>>>> perform >>>>>>> well w/ NRT, where by definition you'd like to get small changes >>>>>>> searchable >>>>>>> very fast. >>>>>> >>>>>> No doubt you are right about the above. A >>>>>> PerSegmentCachingFilterWrapper would be something we can easily do on >>>>>> an application level basis with the infrastructure we are talking >>>>>> about in place. While I don't exactly know how I feel that this >>>>>> particular problem should rather be addressed internally and I'm not >>>>>> sure if the high level Cache mechanism is the right way to do it but >>>>>> this is just a gut feeling. But when I think about it twice it might >>>>>> be way sufficient enough to do it.... >>>>>>> >>>>>>> Therefore I think that if we could provide the necessary hooks in >>>>>>> Lucene, >>>>>>> let's call it a Cache plug-in for now, we can incrementally improve the >>>>>>> search process. I don't want to go too far into the design of a generic >>>>>>> plug-ins mechanism, but you're right (again :)) -- we could offer a >>>>>>> reopen(PluginProvider) which is entirely not about Cache, and Cache >>>>>>> would >>>>>>> become one of the Plugins the PluginProvider provides. I just try to >>>>>>> learn >>>>>>> from past experience -- when the discussion is focused, there's a >>>>>>> better >>>>>>> chance of getting to a resolution. However if you think that in this >>>>>>> case, a >>>>>>> more generic API, as PluginProvider, would get us to a resolution >>>>>>> faster, I >>>>>>> don't mind spend some time to think about it. But for all practical >>>>>>> purposes, we should IMO start w/ a Cache plug-in, that is called like >>>>>>> that, >>>>>>> and if it catches, generify it ... >>>>>> >>>>>> I absolutely agree the API might be more generic but our current >>>>>> use-case / PoC should be a caching. I don't like the name Plugin but >>>>>> thats a personal thing since you are not pluggin anything is. >>>>>> Something like SubreaderCallback or ReaderVisitor might be more >>>>>> accurate but lets argue about the details later. Why not sketching >>>>>> something out for the filter problem and follow on from there? The >>>>>> more iteration the better and back to your question if that would be >>>>>> something which could make it to be committable I would say if it >>>>>> works stand alone / not to tightly coupled I would absolutely say yes. >>>>>>> >>>>>>> Unfortunately, I haven't had enough experience w/ Codecs yet (still on >>>>>>> 3x) >>>>>>> so I can't comment on how feasible that solution is. I'll take your >>>>>>> word for >>>>>>> it that it's doable :). But this doesn't give us a 3x solution ... the >>>>>>> Caching framework on trunk can be developed w/ Codecs. >>>>>> >>>>>> I guess nobody really has except of mike and maybe one or two others >>>>>> but what I have done so far regarding codecs I would say that is the >>>>>> place to solve this particular problem. Maybe even lower than that on >>>>>> a Directory level. Anyhow, lets focus on application level caches for >>>>>> now. We are not aiming to provide a whole full fledged Cache API but >>>>>> the infrastructure to make it easier to build those on a app basis >>>>>> which would be a valuable improvement. We should also look at Solr's >>>>>> cache implementations and how they could benefit from this efforts >>>>>> since Solr uses app-level caching we can learn from API design wise. >>>>>> >>>>>> simon >>>>>>> >>>>>>> Shai >>>>>>> >>>>>>> On Sat, Sep 11, 2010 at 10:41 PM, Simon Willnauer >>>>>>> <simon.willna...@googlemail.com> wrote: >>>>>>>> >>>>>>>> Hi Shai, >>>>>>>> >>>>>>>> On Sat, Sep 11, 2010 at 8:08 PM, Shai Erera<ser...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> Hi >>>>>>>>> >>>>>>>>> Lucene's Caches have been heavilydiscussed before (e.g., LUCENE-831, >>>>>>>>> LUCENE-2133 and LUCENE-2394) and from what I can tell, there have >>>>>>>>> been >>>>>>>>> many proposals to attack this problem, w/ no developed solution. >>>>>>>> >>>>>>>> I didn't go through those issues so forgive me if something I bring up >>>>>>>> has already been discussed. >>>>>>>> I have a couple of question about your proposal - please find them >>>>>>>> inline... >>>>>>>> >>>>>>>>> I'd like to explore a different, IMO much simpler, angle to attach >>>>>>>>> this >>>>>>>>> problem. Instead of having Lucene manage the Cache itself, we let the >>>>>>>>> application manage it, however Lucene will provide the necessary >>>>>>>>> hooks >>>>>>>>> in IndexReader to allow it. The hooks I have in mind are: >>>>>>>>> >>>>>>>>> (1) IndexReader current API for TermDocs, TermEnum, TermPositions >>>>>>>>> etc. >>>>>>>>> -- >>>>>>>>> already exists. >>>>>>>>> >>>>>>>>> (2) When reopen() is called, Lucene will take care to call a >>>>>>>>> Cache.load(IndexReader), so that the application can pull whatever >>>>>>>>> information >>>>>>>>> it needs from the passed-in IndexReader. >>>>>>>> >>>>>>>> Would that do anything else than passing the new reader (if reopened) >>>>>>>> to the caches load method? I wonder if this is more than >>>>>>>> If(newReader != oldReader) >>>>>>>> Cache.load(newReader) >>>>>>>> >>>>>>>> If so something like that should be done on a segment reader anyway, >>>>>>>> right? From my perspective this isn't more than a callback or visitor >>>>>>>> that should walk though the subreaders and called for each reopened >>>>>>>> sub-reader. A cache-warming visitor / callback would then be trivial >>>>>>>> and the API would be more general. >>>>>>>> >>>>>>>> >>>>>>>>> So to be more concrete on my proposal, I'd like to support caching in >>>>>>>>> the following way (and while I've spent some time thinking about it, >>>>>>>>> I'm >>>>>>>>> sure there are great suggestions to improve it): >>>>>>>>> >>>>>>>>> * Application provides a CacheFactory to IndexReader.open/reopen, >>>>>>>>> which >>>>>>>>> exposes some very simple API, such as createCache, or >>>>>>>>> initCache(IndexReader) etc. Something which returns a Cache object, >>>>>>>>> which does not have very strict/concrete API. >>>>>>>> >>>>>>>> My first question would be why the reader should know about Cache if >>>>>>>> there is no strict / concrete API? >>>>>>>> I can follow you with the CacheFactory to create cache objects but why >>>>>>>> would the reader have to know / "receive" this object? Maybe this is >>>>>>>> answered further down the path but I don't see the reason why the >>>>>>>> notion of a "cache" must exist within open/reopen or if that could be >>>>>>>> implemented in a more general "cache" - agnostic way. >>>>>>>>> >>>>>>>>> * IndexReader, most probably at the SegmentReader level uses >>>>>>>>> CacheFactory to create a new Cache instance and calls its >>>>>>>>> load(IndexReader) method, so that the Cache would initialize itself. >>>>>>>> >>>>>>>> That is what I was thinking above - yet is that more than a callback >>>>>>>> for each reopened or opened segment reader? >>>>>>>> >>>>>>>>> * The application can use CacheFactory to obtain the Cache object per >>>>>>>>> IndexReader (for example, during Collector.setNextReader), or we can >>>>>>>>> have IndexReader offer a getCache() method. >>>>>>>> >>>>>>>> :) until here the cache is only used by the application itself not by >>>>>>>> any Lucene API, right? I can think of many application specific data >>>>>>>> that could be useful to be associated with an IR beyond the cacheing >>>>>>>> use case - again this could be a more general API solving that >>>>>>>> problem. >>>>>>>>> >>>>>>>>> * One of Cache API would be getCache(TYPE), where TYPE is a String or >>>>>>>>> Object, or an interface CacheType w/ no methods, just to be a marker >>>>>>>>> one, and the application is free to impl it however it wants. That's >>>>>>>>> a >>>>>>>>> loose API, I know, but completely at the application hands, which >>>>>>>>> makes >>>>>>>>> Lucene code simpler. >>>>>>>> >>>>>>>> I like the idea together with the metadata associating functionality >>>>>>>> from above something like public T IndexReader#get(Type<T> type). >>>>>>>> Hmm that looks quiet similar to Attributes, does it?! :) However this >>>>>>>> could be done in many ways but again "cache" - agnositc >>>>>>>>> >>>>>>>>> * We can introduce a TermsCache, TermEnumCache and TermVectorCache to >>>>>>>>> provide the user w/ IndexReader-similar API, only more efficient than >>>>>>>>> say TermDocs -- something w/ random access to the docs inside, >>>>>>>>> perhaps >>>>>>>>> even an OpenBitSet. Lucene can take advantage of it if, say, we >>>>>>>>> create a >>>>>>>>> CachingSegmentReader which makes use of the cache, and checks every >>>>>>>>> time >>>>>>>>> termDocs() is called if the required Term is cached or not etc. I >>>>>>>>> admit >>>>>>>>> I may be thinking too much ahead. >>>>>>>> >>>>>>>> I see what you are trying to do here. I also see how this could be >>>>>>>> useful but I guess coming up with a stable APi which serves lots of >>>>>>>> applications would be quiet hard. A CachingSegmentReader could be a >>>>>>>> very simple decorator which would not require to touch the IR >>>>>>>> interface. Something like that could be part of lucene but I'm not >>>>>>>> sure if necessarily part of lucene core. >>>>>>>> >>>>>>>>> That's more or less what I've been thinking. I'm sure there are many >>>>>>>>> details to iron out, but I hope I've managed to pass the general >>>>>>>>> proposal through to you. >>>>>>>> >>>>>>>> Absolutely, this is how it works isn't it! >>>>>>>> >>>>>>>>> What I'm after first, is to allow applications deal w/ postings >>>>>>>>> caching >>>>>>>>> more >>>>>>>>> natively. For example, if you have a posting w/ payloads you'd like >>>>>>>>> to >>>>>>>>> read into memory, or if you would like a term's TermDocs to be cached >>>>>>>>> (to be used as a Filter) etc. -- instead of writing something that >>>>>>>>> can >>>>>>>>> work at the top IndexReader level, you'd be able to take advantage of >>>>>>>>> Lucene internals, i.e. refresh the Cache only for the new segments >>>>>>>>> ... >>>>>>>> >>>>>>>> I wonder if a custom codec would be the right place to implement >>>>>>>> caching / mem resident structures for Postings with payloads etc. You >>>>>>>> could do that on a higher level too but codec seems to be the way to >>>>>>>> go here, right? >>>>>>>> To utilize per segment capabilities a callback for (re)opened segment >>>>>>>> readers would be sufficient or do I miss something? >>>>>>>> >>>>>>>> simon >>>>>>>>> >>>>>>>>> I'm sure that after this will be in place, we can refactor FieldCache >>>>>>>>> to >>>>>>>>> work w/ that API, perhaps as a Cache specific implementation. But I >>>>>>>>> leave that for later. >>>>>>>>> >>>>>>>>> I'd appreciate your comments. Before I set to implement it, I'd like >>>>>>>>> to >>>>>>>>> know if the idea has any chances of making it to a commit :). >>>>>>>>> >>>>>>>>> Shai >>>>>>>>> >>>>>>>>> >>>>>>>> --------------------------------------------------------------------- >>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>>>>>> >>>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>>>> >>>>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: dev-h...@lucene.apache.org >>> >>> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > -- Lance Norskog goks...@gmail.com --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org