And it would be nice to have hooks in lucene and avoid managing refs
to indexReader on reopen() and close() by myself.

Oh...and to complicate things, my index it's near-realtime using
IndexWriter.getReader(), so it's not just IndexReader we need to
change, but also IndexWriter should provide a reader that has proper
FieldCache implementation.

And I'm a bit uncomfortable to dig that deep :)

On Mon, Sep 13, 2010 at 17:51, Danil ŢORIN <torin...@gmail.com> wrote:
> I'd second that....
>
> In my usecase we need to search, sometimes with sort, on pretty big index...
>
> So in worst case scenario we get OOM while loading FieldCache as it
> tries do create an huge array.
> You can increase -Xmx, go to bigger host, but in the end there WILL be
> an index big enough to crash you.
>
> My idea would be to use something like EhCache with few elements in
> memory and overflow to disk, so that if there are few unique terms, it
> would be almost as fast as an array.
> Otherwise in Collector/Sort/SortField/FieldComparator I would hit the
> EhCache on disk (yes it would be a huge performance hit) but I won't
> get OOMs and the results STILL will be sorted.
>
> Right now SegmentReader/FieldCacheImpl are pretty hardcoded on
> FieldCache.DEFAULT
>
> And yes, I'm on 3.x...
>
>
> On Mon, Sep 13, 2010 at 16:05, Tim Smith <tsm...@attivio.com> wrote:
>>  i created https://issues.apache.org/jira/browse/LUCENE-2345 some time ago
>> proposing pretty much what seems to be discussed here
>>
>>
>>  -- Tim
>>
>> On 09/12/10 10:18, Simon Willnauer wrote:
>>>
>>> On Sun, Sep 12, 2010 at 11:46 AM, Michael McCandless
>>> <luc...@mikemccandless.com>  wrote:
>>>>
>>>> Having hooks to enable an app to manage its own "external, private
>>>> stuff associated w/ each segment reader" would be useful and it's been
>>>> asked for in the past.  However, since we've now opened up
>>>> SegmentReader, SegmentInfo/s, etc., in recent releases, can't an app
>>>> already do this w/o core API changes?
>>>
>>> The visitor approach would simply be a little more than syntactic
>>> sugar where only new SubReader instances are passed to the callback.
>>> You can do the same with the already existing API like
>>> gatherSubReaders or getSequentialSubReaders. Every API I was talking
>>> about would just be simplification anyway and would be possible to
>>> build without changing the core.
>>>>
>>>> I know Earwin has built a whole system like this on top of Lucene --
>>>> Earwin how did you do that...?  Did you make core changes to
>>>> Lucene...?
>>>>
>>>> A custom Codec should be an excellent way to handle the specific use
>>>> cache (caching certain postings) -- by doing it as a Codec, any time
>>>> anything in Lucene needs to tap into that posting (query scorers,
>>>> filters, merging, applying deletes, etc), it hits this cache.  You
>>>> could model it like PulsingCodec, which wraps any other Codec but
>>>> handles the low-freq ones itself.  If you do it externally how would
>>>> core use of postings hit it?  (Or was that not the intention?)
>>>>
>>>> I don't understand the filter use-case... the CachingWrapperFilter
>>>> already caches per-segment, so that reopen is efficient?  How would an
>>>> external cache (built on these hooks) be different?
>>>
>>> Man you are right - never mind :)
>>>
>>> simon
>>>>
>>>> For faster filters we have to apply them like we do deleted docs if
>>>> the filter is "random access" (such as being cached), LUCENE-1536 --
>>>> flex actually makes this relatively easy now, since the postings API
>>>> no longer implicitly filters deleted docs (ie you provide your own
>>>> skipDocs) -- but these hooks won't fix that right?
>>>>
>>>> Mike
>>>>
>>>> On Sun, Sep 12, 2010 at 3:43 AM, Simon Willnauer
>>>> <simon.willna...@googlemail.com>  wrote:
>>>>>
>>>>> Hey Shai,
>>>>>
>>>>> On Sun, Sep 12, 2010 at 6:51 AM, Shai Erera<ser...@gmail.com>  wrote:
>>>>>>
>>>>>> Hey Simon,
>>>>>>
>>>>>> You're right that the application can develop a Caching mechanism
>>>>>> outside
>>>>>> Lucene, and when reopen() is called, if it changed, iterate on the
>>>>>> sub-readers and init the Cache w/ the new ones.
>>>>>
>>>>> Alright, then we are on the same track I guess!
>>>>>
>>>>>> However, by building something like that inside Lucene, the application
>>>>>> will
>>>>>> get more native support, and thus better performance, in some cases.
>>>>>> For
>>>>>> example, consider a field fileType with 10 possible values, and for the
>>>>>> sake
>>>>>> of simplicity, let's say that the index is divided evenly across them.
>>>>>> Your
>>>>>> users always add such a term constraint to the query (e.g. they want to
>>>>>> get
>>>>>> results of fileType:pdf or fileType:odt, and perhaps sometimes both,
>>>>>> but not
>>>>>> others). You have basically two ways of supporting this:
>>>>>> (1) Add such a term to the query / clause to a BooleanQuery w/ an AND
>>>>>> relation -- cons is that this term / posting is read for every query.
>>>>>
>>>>> Oh I wasn't saying that a cache framework would be obsolet and
>>>>> shouldn't be part of lucene. My intention was rather to generalize
>>>>> this functionality so that we can make the API change more easily and
>>>>> at the same time brining the infrastructure you are proposing in
>>>>> place.
>>>>>
>>>>> Regarding you example above, filters are a very good example where
>>>>> something like that could help to improve performance and we should
>>>>> provide it with lucene core but I would again prefer the least
>>>>> intrusive way to do so. If we can make that happen without adding any
>>>>> cache agnostic API we should do it. We really should try to sketch out
>>>>> a simple API with gives us access to the opened segReaders and see if
>>>>> that would be sufficient for our usecases. Specialization will always
>>>>> be possible but I doubt that it is needed.
>>>>>>
>>>>>> (2) Write a Filter which works at the top IR level, that is refreshed
>>>>>> whenever the index is refreshed. This is better than (1), however has
>>>>>> some
>>>>>> disadvantages:
>>>>>>
>>>>>> (2.1) As Mike already proved (on some issue which I don't remember its
>>>>>> subject/number at the moment), if we could get Filter down to the lower
>>>>>> level components of Lucene's search, so e.g. it is used as the deleted
>>>>>> docs
>>>>>> OBS, we can get better performance w/ Filters.
>>>>>>
>>>>>> (2.2) The Filter is refreshed for the entire IR, and not just the
>>>>>> changed
>>>>>> segments. Reason is, outside Collector, you have no way of telling
>>>>>> IndexSearcher "use Filter F1 for segment S1 and F2 for segment F2".
>>>>>> Loading/refreshing the Filter may be expensive, and definitely won't
>>>>>> perform
>>>>>> well w/ NRT, where by definition you'd like to get small changes
>>>>>> searchable
>>>>>> very fast.
>>>>>
>>>>> No doubt you are right about the above. A
>>>>> PerSegmentCachingFilterWrapper would be something we can easily do on
>>>>> an application level basis with the infrastructure we are talking
>>>>> about in place. While I don't exactly know how I feel that this
>>>>> particular problem should rather be addressed internally and I'm not
>>>>> sure if the high level Cache mechanism is the right way to do it but
>>>>> this is just a gut feeling. But when I think about it twice it might
>>>>> be way sufficient enough to do it....
>>>>>>
>>>>>> Therefore I think that if we could provide the necessary hooks in
>>>>>> Lucene,
>>>>>> let's call it a Cache plug-in for now, we can incrementally improve the
>>>>>> search process. I don't want to go too far into the design of a generic
>>>>>> plug-ins mechanism, but you're right (again :)) -- we could offer a
>>>>>> reopen(PluginProvider) which is entirely not about Cache, and Cache
>>>>>> would
>>>>>> become one of the Plugins the PluginProvider provides. I just try to
>>>>>> learn
>>>>>> from past experience -- when the discussion is focused, there's a
>>>>>> better
>>>>>> chance of getting to a resolution. However if you think that in this
>>>>>> case, a
>>>>>> more generic API, as PluginProvider, would get us to a resolution
>>>>>> faster, I
>>>>>> don't mind spend some time to think about it. But for all practical
>>>>>> purposes, we should IMO start w/ a Cache plug-in, that is called like
>>>>>> that,
>>>>>> and if it catches, generify it ...
>>>>>
>>>>> I absolutely agree the API might be more generic but our current
>>>>> use-case / PoC should be a caching. I don't like the name Plugin but
>>>>> thats a personal thing since you are not pluggin anything is.
>>>>> Something like SubreaderCallback or ReaderVisitor might be more
>>>>> accurate but lets argue about the details later. Why not sketching
>>>>> something out for the filter problem and follow on from there? The
>>>>> more iteration the better and back to your question if that would be
>>>>> something which could make it to be committable I would say if it
>>>>> works stand alone / not to tightly coupled I would absolutely say yes.
>>>>>>
>>>>>> Unfortunately, I haven't had enough experience w/ Codecs yet (still on
>>>>>> 3x)
>>>>>> so I can't comment on how feasible that solution is. I'll take your
>>>>>> word for
>>>>>> it that it's doable :). But this doesn't give us a 3x solution ... the
>>>>>> Caching framework on trunk can be developed w/ Codecs.
>>>>>
>>>>> I guess nobody really has except of mike and maybe one or two others
>>>>> but what I have done so far regarding codecs I would say that is the
>>>>> place to solve this particular problem. Maybe even lower than that on
>>>>> a Directory level. Anyhow, lets focus on application level caches for
>>>>> now. We are not aiming to provide a whole full fledged Cache API but
>>>>> the infrastructure to make it easier to build those on a app basis
>>>>> which would be a valuable improvement. We should also look at Solr's
>>>>> cache implementations and how they could benefit from this efforts
>>>>> since Solr uses app-level caching we can learn from API design wise.
>>>>>
>>>>> simon
>>>>>>
>>>>>> Shai
>>>>>>
>>>>>> On Sat, Sep 11, 2010 at 10:41 PM, Simon Willnauer
>>>>>> <simon.willna...@googlemail.com>  wrote:
>>>>>>>
>>>>>>> Hi Shai,
>>>>>>>
>>>>>>> On Sat, Sep 11, 2010 at 8:08 PM, Shai Erera<ser...@gmail.com>  wrote:
>>>>>>>>
>>>>>>>> Hi
>>>>>>>>
>>>>>>>> Lucene's Caches have been heavilydiscussed before (e.g., LUCENE-831,
>>>>>>>> LUCENE-2133 and LUCENE-2394) and from what I can tell, there have
>>>>>>>> been
>>>>>>>> many proposals to attack this problem, w/ no developed solution.
>>>>>>>
>>>>>>> I didn't go through those issues so forgive me if something I bring up
>>>>>>> has already been discussed.
>>>>>>> I have a couple of question about your proposal - please find them
>>>>>>> inline...
>>>>>>>
>>>>>>>> I'd like to explore a different, IMO much simpler, angle to attach
>>>>>>>> this
>>>>>>>> problem. Instead of having Lucene manage the Cache itself, we let the
>>>>>>>> application manage it, however Lucene will provide the necessary
>>>>>>>> hooks
>>>>>>>> in IndexReader to allow it. The hooks I have in mind are:
>>>>>>>>
>>>>>>>> (1) IndexReader current API for TermDocs, TermEnum, TermPositions
>>>>>>>> etc.
>>>>>>>> --
>>>>>>>> already exists.
>>>>>>>>
>>>>>>>> (2) When reopen() is called, Lucene will take care to call a
>>>>>>>> Cache.load(IndexReader), so that the application can pull whatever
>>>>>>>> information
>>>>>>>> it needs from the passed-in IndexReader.
>>>>>>>
>>>>>>> Would that do anything else than passing the new reader (if reopened)
>>>>>>> to the caches load method? I wonder if this is more than
>>>>>>> If(newReader != oldReader)
>>>>>>>  Cache.load(newReader)
>>>>>>>
>>>>>>> If so something like that should be done on a segment reader anyway,
>>>>>>> right? From my perspective this isn't more than a callback or visitor
>>>>>>> that should walk though the subreaders and called for each reopened
>>>>>>> sub-reader. A cache-warming visitor / callback would then be trivial
>>>>>>> and the API would be more general.
>>>>>>>
>>>>>>>
>>>>>>>> So to be more concrete on my proposal, I'd like to support caching in
>>>>>>>> the following way (and while I've spent some time thinking about it,
>>>>>>>> I'm
>>>>>>>> sure there are great suggestions to improve it):
>>>>>>>>
>>>>>>>> * Application provides a CacheFactory to IndexReader.open/reopen,
>>>>>>>> which
>>>>>>>> exposes some very simple API, such as createCache, or
>>>>>>>> initCache(IndexReader) etc. Something which returns a Cache object,
>>>>>>>> which does not have very strict/concrete API.
>>>>>>>
>>>>>>> My first question would be why the reader should know about Cache if
>>>>>>> there is no strict / concrete API?
>>>>>>> I can follow you with the CacheFactory to create cache objects but why
>>>>>>> would the reader have to know / "receive" this object? Maybe this is
>>>>>>> answered further down the path but I don't see the reason why the
>>>>>>> notion of a "cache" must exist within open/reopen or if that could be
>>>>>>> implemented in a more general "cache" - agnostic way.
>>>>>>>>
>>>>>>>> * IndexReader, most probably at the SegmentReader level uses
>>>>>>>> CacheFactory to create a new Cache instance and calls its
>>>>>>>> load(IndexReader) method, so that the Cache would initialize itself.
>>>>>>>
>>>>>>> That is what I was thinking above - yet is that more than a callback
>>>>>>> for each reopened or opened segment reader?
>>>>>>>
>>>>>>>> * The application can use CacheFactory to obtain the Cache object per
>>>>>>>> IndexReader (for example, during Collector.setNextReader), or we can
>>>>>>>> have IndexReader offer a getCache() method.
>>>>>>>
>>>>>>> :)  until here the cache is only used by the application itself not by
>>>>>>> any Lucene API, right? I can think of many application specific data
>>>>>>> that could be useful to be associated with an IR beyond the cacheing
>>>>>>> use case - again this could be a more general API solving that
>>>>>>> problem.
>>>>>>>>
>>>>>>>> * One of Cache API would be getCache(TYPE), where TYPE is a String or
>>>>>>>> Object, or an interface CacheType w/ no methods, just to be a marker
>>>>>>>> one, and the application is free to impl it however it wants. That's
>>>>>>>> a
>>>>>>>> loose API, I know, but completely at the application hands, which
>>>>>>>> makes
>>>>>>>> Lucene code simpler.
>>>>>>>
>>>>>>> I like the idea together with the metadata associating functionality
>>>>>>> from above something like public T IndexReader#get(Type<T>  type).
>>>>>>> Hmm that looks quiet similar to Attributes, does it?! :) However this
>>>>>>> could be done in many ways but again "cache" - agnositc
>>>>>>>>
>>>>>>>> * We can introduce a TermsCache, TermEnumCache and TermVectorCache to
>>>>>>>> provide the user w/ IndexReader-similar API, only more efficient than
>>>>>>>> say TermDocs -- something w/ random access to the docs inside,
>>>>>>>> perhaps
>>>>>>>> even an OpenBitSet. Lucene can take advantage of it if, say, we
>>>>>>>> create a
>>>>>>>> CachingSegmentReader which makes use of the cache, and checks every
>>>>>>>> time
>>>>>>>> termDocs() is called if the required Term is cached or not etc. I
>>>>>>>> admit
>>>>>>>> I may be thinking too much ahead.
>>>>>>>
>>>>>>> I see what you are trying to do here. I also see how this could be
>>>>>>> useful but I guess coming up with a stable APi which serves lots of
>>>>>>> applications would be quiet hard. A CachingSegmentReader could be a
>>>>>>> very simple decorator which would not require to touch the IR
>>>>>>> interface. Something like that could be part of lucene but I'm not
>>>>>>> sure if necessarily part of lucene core.
>>>>>>>
>>>>>>>> That's more or less what I've been thinking. I'm sure there are many
>>>>>>>> details to iron out, but I hope I've managed to pass the general
>>>>>>>> proposal through to you.
>>>>>>>
>>>>>>> Absolutely, this is how it works isn't it!
>>>>>>>
>>>>>>>> What I'm after first, is to allow applications deal w/ postings
>>>>>>>> caching
>>>>>>>> more
>>>>>>>> natively. For example, if you have a posting w/ payloads you'd like
>>>>>>>> to
>>>>>>>> read into memory, or if you would like a term's TermDocs to be cached
>>>>>>>> (to be used as a Filter) etc. -- instead of writing something that
>>>>>>>> can
>>>>>>>> work at the top IndexReader level, you'd be able to take advantage of
>>>>>>>> Lucene internals, i.e. refresh the Cache only for the new segments
>>>>>>>> ...
>>>>>>>
>>>>>>> I wonder if a custom codec would be the right place to implement
>>>>>>> caching / mem resident structures for Postings with payloads etc. You
>>>>>>> could do that on a higher level too but codec seems to be the way to
>>>>>>> go here, right?
>>>>>>> To utilize per segment capabilities a callback for (re)opened segment
>>>>>>> readers would be sufficient or do I miss something?
>>>>>>>
>>>>>>> simon
>>>>>>>>
>>>>>>>> I'm sure that after this will be in place, we can refactor FieldCache
>>>>>>>> to
>>>>>>>> work w/ that API, perhaps as a Cache specific implementation. But I
>>>>>>>> leave that for later.
>>>>>>>>
>>>>>>>> I'd appreciate your comments. Before I set to implement it, I'd like
>>>>>>>> to
>>>>>>>> know if the idea has any chances of making it to a commit :).
>>>>>>>>
>>>>>>>> Shai
>>>>>>>>
>>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>>>>>
>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>>>
>>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to