[ https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654064#action_12654064 ]
Mark Miller commented on LUCENE-831: ------------------------------------ Ah, the dirty secret of 831 - there is plenty more to do :) I've been pushing it down the path, but I've expected radical changes to be needed before it goes in. bq. But I'm concerned, because this change continues the "materialize massive array for entire index" approach, which is the major remaining cost when (re)opening readers. EG, isMergable()/mergeData() methods build up the whole array from sub readers. Originally, 3.0 wasn't so close, so there was more concern with back compatibility than there might be now. I think the method call will be a slight slowdown no matter what as well...even with an iterator approach. Perhaps other "wins" will make up for it though. Its certainly cleaner to support 'one' mode. bq. What would it take to never require materializing the full array for the index, for Lucene's internal purposes (external users may continue to do so if they want)? Ie, leave the array bound to the "leaf" IndexReader (ie, SegmentReader). I'm not sure I fully understand yet. If you use the ObjectArray mode, this is what happens right? Each sub array is bound to the IndexReader and MultiReader will distribute the requests to the right subreader. Only if you use the primitive arrays and merging do you get the full arrays (when not using USE_OA_SORT). bq. I realize this is a big change, but I think we need to get there eventually. Sounds good to me. bq. (why do we have get2?) Because a StringIndex needs to access both the array of Strings and a second array indexing into that. None of the other types need to access two arrays. bq. Couldn't we expose eg an IntData class (and all other types) that has int get(docID) abstract method, that delegate to child readers? Yeah, I think this would be possible. If casting does indeed cost so much, this may bring things closer to the primitive array speed. bq. I'm also generally confused by why we have the per-atomic-type switching happening in CacheKey subclasses and not CacheData. >From Hoss' original design. What are your concerns here? The right key gets >you the right data :) I've actually mulled this over some, buts its too early >in the morning to remember I suppose. I'll look at it some more. bq. If we could fix field-sorting like that (and I'm hazy on exactly how to do so), I think Lucene internally would then never need the full array? That would be cool. Again, I'll try to explore in this direction. It doesn't need the full array when using the ObjectArray stuff now though (well, it kind of does, just split up over the readers). bq. This change also adds USE_OA_SORT, which is scary to me because Object overhead per doc can be exceptionally costly. Why do we need to even offer that? All this does at the moment (and I hate system properties, but for the moment, thats whats working) is switch between using the primitive arrays and merging or using the distributed ObjectArray for internal sorting. It defaults to using the primitive arrays and merging because its 5-10% faster than using the ObjectArrays. The ObjectArray approach is just an ObjectArray backed by an array for each Reader - a MultiReader distributes a requests for a doc field to the right Readers ObjectArray. To your second comment...I'm gong to have to spend some more time :) No worries though, this is very much a work in progress. I'd love to have it in by 3.0 though. Glad to see someone else taking more of an interest - very hard for me to find the time to dig into it all that often. I'll work with the code some as I can, thinking more about your comments, and perhaps I can come up with some better responses/ideas. > Complete overhaul of FieldCache API/Implementation > -------------------------------------------------- > > Key: LUCENE-831 > URL: https://issues.apache.org/jira/browse/LUCENE-831 > Project: Lucene - Java > Issue Type: Improvement > Components: Search > Reporter: Hoss Man > Fix For: 3.0 > > Attachments: fieldcache-overhaul.032208.diff, > fieldcache-overhaul.diff, fieldcache-overhaul.diff, > LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, > LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, > LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch > > > Motivation: > 1) Complete overhaul the API/implementation of "FieldCache" type things... > a) eliminate global static map keyed on IndexReader (thus > eliminating synch block between completley independent IndexReaders) > b) allow more customization of cache management (ie: use > expiration/replacement strategies, disk backed caches, etc) > c) allow people to define custom cache data logic (ie: custom > parsers, complex datatypes, etc... anything tied to a reader) > d) allow people to inspect what's in a cache (list of CacheKeys) for > an IndexReader so a new IndexReader can be likewise warmed. > e) Lend support for smarter cache management if/when > IndexReader.reopen is added (merging of cached data from subReaders). > 2) Provide backwards compatibility to support existing FieldCache API with > the new implementation, so there is no redundent caching as client code > migrades to new API. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]