[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

Mark Miller (JIRA) Sat, 06 Dec 2008 05:16:38 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654064#action_12654064
 ]


Mark Miller commented on LUCENE-831:
------------------------------------

Ah, the dirty secret of 831 - there is plenty more to do :) I've been pushing 
it down the path, but I've expected radical changes to be needed before it goes 
in.

bq. But I'm concerned, because this change continues the "materialize massive 
array for entire index" approach, which is the major remaining cost when 
(re)opening readers. EG, isMergable()/mergeData() methods build up the whole 
array from sub readers.

Originally, 3.0 wasn't so close, so there was more concern with back 
compatibility than there might be now. I think the method call will be a slight 
slowdown no matter what as well...even with an iterator approach. Perhaps other 
"wins" will make up for it though. Its certainly cleaner to support 'one' mode.

bq. What would it take to never require materializing the full array for the 
index, for Lucene's internal purposes (external users may continue to do so if 
they want)? Ie, leave the array bound to the "leaf" IndexReader (ie, 
SegmentReader). 

I'm not sure I fully understand yet. If you use the ObjectArray mode, this is 
what happens right? Each sub array is bound to the IndexReader and MultiReader 
will distribute the requests to the right subreader. Only if you use the 
primitive arrays and merging do you get the full arrays (when not using 
USE_OA_SORT).

bq. I realize this is a big change, but I think we need to get there eventually.

Sounds good to me.

bq. (why do we have get2?) 

Because a StringIndex needs to access both the array of Strings and a second 
array indexing into that. None of the other types need to access two arrays.

bq. Couldn't we expose eg an IntData class (and all other types) that has int 
get(docID) abstract method, that delegate to child readers?

Yeah, I think this would be possible. If casting does indeed cost so much, this 
may bring things closer to the primitive array speed.

bq. I'm also generally confused by why we have the per-atomic-type switching 
happening in CacheKey subclasses and not CacheData.

>From Hoss' original design. What are your concerns here? The right key gets 
>you the right data :) I've actually mulled this over some, buts its too early 
>in the morning to remember I suppose. I'll look at it some more.

bq. If we could fix field-sorting like that (and I'm hazy on exactly how to do 
so), I think Lucene internally would then never need the full array?

That would be cool. Again, I'll try to explore in this direction. It doesn't 
need the full array when using the ObjectArray stuff now though (well, it kind 
of does, just split up over the readers).
 
bq. This change also adds USE_OA_SORT, which is scary to me because Object 
overhead per doc can be exceptionally costly. Why do we need to even offer that?

All this does at the moment (and I hate system properties, but for the moment, 
thats whats working) is switch between using the primitive arrays and merging 
or using the distributed ObjectArray for internal sorting. It defaults to using 
the primitive arrays and merging because its 5-10% faster than using the 
ObjectArrays. The ObjectArray approach is just an ObjectArray backed by an 
array for each Reader - a MultiReader distributes a requests for a doc field to 
the right Readers ObjectArray.

To your second comment...I'm gong to have to spend some more time :)

No worries though, this is very much a work in progress. I'd love to have it in 
by 3.0 though. Glad to see someone else taking more of an interest - very hard 
for me to find the time to dig into it all that often. I'll work with the code 
some as I can, thinking more about your comments, and perhaps I can come up 
with some better responses/ideas. 



> Complete overhaul of FieldCache API/Implementation
> --------------------------------------------------
>
>                 Key: LUCENE-831
>                 URL: https://issues.apache.org/jira/browse/LUCENE-831
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Hoss Man
>             Fix For: 3.0
>
>         Attachments: fieldcache-overhaul.032208.diff, 
> fieldcache-overhaul.diff, fieldcache-overhaul.diff, 
> LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, 
> LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, 
> LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch
>
>
> Motivation:
> 1) Complete overhaul the API/implementation of "FieldCache" type things...
>     a) eliminate global static map keyed on IndexReader (thus
>         eliminating synch block between completley independent IndexReaders)
>     b) allow more customization of cache management (ie: use 
>         expiration/replacement strategies, disk backed caches, etc)
>     c) allow people to define custom cache data logic (ie: custom
>         parsers, complex datatypes, etc... anything tied to a reader)
>     d) allow people to inspect what's in a cache (list of CacheKeys) for
>         an IndexReader so a new IndexReader can be likewise warmed. 
>     e) Lend support for smarter cache management if/when
>         IndexReader.reopen is added (merging of cached data from subReaders).
> 2) Provide backwards compatibility to support existing FieldCache API with
>     the new implementation, so there is no redundent caching as client code
>     migrades to new API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

Reply via email to