[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Tim Smith (JIRA) Sun, 23 Aug 2009 10:38:22 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746643#action_12746643
 ]


Tim Smith commented on LUCENE-1821:
-----------------------------------

Lot of new comments to respond to :)
will try to cover them all

bq. decent comparator (StringOrdValComparator) that operates per segment.

Still, the StringOrdValComparator will have to break down and call 
String.equals() whenever it compars docs in different IndexReaders
It also has to do more maintenance in general than would be needed for just a 
StringOrd comparator that would have a cache across all IndexReaders
While the StringOrdValComparator may be faster in 2.9 than string sorting in 
2.4, its not as fast as it could be if the cache was created on the 
IndexSearcher level
I looked at the new string sorting stuff last week, and it looks pretty smart 
to reduce the number of String.equals() calls needed, but this adds extra 
complexity and will still be reduced to String.equals() calls, which will 
translate to slower sorting than could be possible

bq. one option might be to subclass DirectoryReader 

The idea of this is to disable per segment searching?
I don't actually want to do that. I want to use per segment searching 
functionality to take advantage of caches on per segment basis where possible, 
and map docs to the IndexSearcher context when i can't do per segment caching.

bq. Could you compute the top-level ords, but then break it up per-segment?

I think i see what your getting at here, and i've already thought of this as a 
potential solution. The cache will always need to be created at the top most 
level, but it will be pre-broken out into a per-segment cache whose context is 
the top level IndexSearcher/MultiReader. The biggest problem here is the 
complexity of actually creating such a cache, which i'm sure will translate to 
this cache loading slower (hard to say how much slower without implementing)
I do plan to try this approach, but i expect this will be at least a week or 
two out from now.

I've currently updated my code for this to work per-segment by adding the 
docBase when performing the lookup into this cache (which is per-IndexSearcher)
I did this using my getIndexReaderBase() funciton i added to my subclass of 
IndexSearcher during Scorer construction time (I can live with this, however i 
would like to see getIndexReaderBase() added to IndexSearcher, and the 
IndexSearcher passed to Weight.scorer() so i don't need to hold onto my 
IndexSearcher subclass in my Weight implementation)

bq. just return the "virtual" per-segment DocIdSet.

Thats what i'm doing now. I use the docid base for the IndexReader, along with 
its maxDoc to have the Scorer represent a virtual slice for just the segment in 
question
The only real problem here is that during Scorer initialization for this i have 
to call fullDocIdSetIter.advance(docBase) in the Scorer constructor. If 
advance(int) for the DocIdSet in question is O(N), this adds an extra penalty 
per segment that did not exist before

bq. his isn't a long-term solution, since the order in which Lucene visits the 
readers isn't in general guaranteed,

that's where IndexSearcher.getIndexReaderBase(IndexReader) comes into play. If 
you call this in your scorer to get the docBase, it doesn't matter what order 
the segments are searched in (as it'll always return the proper base (in the 
context of the IndexSearcher that is))


Here's another potential thought (very rough, haven't consulted code to see how 
feasible this is):
what if Similarity had a method called getDocIdBase(IndexReader)
then, the searcher implementation could wrap the provided Similarity to provide 
the proper calculation
Similarity is always already passed through this chain of Weight creation and 
is passed into the Scorer
Obviously, a Query Implementation can completely drop the passing of the 
Searcher's similarity and drop in its own (but this would mean it doesn't care 
about getting these docid bases)
I think this approach would potentially resolve all MultiSearcher difficulties







> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a 
> Scorer to know the "actual" doc id for the document's it matches (only the 
> relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all 
> segments), there is now no way to index into them properly from inside a 
> Scorer because the scorer is not passed the needed offset to calculate the 
> "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as 
> well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created 
> "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed 
> in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of 
> YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if 
> gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight 
> implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Reply via email to