[jira] Commented: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term

Mark Miller (JIRA) Thu, 04 Sep 2008 06:04:42 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628331#action_12628331
 ]


Mark Miller commented on LUCENE-1372:
-------------------------------------

Hey Paul,

I agree that your patch is more intuitive than the current behavior, but I 
don't know how intuitive that is - if the sort worked on multiple tokens, you 
would expect it to sort lexicographically across each word, and even with your 
patch it won't, it will just use the first word rather than the last, right? In 
other words, I see it as a half fix.

So while its low overhead, I wonder if any overhead is worth not getting the 
full fix? Currently the solution has been that you should only be sorting on 
single token fields - in fact, there is a check for this (that just isnt very 
good at checking <g>) that will possibly throw an exception if you sort on a 
field with multiple tokens - its just not safe unless that check is taken out 
(FieldCacheImpl string sorting).

It appears that to do this right, we need to pay a cost in the general case and 
sorting across multiple tokens may not be worth that, as you can get around the 
limitation by using multiple fields etc now. Personally though, if a patch were 
to be accepted, I think it would have to fully support the correct sorting and 
disable that check I mentioned (again, i doubt people want to pay that perf 
cost though). Finally, even if the committers decide this is a good way to go, 
the check needs to come out at a minimum.




> Proposal: introduce more sensible sorting when a doc has multiple values for 
> a term
> -----------------------------------------------------------------------------------
>
>                 Key: LUCENE-1372
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1372
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.3.2
>            Reporter: Paul Cowan
>            Priority: Minor
>         Attachments: lucene-multisort.patch
>
>
> At the moment, FieldCacheImpl has somewhat disconcerting values when sorting 
> on a field for which multiple values exist for one document. For example, 
> imagine a field "fruit" which is added to a document multiple times, with the 
> values as follows:
> doc 1: {"apple"}
> doc 2: {"banana"}
> doc 3: {"apple", "banana"}
> doc 4: {"apple", "zebra"}
> if one sorts on the field "fruit", the loop in 
> FieldCacheImpl.stringsIndexCache.createValue() (and similarly for the other 
> methods in the various FieldCacheImpl caches) does the following:
>           while (termDocs.next()) {
>             retArray[termDocs.doc()] = t;
>           }
> which means that we look over the terms in their natural order and, on each 
> one, overwrite retArray[doc] with the value for each document with that term. 
> Effectively, this overwriting means that a string sort in this circumstance 
> will sort by the LAST term lexicographically, so the docs above will 
> effecitvely be sorted as if they had the single values ("apple", "banana", 
> "banana", "zebra") which is nonintuitive. To change this to sort on the first 
> time in the TermEnum seems relatively trivial and low-overhead; while it's 
> not perfect (it's not local-aware, for example) the behaviour seems much more 
> sensible to me. Interested to see what people think.
> Patch to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term

Reply via email to