[jira] Commented: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term

Paul Cowan (JIRA) Wed, 04 Mar 2009 01:14:20 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12678650#action_12678650
 ]


Paul Cowan commented on LUCENE-1372:
------------------------------------

Yes, sorry, I might have been unclear. When I referred to 'first term' I meant 
'the first term lexicographically' -- at least as far as binary order is 
'lexicographically' -- i.e. the 'lowest' term.

I like the idea of the pluggable behaviour, even if it's a simple boolean:

{code}
boolean sortByLowestTerm = ...

if (retArray[termDocs.doc() == null || !sortByLowestTerm) {
   retArray[termDocs.doc()] = termval;
}
{code}

We could replace this with a pluggable 'TermSelectionPolicy' or somesuch (as 
suggested by Earwin on java-dev@).... perhaps something like

{code}
interface SortTermCollector {
  void addTermText(String text);
  Comparable toSortValue();
}
{code}

and then use a SortTermCollector[maxDoc] in the field cache, then iterate over 
the array at the end to convert the SortTermCollectors into Comparables (or 
make them directly comparable). Implementation of addTermText would be trivial 
for the first and last behaviour ("if (sortValue != null) sortValue = text" and 
"sortValue = text") respectively but we could use it for our 'full alphabetical 
ordering', it could perform functions on the terms as Earwin mentions, etc. 
This may or may not be overkill.

I'm happy to try and get the changes you'd like for TrieRange, because they're 
an almost-but-not-quite-acceptable compromise for us (we're using a patched 
version of Lucene that does this now), but I'm content to use our own class 
internally, happy if we can expose the DEFAULT_PARSER implementations (and 
anything else -- my class sits in the same package so rebasing it may expose 
other things that need to be made protected etc) -- and anything beyond that 
(landing it in contrib or core) would be brilliant.

My two proposals certainly aren't mutually exclusive, they don't really touch 
each other.


> Proposal: introduce more sensible sorting when a doc has multiple values for 
> a term
> -----------------------------------------------------------------------------------
>
>                 Key: LUCENE-1372
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1372
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.3.2
>            Reporter: Paul Cowan
>            Priority: Minor
>         Attachments: LUCENE-1372-MultiValueSorters.patch, 
> lucene-multisort.patch
>
>
> At the moment, FieldCacheImpl has somewhat disconcerting values when sorting 
> on a field for which multiple values exist for one document. For example, 
> imagine a field "fruit" which is added to a document multiple times, with the 
> values as follows:
> doc 1: {"apple"}
> doc 2: {"banana"}
> doc 3: {"apple", "banana"}
> doc 4: {"apple", "zebra"}
> if one sorts on the field "fruit", the loop in 
> FieldCacheImpl.stringsIndexCache.createValue() (and similarly for the other 
> methods in the various FieldCacheImpl caches) does the following:
>           while (termDocs.next()) {
>             retArray[termDocs.doc()] = t;
>           }
> which means that we look over the terms in their natural order and, on each 
> one, overwrite retArray[doc] with the value for each document with that term. 
> Effectively, this overwriting means that a string sort in this circumstance 
> will sort by the LAST term lexicographically, so the docs above will 
> effecitvely be sorted as if they had the single values ("apple", "banana", 
> "banana", "zebra") which is nonintuitive. To change this to sort on the first 
> time in the TermEnum seems relatively trivial and low-overhead; while it's 
> not perfect (it's not local-aware, for example) the behaviour seems much more 
> sensible to me. Interested to see what people think.
> Patch to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term

Reply via email to