And you've illustrated my viewpoint I think by saying
"two obvious choices".

I may prefer the first, and you may prefer the second. Neither is
necessarily more "correct" IMO, it depends on the problem
space. Choosing either one will be unpopular with anyone
who likes the other....

And I suspect that 99 times out of 100, someone wanting to sort on
fields with multiple tokens hasn't thought the problem through
carefully. So I favor forcing the person with the use-case where this
is actually _desired_ behavior to work to implement rather than
have to deal with "surprising" orderings.

And duplicate entries in the result set gets ugly. Say a user sorts
on a field containing 10,000 tokens. Now one doc is repeated
10,000 times in the result set. How many docs are set for
numFound? Faceting? Grouping?

I think your first option is at least easy to explain, but I don't see
it as compelling enough to put the work into it, although I confess
I don't know the guts of how much work it would take to find the
first (and last, don't forget specifying desc) token for each doc....

Anyway, that's my story and I'm sticking to it <G>...

Best
Erick

On Wed, Sep 5, 2012 at 12:54 AM, Toke Eskildsen <t...@statsbiblioteket.dk> 
wrote:
> On Fri, 2012-08-31 at 13:35 +0200, Erick Erickson wrote:
>> Imagine you have two entries, aardvark and emu in your
>> multiValued field. How should that document sort relative to
>> another doc with camel and zebra? Any heuristic
>> you apply will be wrong for someone else....
>
> I see two obvious choices here:
>
> 1) Sort by the value that is ordered first by the comparator function.
> Doc1: aardvark, (emu)
> Doc2: camel, (zebra)
> This is what Uwe wants to do and it is normally done by preprocessing
> and collapsing to a single value.
> It could be implemented with an ordered multi-valued field cache by
> comparing on the first (or last, in the case of reverse sort) entry for
> each matching document.
>
> 2) Make duplicate entries in the result set, one for each value.
> Doc1: aardvark, (emu)
> Doc2: camel, (zebra)
> Doc1: (aardvark), emu
> Doc2: (camel), zebra
> I have a hard time coming up with a real world use case for this.
> It could be implemented by using a multi-valued field cache as above and
> putting the same document ID into the sliding window sorter once for
> each field value.
>
> Collapsing this into a single algorithm:
> Step through all IDs. For each ID, give access to the list of field
> values and provide a callback for adding one or more (value, ID)-pairs
> to the sliding windows sorter.
>
>
> Are there some other realistic heuristics that I have missed?
>

Reply via email to