Re: Sorting on multiValued fields via function query

2011-09-15 Thread boneill42


Was there a solution here?  Is there a ticket related to the sort=max(FIELD)
solution?

-brian

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Sorting-on-multiValued-fields-via-function-query-tp2681833p3340145.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Sorting on multiValued fields via function query

2011-03-18 Thread Erick Erickson
+1 for both Chris's and Yonik's comments.

On Thu, Mar 17, 2011 at 3:19 PM, Yonik Seeley
yo...@lucidimagination.com wrote:
 On Thu, Mar 17, 2011 at 2:12 PM, Chris Hostetter
 hossman_luc...@fucit.org wrote:
 As the code stands now: we fail fast and let the person building hte index
 make a decision.

 Indexing two fields when one could work is unfortunate though.
 I think what we should support (eventually) is a max() function will also
 work on a multi-valued field and select the maximum value (i.e. it will
 simply bypass the check for multi-valued fields).

 Then one can utilize sort-by-function to do
 sort=max(author) asc

 -Yonik
 http://lucidimagination.com



Re: Sorting on multiValued fields via function query

2011-03-17 Thread Yonik Seeley
On Wed, Mar 16, 2011 at 6:08 PM, Jonathan Rochkind rochk...@jhu.edu wrote:
 Also... if lucene is already capable of sorting on multi-valued field by
 choosing the largest value largest vs. smallest is presumably just
 arbitrary there, there is presumably no performance implication to choosing
 the smallest instead of the largest. It just chooses the largest, according
 to Yonik.

It's a little more complicated than that.
It's not so much an explicit feature in lucene, but just what
naturally happens when building the field cache via uninverting an
indexed field.

It's pretty much this:

for every term in the field:
  for every document that matches that term:
value[document] = term

And since terms are iterated from smallest to largest (and no, you
can't reverse this)
larger values end up overwriting smaller values.
There's no simple patch to pick the smallest rather than the largest.

In the past, lucene used to try and detect this multi-valued case by
checking the number of values set in the whole array.  This was
unreliable though and the check was discarded.

-Yonik
http://lucidimagination.com


Re: Sorting on multiValued fields via function query

2011-03-17 Thread Bill Bell
Here is a work around. Stick the high value and low value into other fields. 
Use those fields for sorting.

Bill Bell
Sent from mobile


On Mar 17, 2011, at 8:49 AM, Yonik Seeley yo...@lucidimagination.com wrote:

 On Wed, Mar 16, 2011 at 6:08 PM, Jonathan Rochkind rochk...@jhu.edu wrote:
 Also... if lucene is already capable of sorting on multi-valued field by
 choosing the largest value largest vs. smallest is presumably just
 arbitrary there, there is presumably no performance implication to choosing
 the smallest instead of the largest. It just chooses the largest, according
 to Yonik.
 
 It's a little more complicated than that.
 It's not so much an explicit feature in lucene, but just what
 naturally happens when building the field cache via uninverting an
 indexed field.
 
 It's pretty much this:
 
 for every term in the field:
  for every document that matches that term:
value[document] = term
 
 And since terms are iterated from smallest to largest (and no, you
 can't reverse this)
 larger values end up overwriting smaller values.
 There's no simple patch to pick the smallest rather than the largest.
 
 In the past, lucene used to try and detect this multi-valued case by
 checking the number of values set in the whole array.  This was
 unreliable though and the check was discarded.
 
 -Yonik
 http://lucidimagination.com


Re: Sorting on multiValued fields via function query

2011-03-17 Thread Bill Bell
By the way, this could be done automatically by Solr or Lucene behind the 
scenes. 

Bill Bell
Sent from mobile


On Mar 17, 2011, at 9:02 AM, Bill Bell billnb...@gmail.com wrote:

 Here is a work around. Stick the high value and low value into other fields. 
 Use those fields for sorting.
 
 Bill Bell
 Sent from mobile
 
 
 On Mar 17, 2011, at 8:49 AM, Yonik Seeley yo...@lucidimagination.com wrote:
 
 On Wed, Mar 16, 2011 at 6:08 PM, Jonathan Rochkind rochk...@jhu.edu wrote:
 Also... if lucene is already capable of sorting on multi-valued field by
 choosing the largest value largest vs. smallest is presumably just
 arbitrary there, there is presumably no performance implication to choosing
 the smallest instead of the largest. It just chooses the largest, according
 to Yonik.
 
 It's a little more complicated than that.
 It's not so much an explicit feature in lucene, but just what
 naturally happens when building the field cache via uninverting an
 indexed field.
 
 It's pretty much this:
 
 for every term in the field:
 for every document that matches that term:
   value[document] = term
 
 And since terms are iterated from smallest to largest (and no, you
 can't reverse this)
 larger values end up overwriting smaller values.
 There's no simple patch to pick the smallest rather than the largest.
 
 In the past, lucene used to try and detect this multi-valued case by
 checking the number of values set in the whole array.  This was
 unreliable though and the check was discarded.
 
 -Yonik
 http://lucidimagination.com


Re: Sorting on multiValued fields via function query

2011-03-17 Thread Jonathan Rochkind

Aha, oh well, not quite as good/flexible as I hoped.

Still, if lucene is now behaving somewhat more predictably/rationally 
when sorting on multi-valued fields, then I think, in response to your 
other email on a similar thread, perhaps SOLR-2339  is now a mistake.


When lucene was returning completely unpredictable results -- and even 
sometimes crashing entirely -- when sorting on a multi-valued field --- 
then I think in that situation it made a lot of sense for Solr to 
prevent you from doing that, which is I think what SOLR-2339 does?  So I 
don't think it was neccesarily a mistake in that context.


But if lucene now can sort a multi-valued field without crashing when 
there are 'too many' unique values, and with easily described and 
predictable semantics (use the minimal value in the multi-valued field 
as sort key) -- then it probably makes more sense for Solr to let you do 
that if you really want to, give you enough rope to hang yourself.


Jonathan

On 3/17/2011 10:49 AM, Yonik Seeley wrote:

On Wed, Mar 16, 2011 at 6:08 PM, Jonathan Rochkindrochk...@jhu.edu  wrote:

Also... if lucene is already capable of sorting on multi-valued field by
choosing the largest value largest vs. smallest is presumably just
arbitrary there, there is presumably no performance implication to choosing
the smallest instead of the largest. It just chooses the largest, according
to Yonik.

It's a little more complicated than that.
It's not so much an explicit feature in lucene, but just what
naturally happens when building the field cache via uninverting an
indexed field.

It's pretty much this:

for every term in the field:
   for every document that matches that term:
 value[document] = term

And since terms are iterated from smallest to largest (and no, you
can't reverse this)
larger values end up overwriting smaller values.
There's no simple patch to pick the smallest rather than the largest.

In the past, lucene used to try and detect this multi-valued case by
checking the number of values set in the whole array.  This was
unreliable though and the check was discarded.

-Yonik
http://lucidimagination.com



Re: Sorting on multiValued fields via function query

2011-03-17 Thread Chris Hostetter

: But if lucene now can sort a multi-valued field without crashing when there
: are 'too many' unique values, and with easily described and predictable
: semantics (use the minimal value in the multi-valued field as sort key) --
: then it probably makes more sense for Solr to let you do that if you really
: want to, give you enough rope to hang yourself.

(Clarification: it's the the *maximal* value that gets used by lucene in 
that situation) 

I disagree.  

If we do what you describe we'd be relying on users to recognize when the 
sort logic is silently doing something tricky under the covers and make 
a concious decision as to if that was what they want, and if not then 
change their indexing to account for it.  

That seems like a recipe for confusion and unexpected behavior.

with SOLR-2339 in place, we tell users explicitly and up front what you 
are attempting to do can not work as specified and we force them to 
decide in advance how they want to deal with it -- by either indexing the 
lowest value or hte highest value (or both in distinct fields).

As the code stands now: we fail fast and let the person building hte index 
make a decision.  If we silently sort on the maximal value, we leave nasty 
headache for people who don't realize they are missusing a multiValued 
field and then wonder why some sorts don't do what they expect in some 
situations.

Bottom line: from day 1, we have always documented that sorting on 
multiValued fields (or fields that produced more then one document per 
document) didn't work.  If people didn't notice that documentation, they 
aren't likely to notice any documentation that says it will sort on the 
maximal value either -- SOLR-2339 may introduce a pain point for people 
upgrading, but it introduces it early and loudly, not quietly at some 
arbitrary moment in the future when they're beating their heads against a 
desk wondering why some sort isn't working the way they expect it to 
becuase they added some more values to a few documents.




-Hoss


Re: Sorting on multiValued fields via function query

2011-03-17 Thread Yonik Seeley
On Thu, Mar 17, 2011 at 2:12 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:
 As the code stands now: we fail fast and let the person building hte index
 make a decision.

Indexing two fields when one could work is unfortunate though.
I think what we should support (eventually) is a max() function will also
work on a multi-valued field and select the maximum value (i.e. it will
simply bypass the check for multi-valued fields).

Then one can utilize sort-by-function to do
sort=max(author) asc

-Yonik
http://lucidimagination.com


Re: Sorting on multiValued fields via function query

2011-03-16 Thread harish.agarwal
Hi David,

It did seem to work correctly for me - we had it running on our production
indexes for some time and we never noticed any strange sorting behavior. 
However, many of our multiValued fields are single valued for the majority
of documents in our index so we may not have noticed the incorrect sorting
behaviors.

Regardless, I understand the reasoning behind the restriction, I'm
interested in getting around it by using a functionQuery to reduce
multiValued fields to a single value.  It sounds like this isn't possible,
is that correct?  Ideally I'd like to sort by the maximum value on
descending sorts and the minimum value on ascending sorts.  Is there any
movement towards implementing this sort of behavior?

Best,
-Harish

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Sorting-on-multiValued-fields-via-function-query-tp2681833p2688288.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Sorting on multiValued fields via function query

2011-03-16 Thread Smiley, David W.
Heh heh, you say it worked correctly for me yet you didn't actually have 
multi-valued data ;-)  Funny.

The only solution right now is to store the max and min into indexed 
single-valued fields at index time.  This is pretty straight-forward to do.  
Even if/when Solr supports sorting on a multi-valued field, I doubt it would 
perform as well as what I suggest.

~ David Smiley
Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/


On Mar 16, 2011, at 10:16 AM, harish.agarwal wrote:

 Hi David,
 
 It did seem to work correctly for me - we had it running on our production
 indexes for some time and we never noticed any strange sorting behavior. 
 However, many of our multiValued fields are single valued for the majority
 of documents in our index so we may not have noticed the incorrect sorting
 behaviors.
 
 Regardless, I understand the reasoning behind the restriction, I'm
 interested in getting around it by using a functionQuery to reduce
 multiValued fields to a single value.  It sounds like this isn't possible,
 is that correct?  Ideally I'd like to sort by the maximum value on
 descending sorts and the minimum value on ascending sorts.  Is there any
 movement towards implementing this sort of behavior?
 
 Best,
 -Harish
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Sorting-on-multiValued-fields-via-function-query-tp2681833p2688288.html
 Sent from the Solr - User mailing list archive at Nabble.com.







Re: Sorting on multiValued fields via function query

2011-03-16 Thread Chris Hostetter

: However, many of our multiValued fields are single valued for the majority
: of documents in our index so we may not have noticed the incorrect sorting
: behaviors.

that would make sense ... if you use a multiValued field as if it were 
single valued, you would never enocunter a problem.  if you had *some* 
multivalued fields your results would be sorted extremely arbitrarily for 
those docs that did have multiple values, unless you had more distinct 
values then you had documents -- at which point you would get a hard crash 
at query time.

: Regardless, I understand the reasoning behind the restriction, I'm
: interested in getting around it by using a functionQuery to reduce
: multiValued fields to a single value.  It sounds like this isn't possible,

I don't think we have any functions that do that -- functions are composed 
of valuesources which may be composed of other value sources but 
ultimatley the data comes from somewhere, and in every case i can think of 
(except for constant values) that data comes from the FieldCache -- the 
same FieldCache used for sorting.

I don't think there are any value sources that will let you specify a 
multiValued field, and then pick one of those values based on a 
rule/function ... even the PolyFields used for spatial search work by 
using multiple field names unde the covers (N distinct field names for an 
N-dimensional space)

: is that correct?  Ideally I'd like to sort by the maximum value on
: descending sorts and the minimum value on ascending sorts.  Is there any
: movement towards implementing this sort of behavior?

this is a fairly classic usecase of just having multiple fields.  even if 
the logic was implemented to support this at query time, it could never be 
faster then sorting on asingle valued field that you populat with the 
min/max at indexing time -- the mantra of fast I/R is that if you can 
precompute it independently of the individual search critera, you should 
(it's the whole foundation for why the inverted index exists)


-Hoss


Re: Sorting on multiValued fields via function query

2011-03-16 Thread Yonik Seeley
On Wed, Mar 16, 2011 at 5:46 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : However, many of our multiValued fields are single valued for the majority
 : of documents in our index so we may not have noticed the incorrect sorting
 : behaviors.

 that would make sense ... if you use a multiValued field as if it were
 single valued, you would never enocunter a problem.  if you had *some*
 multivalued fields your results would be sorted extremely arbitrarily for
 those docs that did have multiple values, unless you had more distinct
 values then you had documents -- at which point you would get a hard crash
 at query time.

AFAIK, not any more.  Since that behavior was very unreliable, it has
been removed and you can reliably sort by any multi-valued field in
lucene (with the sort order being defined by the largest value if
there are multiple).

-Yonik
http://lucidimagination.com


Re: Sorting on multiValued fields via function query

2011-03-16 Thread Jonathan Rochkind
Huh, so lucene is actually doing what has been commonly described as 
impossible in Solr?


But is Solr trunk, as the OP person seemed to report, still not aware of 
this and raising on a sort on multi-valued field, instead of just 
saying, okay, we'll just pass it to lucene anyway and go with lucene's 
approach to sorting on multi-valued field (that is, apparently, using 
the largest value)?


If so... that kind of sounds like a bug/misfeature, yes, no?

Also... if lucene is already capable of sorting on multi-valued field by 
choosing the largest value largest vs. smallest is presumably just 
arbitrary there, there is presumably no performance implication to 
choosing the smallest instead of the largest. It just chooses the 
largest, according to Yonik.


So... if someone patched lucene, so whether it chose the largest or 
smallest in that case was a parameter passed in -- probably not a large 
patch since lucene, says Yonik, already has been enhanced to choose 
largest always -- and then patched Solr to take a param and pass it to 
Lucene for this purpose, which presumably also wouldn't be a large patch 
if lucene supported it   then we'd have the feature OP asked for.


Based on Yonik's description (assuming I understand correctly and he's 
correct), it doesn't sound like a lot of code. But it's still beyond my 
unfamiliar-with-lucene-code-not-so-great-at-java abilities, nor do I 
have the interest for my own app needs at the moment. But if OP or 
someone else has both sounds like a plausible feature?


On 3/16/2011 6:00 PM, Yonik Seeley wrote:

On Wed, Mar 16, 2011 at 5:46 PM, Chris Hostetter
hossman_luc...@fucit.org  wrote:

: However, many of our multiValued fields are single valued for the majority
: of documents in our index so we may not have noticed the incorrect sorting
: behaviors.

that would make sense ... if you use a multiValued field as if it were
single valued, you would never enocunter a problem.  if you had *some*
multivalued fields your results would be sorted extremely arbitrarily for
those docs that did have multiple values, unless you had more distinct
values then you had documents -- at which point you would get a hard crash
at query time.

AFAIK, not any more.  Since that behavior was very unreliable, it has
been removed and you can reliably sort by any multi-valued field in
lucene (with the sort order being defined by the largest value if
there are multiple).

-Yonik
http://lucidimagination.com



Re: Sorting on multiValued fields via function query

2011-03-16 Thread Bill Bell
I agree with this and it is even needed for function sorting for multvalued 
fields. See geohash patch for one wY to deal with multivalued fields on 
distance. Not ideal but it works efficiently.

Bill Bell
Sent from mobile


On Mar 16, 2011, at 4:08 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 Huh, so lucene is actually doing what has been commonly described as 
 impossible in Solr?
 
 But is Solr trunk, as the OP person seemed to report, still not aware of this 
 and raising on a sort on multi-valued field, instead of just saying, okay, 
 we'll just pass it to lucene anyway and go with lucene's approach to sorting 
 on multi-valued field (that is, apparently, using the largest value)?
 
 If so... that kind of sounds like a bug/misfeature, yes, no?
 
 Also... if lucene is already capable of sorting on multi-valued field by 
 choosing the largest value largest vs. smallest is presumably just 
 arbitrary there, there is presumably no performance implication to choosing 
 the smallest instead of the largest. It just chooses the largest, according 
 to Yonik.
 
 So... if someone patched lucene, so whether it chose the largest or smallest 
 in that case was a parameter passed in -- probably not a large patch since 
 lucene, says Yonik, already has been enhanced to choose largest always -- and 
 then patched Solr to take a param and pass it to Lucene for this purpose, 
 which presumably also wouldn't be a large patch if lucene supported it   
 then we'd have the feature OP asked for.
 
 Based on Yonik's description (assuming I understand correctly and he's 
 correct), it doesn't sound like a lot of code. But it's still beyond my 
 unfamiliar-with-lucene-code-not-so-great-at-java abilities, nor do I have the 
 interest for my own app needs at the moment. But if OP or someone else has 
 both sounds like a plausible feature?
 
 On 3/16/2011 6:00 PM, Yonik Seeley wrote:
 On Wed, Mar 16, 2011 at 5:46 PM, Chris Hostetter
 hossman_luc...@fucit.org  wrote:
 : However, many of our multiValued fields are single valued for the majority
 : of documents in our index so we may not have noticed the incorrect sorting
 : behaviors.
 
 that would make sense ... if you use a multiValued field as if it were
 single valued, you would never enocunter a problem.  if you had *some*
 multivalued fields your results would be sorted extremely arbitrarily for
 those docs that did have multiple values, unless you had more distinct
 values then you had documents -- at which point you would get a hard crash
 at query time.
 AFAIK, not any more.  Since that behavior was very unreliable, it has
 been removed and you can reliably sort by any multi-valued field in
 lucene (with the sort order being defined by the largest value if
 there are multiple).
 
 -Yonik
 http://lucidimagination.com
 


Re: Sorting on multiValued fields via function query

2011-03-15 Thread David Smiley (@MITRE.org)
Hi Harish. 
Did sorting on multiValued fields actually work correctly for you before?
I'd be surprised if so.  I could be wrong but I think you previously always
got the sorting affects of whatever was the last indexed value. It is indeed
the case that the FieldCache only supports up to one indexed value per
field. Recently Hoss added sanity checks that you are seeing the results of: 
https://issues.apache.org/jira/browse/SOLR-2339   You might want to comment
on that issue with proof (e.g. a simple test) that it worked before but not
now.

~ David

-
 Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Sorting-on-multiValued-fields-via-function-query-tp2681833p2685485.html
Sent from the Solr - User mailing list archive at Nabble.com.