Re: strange sorting results: each word in field is sorted

2009-08-19 Thread Erik Hatcher


On Aug 19, 2009, at 3:50 PM, Paul Rosen wrote:
I'm surprised you're not seeing an exception when trying to sort on  
title given this configuration.  Sorting must be done on single  
valued indexed fields, that have at most a single term indexed per  
document.  I recommend you use copyField to copy title to  
title_sort and configure a title_sort field as a "string" or a  
field type that analyzes only to a single term (like simply keyword  
tokenizing -> lower case filter.

   Erik


I want to double check this (since you probably remember how long it  
takes to recreate the indexes). I think you're saying to add these  
two lines, then re-index:






For the simplest case, yes.  You do have to be careful the sort field  
is not multiValued - and I believe the NINES model allowed for  
multiple titles.  So it might be necessary for your indexing client to  
specify the single sort field value instead of leveraging copyField.


Now, this is case-sensitive, right? So would this make it case- 
insensitive?


Yes, the above would be case sensitive.

sortMissingLast="true">

 
   
 

stored="true"/>




That  definition isn't quite right - you must have at least  
a tokenizer.  The KeywordTokenizer "tokenizes" the entire string into  
a single token, though.  In Solr's example schema there is a field  
type like this:


sortMissingLast="true" omitNorms="true">

  









  


Also, I'm guessing from seeing the current results that this  
wouldn't collate the characters with diacritical marks correctly. Is  
there a way to indicate that, for instance, A-grave would sort next  
to A?


Yes, you can incorporate the diacritic normalizing filter into the  
analyzer definition above.  AsciiFoldingFilter or the ISO Latin1 one.


And, while I'm on the subject, I have to do the same thing with the  
Author field, but unfortunately, that is sometimes "First Last" and  
sometimes "Last, First". Is there any way to sort those by last  
name, or do I just have to encourage the index people to be more  
consistent?


Good luck with getting consistency in your domain!  :)

But it certainly makes sense to request that from the data providers,  
in at least some form that can be turned into the sortable value.


I can think of a fairly simple algorithm, but am not sure where to  
implement it:


- if the word "and" or "&" appears, just look at the left side of  
the field (in other words, sort by the first name that appears.)
- if there is a comma, but it is part of ", jr." or some other  
common suffixes like that, ignore it.
- otherwise, if there is no comma, sort by the last word, unless it  
is "jr", "sr", "III", etc., then sort by the word before that.

- otherwise, sort by the first word.


Probably best to implement that in the indexing client code, but  
simple transformations could be implemented using the  
PatternReplaceFilter like above.


Erik



Re: strange sorting results: each word in field is sorted

2009-08-19 Thread Paul Rosen

Erik Hatcher wrote:


On Aug 19, 2009, at 2:45 PM, Paul Rosen wrote:
You can see the problem here (at least until it's fixed!): 
http://nines.performantsoftware.com/search/saved?user=paul&name=poem


Hi Paul - that project looks familiar!  :)


Hi Erik! I should hope so! And I've gone a year without having to delve 
into solr much since it has just plain worked.


Thanks for the speedy reply.

I'm surprised you're not seeing an exception when trying to sort on 
title given this configuration.  Sorting must be done on single valued 
indexed fields, that have at most a single term indexed per document.  I 
recommend you use copyField to copy title to title_sort and configure a 
title_sort field as a "string" or a field type that analyzes only to a 
single term (like simply keyword tokenizing -> lower case filter.


Erik


I want to double check this (since you probably remember how long it 
takes to recreate the indexes). I think you're saying to add these two 
lines, then re-index:





Now, this is case-sensitive, right? So would this make it case-insensitive?


  

  




Also, I'm guessing from seeing the current results that this wouldn't 
collate the characters with diacritical marks correctly. Is there a way 
to indicate that, for instance, A-grave would sort next to A?


And, while I'm on the subject, I have to do the same thing with the 
Author field, but unfortunately, that is sometimes "First Last" and 
sometimes "Last, First". Is there any way to sort those by last name, or 
do I just have to encourage the index people to be more consistent?


I can think of a fairly simple algorithm, but am not sure where to 
implement it:


- if the word "and" or "&" appears, just look at the left side of the 
field (in other words, sort by the first name that appears.)
- if there is a comma, but it is part of ", jr." or some other common 
suffixes like that, ignore it.
- otherwise, if there is no comma, sort by the last word, unless it is 
"jr", "sr", "III", etc., then sort by the word before that.

- otherwise, sort by the first word.

That would get most of the cases.

Thanks,
Paul


Re: strange sorting results: each word in field is sorted

2009-08-19 Thread Erik Hatcher


On Aug 19, 2009, at 2:45 PM, Paul Rosen wrote:

You can see the problem here (at least until it's fixed!): 
http://nines.performantsoftware.com/search/saved?user=paul&name=poem


Hi Paul - that project looks familiar!  :)

If you sort by Title/Ascending, you get partially sorted results,  
but it seems to be using a random word to sort on instead of sorting  
on the entire title.


I'm not sure what info would be useful to help debug. In my  
schema.xml file, I've clipped what seems to be the relevant part:


positionIncrementGap="100">

 
   
   
   
 


multiValued="true"/>


I'm surprised you're not seeing an exception when trying to sort on  
title given this configuration.  Sorting must be done on single valued  
indexed fields, that have at most a single term indexed per document.   
I recommend you use copyField to copy title to title_sort and  
configure a title_sort field as a "string" or a field type that  
analyzes only to a single term (like simply keyword tokenizing ->  
lower case filter.


Erik