Re: strange sorting results: each word in field is sorted
On Aug 19, 2009, at 3:50 PM, Paul Rosen wrote: I'm surprised you're not seeing an exception when trying to sort on title given this configuration. Sorting must be done on single valued indexed fields, that have at most a single term indexed per document. I recommend you use copyField to copy title to title_sort and configure a title_sort field as a "string" or a field type that analyzes only to a single term (like simply keyword tokenizing -> lower case filter. Erik I want to double check this (since you probably remember how long it takes to recreate the indexes). I think you're saying to add these two lines, then re-index: For the simplest case, yes. You do have to be careful the sort field is not multiValued - and I believe the NINES model allowed for multiple titles. So it might be necessary for your indexing client to specify the single sort field value instead of leveraging copyField. Now, this is case-sensitive, right? So would this make it case- insensitive? Yes, the above would be case sensitive. sortMissingLast="true"> stored="true"/> That definition isn't quite right - you must have at least a tokenizer. The KeywordTokenizer "tokenizes" the entire string into a single token, though. In Solr's example schema there is a field type like this: sortMissingLast="true" omitNorms="true"> Also, I'm guessing from seeing the current results that this wouldn't collate the characters with diacritical marks correctly. Is there a way to indicate that, for instance, A-grave would sort next to A? Yes, you can incorporate the diacritic normalizing filter into the analyzer definition above. AsciiFoldingFilter or the ISO Latin1 one. And, while I'm on the subject, I have to do the same thing with the Author field, but unfortunately, that is sometimes "First Last" and sometimes "Last, First". Is there any way to sort those by last name, or do I just have to encourage the index people to be more consistent? Good luck with getting consistency in your domain! :) But it certainly makes sense to request that from the data providers, in at least some form that can be turned into the sortable value. I can think of a fairly simple algorithm, but am not sure where to implement it: - if the word "and" or "&" appears, just look at the left side of the field (in other words, sort by the first name that appears.) - if there is a comma, but it is part of ", jr." or some other common suffixes like that, ignore it. - otherwise, if there is no comma, sort by the last word, unless it is "jr", "sr", "III", etc., then sort by the word before that. - otherwise, sort by the first word. Probably best to implement that in the indexing client code, but simple transformations could be implemented using the PatternReplaceFilter like above. Erik
Re: strange sorting results: each word in field is sorted
Erik Hatcher wrote: On Aug 19, 2009, at 2:45 PM, Paul Rosen wrote: You can see the problem here (at least until it's fixed!): http://nines.performantsoftware.com/search/saved?user=paul&name=poem Hi Paul - that project looks familiar! :) Hi Erik! I should hope so! And I've gone a year without having to delve into solr much since it has just plain worked. Thanks for the speedy reply. I'm surprised you're not seeing an exception when trying to sort on title given this configuration. Sorting must be done on single valued indexed fields, that have at most a single term indexed per document. I recommend you use copyField to copy title to title_sort and configure a title_sort field as a "string" or a field type that analyzes only to a single term (like simply keyword tokenizing -> lower case filter. Erik I want to double check this (since you probably remember how long it takes to recreate the indexes). I think you're saying to add these two lines, then re-index: Now, this is case-sensitive, right? So would this make it case-insensitive? Also, I'm guessing from seeing the current results that this wouldn't collate the characters with diacritical marks correctly. Is there a way to indicate that, for instance, A-grave would sort next to A? And, while I'm on the subject, I have to do the same thing with the Author field, but unfortunately, that is sometimes "First Last" and sometimes "Last, First". Is there any way to sort those by last name, or do I just have to encourage the index people to be more consistent? I can think of a fairly simple algorithm, but am not sure where to implement it: - if the word "and" or "&" appears, just look at the left side of the field (in other words, sort by the first name that appears.) - if there is a comma, but it is part of ", jr." or some other common suffixes like that, ignore it. - otherwise, if there is no comma, sort by the last word, unless it is "jr", "sr", "III", etc., then sort by the word before that. - otherwise, sort by the first word. That would get most of the cases. Thanks, Paul
Re: strange sorting results: each word in field is sorted
On Aug 19, 2009, at 2:45 PM, Paul Rosen wrote: You can see the problem here (at least until it's fixed!): http://nines.performantsoftware.com/search/saved?user=paul&name=poem Hi Paul - that project looks familiar! :) If you sort by Title/Ascending, you get partially sorted results, but it seems to be using a random word to sort on instead of sorting on the entire title. I'm not sure what info would be useful to help debug. In my schema.xml file, I've clipped what seems to be the relevant part: positionIncrementGap="100"> multiValued="true"/> I'm surprised you're not seeing an exception when trying to sort on title given this configuration. Sorting must be done on single valued indexed fields, that have at most a single term indexed per document. I recommend you use copyField to copy title to title_sort and configure a title_sort field as a "string" or a field type that analyzes only to a single term (like simply keyword tokenizing -> lower case filter. Erik