Localized alphabetical order
As someone who's new to Solr/Lucene, I'm having trouble finding information on sorting results in localized alphabetical order. I've ineffectively searched the wiki and the mail archives. I'm thinking for example about Hawai'ian, where mīka (with an i-macron) comes after mika (i without the macron) but before miki (also without the macron), or about Welsh, where the digraphs (ch, dd, etc.) are treated as single letters, or about Ojibwe, where the apostrophe ' is a letter which sorts between h and i. How do non-English languages typically handle this? -Ben
Re: Localized alphabetical order
please see http://wiki.apache.org/solr/UnicodeCollation In general the idea is similar to how this is handled in databases, you can index collation keys into a sort field at analysis time, then you just do a standard solr sort. However, I am not sure if your JRE provides a haw Locale for the Hawaiian language. Because of this, its probably better to use the ICU collation integration (http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUCollationKeyFilterFactory), because ICU definitely supports this locale and has collation rules for it. On Fri, Apr 22, 2011 at 12:33 PM, Ben Preece preec...@umn.edu wrote: As someone who's new to Solr/Lucene, I'm having trouble finding information on sorting results in localized alphabetical order. I've ineffectively searched the wiki and the mail archives. I'm thinking for example about Hawai'ian, where mīka (with an i-macron) comes after mika (i without the macron) but before miki (also without the macron), or about Welsh, where the digraphs (ch, dd, etc.) are treated as single letters, or about Ojibwe, where the apostrophe ' is a letter which sorts between h and i. How do non-English languages typically handle this? -Ben
Re: Localized alphabetical order
On Fri, Apr 22, 2011 at 12:33 PM, Ben Preece preec...@umn.edu wrote: As someone who's new to Solr/Lucene, I'm having trouble finding information on sorting results in localized alphabetical order. I've ineffectively searched the wiki and the mail archives. I'm thinking for example about Hawai'ian, where mīka (with an i-macron) comes after mika (i without the macron) but before miki (also without the macron), or about Welsh, where the digraphs (ch, dd, etc.) are treated as single letters, or about Ojibwe, where the apostrophe ' is a letter which sorts between h and i. How do non-English languages typically handle this? -Ben
Re: Localized alphabetical order
Thank you. This looks like the right direction. I see the docs say ICUCollationKeyFilterFactory is deprecated in favor of ICUCollationField. So ... I'd implement a subclass of ICUCollationField, and use that as the fieldtype in schema.xml. And this means - what? - that I'd also implement a custom SortField to be returned by MyCollationField.getSortField(...), which would also require me to write a custom FieldComparator? Am I on the right track? Do you know an example of another language which has already done this sort of thing? Really, thanks for your help. -Ben On Fri, Apr 22, 2011 at 11:41 AM, Peter Keegan peterlkee...@gmail.comwrote: On Fri, Apr 22, 2011 at 12:33 PM, Ben Preece preec...@umn.edu wrote: As someone who's new to Solr/Lucene, I'm having trouble finding information on sorting results in localized alphabetical order. I've ineffectively searched the wiki and the mail archives. I'm thinking for example about Hawai'ian, where mīka (with an i-macron) comes after mika (i without the macron) but before miki (also without the macron), or about Welsh, where the digraphs (ch, dd, etc.) are treated as single letters, or about Ojibwe, where the apostrophe ' is a letter which sorts between h and i. How do non-English languages typically handle this? -Ben
Re: Localized alphabetical order
On Fri, Apr 22, 2011 at 2:37 PM, Bently Preece preec...@umn.edu wrote: Thank you. This looks like the right direction. I see the docs say ICUCollationKeyFilterFactory is deprecated in favor of ICUCollationField. So ... I'd implement a subclass of ICUCollationField, and use that as the fieldtype in schema.xml. And this means - what? - that I'd also implement a custom SortField to be returned by MyCollationField.getSortField(...), which would also require me to write a custom FieldComparator? Am I on the right track? no, you don't have to write any code in either case: solr 3.1: fieldType name=sort_haw class=solr.TextField analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.ICUCollationKeyFilterFactory locale=haw strength=secondary/ /analyzer /fieldType solr 4.0: fieldtype name=sort_haw class=solr.ICUCollationField locale=haw strength=secondary/ then just copyField or whatever to get your data in there.
Re: Localized alphabetical order
What if there is no standard localization already? The case I'm specifically interested in is Ojibwe. So should I really be researching how the JRE does localization instead of Solr? On Fri, Apr 22, 2011 at 2:01 PM, Robert Muir rcm...@gmail.com wrote: On Fri, Apr 22, 2011 at 2:37 PM, Bently Preece preec...@umn.edu wrote: Thank you. This looks like the right direction. I see the docs say ICUCollationKeyFilterFactory is deprecated in favor of ICUCollationField. So ... I'd implement a subclass of ICUCollationField, and use that as the fieldtype in schema.xml. And this means - what? - that I'd also implement a custom SortField to be returned by MyCollationField.getSortField(...), which would also require me to write a custom FieldComparator? Am I on the right track? no, you don't have to write any code in either case: solr 3.1: fieldType name=sort_haw class=solr.TextField analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.ICUCollationKeyFilterFactory locale=haw strength=secondary/ /analyzer /fieldType solr 4.0: fieldtype name=sort_haw class=solr.ICUCollationField locale=haw strength=secondary/ then just copyField or whatever to get your data in there.
Re: Localized alphabetical order
On Fri, Apr 22, 2011 at 3:09 PM, Bently Preece preec...@umn.edu wrote: What if there is no standard localization already? The case I'm specifically interested in is Ojibwe. this is standard? to sort a field with a specific locale, you have to tell it the locale you want. if you use the ICU implementation you get support for more locales, its just that simple. The JRE has less available locales because its internationalization and localization support lags behind ICU. On the other hand ICU keeps current with both the unicode standard and locale data in CLDR (http://unicode.org/cldr), which is why it supports more. I noticed there is no locale for your language in CLDR, not even under development it appears (http://unicode.org/cldr/apps/survey). So if your language (Ojibwe) has special sort rules, I recommend making the collation rules and using a custom collator as specified here: http://wiki.apache.org/solr/UnicodeCollation#Sorting_text_with_custom_rules for your base collator you just need to use new Locale() and your rules will be a delta from that. Separately, if these sort rules are well-defined/standardized for this language, and you get them working, you might want to then consider contributing them to CLDR.
Re: Localized alphabetical order
Thanks. I get it now. I meet with our language experts again on Monday. I'll ask them about submitting localization info to the CLDR. Thanks again. -Ben On Fri, Apr 22, 2011 at 2:44 PM, Robert Muir rcm...@gmail.com wrote: On Fri, Apr 22, 2011 at 3:09 PM, Bently Preece preec...@umn.edu wrote: What if there is no standard localization already? The case I'm specifically interested in is Ojibwe. this is standard? to sort a field with a specific locale, you have to tell it the locale you want. if you use the ICU implementation you get support for more locales, its just that simple. The JRE has less available locales because its internationalization and localization support lags behind ICU. On the other hand ICU keeps current with both the unicode standard and locale data in CLDR (http://unicode.org/cldr), which is why it supports more. I noticed there is no locale for your language in CLDR, not even under development it appears (http://unicode.org/cldr/apps/survey). So if your language (Ojibwe) has special sort rules, I recommend making the collation rules and using a custom collator as specified here: http://wiki.apache.org/solr/UnicodeCollation#Sorting_text_with_custom_rules for your base collator you just need to use new Locale() and your rules will be a delta from that. Separately, if these sort rules are well-defined/standardized for this language, and you get them working, you might want to then consider contributing them to CLDR.