Localized alphabetical order

2011-04-22 Thread Ben Preece
As someone who's new to Solr/Lucene, I'm having trouble finding 
information on sorting results in localized alphabetical order. I've 
ineffectively searched the wiki and the mail archives.


I'm thinking for example about Hawai'ian, where mīka (with an i-macron) 
comes after mika (i without the macron) but before miki (also without 
the macron), or about Welsh, where the digraphs (ch, dd, etc.) are 
treated as single letters, or about Ojibwe, where the apostrophe ' is a 
letter which sorts between h and i.


How do non-English languages typically handle this?

-Ben


Re: Localized alphabetical order

2011-04-22 Thread Robert Muir
please see http://wiki.apache.org/solr/UnicodeCollation

In general the idea is similar to how this is handled in databases,
you can index collation keys into a sort field at analysis time, then
you just do a standard solr sort.

However, I am not sure if your JRE provides a haw Locale for the
Hawaiian language.

Because of this, its probably better to use the ICU collation
integration 
(http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUCollationKeyFilterFactory),
because ICU definitely supports this locale and has collation rules
for it.

On Fri, Apr 22, 2011 at 12:33 PM, Ben Preece preec...@umn.edu wrote:
 As someone who's new to Solr/Lucene, I'm having trouble finding information
 on sorting results in localized alphabetical order. I've ineffectively
 searched the wiki and the mail archives.

 I'm thinking for example about Hawai'ian, where mīka (with an i-macron)
 comes after mika (i without the macron) but before miki (also without the
 macron), or about Welsh, where the digraphs (ch, dd, etc.) are treated as
 single letters, or about Ojibwe, where the apostrophe ' is a letter which
 sorts between h and i.

 How do non-English languages typically handle this?

 -Ben



Re: Localized alphabetical order

2011-04-22 Thread Peter Keegan
On Fri, Apr 22, 2011 at 12:33 PM, Ben Preece preec...@umn.edu wrote:

 As someone who's new to Solr/Lucene, I'm having trouble finding information
 on sorting results in localized alphabetical order. I've ineffectively
 searched the wiki and the mail archives.

 I'm thinking for example about Hawai'ian, where mīka (with an i-macron)
 comes after mika (i without the macron) but before miki (also without the
 macron), or about Welsh, where the digraphs (ch, dd, etc.) are treated as
 single letters, or about Ojibwe, where the apostrophe ' is a letter which
 sorts between h and i.

 How do non-English languages typically handle this?

 -Ben



Re: Localized alphabetical order

2011-04-22 Thread Bently Preece
Thank you.  This looks like the right direction.

I see the docs say ICUCollationKeyFilterFactory is deprecated in favor of
ICUCollationField.  So ... I'd implement a subclass of ICUCollationField,
and use that as the fieldtype in schema.xml.  And this means - what? - that
I'd also implement a custom SortField to be returned by
MyCollationField.getSortField(...), which would also require me to write a
custom FieldComparator?  Am I on the right track?

Do you know an example of another language which has already done this sort
of thing?

Really, thanks for your help.

-Ben

On Fri, Apr 22, 2011 at 11:41 AM, Peter Keegan peterlkee...@gmail.comwrote:

 On Fri, Apr 22, 2011 at 12:33 PM, Ben Preece preec...@umn.edu wrote:

  As someone who's new to Solr/Lucene, I'm having trouble finding
 information
  on sorting results in localized alphabetical order. I've ineffectively
  searched the wiki and the mail archives.
 
  I'm thinking for example about Hawai'ian, where mīka (with an i-macron)
  comes after mika (i without the macron) but before miki (also without the
  macron), or about Welsh, where the digraphs (ch, dd, etc.) are treated as
  single letters, or about Ojibwe, where the apostrophe ' is a letter which
  sorts between h and i.
 
  How do non-English languages typically handle this?
 
  -Ben
 



Re: Localized alphabetical order

2011-04-22 Thread Robert Muir
On Fri, Apr 22, 2011 at 2:37 PM, Bently Preece preec...@umn.edu wrote:
 Thank you.  This looks like the right direction.

 I see the docs say ICUCollationKeyFilterFactory is deprecated in favor of
 ICUCollationField.  So ... I'd implement a subclass of ICUCollationField,
 and use that as the fieldtype in schema.xml.  And this means - what? - that
 I'd also implement a custom SortField to be returned by
 MyCollationField.getSortField(...), which would also require me to write a
 custom FieldComparator?  Am I on the right track?

no, you don't have to write any code in either case:

solr 3.1:

fieldType name=sort_haw class=solr.TextField
  analyzer
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.ICUCollationKeyFilterFactory locale=haw
strength=secondary/
  /analyzer
/fieldType

solr 4.0:

fieldtype name=sort_haw class=solr.ICUCollationField locale=haw
strength=secondary/

then just copyField or whatever to get your data in there.


Re: Localized alphabetical order

2011-04-22 Thread Bently Preece
What if there is no standard localization already?  The case I'm
specifically interested in is Ojibwe.

So should I really be researching how the JRE does localization instead of
Solr?


On Fri, Apr 22, 2011 at 2:01 PM, Robert Muir rcm...@gmail.com wrote:

 On Fri, Apr 22, 2011 at 2:37 PM, Bently Preece preec...@umn.edu wrote:
  Thank you.  This looks like the right direction.
 
  I see the docs say ICUCollationKeyFilterFactory is deprecated in favor of
  ICUCollationField.  So ... I'd implement a subclass of ICUCollationField,
  and use that as the fieldtype in schema.xml.  And this means - what? -
 that
  I'd also implement a custom SortField to be returned by
  MyCollationField.getSortField(...), which would also require me to write
 a
  custom FieldComparator?  Am I on the right track?

 no, you don't have to write any code in either case:

 solr 3.1:

 fieldType name=sort_haw class=solr.TextField
  analyzer
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.ICUCollationKeyFilterFactory locale=haw
 strength=secondary/
  /analyzer
 /fieldType

 solr 4.0:

 fieldtype name=sort_haw class=solr.ICUCollationField locale=haw
 strength=secondary/

 then just copyField or whatever to get your data in there.



Re: Localized alphabetical order

2011-04-22 Thread Robert Muir
On Fri, Apr 22, 2011 at 3:09 PM, Bently Preece preec...@umn.edu wrote:
 What if there is no standard localization already?  The case I'm
 specifically interested in is Ojibwe.


this is standard? to sort a field with a specific locale, you have to
tell it the locale you want. if you use the ICU implementation you get
support for more locales, its just that simple. The JRE has less
available locales because its internationalization and localization
support lags behind ICU.

On the other hand ICU keeps current with both the unicode standard and
locale data in CLDR (http://unicode.org/cldr), which is why it
supports more.

I noticed there is no locale for your language in CLDR, not even under
development it appears (http://unicode.org/cldr/apps/survey).

So if your language (Ojibwe) has special sort rules, I recommend
making the collation rules and using a custom collator as specified
here: 
http://wiki.apache.org/solr/UnicodeCollation#Sorting_text_with_custom_rules

for your base collator you just need to use new Locale() and your
rules will be a delta from that.

Separately, if these sort rules are well-defined/standardized for this
language, and you get them working, you might want to then consider
contributing them to CLDR.


Re: Localized alphabetical order

2011-04-22 Thread Bently Preece
Thanks.  I get it now.

I meet with our language experts again on Monday.  I'll ask them about
submitting localization info to the CLDR.

Thanks again.

-Ben

On Fri, Apr 22, 2011 at 2:44 PM, Robert Muir rcm...@gmail.com wrote:

 On Fri, Apr 22, 2011 at 3:09 PM, Bently Preece preec...@umn.edu wrote:
  What if there is no standard localization already?  The case I'm
  specifically interested in is Ojibwe.
 

 this is standard? to sort a field with a specific locale, you have to
 tell it the locale you want. if you use the ICU implementation you get
 support for more locales, its just that simple. The JRE has less
 available locales because its internationalization and localization
 support lags behind ICU.

 On the other hand ICU keeps current with both the unicode standard and
 locale data in CLDR (http://unicode.org/cldr), which is why it
 supports more.

 I noticed there is no locale for your language in CLDR, not even under
 development it appears (http://unicode.org/cldr/apps/survey).

 So if your language (Ojibwe) has special sort rules, I recommend
 making the collation rules and using a custom collator as specified
 here:
 http://wiki.apache.org/solr/UnicodeCollation#Sorting_text_with_custom_rules

 for your base collator you just need to use new Locale() and your
 rules will be a delta from that.

 Separately, if these sort rules are well-defined/standardized for this
 language, and you get them working, you might want to then consider
 contributing them to CLDR.