On Mon, Sep 27, 2010 at 2:59 AM, Noah Diewald <noah.diew...@gmail.com> wrote: > On Sun, Sep 26, 2010 at 7:43 PM, Paul Davis <paul.joseph.da...@gmail.com> > wrote: >> On Sun, Sep 26, 2010 at 8:37 PM, Noah Diewald <noah.diew...@gmail.com> wrote: >>> On Sat, Sep 25, 2010 at 6:38 PM, Paul Davis <paul.joseph.da...@gmail.com> >>> wrote: >>>> On Sat, Sep 25, 2010 at 7:21 PM, Chris Anderson <jch...@apache.org> wrote: >>>>> On Sat, Sep 18, 2010 at 4:47 PM, Noah Diewald <noah.diew...@gmail.com> >>>>> wrote: >>>>>> I was wondering if there were any plans to make use of more of the ICU >>>>>> collation API in CouchDB. >>>>>> >>>>>> I'm using CouchDB to make natural language documentation software and >>>>>> it seems like a shame that I might have to use ICU for creating sort >>>>>> keys to get sort orders right for view keys in certain languages when >>>>>> ICU is already used internally by CouchDB. It kind of looks like >>>>>> something could be added in at about the same place as the option for >>>>>> case or no case collations in couch_icu_driver.c but I feel under >>>>>> qualified to play around with it. I think that having an option in the >>>>>> view to specify collation customization would be really great and it >>>>>> must be something that even people working with less obscure languages >>>>>> than I am could benefit from. >>>>>> >>>>> >>>>> we definitely plan to make this configurable, just a matter of writing >>>>> code. for now there might be a way to set it on a per-server-instance >>>>> basis with environment variables. I am no expert on the topic, but I >>>>> vaguely recall someone mentioning this possibility. >>>>> >>>>> Chris >>>>> >>>>>> -- >>>>>> Noah Diewald >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Chris Anderson >>>>> http://jchrisa.net >>>>> http://couch.io >>>>> >>>> >>>> I'm pretty sure that Chris is right that there's a server wide >>>> environment setting that affects ICU collation, but I can't say with >>>> any certainty. >>>> >>>> Its always been on the to-do list to provide the ability to have >>>> language based sorts that are defined at the view or database level, >>>> but as Chris points out, no one's gotten around to doing that. >>>> Currently the major issues would revolve around recoding the >>>> icu_driver to have smarts in how it's created, as well as refactoring >>>> how we access the driver. >>>> >>>> If we bumped our minimum Erlang VM version to R13, writing this as a >>>> NIF would probably be orders of magnitude easier because of resource >>>> types and what not. >>>> >>>> Once those hard parts are figured out, exposing it to the outside >>>> world should be as easy as going through the bike shedding motions on >>>> what the _design/doc syntax would look like. >>>> >>>> HTH, >>>> Paul Davis >>>> >>> >>> It is great to know that this type of thing is on the todo list. If >>> custom rules were supported and not just predefined locales, some of >>> the questionable NIFs I'm writing to make sort keys in my application >>> layer could be removed some day and life would be simpler. >>> >>> I don't think that the environment variables help me personally with >>> supporting multiple languages with different sort orders, especially >>> since the collation customizations for two of the languages that I'm >>> focusing on require custom rules. It would be really awesome if >>> CouchDB supported ICU custom collation rules in views right out of the >>> box. It might go a long way to making CouchDB a favorite with >>> linguists. (CouchDB should be a favorite with linguists anyway because >>> it is such a pleasure to use but this could make it extra favorite.) >>> >>> Thank you both for the replies. >>> >>> -- >>> Noah Diewald >>> >> >> I'm not sure what you mean by custom rules. I'm not extremely familiar >> with the collation API, but as I recall it had a thing that allowed a >> user to pass a string based config to it that it would use to affect >> the collation algorithm. Are you needing something beyond that? >> >> Paul Davis >> > > I don't think I'm needing anything more if we're talking about the > same thing but maybe we're not. > > Sorry about the "customization rule" stuff. Now that I look back, the > ICU documentation consistently calls them tailoring rules, sorry to be > unclear. I'm just learning this stuff. > > Here is my understanding of instantiating ICU collators just to see if > we are on the same page. > > There are two ways of instantiating collators. The predefined > collators are instantiated with locale strings like "en_US". Custom > collators are instantiated using tailoring rules.[1] > > The ICU users guide says that a tailoring rule "overrides the default > order of code points and the values of the ICU Collation Service > attributes".[2], which seems like a strange definition because > tailoring allows one to specify complex base letters that consist of > more than one code point. UTS 10 says "Tailoring is any well-defined > syntax that takes the Default Unicode Collation Element Table and > produces another well-formed Unicode Collation Element Table."[3] In > ICU a tailoring rule is a string that looks like this: > > "& C < č <<< Č < ć <<< Ć" > > So a string is used for configuration in both cases of collator > instantiation but a different api function is used to instantiate the > collator depending on whether one is using a predefined collator or a > tailoring rule. Any way of instantiating an ICU collator other than > passing in an empty string or "root" as the locale may or may not > result in a custom UCET derived from the DUCET so it was not a good > idea to just talk about customization since that is vague. > > I'm dealing with languages that require tailoring and it is likely > that most people wouldn't need tailoring just to be able to use a > specific language for a specific view and that specifying a locale > would be just fine. On the other hand, tailoring is very powerful and > could be used to customize collation for reasons other than matching > the alphabet of a rare language. > > Another aspect of what I need is that I specifically need different > collation algorithms for different views. In one case I'll want to > sort by English, in another I'll want to sort by Potawatomi or > Menominee or something else. > > 1. http://userguide.icu-project.org/collation/api > 2. http://userguide.icu-project.org/collation/customization > 3. http://www.unicode.org/reports/tr10/ > > -- > Noah Diewald >
Cool. I'm concerned about a small API difference that gets selected. Was just concerned for a bit that you were doing things like passing function pointers to an API which would increase the overhead by a couple orderes of magnitude. My earlier characterization of the level of difficulty is about at the right level still. Paul Davis