Re: Turkish 'I's can't D either

Michel Fortin Tue, 25 Aug 2009 04:45:18 -0700

On 2009-08-25 00:23:25 -0400, Ali Cehreli <acehr...@yahoo.com> said:

You may be aware of the problems related to the consistency of the twoseparate letter 'I's in the Turkish alphabet (and the alphabets thatare based on the Turkish alphabet).
Lowercase and uppercase versions of the two are consistent in whetherthey have a dot or not:
  http://en.wikipedia.org/wiki/Turkish_I
Turkish alphabet being in a position so close to the western alphabets,but not close enough, puts it in a strange position. (Strangely; thesame applies geographically, politically, socially, etc. as well... ;))
Computer systems *almost* work for Turkish, but not for those two letters.
I love the fact that D allows Unicode letters in the source code andthat it natively supports Unicode. I cannot stress enough how importantthis is. That is the single biggest reason why I decided to finallywrite a programming tutorial. Thank you to all who proposed andimplemented those features!
Back to the Turquois 'I's... What a programmer is to do who is writingprograms that deals with Turkish letters?
a) Accept that Phobos too has this age old behavior that is a result ofpremature optimization (i.e. this code in tolower: c + (cast(char)'a' -'A'))
b) Accept that the problem is unsolvable because the letter I has twominuscules, and the letter i has two majuscules anyway, and that theintent is not always clear
c) Accept Turkish alphabet as being pathological (merely for being inthe minority!), and use a Turkish version of Phobos or some otherlibrary
d) Solve the problem with locale support
Is option d possible with today's systems? Whose resposibility is thisanyway? OS? Language? Program? Something else?
The fact that alphanumerical ordering is also of interest, I think thishas something to do with locales.
Is there a way for a program to work with Turkish letters and ensurethat the following program produces the expected output of 'dotless i','I with dot', and 0?
import std.stdio;
import std.string;
import std.c.locale;
import std.uni;

void main()
{
    const char * result = setlocale(LC_ALL, "tr_TR.UTF-8");
    assert(result);

    writeln(toUniLower('I'));
    writeln(toUniUpper('i'));
    writeln(indexOf("I",
                    '\u0131',               // dotless i
                    (CaseSensitive).no));
}
This is a practical question. I really want to be able to work withTurkish... :)

Perhaps this could be of some inspiration. In Cocoa you can pass alocale argument to many string methods (unfortunatly, notlowercaseString or uppercaseStrings) to get the desired result. Forinstance, the "rangeOfString:options:range:locale:" method can searchfor substrings case-insentively, and it specifically discuss theTurkish “ı” character under the locale parameter.


http://developer.apple.com/documentation/Cocoa/Reference/Foundation/Classes/NSString_Class/Reference/NSString.html#//apple_ref/occ/instm/NSString/rangeOfString:options:range:locale:

It'salso interesting to see that when you search for ß in a webpage usingSafari, it also matches every instance of SS (whatever your locale). ßis a german character that becomes SS in uppercase.


- - -

What I'd like to see is an a base class representing a locale. Then youcan instanciate the locale you want (from a config file, by coding itdirectly, having bindings to system APIs, or a mix of all this) and usethe locale. Something like:


        class Locale
        {
        immutable:
                string lowercase(string s);
                string uppercase(string s);

                int compare(string a, string b);
                int compare(string a, string b);

                // number & date formatting, etc.
        }

        immutable(Locale) systemLocale();              // get default system 
locale
        immutable(Locale) locale(string localeName); // get best matching locale

        void main()
        {
                Locale turkish = locale("tr-TR");
            writeln(turkish.lowercase("I")); // writes "ı"
            writeln(turkish.uppercase("i")); // writes "İ"

                Locale english = locale("en-US");
            writeln(english.lowercase("I")); // writes "i"
            writeln(english.uppercase("i")); // writes "I"

            writeln(systemLocale.lowercase("I")); // depends on user settings
            writeln(systemLocale.uppercase("i")); // depends on user settings
        }

This way you can work with many locales at once. And there's noreliance on a global state.



--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/

Re: Turkish 'I's can't D either

Reply via email to