On 2009-08-25 00:23:25 -0400, Ali Cehreli <acehr...@yahoo.com> said:

You may be aware of the problems related to the consistency of the two separate letter 'I's in the Turkish alphabet (and the alphabets that are based on the Turkish alphabet).

Lowercase and uppercase versions of the two are consistent in whether they have a dot or not:

  http://en.wikipedia.org/wiki/Turkish_I

Turkish alphabet being in a position so close to the western alphabets, but not close enough, puts it in a strange position. (Strangely; the same applies geographically, politically, socially, etc. as well... ;))

Computer systems *almost* work for Turkish, but not for those two letters.

I love the fact that D allows Unicode letters in the source code and that it natively supports Unicode. I cannot stress enough how important this is. That is the single biggest reason why I decided to finally write a programming tutorial. Thank you to all who proposed and implemented those features!

Back to the Turquois 'I's... What a programmer is to do who is writing programs that deals with Turkish letters?

a) Accept that Phobos too has this age old behavior that is a result of premature optimization (i.e. this code in tolower: c + (cast(char)'a' - 'A'))

b) Accept that the problem is unsolvable because the letter I has two minuscules, and the letter i has two majuscules anyway, and that the intent is not always clear

c) Accept Turkish alphabet as being pathological (merely for being in the minority!), and use a Turkish version of Phobos or some other library

d) Solve the problem with locale support

Is option d possible with today's systems? Whose resposibility is this anyway? OS? Language? Program? Something else?

The fact that alphanumerical ordering is also of interest, I think this has something to do with locales.

Is there a way for a program to work with Turkish letters and ensure that the following program produces the expected output of 'dotless i', 'I with dot', and 0?

import std.stdio;
import std.string;
import std.c.locale;
import std.uni;

void main()
{
    const char * result = setlocale(LC_ALL, "tr_TR.UTF-8");
    assert(result);

    writeln(toUniLower('I'));
    writeln(toUniUpper('i'));
    writeln(indexOf("I",
                    '\u0131',               // dotless i
                    (CaseSensitive).no));
}

This is a practical question. I really want to be able to work with Turkish... :)

Perhaps this could be of some inspiration. In Cocoa you can pass a locale argument to many string methods (unfortunatly, not lowercaseString or uppercaseStrings) to get the desired result. For instance, the "rangeOfString:options:range:locale:" method can search for substrings case-insentively, and it specifically discuss the Turkish “ı” character under the locale parameter.

http://developer.apple.com/documentation/Cocoa/Reference/Foundation/Classes/NSString_Class/Reference/NSString.html#//apple_ref/occ/instm/NSString/rangeOfString:options:range:locale:

It's also interesting to see that when you search for ß in a webpage using Safari, it also matches every instance of SS (whatever your locale). ß is a german character that becomes SS in uppercase.

- - -

What I'd like to see is an a base class representing a locale. Then you can instanciate the locale you want (from a config file, by coding it directly, having bindings to system APIs, or a mix of all this) and use the locale. Something like:

        class Locale
        {
        immutable:
                string lowercase(string s);
                string uppercase(string s);

                int compare(string a, string b);
                int compare(string a, string b);

                // number & date formatting, etc.
        }

        immutable(Locale) systemLocale();              // get default system 
locale
        immutable(Locale) locale(string localeName); // get best matching locale

        void main()
        {
                Locale turkish = locale("tr-TR");
            writeln(turkish.lowercase("I")); // writes "ı"
            writeln(turkish.uppercase("i")); // writes "İ"

                Locale english = locale("en-US");
            writeln(english.lowercase("I")); // writes "i"
            writeln(english.uppercase("i")); // writes "I"

            writeln(systemLocale.lowercase("I")); // depends on user settings
            writeln(systemLocale.uppercase("i")); // depends on user settings
        }

This way you can work with many locales at once. And there's no reliance on a global state.


--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/

Reply via email to