Re: ICU incorporation and string changes heads-up

Leopold Toetsch Sat, 10 Apr 2004 04:01:11 -0700

Jeff Clites <[EMAIL PROTECTED]> wrote:
> On Apr 10, 2004, at 1:12 AM, Leopold Toetsch wrote:


>>    use German;
>>    print uc("i");
>>    use Turkish;
>>    print uc("i");

> Perfect example. The string "i" is the same in each case. What you've
> done is implicitly supplied a locale argument to the uc()
> operation--it's just a hidden form of:

>       uc(string, locale);

Ok. Now when the identical string "i" (but originating from different
locale environmets) goes through a sequence of string operations later,
how do you track the locale down to the final C<uc> where it's needed?

e.g.

    use German;
    my $gi = "i";
    use Turkish;
    my $ti = "i";

    my $s = $gi x 10;
    ...
    print uc($s);       # locale is what?

Where do you track the locale, if not in the string itself.

> The important thing is that the locale is a parameter to the operation,
> not an attribute of the string.

If that works ...

> Hmm? The point is that if you have a list of strings, for instance some
> in English, some in Greek, and some in Japanese, and you want to sort
> them, then you have to pick a sort ordering.

Ok. I want to uppercase the strings - no sorting (yet). I've an array of
Vienna's Kebab boothes. Half of these have turkish names (at least) the
rest is a mixture of other languages. I'd like to uppercase this array
of names. How do I do it?

> one = "\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"
> two = "\N{LATIN CAPITAL LETTER A WITH ACUTE}";

> one eq two  //false--they're different strings
> normalizeFormD(one) eq normalizeFormD(two)  //true

Sure. But if I want to compare "letters": one eq two. I think this is
the normal case the user of Unicode wants or expects. On the surface it
doesn't matter if the internal representation is different

OTOH normalizing all strings on input is not possible - what if they
should go into a file in unnormalized form.

> This is quite analogous to:

> three = "abc"
> four = "ABC"

No.

>> ,--[ Larry Wall
>> ]--------------------------------------------------------
>> |  level 0   byte == character, "use bytes" basically
>> |  level 1   codepoint == character, what we seem to be aiming for,
>> vaguely
>> |  level 2   grapheme == character, what the user usually wants
>> |  level 3   letter == character, what the "current language" wants
>> `----------------------------------------------------------------------
>> --

> Yes, and I'm boldly arguing that this is the wrong way to go, and I
> guarantee you that you can't find any other string or encoding library
> out there which takes an approach like that, or anyone asking for one.
> I'm eager for Larry to comment.

The design gods may speak up, yes.

>> I can't imagine that. I've an ASCII string and want to convert it to
>> UTF8
>> and UTF16 and write it into a file. How do I do that?

> That's the mindset shift. You don't have an ASCII string. You have a
> string, which may have come from a file or a buffer representing a
> string using the ASCII encoding. It's the example from above, again:

> inputBuffer = read(inputHandle);
> string = string_make(inputBuffer, "ASCII");
> outputBuffer = encode(string, "UTF-16");
> write(outputHandle, outputBuffer);

Ok. I should have asked: How do I do that in PASM of course.

> JEff

leo

Re: ICU incorporation and string changes heads-up

Reply via email to