Re: Plans for string processing

Aaron Sherman Wed, 14 Apr 2004 12:21:46 -0700

On Tue, 2004-04-13 at 18:23, Leopold Toetsch wrote:
> Aaron Sherman <[EMAIL PROTECTED]> wrote:
> > For example, in Perl5/Ponie:
> 
> >         @names=<NAMES>;
> >         print "Phone Book: ", sort(@names), "\n";
> 
> > In this example, I don't see why I would care that NAMES might be a
> > pseudo-handle that iterates over several databases, and returns strings
> > in the 7 different languages
> 
> I already did show an example where uc("i") isn't "I". Collating is sill
> more cmplex then a »simple« uc().


Correct. I agree, and I don't think anything I said contradicted that,
did it?

> Well, we dont't know what the caller expects. The caller has to decide.
> There are basically at least two ways: Treat all strings language
> independent (from their origin) or append more information to each
> string.

Hmmm... or the third, and far more common approach in all languages that
I've seen that deal with these issues: deal with the comparison
according to the rules set out by the language in which the comparison
is being done. Why is that option being passed over? Is it considered to
be, in some way, identical to ignoring language distinctions? How?

> >> *) Provides language-sensitive character overrides ('ll' treated as a
> >> single character, for example, in Spanish if that's still desired)
> >> *) Provides language-sensitive grouping overrides.
> 
> > Ah, and here we come to my biggest point of confusion.
> 
> Another example:
> 
>  "my dog Fiffi" eq "my dog Fi\x{fb03}"
> 
> When my program is doing typographical computations, above equation is
> true. And useful. The characters "f", "f", "i" are goin' to be printed.
> But the ligature "ffi" takes less space when printed as such.
> This is the same character string, though, when I'm a reader of this dog
> news paper.

Ok, so here you essentially say, "in the typographical context this
statement has one result, in a string data context it has another."

So, why is that:

        "my dog Fiffi":language("blah") eq "my dog Fi\x{fb03}":langauge("blah")

and not

        use language "blah";
        "my dog Fiffi" eq "my dog Fi\x{fb03}"

and what in Parrot's name does

        "james":langauge("blah") eq "jim":language("bletch")

mean? Should "blah"'s language rules (in which "james" and "jim" are the
same name) or "bletch"'s language rules (in which they are not) take
priority? The comparison of two different languages would have to be
done in a third context of "culture" (e.g. "culture foo holds that
blah's rules for names are used and bletch's rules for everything else
are used except when a word in bletch was derived from a word used in
blah during the third invasion and swap meet of 1233").

Then, of course, we can get into how I feel about my program telling me
(in any context) that "ffi" and "\x{fb03}" are the same for any number
of reasons, not the least of which is that I consider such
representations to be markup, not text... but that's just me, and
perhaps I'll just have to put "use language 'ignorant American geek'" at
the start of all of my programs ;)

> > b) Strings will have different languages and behave according to their
> > "sources" regardless of the native rules of the user.

Again, I have never seen any source of information that suggests that
there is a universally known way to implement the above. Don't get me
started on the impact of going to southeast Asia and suggesting that
"ok, one of your language rules have to win when comparing characters of
differing languages"... ha! IMHO, the only thing that CAN be done at
such a low level as Parrot is to do the work according to the language
rules that govern the rest of this execution of the program, and if a
string makes no sense in that context, an exception is thrown.

But otherwise, how do you sort \x{6728} in Japanese vs Mandarin Chinese?
The two languages have different answers, and you HAVE to pick one.

> >> IW: Mush together (either concatenate or substr replacement) two
> >> strings of different languages but same charset
> 
> > According to whose rules?
> 
> User level - what do you want to achieve. At codepoint level the
> operation is fine. It doesn't make sense above that, though.

So, you seem to be suggesting that a single language (that of the user,
not the 2+ involved if you tag every string) should decide? If so, why
have strings tagged with language?

> > This means that someone's rules must become dominant,
> 
> It doesn't make much sense to do
> 
>    bors S0, S1   # stringwise bit not
> 
> to anything that isn't singlebyte encoded. It depends.

Sorry, you lost me. Did I bring that up? I was asking if:

        $a cmp $b

would have a result in which $b was considered with respect to $a's
language or visa versa. Most commonly (always?) there is an incomplete
intersection of rules between the two, so someone's rules will have to
"win". So you have choices:

      * If you go with LHS vs. RHS, then sort gets borked because sort
        will reverse the "sides" repeatedly as it executes. This can and
        would result in infinite sort times.
      * If you come up with a list of languages in descending order of
        dominance then there will be at least as many camps that
        disagree with you as the length of the list minus one.
      * If you try to architect a universal set of rules for applying
        all language rules to strings involving all other language
        rules, you will finish right about the time Esperanto takes over
        the world, and you will have conducted several world wars in the
        process.

If you think I'm over-stating, then please go to Iraq and ask the Sunni
how they feel about the Shia rules being dominant in some situations
when comparing Arabic strings....

And so, I return to my premise: why is this information associated with
strings, rather than being some sort of context that is associated with
an operation (e.g. "compare these two Arabic strings WRT the rules of
Sunni Bedouins speaking the Nadji Arabic dialect")? And why, again in
response to Dan's comments, is that monoculuristic of me?

So far, the responses have sounded like "that's complex, you wouldn't
understand". I hope that I'm making it clear that I understand the
complexities just fine, and I don't see associating language with a
string as resolving so much as introducing complexity to an already
intractable problem.

-- 
Aaron Sherman <[EMAIL PROTECTED]>
Senior Systems Engineer and Toolsmith
"It's the sound of a satellite saying, 'get me down!'" -Shriekback

Re: Plans for string processing

Reply via email to