On Tue, 2004-04-13 at 18:23, Leopold Toetsch wrote: > Aaron Sherman <[EMAIL PROTECTED]> wrote: > > For example, in Perl5/Ponie: > > > @names=<NAMES>; > > print "Phone Book: ", sort(@names), "\n"; > > > In this example, I don't see why I would care that NAMES might be a > > pseudo-handle that iterates over several databases, and returns strings > > in the 7 different languages > > I already did show an example where uc("i") isn't "I". Collating is sill > more cmplex then a »simple« uc().
Correct. I agree, and I don't think anything I said contradicted that, did it? > Well, we dont't know what the caller expects. The caller has to decide. > There are basically at least two ways: Treat all strings language > independent (from their origin) or append more information to each > string. Hmmm... or the third, and far more common approach in all languages that I've seen that deal with these issues: deal with the comparison according to the rules set out by the language in which the comparison is being done. Why is that option being passed over? Is it considered to be, in some way, identical to ignoring language distinctions? How? > >> *) Provides language-sensitive character overrides ('ll' treated as a > >> single character, for example, in Spanish if that's still desired) > >> *) Provides language-sensitive grouping overrides. > > > Ah, and here we come to my biggest point of confusion. > > Another example: > > "my dog Fiffi" eq "my dog Fi\x{fb03}" > > When my program is doing typographical computations, above equation is > true. And useful. The characters "f", "f", "i" are goin' to be printed. > But the ligature "ffi" takes less space when printed as such. > This is the same character string, though, when I'm a reader of this dog > news paper. Ok, so here you essentially say, "in the typographical context this statement has one result, in a string data context it has another." So, why is that: "my dog Fiffi":language("blah") eq "my dog Fi\x{fb03}":langauge("blah") and not use language "blah"; "my dog Fiffi" eq "my dog Fi\x{fb03}" and what in Parrot's name does "james":langauge("blah") eq "jim":language("bletch") mean? Should "blah"'s language rules (in which "james" and "jim" are the same name) or "bletch"'s language rules (in which they are not) take priority? The comparison of two different languages would have to be done in a third context of "culture" (e.g. "culture foo holds that blah's rules for names are used and bletch's rules for everything else are used except when a word in bletch was derived from a word used in blah during the third invasion and swap meet of 1233"). Then, of course, we can get into how I feel about my program telling me (in any context) that "ffi" and "\x{fb03}" are the same for any number of reasons, not the least of which is that I consider such representations to be markup, not text... but that's just me, and perhaps I'll just have to put "use language 'ignorant American geek'" at the start of all of my programs ;) > > b) Strings will have different languages and behave according to their > > "sources" regardless of the native rules of the user. Again, I have never seen any source of information that suggests that there is a universally known way to implement the above. Don't get me started on the impact of going to southeast Asia and suggesting that "ok, one of your language rules have to win when comparing characters of differing languages"... ha! IMHO, the only thing that CAN be done at such a low level as Parrot is to do the work according to the language rules that govern the rest of this execution of the program, and if a string makes no sense in that context, an exception is thrown. But otherwise, how do you sort \x{6728} in Japanese vs Mandarin Chinese? The two languages have different answers, and you HAVE to pick one. > >> IW: Mush together (either concatenate or substr replacement) two > >> strings of different languages but same charset > > > According to whose rules? > > User level - what do you want to achieve. At codepoint level the > operation is fine. It doesn't make sense above that, though. So, you seem to be suggesting that a single language (that of the user, not the 2+ involved if you tag every string) should decide? If so, why have strings tagged with language? > > This means that someone's rules must become dominant, > > It doesn't make much sense to do > > bors S0, S1 # stringwise bit not > > to anything that isn't singlebyte encoded. It depends. Sorry, you lost me. Did I bring that up? I was asking if: $a cmp $b would have a result in which $b was considered with respect to $a's language or visa versa. Most commonly (always?) there is an incomplete intersection of rules between the two, so someone's rules will have to "win". So you have choices: * If you go with LHS vs. RHS, then sort gets borked because sort will reverse the "sides" repeatedly as it executes. This can and would result in infinite sort times. * If you come up with a list of languages in descending order of dominance then there will be at least as many camps that disagree with you as the length of the list minus one. * If you try to architect a universal set of rules for applying all language rules to strings involving all other language rules, you will finish right about the time Esperanto takes over the world, and you will have conducted several world wars in the process. If you think I'm over-stating, then please go to Iraq and ask the Sunni how they feel about the Shia rules being dominant in some situations when comparing Arabic strings.... And so, I return to my premise: why is this information associated with strings, rather than being some sort of context that is associated with an operation (e.g. "compare these two Arabic strings WRT the rules of Sunni Bedouins speaking the Nadji Arabic dialect")? And why, again in response to Dan's comments, is that monoculuristic of me? So far, the responses have sounded like "that's complex, you wouldn't understand". I hope that I'm making it clear that I understand the complexities just fine, and I don't see associating language with a string as resolving so much as introducing complexity to an already intractable problem. -- Aaron Sherman <[EMAIL PROTECTED]> Senior Systems Engineer and Toolsmith "It's the sound of a satellite saying, 'get me down!'" -Shriekback