In-Reply-To: Message from Darren Duncan <[EMAIL PROTECTED]>
of "Wed, 26 Nov 2008 19:34:09 PST." <[EMAIL PROTECTED]>
Tom Christiansen wrote:
I believe database folks have been doing the same with character data, but
I'm not up-to-date on the DB world, so maybe we have some metainfo about
the locale to draw on there. Tim?
AFAIK, modern databases are all strongly typed at least to the point
that the values you store in and fetch from them are each explicitly
character data or binary data or numbers or what-have-you; and so,
when you are dealing with a DBMS in terms of character data, it is
explicitly specified somewhere (either locally for the data or
globally/hardcoded for the DBMS) that each value of character data
belongs to a particular character repertoire and text encoding, and so
the DBMS knows what encoding etc the character data is in, or at least
it treats it consistently based on what the user said it was when it
input the data.
Oh, good then. That's what I'd heard was happening, but wasn't sure since
I've steared clear of such beasties since before it was true.
I wish our filesystems worked that way. But Andrew said something to me
last week about Ken and Dennis writing quite pointedly that while you
*could* use the f/s as a database, that you *shouldn't*. I didn't know
the reference he was thinking of, so just nodded pensively (=thoughtfully).
There is ABSOLUTELY NO WAY I've found to tell whether these utf-8
string should test equal, and when, nor how to order them, without
knowing the locale:
"RESUME",
"Resume"
"resume"
"Resum\x{e9}"
"r\x{E9}sum\x{E9}"
"r\x{E9}sume\x{301}"
"Re\x{301}sume\x{301}"
Case insensitively, in Spanish they should be identical in all regards.
In French, they should be identical but for ties, in which case you
work your way right to left on the diactricals.
This leads me to talk about my main point about sensitivity etc.
I believe that the most important issues here, those having to do with
identity, can be discussed and solved without unduly worrying about
matters of collation;
It's funny you should say that, as I could nearly swear that I just showed
that identify cannot be determmined in the examples above without knowing
about locales. To wit, while all of those sort somewhat differently, even
case-insensitively, no matter whether you're thinking of a French or a
Spanish ordering (and what is English's, anyway?), you have a a more
fundadmental = vs != scenario which is entirely locale-dependent.
If I can make a "RESUME" file, ought I be able to make a distcint
"r\x{E9}sum\x{E9}" or "re\x{301}sume\x{301}" file in a case-ignorant
filesystem? There is no good answer, because we might think it
reasonable to
lc(strip_marks($old_fn)) eq lc(strip_marks($new_fn))
Theee problem of what is or is not a "mark" varies by locale,
* Castilian doesn't think ~ is a mark; Portuguese does, and
so if you strip marks, you in Castilian count as the same
two letters that it deems disinct, but in Portuguese, you
incur no lasting harm.
* Catalan doesn't think ¸ is a mark; French does. and so if you strip
marks, you in Catalan count as the same two letters that it deems
disinct, but in French or Portuguese, you incur no lasting harm.
* Modern English (usually) decomposes æ into a+e, but OE/AS and
Icelandic do not.
* Moreover, Icelandic deems é and e to be completely
different letters altogether. If you strip marks, you
count as the same letters that that language does not.
Similarly with ö, which is at the end of their alphabet,
(like ø in some), and nowhere near o or ó. BTW, those
are three separate letters, not variants.
* And in OE/AS you could have a long mark on an asc (say "ash" for the
atomic *letter* æ). If split into a and e and stripped of marks, it
woudn't make any sense at all.
Case in point: Ælene Frisch, whom many of you doubtless know, insists her
name be spelt as I have written it. She does not want Aelene Frish, for
she considers her forename to have 5 letters in it, not 6. But Unicode
doesn't give us a title case version of that (did AS?), suggesting it a
ligature not a digraph.
But if we have a file called "ÆLENE", may be assume it the same in a case-
insensitive sense to both "aelene" and "ælene"?
I can only go on code-points, because I don't want to deal with ß and SS
and Ss. Case-folding file systems are just begging for trouble, and I just
don't know what to do. Think of the 3 Greek sigmata.
identity is a lot more important than collation, as well as a
precondition for collation, and collation is a lot more difficult and can
be put off.
I agree everything with everthing save "and can be put off". I would like
you to be right. I should truly wish to be mistaken. And I don't know
what we have for prior (cough) art.
respect to dealing with a file system, generally it is just identity that
matters and collation is a concern that can typically be just tacked on
after identity is solved.
That is, with a file system you need to know whether or not a file name
you hold will or won't match a file in the system, and matching or not-
matching is the main function of an identity.
But you can't match without knowing locales. It's NOT just collation. I'll
leave Icelandic out of it, but look at the trouble with 0xDF spilling from
one each to two chars and two bytes in the perl5 regex engine. Then look
at 0xFF spilling from one char to one char and three bytes there. It's
just plain horripilating.
Collation criteria is something that can be naturally applied externally
to a file system, such as by a user program, and only identity criteria
needs to be built-in to the file system.
I don't think you can do identify (case-wise) correctly without reguard to
digraphs and a world of weirdnesses we really wish we didn't. But you know
what else I wonder: what existing art *IS* there? It's so hard a problem
that I wonder if any one has done a good job at it.
Talking to the standards geeks at Usenix, including Andrew, brought no joy.
They basically just through up their hands, and lunch. I really wish I
could talk to Rob Pike and Udi Manber, my old theory and regex prof, but I
think they've both drunk the Googlaide now. I know Google strips accents
willynilly and does case-insensitive compares, but I don't know if that's a
global sol;ution.
So collation doesn't need to be considered in Perl's file-system
interface, while identity does; collation can be a layer on top of the
core interface that just cares about identity.
That seems a simplified version of reality. Identity isn't what monoglots
think it is.
If you *know* that the 7 strings are all UTF-8, then locale doesn't have
to be considered for equality; just your unicode abstraction level
matters, such as if you're defining the values in terms of graphemes vs
codepoints vs bytes.
That's not true. é is not the same letter as e in Icelandic.
See what a mess it's going into? Larry, can you think of something
simple? I haven't been able to. Unicode solves so few of the problems
people think it does. We've still so much to do, and I don't just
mean perlers.
AFAIK, Unicode does have an answer for the most important problems.
Darren>> To summarize, what we really want is something more generic
Darren>> than case-sensitivity, which is text normalization and text
Darren>> folding in general, as well as distinctly dealing with
Darren>> distinctness for representation versus distinctness for mutual
Darren>> exclusivity.
I think that you might have to use a Unicode::Collator object, since
the standard DUCET. It doesn't help much for actual locales, but it
does take care of some of things you're concerned with.
Makes sense.
Yes, I think so too. But it is very expensive in performance. Play with
my program. Makes you want to cheat.
Darren>> [This] implies that sensitivity is special whereas sensitivity
Darren>> should be considered normal, and rather insensitivity should
Darren>> be considered special.
I think Darren may be right, because even case-sensitivity is a real
problem.
It sure is.
No kidding. :-(
--tom