Tom Christiansen wrote:
I believe database folks have been doing the same with character data, but
I'm not up-to-date on the DB world, so maybe we have some metainfo about
the locale to draw on there.  Tim?

AFAIK, modern databases are all strongly typed at least to the point that the values you store in and fetch from them are each explicitly character data or binary data or numbers or what-have-you; and so, when you are dealing with a DBMS in terms of character data, it is explicitly specified somewhere (either locally for the data or globally/hardcoded for the DBMS) that each value of character data belongs to a particular character repertoire and text encoding, and so the DBMS knows what encoding etc the character data is in, or at least it treats it consistently based on what the user said it was when it input the data. The only time this information isn't really remembered is if the data is supplied in terms of being binary data.

Maybe some older or unusual DBMSs aren't this way, and of course technically a filesystem etc *is* a database ... I think that example mentioned about filename storage being locale dependent, probably meant that at the actual filesystem level it was just dealing with the names as binary data.

There is ABSOLUTELY NO WAY I've found to tell whether these utf-8
string should test equal, and when, nor how to order them, without
knowing the locale:

    "RESUME",
    "Resume"
    "resume"
    "Resum\x{e9}"
    "r\x{E9}sum\x{E9}"
    "r\x{E9}sume\x{301}"
    "Re\x{301}sume\x{301}"

Case insensitively, in Spanish they should be identical in all
regards. In French, they should be identical but for ties, in which case you work your way right to left on the diactricals.

This leads me to talk about my main point about sensitivity etc.

I believe that the most important issues here, those having to do with identity, can be discussed and solved without unduly worrying about matters of collation; identity is a lot more important than collation, as well as a precondition for collation, and collation is a lot more difficult and can be put off. With respect to dealing with a file system, generally it is just identity that matters and collation is a concern that can typically be just tacked on after identity is solved.

That is, with a file system you need to know whether or not a file name you hold will or won't match a file in the system, and matching or not-matching is the main function of an identity. Similarly, the file system has to make sure that no 2 distinct files in it have the same file name, that is the same public identity. In contrast, the order that you order or sort a list of files by their names usually isn't so important; while all work with a file system requires working with identities, most work does not need to deal with collation. In practice several parties can agree on a single means of identifying files, while still having their own favorite collations, so the same list can be ordered in different ways.

Collation criteria is something that can be naturally applied externally to a file system, such as by a user program, and only identity criteria needs to be built-in to the file system.

So collation doesn't need to be considered in Perl's file-system interface, while identity does; collation can be a layer on top of the core interface that just cares about identity.

One maxim I apply in my database work, and that I believe applies to this discussion, is "any logical difference is a big difference". If you have 2 distinct value literals such that you consider the difference in each literal's spelling to be significant, such that you can't for all use cases substitute one literal for the other, then the 2 literals denote 2 distinct values; in the other case, where you can always substitute one for the other harmlessly, then they denote the same value. The concept of 'value' and 'identity' are the same, and any value is its own identity.

And so, with your 7 'resume' literals, I would say that if there is a reason for any of the spellings to exist that couldn't be handled by one of the other spellings, then all 7 literals are distinct/non-identical taken as-is.

If you *know* that the 7 strings are all UTF-8, then locale doesn't have to be considered for equality; just your unicode abstraction level matters, such as if you're defining the values in terms of graphemes vs codepoints vs bytes.

When talking about identity, there is no such thing as case-insensitivity or accent insensitivity or whitespace insensitivity or what have you. If you have any reason to not replace every "E" with an "e" or vice-versa in your character string, then you consider those 2 non-identical and so they wouldn't match; by contrast, true case-insensitivity means you can replace every "e" with an "E" (for example) and forget than an "e" ever existed; the actual equality test is then the same since all comparands would only have the "E".

And so I brought up before that the generalization of case-concerning matters is about normalization and folding. Where the normal situation of everything-sensitive character data is just "$foo eq $bar", case-insensitive is really "lc($foo) eq lc($bar)", accent-insensitive is "strip_accents($foo) eq strip_accents($bar)", and whitespace-insensitive is "strip_ws($foo) eq strip_ws($bar)"; in every case, the actual "eq" is everything-sensitive, but since its arguments have been normalized, the actual domain of characters they would compare is smaller. In each normalized case, you aren't comparing $foo and $bar at all, but rather you are comparing 2 other values.

Now normalization can be arbitarily complex or different, but ultimately its just a functional mapping or functional dependency (the normalized version is the dependent and the non-normalized one is the determinant) and at the end of the day the actual tests for comparison or identity tests are the same and simple.

As for collation, if the collation is deterministic and fully-ordered then every 2 distinct characters does not compare as 'same' and one will always sort before the other. If 2 characters sort as 'same' then either the collation is just partially-ordered, in which case the 2 characters would order randomly, or otherwise the 2 characters are in fact the same character and all occurrences of one can be safely replaced by the other. No matter what your collation is, no 2 characters considered non-identical would compare as 'same' unless the collation was just partially ordered.

See what a mess it's going into?  Larry, can you think of something
simple? I haven't been able to. Unicode solves so few of the problems people think it does. We've still so much to do, and I don't just
mean perlers.

AFAIK, Unicode does have an answer for the most important problems.

Darren>> To summarize, what we really want is something more generic
Darren>> than case-sensitivity, which is text normalization and text
Darren>> folding in general, as well as distinctly dealing with
Darren>> distinctness for representation versus distinctness for mutual
Darren>> exclusivity.

I think that you might have to use a Unicode::Collator object, since
the standard DUCET.  It doesn't help much for actual locales, but it
does take care of some of things you're concerned with.

Makes sense.

  Two issues:

  **MAJOR** This is the opposite of small, fast, svelte.
    minor   You had better use the canonical forms, since
you don't want
              "e\x{COMBINING DOWN TACK BELOW}\x{COMBINING TILDE}\x{LATIN SMALL 
LETTER N WITH LEFT HOOK}je\x{COMBINING DOWN TACK BELOW}"
              "e\x{COMBINING TILDE}\x{COMBINING DOWN TACK BELOW}\x{LATIN SMALL 
LETTER N WITH LEFT HOOK}je\x{COMBINING DOWN TACK BELOW}"

            to be different; nor, case-insensitively, for these to differ:

"EN\x{COMBINING TILDE}E" "e\N{LATIN CAPITAL LETTER N WITH TILDE}e"

This depends on your abstraction level. If you're working in terms of graphemes then AFAIK those are considered identical. If in terms of codepoints then not. But still, I agree that canonical form use is very helpful and ideal, since then you don't need to use the grapheme level and the codepoint level would do what we want. But technically this is an example of what I was saying about normalization. If Perl's "eq" was always codepoint oriented, then people would have to say eg "nfc($foo) eq nfc($bar)" to get grapheme-insensitive comparisons. But the grapheme abstraction level is generally what you want anyway since character data is for humans and humans don't consider the various unicode normal forms as distinct characters; they *display* with exactly the same glyphs.

Darren>> [This] implies that sensitivity is special whereas sensitivity
Darren>> should be considered normal, and rather insensitivity should be
Darren>> considered special.

I think Darren may be right, because even case-sensitivity is a real problem.

It sure is.

-- Darren Duncan

Reply via email to