Re: Files, Directories, Resources, Operating Systems

Darren Duncan Wed, 26 Nov 2008 19:34:36 -0800

Tom Christiansen wrote:

I believe database folks have been doing the same with character data, but
I'm not up-to-date on the DB world, so maybe we have some metainfo about
the locale to draw on there.  Tim?

AFAIK, modern databases are all strongly typed at least to the point that thevalues you store in and fetch from them are each explicitly character data orbinary data or numbers or what-have-you; and so, when you are dealing with aDBMS in terms of character data, it is explicitly specified somewhere (eitherlocally for the data or globally/hardcoded for the DBMS) that each value ofcharacter data belongs to a particular character repertoire and text encoding,and so the DBMS knows what encoding etc the character data is in, or at least ittreats it consistently based on what the user said it was when it input thedata. The only time this information isn't really remembered is if the data issupplied in terms of being binary data.

Maybe some older or unusual DBMSs aren't this way, and of course technically afilesystem etc *is* a database ... I think that example mentioned about filenamestorage being locale dependent, probably meant that at the actual filesystemlevel it was just dealing with the names as binary data.

There is ABSOLUTELY NO WAY I've found to tell whether these utf-8
string should test equal, and when, nor how to order them, without
knowing the locale:

    "RESUME",
    "Resume"
    "resume"
    "Resum\x{e9}"
    "r\x{E9}sum\x{E9}"
    "r\x{E9}sume\x{301}"
    "Re\x{301}sume\x{301}"

Case insensitively, in Spanish they should be identical in all

regards. In French, they should be identical but for ties,in which case you work your way right to left on the diactricals.


This leads me to talk about my main point about sensitivity etc.

I believe that the most important issues here, those having to do with identity,can be discussed and solved without unduly worrying about matters of collation;identity is a lot more important than collation, as well as a precondition forcollation, and collation is a lot more difficult and can be put off. Withrespect to dealing with a file system, generally it is just identity thatmatters and collation is a concern that can typically be just tacked on afteridentity is solved.

That is, with a file system you need to know whether or not a file name you holdwill or won't match a file in the system, and matching or not-matching is themain function of an identity. Similarly, the file system has to make sure thatno 2 distinct files in it have the same file name, that is the same publicidentity. In contrast, the order that you order or sort a list of files bytheir names usually isn't so important; while all work with a file systemrequires working with identities, most work does not need to deal withcollation. In practice several parties can agree on a single means ofidentifying files, while still having their own favorite collations, so the samelist can be ordered in different ways.

Collation criteria is something that can be naturally applied externally to afile system, such as by a user program, and only identity criteria needs to bebuilt-in to the file system.

So collation doesn't need to be considered in Perl's file-system interface,while identity does; collation can be a layer on top of the core interface thatjust cares about identity.

One maxim I apply in my database work, and that I believe applies to thisdiscussion, is "any logical difference is a big difference". If you have 2distinct value literals such that you consider the difference in each literal'sspelling to be significant, such that you can't for all use cases substitute oneliteral for the other, then the 2 literals denote 2 distinct values; in theother case, where you can always substitute one for the other harmlessly, thenthey denote the same value. The concept of 'value' and 'identity' are the same,and any value is its own identity.

And so, with your 7 'resume' literals, I would say that if there is a reason forany of the spellings to exist that couldn't be handled by one of the otherspellings, then all 7 literals are distinct/non-identical taken as-is.

If you *know* that the 7 strings are all UTF-8, then locale doesn't have to beconsidered for equality; just your unicode abstraction level matters, such as ifyou're defining the values in terms of graphemes vs codepoints vs bytes.

When talking about identity, there is no such thing as case-insensitivity oraccent insensitivity or whitespace insensitivity or what have you. If you haveany reason to not replace every "E" with an "e" or vice-versa in your characterstring, then you consider those 2 non-identical and so they wouldn't match; bycontrast, true case-insensitivity means you can replace every "e" with an "E"(for example) and forget than an "e" ever existed; the actual equality test isthen the same since all comparands would only have the "E".

And so I brought up before that the generalization of case-concerning matters isabout normalization and folding. Where the normal situation ofeverything-sensitive character data is just "$foo eq $bar", case-insensitive isreally "lc($foo) eq lc($bar)", accent-insensitive is "strip_accents($foo) eqstrip_accents($bar)", and whitespace-insensitive is "strip_ws($foo) eqstrip_ws($bar)"; in every case, the actual "eq" is everything-sensitive, butsince its arguments have been normalized, the actual domain of characters theywould compare is smaller. In each normalized case, you aren't comparing $fooand $bar at all, but rather you are comparing 2 other values.

Now normalization can be arbitarily complex or different, but ultimately itsjust a functional mapping or functional dependency (the normalized version isthe dependent and the non-normalized one is the determinant) and at the end ofthe day the actual tests for comparison or identity tests are the same and simple.

As for collation, if the collation is deterministic and fully-ordered then every2 distinct characters does not compare as 'same' and one will always sort beforethe other. If 2 characters sort as 'same' then either the collation is justpartially-ordered, in which case the 2 characters would order randomly, orotherwise the 2 characters are in fact the same character and all occurrences ofone can be safely replaced by the other. No matter what your collation is, no 2characters considered non-identical would compare as 'same' unless the collationwas just partially ordered.

See what a mess it's going into?  Larry, can you think of something
simple? I haven't been able to. Unicode solves so few of the problemspeople think it does. We've still so much to do, and I don't just
mean perlers.


AFAIK, Unicode does have an answer for the most important problems.

Darren>> To summarize, what we really want is something more generic
Darren>> than case-sensitivity, which is text normalization and text
Darren>> folding in general, as well as distinctly dealing with
Darren>> distinctness for representation versus distinctness for mutual
Darren>> exclusivity.

I think that you might have to use a Unicode::Collator object, since
the standard DUCET.  It doesn't help much for actual locales, but it
does take care of some of things you're concerned with.


Makes sense.

  Two issues:


  **MAJOR** This is the opposite of small, fast, svelte.
    minor   You had better use the canonical forms, since

you don't want

              "e\x{COMBINING DOWN TACK BELOW}\x{COMBINING TILDE}\x{LATIN SMALL 
LETTER N WITH LEFT HOOK}je\x{COMBINING DOWN TACK BELOW}"
              "e\x{COMBINING TILDE}\x{COMBINING DOWN TACK BELOW}\x{LATIN SMALL 
LETTER N WITH LEFT HOOK}je\x{COMBINING DOWN TACK BELOW}"

            to be different; nor, case-insensitively, for these to differ:

"EN\x{COMBINING TILDE}E""e\N{LATIN CAPITAL LETTER N WITH TILDE}e"

This depends on your abstraction level. If you're working in terms of graphemesthen AFAIK those are considered identical. If in terms of codepoints then not.But still, I agree that canonical form use is very helpful and ideal, sincethen you don't need to use the grapheme level and the codepoint level would dowhat we want. But technically this is an example of what I was saying aboutnormalization. If Perl's "eq" was always codepoint oriented, then people wouldhave to say eg "nfc($foo) eq nfc($bar)" to get grapheme-insensitive comparisons.But the grapheme abstraction level is generally what you want anyway sincecharacter data is for humans and humans don't consider the various unicodenormal forms as distinct characters; they *display* with exactly the same glyphs.

Darren>> [This] implies that sensitivity is special whereas sensitivity
Darren>> should be considered normal, and rather insensitivity should be
Darren>> considered special.

I think Darren may be right, because even case-sensitivity is a real problem.


It sure is.

-- Darren Duncan

Re: Files, Directories, Resources, Operating Systems

Reply via email to