Re: encoding of file names

Ken Thomases Tue, 24 May 2011 22:14:42 -0700

On May 24, 2011, at 11:09 PM, Quincey Morris wrote:

> On May 24, 2011, at 17:33, Ken Thomases wrote:
> 
>>> I am sure this becomes more difficult with Arabic, Hebrew and Thai and 
>>> other writing systems that have highly composed forms. (not sure if that's 
>>> the right term)
>> 
>> Not really. 
> 
> There *is* another level, described briefly here:
> 
>       
> http://developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/Strings/Articles/stringsClusters.html
> 
> As I understand things, there are at least 3 levels, informally at least:
> 
> 1. Codepoints. Each Unicode codepoint is represented by 1, 2 or more 8, 16 or 
> 32 bit values (UTF-8, UTF-16, etc). I don't know if the individual 8, 16 or 
> 32 bit components have an official name. I call them "components".
> 
> 2. Characters. Some Unicode characters consist of a base codepoint and one or 
> more combining marks (accents). Some characters may representable as either a 
> single codepoint (precomposed) or multiple codepoints (decomposed), and there 
> are various normalization rule sets that specify the order and composition 
> for various contexts.
> 
> 3. Grapheme clusters. Some written units in some languages (such as Arabic, 
> Hebrew and Thai) are made up of multiple characters.
> 
> This means that, in general, a single grapheme cluster may consist of a 
> variable number of characters, which may each consist of a variable number of 
> Unicode codepoints, which may each consist of a variable number of components.


This is all correct, but seems to me to introduce stuff that's irrelevant to 
the current discussion, which was about comparing strings and, in particular, 
file paths.  Grapheme clusters and surrogate pairs really only come into play 
when one is splitting strings or identifying indexes or sub-ranges 
corresponding to what users think of as characters.  They don't affect 
comparison of strings for equality, although they may affect comparison for 
sorting for display.

Also, I wouldn't say that codepoints "may each consist of a variable number of 
components".  They may be _encoded_ to a variable number of components, but 
don't "consist" of them.


> Within Cocoa, the "native" string capabilities happen to be implemented in 
> terms 16 bit components whose type is 'unichar'. (Specifically, 'unichar' is 
> *not* a Unicode character type, nor even a Unicode codepoint type. It's a raw 
> component value. This is in spite of the fact that NSString methods that 
> access these components refer to them, incorrectly, as "characters".)
> 
> In class NSString, though, except when you specifically access individual 
> components or use methods and options specifically relating to composition, 
> strings are treated as *character* sequences, meaning that composition and 
> normalization are generally handled transparently.

This last bit is not true.  For the most part, NSString deals with sequences of 
UTF-16 units.  It is the exception, not the norm, for NSString to transparently 
ignore differences in composition -- i.e. to treat their contents as characters.

> AFAIK the file system operates at level 2, which means that composition and 
> normalization are *not* significant in file name comparisons, though files 
> names *are* stored with a canonical composition and normalization.
> 
> Ken, is that a correct statement of how it works?

This is basically correct.  The file system APIs normalize all paths they 
receive to Apple's variant of NFD.  So, the caller does not have to normalize 
it themselves in order to match a file path if, for example, they're trying to 
open an existing file.

Still, if you need to supply a file path to a C API, you should use one of the 
file-system-representation methods or functions.  Likewise, if you receive a 
file path from a C API, you should use the appropriate 
file-system-representation-taking methods or functions to obtain an NSString or 
CFString object from it.


>> You just need to be aware of the semantics of the operations you're 
>> performing so you can pick the right one -- i.e. isEqual: and 
>> isEqualToString: perform literal comparision, while -compare: does not, and 
>> the -compare:options:... methods let you choose that as well as 
>> case-sensitivity, diacritic-sensitivity, and width-sensitivity.
> 
> And "literal" means component by component. The NSString class reference 
> describes 'NSLiteralSearch' like this:
> 
>> Exact character-by-character equivalence.
> 
> I've always understand this to mean unichar by unichar, i.e. component by 
> component, since the NSString documentation generally refers to components as 
> "characters".

Well, your definition of "components" are the result of a particular encoding 
(e.g. UTF-16).  "Literal" means codepoint-by-codepoint equivalence.  Since the 
encodings are one-to-one, that implies component-by-component equivalence, too.

Remember that UTF-16 determines the _interface_ of NSString, not necessarily 
its storage.  Internally, it may be storing some strings in UTF-8.  So, when 
you consider a comparison between two NSStrings, the important thing isn't the 
components, but the codepoints that they encode.

> Here's what the NSString class reference says about 'isEqualToString:':
> 
>> The comparison uses the canonical representation of strings, which for a 
>> particular string is the length of the string plus the Unicode characters 
>> that make up the string. When this method compares two strings, if the 
>> individual Unicodes are the same, then the strings are equal, regardless of 
>> the backing store. “Literal” when applied to string comparison means that 
>> various Unicode decomposition rules are not applied and Unicode characters 
>> are individually compared. So, for instance, “Ö” represented as the composed 
>> character sequence “O” and umlaut would not compare equal to “Ö” represented 
>> as one Unicode character.
> 
> This make absolutely no sense unless the word "character" is here understood 
> to mean "component".

Well, I would say "codepoint" is more proper.  "O", the combining umlaut 
(diaeresis), and "Ö" are all distinct codepoints.  They are not components, 
although they can be represented by components.

> Under this interpretation, NSString has no real codepoint by codepoint 
> comparison. However, I believe that each codepoint point is represented by a 
> *unique* UTF-16 component sequence, so a literal comparison amounts to the 
> same thing as a codepoint by codepoint comparison.
> 
> Am I still on track here?

Well, you're correct that a component-by-component comparison is equivalent to 
a codepoint-by-codepoint comparison.  I disagree that NSString doesn't have the 
latter.  Because of the equivalence, I suppose it may be a matter of 
perspective.

Regards,
Ken

_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: encoding of file names

Reply via email to