Re: encoding of file names

Quincey Morris Thu, 26 May 2011 23:37:51 -0700

On May 26, 2011, at 22:56, Andrew Thompson wrote:

> I believe this stems from a period in history when the unicode group believed 
> that they'd be able to fit all practical scripts into 65536 code points. 
> Which meant you could get away with all kinds of assumptions like 16 bit 
> types and UCS-2. 
> 
> As it became clear that wasn't going to be enough code points the additional 
> planes were defined and ucs2 fell out of favor being replaced by UTF16 which 
> can model the higher planes.


That would explain the parting of the ways between "code unit" and "code 
point", but not really the distinction between "code point" and "[Unicode] 
character". My memory of the days when Unicode first started to get a foothold 
(the early 90s IIRC) is very hazy, but I think there were actually two things 
going on:

-- The belief, exactly as you describe, that 65536 was enough.

-- A vagueness (or perhaps a deliberate lack of definition) about what should 
be called a "character".

This seems to have been resolved now, and we have this hierarchy, at least in 
Unicode/Apple terms:

        code unit -> code point -> character -> grapheme -> (whatever the 
grouping is called upon which transformations like upper and lower case are 
performed)

It's not ultimately so hard, just a bit perilous for the unwary. That's the 
reason I've been going on about this ad nauseam. If we shine some light on it, 
we may help demystify it.

_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: encoding of file names

Reply via email to