At 01:27 PM 7/11/2002 -0400, Suzanne M. Topping wrote:
>Unicode is a character set. Period. 


Well, maybe. But in a much broader sense then the character sets it subsumes in its 
listings. Each character has numerous properties in Unicode, whereas they generally 
don't in legacy character sets.

Maybe Unicode is more of a shared set of rules that apply to low level data structures 
surrounding text and its algorithms then a character set.

The Unicode consortium very wisely keeps it's focus narrow. It provides
>a mechanism for specifying characters. Not for manipulating them, not
>for describing them, not for making them twinkle.

All true, except for some special cases (BOM, bidi issues and algoirthms, vertical 
variants, etc).Not saying those shouldn't be in there, just that they are useful only 
in the use of algorithms that are explicit (bi-di) or assumed (upper case/lower case, 
vertical/horizontal) etc.

In many cases, these algorthms are not well known, even amongst the cognoscenti, or 
generally available in nice libraries. Anyone for an open source Japanese word 
splitting library (I know not taking a look at ICU before I press send is going to 
come back to haunt me on this, but if it is in there, then substitute something that 
isn't :)

Barry Caplan
www.i18n.com


Reply via email to