In re CJK, that's already a FAQ: http://www.unicode.org/faq/han_cjk.html#16.
The short version is: if all you want to do is to draw something, then yes,
making up new hanzi on the fly is a solvable problem. If you want to do
anything that deals with the *content* (lexical analysis, sorting,
text-to-speech), it's an incredibly difficult problem.
And, actually, there's already a way to insert nonstandard hanzi into text
(well, two, if you count the Ideographic Variation Indicator), namely
Ideographic Description Sequences. They're clumsy and awkward, but they do
make it possible to exchange text with unencoded hanzi in a vaguely standard
fashion.
And yes, Unicode is very complicated, but that's because of the problem it's
intended to solve. If all you're interested in is drawing text in a couple of
common scripts, such as Latin and Japanese, then you really don't need Unicode
with all of its complexity. Unicode is trying to provide a basis for handling
all aspects of plain text processing for all the languages of the world in a
single application.
Just go to Wikipedia and look down the long list of different languages that a
popular subject has articles in. *That* is what Unicode is trying to provide.
It's very tough to implement, but fortunately on all the major platforms, there
are libraries that make it unnecessary for you to do all the work yourself.
QSJN 4 UKR 於 2011年9月20日 下午9:01 寫道:
Yes, i had written 'egyptian hieroglyphs' but how about banal CJK? We
still have no way to insert nonstandard ideogramme into text. Isn't it
a simple task? There are just 20 basic strokes :) ok, 500 basic
symbols. Or 20? However we can't combine it together :( !
Unicode is to complex standard. I even don't know how many properties
have one character (did you know about unicode-coloured characters? -
there was somewhere that my theme in this list), how can i know how my
application has to render 'plain' text with bidi, noncanonicordered
diacritics, and korean script. Right, i don't know that. And my
application render it in my way, some else in another (a_a / aa_ -
double comb. char., sure you seen that), so we have no standard at
all.
Off course, i can learn this complex standard, but what for? Most of
them i never use.
There must be a simpler system, not so many aprior data for it work.
2011/9/13, John H. Jenkins jenk...@apple.com:
QSJN 4 UKR 於 2011年9月12日 下午9:06 寫道:
I know it is sacred cow, but let me just ask, how do you people think.
Is it good or bad that the codepoint means all about character: what,
where, how... (see theme)? Maybe have we separate graph control
codes - wellnt have many problems, from banal ltr (( rtl instead ltr
(rtl) to placing one tilde above 3, 4, anymore letters, or egyptian
hierogliphs in rows'n'cols. Conceptually, I mean! Each letter in text
is at least two codepoints (what and where) in file. Is it stupid?
Trying to render the text we anyway must generate this data.
It's not really a sacred cow per se, but it is a fundamental architectural
decision which would be pretty much impossible to revisit now.
Almost all writing is done using a small set of script-specific rules which
are pretty straightforward. English, for example, is laid out in horizontal
lines running left-to-right and arranged top-to-bottom of the writing
surface. East Asian languages were traditionally laid out in vertical lines
running from top-to-bottom and arranged right-to-left on the writing
surface.
Because some scripts are right-to-left and ltr and rtl text can be freely
intermingled on a single line, Unicode provides plain-text directionality
controls. The preference, however, is to use higher-level protocols where
possible.
As for the scripts which are inherently two-dimensional (using
hieroglyphics, mathematics, and music), it's almost impossible to provide
plain text support for them. There is too much dependence on additional
information such as the specifics of font and point size. Because of this,
the UTC decided long ago that layout for such scripts absolutely must be
done using a higher-level protocol to handle all the details.
There are occasionally suggestions that positioning controls be added to
plain text in Unicode, but so far the UTC has felt that the benefits are too
marginal to overcome its reasons for having left them out in the first
place.
=
Hoani H. Tinikini
John H. Jenkins
jenk...@apple.com
=
John H. Jenkins
jenk...@apple.com