Re: gcc and utf-8 source

srintuar Tue, 16 Nov 2004 04:52:24 -0800

Danilo Segan wrote:

srintuar wrote:

1) For printf("%s\n", "SchÃne GrÃÃe");

... Being that UTF-8 is sortof an an endpoint in the evolution of encodings, I also consider option 1 to be perfectly valid.

I would be careful with such statements. We don't know what the successor of UTF-8 might look like, nor when it will appear (in 6 years? 10 years? 15 years?). But predictions like "A personal computer will never need more than 640 KB of RAM" have too frequently turned out to be wrong.
I'd second that.  Unicode itself has quite a few quirks of its own,
and UTF-8 as a "Transformation Format" of it is no better.
I'd be very disappointed if Unicode/UTF-8 was really endpoint in the
evolution of encodings (for instance, I'd prefer if I was able to
encode Serbian language content with a single set of codepoints, since
it uses either Latin or Cyrillic script, and there's [almost]
bijective mapping between them).  Those making heavy use of "CJK"
glyphs would surely have more objections, and undoubtly, many others.

Unlike the famous gates quote, it is reasonable to state that certain
things represent ending points. For example, a 64 bit time counter for
seconds will probably be enough.

Language itself will change over time, though not as fast as it has in
the past.  But I imagine we can at least avoid having a mixed bag of
incompatible encodings for the foreseeable future. Its not a bad place
to be either, imo. Encoding reform is mostly a sad tale of broken
apps, corrupted files, garbled displays, immiscible scripts, etc.  A
world with 20years of solid support for unicode/utf-8 would be free to
work on more interesting problems.


As for serbian, I dont think that really has much to do with unicode
itself. You could apply a special folding algorithm when doing
searches in a serbian context, but I dont think you would want to make
the script ambiguous. Defining a "Serbian Unified" code range would
probably be futile: Even after such a range was defined, little could
stop someone from representing serbian in cyrillic or latin script.
(then you would have three strings to search for intstead of one or
two) If you have two official scripts, you have two official scripts.

A similar issue arises with Vietnamese: native speakers often leave
out the diacritical marks in informal communications, or they
partially leave them out. Doing a normalizatian fold down to ascii may
give you too many false hits, and searching only for properly spelled
text may miss too many. This doesnt mean anything is wrong with unicode
though.

As for CJK un-unification, I find that highly unlikely. Even if for
some reason certain codepoints were unambiguated,  that would still
fit within a backwards compatible modification of unicode, and not a
complete scrapping of it. Speaking of political issues: if you advocate
un-unification, I'd be interested to hear which language's script you
would place closest to zero.

Changes to unicode dont require any change to UTF-8 itself.
Sacrificing native script programming comments and literals isnt worth
it for something that may never change, imo.


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: gcc and utf-8 source

Reply via email to