On Friday, 24 May 2013 at 22:44:24 UTC, H. S. Teoh wrote:
I remember those bad ole days of gratuitously-incompatible
encodings. I
wish those days will never ever return again. You'd get a text
file in
some unknown encoding, and the only way to make any sense of it
was to
guess what encoding it might be and hope you get lucky. Not
only so, the
same language often has multiple encodings, so adding support
for a
single new language required supporting several new encodings
and being
able to tell them apart (often with no info on which they are,
if you're
lucky, or if you're unlucky, with *wrong* encoding type specs
-- for
example, I *still* get email from outdated systems that claim
to be
iso-8859 when it's actually KOI8R).
This is an argument for UCS, not UTF-8.
Prepending the encoding to the data doesn't help, because it's
pretty
much guaranteed somebody will cut-n-paste some segment of that
data and
save it without the encoding type header (or worse, some
program will
try to "fix" broken low-level code by prepending a default
encoding type
to everything, regardless of whether it's actually in that
encoding or
not), thus ensuring nobody will be able to reliably recognize
what
encoding it is down the road.
This problem already exists for UTF-8, breaking ASCII
compatibility in the process:
http://en.wikipedia.org/wiki/Byte_order_mark
Well, at the very least adding garbage ASCII data in the front,
just as my header would do. ;)
For all of its warts, Unicode fixed a WHOLE bunch of these
problems, and
made cross-linguistic data sane to handle without pulling out
your hair,
many times over. And now we're trying to go back to that
nightmarish
old world again? No way, José!
No, I'm suggesting going back to one element of that "old world,"
single-byte encodings, but using UCS or some other standardized
character set to avoid all those incompatible code pages you had
to deal with.
If you're really concerned about encoding size, just use a
compression
library -- they're readily available these days. Internally,
the program
can just use UTF-16 for the most part -- UTF-32 is really only
necessary
if you're routinely delving outside BMP, which is very rare.
True, but you're still doubling your string size with UTF-16 and
non-ASCII text. My concerns are the following, in order of
importance:
1. Lost programmer productivity due to these dumb variable-length
encodings. That is the biggest loss from UTF-8's complexity.
2. Lost speed and memory due to using either an unnecessarily
complex variable-length encoding or because you translated
everything to 32-bit UTF-32 to get back to constant-width.
3. Lost bandwidth from using a fatter encoding.
As far as Phobos is concerned, Dmitry's new std.uni module has
powerful
code-generation templates that let you write code that operate
directly
on UTF-8 without needing to convert to UTF-32 first. Well, OK,
maybe
we're not quite there yet, but the foundations are in place,
and I'm
looking forward to the day when string functions will no longer
have
implicit conversion to UTF-32, but will directly manipulate
UTF-8 using
optimized state tables generated by std.uni.
There is no way this can ever be as performant as a
constant-width single-byte encoding.
+1. Using your own encoding is perfectly fine. Just don't do
that for
data interchange. Unicode was created because we *want* a single
standard to communicate with each other without stupid broken
encoding
issues that used to be rampant on the web before Unicode came
along.
In the bad ole days, HTML could be served in any random number
of
encodings, often out-of-sync with what the server claims the
encoding
is, and browsers would assume arbitrary default encodings that
for the
most part *appeared* to work but are actually fundamentally
b0rken.
Sometimes webpages would show up mostly-intact, but with a few
characters mangled, because of deviations / variations on
codepage
interpretation, or non-standard characters being used in a
particular
encoding. It was a total, utter mess, that wasted who knows how
many
man-hours of programming time to work around. For data
interchange on
the internet, we NEED a universal standard that everyone can
agree on.
I disagree. This is not an indictment of multiple encodings, it
is one of multiple unspecified or _broken_ encodings. Given how
difficult UTF-8 is to get right, all you've likely done is
replace multiple broken encodings with a single encoding with
multiple broken implementations.
UTF-8, for all its flaws, is remarkably resilient to mangling
-- you can
cut-n-paste any byte sequence and the receiving end can still
make some
sense of it. Not like the bad old days of codepages where you
just get
one gigantic block of gibberish. A properly-synchronizing UTF-8
function
can still recover legible data, maybe with only a few
characters at the
ends truncated in the worst case. I don't see how any
codepage-based
encoding is an improvement over this.
Have you ever used this self-synchronizing future of UTF-8? Have
you ever heard of anyone using it? There is no reason why this
kind of limited checking of data integrity should be rolled into
the encoding. Maybe this made sense two decades ago when
everyone had plans to stream text or something, but nobody does
that nowadays. Just put a checksum in your header and you're
good to go.
Unicode is still a "codepage-based encoding," nothing has changed
in that regard. All UCS did is standardize a bunch of
pre-existing code pages, so that some of the redundancy was taken
out. Unfortunately, the UTF-8 encoding then bloated the
transmission format and tempted devs to use this unnecessarily
complex format for processing too.