On Friday, 24 May 2013 at 22:44:24 UTC, H. S. Teoh wrote:
I remember those bad ole days of gratuitously-incompatible encodings. I wish those days will never ever return again. You'd get a text file in some unknown encoding, and the only way to make any sense of it was to guess what encoding it might be and hope you get lucky. Not only so, the same language often has multiple encodings, so adding support for a single new language required supporting several new encodings and being able to tell them apart (often with no info on which they are, if you're lucky, or if you're unlucky, with *wrong* encoding type specs -- for example, I *still* get email from outdated systems that claim to be
iso-8859 when it's actually KOI8R).
This is an argument for UCS, not UTF-8.

Prepending the encoding to the data doesn't help, because it's pretty much guaranteed somebody will cut-n-paste some segment of that data and save it without the encoding type header (or worse, some program will try to "fix" broken low-level code by prepending a default encoding type to everything, regardless of whether it's actually in that encoding or not), thus ensuring nobody will be able to reliably recognize what
encoding it is down the road.
This problem already exists for UTF-8, breaking ASCII compatibility in the process:

http://en.wikipedia.org/wiki/Byte_order_mark

Well, at the very least adding garbage ASCII data in the front, just as my header would do. ;)

For all of its warts, Unicode fixed a WHOLE bunch of these problems, and made cross-linguistic data sane to handle without pulling out your hair, many times over. And now we're trying to go back to that nightmarish
old world again? No way, José!
No, I'm suggesting going back to one element of that "old world," single-byte encodings, but using UCS or some other standardized character set to avoid all those incompatible code pages you had to deal with.

If you're really concerned about encoding size, just use a compression library -- they're readily available these days. Internally, the program can just use UTF-16 for the most part -- UTF-32 is really only necessary
if you're routinely delving outside BMP, which is very rare.
True, but you're still doubling your string size with UTF-16 and non-ASCII text. My concerns are the following, in order of importance:

1. Lost programmer productivity due to these dumb variable-length encodings. That is the biggest loss from UTF-8's complexity.

2. Lost speed and memory due to using either an unnecessarily complex variable-length encoding or because you translated everything to 32-bit UTF-32 to get back to constant-width.

3. Lost bandwidth from using a fatter encoding.

As far as Phobos is concerned, Dmitry's new std.uni module has powerful code-generation templates that let you write code that operate directly on UTF-8 without needing to convert to UTF-32 first. Well, OK, maybe we're not quite there yet, but the foundations are in place, and I'm looking forward to the day when string functions will no longer have implicit conversion to UTF-32, but will directly manipulate UTF-8 using
optimized state tables generated by std.uni.
There is no way this can ever be as performant as a constant-width single-byte encoding.

+1. Using your own encoding is perfectly fine. Just don't do that for
data interchange. Unicode was created because we *want* a single
standard to communicate with each other without stupid broken encoding issues that used to be rampant on the web before Unicode came along.

In the bad ole days, HTML could be served in any random number of encodings, often out-of-sync with what the server claims the encoding is, and browsers would assume arbitrary default encodings that for the most part *appeared* to work but are actually fundamentally b0rken.
Sometimes webpages would show up mostly-intact, but with a few
characters mangled, because of deviations / variations on codepage interpretation, or non-standard characters being used in a particular encoding. It was a total, utter mess, that wasted who knows how many man-hours of programming time to work around. For data interchange on the internet, we NEED a universal standard that everyone can agree on.
I disagree. This is not an indictment of multiple encodings, it is one of multiple unspecified or _broken_ encodings. Given how difficult UTF-8 is to get right, all you've likely done is replace multiple broken encodings with a single encoding with multiple broken implementations.

UTF-8, for all its flaws, is remarkably resilient to mangling -- you can cut-n-paste any byte sequence and the receiving end can still make some sense of it. Not like the bad old days of codepages where you just get one gigantic block of gibberish. A properly-synchronizing UTF-8 function can still recover legible data, maybe with only a few characters at the ends truncated in the worst case. I don't see how any codepage-based
encoding is an improvement over this.
Have you ever used this self-synchronizing future of UTF-8? Have you ever heard of anyone using it? There is no reason why this kind of limited checking of data integrity should be rolled into the encoding. Maybe this made sense two decades ago when everyone had plans to stream text or something, but nobody does that nowadays. Just put a checksum in your header and you're good to go.

Unicode is still a "codepage-based encoding," nothing has changed in that regard. All UCS did is standardize a bunch of pre-existing code pages, so that some of the redundancy was taken out. Unfortunately, the UTF-8 encoding then bloated the transmission format and tempted devs to use this unnecessarily complex format for processing too.

Reply via email to