Re: Why UTF-8/16 character encodings?

Joakim Sat, 25 May 2013 00:15:36 -0700

On Friday, 24 May 2013 at 22:44:24 UTC, H. S. Teoh wrote:

I remember those bad ole days of gratuitously-incompatibleencodings. Iwish those days will never ever return again. You'd get a textfile insome unknown encoding, and the only way to make any sense of itwas toguess what encoding it might be and hope you get lucky. Notonly so, thesame language often has multiple encodings, so adding supportfor asingle new language required supporting several new encodingsand beingable to tell them apart (often with no info on which they are,if you'relucky, or if you're unlucky, with *wrong* encoding type specs-- forexample, I *still* get email from outdated systems that claimto be
iso-8859 when it's actually KOI8R).

This is an argument for UCS, not UTF-8.

Prepending the encoding to the data doesn't help, because it'sprettymuch guaranteed somebody will cut-n-paste some segment of thatdata andsave it without the encoding type header (or worse, someprogram willtry to "fix" broken low-level code by prepending a defaultencoding typeto everything, regardless of whether it's actually in thatencoding ornot), thus ensuring nobody will be able to reliably recognizewhat
encoding it is down the road.

This problem already exists for UTF-8, breaking ASCIIcompatibility in the process:


http://en.wikipedia.org/wiki/Byte_order_mark

Well, at the very least adding garbage ASCII data in the front,just as my header would do. ;)

For all of its warts, Unicode fixed a WHOLE bunch of theseproblems, andmade cross-linguistic data sane to handle without pulling outyour hair,many times over. And now we're trying to go back to thatnightmarish
old world again? No way, José!

No, I'm suggesting going back to one element of that "old world,"single-byte encodings, but using UCS or some other standardizedcharacter set to avoid all those incompatible code pages you hadto deal with.

If you're really concerned about encoding size, just use acompressionlibrary -- they're readily available these days. Internally,the programcan just use UTF-16 for the most part -- UTF-32 is really onlynecessary
if you're routinely delving outside BMP, which is very rare.

True, but you're still doubling your string size with UTF-16 andnon-ASCII text. My concerns are the following, in order ofimportance:

1. Lost programmer productivity due to these dumb variable-lengthencodings. That is the biggest loss from UTF-8's complexity.

2. Lost speed and memory due to using either an unnecessarilycomplex variable-length encoding or because you translatedeverything to 32-bit UTF-32 to get back to constant-width.


3. Lost bandwidth from using a fatter encoding.

As far as Phobos is concerned, Dmitry's new std.uni module haspowerfulcode-generation templates that let you write code that operatedirectlyon UTF-8 without needing to convert to UTF-32 first. Well, OK,maybewe're not quite there yet, but the foundations are in place,and I'mlooking forward to the day when string functions will no longerhaveimplicit conversion to UTF-32, but will directly manipulateUTF-8 using
optimized state tables generated by std.uni.

There is no way this can ever be as performant as aconstant-width single-byte encoding.

+1. Using your own encoding is perfectly fine. Just don't dothat for
data interchange. Unicode was created because we *want* a single
standard to communicate with each other without stupid brokenencodingissues that used to be rampant on the web before Unicode camealong.
In the bad ole days, HTML could be served in any random numberofencodings, often out-of-sync with what the server claims theencodingis, and browsers would assume arbitrary default encodings thatfor themost part *appeared* to work but are actually fundamentallyb0rken.
Sometimes webpages would show up mostly-intact, but with a few
characters mangled, because of deviations / variations oncodepageinterpretation, or non-standard characters being used in aparticularencoding. It was a total, utter mess, that wasted who knows howmanyman-hours of programming time to work around. For datainterchange onthe internet, we NEED a universal standard that everyone canagree on.

I disagree. This is not an indictment of multiple encodings, itis one of multiple unspecified or _broken_ encodings. Given howdifficult UTF-8 is to get right, all you've likely done isreplace multiple broken encodings with a single encoding withmultiple broken implementations.

UTF-8, for all its flaws, is remarkably resilient to mangling-- you cancut-n-paste any byte sequence and the receiving end can stillmake somesense of it. Not like the bad old days of codepages where youjust getone gigantic block of gibberish. A properly-synchronizing UTF-8functioncan still recover legible data, maybe with only a fewcharacters at theends truncated in the worst case. I don't see how anycodepage-based
encoding is an improvement over this.

Have you ever used this self-synchronizing future of UTF-8? Haveyou ever heard of anyone using it? There is no reason why thiskind of limited checking of data integrity should be rolled intothe encoding. Maybe this made sense two decades ago wheneveryone had plans to stream text or something, but nobody doesthat nowadays. Just put a checksum in your header and you'regood to go.

Unicode is still a "codepage-based encoding," nothing has changedin that regard. All UCS did is standardize a bunch ofpre-existing code pages, so that some of the redundancy was takenout. Unfortunately, the UTF-8 encoding then bloated thetransmission format and tempted devs to use this unnecessarilycomplex format for processing too.

Re: Why UTF-8/16 character encodings?

Reply via email to