Re: Why UTF-8/16 character encodings?

Dmitry Olshansky Fri, 24 May 2013 14:25:25 -0700

24-May-2013 21:05, Joakim пишет:

On Friday, 24 May 2013 at 09:49:40 UTC, Jacob Carlborg wrote:

toUpper/lower cannot be made in place if it should handle all Unicode.
Some characters will change their length when convert to/from
uppercase. Examples of these are the German double S and some Turkish I.


This triggered a long-standing bugbear of mine: why are we using these
variable-length encodings at all?  Does anybody really care about UTF-8
being "self-synchronizing," ie does anybody actually use that in this
day and age?  Sure, it's backwards-compatible with ASCII and the vast
majority of usage is probably just ASCII, but that means the other
languages don't matter anyway.  Not to mention taking the valuable 8-bit
real estate for English and dumping the longer encodings on everyone else.

I'd just use a single-byte header to signify the language and then put
the vast majority of languages in a single byte encoding, with the few
exceptional languages with more than 256 characters encoded in two
bytes.

You seem to think that not only UTF-8 is bad encoding but also oneunified encoding (code-space) is bad(?).

Separate code spaces were the case before Unicode (and utf-8). Theproblem is not only that without header text is meaningless (no easyslicing) but the fact that encoding of data after header stronglydepends a variety of factors - a list of encodings actually. Noweverybody has to keep a (code) page per language to at least know ifit's 2 bytes per char or 1 byte per char or whatever. And you still workon a basis that there is no combining marks and regional specific stuff :)

In fact it was even "better" nobody ever talked about header they justassumed a codepage with some global setting. Imagine yourself creating afont rendering system these days - a hell of an exercise in frustration(okay how do I render 0x88 ? mm if that is in codepage XYZ then ...).

OK, that doesn't cover multi-language strings, but that is what,
.000001% of usage?

This just shows you don't care for multilingual stuff at all. Imagineany language tutor/translator/dictionary on the Web. For instance mostlanguages need to intersperse ASCII (also keep in mind e.g. HTMLmarkup). Books often feature citations in native language (or e.g.latin) along with translations.

Now also take into account math symbols, currency symbols and beyond.Also these days cultures are mixing in wild combinations so you mightneed to see the text even if you can't read it. Unicode is not only"encode characters from all languages". It needs to address universalrepresentation of symbolics used in writing systems at large.

Make your header a little longer and you could handle
those also.  Yes, it wouldn't be strictly backwards-compatible with
ASCII, but it would be so much easier to internationalize.  Of course,
there's also the monoculture we're creating; love this UTF-8 rant by
tuomov, author of one the first tiling window managers for linux:

We want monoculture! That is to understand each without all these"par-le-vu-france?" and codepages of various complexity(insanity).

Want small - use compression schemes which are perfectly fine and get tothe precious 1byte per codepoint with exceptional speed.

http://www.unicode.org/reports/tr6/

http://tuomov.bitcheese.net/b/archives/2006/08/26/T20_16_06

The emperor has no clothes, what am I missing?

And borrowing the arguments from from that rant: locale is borked shitwhen it comes to encodings. Locales should be used for tweaking visuallike numbers, date display an so on.


--
Dmitry Olshansky

Re: Why UTF-8/16 character encodings?

Reply via email to