24-May-2013 21:05, Joakim пишет:
On Friday, 24 May 2013 at 09:49:40 UTC, Jacob Carlborg wrote:
toUpper/lower cannot be made in place if it should handle all Unicode.
Some characters will change their length when convert to/from
uppercase. Examples of these are the German double S and some Turkish I.

This triggered a long-standing bugbear of mine: why are we using these
variable-length encodings at all?  Does anybody really care about UTF-8
being "self-synchronizing," ie does anybody actually use that in this
day and age?  Sure, it's backwards-compatible with ASCII and the vast
majority of usage is probably just ASCII, but that means the other
languages don't matter anyway.  Not to mention taking the valuable 8-bit
real estate for English and dumping the longer encodings on everyone else.

I'd just use a single-byte header to signify the language and then put
the vast majority of languages in a single byte encoding, with the few
exceptional languages with more than 256 characters encoded in two
bytes.

You seem to think that not only UTF-8 is bad encoding but also one unified encoding (code-space) is bad(?).

Separate code spaces were the case before Unicode (and utf-8). The problem is not only that without header text is meaningless (no easy slicing) but the fact that encoding of data after header strongly depends a variety of factors - a list of encodings actually. Now everybody has to keep a (code) page per language to at least know if it's 2 bytes per char or 1 byte per char or whatever. And you still work on a basis that there is no combining marks and regional specific stuff :)

In fact it was even "better" nobody ever talked about header they just assumed a codepage with some global setting. Imagine yourself creating a font rendering system these days - a hell of an exercise in frustration (okay how do I render 0x88 ? mm if that is in codepage XYZ then ...).

OK, that doesn't cover multi-language strings, but that is what,
.000001% of usage?

This just shows you don't care for multilingual stuff at all. Imagine any language tutor/translator/dictionary on the Web. For instance most languages need to intersperse ASCII (also keep in mind e.g. HTML markup). Books often feature citations in native language (or e.g. latin) along with translations.

Now also take into account math symbols, currency symbols and beyond. Also these days cultures are mixing in wild combinations so you might need to see the text even if you can't read it. Unicode is not only "encode characters from all languages". It needs to address universal representation of symbolics used in writing systems at large.

Make your header a little longer and you could handle
those also.  Yes, it wouldn't be strictly backwards-compatible with
ASCII, but it would be so much easier to internationalize.  Of course,
there's also the monoculture we're creating; love this UTF-8 rant by
tuomov, author of one the first tiling window managers for linux:

We want monoculture! That is to understand each without all these "par-le-vu-france?" and codepages of various complexity(insanity).

Want small - use compression schemes which are perfectly fine and get to the precious 1byte per codepoint with exceptional speed.
http://www.unicode.org/reports/tr6/

http://tuomov.bitcheese.net/b/archives/2006/08/26/T20_16_06

The emperor has no clothes, what am I missing?

And borrowing the arguments from from that rant: locale is borked shit when it comes to encodings. Locales should be used for tweaking visual like numbers, date display an so on.

--
Dmitry Olshansky

Reply via email to