Re: Why UTF-8/16 character encodings?

Dmitry Olshansky Sat, 25 May 2013 10:05:31 -0700

25-May-2013 10:44, Joakim пишет:

On Friday, 24 May 2013 at 21:21:27 UTC, Dmitry Olshansky wrote:

You seem to think that not only UTF-8 is bad encoding but also one
unified encoding (code-space) is bad(?).

Yes, on the encoding, if it's a variable-length encoding like UTF-8, no,
on the code space.  I was originally going to title my post, "Why
Unicode?" but I have no real problem with UCS, which merely standardized
a bunch of pre-existing code pages.  Perhaps there are a lot of problems
with UCS also, I just haven't delved into it enough to know.  My problem
is with these dumb variable-length encodings, so I was precise in the
title.


UCS is dead and gone. Next in line to "640K is enough for everyone".

Simply put Unicode decided to take into account all diversity ofluggages instead of ~80% of these. Hard to add anything else. No offensemeant but it feels like you actually live in universe that is 5-7 yearsbehind current state. UTF-16 (a successor to UCS) is no random-accesseither. And it's shitty beyond measure, UTF-8 is a shining gem incomparison.

Separate code spaces were the case before Unicode (and utf-8). The
problem is not only that without header text is meaningless (no easy
slicing) but the fact that encoding of data after header strongly
depends a variety of factors -  a list of encodings actually. Now
everybody has to keep a (code) page per language to at least know if
it's 2 bytes per char or 1 byte per char or whatever. And you still
work on a basis that there is no combining marks and regional specific
stuff :)

Everybody is still keeping code pages, UTF-8 hasn't changed that.

Legacy. Hard to switch overnight. There are graphs that indicate thatfew years from now you might never encounter a legacy encoding anymore,only UTF-8/UTF-16.

 Does
UTF-8 not need "to at least know if it's 2 bytes per char or 1 byte per
char or whatever?"

It's coherent in its scheme to determine that. You don't need extrainformation synced to text unlike header stuff.

It has to do that also. Everyone keeps talking about
"easy slicing" as though UTF-8 provides it, but it doesn't.  Phobos
turns UTF-8 into UTF-32 internally for all that ease of use, at least
doubling your string size in the process.  Correct me if I'm wrong, that
was what I read on the newsgroup sometime back.

Indeed you are - searching for UTF-8 substring in UTF-8 string doesn'tdo any decoding and it does return you a slice of a balance of original.

In fact it was even "better" nobody ever talked about header they just
assumed a codepage with some global setting. Imagine yourself creating
a font rendering system these days - a hell of an exercise in
frustration (okay how do I render 0x88 ? mm if that is in codepage XYZ
then ...).

I understand that people were frustrated with all the code pages out
there before UCS standardized them, but that is a completely different
argument than my problem with UTF-8 and variable-length encodings.  My
proposed simple, header-based, constant-width encoding could be
implemented with UCS and there go all your arguments about random code
pages.

No they don't - have you ever seen native Korean or Chinese codepages?Problems with your header based approach are self-evident in a sensethat there is no single sane way to deal with it on cross-locale basis(that you simply ignore as noted below).

This just shows you don't care for multilingual stuff at all. Imagine
any language tutor/translator/dictionary on the Web. For instance most
languages need to intersperse ASCII (also keep in mind e.g. HTML
markup). Books often feature citations in native language (or e.g.
latin) along with translations.

This is a small segment of use and it would be handled fine by an
alternate encoding.

??? Simply makes no sense. There is no intersection between some legacyencodings as of now. Or do you want to add N*(N-1) cross-encodings forany combination of 2? What about 3 in one string?

Now also take into account math symbols, currency symbols and beyond.
Also these days cultures are mixing in wild combinations so you might
need to see the text even if you can't read it. Unicode is not only
"encode characters from all languages". It needs to address universal
representation of symbolics used in writing systems at large.

I take your point that it isn't just languages, but symbols also.  I see
no reason why UTF-8 is a better encoding for that purpose than the kind
of simple encoding I've suggested.

We want monoculture! That is to understand each without all these
"par-le-vu-france?" and codepages of various complexity(insanity).

I hate monoculture, but then I haven't had to decipher some screwed-up
codepage in the middle of the night. ;)

So you never had trouble of internationalization? What languages do youuse (read/speak/etc.)?

That said, you could standardize
on UCS for your code space without using a bad encoding like UTF-8, as I
said above.

UCS is a myth as of ~5 years ago. Early adopters of Unicode fell intothat trap (Java, Windows NT). You shouldn't.

Want small - use compression schemes which are perfectly fine and get
to the precious 1byte per codepoint with exceptional speed.
http://www.unicode.org/reports/tr6/

Correct me if I'm wrong, but it seems like that compression scheme
simply adds a header and then uses a single-byte encoding, exactly what
I suggested! :)

This is it but it's far more flexible in a sense that it allowsmulti-linguagal strings just fine and lone full-with unicode codepointsas well.

But I get the impression that it's only for sending over
the wire, ie transmision, so all the processing issues that UTF-8
introduces would still be there.

Use mime-type etc. Standards are always a bit stringy and suboptimal,their acceptance rate is one of chief advantages they have. Unicode hashorrifically large momentum now and not a single organization aside fromthem tries to do this dirty work (=i18n).

And borrowing the arguments from from that rant: locale is borked shit
when it comes to encodings. Locales should be used for tweaking visual
like numbers, date display an so on.

Is that worse than every API simply assuming UTF-8, as he says? Broken
locale support in the past, as you and others complain about, doesn't
invalidate the concept.

It's combinatorial blowup and has some stone-walls to hit into. Consideradding another encoding for "Tuva" for isntance. Now you have to add 2*nconversion routines to match it to other codepages/locales.

Beyond that - there are many things to consider in internationalizationand you would have to special case them all by codepage.

If they're screwing up something so simple,
imagine how much worse everyone is screwing up something complex like
UTF-8?

UTF-8 is pretty darn simple. BTW all it does is map [0..10FFFF] to asequence of octets. It does it pretty well and compatible with ASCII,even the little rant you posted acknowledged that. Now you are eitheragainst Unicode as whole or what?


--
Dmitry Olshansky

Re: Why UTF-8/16 character encodings?

Reply via email to