Re: Why UTF-8/16 character encodings?

Dmitry Olshansky Sat, 25 May 2013 12:05:29 -0700

25-May-2013 22:26, Joakim пишет:

On Saturday, 25 May 2013 at 17:03:43 UTC, Dmitry Olshansky wrote:

25-May-2013 10:44, Joakim пишет:

Yes, on the encoding, if it's a variable-length encoding like UTF-8, no,
on the code space.  I was originally going to title my post, "Why
Unicode?" but I have no real problem with UCS, which merely standardized
a bunch of pre-existing code pages.  Perhaps there are a lot of problems
with UCS also, I just haven't delved into it enough to know.


UCS is dead and gone. Next in line to "640K is enough for everyone".

I think you are confused.  UCS refers to the Universal Character Set,
which is the backbone of Unicode:

http://en.wikipedia.org/wiki/Universal_Character_Set

You might be thinking of the unpopular UCS-2 and UCS-4 encodings, which
I have never referred to.


Yeah got confused. So sorry about that.

Separate code spaces were the case before Unicode (and utf-8). The
problem is not only that without header text is meaningless (no easy
slicing) but the fact that encoding of data after header strongly
depends a variety of factors -  a list of encodings actually. Now
everybody has to keep a (code) page per language to at least know if
it's 2 bytes per char or 1 byte per char or whatever. And you still
work on a basis that there is no combining marks and regional specific
stuff :)

Everybody is still keeping code pages, UTF-8 hasn't changed that.


Legacy. Hard to switch overnight. There are graphs that indicate that
few years from now you might never encounter a legacy encoding
anymore, only UTF-8/UTF-16.

I didn't mean that people are literally keeping code pages.  I meant
that there's not much of a difference between code pages with 2 bytes
per char and the language character sets in UCS.


You can map a codepage to a subset of UCS :)
That's what they do internally anyway.

If I take you right you propose to define string as a header thatdenotes a set of windows in code space? I still fail to see how thatwould scale see below.

It has to do that also. Everyone keeps talking about
"easy slicing" as though UTF-8 provides it, but it doesn't. Phobos
turns UTF-8 into UTF-32 internally for all that ease of use, at least
doubling your string size in the process.  Correct me if I'm wrong, that
was what I read on the newsgroup sometime back.


Indeed you are - searching for UTF-8 substring in UTF-8 string doesn't
do any decoding and it does return you a slice of a balance of original.

Perhaps substring search doesn't strictly require decoding but you have
changed the subject: slicing does require decoding and that's the use
case you brought up to begin with.  I haven't looked into it, but I
suspect substring search not requiring decoding is the exception for
UTF-8 algorithms, not the rule.

Mm... strictly speaking (let's turn that argument backwards) - what arealgorithms that require slicing say [5..$] of string without everlooking at it left to right, searching etc.?

??? Simply makes no sense. There is no intersection between some
legacy encodings as of now. Or do you want to add N*(N-1)
cross-encodings for any combination of 2? What about 3 in one string?

I sketched two possible encodings above, none of which would require
"cross-encodings."

We want monoculture! That is to understand each without all these
"par-le-vu-france?" and codepages of various complexity(insanity).

I hate monoculture, but then I haven't had to decipher some screwed-up
codepage in the middle of the night. ;)


So you never had trouble of internationalization? What languages do
you use (read/speak/etc.)?

This was meant as a point in your favor, conceding that I haven't had to
code with the terrible code pages system from the past.  I can read and
speak multiple languages, but I don't use anything other than English text.


Okay then.

That said, you could standardize
on UCS for your code space without using a bad encoding like UTF-8, as I
said above.


UCS is a myth as of ~5 years ago. Early adopters of Unicode fell into
that trap (Java, Windows NT). You shouldn't.

UCS, the character set, as noted above.  If that's a myth, Unicode is a
myth. :)

Yeah, that was a mishap on my behalf. I think I've seen your 2 byteargument way to often and it got concatenated to UCS forming UCS-2 :)

This is it but it's far more flexible in a sense that it allows
multi-linguagal strings just fine and lone full-with unicode
codepoints as well.

That's only because it uses a more complex header than a single byte for
the language, which I noted could be done with my scheme, by adding a
more complex header,


How would it look like? Or how the processing will go?

long before you mentioned this unicode compression
scheme.

It does inline headers or rather tags. That hop between fixed charwindows. It's not random-access nor claims to be.

But I get the impression that it's only for sending over
the wire, ie transmision, so all the processing issues that UTF-8
introduces would still be there.


Use mime-type etc. Standards are always a bit stringy and suboptimal,
their acceptance rate is one of chief advantages they have. Unicode
has horrifically large momentum now and not a single organization
aside from them tries to do this dirty work (=i18n).

You misunderstand.  I was saying that this unicode compression scheme
doesn't help you with string processing, it is only for transmission and
is probably fine for that, precisely because it seems to implement some
version of my single-byte encoding scheme!  You do raise a good point:
the only reason why we're likely using such a bad encoding in UTF-8 is
that nobody else wants to tackle this hairy problem.


Yup, where have you been say almost 10 years ago? :)

Consider adding another encoding for "Tuva" for isntance. Now you have
to add 2*n conversion routines to match it to other codepages/locales.

Not sure what you're referring to here.

If you adopt the "map to UCS policy" then nothing.

Beyond that - there are many things to consider in
internationalization and you would have to special case them all by
codepage.

Not necessarily.  But that is actually one of the advantages of
single-byte encodings, as I have noted above.  toUpper is a NOP for a
single-byte encoding string with an Asian script, you can't do that with
a UTF-8 string.

But you have to check what encoding it's in and given that not allcodepages are that simple to upper case some generic algorithm is required.

If they're screwing up something so simple,
imagine how much worse everyone is screwing up something complex like
UTF-8?


UTF-8 is pretty darn simple. BTW all it does is map [0..10FFFF] to a
sequence of octets. It does it pretty well and compatible with ASCII,
even the little rant you posted acknowledged that. Now you are either
against Unicode as whole or what?

The BOM link I gave notes that UTF-8 isn't always ASCII-compatible.

There are two parts to Unicode.  I don't know enough about UCS, the
character set, ;) to be for it or against it, but I acknowledge that a
standardized character set may make sense.  I am dead set against the
UTF-8 variable-width encoding, for all the reasons listed above.

Okay we are getting somewhere, now that I understand your position andgot myself confused in the midway there.

On Saturday, 25 May 2013 at 17:13:41 UTC, Dmitry Olshansky wrote:

25-May-2013 13:05, Joakim пишет:

Nobody is talking about going back to code pages.  I'm talking about
going to single-byte encodings, which do not imply the problems that you
had with code pages way back when.


Problem is what you outline is isomorphic with code-pages. Hence the
grief of accumulated experience against them.

They may seem superficially similar but they're not.  For example, from
the beginning, I have suggested a more complex header that can enable
multi-language strings, as one possible solution.  I don't think code
pages provided that.

The problem is how would you define an uppercase algorithm formultilingual string with 3 distinct 256 codespaces (windows)? I bet it'swon't be pretty.

Well if somebody get a quest to redefine UTF-8 they *might* come up
with something that is a bit faster to decode but shares the same
properties. Hardly a life saver anyway.

Perhaps not, but I suspect programmers will flock to a constant-width
encoding that is much simpler and more efficient than UTF-8.  Programmer
productivity is the biggest loss from the complexity of UTF-8, as I've
noted before.

I still don't see how your solution scales to beyond 256 differentcodepoints per string (= multiple pages/parts of UCS ;) ).


--
Dmitry Olshansky

Re: Why UTF-8/16 character encodings?

Reply via email to