Re: Why UTF-8/16 character encodings?

Walter Bright Sat, 25 May 2013 14:35:31 -0700

On 5/25/2013 1:03 PM, Joakim wrote:

On Saturday, 25 May 2013 at 19:30:25 UTC, Walter Bright wrote:

On the other hand, Joakim even admits his single byte encoding is variable
length, as otherwise he simply dismisses the rarely used (!) Chinese,
Japanese, and Korean languages, as well as any text that contains words from
more than one language.

I have noted from the beginning that these large alphabets have to be encoded to
two bytes, so it is not a true constant-width encoding if you are mixing one of
those languages into a single-byte encoded string.  But this "variable length"
encoding is so much simpler than UTF-8, there's no comparison.

If it's one byte sometimes, or two bytes sometimes, it's variable length. Youoverlook that I've had to deal with this. It isn't "simpler", there's actuallymore work to write code that adapts to one or two byte encodings.

I suspect he's trolling us, and quite successfully.

Ha, I wondered who would pull out this insult, quite surprised to see it's
Walter.  It seems to be the trend on the internet to accuse anybody you disagree
with of trolling, I am honestly surprised to see Walter stoop so low.
Considering I'm the only one making any cogent arguments here, perhaps I should
wonder if you're all trolling me. ;)

On Saturday, 25 May 2013 at 19:35:42 UTC, Walter Bright wrote:

I suspect the Chinese, Koreans, and Japanese would take exception to being
called irrelevant.

Irrelevant only because they are a small subset of the UCS.  I have noted that
they would also be handled by a two-byte encoding.

Good luck with your scheme that can't handle languages written by billions of
people!

So let's see: first you say that my scheme has to be variable length because I
am using two bytes to handle these languages,

Well, it *is* variable length or you have to disregard Chinese. You cannot haveit both ways. Code to deal with two bytes is significantly different than codeto deal with one. That means you've got a conditional in your generic code -that isn't going to be faster than the conditional for UTF-8.

then you claim I don't handle
these languages.  This kind of blatant contradiction within two posts can only
be called... trolling!

You gave some vague handwaving about it, and then dismissed it as irrelevant,along with more handwaving about what to do with text that has embedded words inmultiple languages.

Worse, there are going to be more than 256 of these encodings - you can't evenhave a byte to specify them. Remember, Unicode has approximately 256,000characters in it. How many code pages is that?

I was being kind saying you were trolling, as otherwise I'd be saying yourscheme was, to be blunt, absurd.


---------------------------------------

I'll be the first to admit that a lot of great ideas have been initiallydismissed by the experts as absurd. If you really believe in this, I recommendthat you write it up as a real article, taking care to fill in all thehandwaving with something specific, and include some benchmarks to prove yourperformance claims. Post your article on reddit, stackoverflow, hackernews,etc., and look for fertile ground for it. I'm sorry you're not finding fertileground here (so far, nobody has agreed with any of your points), and this is thewrong place for such proposals anyway, as D is simply not going to switch overto it.

Remember, extraordinary claims require extraordinary evidence, not handwavingand assumptions disguised as bold assertions.

Re: Why UTF-8/16 character encodings?

Reply via email to