On 5/25/2013 1:03 PM, Joakim wrote:
On Saturday, 25 May 2013 at 19:30:25 UTC, Walter Bright wrote:
On the other hand, Joakim even admits his single byte encoding is variable
length, as otherwise he simply dismisses the rarely used (!) Chinese,
Japanese, and Korean languages, as well as any text that contains words from
more than one language.
I have noted from the beginning that these large alphabets have to be encoded to
two bytes, so it is not a true constant-width encoding if you are mixing one of
those languages into a single-byte encoded string.  But this "variable length"
encoding is so much simpler than UTF-8, there's no comparison.

If it's one byte sometimes, or two bytes sometimes, it's variable length. You overlook that I've had to deal with this. It isn't "simpler", there's actually more work to write code that adapts to one or two byte encodings.


I suspect he's trolling us, and quite successfully.
Ha, I wondered who would pull out this insult, quite surprised to see it's
Walter.  It seems to be the trend on the internet to accuse anybody you disagree
with of trolling, I am honestly surprised to see Walter stoop so low.
Considering I'm the only one making any cogent arguments here, perhaps I should
wonder if you're all trolling me. ;)

On Saturday, 25 May 2013 at 19:35:42 UTC, Walter Bright wrote:
I suspect the Chinese, Koreans, and Japanese would take exception to being
called irrelevant.
Irrelevant only because they are a small subset of the UCS.  I have noted that
they would also be handled by a two-byte encoding.

Good luck with your scheme that can't handle languages written by billions of
people!
So let's see: first you say that my scheme has to be variable length because I
am using two bytes to handle these languages,

Well, it *is* variable length or you have to disregard Chinese. You cannot have it both ways. Code to deal with two bytes is significantly different than code to deal with one. That means you've got a conditional in your generic code - that isn't going to be faster than the conditional for UTF-8.


then you claim I don't handle
these languages.  This kind of blatant contradiction within two posts can only
be called... trolling!

You gave some vague handwaving about it, and then dismissed it as irrelevant, along with more handwaving about what to do with text that has embedded words in multiple languages.

Worse, there are going to be more than 256 of these encodings - you can't even have a byte to specify them. Remember, Unicode has approximately 256,000 characters in it. How many code pages is that?

I was being kind saying you were trolling, as otherwise I'd be saying your scheme was, to be blunt, absurd.

---------------------------------------

I'll be the first to admit that a lot of great ideas have been initially dismissed by the experts as absurd. If you really believe in this, I recommend that you write it up as a real article, taking care to fill in all the handwaving with something specific, and include some benchmarks to prove your performance claims. Post your article on reddit, stackoverflow, hackernews, etc., and look for fertile ground for it. I'm sorry you're not finding fertile ground here (so far, nobody has agreed with any of your points), and this is the wrong place for such proposals anyway, as D is simply not going to switch over to it.

Remember, extraordinary claims require extraordinary evidence, not handwaving and assumptions disguised as bold assertions.

Reply via email to