On Saturday, 25 May 2013 at 19:58:25 UTC, Dmitry Olshansky wrote:
Runs away in horror :) It's mess even before you've got to
details.
Perhaps it's fatally flawed, but I don't see an argument for why,
so I'll assume you can't find such a flaw. It is still _much
less_ messy than UTF-8, that is the critical distinction.
Another point about using sometimes a 2-byte encoding - welcome
to the nice world of BigEndian/LittleEndian i.e. the very trap
UTF-16 has stepped into.
I don't think this is a sizable obstacle. It takes some
coordination, but it is a minor issue.
On Saturday, 25 May 2013 at 20:20:11 UTC, Juan Manuel Cabo wrote:
You obviously are not thinking it through. Such encoding would
have a O(n^2) complexity for appending a character/symbol in a
different language to the string, since you would have to
update the beginning of the string, and move the contents
forward to make room. Not to mention that it wouldn't be
backwards compatible with ascii routines, and the complexity of
such a header would be have to be carried all the way to font
rendering routines in the OS.
You obviously have not read the rest of the thread, both your
non-font-related assertions have been addressed earlier. I see
no reason why a single-byte encoding of UCS would have to be
carried to "font rendering routines" but UTF-8 wouldn't be.
Multiple languages/symbols in one string is a blessing of
modern humane computing. It is the norm more than the exception
in most of the world.
I disagree, but in any case, most of this thread refers to
multi-language strings. The argument is about how best to encode
them.
On Saturday, 25 May 2013 at 20:47:25 UTC, Peter Alexander wrote:
On Saturday, 25 May 2013 at 14:58:02 UTC, Joakim wrote:
On Saturday, 25 May 2013 at 14:16:21 UTC, Peter Alexander
wrote:
I suggest you read up on UTF-8. You really don't understand
it. There is no need to decode, you just treat the UTF-8
string as if it is an ASCII string.
Not being aware of this shortcut doesn't mean not
understanding UTF-8.
It's not just a shortcut, it is absolutely fundamental to the
design of UTF-8. It's like saying you understand Lisp without
being aware that everything is a list.
It is an accidental shortcut because of the encoding scheme
chosen for UTF-8 and, as I've noted, still less efficient than
similarly searching a single-byte encoding. The fact that you
keep trumpeting this silly detail as somehow "fundamental"
suggests you have no idea what you're talking about.
Also, you continuously keep stating disadvantages to UTF-8 that
are completely false, like "slicing does require decoding".
Again, completely missing the point of UTF-8. I cannot conceive
how you can claim to understand how UTF-8 works yet repeatedly
demonstrating that you do not.
Slicing on code points requires decoding, I'm not sure how you
don't know that. If you mean slicing by byte, that is not only
useless, but _every_ encoding can do that. I cannot conceive how
you claim to defend UTF-8, yet keep making such stupid points,
that you don't even bother backing up.
You are either ignorant or a successful troll. In either case,
I'm done here.
Must be nice to just insult someone who has demolished your
arguments and leave. Good riddance, you weren't adding anything.