On Mon, Apr 25, 2016 at 3:31 AM, Simon Slavin <slavins at bigfraud.org> wrote:

> > These are different concerns, and they don't really pose any
> > difficulty.  Given an encoding, a column of N characters can take up to
> > x * N bytes.  Back in the day, "x" was 1.  Now it's something else.  No
> > big deal.
>
> No.  Unicode uses different numbers of bytes to store different
> characters.  You cannot tell from the number of bytes in a string how many
> characters it encodes, and the programming required to work out the string
> length is complicated.  The combination of six bytes
>

Don't confuse Unicode encodings, and UTF-8 Simon. What you say is true of
UTF-8. And UTF-16. But there's also UTF-32, where you *can* tell. --DD

PS: Well, kinda, since you can still have several code-points that combine
to make a single accented characters,
  despite most accented characters having their own code point.
  Granted, UTF-32 is almost never used for storage and only sometimes used
in-memory,
  but efficient processing on some algos, but still. Unicode Codepoints !=
Variable Length Encoded sequences.

Reply via email to