On Mon, Apr 25, 2016 at 1:08 AM, Dominique Devienne <ddevienne at gmail.com> wrote:
> On Mon, Apr 25, 2016 at 3:31 AM, Simon Slavin <slavins at bigfraud.org> > wrote: > > > > These are different concerns, and they don't really pose any > > > difficulty. Given an encoding, a column of N characters can take up to > > > x * N bytes. Back in the day, "x" was 1. Now it's something else. No > > > big deal. > > > > No. Unicode uses different numbers of bytes to store different > > characters. You cannot tell from the number of bytes in a string how > many > > characters it encodes, and the programming required to work out the > string > > length is complicated. The combination of six bytes > > > > Don't confuse Unicode encodings, and UTF-8 Simon. What you say is true of > UTF-8. And UTF-16. But there's also UTF-32, where you *can* tell. --DD > > PS: Well, kinda, since you can still have several code-points that combine > to make a single accented characters, > despite most accented characters having their own code point. > Granted, UTF-32 is almost never used for storage and only sometimes used > in-memory, > but efficient processing on some algos, but still. Unicode Codepoints != > Variable Length Encoded sequences. > Even with UTF-32 there is not a correlation between "characters" and "codepoints". One character can in UTF-32 be built from multiple code points. Unicode processing is far more complex, in any UTF, than simple single byte character sets like ASCII. -- Scott Robison