On Mon, Apr 25, 2016 at 8:38 AM, James K. Lowden <jklowden at schemamania.org>
wrote:

> On Mon, 25 Apr 2016 02:31:25 +0100
> Simon Slavin <slavins at bigfraud.org> wrote:
>
> > > These are different concerns, and they don't really pose any
> > > difficulty.  Given an encoding, a column of N characters can take
> > > up to x * N bytes.  Back in the day, "x" was 1.  Now it's something
> > > else.  No big deal.
> >
> > No.  Unicode uses different numbers of bytes to store different
> > characters.  You cannot tell from the number of bytes in a string how
> > many characters it encodes, and the programming required to work out
> > the string length is complicated.
>
> "up to", I said.  You're right that you can't know the byte-offset for a
> letter in a UTF-8 string.  What I'm saying is that given an encoding
> and a string, you *do* know the maximum number of bytes required.
> From the DBMS's point of view, a string of known size and encoding can
> be managed with a fixed length buffer.
>

It depends on what you call a character. If you consider a "character" the
same way most people do (one typographical unit), then you have to deal
with varying numbers of code points per character, even in a "fixed width"
encoding like UTF-32. There is no hard limit on how many combining marks
can be appended to a base code point.

See
http://stackoverflow.com/questions/10414864/whats-up-with-these-unicode-combining-characters-and-how-can-we-filter-them
for a stupid / extreme example.

-- 
Scott Robison

Reply via email to