On Mon, Apr 25, 2016 at 8:38 AM, James K. Lowden <jklowden at schemamania.org> wrote:
> On Mon, 25 Apr 2016 02:31:25 +0100 > Simon Slavin <slavins at bigfraud.org> wrote: > > > > These are different concerns, and they don't really pose any > > > difficulty. Given an encoding, a column of N characters can take > > > up to x * N bytes. Back in the day, "x" was 1. Now it's something > > > else. No big deal. > > > > No. Unicode uses different numbers of bytes to store different > > characters. You cannot tell from the number of bytes in a string how > > many characters it encodes, and the programming required to work out > > the string length is complicated. > > "up to", I said. You're right that you can't know the byte-offset for a > letter in a UTF-8 string. What I'm saying is that given an encoding > and a string, you *do* know the maximum number of bytes required. > From the DBMS's point of view, a string of known size and encoding can > be managed with a fixed length buffer. > It depends on what you call a character. If you consider a "character" the same way most people do (one typographical unit), then you have to deal with varying numbers of code points per character, even in a "fixed width" encoding like UTF-32. There is no hard limit on how many combining marks can be appended to a base code point. See http://stackoverflow.com/questions/10414864/whats-up-with-these-unicode-combining-characters-and-how-can-we-filter-them for a stupid / extreme example. -- Scott Robison