On 25 Apr 2016, at 1:42am, James K. Lowden <jklowden at schemamania.org> wrote:
> Simon Slavin <slavins at bigfraud.org> wrote: > >> Another reason is that we use Unicode not ASCII/SIXBIT/EBCDIC, and in >> Unicode different characters take different numbers of bytes. So >> even if you're storing a fixed number of bytes the convenience of >> always knowing exactly how many characters to display no longer >> exists. > > These are different concerns, and they don't really pose any > difficulty. Given an encoding, a column of N characters can take up to > x * N bytes. Back in the day, "x" was 1. Now it's something else. No > big deal. No. Unicode uses different numbers of bytes to store different characters. You cannot tell from the number of bytes in a string how many characters it encodes, and the programming required to work out the string length is complicated. The combination of six bytes U+01B5 LATIN CAPITAL LETTER Z WITH STROKE U+0327 COMBINING CEDILLA U+0308 COMBINING DIAERESIS is rendered as a capital Z with something above it, something in the middle of it and something below it. That is all one character taking up the same horizontal space as a simple capital Z which requires just two bytes. A consequence is that in implementations of SQL which support character limits a definition like VARCHAR(100) is tricky to understand. It could mean that the field can take up to 100 bytes of storage But it might mean 200 bytes of storage for a UTF-16 string, or even 100 Unicode characters which might take anything up to 800 bytes. I would definitely be reading the documentation for the SQL engine I was using. Simon.