On 25 Apr 2016, at 1:42am, James K. Lowden <jklowden at schemamania.org> wrote:

> Simon Slavin <slavins at bigfraud.org> wrote:
> 
>> Another reason is that we use Unicode not ASCII/SIXBIT/EBCDIC, and in
>> Unicode different characters take different numbers of bytes.  So
>> even if you're storing a fixed number of bytes the convenience of
>> always knowing exactly how many characters to display no longer
>> exists.
> 
> These are different concerns, and they don't really pose any
> difficulty.  Given an encoding, a column of N characters can take up to 
> x * N bytes.  Back in the day, "x" was 1.  Now it's something else.  No
> big deal.  

No.  Unicode uses different numbers of bytes to store different characters.  
You cannot tell from the number of bytes in a string how many characters it 
encodes, and the programming required to work out the string length is 
complicated.  The combination of six bytes

U+01B5 LATIN CAPITAL LETTER Z WITH STROKE
U+0327 COMBINING CEDILLA
U+0308 COMBINING DIAERESIS

is rendered as a capital Z with something above it, something in the middle of 
it and something below it.  That is all one character taking up the same 
horizontal space as a simple capital Z which requires just two bytes.

A consequence is that in implementations of SQL which support character limits 
a definition like VARCHAR(100) is tricky to understand.  It could mean that the 
field can take up to 100 bytes of storage  But it might mean 200 bytes of 
storage for a UTF-16 string, or even 100 Unicode characters which might take 
anything up to 800 bytes.  I would definitely be reading the documentation for 
the SQL engine I was using.

Simon.

Reply via email to