*Background: *UTF-16 is an encoding which allows most characters to be
encoded in a single 16-bit code unit. Characters outside the basic
multilingual plane (i.e. code points between 0x10000 and 0x10FFFF), require
two code units: a high surrogate between 0xD800 and 0xDBFF, followed by a
low surrogate between 0xDC00 and 0xDFFF. Strings that contain unpaired
surrogates are invalid UTF-16.

*Problem:* sqlite silently accepts invalid UTF-16 strings as arguments to
functions like sqlite3_bind_text16(), but corrupts them when converting
them to UTF-8 and back. A common use case where this happens implicitly is
when using a database with UTF-8 as the native text encoding from a
programming language that uses UTF-16 to represent strings in memory.

Specifically, what happens depends on where the unpaired surrogate
character occurs:

   - *In the middle of the string*: the surrogate is consumed together with
   the following character, so that "fooXbar" may be transformed into"fooYar"
   (note missing 'b'), where Y is some seemingly-random character outside the
   BMP. When read back, it's not obvious that a corruption has occurred.
   - *At the end of the string*: the surrogate is consumed and encoded in
   UTF-8, which is technically invalid (UTF-8 is not supposed to encode
   surrogate characters). How this reads back depends on whether the value is
   accessed through sqlite3_column_text() or sqlite3_column_text16(): the
   latter uses READ_UTF8() internally, which detects the invalid encoding and
   substitutes the replacement character 0xFFFD. So the storage is in a
   logically inconsistent state at this point.

I've created a small proof-of-concept to reproduce some of these issues,
here:
https://gist.github.com/maksverver/2b225637186d64878d3e635ef0a4fd18
<https://gist.github.com/maksverver/2b225637186d64878d3e635ef0a4fd18#file-sqlite3-utf16-unpaired-surrogates-c>

The problem is caused by the implementation of READ_UTF16 in utf.c
<https://sqlite.org/src/file/src/utf.c>, which blindly assumes that the
input string is valid UTF-16, and doesn't check that surrogates pair up as
required. Although arguably the problems originated with the caller
that passed bad string data to sqlite, it would be better if sqlite
detected and corrected invalid UTF-16 strings during conversion by
converting unpaired surrogates to the Unicode replacement character 0xFFFD.
This would be consistent with the behavior of READ_UTF8, and would make
sqlite more robust.

As a concrete suggestion, the macro READ_UTF16LE, which looks like this:

#define READ_UTF16LE(zIn, TERM, c){
 c = (*zIn++);
 c += ((*zIn++)<<8);
 if( c>=0xD800 && c<0xE000 && TERM ){
    int c2 = (*zIn++);
    c2 += ((*zIn++)<<8);
    c = (c2&0x03FF) + ((c&0x003F)<<10) + (((c&0x03C0)+0x0040)<<10);
  }
}

Could be changed to something like this:

#define READ_UTF16LE(zIn, TERM, c){
  c = (*zIn++);
  c += ((*zIn++)<<8);
  if( c>=0xD800 ){
    int c2 = c<DC00 && TERM ? (zIn[0] | (zIn[1] << 8)) : 0;
    if ( c2>=0xDC00 && c2<0xE000) {
      zIn += 2;
      c = (c2&0x03FF) + ((c&0x003F)<<10) + (((c&0x03C0)+0x0040)<<10);
    } else {
      c = 0xFFFD;
    }
  }
}

(And similarly for READ_UTF16BE.)

Kind regards,
Maks Verver.
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to