I want to second that. This leads to the situation that data is accepted by the database but there is no way to read that data back or more precisely I get the wrong (i.e. different) data back. I didn't check the suggested patch, but I don't believe it will work in all cases. I'd rather prefer rejecting such strings or implicitly convert them to a BLOB which at least provides a way to get the data back.
Thanks, Detlef. -----Ursprüngliche Nachricht----- Von: sqlite-users <sqlite-users-boun...@mailinglists.sqlite.org> Im Auftrag von Dennis Snell Gesendet: Montag, 13. Januar 2020 21:57 An: sqlite-users@mailinglists.sqlite.org Betreff: Re: [sqlite] Feature request: more robust handling of invalid UTF-16 data I’d like to raise this issue again and give my support for what Maks Verver recommended in https://www.mail-archive.com/sqlite-users@mailinglists.sqlite.org/msg110107.html Independently I came to this bug while working on an issue in Simplenote’s Android app where our data was being corrupted when saved to sqlite inside the Android SDK. We received some invalid UTF-16 sequences and instead of rejecting them or decoding it properly sqlite is further mangling them and introducing more corruption. Example: We have a JSON document like this which we store in a table. {“content”: “\ud83c\udd70\ud83c(null)\udd71”,”tags":[]} The JSON is well-formed but the sequence of UTF-16 code points is invalid. We have fixed our side of the equation which prevents creating this content, but we still receive from time to time the invalid sequence from older client libraries. When sqlite reads this data two types of further corruption occur: reading beyond a code unit subsequence; and conflating high and low surrogates. Reading beyond a code unit subsequence: When the `TERM` was introduced[1] and updated[2] it appears to have been designed to assume that a string ends mid-surrogate but it does not attempt to address unpaired surrogates in the middle of an input text. In our case the `READ_UTF16BE` macro accepts the second `\ud83c` code unit and then consumes the following `\u0028` which is the separate and well-formed “(“. In turn this produces the more corrupted value of `\ud83c\udc28`, code point U+1F028, plus “null)” without the leading “(“. Conflating high and low surrogates: The `READ_UTF16__` macros both attempt to start processing surrogate pairs based on the `0xD800 <= c <= 0xE000` value of the input code unit. Because of this they will pick up on unpaired low surrogates, consume the next character, and then create a more corrupted Unicode string. In our case, once we reach the `\udd71` the macro consumes the following quote, which in the JSON document closes the string, and puts them together as `\udd71\u0022` producing the invalid code point U+6C422. Moreover, because it consumed the string-ending quote it also corrupted the entire JSON document, as the new output resembles the following: {“content”: “\ud83c\udd70\ud83c\udc28ull)\ud971\udc22,”tags”:[]} That is, we write this invalid Unicode sequence but valid JSON document into sqlite and read back an invalid Unicode sequence _and_ invalid JSON (see the missing quote before “tags”). Supporting Unicode spec: The Unicode specification[3] sections 3.2 and 3.9 speak to this situation and provides a specific comparable example: When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters. Furthermore, such a process must not treat any adjacent well-formed code unit sequences as being part of those ill-formed code unit sequences. For example, with the input UTF-8 code unit sequence <C2 41 42>, such a UTF-8 conversion process must not return <U+FFFD> or <U+FFFD, U+0042>, because either of those outputs would be the result of interpreting a well-formed subsequence as being part of the ill-formed subsequence. The expected return value for such a process would instead be <U+FFFD, U+0041, U+0042>. Supporing Maks’ suggestion to use the replacement character on error section 23.8[4] provides the guidance: It [U+FFFD] can be substituted for any “unknown” character in another encoding that cannot be mapped in terms of known Unicode characters. It can also be used as one means of indicating a conversion error, when encountering an ill-formed sequence in a conversion between Unicode encoding forms. Patching: The `READ_UTF16__` macros thus should do not only what Maks proposed, which is to verify that the character following a surrogate half is also a surrogate half, but also to verify that we don’t start interpreting a surrogate sequence when encountering an unpaired low surrogate. I propose this change instead: #define READ_UTF16LE(zIn, TERM, c){ c = (*zIn++); c += ((*zIn++)<<8); if( c>=0xDC00 && c<=0xE000 && TERM ) { c = 0xFFFD } else if( c>=0xD800 && TERM ){ int c2 = (zIn[0] | (zIn[1] << 8)); if ( c2>=0xDC00 && c2<0xE000) { zIn += 2; c = (c2&0x03FF) + ((c&0x003F)<<10) + (((c&0x03C0)+0x0040)<<10); } else { c = 0xFFFD; } } } This will solve both problem of reading past an ill-formed surrogate sequence and of interpreting an ill-formed surrogate sequence. I love sqlite and I really appreciated how the code is laid out which made it so easy to find this macro in the source and identify the problem. Dennis Snell Automattic, Inc. [1]: https://sqlite.org/src/info/19064d7cea [2]: https://sqlite.org/src/info/a96b4e8c01d167d3 [3]: https://www.unicode.org/versions/Unicode12.0.0/ch03.pdf [4]: https://www.unicode.org/versions/Unicode12.0.0/ch23.pdf _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users