Re: [sqlite] Feature request: more robust handling of invalid UTF-16 data

Richard Hipp Tue, 14 Jan 2020 10:52:41 -0800

On 1/13/20, Dennis Snell <den...@dennisandmandi.com> wrote:
> We have a JSON document like this which we store in a table.
>
>     {“content”: “\ud83c\udd70\ud83c(null)\udd71”,”tags":[]}
>
>
> The JSON is well-formed but the sequence of UTF-16 code points is invalid.
>
> When sqlite reads this data two types of further corruption


I'm having trouble reproducing this.  The following test script (one
of many) illustrates:

CREATE TABLE t1(j TEXT);
INSERT INTO t1(j) VALUES
    ('{"content": "\ud83c\udd70\ud83c(null)\udd71","tags":[]}');
SELECT length(json_extract(j,'$.content')) FROM t1;
WITH RECURSIVE c(x) AS (VALUES(1) UNION ALL SELECT x+1 FROM c WHERE x<9)
SELECT x, printf('%x',unicode(substr(json_extract(j,'$.content'),x)))
  FROM t1, c;

The column T1.J is loaded up with your original JSON with the invalid
code points. Then I run json_extract() to pull out the invalid string,
but SQLite says that the length is 9, which I think is the correct
answer.  The second SELECT with the CTE in it loops over each
character and prints out the HEX value for that character.  Here is
what I see:

1|1f170
2|fffd
3|28
4|6e
5|75
6|6c
7|6c
8|29
9|fffd

So the initial surrogate pair was rendered correctly as 0x1f170.  The
\ud83c without the following high surrogate was converted into 0xfffd
(which is the right thing to do, is it not).  Then the 6 ASCII
characters follow.  Finally, the last isolated high-surrogate is
(correctly?) converted into 0xfffd.

What behavior were you expecting?

Is there something that I can be doing differently to make it misbehave?

-- 
D. Richard Hipp
d...@sqlite.org
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Feature request: more robust handling of invalid UTF-16 data

Reply via email to