On 1/13/20, Dennis Snell <den...@dennisandmandi.com> wrote: > We have a JSON document like this which we store in a table. > > {“content”: “\ud83c\udd70\ud83c(null)\udd71”,”tags":[]} > > > The JSON is well-formed but the sequence of UTF-16 code points is invalid. > > When sqlite reads this data two types of further corruption
I'm having trouble reproducing this. The following test script (one of many) illustrates: CREATE TABLE t1(j TEXT); INSERT INTO t1(j) VALUES ('{"content": "\ud83c\udd70\ud83c(null)\udd71","tags":[]}'); SELECT length(json_extract(j,'$.content')) FROM t1; WITH RECURSIVE c(x) AS (VALUES(1) UNION ALL SELECT x+1 FROM c WHERE x<9) SELECT x, printf('%x',unicode(substr(json_extract(j,'$.content'),x))) FROM t1, c; The column T1.J is loaded up with your original JSON with the invalid code points. Then I run json_extract() to pull out the invalid string, but SQLite says that the length is 9, which I think is the correct answer. The second SELECT with the CTE in it loops over each character and prints out the HEX value for that character. Here is what I see: 1|1f170 2|fffd 3|28 4|6e 5|75 6|6c 7|6c 8|29 9|fffd So the initial surrogate pair was rendered correctly as 0x1f170. The \ud83c without the following high surrogate was converted into 0xfffd (which is the right thing to do, is it not). Then the 6 ASCII characters follow. Finally, the last isolated high-surrogate is (correctly?) converted into 0xfffd. What behavior were you expecting? Is there something that I can be doing differently to make it misbehave? -- D. Richard Hipp d...@sqlite.org _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users