Hi List,
Due to a bug in our own code we inserted a string with embedded nul character
in the database. This caused strange behavior when we used the built-in
function instr() on this data. As part of the analysis of this issue I
investigated how the builtin functions handle strings with embedded nul
characters. I want to share my findings.
Although I don't find in the sqlite documentation if a nul character is
special, and ends a string, the specification and implementation of some string
related functions seem to suggest this. However, the behaviour of string
functions is not always consistent.
For the tests below I assumed that the intention is that a string ends at the
first embedded nul character.
Summary:
- The instr() function returns incorrect results.
- The rtrim() and replace() functions return incorrect results, but this is
usually not noticeable.
- builtin functions not consistent in truncation after embedded nul character
Please document that functions that operate on string arguments process the
data till the first embedded nul character, if any, and (may) truncate excess
data (or otherwise that string data may/should not contain embedded nul
character).
Please adapt the implementation of instr() with string arguments to stop
searching after an embedded nul character. No strong opinion about desired
behaviour of rtrim() and replace().
Details:
Below my test queries (sqlite version 3.11.1), executed on TEXT data and BLOB
data. Most queries produce also HEX-output to show the exact output. Annotated
output added as comments.
.mode lines
CREATE TABLE test(t TEXT, b BLOB);
INSERT INTO test VALUES ( 'A' || x'00' || 'B ',
CAST('A' || x'00' || 'B ' AS BLOB));
-- show actual data value and type of data
SELECT t, hex(t), typeof(t),
b, hex(b), typeof(b)
FROM test;
-- t = A
-- hex(t) = 41004220 -- OK. contains embedded nul character
-- typeof(t) = text
-- b = A
-- hex(b) = 41004220 -- OK. contains embedded nul character
-- typeof(b) = blob
SELECT length(t),
length(b)
FROM test;
-- length(t) = 1 -- OK. till first embedded nul character
-- length(b) = 4 -- OK. size of entire blob
SELECT lower(t),
hex(lower(t)),
lower(b),
hex(lower(b))
FROM test;
-- lower(t) = a
-- hex(lower(t)) = 61 -- OK. String is assumed to end at embedded nul
character.
-- lower(b) = a
-- hex(lower(b)) = 61 -- OK. Blob is interpreted as string (suggested in spec),
so processed till embedded nul character.
SELECT rtrim(t),
hex(rtrim(t)),
rtrim(b),
hex(rtrim(b))
FROM test;
-- rtrim(t) = A
-- hex(rtrim(t)) = 410042 -- NOT OK. Data after embedded nul character is
considered string content. Changed part is usually
-- -- not noticed if resulting data is
processed as string (so upto embedded nul char)
-- -- Note that output is not truncted at nul
character like lower() does.
-- rtrim(b) = A
-- hex(rtrim(b)) = 410042 -- Blob is interpreted as string. Same behaviour as
string.
SELECT quote(t),
hex(quote(t)),
quote(b),
hex(quote(b))
FROM test;
-- quote(t) = 'A'
-- hex(quote(t)) = 274127 -- OK. As specified. String till
first embedded nul character quoted
-- quote(b) = X'41004220' -- OK. As specified. Hex
representation of entire string
-- hex(quote(b)) = 5827343130303432323027
SELECT replace(t, 'B', 'C'),
hex(replace(t, 'B', 'C')),
replace(b, 'B', 'C'),
hex(replace(b, 'B', 'C'))
FROM test;
-- replace(t, 'B', 'C') = A
-- hex(replace(t, 'B', 'C')) = 41004320 -- NOT OK. Replaces also characters
after embedded nul. Chnaged part is usually not noticed if
-- -- resulting data is processed
as string (so upto embedded nul char)
-- -- Note that output is not
truncated at embedded nul character like lower() does.
-- replace(b, 'B', 'C') = A
-- hex(replace(b, 'B', 'C')) = 41004320 -- OK. Replaces in entire data
SELECT substr(t, 1, 10),
hex(substr(t, 1, 10)),
substr(b, 1, 10),
hex(substr(b, 1, 10)),
substr(t, 3, 10),
hex(substr(t, 3, 10)),
substr(b, 3, 10),
hex(substr(b, 3, 10))
FROM test;
-- substr(t, 1, 10) = A
-- hex(substr(t, 1, 10)) = 41 -- OK. Till embedded nul character
-- substr(b, 1, 10) = A
-- hex(substr(b, 1, 10)) = 41004220 -- OK. Entire data
-- substr(t, 3, 10) =
-- hex(substr(t, 3, 10)) = -- OK. Pos 3 is after emdedded nul
character, so not part of string
-- substr(b, 3, 10) = B
-- hex(substr(b, 3, 10)) = 4220 -- Ok substitute in all data
SELECT instr(t, 'B'),
instr(b, 'B')
FROM test;
-- instr(t, 'B') = 3 -- NOT OK. String ends at embedded nul
character. This violates the specification, which
-- -- explicitly says that that
arguments are treated as strings (so up to emdedded nul).
-- -- Should return 0 (=not found).
-- instr(b, 'B') = 3 -- OK. search in all data
Met Vriendelijke Groet, Kind Regards, 谨致问候,
Rob
_______________________________________________
sqlite-users mailing list
[email protected]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users