Re: [sqlite] UTF8 and NUL

J Decker Fri, 26 Jan 2018 13:05:49 -0800

On Fri, Jan 26, 2018 at 11:41 AM, Peter Da Silva <
[email protected]> wrote:

> On 1/26/18, 1:37 PM, "sqlite-users on behalf of J Decker" <
> [email protected] on behalf of [email protected]>
> wrote:
> >    doesn't get 26 either. 0x1a
>
> 26 isn't EOF, it's SUB (substitute). It was used to represent
> untranslatable characters when converting (for example) EBCDIC to ASCII.
>
> I gave up ever using "rt" or "wt"  because it IS EOF; depending on the
system.
I bet windows command line tools still use it because copy has /B and /A on
windows 10.

(interject, edit:  The effect of */a* depends on its position in the
command-line string. When */a* follows *Source*, *copy* treats the file as
an ASCII file and copies data that precedes the first end-of-file character.
https://en.wikipedia.org/wiki/End-of-file

"In Microsoft's DOS <https://en.wikipedia.org/wiki/DOS> and Windows
<https://en.wikipedia.org/wiki/Microsoft_Windows> (and in CP/M
<https://en.wikipedia.org/wiki/CP/M> and many DEC
<https://en.wikipedia.org/wiki/Digital_Equipment_Corporation> operating
systems), reading from the terminal will never produce an EOF. Instead,
programs recognize that the source is a terminal (or other "character
device") and interpret a given reserved character or sequence as an
end-of-file indicator; most commonly this is an *ASCII
<https://en.wikipedia.org/wiki/ASCII> Control-Z
<https://en.wikipedia.org/wiki/Substitute_character>**, code 26. *Some
MS-DOS programs, including parts of the Microsoft MS-DOS shell (COMMAND.COM
<https://en.wikipedia.org/wiki/COMMAND.COM>) and operating-system utility
programs (such as EDLIN <https://en.wikipedia.org/wiki/EDLIN>), treat a
Control-Z in a text file as marking the end of meaningful data, and/or
append a Control-Z to the end when writing a text file. This was done for
two reasons:"

... ASCII <https://en.wikipedia.org/wiki/ASCII> Control-Z
<https://en.wikipedia.org/wiki/Substitute_character>, code 26. ....
)

I understand probably 0xFF on punch cards was good because you could just
knock out all the holes and make a correction; and that could be an EOF on
other systems.  unless like O_BINARY was used.

so now we just use files as binary and get the length from the system; and
don't expect any transformations on our data.

------
More on my point though

Sqlite results with values with sqlite3_column_text(stmt,n) and
sqlite3_column_bytes(stmt,n) so any data including NUL from bound or
otherwise values is returned.

strcmp() would have an issue.  Even StrNCmp() and really you need a
comparison that inludes length of both strings.
Strlen is used constantly to find lengths of column, table, and function
names for things that should already be known.  It's not like there's a lot
of copying of those; the net effect is more speec; because it's not even
'strlen' that can be auto-intrinsic-inlined, but a fancy function that
sanitizes the length (sqlite3StrLen()).

The commands for SQL
LENGTH, RTRIM, LTRIM, QUOTE.... that deal with strings...
MySQL returns bytes for length.  Sqlite returns characters.  and all string
functions work on characters, which means sqltie has to understand UTF8
characters....

I wouldn't use any of those functions except in a one-of script because
they are non portable.  But they are non conformant because they do support
a basic way of skipping utf characters.... 0x9X arbitrarily is also not a
valid UTF8 character (it's a continutation that had no leadin length).
SO that makes even the unicode escapes in the range of 0x9X also available
to encode as bytes, kinda OOB with the data.

Invalid characters (overlong and otherwise) should be replaced with FFFD
http://unicode.org/pipermail/unicode/2017-May/005522.html (from this
thread, sort of; was on ill formed utf-8, really the past of this thread
but didn't find it)
https://www.fileformat.info/info/unicode/char/fffd/index.htm
Comments used to replace an incoming character whose value is unknown or
unrepresentable in Unicode
compare the use of U+001A
<https://www.fileformat.info/info/unicode/char/001a/index.htm> as a control
character to indicate the substitute function

(I would have said, "0xFEFF ? ZWNBSP zero width non breaking space(?) EF BB
BF "  but went and searched and found it was different than I thought )
A quick note about UTF8; every byte has one bit off.

My initial impression was that it shouldn't care, being basically a smart
storage engine; since what I put in I could get back out.  Having patched
the input side to escape ' and NUL in string values, I don't need my larger
patch.
But then having looked through so much of the string handling, the overall
effect is still positive.

Then there's internal logging and analysis, which should also escape the
output for strings, there IS a SQL way to include char(0).
can't really change sqlite3_column_text at this point; which means no
matter how much it is enforced and made harder to not count 0 as a
character, it doesn't matter, because it still will be.

( How are you?  Как дела?)

sqlite3 test.db
create table test(a);
insert into test (a) values ('Как дела');
select length(a),a

8|??? ????

(bytes in db) 02 1D 3F 3F 3F 20 3F 3F 3F 3F        ◙♂☻↔??? ????

(or done with sqlite3 test.db < test.sql ) where test.sql was the above...
(on terminal)
8|╨Ü╨░╨║ ╨┤╨╡╨╗╨░

sqlite3 test.db < test.sql > test.out )
8|Как дела

(bytes in db) 11 0E 02 2B D0 9A D0 B0 D0 BA 20 │ D0 B4 D0 B5 D0 BB D0 B0
◄♫☻+ÐšÐ°Ðº Ð´ÐµÐ»Ð°

_______________________________________________

> sqlite-users mailing list
> [email protected]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[email protected]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] UTF8 and NUL

Reply via email to