Ariel, Keith, Rowan, apologies, I re-ran the queries a few times, and I decided to include "rowid" to keep track of the changes.
The experiments were conducted by cut-and-paste of the í character from the email, hence UTF8, and using x'...' for inserts and concats. Note, I use sqlite3 shell by preference, but I use both the Firefox addon and "DB Browser for Sqlite" for GUI convenience - however, for inserts and updates I use the sqlite3 shell or the C programming interface. In C, I use int rather than char - FILE *pinfile = NULL; ... pinfile = fopen(argv[1],"rb"); .... int ch = fgetc (pinfile); /* changed from char to int to allow >127 & UTF */ Also, I use .mode csv and then a spreadsheet quite a lot. Note, at the end, I added typeof( ) - and most were BLOBs and a couple as TEXT. kevin@kevin-Aspire-V5-571G:~$ sqlite3 dir_md5sum_db.sqlite SQLite version 3.15.2 2016-11-28 19:13:37 Enter ".help" for usage hints. sqlite> SELECT * FROM dir_md5sum ...> where rowid >= 194576; kev| kev2| kev3| kev4| sqlite> insert into dir_md5sum values ( 'kev5', x'C3AD'); sqlite> SELECT * FROM dir_md5sum where rowid >= 194576; kev| kev2| kev3| kev4| kev5|í sqlite> .schema CREATE TABLE dir_md5sum (dir_name text, dir_md5sum text); sqlite> SELECT rowid, dir_name, hex(dir_name), dir_md5sum, hex(dir_md5sum), unicode(dir_md5sum) FROM dir_md5sum where rowid >= 194576; 194576|kev|6B6576|�|EE|65533 194577|kev2|6B657632|�|EE|65533 194578|kev3|6B657633|�|EE|65533 194579|kev4|6B657634|�|ED|65533 194580|kev5|6B657635|í|C3AD|237 sqlite> insert into dir_md5sum values ( 'kev6', 'a' || x'C3AD' || 'b' ); sqlite> SELECT rowid, dir_name, hex(dir_name), dir_md5sum, hex(dir_md5sum), unicode(dir_md5sum) FROM dir_md5sum where rowid >= 194576; 194576|kev|6B6576|�|EE|65533 194577|kev2|6B657632|�|EE|65533 194578|kev3|6B657633|�|EE|65533 194579|kev4|6B657634|�|ED|65533 194580|kev5|6B657635|í|C3AD|237 194581|kev6|6B657636|aíb|61C3AD62|97 sqlite> insert into dir_md5sum values ( 'kev7', 'c' || x'00ED' || 'd' ); sqlite> SELECT rowid, dir_name, hex(dir_name), dir_md5sum, hex(dir_md5sum), unicode(dir_md5sum) FROM dir_md5sum where rowid >= 194576; 194576|kev|6B6576|�|EE|65533 194577|kev2|6B657632|�|EE|65533 194578|kev3|6B657633|�|EE|65533 194579|kev4|6B657634|�|ED|65533 194580|kev5|6B657635|í|C3AD|237 194581|kev6|6B657636|aíb|61C3AD62|97 194582|kev7|6B657637|c|6300ED64|99 sqlite> insert into dir_md5sum values ( 'kev8', x'00ED' ); sqlite> SELECT rowid, dir_name, hex(dir_name), dir_md5sum, hex(dir_md5sum), unicode(dir_md5sum) FROM dir_md5sum where rowid >= 194576; 194576|kev|6B6576|�|EE|65533 194577|kev2|6B657632|�|EE|65533 194578|kev3|6B657633|�|EE|65533 194579|kev4|6B657634|�|ED|65533 194580|kev5|6B657635|í|C3AD|237 194581|kev6|6B657636|aíb|61C3AD62|97 194582|kev7|6B657637|c|6300ED64|99 194583|kev8|6B657638||00ED| sqlite> .mode csv sqlite> .once /home/kevin/Martin.csv sqlite> SELECT rowid, dir_name, hex(dir_name), dir_md5sum, hex(dir_md5sum), unicode(dir_md5sum) FROM dir_md5sum where rowid >= 194576; sqlite> SELECT rowid, dir_name, hex(dir_name), dir_md5sum, hex(dir_md5sum), unicode(dir_md5sum), typeof(dir_md5sum) FROM dir_md5sum where rowid >= 194576; 194576,kev,6B6576,"�",EE,65533,blob 194577,kev2,6B657632,"�",EE,65533,blob 194578,kev3,6B657633,"�",EE,65533,blob 194579,kev4,6B657634,"�",ED,65533,blob 194580,kev5,6B657635,"í",C3AD,237,blob 194581,kev6,6B657636,"aíb",61C3AD62,97,text 194582,kev7,6B657637,c,6300ED64,99,text 194583,kev8,6B657638,"",00ED,,blob sqlite> regs, Kev Date: Mon, 19 Dec 2016 11:12:59 +0800 From: Rowan Worth <row...@dug.com> To: SQLite mailing list <sqlite-users@mailinglists.sqlite.org> Subject: Re: [sqlite] Weird chars inserted Message-ID: <CAJtcO2TT1SLwU0DrvE5GQo4o-VFWHa8Oh3JuP7Vh0rT+QEDRqw@mail.gmail .com> Content-Type: text/plain; charset=UTF-8 On 19 December 2016 at 08:24, Kevin <kyou...@gmail.com> wrote: > Hi Martin, > > I had a go using a terminal session, with default encoding UTF-8. > > Try using the hex( ) and unicode( ) functions to check what is actually > stored in the sqlite table. > > I put a couple of rows at the end of an existing simple table.... > > kevin@kevin-Aspire-V5-571G:~$ sqlite3 /home/kevin/dir_md5sum_db.sqlite > SQLite version 3.15.2 2016-11-28 19:13:37 > Enter ".help" for usage hints. > sqlite> SELECT dir_name, hex(dir_name), dir_md5sum, hex(dir_md5sum), > unicode(dir_md5sum) FROM dir_md5sum > ...> where rowid >= 194576; > 194576|kev|6B6576|í|C3AD|237 > 194577|kev2|6B657632|�|ED|65533 > sqlite> .quit > kevin@kevin-Aspire-V5-571G:~$ > Hi Kevin, The problem here lies in whatever inserted these rows. sqlite just stores what it is given - it is up to the application to take care of encoding issues. In this case the "kev" row has been inserted using utf-8 encoding, so when you retrieve this value sqlite emits the bytes 0xC3 0xAD (exactly as they were stored), which your terminal interprets as utf-8 and renders the character í. The "kev2" row however is not utf-8 encoded. The dir_md5sum column contains a single byte 0xED, which is not valid utf-8 (the encoding specifies that when the highest bit is on, there is more information about the current character in the next byte). When you retrieve this value sqlite emits the byte 0xED (exactly as it was stored). Your terminal tries to interpret this as utf-8, but since it is not valid it instead inserts a unicode replacement character (U+FFFD). Sqlite3's unicode() function makes the same U+FFFD replacement when encountering an invalid encoding which is where the 65533 comes from. -Rowan _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users