Ariel, Keith, Rowan, 

apologies, I re-ran the queries a few times, and I decided to include
"rowid" to keep track of the changes.



The experiments were conducted by cut-and-paste of the í character from
the email, hence UTF8, and using x'...' for inserts and concats.

Note, I use sqlite3 shell by preference, but I use both the Firefox
addon and "DB Browser for Sqlite" for GUI convenience - however, for
inserts and updates I use the sqlite3 shell or the C programming
interface. 

In C, I use int rather than char -

FILE *pinfile = NULL;
...
pinfile = fopen(argv[1],"rb");

....

int ch = fgetc (pinfile); 
/* changed from char to int to allow >127 & UTF */


Also, I use .mode csv and then a spreadsheet quite a lot.

Note, at the end, I added typeof( ) - and most were BLOBs and a couple
as TEXT.




kevin@kevin-Aspire-V5-571G:~$ sqlite3 dir_md5sum_db.sqlite
SQLite version 3.15.2 2016-11-28 19:13:37
Enter ".help" for usage hints.
sqlite> SELECT * FROM dir_md5sum
   ...> where rowid >= 194576;
kev|
kev2|
kev3|
kev4|
sqlite> insert into dir_md5sum values ( 'kev5', x'C3AD');
sqlite> SELECT * FROM dir_md5sum where rowid >= 194576;
kev|
kev2|
kev3|
kev4|
kev5|í
sqlite> .schema
CREATE TABLE dir_md5sum (dir_name text, dir_md5sum text);
sqlite> SELECT rowid, dir_name, hex(dir_name), dir_md5sum,
hex(dir_md5sum), unicode(dir_md5sum)  FROM dir_md5sum where rowid >=
194576;
194576|kev|6B6576|�|EE|65533
194577|kev2|6B657632|�|EE|65533
194578|kev3|6B657633|�|EE|65533
194579|kev4|6B657634|�|ED|65533
194580|kev5|6B657635|í|C3AD|237
sqlite> insert into dir_md5sum values ( 'kev6', 'a' || x'C3AD' || 'b'
);
sqlite> SELECT rowid, dir_name, hex(dir_name), dir_md5sum,
hex(dir_md5sum), unicode(dir_md5sum)  FROM dir_md5sum where rowid >=
194576;
194576|kev|6B6576|�|EE|65533
194577|kev2|6B657632|�|EE|65533
194578|kev3|6B657633|�|EE|65533
194579|kev4|6B657634|�|ED|65533
194580|kev5|6B657635|í|C3AD|237
194581|kev6|6B657636|aíb|61C3AD62|97
sqlite> insert into dir_md5sum values ( 'kev7', 'c' || x'00ED' || 'd'
);
sqlite> SELECT rowid, dir_name, hex(dir_name), dir_md5sum,
hex(dir_md5sum), unicode(dir_md5sum)  FROM dir_md5sum where rowid >=
194576;
194576|kev|6B6576|�|EE|65533
194577|kev2|6B657632|�|EE|65533
194578|kev3|6B657633|�|EE|65533
194579|kev4|6B657634|�|ED|65533
194580|kev5|6B657635|í|C3AD|237
194581|kev6|6B657636|aíb|61C3AD62|97
194582|kev7|6B657637|c|6300ED64|99
sqlite> insert into dir_md5sum values ( 'kev8',  x'00ED'  );
sqlite> SELECT rowid, dir_name, hex(dir_name), dir_md5sum,
hex(dir_md5sum), unicode(dir_md5sum)  FROM dir_md5sum where rowid >=
194576;
194576|kev|6B6576|�|EE|65533
194577|kev2|6B657632|�|EE|65533
194578|kev3|6B657633|�|EE|65533
194579|kev4|6B657634|�|ED|65533
194580|kev5|6B657635|í|C3AD|237
194581|kev6|6B657636|aíb|61C3AD62|97
194582|kev7|6B657637|c|6300ED64|99
194583|kev8|6B657638||00ED|
sqlite> .mode csv
sqlite> .once /home/kevin/Martin.csv
sqlite> SELECT rowid, dir_name, hex(dir_name), dir_md5sum,
hex(dir_md5sum), unicode(dir_md5sum)  FROM dir_md5sum where rowid >=
194576;
sqlite> SELECT rowid, dir_name, hex(dir_name), dir_md5sum,
hex(dir_md5sum), unicode(dir_md5sum), typeof(dir_md5sum)  FROM
dir_md5sum where rowid >= 194576;
194576,kev,6B6576,"�",EE,65533,blob
194577,kev2,6B657632,"�",EE,65533,blob
194578,kev3,6B657633,"�",EE,65533,blob
194579,kev4,6B657634,"�",ED,65533,blob
194580,kev5,6B657635,"í",C3AD,237,blob
194581,kev6,6B657636,"aíb",61C3AD62,97,text
194582,kev7,6B657637,c,6300ED64,99,text
194583,kev8,6B657638,"",00ED,,blob
sqlite> 

regs,

Kev


Date: Mon, 19 Dec 2016 11:12:59 +0800
From: Rowan Worth <row...@dug.com>
To: SQLite mailing list <sqlite-users@mailinglists.sqlite.org>
Subject: Re: [sqlite] Weird chars inserted
Message-ID:
        <CAJtcO2TT1SLwU0DrvE5GQo4o-VFWHa8Oh3JuP7Vh0rT+QEDRqw@mail.gmail
.com>
Content-Type: text/plain; charset=UTF-8

On 19 December 2016 at 08:24, Kevin <kyou...@gmail.com> wrote:

> Hi Martin,
>
> I had a go using a terminal session, with default encoding UTF-8.
>
> Try using the hex( ) and unicode( ) functions to check what is
actually
> stored in the sqlite table.
>
> I put a couple of rows at the end of an existing simple table....
>
> kevin@kevin-Aspire-V5-571G:~$ sqlite3
/home/kevin/dir_md5sum_db.sqlite
> SQLite version 3.15.2 2016-11-28 19:13:37
> Enter ".help" for usage hints.
> sqlite> SELECT dir_name, hex(dir_name), dir_md5sum, hex(dir_md5sum),
> unicode(dir_md5sum)  FROM dir_md5sum
>    ...> where rowid >= 194576;
> 194576|kev|6B6576|í|C3AD|237
> 194577|kev2|6B657632|�|ED|65533
> sqlite> .quit
> kevin@kevin-Aspire-V5-571G:~$
>

Hi Kevin,

The problem here lies in whatever inserted these rows. sqlite just
stores
what it is given - it is up to the application to take care of encoding
issues.

In this case the "kev" row has been inserted using utf-8 encoding, so
when
you retrieve this value sqlite emits the bytes 0xC3 0xAD (exactly as
they
were stored), which your terminal interprets as utf-8 and renders the
character í.

The "kev2" row however is not utf-8 encoded. The dir_md5sum column
contains
a single byte 0xED, which is not valid utf-8 (the encoding specifies
that
when the highest bit is on, there is more information about the current
character in the next byte).

When you retrieve this value sqlite emits the byte 0xED (exactly as it
was
stored). Your terminal tries to interpret this as utf-8, but since it
is
not valid it instead inserts a unicode replacement character (U+FFFD).
Sqlite3's unicode() function makes the same U+FFFD replacement when
encountering an invalid encoding which is where the 65533 comes from.

-Rowan
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to