Scott,

On Fri, Jun 24, 2016 at 2:16 PM, Scott Robison <sc...@casaderobison.com> wrote:
> On Fri, Jun 24, 2016 at 12:03 PM, Scott Robison <sc...@casaderobison.com>
> wrote:
>
>> On Windows, when you get a string of characters, you either get an ANSI
>> string using some code page, or you get a wide character string.
>>
>> When you get an ANSI string, it is just a sequence of 8 bit bytes. UTF-8
>> is also a sequence of 8 bit bytes. The meaning / encoding of those 8 bit
>> bytes are very different.
>>
>> SQLite will allow you to write any 8 bit byte sequence you want as a
>> string. It does not attempt to validate the bytes. It will read the bytes
>> back exactly as written. So if you wrote an ANSI string to the database
>> instead of a UTF-8 string, you will get back the ANSI string.
>>
>> This all assumes you're using the UTF-8 functions, which might be more
>> accurately described as byte functions. SQLite databases have an encoding.
>> They store either UTF-8 text or UTF-16 text. If your database is UTF-8 and
>> you use the char/byte based interface, SQLite won't interpret the bytes. If
>> your database is UTF-16 and you use the wide character based interface,
>> SQLite won't interpret the wide characters. It assumes you've given it
>> valid data and will use it as is. This is particularly convenient when
>> dealing with variant columns.
>>
>> If, however, your database is UTF-8 and you use the UTF-16 interface
>> functions, SQLite will attempt to convert the data between UTF-8 & UTF-16.
>> If your database is UTF-16 and you use the UTF-8 interface functions,
>> SQLite will attempt to convert the data. In these cases, it is important to
>> have valid UTF-whatever in the database.
>>
>> It looks to me like, in your case, some program wrote a byte sequence to
>> the database that was not UTF-8. You later read that string back out of the
>> database, and attempt to convert it to a wstring with your C++ code. The
>> byte sequence was not UTF-8, hence the failure.
>>
>> I seem to recall a recent discussion on the list about the shell and
>> console input / output and it not being treated 100% accurately as
>> UTF-whatever. Library internals are, but the IO layer in the shell, not so
>> much.
>>
>> Thus you cannot depend on the shell to translate non-ASCII characters on
>> Windows and write them as UTF-whatever. If using the shell is essential to
>> your process, you can't currently get there from here.
>>
>> Though maybe ... instead of typing ALT+225, try typing ALT+195 ALT+159. In
>> your windows console, that would give you the equivalent byte sequence for
>> that character, compensating for the fact that SQLite doesn't (I believe)
>> transform console input to UTF-8. If I am mistaken on that point, I
>> apologize.
>>
>> If the two alt-code byte sequences create data your C++ code can then
>> process (because it's valid UTF-8), you'll know for certain that the SQLite
>> shell on Windows does not process UTF-8 for console IO, just internally to
>> the database layer.
>>
>
> Okay, rather than guessing, I just did a test from a Windows 10 command
> prompt. I am getting appropriate UTF-8 sequences. Here is my experiment:
>
> I opened a memory database and issued the following commands:
>
> create table test(a text);
> insert into test values('ß'),('▀'),('á'),('ß'); -- for the first value I
> typed ALT+225, then ALT+223, then ALT+0225, then ALT+0223
> select a, hex(a) from test;
>
> Which resulted in four rows of output:
>
> ß|C3A1
> ▀|C39F
> á|C2A0
> ß|C3A1
>
> I'm hoping all these extended characters are handled properly by gmail and
> whatever email program you use.
>
> Windows supports legacy ALT+### codes that map to the legacy code page. It
> also supports ALT+0### which map to Unicode code points. This allows people
> who're accustomed to the ALT+### format to still see the character they
> expect, but translated to the equivalent Unicode code point.
>
> Again, this is with Windows 10. Perhaps you could try a similar sequence to
> what I typed above on your SQLite shell and Windows command prompt version
> and see what you get back.

This are the results of me trying:

SQLite version 3.9.2 2015-11-02 18:31:45
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> CREATE TABLE abcß▀(id integer primary key, αΓ string); first
value was ALT+225, second - ALT+223.
sqlite> SELECT name, hex(name) FROM sqlite_master;
abcß▀|616263E1DF
sqlite>

So now the question is - what encoding is that value, so that it can
be successfully converted to wstring?

It is not UTF-8 and it is not UTF-16 and it's definitely not ASCII.

Thank you.

>
> --
> Scott Robison
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to