Scott, On Fri, Jun 24, 2016 at 2:16 PM, Scott Robison <sc...@casaderobison.com> wrote: > On Fri, Jun 24, 2016 at 12:03 PM, Scott Robison <sc...@casaderobison.com> > wrote: > >> On Windows, when you get a string of characters, you either get an ANSI >> string using some code page, or you get a wide character string. >> >> When you get an ANSI string, it is just a sequence of 8 bit bytes. UTF-8 >> is also a sequence of 8 bit bytes. The meaning / encoding of those 8 bit >> bytes are very different. >> >> SQLite will allow you to write any 8 bit byte sequence you want as a >> string. It does not attempt to validate the bytes. It will read the bytes >> back exactly as written. So if you wrote an ANSI string to the database >> instead of a UTF-8 string, you will get back the ANSI string. >> >> This all assumes you're using the UTF-8 functions, which might be more >> accurately described as byte functions. SQLite databases have an encoding. >> They store either UTF-8 text or UTF-16 text. If your database is UTF-8 and >> you use the char/byte based interface, SQLite won't interpret the bytes. If >> your database is UTF-16 and you use the wide character based interface, >> SQLite won't interpret the wide characters. It assumes you've given it >> valid data and will use it as is. This is particularly convenient when >> dealing with variant columns. >> >> If, however, your database is UTF-8 and you use the UTF-16 interface >> functions, SQLite will attempt to convert the data between UTF-8 & UTF-16. >> If your database is UTF-16 and you use the UTF-8 interface functions, >> SQLite will attempt to convert the data. In these cases, it is important to >> have valid UTF-whatever in the database. >> >> It looks to me like, in your case, some program wrote a byte sequence to >> the database that was not UTF-8. You later read that string back out of the >> database, and attempt to convert it to a wstring with your C++ code. The >> byte sequence was not UTF-8, hence the failure. >> >> I seem to recall a recent discussion on the list about the shell and >> console input / output and it not being treated 100% accurately as >> UTF-whatever. Library internals are, but the IO layer in the shell, not so >> much. >> >> Thus you cannot depend on the shell to translate non-ASCII characters on >> Windows and write them as UTF-whatever. If using the shell is essential to >> your process, you can't currently get there from here. >> >> Though maybe ... instead of typing ALT+225, try typing ALT+195 ALT+159. In >> your windows console, that would give you the equivalent byte sequence for >> that character, compensating for the fact that SQLite doesn't (I believe) >> transform console input to UTF-8. If I am mistaken on that point, I >> apologize. >> >> If the two alt-code byte sequences create data your C++ code can then >> process (because it's valid UTF-8), you'll know for certain that the SQLite >> shell on Windows does not process UTF-8 for console IO, just internally to >> the database layer. >> > > Okay, rather than guessing, I just did a test from a Windows 10 command > prompt. I am getting appropriate UTF-8 sequences. Here is my experiment: > > I opened a memory database and issued the following commands: > > create table test(a text); > insert into test values('ß'),('▀'),('á'),('ß'); -- for the first value I > typed ALT+225, then ALT+223, then ALT+0225, then ALT+0223 > select a, hex(a) from test; > > Which resulted in four rows of output: > > ß|C3A1 > ▀|C39F > á|C2A0 > ß|C3A1 > > I'm hoping all these extended characters are handled properly by gmail and > whatever email program you use. > > Windows supports legacy ALT+### codes that map to the legacy code page. It > also supports ALT+0### which map to Unicode code points. This allows people > who're accustomed to the ALT+### format to still see the character they > expect, but translated to the equivalent Unicode code point. > > Again, this is with Windows 10. Perhaps you could try a similar sequence to > what I typed above on your SQLite shell and Windows command prompt version > and see what you get back.
This are the results of me trying: SQLite version 3.9.2 2015-11-02 18:31:45 Enter ".help" for usage hints. Connected to a transient in-memory database. Use ".open FILENAME" to reopen on a persistent database. sqlite> CREATE TABLE abcß▀(id integer primary key, αΓ string); first value was ALT+225, second - ALT+223. sqlite> SELECT name, hex(name) FROM sqlite_master; abcß▀|616263E1DF sqlite> So now the question is - what encoding is that value, so that it can be successfully converted to wstring? It is not UTF-8 and it is not UTF-16 and it's definitely not ASCII. Thank you. > > -- > Scott Robison > _______________________________________________ > sqlite-users mailing list > sqlite-users@mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users