Re: [sqlite] hex and char functions
Thanks Rowan. I’ve picked up bits and pieces from the various replies to get a basic idea of what’s going on. My question should’ve been posted on the c++ builder forum in the first place. Sorry for wasting everyone’s time. Tom From: Rowan Worth<mailto:row...@dug.com> Sent: 09 August 2017 03:29 To: SQLite mailing list<mailto:sqlite-users@mailinglists.sqlite.org> Subject: Re: [sqlite] hex and char functions On 8 August 2017 at 18:32, x wrote: > Why does it always turn out to be my own fault ☹ > Welcome to programming ;) It's very rare in practice to find an actual bug in a library or language, especially one as widely deployed and tested as sqlite. Glad you're sorted, I just wanted to have a quick go at clearing this up: but finding out that the UTF8 code for the UTF16 code \u0085 is in fact > \uc285 has tipped me over the edge > 1. \u0085 is a unicode code-point, 'NEXT LINE (NEL)' 2. \uc285 is a unicode code-point, 'HANGUL SYLLABLE SWIT (솅)' 3. Neither of these is a UTF8 or UTF8 "code" -- UTF8 and UTF16 are different ways of enconding/decoding unicode code-points 4. In UTF8, \u0085 is encoded as two bytes -- 0xc2 0x85 5. In UTF8, \uc285 is encoded as three bytes -- 0xec 0x8a 0x85 HTH, -Rowan ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] hex and char functions
On 8 August 2017 at 18:32, x wrote: > Why does it always turn out to be my own fault ☹ > Welcome to programming ;) It's very rare in practice to find an actual bug in a library or language, especially one as widely deployed and tested as sqlite. Glad you're sorted, I just wanted to have a quick go at clearing this up: but finding out that the UTF8 code for the UTF16 code \u0085 is in fact > \uc285 has tipped me over the edge > 1. \u0085 is a unicode code-point, 'NEXT LINE (NEL)' 2. \uc285 is a unicode code-point, 'HANGUL SYLLABLE SWIT (솅)' 3. Neither of these is a UTF8 or UTF8 "code" -- UTF8 and UTF16 are different ways of enconding/decoding unicode code-points 4. In UTF8, \u0085 is encoded as two bytes -- 0xc2 0x85 5. In UTF8, \uc285 is encoded as three bytes -- 0xec 0x8a 0x85 HTH, -Rowan ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] hex and char functions
Just received this on Embarcadero forum. You are assigning an Ansi literal string to a Unicode string so a conversion from the current Ansi locale is occurring. Try assigning a unicode literal: UnicodeString S= L"\u0085"; That L in front of the string makes Code==133 and it’s also 133 inside the udf after retrieval using sqlite3_value_text16. Why does it always turn out to be my own fault ☹ Many thanks to all who contributed. From: x<mailto:tam118...@hotmail.com> Sent: 08 August 2017 09:24 To: SQLite mailing list<mailto:sqlite-users@mailinglists.sqlite.org> Subject: Re: [sqlite] hex and char functions I accidentally deleted the latest 3 replies to this thread in my mail so apologies for not including any content from those mails. Ryan, the error happens immediately in c++ builder. E.g. UnicodeString S=”\u0085”; Int Code=S[1]; Code returns 63 (x3F) which is code for ‘?’ and the question mark is also what shows when I hover the cursor over S during debugging. As u+0085 is a control character, and so has no associated graphic, I can forgive the ‘?’ being printed in its place but why does the underlying Code change? I have posted this question on the c++ builder website and will post any answer here. Simon’s timing with his comment about debugging is spooky. Last night I found that when I try to ‘Evaluate/Modify’ S[1] during debugging I get the following error message E2015 Ambiguity between 'System::UnicodeString::System::UnicodeString::operator [](const int)' and 'System::UnicodeString::System::UnicodeString::operator [](const int)' Ambiguity between X and X?? I had no such problems earlier yesterday but, as it was working fine when I first encountered the ‘\u0085’ problem, I doubt it’s the debugger. I conducted the following experiment in SQLite Expert Pro by creating a couple of new empty databases. select hex(char(65,133,66)) returned '41C28542' when the current db was utf-8 '410085004200' when the current db was utf-16le '004100850042' when the current db was utf-16be The bottom one of those is what I expected when I first ran the select yet I think I'm right in saying my system (windows 10) is little endian. This is a further confusion thrown into the mix for me. As is usually the case. I've little doubt sqlite is coming up with the right answer in each case. Tom ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] hex and char functions
I accidentally deleted the latest 3 replies to this thread in my mail so apologies for not including any content from those mails. Ryan, the error happens immediately in c++ builder. E.g. UnicodeString S=”\u0085”; Int Code=S[1]; Code returns 63 (x3F) which is code for ‘?’ and the question mark is also what shows when I hover the cursor over S during debugging. As u+0085 is a control character, and so has no associated graphic, I can forgive the ‘?’ being printed in its place but why does the underlying Code change? I have posted this question on the c++ builder website and will post any answer here. Simon’s timing with his comment about debugging is spooky. Last night I found that when I try to ‘Evaluate/Modify’ S[1] during debugging I get the following error message E2015 Ambiguity between 'System::UnicodeString::System::UnicodeString::operator [](const int)' and 'System::UnicodeString::System::UnicodeString::operator [](const int)' Ambiguity between X and X?? I had no such problems earlier yesterday but, as it was working fine when I first encountered the ‘\u0085’ problem, I doubt it’s the debugger. I conducted the following experiment in SQLite Expert Pro by creating a couple of new empty databases. select hex(char(65,133,66)) returned '41C28542' when the current db was utf-8 '410085004200' when the current db was utf-16le '004100850042' when the current db was utf-16be The bottom one of those is what I expected when I first ran the select yet I think I'm right in saying my system (windows 10) is little endian. This is a further confusion thrown into the mix for me. As is usually the case. I've little doubt sqlite is coming up with the right answer in each case. Tom ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] hex and char functions
Actually a maximum of 4 bytes are required to encode a single valid code-point in UTF-8. > On Aug 8, 2017, at 2:44 AM, Jens Alfke wrote: > > >> On Aug 7, 2017, at 8:29 AM, x wrote: >> >> I thought I had learned enough about this string lunacy to get by but >> finding out that the UTF8 code for the UTF16 code \u0085 is in fact \uc285 >> has tipped me over the edge. I assumed they both used the same codes but >> UTF16 allowed some characters UTF8 didn’t have. > > UTF-8 is backwards-compatible with ASCII. All 7-bit bytes (00-7f) represent > the same characters as their ASCII equivalents. Beyond that, UTF-8 uses a > sequence of two to five bytes in the range 80-ff to encode a single Unicode > character/code-point. (You can sort of think of this as every byte holding 7 > bits of the actual character number, with its MSB set to 1. It’s not exactly > like that, but close.) > > IMHO UTF-8 is the best general purpose text encoding. Code that works with > ASCII (real 7-bit ASCII, not the nonstandard “extended” stuff) will generally > work with UTF-8; the main thing to watch out for tends to be breaking or > trimming strings, because you don’t want to cut part of a multibyte sequence. > UTF-8 is also quite compact for Roman languages (although not non-Roman ones.) > > 16-bit encodings used to seem like a good idea back when Unicode has fewer > than 65,536 characters, so you could assume that one unichar = one character. > Those days are long gone. Now dealing with UTF-16 has all the same problems > of dealing with UTF-8 (i.e. multi-word sequences) without the benefits of > compactness or ASCII compatibility. > > 32-bit encodings are just silly, unless for some reason you really really > have to optimize for speed over size (and even then the added size may well > blow out your CPU caches and negate the speed boost.) > > —Jens > > PS: Apparently C++11 allows Unicode string literals by putting a letter U in > front of the initial quote. The result will be a string of wchar_t. > ___ > sqlite-users mailing list > sqlite-users@mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] hex and char functions
> On Aug 7, 2017, at 8:29 AM, x wrote: > > I thought I had learned enough about this string lunacy to get by but finding > out that the UTF8 code for the UTF16 code \u0085 is in fact \uc285 has tipped > me over the edge. I assumed they both used the same codes but UTF16 allowed > some characters UTF8 didn’t have. UTF-8 is backwards-compatible with ASCII. All 7-bit bytes (00-7f) represent the same characters as their ASCII equivalents. Beyond that, UTF-8 uses a sequence of two to five bytes in the range 80-ff to encode a single Unicode character/code-point. (You can sort of think of this as every byte holding 7 bits of the actual character number, with its MSB set to 1. It’s not exactly like that, but close.) IMHO UTF-8 is the best general purpose text encoding. Code that works with ASCII (real 7-bit ASCII, not the nonstandard “extended” stuff) will generally work with UTF-8; the main thing to watch out for tends to be breaking or trimming strings, because you don’t want to cut part of a multibyte sequence. UTF-8 is also quite compact for Roman languages (although not non-Roman ones.) 16-bit encodings used to seem like a good idea back when Unicode has fewer than 65,536 characters, so you could assume that one unichar = one character. Those days are long gone. Now dealing with UTF-16 has all the same problems of dealing with UTF-8 (i.e. multi-word sequences) without the benefits of compactness or ASCII compatibility. 32-bit encodings are just silly, unless for some reason you really really have to optimize for speed over size (and even then the added size may well blow out your CPU caches and negate the speed boost.) —Jens PS: Apparently C++11 allows Unicode string literals by putting a letter U in front of the initial quote. The result will be a string of wchar_t. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] hex and char functions
On 7 Aug 2017, at 8:45pm, R Smith wrote: > If it ends up as something else in that database, follow the call stack > chain, see how those look in the debugger initially (which may be different > to what you expect, depending on the debugger display encoding) Note this point very hard. It is always possible that your program is working perfectly, and your debugger is faulty. I’ve wasted a lot of time over the years through believing the debugger after multiple versions of my own code all produced the same, apparently wrong, result. Simon. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] hex and char functions
On 2017/08/07 9:01 PM, x wrote: Thanks Ryan for going to the trouble of typing that out. Hope you’re not a one fingered typewriter like myself. The borland related stuff is welcome but I still can’t say I’m any less confused by it all. I’m having a bad day today. I’ve spent most of it trying to fathom this stuff out. Igor & Gunter were correct earlier. The ‘\u0085’ is changed to ‘?’ before the string is anywhere near sqlite. Why I don’t know. It doesn’t seem unreasonable to want to put a Unicode code into a UnicodeString. As regards the hex(char(133)) returning C285, following the posts by Nico and Richard I’m wondering if it’s because I’m using SQLite Expert pro on a database that’s encoded in utf-8. I tried to change the coding to utf-16 to see if I would get a different result but, while the software seemed to accept the request, the request was never completed and no feedback was given aside from both the ‘Apply’ and ‘Cancel’ buttons both being greyed out for hours (it’s only a small database). I’ve had enough for today though. Don't worry, I can type faster than I can think (which isn't very fast, and quite evident from the typos!). I feel your pain though, and hope we can help get you un-discouraged soon. (Imagine that is a word.) One thing to Note: You can only change the Database encoding at the START of making the Database. Once you put stuff in the database, the encoding is set for life. That is an SQLite quirk - but that said, it wouldn't solve your problem, since the encoding problem (by your account) happens before the database is reached. It's real easy to test. Use the lowest ranked interface to simply add a string into the DB. By lowest ranked, I mean before it goes through any of your own code, or other wrapper code. Use direct SQL, such as DB.Execute('INSERT INTO t(v1) VALUES(''Geronimo'');'); (This assumes a table called t with a column v1 of type TEXT. Those are 2 single-quotes btw, not double-quotes). Use some SQLite tool or the command line to check what ended up in that DB. If that works, add some Unicode value, like this: DB.Execute('INSERT INTO t VALUES(''©Geronimo®'');'); If it ends up as something else in that database, follow the call stack chain, see how those look in the debugger initially (which may be different to what you expect, depending on the debugger display encoding), what you are looking for is Change - the moment it changes - you will see which thing is responsible for misinterpreting it, if any. Tell us more when you did the tests. Don't give up! Also - make sure that whatever you read it with reads (and can faithfully display) any unicode characters. Often the fault is not yours. The SQLite CLI for instance can read it correctly, but sometimes your computer stdout uses a codepage that doesn't display it correctly. Mail me off-list if you would like an SQLite DB and table with lots of Unicode test data in it and a correct interpretation of it to test with. Best of luck! Ryan ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] hex and char functions
Thanks Ryan for going to the trouble of typing that out. Hope you’re not a one fingered typewriter like myself. The borland related stuff is welcome but I still can’t say I’m any less confused by it all. I’m having a bad day today. I’ve spent most of it trying to fathom this stuff out. Igor & Gunter were correct earlier. The ‘\u0085’ is changed to ‘?’ before the string is anywhere near sqlite. Why I don’t know. It doesn’t seem unreasonable to want to put a Unicode code into a UnicodeString. As regards the hex(char(133)) returning C285, following the posts by Nico and Richard I’m wondering if it’s because I’m using SQLite Expert pro on a database that’s encoded in utf-8. I tried to change the coding to utf-16 to see if I would get a different result but, while the software seemed to accept the request, the request was never completed and no feedback was given aside from both the ‘Apply’ and ‘Cancel’ buttons both being greyed out for hours (it’s only a small database). I’ve had enough for today though. Thanks to all who have contributed. From: R Smith<mailto:rsm...@rsweb.co.za> Sent: 07 August 2017 19:33 To: sqlite-users@mailinglists.sqlite.org<mailto:sqlite-users@mailinglists.sqlite.org> Subject: Re: [sqlite] hex and char functions On 2017/08/07 5:29 PM, x wrote: > Apologies, I should have said I was using c++ builder Berlin on windows 10 > and that UnicodeString was UTF16. > > I thought I had learned enough about this string lunacy to get by but finding > out that the UTF8 code for the UTF16 code \u0085 is in fact \uc285 has tipped > me over the edge. I assumed they both used the same codes but UTF16 allowed > some characters UTF8 didn’t have. > > I’m now wondering if I should go to the trouble of changing my sqlite wrapper > over to communicate with the sqlite utf8 functions rather than the utf16 > ones. Trouble is many of c++ builder’s built in types such as TStringList etc > are utf16. No you shouldn't. UTF16 doesn't have "more" characters than UTF8, and TStringlist is not UTF16 - let me see if I can clear up some bit of the confusion. This next bit is very short and really requires much more study, but I hope I say enough (and correct enough) for you to get the picture a little better. First some (very short) history on the "string lunacy" you refer to. Note that when storing text in any system there are two confusing concepts that are hard to get your head around: Firstly there is the actual Characters, or Character-sets - these are the things referred to as ANSI, Latin, CN-Big5 etc., then there are Character Code Index mappings, these are things such as Code-Pages and the like that basically says stuff like the Uppercase Latin character A has a code of 65 (Hex 41) in the ASCII code-page etc. These may all differ for different code-pages, though there were good overlap. Eventually Unicode intended to save the World by indeed unifying all the code-paging (hence "Unicode") and they did a marvelous job of it - but there were very many real-World characters to cater for, so they have code-point indices much larger than any single or even double-byte character arrray or string can ever contain. Here we Enter the character-encodings. These are things like UTF8, UTF16LE and they specify an encoding, a way to make a sequence of bytes refer to a specific codepoint in a code-space (in typically the Unicode code-point space) that can be much larger than 8 or 16 bits may accommodate. UTF-8 for instance specifies that any byte value less than 128 refers to the first 127 code points, as soon as that final bit (MSB) goes high, it means another byte is needed (or byteS, depending on how many high bits follow the initial) to complete the encoding, and further bytes must specify a 1.0.x.x.x.x.x.x format in turn to ensure consistency and safely lets any reader know as soon as they encounter a high MSB that it is definitely part of a multi-byte UTF8 sequence - which is a brilliant encoding. Although slightly technical, it is very lean, we only escalate bytes when needed, and only as much as is needed. The UTF16 encoding is a bit less technical, we can represent far more code points with a consistent 2 byte setup, but even that is much smaller than the full Unicode world, so UTF16 has specific character ranges (0xd800 to 0xdbff) that requires follow-on double-bytes, also known as "Surrogate pairs" (this is the thing that you said pushed you over the edge, finding that some Unicode characters are represented by 2 double-byte characters, so 4-byte total width). There is much more to be said about all this, but I don't want to take everyone's time and the above is enough to understand the next bit regarding C++ history: One of the great features of the bcc32 compilers of yonder was that they (Borland) embraced strong typing, probably to this day the strongest-typed language around i
Re: [sqlite] hex and char functions
On 2017/08/07 5:29 PM, x wrote: Apologies, I should have said I was using c++ builder Berlin on windows 10 and that UnicodeString was UTF16. I thought I had learned enough about this string lunacy to get by but finding out that the UTF8 code for the UTF16 code \u0085 is in fact \uc285 has tipped me over the edge. I assumed they both used the same codes but UTF16 allowed some characters UTF8 didn’t have. I’m now wondering if I should go to the trouble of changing my sqlite wrapper over to communicate with the sqlite utf8 functions rather than the utf16 ones. Trouble is many of c++ builder’s built in types such as TStringList etc are utf16. No you shouldn't. UTF16 doesn't have "more" characters than UTF8, and TStringlist is not UTF16 - let me see if I can clear up some bit of the confusion. This next bit is very short and really requires much more study, but I hope I say enough (and correct enough) for you to get the picture a little better. First some (very short) history on the "string lunacy" you refer to. Note that when storing text in any system there are two confusing concepts that are hard to get your head around: Firstly there is the actual Characters, or Character-sets - these are the things referred to as ANSI, Latin, CN-Big5 etc., then there are Character Code Index mappings, these are things such as Code-Pages and the like that basically says stuff like the Uppercase Latin character A has a code of 65 (Hex 41) in the ASCII code-page etc. These may all differ for different code-pages, though there were good overlap. Eventually Unicode intended to save the World by indeed unifying all the code-paging (hence "Unicode") and they did a marvelous job of it - but there were very many real-World characters to cater for, so they have code-point indices much larger than any single or even double-byte character arrray or string can ever contain. Here we Enter the character-encodings. These are things like UTF8, UTF16LE and they specify an encoding, a way to make a sequence of bytes refer to a specific codepoint in a code-space (in typically the Unicode code-point space) that can be much larger than 8 or 16 bits may accommodate. UTF-8 for instance specifies that any byte value less than 128 refers to the first 127 code points, as soon as that final bit (MSB) goes high, it means another byte is needed (or byteS, depending on how many high bits follow the initial) to complete the encoding, and further bytes must specify a 1.0.x.x.x.x.x.x format in turn to ensure consistency and safely lets any reader know as soon as they encounter a high MSB that it is definitely part of a multi-byte UTF8 sequence - which is a brilliant encoding. Although slightly technical, it is very lean, we only escalate bytes when needed, and only as much as is needed. The UTF16 encoding is a bit less technical, we can represent far more code points with a consistent 2 byte setup, but even that is much smaller than the full Unicode world, so UTF16 has specific character ranges (0xd800 to 0xdbff) that requires follow-on double-bytes, also known as "Surrogate pairs" (this is the thing that you said pushed you over the edge, finding that some Unicode characters are represented by 2 double-byte characters, so 4-byte total width). There is much more to be said about all this, but I don't want to take everyone's time and the above is enough to understand the next bit regarding C++ history: One of the great features of the bcc32 compilers of yonder was that they (Borland) embraced strong typing, probably to this day the strongest-typed language around is Pascal, later Turbo Pascal, Delphi etc. I mention this because you use (apparently) the Borland origin version of C++ and they always had a stronger typed vision than the C++ standard. There is a precise type for everything - but that also came with problems when adapting as times changed. C++ specifically avoided a standard string type for quite a while, which was one of the design mistakes often noted by people reflecting on the early C++ development. Anyway... The first iterations of C++ started out long ago using the convention borrowed from C for pointer strings with null terminators. Indeed, a constant string such as 'Hello World!' today still is a pointer and null terminator setup and the "std::string" type has ways of converting itself to that still. After some time came a standardization in C++ with the "std::string" type. This standard string usually represents an array of 8-bit characters in ASCII encoding. ASCII of course did not last long as the defacto character set definition, and eventually everything went Unicode (as mentioned above). The first adaptations seen was "Wide" types (std::wstring) which allowed a nice range of 65536 character codes per string index "character", same as what Windows went for, but in no way guaranteed compatibility with specific encodings. Later still came things lik
Re: [sqlite] hex and char functions
On Mon, Aug 07, 2017 at 11:45:50AM -0400, Richard Hipp wrote: > On 8/7/17, Nico Williams wrote: > > Internally SQLite3 uses UTF-8. The SQLite3 API lets you deal with > > UTF-16, but this just transcodes to/from UTF-8 internally. > > That is not quite correct. > > SQL statements are always converted into UTF8 for parsing and code > generation. But data can be stored in the database file and processed > as UTF8, UTF16be, or UTF16le. All text content for a single database > must use the same encoding. When creating the database file, set the > encoding before adding any content by running one of: > > PRAGMA encoding('utf-8'); > PRAGMA encoding('utf-16be'); > PRAGMA encoding('utf-16le'); > > See https://www.sqlite.org/pragma.html#pragma_encoding for additional > information. Ah, OK, thanks for the correction! ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] hex and char functions
On 8/7/17, Nico Williams wrote: > > Internally SQLite3 uses UTF-8. The SQLite3 API lets you deal with > UTF-16, but this just transcodes to/from UTF-8 internally. > That is not quite correct. SQL statements are always converted into UTF8 for parsing and code generation. But data can be stored in the database file and processed as UTF8, UTF16be, or UTF16le. All text content for a single database must use the same encoding. When creating the database file, set the encoding before adding any content by running one of: PRAGMA encoding('utf-8'); PRAGMA encoding('utf-16be'); PRAGMA encoding('utf-16le'); See https://www.sqlite.org/pragma.html#pragma_encoding for additional information. -- D. Richard Hipp d...@sqlite.org ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] hex and char functions
On Mon, Aug 07, 2017 at 03:29:41PM +, x wrote: > Apologies, I should have said I was using c++ builder Berlin on > windows 10 and that UnicodeString was UTF16. > > I thought I had learned enough about this string lunacy to get by but > finding out that the UTF8 code for the UTF16 code \u0085 is in fact > \uc285 has tipped me over the edge. I assumed they both used the same > codes but UTF16 allowed some characters UTF8 didn’t have. Internally SQLite3 uses UTF-8. The SQLite3 API lets you deal with UTF-16, but this just transcodes to/from UTF-8 internally. So when you SELECT hex(); the hex() function sees UTF-8, not UTF-16. Nico -- ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] hex and char functions
Apologies, I should have said I was using c++ builder Berlin on windows 10 and that UnicodeString was UTF16. I thought I had learned enough about this string lunacy to get by but finding out that the UTF8 code for the UTF16 code \u0085 is in fact \uc285 has tipped me over the edge. I assumed they both used the same codes but UTF16 allowed some characters UTF8 didn’t have. I’m now wondering if I should go to the trouble of changing my sqlite wrapper over to communicate with the sqlite utf8 functions rather than the utf16 ones. Trouble is many of c++ builder’s built in types such as TStringList etc are utf16. From: Igor Tandetnik<mailto:i...@tandetnik.org> Sent: 07 August 2017 15:49 To: sqlite-users@mailinglists.sqlite.org<mailto:sqlite-users@mailinglists.sqlite.org> Subject: Re: [sqlite] hex and char functions On 8/7/2017 9:38 AM, x wrote: > Related > > Select hex(char(65,133,66)); returns ‘41C28542’ whereas I expected ‘418542’. > > What is the ‘C2’ about? Two-byte sequence C2 85 is the UTF-8 encoding of the Unicode codepoint U+0085. -- Igor Tandetnik ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] hex and char functions
On 8/7/2017 9:38 AM, x wrote: Related Select hex(char(65,133,66)); returns ‘41C28542’ whereas I expected ‘418542’. What is the ‘C2’ about? Two-byte sequence C2 85 is the UTF-8 encoding of the Unicode codepoint U+0085. -- Igor Tandetnik ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] hex and char functions
The sqlite char() function returns unicode. Apparently, the encoding for code point 133 is two characters, namely c2 85. You seem to be expecting char() to return ISO characters, which it does not do. Calling sqlite3_value_text16 instructs SQLite to convert the contents of the field into utf16 with native byte order from (assumed) utf8, which may have funny results if the source is not indeed UTF8 but rather ISO or some strange (windows) codepage. -Ursprüngliche Nachricht- Von: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org] Im Auftrag von x Gesendet: Montag, 07. August 2017 15:39 An: sqlite-users@mailinglists.sqlite.org Betreff: [sqlite] hex and char functions In c++ I tried to call a sqlite udf using the following sql UnicodeString SQL=“select udf(‘5\u00856’);” Which was prepared using sqlite3_prepare16_v2. (I was experimenting with sending a udf a group of numbers without getting involved with blobs). Debugging the udf I recovered the string (call it Str) using sqlite3_value_text16. Str[1]==’5’ & Str[3]==’6’ but Str[2]==’?’ (or Unicode 3F) I also tried retrieving the string using sqlite3_value_text to return a const unsigned char * but that also seemed to have a problem with values beyond 127. What am I missing? Related Select hex(char(65,133,66)); returns ‘41C28542’ whereas I expected ‘418542’. What is the ‘C2’ about? ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users ___ Gunter Hick Software Engineer Scientific Games International GmbH FN 157284 a, HG Wien Klitschgasse 2-4, A-1130 Vienna, Austria Tel: +43 1 80100 0 E-Mail: h...@scigames.at This communication (including any attachments) is intended for the use of the intended recipient(s) only and may contain information that is confidential, privileged or legally protected. Any unauthorized use or dissemination of this communication is strictly prohibited. If you have received this communication in error, please immediately notify the sender by return e-mail message and delete all copies of the original communication. Thank you for your cooperation. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] hex and char functions
On 8/7/2017 9:38 AM, x wrote: In c++ I tried to call a sqlite udf using the following sql UnicodeString SQL=“select udf(‘5\u00856’);” You are using a narrow string literal to try and hold a Unicode character. You are at the whim of your compiler as to how it represents the latter in the former. My guess is, it tries to convert according to the system default code page (I assume Windows here), and since the character is not in fact representable therein, it's converted to '?'. That is, the character was lost before the program even ran, let alone before SQLite got involved. Also, what's UnicodeString? -- Igor Tandetnik ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
[sqlite] hex and char functions
In c++ I tried to call a sqlite udf using the following sql UnicodeString SQL=“select udf(‘5\u00856’);” Which was prepared using sqlite3_prepare16_v2. (I was experimenting with sending a udf a group of numbers without getting involved with blobs). Debugging the udf I recovered the string (call it Str) using sqlite3_value_text16. Str[1]==’5’ & Str[3]==’6’ but Str[2]==’?’ (or Unicode 3F) I also tried retrieving the string using sqlite3_value_text to return a const unsigned char * but that also seemed to have a problem with values beyond 127. What am I missing? Related Select hex(char(65,133,66)); returns ‘41C28542’ whereas I expected ‘418542’. What is the ‘C2’ about? ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users