On 2017/08/07 5:29 PM, x wrote:
Apologies, I should have said I was using c++ builder Berlin on windows 10 and 
that UnicodeString was UTF16.

I thought I had learned enough about this string lunacy to get by but finding 
out that the UTF8 code for the UTF16 code \u0085 is in fact \uc285 has tipped 
me over the edge. I assumed they both used the same codes but UTF16 allowed 
some characters UTF8 didn’t have.

I’m now wondering if I should go to the trouble of changing my sqlite wrapper 
over to communicate with the sqlite utf8 functions rather than the utf16 ones. 
Trouble is many of c++ builder’s built in types such as TStringList etc are 
utf16.

No you shouldn't. UTF16 doesn't have "more" characters than UTF8, and TStringlist is not UTF16 - let me see if I can clear up some bit of the confusion. This next bit is very short and really requires much more study, but I hope I say enough (and correct enough) for you to get the picture a little better.

First some (very short) history on the "string lunacy" you refer to. Note that when storing text in any system there are two confusing concepts that are hard to get your head around: Firstly there is the actual Characters, or Character-sets - these are the things referred to as ANSI, Latin, CN-Big5 etc., then there are Character Code Index mappings, these are things such as Code-Pages and the like that basically says stuff like the Uppercase Latin character A has a code of 65 (Hex 41) in the ASCII code-page etc. These may all differ for different code-pages, though there were good overlap. Eventually Unicode intended to save the World by indeed unifying all the code-paging (hence "Unicode") and they did a marvelous job of it - but there were very many real-World characters to cater for, so they have code-point indices much larger than any single or even double-byte character arrray or string can ever contain.

Here we Enter the character-encodings. These are things like UTF8, UTF16LE and they specify an encoding, a way to make a sequence of bytes refer to a specific codepoint in a code-space (in typically the Unicode code-point space) that can be much larger than 8 or 16 bits may accommodate. UTF-8 for instance specifies that any byte value less than 128 refers to the first 127 code points, as soon as that final bit (MSB) goes high, it means another byte is needed (or byteS, depending on how many high bits follow the initial) to complete the encoding, and further bytes must specify a 1.0.x.x.x.x.x.x format in turn to ensure consistency and safely lets any reader know as soon as they encounter a high MSB that it is definitely part of a multi-byte UTF8 sequence - which is a brilliant encoding. Although slightly technical, it is very lean, we only escalate bytes when needed, and only as much as is needed. The UTF16 encoding is a bit less technical, we can represent far more code points with a consistent 2 byte setup, but even that is much smaller than the full Unicode world, so UTF16 has specific character ranges (0xd800 to 0xdbff) that requires follow-on double-bytes, also known as "Surrogate pairs" (this is the thing that you said pushed you over the edge, finding that some Unicode characters are represented by 2 double-byte characters, so 4-byte total width). There is much more to be said about all this, but I don't want to take everyone's time and the above is enough to understand the next bit regarding C++ history:

One of the great features of the bcc32 compilers of yonder was that they (Borland) embraced strong typing, probably to this day the strongest-typed language around is Pascal, later Turbo Pascal, Delphi etc. I mention this because you use (apparently) the Borland origin version of C++ and they always had a stronger typed vision than the C++ standard. There is a precise type for everything - but that also came with problems when adapting as times changed. C++ specifically avoided a standard string type for quite a while, which was one of the design mistakes often noted by people reflecting on the early C++ development. Anyway...

The first iterations of C++ started out long ago using the convention borrowed from C for pointer strings with null terminators. Indeed, a constant string such as 'Hello World!' today still is a pointer and null terminator setup and the "std::string" type has ways of converting itself to that still. After some time came a standardization in C++ with the "std::string" type. This standard string usually represents an array of 8-bit characters in ASCII encoding.

ASCII of course did not last long as the defacto character set definition, and eventually everything went Unicode (as mentioned above). The first adaptations seen was "Wide" types (std::wstring) which allowed a nice range of 65536 character codes per string index "character", same as what Windows went for, but in no way guaranteed compatibility with specific encodings. Later still came things like "std:u16string" and "std::u32string" to try and fix some of the incompatibilities. This went much more smoothly for the other Borland projects - the Delphi/Lazarus platforms used the length specifier so all those #0 characters in between the wide bytes did not trip up the bcc compilers. To this day, any normal string in Delphi/Lazarus and its ilk are Double-Byte characters starting at Index 1 (not 0 because of the backward compatibility, though these days the length is contained fully in a sort of variable-registry - known as the RTTI or Real-Time-Type-Info [Google it] and you can switch your compiler to use zero-indexed strings in stead).

Back to the point: Unicode is much larger than those 65536 code-points, so Delphi/C++ devised the "UTF8String" to accompany the "WideString" and other string types which simply let's the compiler know that you wish for that string type to automatically assume that any byte-sequence stored to it comes in the form of UTF8, so it needs to be translated from UTF8, and also when you read that memory, you intend for it to be translated out of UTF8. Widestring/wstring on the other hand does no such thing, but there are functions part of the standard interfaces that will encode to and from anything - very easy in fact, but of course you first have to know WHY and WHAT you want to encode for anything to make sense. This can be especially confusing when you use an object made by someone else, such as an SQLite wrapper, and not know whether they already take care of UTF8 encoding or not, or whether the object will have a setting or property like "useUTF8" or similar.

You can further read up on the Encoding class (not sure if it is the exact same name in C++).

Lastly, I think Unicode is great and UTF8 and UTF16 both have merit - right tool for the job and all that, but it can be confusing. Y may be overwhelmed by needing to know all the above to simply get a string out of a DB, but trust me, your simple string will one day need to store Chinese, German, French characters and, naturally, an icon of a smiley poo - and then you will be delighted about how easy and automatic that ended up being - all because you went through the effort at the start of learning the Unicode way of doing things.


Hope that did not add to the confusion...
Cheers,
Ryan

PS: Since this is an SQLite forum, and the topic of Unicode comes up from time to time, allow me to share a personal project I had to accomplish as part of a larger project - if anyone wishes. This downloadable DB contains the complete Unicode standard in SQLite format via a few tables that can be joined in obvious ways to get complete Unicode character information currently defined by the Unicode standard, each codepoint, its Plane, Section, surrogate pair, HTML entity, HTML code etc. This DB will be available in future and updated whenever the Unicode standard changes, though might host it in a different place. Mail me offline if interested in updates.
http://rifin.co.za/software/glyphlib/unicode.zip


_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to