Re: [sqlite] hex and char functions

R Smith Mon, 07 Aug 2017 11:33:35 -0700


On 2017/08/07 5:29 PM, x wrote:

Apologies, I should have said I was using c++ builder Berlin on windows 10 and 
that UnicodeString was UTF16.


I thought I had learned enough about this string lunacy to get by but finding 
out that the UTF8 code for the UTF16 code \u0085 is in fact \uc285 has tipped 
me over the edge. I assumed they both used the same codes but UTF16 allowed 
some characters UTF8 didn’t have.

I’m now wondering if I should go to the trouble of changing my sqlite wrapper 
over to communicate with the sqlite utf8 functions rather than the utf16 ones. 
Trouble is many of c++ builder’s built in types such as TStringList etc are 
utf16.

No you shouldn't. UTF16 doesn't have "more" characters than UTF8, andTStringlist is not UTF16 - let me see if I can clear up some bit of theconfusion. This next bit is very short and really requires much morestudy, but I hope I say enough (and correct enough) for you to get thepicture a little better.

First some (very short) history on the "string lunacy" you refer to.Note that when storing text in any system there are two confusingconcepts that are hard to get your head around: Firstly there is theactual Characters, or Character-sets - these are the things referred toas ANSI, Latin, CN-Big5 etc., then there are Character Code Indexmappings, these are things such as Code-Pages and the like thatbasically says stuff like the Uppercase Latin character A has a code of65 (Hex 41) in the ASCII code-page etc. These may all differ fordifferent code-pages, though there were good overlap. EventuallyUnicode intended to save the World by indeed unifying all thecode-paging (hence "Unicode") and they did a marvelous job of it - butthere were very many real-World characters to cater for, so they havecode-point indices much larger than any single or even double-bytecharacter arrray or string can ever contain.

Here we Enter the character-encodings. These are things like UTF8,UTF16LE and they specify an encoding, a way to make a sequence of bytesrefer to a specific codepoint in a code-space (in typically the Unicodecode-point space) that can be much larger than 8 or 16 bits mayaccommodate. UTF-8 for instance specifies that any byte value less than128 refers to the first 127 code points, as soon as that final bit (MSB)goes high, it means another byte is needed (or byteS, depending on howmany high bits follow the initial) to complete the encoding, and furtherbytes must specify a 1.0.x.x.x.x.x.x format in turn to ensureconsistency and safely lets any reader know as soon as they encounter ahigh MSB that it is definitely part of a multi-byte UTF8 sequence -which is a brilliant encoding. Although slightly technical, it is verylean, we only escalate bytes when needed, and only as much as is needed.The UTF16 encoding is a bit less technical, we can represent far morecode points with a consistent 2 byte setup, but even that is muchsmaller than the full Unicode world, so UTF16 has specific characterranges (0xd800 to 0xdbff) that requires follow-on double-bytes, alsoknown as "Surrogate pairs" (this is the thing that you said pushed youover the edge, finding that some Unicode characters are represented by 2double-byte characters, so 4-byte total width). There is much more to besaid about all this, but I don't want to take everyone's time and theabove is enough to understand the next bit regarding C++ history:

One of the great features of the bcc32 compilers of yonder was that they(Borland) embraced strong typing, probably to this day thestrongest-typed language around is Pascal, later Turbo Pascal, Delphietc. I mention this because you use (apparently) the Borland originversion of C++ and they always had a stronger typed vision than the C++standard. There is a precise type for everything - but that also camewith problems when adapting as times changed. C++ specifically avoided astandard string type for quite a while, which was one of the designmistakes often noted by people reflecting on the early C++ development.Anyway...

The first iterations of C++ started out long ago using the conventionborrowed from C for pointer strings with null terminators. Indeed, aconstant string such as 'Hello World!' today still is a pointer andnull terminator setup and the "std::string" type has ways of convertingitself to that still. After some time came a standardization in C++with the "std::string" type. This standard string usually represents anarray of 8-bit characters in ASCII encoding.

ASCII of course did not last long as the defacto character setdefinition, and eventually everything went Unicode (as mentioned above).The first adaptations seen was "Wide" types (std::wstring) which alloweda nice range of 65536 character codes per string index "character", sameas what Windows went for, but in no way guaranteed compatibility withspecific encodings. Later still came things like "std:u16string" and"std::u32string" to try and fix some of the incompatibilities. Thiswent much more smoothly for the other Borland projects - theDelphi/Lazarus platforms used the length specifier so all those #0characters in between the wide bytes did not trip up the bcc compilers.To this day, any normal string in Delphi/Lazarus and its ilk areDouble-Byte characters starting at Index 1 (not 0 because of thebackward compatibility, though these days the length is contained fullyin a sort of variable-registry - known as the RTTI orReal-Time-Type-Info [Google it] and you can switch your compiler to usezero-indexed strings in stead).

Back to the point: Unicode is much larger than those 65536 code-points,so Delphi/C++ devised the "UTF8String" to accompany the "WideString" andother string types which simply let's the compiler know that you wishfor that string type to automatically assume that any byte-sequencestored to it comes in the form of UTF8, so it needs to be translatedfrom UTF8, and also when you read that memory, you intend for it to betranslated out of UTF8. Widestring/wstring on the other hand does nosuch thing, but there are functions part of the standard interfaces thatwill encode to and from anything - very easy in fact, but of course youfirst have to know WHY and WHAT you want to encode for anything to makesense. This can be especially confusing when you use an object made bysomeone else, such as an SQLite wrapper, and not know whether theyalready take care of UTF8 encoding or not, or whether the object willhave a setting or property like "useUTF8" or similar.

You can further read up on the Encoding class (not sure if it is theexact same name in C++).

Lastly, I think Unicode is great and UTF8 and UTF16 both have merit -right tool for the job and all that, but it can be confusing. Y may beoverwhelmed by needing to know all the above to simply get a string outof a DB, but trust me, your simple string will one day need to storeChinese, German, French characters and, naturally, an icon of a smileypoo - and then you will be delighted about how easy and automatic thatended up being - all because you went through the effort at the start oflearning the Unicode way of doing things.



Hope that did not add to the confusion...
Cheers,
Ryan

PS: Since this is an SQLite forum, and the topic of Unicode comes upfrom time to time, allow me to share a personal project I had toaccomplish as part of a larger project - if anyone wishes. Thisdownloadable DB contains the complete Unicode standard in SQLite formatvia a few tables that can be joined in obvious ways to get completeUnicode character information currently defined by the Unicode standard,each codepoint, its Plane, Section, surrogate pair, HTML entity, HTMLcode etc. This DB will be available in future and updated whenever theUnicode standard changes, though might host it in a different place.Mail me offline if interested in updates.

http://rifin.co.za/software/glyphlib/unicode.zip


_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] hex and char functions

Reply via email to