Thanks Ryan for going to the trouble of typing that out. Hope you’re not a one 
fingered typewriter like myself. The borland related stuff is welcome but I 
still can’t say I’m any less confused by it all.

I’m having a bad day today. I’ve spent most of it trying to fathom this stuff 
out. Igor & Gunter were correct earlier. The ‘\u0085’ is changed to ‘?’ before 
the string is anywhere near sqlite. Why I don’t know. It doesn’t seem 
unreasonable to want to put a Unicode code into a UnicodeString. As regards the 
hex(char(133)) returning C285, following the posts by Nico and Richard I’m 
wondering if it’s because I’m using SQLite Expert pro on a database that’s 
encoded in utf-8. I tried to change the coding to utf-16 to see if I would get 
a different result but, while the software seemed to accept the request, the 
request was never completed and no feedback was given aside from both the 
‘Apply’ and ‘Cancel’ buttons both being greyed out for hours (it’s only a small 
database). I’ve had enough for today though.

Thanks to all who have contributed.

From: R Smith<mailto:rsm...@rsweb.co.za>
Sent: 07 August 2017 19:33
To: 
sqlite-users@mailinglists.sqlite.org<mailto:sqlite-users@mailinglists.sqlite.org>
Subject: Re: [sqlite] hex and char functions


On 2017/08/07 5:29 PM, x wrote:
> Apologies, I should have said I was using c++ builder Berlin on windows 10 
> and that UnicodeString was UTF16.
>
> I thought I had learned enough about this string lunacy to get by but finding 
> out that the UTF8 code for the UTF16 code \u0085 is in fact \uc285 has tipped 
> me over the edge. I assumed they both used the same codes but UTF16 allowed 
> some characters UTF8 didn’t have.
>
> I’m now wondering if I should go to the trouble of changing my sqlite wrapper 
> over to communicate with the sqlite utf8 functions rather than the utf16 
> ones. Trouble is many of c++ builder’s built in types such as TStringList etc 
> are utf16.

No you shouldn't. UTF16 doesn't have "more" characters than UTF8, and
TStringlist is not UTF16 - let me see if I can clear up some bit of the
confusion. This next bit is very short and really requires much more
study, but I hope I say enough (and correct enough) for you to get the
picture a little better.

First some (very short) history on the "string lunacy" you refer to.
Note that when storing text in any system there are two confusing
concepts that are hard to get your head around: Firstly there is the
actual Characters, or Character-sets - these are the things referred to
as ANSI, Latin, CN-Big5 etc., then there are Character Code Index
mappings, these are things such as Code-Pages and the like that
basically says stuff like the Uppercase Latin character A has a code of
65 (Hex 41) in the ASCII code-page etc. These may all differ for
different code-pages, though there were good overlap.  Eventually
Unicode intended to save the World by indeed unifying all the
code-paging (hence "Unicode") and they did a marvelous job of it - but
there were very many real-World characters to cater for, so they have
code-point indices much larger than any single or even double-byte
character arrray or string can ever contain.

Here we Enter the character-encodings. These are things like UTF8,
UTF16LE and they specify an encoding, a way to make a sequence of bytes
refer to a specific codepoint in a code-space (in typically the Unicode
code-point space) that can be much larger than 8 or 16 bits may
accommodate.  UTF-8 for instance specifies that any byte value less than
128 refers to the first 127 code points, as soon as that final bit (MSB)
goes high, it means another byte is needed (or byteS, depending on how
many high bits follow the initial) to complete the encoding, and further
bytes must specify a 1.0.x.x.x.x.x.x format in turn to ensure
consistency and safely lets any reader know as soon as they encounter a
high MSB that it is definitely part of a multi-byte UTF8 sequence -
which is a brilliant encoding.  Although slightly technical, it is very
lean, we only escalate bytes when needed, and only as much as is needed.
The UTF16 encoding is a bit less technical, we can represent far more
code points with a consistent 2 byte setup, but even that is much
smaller than the full Unicode world, so UTF16 has specific character
ranges (0xd800 to 0xdbff) that requires follow-on double-bytes, also
known as "Surrogate pairs" (this is the thing that you said pushed you
over the edge, finding that some Unicode characters are represented by 2
double-byte characters, so 4-byte total width). There is much more to be
said about all this, but I don't want to take everyone's time and the
above is enough to understand the next bit regarding C++ history:

One of the great features of the bcc32 compilers of yonder was that they
(Borland) embraced strong typing, probably to this day the
strongest-typed language around is Pascal, later Turbo Pascal, Delphi
etc. I mention this because you use (apparently) the Borland origin
version of C++ and they always had a stronger typed vision than the C++
standard. There is a precise type for everything - but that also came
with problems when adapting as times changed. C++ specifically avoided a
standard string type for quite a while, which was one of the design
mistakes often noted by people reflecting on the early C++ development.
Anyway...

The first iterations of C++ started out long ago using the convention
borrowed from C for pointer strings with null terminators. Indeed, a
constant string such as 'Hello World!' today still  is a pointer and
null terminator setup and the "std::string" type has ways of converting
itself to that still.  After some time came a standardization in C++
with the "std::string" type. This standard string usually represents an
array of 8-bit characters in ASCII encoding.

ASCII of course did not last long as the defacto character set
definition, and eventually everything went Unicode (as mentioned above).
The first adaptations seen was "Wide" types (std::wstring) which allowed
a nice range of 65536 character codes per string index "character", same
as what Windows went for, but in no way guaranteed compatibility with
specific encodings. Later still came things like "std:u16string" and
"std::u32string" to try and fix some of the incompatibilities.  This
went much more smoothly for the other Borland projects - the
Delphi/Lazarus platforms used the length specifier so all those #0
characters in between the wide bytes did not trip up the bcc compilers.
To this day, any normal string in Delphi/Lazarus and its ilk are
Double-Byte characters starting at Index 1 (not 0 because of the
backward compatibility, though these days the length is contained fully
in a sort of variable-registry - known as the RTTI or
Real-Time-Type-Info [Google it] and you can switch your compiler to use
zero-indexed strings in stead).

Back to the point: Unicode is much larger than those 65536 code-points,
so Delphi/C++ devised the "UTF8String" to accompany the "WideString" and
other string types which simply let's the compiler know that you wish
for that string type to automatically assume that any byte-sequence
stored to it comes in the form of UTF8, so it needs to be translated
from UTF8, and also when you read that memory, you intend for it to be
translated out of UTF8. Widestring/wstring on the other hand does no
such thing, but there are functions part of the standard interfaces that
will encode to and from anything - very easy in fact, but of course you
first have to know WHY and WHAT you want to encode for anything to make
sense. This can be especially confusing when you use an object made by
someone else, such as an SQLite wrapper, and not know whether they
already take care of UTF8 encoding or not, or whether the object will
have a setting or property like "useUTF8" or similar.

You can further read up on the Encoding class (not sure if it is the
exact same name in C++).

Lastly, I think Unicode is great and UTF8 and UTF16 both have merit -
right tool for the job and all that, but it can be confusing.  Y may be
overwhelmed by needing to know all the above to simply get a string out
of a DB, but trust me, your simple string will one day need to store
Chinese, German, French characters and, naturally, an icon of a smiley
poo - and then you will be delighted about how easy and automatic that
ended up being - all because you went through the effort at the start of
learning the Unicode way of doing things.


Hope that did not add to the confusion...
Cheers,
Ryan

PS: Since this is an SQLite forum, and the topic of Unicode comes up
from time to time, allow me to share a personal project I had to
accomplish as part of a larger project - if anyone wishes. This
downloadable DB contains the complete Unicode standard in SQLite format
via a few tables that can be joined in obvious ways to get complete
Unicode character information currently defined by the Unicode standard,
each codepoint, its Plane, Section, surrogate pair, HTML entity, HTML
code etc.  This DB will be available in future and updated whenever the
Unicode standard changes, though might host it in a different place.
Mail me offline if interested in updates.
http://rifin.co.za/software/glyphlib/unicode.zip


_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to