Windows UTF-16 is represented by WCHAR.  It is always 2 bytes.  UCS-2 can
be 3 or more bytes but these are for extended characters outside the ones
used for real language.  For example, musical notation symbols use the
third byte.  I don't think any OS's use UCS2 directly.  I know Oracle
supports UTF8, UTF16, and UCS2.  In fact, Oracle's online documentation
has a really good discussion of Unicode.  Look for their
internationalization book.  I wrote some code that was sharing data from
Oracle to Microsoft SQL Server and found this book very helpful.  Oracle
generally favors UTF8 while SQL Server favors UTF16.

http://download-west.oracle.com/docs/cd/B10501_01/server.920/a96529/toc.htm

If you are going to cast to unsigned char*, you must manage the fact that
your strings are two (or more) bytes.  You are just effectively using a
byte pointer to the string data.  I think wchar_t* is used typically
but the encoding is usually platform dependent.  The big problem you have
is that your databases are portable.  I think you will need to pick and
internal format to store the strings in the db so that you can then
translate as appropriate for a platform.  You may be able to do some clever things
like use sizeof(wchar_t) to find out how many bytes are used for a
character and use that for your translation.

There is a Unicode book available that talks about the specs.
Unfortunately, my experience has been that everyone has their own
nuiances.  Generally though it is pretty consistent.

-- 
Andrew

On Wed, 7 Apr 2004, D. Richard Hipp wrote:

> Simon Berthiaume wrote:
> >  >> Notice that text strings are always transferred as type "char*" even
> > if the text representation is UTF-16.
> >
> > This might force users to explicitely type cast some calls to function
> > to avoir warnings. I would prefer UNICODE neutral functions that can
> > take either one of them depending on the setting of a compilation
> > #define (UNICODE). Create a function that takes char * and another that
> > takes wchar_t * them encourage the use of a #defined symbol that would
> > switch depending on context (see example below). It would allow people
> > to call the functions in either way they want.
> >
> >     Example:
> >
> >         int sqlite3_open8(const char*, sqlite3**, const char**);
> >         int sqlite3_open16(const wchar_t*, sqlite3**, const wchar_t**);
> >         #ifdef UNICODE
> >             #define sqlite3_open sqlite3_open16
> >         #else
> >             #define sqlite3_open sqlite3_open8
> >         #endif
> >
>
> I'm told that wchar_t is 2 bytes on some systems and 4 bytes on others.
> Is it really acceptable to use wchar_t* as a UTF-16 string pointer?
>
> Note that internally, sqlite3 will cast all UTF-16 strings to be of
> type "unsigned char*".  So the type in the declaration doesn't really
> matter. But it would be nice to avoid compiler warnings.  So what datatype
> are most systems expecting to use for UTF-16 strings?  Who can provide
> me with a list?  Or even a few examples?
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to