Re: [sqlite] Some clarification needed about Unicode

John Crenshaw Thu, 29 Oct 2009 14:56:09 -0700

> MB_COMPOSITE has nothing to do with surrogate pairs

You're right. I was trying to determine whether the output was UTF-16 or UCS-2 
based on whether the output might use multiple bytes to represent a character, 
which is where I got tripped up.

> Do you believe _that's_ what differentiates UTF-16 and UCS-2? If so, you are 
> mistaken.

No. If that were the difference it wouldn't be a big deal. The difference is an 
encoding difference, similar to UTF-8 vs. ASCII. (but different...) UTF-16 will 
use either 2 or 4 bytes for a character, UCS-2 will always use 2 bytes. As a 
result, UCS-2 can't hold everything that UTF-16 can.

> > Microsoft never seems to clearly identify whether the wide APIs should
> > be given UTF-16 or UCS-2.
> 
> You mean, which Unicode normalization form they expect

No, I mean which encoding. You can't give a UTF-16 string to an API that only 
knows how to handle UCS-2 encoded data, just like you can't use a UTF-8 string 
when ASCII data is expected. When I tackle this nightmare the last time I was 
left with the understanding that the wide Win32 APIs expected data to be UCS-2 
encoded. Now I'm no longer sure, and I can't find any reliable documentation on 
this either way. It would be good if the APIs accept UTF-16, because that would 
mean they also accept UCS-2, but I couldn't find anything reliable to support 
this idea. Some folks say yes. Some say no. The documentation says nothing.

John

-----Original Message-----
From: sqlite-users-boun...@sqlite.org [mailto:sqlite-users-boun...@sqlite.org] 
On Behalf Of Igor Tandetnik
Sent: Thursday, October 29, 2009 5:08 PM
To: sqlite-users@sqlite.org
Subject: Re: [sqlite] Some clarification needed about Unicode

John Crenshaw <johncrens...@priacta.com>
wrote: 
> 2. MultiByteToWideChar supports a "MB_COMPOSITE" flag, which appears
> to 
> give UTF-16 output.

MB_COMPOSITE has nothing to do with surrogate pairs, and everything to do with 
whether, say, Latin-1 character Á (A with accute) is converted to a single 
character U+00C1, or two characters U+0041 U+0301 (capital A + combining accute 
accent). The latter is "composite", the former is "precomposed".

Do you believe _that's_ what differentiates UTF-16 and UCS-2? If so, you are 
mistaken. The difference between the two is in how Unicode characters U+10000 
and up are represented (as surrogate pairs in one case, unsupported in the 
other). U+0041 U+0301 is a valid UCS-2 sequence and a valid UTF-16 sequence.

> Microsoft never seems to clearly identify whether the wide APIs should
> be given UTF-16 or UCS-2.

You mean, which Unicode normalization form they expect ( see 
http://en.wikipedia.org/wiki/Unicode_equivalence ), which, again, has 
absolutely nothing to do with UTF-16 vs UCS-2. The answer is, Win32 API can 
handle any normalization form as well as denormalized strings. FoldString API 
is provided to normalize strings to various normalization forms if desired.

Igor Tandetnik

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Some clarification needed about Unicode

Reply via email to