> MB_COMPOSITE has nothing to do with surrogate pairs You're right. I was trying to determine whether the output was UTF-16 or UCS-2 based on whether the output might use multiple bytes to represent a character, which is where I got tripped up.
> Do you believe _that's_ what differentiates UTF-16 and UCS-2? If so, you are > mistaken. No. If that were the difference it wouldn't be a big deal. The difference is an encoding difference, similar to UTF-8 vs. ASCII. (but different...) UTF-16 will use either 2 or 4 bytes for a character, UCS-2 will always use 2 bytes. As a result, UCS-2 can't hold everything that UTF-16 can. > > Microsoft never seems to clearly identify whether the wide APIs should > > be given UTF-16 or UCS-2. > > You mean, which Unicode normalization form they expect No, I mean which encoding. You can't give a UTF-16 string to an API that only knows how to handle UCS-2 encoded data, just like you can't use a UTF-8 string when ASCII data is expected. When I tackle this nightmare the last time I was left with the understanding that the wide Win32 APIs expected data to be UCS-2 encoded. Now I'm no longer sure, and I can't find any reliable documentation on this either way. It would be good if the APIs accept UTF-16, because that would mean they also accept UCS-2, but I couldn't find anything reliable to support this idea. Some folks say yes. Some say no. The documentation says nothing. John -----Original Message----- From: sqlite-users-boun...@sqlite.org [mailto:sqlite-users-boun...@sqlite.org] On Behalf Of Igor Tandetnik Sent: Thursday, October 29, 2009 5:08 PM To: sqlite-users@sqlite.org Subject: Re: [sqlite] Some clarification needed about Unicode John Crenshaw <johncrens...@priacta.com> wrote: > 2. MultiByteToWideChar supports a "MB_COMPOSITE" flag, which appears > to > give UTF-16 output. MB_COMPOSITE has nothing to do with surrogate pairs, and everything to do with whether, say, Latin-1 character Á (A with accute) is converted to a single character U+00C1, or two characters U+0041 U+0301 (capital A + combining accute accent). The latter is "composite", the former is "precomposed". Do you believe _that's_ what differentiates UTF-16 and UCS-2? If so, you are mistaken. The difference between the two is in how Unicode characters U+10000 and up are represented (as surrogate pairs in one case, unsupported in the other). U+0041 U+0301 is a valid UCS-2 sequence and a valid UTF-16 sequence. > Microsoft never seems to clearly identify whether the wide APIs should > be given UTF-16 or UCS-2. You mean, which Unicode normalization form they expect ( see http://en.wikipedia.org/wiki/Unicode_equivalence ), which, again, has absolutely nothing to do with UTF-16 vs UCS-2. The answer is, Win32 API can handle any normalization form as well as denormalized strings. FoldString API is provided to normalize strings to various normalization forms if desired. Igor Tandetnik _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users