Re: [scintilla] UTF-8 to UCS conversion and other encoding utilities

Reece Dunn Mon, 16 Jan 2006 00:11:37 -0800

Neil Hodgson wrote:

Reece Dunn:


> The UTF8 <==> UCS conversion utilities in scintilla/src/UniConversion.h

> would be useful to the outside world. For example, in my application, Iam> returning the selected text and searching using UCS encoded WindowsBSTRs.


   At least one person thinks that UniConversion sucks and mailed me
extensively on the subject. Its normally better to use platform
facilities for this. Scintilla defines enough for just its use so it
doesn't have to unify platform calls. If you want better generic
Unicode features use a project like ICU that is meant for the job.
SinkWorld has better code than Scintilla.

I know that UniConversion is not meant to be a full Unicode conversionlibrary. I'm currently using SetCodePage to get Scintilla to do the correctconversions using the native platform calls.

That said, Scintilla stores the character buffers natively as UTF8. I amsuccessfully using UniConversion to provide Scintilla text to BSTRconversions, without the heavyweight use of ICU as I don't need genericUnicode facilities.

For me, UniConversion works and allows me to keep the code lightweight andfast. I don't need any of the more complex support that ICU or the MozillaFirefox localisation interfaces provide.

> NOTE: The conversion algorithm doesn't handle the 4th UTF8 byte. I'm
> assuming this is due to lack of support for UTF16 surrogate pairs and
> Unicode planar characters in Windows.

   AFAICT non-BMP use of Windows requires the Chinese GB-18030 add on.
SinkWorld supports non-BMP characters but I won't bother with it yet
for Scintilla.

Do you mean that the 4th byte of a UTF8 string corresponds to the Chinesesymbols? If so, aren't these available with the MS Mincho (and I think theMS Gothic) fonts? You need to install the Japanese/Chinese language supportto provide the character support. When you have the correct languageinstalled (tested on Windows XP), the characters are available. Thus, youcan also view those characters with the regulaar fonts such as Times NewRoman.

Provided that you have the correct character, you can use the normal Windowsrendering (i.e. ExtTextOutW) to render the Japanese/Chinese text. Forexample, U+3301 (I think) would be rendered as the "<<" character.

However, there are two planar character sets. IIRC, these are in the rangeU+1Dxxxx, and are the Fractur mathematical characters and anothermath-related character set. From what I can recall, Internet Explorer has(had?) problems rendering these characters. I think this also extends toWindows. I thought these would be in the 4th UTF8 byte/


- Reece


_______________________________________________
Scintilla-interest mailing list
[email protected]
http://mailman.lyra.org/mailman/listinfo/scintilla-interest

Re: [scintilla] UTF-8 to UCS conversion and other encoding utilities

Reply via email to