When developing xIUA, I designed UTF-8 support to be used two different ways. One as a form of Unicode and the other as yet another code page. In either case the two are handled with few exceptions in the same manor. The only difference it when you want to convert from UTF-8 to an underlying code page. It one case you can have an underlying code pages such as iso-5589-1 or whatever and when UTF-8 is the code page there is no underlying code page. To support UTF-8 data I have defined UChar8 which is unsigned char to ensure consistency and make sure that the strings are treaded consistently across platforms. Functions starting with xiua_ are common routines. Functions starting with xiu8_ are explicit UTF-8 support functions. There are conversion and UTF-8 transformation services. I started developing these for ICU 1.4. I added full UTF-16 support character boundary support and consistent null terminated string support. Had I started developing a package starting with the upcoming ICU 2.0 I might have used the ICU support. There would have been a little extra overhead going through the common converter interface but there would not have been the duplication of code. I will retain this support. With lots of short transformations for parameters I want as little overhead as possible. With very short fields the processes of resetting a converter to insure that it is in the proper state could take as much processing as the transform. Another reason is that the code is well integrated into the application. For example I use one of the translation tables to determining UTF-8 character length in may places in the code. This code also allows you to work in a mix of UTF-8 and UTF-32 in terms of logical characters. In UTF-32 code units and characters are the same. But being an MBCS system this is not true of UTF-8. I can specify number of bytes, number of characters or string length when converting to UTF-32. There are direct UTF transform routines between any UTF format to any other. Thus a UTF-8 to UTF-32 transform is faster that UTF-8 to UTF-16. I will use ICU's converters for all code pages and then transform if needed but not between UTF formats. I also use the ICU UTF-8 support macros where possible and the performance is comparable. The actual UTF-8 support functions fall into two major categories. There are explicit UTF-8 support implementations. They range from UTF-8 data validation routines to string handling routines. The other routine use common UTF-16 or UTF-32 routines. Some routines like strcmp could use a common routine but the overhead would be too great. For UTF-8 a standard strcmp will do. To make it compare the same for all forms of Unicode, it uses the ICU u_strcmpCodePointOrder functions which is a very efficient routine to compare UTF-16 code in Unicode code point order. I use a very similar routine for xiu2_strncmp. Some routines like xiu8_strtok not only return a pointer into the original string but it inserts nulls into the string. This kind of code must have separate implementations. Some functions have to be implemented slightly differently. UChar8 * xiu8_strchrEx(UChar8 *string,UChar8 * charptr); is an example of such a function. You are searching for logical characters that may be up to 4 bytes long. It is impractical to pass the character you are searching for as an int. Other routines are best handled by a common routine. For example if you do a strcoll you will want to convert the data from UTF-8 to UTF-16 and call ICU. Because xIUA is a starting model and if you are using UTF-8 you may want to tailor it in one of two ways. You will notice that while there is an xiua_strcoll there is no xiu8_strcoll. This is because some may want to use xiu8_strcollEx where you specify collation strength and normalization and others will want to only use a standard strcoll. Those that want to also use an xiu8_strcollEx can add a #define xiu8_strcoll(a, b) xiu8_strcollEx(a, b, XCOL_TERTIARY) For routines like case shifting routines will have to convert from UTF-8 to UTF-16 and convert the result back to UTF-8. Because of special casing you should use a separate source and target buffer because the result may be larger or small that the original string. The routine should also map lengths as is converts back and forth. UTF-8 is not a good format for case shifting. There is special code in xIUA for those cases where the application was not designed properly to allow you to use a different results buffer. xiua_strtoupperInplace(char *string); gives you a last ditch workaround that uses a common UTF-32 case shift routine. It uses the ICU u_toupper which is actually a UTF-32 function for efficiency. To make this UTF transformation work efficiently we need another piece. We need storage for work areas and intermediate results. xIUA has its own storage management that minimizes the malloc/free overhead. The large segments are often conversions. It if it uses ICU to convert a code page to UTF-8 it needs intermediate storage for the conversion. To keep the intermediate storage at a minimum it will convert in chunks. You can tune this so that the chunks will be large enough to use the conversion facilities efficiently but not use too much storage. To make UTF-8 really usable we also need other specialized routines. The are the expected routines that deal with characters issues such as character length number of logical characters in a string and string navigation aids but often you need other not quite so obvious routines. A good example is xiu8_strncpyEx. This routine differs from a normal strncpy function in that it will only copy complete character and always adds a null to the end of the string. This routine can be used where you want to break data into chunks or limit the size of a string but it will not produce broken or split UTF-8 characters. There are more details that go into UTF-8 support but these are some of the major issues. It is not had to take existing MBCS code and treat UTF-8 as another characters set by customizing the length detection for UTF-8 just like any other character set. This routines however do not treat UTF-8 as Unicode. Some libraries only support 3 byte UTF-8 encoding and do not support characters in the other planes. Others are even more limiting in that they may limit some functions like case shifting and character searching to the ASCII portions of UTF-8. I have found that a proper UTF-8 support library will have full Unicode 3.1 support including conformance to the new UTF-8 specifications. Also look a the functions like strchr. Do they have a way to search for the full range of UTF-8 characters? Also look at case shifting. Do they provide different input and output areas? If not they probably do not implement special casing and you will not get proper case shifting even for common languages like German. The other factor is check for good locale support. Many UTF-8 functions are locale independent but other routines like collation, date/time and numeric formation are locale dependent. Is the locale support thread independent? Carl