Somya asked: > I have unicode C application. I am using the following macro > to define my string > to 2 byte width characters. > > #ifdef UNICODE > #define _T(x) L##x > > But I see that GCC compiler maps 'L' to wchar_t, which is 4 byte on Linux. I > have used -fshort-wchar option > on Linux but I want my application to be portable on AIX as > well, which does not > have this option. I am not able > to findbest way to define _T(x) of UNICODE version, which takes 2 byte wide > character always.
> Taking this, what is the best way to define _T(x) macro of UNICODE version, > so > that my strings will always be > 2 byte wide character? Well, some may disagree with me, but my first advice would be to avoid macros like that altogether. And second, to absolutely avoid any use of wchar_t in the context of processing Unicode characters and strings. If you are working with C compilers that support the C99 standard, you can instead make use of the stdint.h exact-width integer types. And then you should *typedef* Unicode code unit types to those exact-width integer types. uint8_t <-- typedef your UTF-8 code unit type to this uint16_t <-- typedef your UTF-16 code unit type to this uint32_t <-- typedef your UTF-32 code unit type to this See: http://en.wikipedia.org/wiki/Stdint.h If you need to cross-compile on platforms that don't support the C99 types, then you can probably get away with: unsigned char unsigned short unsigned int which should normally resolve to 8-bit, 16-bit, and 32-bit types, respectively, on all platforms. Once you have your 3 fixed-width code unit typedefs in hand, do all of your Unicode character and string processing using those types. When you are making use of other Unicode libraries, the libraries often have these typedefs already defined for you. Thus, for example, ICU has typedefs for UChar (an unsigned 16-bit integer) and UChar32 (as a signed 32-bit integer). [The choice between a signed or unsigned 32-bit integer has to do with library design choices, but in all cases the valid 32-bit values for Unicode characters are in the positive range 0..0x10FFFF.] See: http://userguide.icu-project.org/strings Once you have your code set up to use typedefs like this for your Unicode characters and strings, read, understand, and follow the rules for the UTF-8, UTF-16, and UTF-32 encoding forms, as documented in Section 3.9, Unicode Encoding Forms, of the Unicode Standard: http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf and your Unicode string handling should then be correct and conformant. --Ken

