Re: XMLCh & wchar_t conversion on multiple platforms

Andy Heninger Mon, 14 May 2001 15:33:07 -0700

Dean wrote

Working the native Unicode format would probably always have significant advantages ...

But wchar_t is not a native unicode format. It can be almost anything at all, depending on platform and locale. The more I learn about wchar_t, the more I think that it's best avoided to the greatest extent possible.

A Unicode oriented editor that let its data pass through an internal wchar_t encoding at any point along the line has made a big design mistake. If the data is known to be Unicode, keep it in a known Unicode format.

It is true that wchar_t is Unicode based on many platforms. But definitely not all, and figuring out what you've got is not always simple.

Andy Heninger
IBM, Cupertino, CA
[EMAIL PROTECTED]

----- Original Message -----

From: Dean Roddey

To: '[EMAIL PROTECTED]'

Sent: Monday, May 14, 2001 2:52 PM

Subject: RE: XMLCh & wchar_t conversion on multiple platforms

Of course, someone writing a Unicode oriented editor, which has to load up a 128MB file and transcode that huge amount of text into another (at least as large if not larger) buffer just to load it up to read or edit, wouldn't agree that its trivial or fast :-) Working the native Unicode format would probably always have significant advantages when the bulk of data gets big.

--------------
Dean Roddey
Software Geek Extraordinaire
Portal, Inc
[EMAIL PROTECTED]

-----Original Message-----
From: Andy Heninger [mailto:[EMAIL PROTECTED]]
Sent: Monday, May 14, 2001 2:17 PM
To: [EMAIL PROTECTED]
Subject: Re: XMLCh & wchar_t conversion on multiple platforms

Here is a proposal for wchar_t conversions from Markus Scherer on the ICU mailing list.

From: "Markus Scherer" <[EMAIL PROTECTED]>
To: "icu list" <[EMAIL PROTECTED]>
Sent: Friday, May 11, 2001 1:32 PM
Subject: icu api proposal: in-process string transformations
UChar*<->UTF-8/32/wchar_t*

This is a kind of FAQ:
"ICU processes strings in UTF-16, but my XYZ API uses UTF-8/32/wchar_t*. What do I do?"

This is especially interesting because the UTF transformations are trivial and fast, and because the wchar_t transformation on many platforms today is just a UTF transformations. Providing functions that portably perform these commonly requested transformations and do the legwork when wchar_t is not Unicode seems like a useful feature.

I propose the following 6 functions:

wchar_t *u_strToWCS(wchar_t *dest, int32_t destCapacity,
                    int32_t *pDestLength,
                    const UChar *src, int32_t srcLength,
                    UErrorCode *pErrorCode);

UChar *u_strFromWCS(UChar *dest, ...);

uint8_t *u_strToUTF8(uint8_t *dest, ...);
UChar *u_strFromUTF8(UChar *dest, ...);

uint32_t *u_strToUTF32(uint32_t *dest, ...);
UChar *u_strFromUTF32(UChar *dest, ...);

I propose this not to be part of the converter API. These functions work on process-internal string encodings, while converters are designed for external encodings. There is no buffer management here, and the UTF transformations will use our UTF macros.

Details of semantics:
- The functions always write a NUL termination if destCapacity is sufficient.
- If srcLength==-1 then u_strlen(src) is used as usual. In this case, if there is not enough destCapacity for the NUL, then a U_BUFFER_OVERFLOW_ERROR is set.
- If srcLength>=0 and only the NUL does not fit, then no error code is set.
- If any character except for the automatic NUL does not fit, then a U_BUFFER_OVERFLOW_ERROR is always set.
- All functions always write to the dest buffer.
Note that this would not be necessary when wchar_t carries UTF-16 anyway as on Win32. However, for consistent behavior, the WCS functions will still memcpy().

Expiration: Friday, 2001-may-17

markus
_______________________________________________
icu mailing list
[EMAIL PROTECTED]
http://oss.software.ibm.com/developerworks/opensource/mailman/listinfo/icu

There's a bit of discussion on the topic going on over there; follow the links above if you are interested. In the API proposal, UChar is a 16 bit utf-16 encoded character, and thus would be completely interoperable with XMLCh.

Andy Heninger
IBM, Cupertino, CA
[EMAIL PROTECTED]

Re: XMLCh & wchar_t conversion on multiple platforms

Reply via email to