RE: XMLCh & wchar_t conversion on multiple platforms

Dean Roddey Mon, 14 May 2001 14:45:54 -0700

Of course, someone writing a Unicode oriented editor, which has to load up a 128MB file and transcode that huge amount of text into another (at least as large if not larger) buffer just to load it up to read or edit, wouldn't agree that its trivial or fast :-) Working the native Unicode format would probably always have significant advantages when the bulk of data gets big.

--------------
Dean Roddey
Software Geek Extraordinaire
Portal, Inc
[EMAIL PROTECTED]

-----Original Message-----
From: Andy Heninger [mailto:[EMAIL PROTECTED]]
Sent: Monday, May 14, 2001 2:17 PM
To: [EMAIL PROTECTED]
Subject: Re: XMLCh & wchar_t conversion on multiple platforms

Here is a proposal for wchar_t conversions from Markus Scherer on the ICU mailing list.

From: "Markus Scherer" <[EMAIL PROTECTED]>
To: "icu list" <[EMAIL PROTECTED]>
Sent: Friday, May 11, 2001 1:32 PM
Subject: icu api proposal: in-process string transformations
UChar*<->UTF-8/32/wchar_t*

This is a kind of FAQ:
"ICU processes strings in UTF-16, but my XYZ API uses UTF-8/32/wchar_t*. What do I do?"

This is especially interesting because the UTF transformations are trivial and fast, and because the wchar_t transformation on many platforms today is just a UTF transformations. Providing functions that portably perform these commonly requested transformations and do the legwork when wchar_t is not Unicode seems like a useful feature.

I propose the following 6 functions:

wchar_t *u_strToWCS(wchar_t *dest, int32_t destCapacity,
                    int32_t *pDestLength,
                    const UChar *src, int32_t srcLength,
                    UErrorCode *pErrorCode);

UChar *u_strFromWCS(UChar *dest, ...);

uint8_t *u_strToUTF8(uint8_t *dest, ...);
UChar *u_strFromUTF8(UChar *dest, ...);

uint32_t *u_strToUTF32(uint32_t *dest, ...);
UChar *u_strFromUTF32(UChar *dest, ...);

I propose this not to be part of the converter API. These functions work on process-internal string encodings, while converters are designed for external encodings. There is no buffer management here, and the UTF transformations will use our UTF macros.

Details of semantics:
- The functions always write a NUL termination if destCapacity is sufficient.
- If srcLength==-1 then u_strlen(src) is used as usual. In this case, if there is not enough destCapacity for the NUL, then a U_BUFFER_OVERFLOW_ERROR is set.
- If srcLength>=0 and only the NUL does not fit, then no error code is set.
- If any character except for the automatic NUL does not fit, then a U_BUFFER_OVERFLOW_ERROR is always set.
- All functions always write to the dest buffer.
Note that this would not be necessary when wchar_t carries UTF-16 anyway as on Win32. However, for consistent behavior, the WCS functions will still memcpy().

Expiration: Friday, 2001-may-17

markus
_______________________________________________
icu mailing list
[EMAIL PROTECTED]
http://oss.software.ibm.com/developerworks/opensource/mailman/listinfo/icu

There's a bit of discussion on the topic going on over there; follow the links above if you are interested. In the API proposal, UChar is a 16 bit utf-16 encoded character, and thus would be completely interoperable with XMLCh.

Andy Heninger
IBM, Cupertino, CA
[EMAIL PROTECTED]

----- Original Message -----
From: "Andy Heninger" <[EMAIL PROTECTED]>

To: <[EMAIL PROTECTED]>

Sent: Wednesday, May 02, 2001 4:55 PM

Subject: Re: XMLCh & wchar_t conversion on multiple platforms

> wchar_t is messy. For the platforms you mentioned, if sizeof(wchar_t) ==
> 2 wchar_t will be utf-16. If the size is 4 bytes and __STDC_ISO_10646__
> is defined, wchar_t is UCS4. I think. But this definitely does not cover
> all possible platforms.
>
> If you know that your Unicode data has no code points > 64k, you can do a
> quick and dirty conversion to UCS4 by just unpacking the 16 bit values
> into 32 bits, with the hi bytes being zero.
>
> You'd think that there would be simple to use library functions for
> converting to/from wchar_t, but there don't seem to be. I'm lobbying to
> get one added to ICU.
>
> Andy Heninger
> IBM, Cupertino, CA
> [EMAIL PROTECTED]
>
>
> ----- Original Message -----
> From: "Mark A Russell" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Sent: Tuesday, May 01, 2001 12:00 PM
> Subject: RE: XMLCh & wchar_t conversion on multiple platforms
>
>
> > So am I correct then in assuming that I will need to instantiate a
> > transcoder of type ICU or Iconv just to do the conversion? If this is
> the
> > case then what are the encodingName 's that the constructors take, the
> > UConverter that ICU takes, and the block size that Iconv takes?
> >
> > Is there some sample code out there that gives a simple case of how this
> > works?
> > Also how do you go about determining wchar_t format? (Beyond just using
> > #ifdef's )
> >
> > Thanks,
> >
> > Mark R
> >
> > -----Original Message-----
> > From: Andy Heninger [mailto:[EMAIL PROTECTED]]
> > Sent: Tuesday, May 01, 2001 10:31 AM
> > To: [EMAIL PROTECTED]
> > Subject: Re: XMLCh & wchar_t conversion on multiple platforms
> >
> >
> > wchar_t seems to be perpetually awkward, largely because its definition
> > varies so much from platform to platform. You will end up with some
> > platform specific code to find the local wchar_t format. Once you have
> > that you can use either iconv (UNIXes), ICU converters (all platforms,
> > assuming you have ICU around), or nothing (when wchar_t encoding is
> > utf-16) to get from utf-16 encoded XMLCh strings to wchar_t strings.
> >
> >
> > Andy Heninger
> > IBM, Cupertino, CA
> > [EMAIL PROTECTED]
> >
> > ----- Original Message -----
> > From: "Mark A Russell" <[EMAIL PROTECTED]>
> > To: <[EMAIL PROTECTED]>
> > Sent: Tuesday, May 01, 2001 6:50 AM
> > Subject: RE: XMLCh & wchar_t conversion on multiple platforms
> >
> >
> > > That seems to be the issue I'm running into, but I can't seem to
> figure
> > out
> > > how to do the transcoding. I've looked through the docs, and more
> > > importantly the headers and the closest thing I can find is the
> > transcodeTo
> > > and transcodeFrom functions. The issue I have with those is that you
> > have
> > > to determine which Transcoder to use, ie Iconv or ICU, you have to
> know
> > the
> > > unicode type when you instantiate the transcoder, and also they are
> not
> > > static functions. Meaning I have to instantiate a transcoder just to
> do
> > > some conversions.
> > >
> > > Surely there is a simpler way to do the transcoding?
> > >
> > > Mark A Russell
> > > NextGen Software Engineer
> > > CSG Systems, Inc.
> > > E-Mail: [EMAIL PROTECTED]
> > >
> > >
> > > -----Original Message-----
> > > From: Dean Roddey [mailto:[EMAIL PROTECTED]]
> > > Sent: Monday, April 30, 2001 4:44 PM
> > > To: '[EMAIL PROTECTED]'
> > > Subject: RE: XMLCh & wchar_t conversion on multiple platforms
> > >
> > >
> > > A decision was made a while back, which I didn't really agree with, to
> > fix
> > > XMLCh to UTF-16 on all platforms. Partly this was because the DOM
> > committee
> > > chose UTF-16 for its representation. So, if this is not compatible
> with
> > your
> > > wchar_t, you must transcode all of the data to your local wide string
> > > representation before using it. On NT, the stuff spit out from the
> > parser is
> > > directly useable, since UTF-16 is NT's native representation of
> Unicode.
> > On
> > > other platforms, you'll have to transcode if they don't do the same.
> > >
> > > --------------
> > > Dean Roddey
> > > Software Geek Extraordinaire
> > > Portal, Inc
> > > [EMAIL PROTECTED]
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Mark A Russell [mailto:[EMAIL PROTECTED]]
> > > Sent: Monday, April 30, 2001 3:25 PM
> > > To: [EMAIL PROTECTED]
> > > Subject: XMLCh & wchar_t conversion on multiple platforms
> > >
> > >
> > > Is there a way to convert between XMLCh and wchar_t on both the AIX
> 4.3
> > &
> > > Solaris platform that won't break my code on NT?
> > >
> > > I have some code that I'm trying to port from win32 that uses wchar_t
> > for
> > > unicode support. This code currently makes use of some of the xerces
> > > functions that only take XMLCh 's. An example is shown below:
> > >
> > >       const wchar_t * szSourceBinding =
> > > attributes.getValue(CBOITagFactory::ATTR_SOURCE_BINDING);
> > >
> > > The CBOITagFactory::ATTR_SOURCE_BINDING is simply a wchar_t. (XMLCh's
> > are
> > > currently unsigned shorts)
> > >
> > > My requirement is to maintain unicode support on all three platforms.
> I
> > > thought about just redefining XMLCh's to wchar_t's like they used to
> be
> > > around 1.2, however after looking at the documentation that seems like
> a
> > > very bad idea because of an incompatibility that would arise on the
> > Solaris
> > > platform.
> > >
> > > Any help would be much appreciated.
> > >
> > > btw - What happen to the mailing list archives? They seem to be
> > unreachable.
> > >
> > > Mark A Russell
> > > NextGen Software Engineer
> > > CSG Systems, Inc.
> > > E-Mail: [EMAIL PROTECTED]
> > >
> > >
> >
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

RE: XMLCh & wchar_t conversion on multiple platforms

Reply via email to