Re: 3rd-party cross-platform UTF-8 support

2001-09-24 Thread Tom Emerson
Michael \(michka\) Kaplan writes: > > To find character n I have to walk all of the 16-bit values in that > > string accounting for surrogates. If I use UTF-32 I don't need to do > > that. This very issue came up during the discussion of how to handle > > surrogates in Python. > > Would this not

Re: 3rd-party cross-platform UTF-8 support

2001-09-24 Thread Michael \(michka\) Kaplan
From: "Tom Emerson" <[EMAIL PROTECTED]> > But if I have a text string, and that string is encoded in UTF-16, and > I want to access Unicode character values, then I cannot index that > string in constant time. > > To find character n I have to walk all of the 16-bit values in that > string accoun

RE: 3rd-party cross-platform UTF-8 support

2001-09-24 Thread Tom Emerson
Carl W. Brown writes: > If you implement an array that is directly indexed by Unicode code point it > would have to have 1114111 entries. (I love the number) I don't think that > many applications can afford to have over a megabyte of storage per byte of > table width. If nothing else it would

RE: 3rd-party cross-platform UTF-8 support

2001-09-24 Thread Carl W. Brown
Tom, > Andy Heninger writes: > > Performance tuning is easier with UTF-16. You can optimize for > > BMP characters, knowing that surrogate pairs are sufficiently uncommon > > that it's OK for them take a bail-out slow path. > > Sure, but if you are using UTF-16 (or any other multibyte encoding)

Re: 3rd-party cross-platform UTF-8 support

2001-09-24 Thread Tom Emerson
Andy Heninger writes: > Performance tuning is easier with UTF-16. You can optimize for > BMP characters, knowing that surrogate pairs are sufficiently uncommon > that it's OK for them take a bail-out slow path. Sure, but if you are using UTF-16 (or any other multibyte encoding) you loose the a

Re: 3rd-party cross-platform UTF-8 support

2001-09-24 Thread Andy Heninger
From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> > Why would UTF-16 be easier for internal processing than UTF-8? > Both are variable-length encodings. > Performance tuning is easier with UTF-16. You can optimize for BMP characters, knowing that surrogate pairs are sufficiently uncommon t

Re: 3rd-party cross-platform UTF-8 support

2001-09-23 Thread Andy Heninger
From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> > Why would UTF-16 be easier for internal processing than UTF-8? > Both are variable-length encodings. > Performance tuning is easier with UTF-16. You can optimize for BMP characters, knowing that surrogate pairs are sufficiently uncommon t

Re: 3rd-party cross-platform UTF-8 support

2001-09-23 Thread Andy Heninger
From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> > Why would UTF-16 be easier for internal processing than UTF-8? > Both are variable-length encodings. > Performance tuning is easier with UTF-16. You can optimize for BMP characters, knowing that surrogate pairs are sufficiently uncommon t

Re: 3rd-party cross-platform UTF-8 support

2001-09-23 Thread Andy Heninger
From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> > Why would UTF-16 be easier for internal processing than UTF-8? > Both are variable-length encodings. > Performance tuning is easier with UTF-16. You can optimize for BMP characters, knowing that surrogate pairs are sufficiently uncommon t

Re: 3rd-party cross-platform UTF-8 support

2001-09-22 Thread Michael \(michka\) Kaplan
From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> > Why would UTF-16 be easier for internal processing than UTF-8? > Both are variable-length encodings. Good straw man! Working with UTF-16 is immensely easier than working with UTF-8. As I am am sure you know! :-) MichKa Michael Kaplan Tr

Re: 3rd-party cross-platform UTF-8 support

2001-09-22 Thread Marcin 'Qrczak' Kowalczyk
Thu, 20 Sep 2001 12:46:49 -0700 (PDT), Kenneth Whistler <[EMAIL PROTECTED]> pisze: > If you are expecting better performance from a library that takes UTF-8 > API's and then does all its internal processing in UTF-8 *without* > converting to UTF-16, then I think you are mistaken. UTF-8 is a bad >

RE: 3rd-party cross-platform UTF-8 support

2001-09-21 Thread Yves Arrouye
> > UTF-16 <-> wchar_t* > > Wait be careful here. wchar_t is not an encoding. So.. in > theory, you cannot convert between UTF-16 and wchar_t. You, > however, can convert between UTF-16 and wchar_t* ON win32 > since microsoft declare UTF-16 as the encoding for wchar_t. And he can also

Re: 3rd-party cross-platform UTF-8 support

2001-09-21 Thread David Starner
On Fri, Sep 21, 2001 at 04:16:50PM -0700, Yung-Fong Tang wrote: > Then... use Unicode internally in your software regardless you use > UTF-8 or UCS2 as the data type in the interface, eventually some code > need to convert it to UCS2 for most of the processing. Why? UCS2 shouldn't be used at

Re: 3rd-party cross-platform UTF-8 support

2001-09-21 Thread Markus Scherer
Yung-Fong Tang wrote: > > UTF-16 <-> wchar_t* > > Wait be careful here. wchar_t is not an encoding. So.. in theory, you cannot >convert between UTF-16 and wchar_t. You, > however, can convert between UTF-16 and wchar_t* ON win32 since microsoft declare >UTF-16 as the encoding for wchar_

Re: 3rd-party cross-platform UTF-8 support

2001-09-21 Thread Yung-Fong Tang
Markus Scherer wrote: > I would like to add that ICU 2.0 (in a few weeks) will have convenience functions >for in-process string transformations: > > UTF-16 <-> UTF-8 > UTF-16 <-> UTF-32 > UTF-16 <-> wchar_t* Wait be careful here. wchar_t is not an encoding. So.. in theory, yo

Re: 3rd-party cross-platform UTF-8 support

2001-09-21 Thread Yung-Fong Tang
Mozilla also use Unicode internally and are cross platform. [EMAIL PROTECTED] wrote:   For cross-platform software (NT,Solaris,HP,AIX), the only 3rd-party unicode support I found so far is IBM ICU. It's a very good support for cross-platform software internationalization. However, ICU internally

Re: 3rd-party cross-platform UTF-8 support

2001-09-21 Thread Markus Scherer
I would like to add that ICU 2.0 (in a few weeks) will have convenience functions for in-process string transformations: UTF-16 <-> UTF-8 UTF-16 <-> UTF-32 UTF-16 <-> wchar_t* markus

RE: 3rd-party cross-platform UTF-8 support

2001-09-20 Thread Carl W. Brown
Ken > > I have to convert from UTF-8 to UTF-16, before calling ICU > functions (such > > as ucol_strcoll() ) > > > > I'm worried about the performance overhead of this conversion. > > You shouldn't be. > > The conversion from UTF-8 to UTF-16 and back is algorithmic and very > fast. To make this

Re: 3rd-party cross-platform UTF-8 support

2001-09-20 Thread David Starner
On Thu, Sep 20, 2001 at 02:02:37PM -0400, [EMAIL PROTECTED] wrote: > I'm worried about the performance overhead of this conversion. How much is this performance overhead? Converting UTF-8 to UTF-16 is computationally trivial; my guess is that it would be significant for cat or grep (maybe . . . t

Re: 3rd-party cross-platform UTF-8 support

2001-09-20 Thread Kenneth Whistler
Changjian Sun said: > For cross-platform software (NT,Solaris,HP,AIX), the only 3rd-party > unicode support > I found so far is IBM ICU. > It's a very good support for cross-platform software internationalization. > However, > ICU internally uses UTF-16, For our application using UTF-8 as inp

3rd-party cross-platform UTF-8 support

2001-09-20 Thread Changjian_Sun
For cross-platform software (NT,Solaris,HP,AIX), the only 3rd-party unicode support I found so far is IBM ICU. It's a very good support for cross-platform software internationalization. However, ICU internally uses UTF-16, For our application using UTF-8 as input and output, I have to convert fr