Re: 3rd-party cross-platform UTF-8 support

2001-09-24 Thread Andy Heninger
From: "Marcin 'Qrczak' Kowalczyk" [EMAIL PROTECTED] Why would UTF-16 be easier for internal processing than UTF-8? Both are variable-length encodings. Performance tuning is easier with UTF-16. You can optimize for BMP characters, knowing that surrogate pairs are sufficiently uncommon that

Re: 3rd-party cross-platform UTF-8 support

2001-09-24 Thread Andy Heninger
From: "Marcin 'Qrczak' Kowalczyk" [EMAIL PROTECTED] Why would UTF-16 be easier for internal processing than UTF-8? Both are variable-length encodings. Performance tuning is easier with UTF-16. You can optimize for BMP characters, knowing that surrogate pairs are sufficiently uncommon that

Re: 3rd-party cross-platform UTF-8 support

2001-09-24 Thread Andy Heninger
From: "Marcin 'Qrczak' Kowalczyk" [EMAIL PROTECTED] Why would UTF-16 be easier for internal processing than UTF-8? Both are variable-length encodings. Performance tuning is easier with UTF-16. You can optimize for BMP characters, knowing that surrogate pairs are sufficiently uncommon that

Re: 3rd-party cross-platform UTF-8 support

2001-09-24 Thread Andy Heninger
From: "Marcin 'Qrczak' Kowalczyk" [EMAIL PROTECTED] Why would UTF-16 be easier for internal processing than UTF-8? Both are variable-length encodings. Performance tuning is easier with UTF-16. You can optimize for BMP characters, knowing that surrogate pairs are sufficiently uncommon that

Re: 3rd-party cross-platform UTF-8 support

2001-09-24 Thread Tom Emerson
Andy Heninger writes: Performance tuning is easier with UTF-16. You can optimize for BMP characters, knowing that surrogate pairs are sufficiently uncommon that it's OK for them take a bail-out slow path. Sure, but if you are using UTF-16 (or any other multibyte encoding) you loose the

RE: 3rd-party cross-platform UTF-8 support

2001-09-24 Thread Carl W. Brown
Tom, Andy Heninger writes: Performance tuning is easier with UTF-16. You can optimize for BMP characters, knowing that surrogate pairs are sufficiently uncommon that it's OK for them take a bail-out slow path. Sure, but if you are using UTF-16 (or any other multibyte encoding) you

RE: 3rd-party cross-platform UTF-8 support

2001-09-24 Thread Tom Emerson
Carl W. Brown writes: If you implement an array that is directly indexed by Unicode code point it would have to have 1114111 entries. (I love the number) I don't think that many applications can afford to have over a megabyte of storage per byte of table width. If nothing else it would be

Re: 3rd-party cross-platform UTF-8 support

2001-09-24 Thread Michael \(michka\) Kaplan
From: Tom Emerson [EMAIL PROTECTED] But if I have a text string, and that string is encoded in UTF-16, and I want to access Unicode character values, then I cannot index that string in constant time. To find character n I have to walk all of the 16-bit values in that string accounting for

Re: 3rd-party cross-platform UTF-8 support

2001-09-24 Thread Tom Emerson
Michael \(michka\) Kaplan writes: To find character n I have to walk all of the 16-bit values in that string accounting for surrogates. If I use UTF-32 I don't need to do that. This very issue came up during the discussion of how to handle surrogates in Python. Would this not be the

Re: 3rd-party cross-platform UTF-8 support

2001-09-22 Thread Marcin 'Qrczak' Kowalczyk
Thu, 20 Sep 2001 12:46:49 -0700 (PDT), Kenneth Whistler [EMAIL PROTECTED] pisze: If you are expecting better performance from a library that takes UTF-8 API's and then does all its internal processing in UTF-8 *without* converting to UTF-16, then I think you are mistaken. UTF-8 is a bad form

Re: 3rd-party cross-platform UTF-8 support

2001-09-22 Thread Michael \(michka\) Kaplan
From: "Marcin 'Qrczak' Kowalczyk" [EMAIL PROTECTED] Why would UTF-16 be easier for internal processing than UTF-8? Both are variable-length encodings. Good straw man! Working with UTF-16 is immensely easier than working with UTF-8. As I am am sure you know! :-) MichKa Michael Kaplan

Re: 3rd-party cross-platform UTF-8 support

2001-09-21 Thread Markus Scherer
I would like to add that ICU 2.0 (in a few weeks) will have convenience functions for in-process string transformations: UTF-16 - UTF-8 UTF-16 - UTF-32 UTF-16 - wchar_t* markus

Re: 3rd-party cross-platform UTF-8 support

2001-09-21 Thread Yung-Fong Tang
Mozilla also use Unicode internally and are cross platform. [EMAIL PROTECTED] wrote: For cross-platform software (NT,Solaris,HP,AIX), the only 3rd-party unicode support I found so far is IBM ICU. It's a very good support for cross-platform software internationalization. However, ICU internally

Re: 3rd-party cross-platform UTF-8 support

2001-09-21 Thread Yung-Fong Tang
Markus Scherer wrote: I would like to add that ICU 2.0 (in a few weeks) will have convenience functions for in-process string transformations: UTF-16 - UTF-8 UTF-16 - UTF-32 UTF-16 - wchar_t* Wait be careful here. wchar_t is not an encoding. So.. in theory, you cannot

Re: 3rd-party cross-platform UTF-8 support

2001-09-21 Thread Markus Scherer
Yung-Fong Tang wrote: UTF-16 - wchar_t* Wait be careful here. wchar_t is not an encoding. So.. in theory, you cannot convert between UTF-16 and wchar_t. You, however, can convert between UTF-16 and wchar_t* ON win32 since microsoft declare UTF-16 as the encoding for wchar_t.

Re: 3rd-party cross-platform UTF-8 support

2001-09-21 Thread David Starner
On Fri, Sep 21, 2001 at 04:16:50PM -0700, Yung-Fong Tang wrote: Then... use Unicode internally in your software regardless you use UTF-8 or UCS2 as the data type in the interface, eventually some code need to convert it to UCS2 for most of the processing. Why? UCS2 shouldn't be used at

RE: 3rd-party cross-platform UTF-8 support

2001-09-21 Thread Yves Arrouye
UTF-16 - wchar_t* Wait be careful here. wchar_t is not an encoding. So.. in theory, you cannot convert between UTF-16 and wchar_t. You, however, can convert between UTF-16 and wchar_t* ON win32 since microsoft declare UTF-16 as the encoding for wchar_t. And he can also do some

Re: 3rd-party cross-platform UTF-8 support

2001-09-20 Thread Kenneth Whistler
Changjian Sun said: For cross-platform software (NT,Solaris,HP,AIX), the only 3rd-party unicode support I found so far is IBM ICU. It's a very good support for cross-platform software internationalization. However, ICU internally uses UTF-16, For our application using UTF-8 as input

Re: 3rd-party cross-platform UTF-8 support

2001-09-20 Thread David Starner
On Thu, Sep 20, 2001 at 02:02:37PM -0400, [EMAIL PROTECTED] wrote: I'm worried about the performance overhead of this conversion. How much is this performance overhead? Converting UTF-8 to UTF-16 is computationally trivial; my guess is that it would be significant for cat or grep (maybe . . .

RE: 3rd-party cross-platform UTF-8 support

2001-09-20 Thread Carl W. Brown
Ken I have to convert from UTF-8 to UTF-16, before calling ICU functions (such as ucol_strcoll() ) I'm worried about the performance overhead of this conversion. You shouldn't be. The conversion from UTF-8 to UTF-16 and back is algorithmic and very fast. To make this conversion