Re: UCS-2 and UTF-16 [was Re: Encode, take five]

2000-09-14 Thread Mark Leisher
Philip> Yes, but if you just have a high surrogate, you can't do much with Philip> it -- it doesn't represent a Unicode character but only half of Philip> one. So you need a high surrogate plus a low surrogate to display Philip> a character beyond U+, leading to a 32-bit repre

Re: UCS-2 and UTF-16 [was Re: Encode, take five]

2000-09-14 Thread Philip Newton
On 13 Sep 2000, at 11:57, Mark Leisher wrote: > True, UTF-16 is not known as UCS-2. However, UTF-16 still consists > of 2-byte chunks. It is essentially UCS-2 plus high and low > surrogates (see the Unicode Standard 3.0 page 19). Yes, but if you just have a high surrogate, you can't do much w

Re: Encode, take five

2000-09-13 Thread Philip Newton
On Wed, 13 Sep 2000, Matt Sergeant wrote: > Until someone extends the Unicode character set beyond the current range, This has "already" happened. Have a look at http://www.unicode.org/unicode/alloc/Pipeline.html , the Unicode allocation pipeline of proposed new characters and scripts. It lists

Re: Encode, take five

2000-09-13 Thread Ed Batutis
On Wed, 13 Sep 2000 19:49:32 Matt Sergeant wrote: >On Wed, 13 Sep 2000, Philip Newton wrote: > >Until someone extends the Unicode character set beyond the current range, >UCS-2 and UTF-16 currently have a one to one mapping. I assume thats the >point being made. An excerpt from the book I'm cur

Re: Encode, take five

2000-09-13 Thread Matt Sergeant
On Wed, 13 Sep 2000, Philip Newton wrote: > On 12 Sep 2000, at 18:42, Jarkko Hietaniemi wrote: > > > UTF-16 is also known as UCS-2, 16 bit or 2-byte chunks, > > As I understand it, that's not true -- UTF-16 is 2-byte *or* 4-byte > chunks, since UTF-16 contains surrogates (high-surrogate + low-

UCS-2 and UTF-16 [was Re: Encode, take five]

2000-09-13 Thread Mark Leisher
Philip> On 12 Sep 2000, at 18:42, Jarkko Hietaniemi wrote: >> UTF-16 is also known as UCS-2, 16 bit or 2-byte chunks, Philip> As I understand it, that's not true -- UTF-16 is 2-byte *or* Philip> 4-byte chunks, since UTF-16 contains surrogates (high-surrogate + Philip> low- su

Re: Encode, take five

2000-09-13 Thread Philip Newton
On 12 Sep 2000, at 18:42, Jarkko Hietaniemi wrote: > UTF-16 is also known as UCS-2, 16 bit or 2-byte chunks, As I understand it, that's not true -- UTF-16 is 2-byte *or* 4-byte chunks, since UTF-16 contains surrogates (high-surrogate + low- surrogate [or the other way around?] = 1 character, re

Re: Encode, take five

2000-09-13 Thread Jarkko Hietaniemi
You must be careful or I'll appoint you to be the Encode designer/documenter :-) As I said, it's in the sources now, and I have no time to play with it for a while, anyone who feels interested should hack at it. -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word w

Re: Encode, take five

2000-09-13 Thread Jarkko Hietaniemi
> The 'in-place' stuff is a slight pain - these thing become chop-like. > > I cannot just say > >syswrite(Handle,chars_to_utf8(join(' ',map( { ...} ; > > But I guess the function style makes it hard to test for error cases. Yes, and if errors can be of several kinds. Another reason i

Re: Encode, take five

2000-09-13 Thread Nick Ing-Simmons
Jarkko Hietaniemi <[EMAIL PROTECTED]> writes: Nick Ing-Simmons <[EMAIL PROTECTED]> writes: >My names were meant to be used like this: > > sysread(Handle,$buffer,...); # buffer seq of bytes > my $str = utf8_to_chars(substr($buffer,$start,$len)); > # now we have string of chars and we can

Re: Encode, take five

2000-09-12 Thread Jarkko Hietaniemi
I'm much tempted to go with these. Yes, it's ISO 8859-1-biased. =item * bytes_to_utf8(STRING) The bytes in STRING are encoded in-place into UTF-8. The bytes are expected to be encoded in US-ASCII or ISO 8859-1 (Latin 1). Returns the new size of STRING, or C if there's a failure. [INT

Re: Encode, take five (malformed UTF-8)

2000-09-12 Thread Jarkko Hietaniemi
> > =head2 Handling Malformed Data > > What exactly is malformed UTF-8 data here? > > Obviously at least everything listed in section R.7 of ISO 10646-1/Amd.2. > > Does it also cover overlong UTF-8 sequences, i.e. any string > containing any of the five bit sequences > > 110x, > 111000

Re: Encode, take five (malformed UTF-8)

2000-09-12 Thread Jarkko Hietaniemi
On Wed, Sep 13, 2000 at 01:33:33AM +0100, Markus Kuhn wrote: > Jarkko Hietaniemi wrote on 2000-09-12 23:42 UTC: > > '7' UTF-7 > > '8' UTF-8 > > '16be' UTF-16 big-endian > > '16le' UTF-16 little-endian > > '16ne'

Re: Encode, take five

2000-09-12 Thread Jarkko Hietaniemi
> bytes_to_utf8(STRING [, CHECK]) > > The bytes in STRING are encoded in-place into UTF-8. The bytes are > assumed to be encoded in US-ASCII, bytes between 0 and 127, inclusive. > Returns the new size of STRING, or C if there's a failure. Okay, now it's time for me to slow down. Now I'

Re: Encode, take five (malformed UTF-8)

2000-09-12 Thread Markus Kuhn
Jarkko Hietaniemi wrote on 2000-09-12 23:42 UTC: > '7' UTF-7 > '8' UTF-8 > '16be' UTF-16 big-endian > '16le' UTF-16 little-endian > '16ne' UTF-16 native-endian > '32be' UTF-32 big-endian >

Encode, take five

2000-09-12 Thread Jarkko Hietaniemi
I tried to explore the boundary cases and error conditions now thoroughly. A new feature is customizable error handling. Note also the s/strict/check/g. =pod =head1 NAME Encode - character encodings =head2 TERMINOLOGY =over =item * I: a B in the range 0..255 =item * I: a B in the range 0