Re: Encode, take four

2000-09-13 Thread Marc Lehmann
On Tue, Sep 12, 2000 at 03:59:15PM -0500, Jarkko Hietaniemi <[EMAIL PROTECTED]> wrote: > bytes_to_utf8(STRING) > > The bytes in STRING are encoded in-place into UTF-8. Returns the new > size of STRING, or undef if there's a failure. [INTERNAL] Also the > UTF-8 flag is turned on. Just t

Re: Encode, take five

2000-09-13 Thread Philip Newton
On Wed, 13 Sep 2000, Matt Sergeant wrote: > Until someone extends the Unicode character set beyond the current range, This has "already" happened. Have a look at http://www.unicode.org/unicode/alloc/Pipeline.html , the Unicode allocation pipeline of proposed new characters and scripts. It lists

Re: Encode, take five

2000-09-13 Thread Ed Batutis
On Wed, 13 Sep 2000 19:49:32 Matt Sergeant wrote: >On Wed, 13 Sep 2000, Philip Newton wrote: > >Until someone extends the Unicode character set beyond the current range, >UCS-2 and UTF-16 currently have a one to one mapping. I assume thats the >point being made. An excerpt from the book I'm cur

Re: Encode, take five

2000-09-13 Thread Matt Sergeant
On Wed, 13 Sep 2000, Philip Newton wrote: > On 12 Sep 2000, at 18:42, Jarkko Hietaniemi wrote: > > > UTF-16 is also known as UCS-2, 16 bit or 2-byte chunks, > > As I understand it, that's not true -- UTF-16 is 2-byte *or* 4-byte > chunks, since UTF-16 contains surrogates (high-surrogate + low-

UCS-2 and UTF-16 [was Re: Encode, take five]

2000-09-13 Thread Mark Leisher
Philip> On 12 Sep 2000, at 18:42, Jarkko Hietaniemi wrote: >> UTF-16 is also known as UCS-2, 16 bit or 2-byte chunks, Philip> As I understand it, that's not true -- UTF-16 is 2-byte *or* Philip> 4-byte chunks, since UTF-16 contains surrogates (high-surrogate + Philip> low- su

Re: Encode, take five

2000-09-13 Thread Philip Newton
On 12 Sep 2000, at 18:42, Jarkko Hietaniemi wrote: > UTF-16 is also known as UCS-2, 16 bit or 2-byte chunks, As I understand it, that's not true -- UTF-16 is 2-byte *or* 4-byte chunks, since UTF-16 contains surrogates (high-surrogate + low- surrogate [or the other way around?] = 1 character, re

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-13 Thread Philip Newton
On 12 Sep 2000, at 11:57, Jarkko Hietaniemi wrote: > I would go for UCS-2 (UTF-16) as soon as possible as the preferred > internal encoding. You know, of course, that UCS-2 ne UTF-16 (specifically, surrogates). What's Perl's take on characters where ord($c) > 0x, anyway? (These two issues

Re: Encode, take five

2000-09-13 Thread Jarkko Hietaniemi
You must be careful or I'll appoint you to be the Encode designer/documenter :-) As I said, it's in the sources now, and I have no time to play with it for a while, anyone who feels interested should hack at it. -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word w

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-13 Thread Jarkko Hietaniemi
On Wed, Sep 13, 2000 at 06:00:55PM +0200, Philip Newton wrote: > On 12 Sep 2000, at 11:57, Jarkko Hietaniemi wrote: > > > I would go for UCS-2 (UTF-16) as soon as possible as the preferred > > internal encoding. > > You know, of course, that UCS-2 ne UTF-16 (specifically, surrogates). Surroga

Re: Encode, take five

2000-09-13 Thread Jarkko Hietaniemi
> The 'in-place' stuff is a slight pain - these thing become chop-like. > > I cannot just say > >syswrite(Handle,chars_to_utf8(join(' ',map( { ...} ; > > But I guess the function style makes it hard to test for error cases. Yes, and if errors can be of several kinds. Another reason i

PICU: Perl wrappers for ICU

2000-09-13 Thread bstell
(the goal:) I've started putting a Perl wrapper around the IBM International Components for Unicode (ICU) library. (see the end of this email for more details) (about me:) I've done internationalization (i18n) for a while but I'm just learning ICU. I have done plenty of Perl but I am new to Xs

Re: Encode, my final take for a while

2000-09-13 Thread Nick Ing-Simmons
Jarkko Hietaniemi <[EMAIL PROTECTED]> writes: >I'm not committing myself (or my trusty deputy Nick) to having >it in 5.7.1, but as an incentive I now checked a skeleton for the >Encode extension into the source code repository so that it will haunt >us until we do something about it. Excellent. T

Re: Encode, take five

2000-09-13 Thread Nick Ing-Simmons
Jarkko Hietaniemi <[EMAIL PROTECTED]> writes: Nick Ing-Simmons <[EMAIL PROTECTED]> writes: >My names were meant to be used like this: > > sysread(Handle,$buffer,...); # buffer seq of bytes > my $str = utf8_to_chars(substr($buffer,$start,$len)); > # now we have string of chars and we can

Encode, my final take for a while

2000-09-13 Thread Jarkko Hietaniemi
I'm personally running out of time in churning out these API proposals (my vacation is coming up, tra-la-la). I'm personally also very much of the opinion that's it's time to get *something* like this into the core. I'm not committing myself (or my trusty deputy Nick) to having it in 5.7.1, but

Re: Encode, take two

2000-09-13 Thread Mark Leisher
Jarkko> You got bored by my deluge of Encode takes and did not read the Jarkko> latest versions in where chars_to_utf8() and utf8_to_chars() have Jarkko> no encoding parameter? :-) I knew there had to be a simple answer :-)

Re: Encode, take two

2000-09-13 Thread Jarkko Hietaniemi
> Taking the view that "bytes are bytes" or "bytes are text in some encoding", > then bytes_to_utf8() and utf8_to_bytes() should take an encoding parameter. > Then chars_to_utf8() and utf8_to_chars() don't need an encoding parameter > because they simply convert between Unicode characters and UTF-

Re: Encode, take two

2000-09-13 Thread Mark Leisher
Jarkko> Assume I have a string a bunch of bytes that makes sense in Jarkko> Shift-JIS, as Shift-JIS characters. Now, how I am going to get it Jarkko> to Unicode? chars_to_blah() won't help since they are not yet in Jarkko> Unicode chars. So yes, I think you are right, we need t

Re: Encode, take two

2000-09-13 Thread Jarkko Hietaniemi
On Wed, Sep 13, 2000 at 09:21:21AM +0100, Nick Ing-Simmons wrote: > Jarkko Hietaniemi <[EMAIL PROTECTED]> writes: > >>bytes_to_utf8($string, $encoding) > >>utf8_to_bytes($string, $encoding) > > > >Scratch these. Bytes are in no encoding. They are numbers. > > Yeah - but it is only a mat

Re: Encode, take three

2000-09-13 Thread Jarkko Hietaniemi
> >=head2 bytes > > > > bytes_to_utf8(STRING) > > > >The bytes in STRING are encoded in-place into UTF-8. Returns the new > >size of STRING, or undef if there's a failure. [INTERNAL] Also the > >UTF-8 flag is turned on. > > Is this a C or a perl API ? Perl. > If a perl API then converting

Re: Encode, take three

2000-09-13 Thread Nick Ing-Simmons
Jarkko Hietaniemi <[EMAIL PROTECTED]> writes: >=head1 NAME > >Encode - character encodings > >=head2 TERMINOLOGY > > bytea number in the range 0..255 > chara character in the range 0..maxint (at least 2**32-1) > >The marker [INTERNAL] marks Internal Implementation Details, in >

Re: Encode, take two

2000-09-13 Thread Nick Ing-Simmons
Jarkko Hietaniemi <[EMAIL PROTECTED]> writes: >> bytes_to_utf8($string, $encoding) >> utf8_to_bytes($string, $encoding) > >Scratch these. Bytes are in no encoding. They are numbers. Yeah - but it is only a matter of time before we want to take a bunch of Shift-JIS bytes and turn them

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-13 Thread Nick Ing-Simmons
Simon Cozens <[EMAIL PROTECTED]> writes: >On Tue, Sep 12, 2000 at 11:24:47AM +0200, Gisle Aas wrote: >> Nick Ing-Simmons <[EMAIL PROTECTED]> writes: >> >> > My stab at names would be: >> > >> > utf8bytes_to_chars() >> > >> > chars_to_utf8bytes(); >> >> That works for me. > >That