On Tue, Sep 12, 2000 at 03:59:15PM -0500, Jarkko Hietaniemi <[EMAIL PROTECTED]> wrote:
> bytes_to_utf8(STRING)
>
> The bytes in STRING are encoded in-place into UTF-8. Returns the new
> size of STRING, or undef if there's a failure. [INTERNAL] Also the
> UTF-8 flag is turned on.
Just t
On Wed, 13 Sep 2000, Matt Sergeant wrote:
> Until someone extends the Unicode character set beyond the current range,
This has "already" happened. Have a look at
http://www.unicode.org/unicode/alloc/Pipeline.html , the Unicode
allocation pipeline of proposed new characters and scripts. It lists
On Wed, 13 Sep 2000 19:49:32 Matt Sergeant wrote:
>On Wed, 13 Sep 2000, Philip Newton wrote:
>
>Until someone extends the Unicode character set beyond the current range,
>UCS-2 and UTF-16 currently have a one to one mapping. I assume thats the
>point being made. An excerpt from the book I'm cur
On Wed, 13 Sep 2000, Philip Newton wrote:
> On 12 Sep 2000, at 18:42, Jarkko Hietaniemi wrote:
>
> > UTF-16 is also known as UCS-2, 16 bit or 2-byte chunks,
>
> As I understand it, that's not true -- UTF-16 is 2-byte *or* 4-byte
> chunks, since UTF-16 contains surrogates (high-surrogate + low-
Philip> On 12 Sep 2000, at 18:42, Jarkko Hietaniemi wrote:
>> UTF-16 is also known as UCS-2, 16 bit or 2-byte chunks,
Philip> As I understand it, that's not true -- UTF-16 is 2-byte *or*
Philip> 4-byte chunks, since UTF-16 contains surrogates (high-surrogate +
Philip> low- su
On 12 Sep 2000, at 18:42, Jarkko Hietaniemi wrote:
> UTF-16 is also known as UCS-2, 16 bit or 2-byte chunks,
As I understand it, that's not true -- UTF-16 is 2-byte *or* 4-byte
chunks, since UTF-16 contains surrogates (high-surrogate + low-
surrogate [or the other way around?] = 1 character, re
On 12 Sep 2000, at 11:57, Jarkko Hietaniemi wrote:
> I would go for UCS-2 (UTF-16) as soon as possible as the preferred
> internal encoding.
You know, of course, that UCS-2 ne UTF-16 (specifically, surrogates).
What's Perl's take on characters where ord($c) > 0x, anyway?
(These two issues
You must be careful or I'll appoint you to be the Encode
designer/documenter :-) As I said, it's in the sources now, and I have
no time to play with it for a while, anyone who feels interested
should hack at it.
--
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word w
On Wed, Sep 13, 2000 at 06:00:55PM +0200, Philip Newton wrote:
> On 12 Sep 2000, at 11:57, Jarkko Hietaniemi wrote:
>
> > I would go for UCS-2 (UTF-16) as soon as possible as the preferred
> > internal encoding.
>
> You know, of course, that UCS-2 ne UTF-16 (specifically, surrogates).
Surroga
> The 'in-place' stuff is a slight pain - these thing become chop-like.
>
> I cannot just say
>
>syswrite(Handle,chars_to_utf8(join(' ',map( { ...} ;
>
> But I guess the function style makes it hard to test for error cases.
Yes, and if errors can be of several kinds.
Another reason i
(the goal:)
I've started putting a Perl wrapper around the IBM
International Components for Unicode (ICU) library.
(see the end of this email for more details)
(about me:)
I've done internationalization (i18n) for a while but
I'm just learning ICU. I have done plenty of Perl but
I am new to Xs
Jarkko Hietaniemi <[EMAIL PROTECTED]> writes:
>I'm not committing myself (or my trusty deputy Nick) to having
>it in 5.7.1, but as an incentive I now checked a skeleton for the
>Encode extension into the source code repository so that it will haunt
>us until we do something about it.
Excellent. T
Jarkko Hietaniemi <[EMAIL PROTECTED]> writes:
Nick Ing-Simmons <[EMAIL PROTECTED]> writes:
>My names were meant to be used like this:
>
> sysread(Handle,$buffer,...); # buffer seq of bytes
> my $str = utf8_to_chars(substr($buffer,$start,$len));
> # now we have string of chars and we can
I'm personally running out of time in churning out these API proposals
(my vacation is coming up, tra-la-la). I'm personally also very much
of the opinion that's it's time to get *something* like this into the
core. I'm not committing myself (or my trusty deputy Nick) to having
it in 5.7.1, but
Jarkko> You got bored by my deluge of Encode takes and did not read the
Jarkko> latest versions in where chars_to_utf8() and utf8_to_chars() have
Jarkko> no encoding parameter? :-)
I knew there had to be a simple answer :-)
> Taking the view that "bytes are bytes" or "bytes are text in some encoding",
> then bytes_to_utf8() and utf8_to_bytes() should take an encoding parameter.
> Then chars_to_utf8() and utf8_to_chars() don't need an encoding parameter
> because they simply convert between Unicode characters and UTF-
Jarkko> Assume I have a string a bunch of bytes that makes sense in
Jarkko> Shift-JIS, as Shift-JIS characters. Now, how I am going to get it
Jarkko> to Unicode? chars_to_blah() won't help since they are not yet in
Jarkko> Unicode chars. So yes, I think you are right, we need t
On Wed, Sep 13, 2000 at 09:21:21AM +0100, Nick Ing-Simmons wrote:
> Jarkko Hietaniemi <[EMAIL PROTECTED]> writes:
> >>bytes_to_utf8($string, $encoding)
> >>utf8_to_bytes($string, $encoding)
> >
> >Scratch these. Bytes are in no encoding. They are numbers.
>
> Yeah - but it is only a mat
> >=head2 bytes
> >
> > bytes_to_utf8(STRING)
> >
> >The bytes in STRING are encoded in-place into UTF-8. Returns the new
> >size of STRING, or undef if there's a failure. [INTERNAL] Also the
> >UTF-8 flag is turned on.
>
> Is this a C or a perl API ?
Perl.
> If a perl API then converting
Jarkko Hietaniemi <[EMAIL PROTECTED]> writes:
>=head1 NAME
>
>Encode - character encodings
>
>=head2 TERMINOLOGY
>
> bytea number in the range 0..255
> chara character in the range 0..maxint (at least 2**32-1)
>
>The marker [INTERNAL] marks Internal Implementation Details, in
>
Jarkko Hietaniemi <[EMAIL PROTECTED]> writes:
>> bytes_to_utf8($string, $encoding)
>> utf8_to_bytes($string, $encoding)
>
>Scratch these. Bytes are in no encoding. They are numbers.
Yeah - but it is only a matter of time before we want
to take a bunch of Shift-JIS bytes and turn them
Simon Cozens <[EMAIL PROTECTED]> writes:
>On Tue, Sep 12, 2000 at 11:24:47AM +0200, Gisle Aas wrote:
>> Nick Ing-Simmons <[EMAIL PROTECTED]> writes:
>>
>> > My stab at names would be:
>> >
>> > utf8bytes_to_chars()
>> >
>> > chars_to_utf8bytes();
>>
>> That works for me.
>
>That
22 matches
Mail list logo