Philip> Yes, but if you just have a high surrogate, you can't do much with
Philip> it -- it doesn't represent a Unicode character but only half of
Philip> one. So you need a high surrogate plus a low surrogate to display
Philip> a character beyond U+, leading to a 32-bit repre
On 13 Sep 2000, at 11:57, Mark Leisher wrote:
> True, UTF-16 is not known as UCS-2. However, UTF-16 still consists
> of 2-byte chunks. It is essentially UCS-2 plus high and low
> surrogates (see the Unicode Standard 3.0 page 19).
Yes, but if you just have a high surrogate, you can't do much w
On Wed, 13 Sep 2000, Matt Sergeant wrote:
> Until someone extends the Unicode character set beyond the current range,
This has "already" happened. Have a look at
http://www.unicode.org/unicode/alloc/Pipeline.html , the Unicode
allocation pipeline of proposed new characters and scripts. It lists
On Wed, 13 Sep 2000 19:49:32 Matt Sergeant wrote:
>On Wed, 13 Sep 2000, Philip Newton wrote:
>
>Until someone extends the Unicode character set beyond the current range,
>UCS-2 and UTF-16 currently have a one to one mapping. I assume thats the
>point being made. An excerpt from the book I'm cur
On Wed, 13 Sep 2000, Philip Newton wrote:
> On 12 Sep 2000, at 18:42, Jarkko Hietaniemi wrote:
>
> > UTF-16 is also known as UCS-2, 16 bit or 2-byte chunks,
>
> As I understand it, that's not true -- UTF-16 is 2-byte *or* 4-byte
> chunks, since UTF-16 contains surrogates (high-surrogate + low-
Philip> On 12 Sep 2000, at 18:42, Jarkko Hietaniemi wrote:
>> UTF-16 is also known as UCS-2, 16 bit or 2-byte chunks,
Philip> As I understand it, that's not true -- UTF-16 is 2-byte *or*
Philip> 4-byte chunks, since UTF-16 contains surrogates (high-surrogate +
Philip> low- su
On 12 Sep 2000, at 18:42, Jarkko Hietaniemi wrote:
> UTF-16 is also known as UCS-2, 16 bit or 2-byte chunks,
As I understand it, that's not true -- UTF-16 is 2-byte *or* 4-byte
chunks, since UTF-16 contains surrogates (high-surrogate + low-
surrogate [or the other way around?] = 1 character, re
You must be careful or I'll appoint you to be the Encode
designer/documenter :-) As I said, it's in the sources now, and I have
no time to play with it for a while, anyone who feels interested
should hack at it.
--
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word w
> The 'in-place' stuff is a slight pain - these thing become chop-like.
>
> I cannot just say
>
>syswrite(Handle,chars_to_utf8(join(' ',map( { ...} ;
>
> But I guess the function style makes it hard to test for error cases.
Yes, and if errors can be of several kinds.
Another reason i
Jarkko Hietaniemi <[EMAIL PROTECTED]> writes:
Nick Ing-Simmons <[EMAIL PROTECTED]> writes:
>My names were meant to be used like this:
>
> sysread(Handle,$buffer,...); # buffer seq of bytes
> my $str = utf8_to_chars(substr($buffer,$start,$len));
> # now we have string of chars and we can
I'm much tempted to go with these. Yes, it's ISO 8859-1-biased.
=item *
bytes_to_utf8(STRING)
The bytes in STRING are encoded in-place into UTF-8. The bytes
are expected to be encoded in US-ASCII or ISO 8859-1 (Latin 1).
Returns the new size of STRING, or C if there's a failure.
[INT
> > =head2 Handling Malformed Data
>
> What exactly is malformed UTF-8 data here?
>
> Obviously at least everything listed in section R.7 of ISO 10646-1/Amd.2.
>
> Does it also cover overlong UTF-8 sequences, i.e. any string
> containing any of the five bit sequences
>
> 110x,
> 111000
On Wed, Sep 13, 2000 at 01:33:33AM +0100, Markus Kuhn wrote:
> Jarkko Hietaniemi wrote on 2000-09-12 23:42 UTC:
> > '7' UTF-7
> > '8' UTF-8
> > '16be' UTF-16 big-endian
> > '16le' UTF-16 little-endian
> > '16ne'
> bytes_to_utf8(STRING [, CHECK])
>
> The bytes in STRING are encoded in-place into UTF-8. The bytes are
> assumed to be encoded in US-ASCII, bytes between 0 and 127, inclusive.
> Returns the new size of STRING, or C if there's a failure.
Okay, now it's time for me to slow down. Now I'
Jarkko Hietaniemi wrote on 2000-09-12 23:42 UTC:
> '7' UTF-7
> '8' UTF-8
> '16be' UTF-16 big-endian
> '16le' UTF-16 little-endian
> '16ne' UTF-16 native-endian
> '32be' UTF-32 big-endian
>
I tried to explore the boundary cases and error conditions now thoroughly.
A new feature is customizable error handling. Note also the
s/strict/check/g.
=pod
=head1 NAME
Encode - character encodings
=head2 TERMINOLOGY
=over
=item *
I: a B in the range 0..255
=item *
I: a B in the range 0
16 matches
Mail list logo