Re: Bit ops on strings

Aaron Sherman Sat, 01 May 2004 16:54:35 -0700

On Sat, 2004-05-01 at 15:09, Jarkko Hietaniemi wrote:
> > How are you defining "valid UTF-8"? Is there a codepoint in UTF-8
> > between \x00 and \xff that isn't valid? Is there a reason to ever do
> 
> Like, half of them?  \x80 .. \xff are all invalid as UTF-8.


Heh, damn Ken Thompson and his placemat!

I am too new to UCS and UTF-8, and had thought it was always 8-bit. I
stand corrected, having read up on the UTF-8 and Unicode FAQ.

Jeff, yeah I have to take back my statement. If Perl defaults to UTF-8,
then it's not a valid assumption that a UTF-8 input string won't throw
an exception. I still think that's ok, and better than
representation-expanding to the larger representation and doing the
bit-op in that, since that  means that bit-vectors would have to be
valid in enum_stringrep_one, _two and _four as sort of alternate
datastructures. I don't think we want to go there.

For everything else, as Jeff correctly points out, this has nothing to
do with encoding. Only in the sense that default encoding in a language
like (only one example) Perl 6 dictates what representation you will
have to expect to be the common case.

> > bitwise operations on anything other than 8-bit codepoints?
> 
> I am very confused.  THIS IS WHAT WE ALL SEEM TO BE SAYING.  BITOPS ONLY
> ON EIGHT-BIT DATA.  AM I WRONG?

No, it's not, and could you please not get emotional about this? It's
what you, Dan and I have been saying, but I was responding to Jeff who
said:

        "Just FYI, the way I implemented bitwise-not so far, was to
        bitwise-not code points 0x{00}-0x{FF} as uint8-sized things,
        0x{100}-0x{FFFF} as uint16-sized things, and > 0x{FFFF} as
        uint32-sized things (but then bit-masking them with 0xFFFFF to
        make sure that they fell into a valid code point range)."

It was kind of important that I deal with the fact that I was proposing
a very different behavior for bit-shifting than exists currently for
boolean operations, I thought.

The question becomes should I CHANGE the existing bit-ops so that they
don't work on representations in two or four bytes for symmetry?

If this continues to be so contentious, I'm tempted to agree with the
nay-sayers and say that Parrot shouldn't do bit-vectors on strings, and
we should just implement a bit-vector class later on. Perl will just
have to suffer the overhead of translation. This just IS NOT important
enough to waste this many brain cells on.

-- 
Aaron Sherman <[EMAIL PROTECTED]>
Senior Systems Engineer and Toolsmith
"It's the sound of a satellite saying, 'get me down!'" -Shriekback

signature.asc
Description: This is a digitally signed message part

Re: Bit ops on strings

Reply via email to