Re: xor of unicode strings

Yitzchak Scott-Thoennes Wed, 21 Apr 2004 11:35:54 -0700

On Wed, Apr 21, 2004 at 07:37:42AM -0700, Yitzchak Scott-Thoennes <[EMAIL PROTECTED]> 
wrote:
> On Wed, 21 Apr 2004, Bernie Cosell wrote:
> 
> > I'm not sure if this is 'fun', but it might be at least curious: I don't
> > have a UTF-8 system handy to try, but I'm wondering: what happens with
> > the string-xor operator on UTF-8 strings.  It obviously cannot work byte-
> > by-byte, but it seems like it is going to be a bit tricky figuring out
> > what you get when you xor a 1 byte character with a 2-byte one [I guess
> > what it would have to do to be 'correct' would be to convert the UTF to
> > "real Unicode", XOR the 32-bit "characters" together, and then compress
> > that resulting "character" back to UTF-8...
> 
> You got it!  What happens with ~ is somewhat more complicated...


Actually, pedantically, the "32-bit" part is wrong; perl will use
whatever normal unsigned integer type it knows about, which may very
well be 64-bit.  While the UTF-8 format is technically only defined up
to 21 bits, perl just switches as needed to an extension of the format
to provide up to 72 bits per character, which avoids having to do
range checking all over (though only 32 or 64 bits will be actually
usable, depending on perl's configuration).

Re: xor of unicode strings

Reply via email to