On Friday, April 12, 2002, at 10:38 , George W Gerrity wrote:
> To expand on this, imagine there is a text file in some encoding on 
> some medium created by a little-endian machine (say a DEC Vax or a 
> Macintosh 68000), and it is to be accessed on a big-endian machine (any 
> Intel 8080 -- Pentium architecture).

   I KNOW someone will pick this nit sooner or later but since I recently 
implemented all UTF-(16|32)(LE|BE)? for Encode, a perl module (also the 
biggest in size thanks to CJK Unification :)  comes will every Perl 5.8) 
that does all the Unicode-(from|to)-X transcoding, it would be 
appropriate for me to do so.
   Your endianness for MC 680x0 and IA-32 is upside down.  68k is in 
network byte order and IA-32 is in VAX byte order.  Here is a C 
one-liner that tells the endianness.  4 for BE and 1 for LE.

   int main(){ int e=0x04030201 ; printf("%d\n", *((char *)&e)); }

And of course, in perl one-liner.

   perl -e 'print pack("C", unpack("L", "1234")), "\n"'

> I acknowledge that the BOM _can_ be used to differentiate between 
> various encodings -- UTF-8, UTF-16, UTF-32, non-Unicode -- but then, 
> that has _nothing_ to do with byte order. Perhaps it should be renamed?

It definitely has A LITTLE to do --  If BOM is the opposite of the 
endianness of your computer, flip the bytes before going anything 
further.  It does not say anything about the endianness of her/his 
machine where the data is originated because any computer can choose to 
prepend both versions of the BOM, however.

FYI Encode module uses BE BOM when it encodes (from perl's native UTF-8) 
to UTF-16 or UTF-32 with no endianness specified, even on my FreeBSD box 
which is LE.  So far there is no such option to make BOMmed UTF-16 or 
UTF-32 with little endian BOM.  This decision has made the code much 
simpler and should not affect usability a bit.   But if you don't like 
it, this weekend is the last chance to say so because code freeze is 
coming!

Dan the BOMeed Man


Reply via email to