On Thu, Apr 23, 2015 at 05:40:34PM -0400, Dave Angel wrote: > On 04/23/2015 05:08 PM, Mark Lawrence wrote: > > > >Slight aside, why a BOM, all I ever think of is Inspector Clouseau? :) > > > > As I recall, it stands for "Byte Order Mark". Applicable only to > multi-byte storage formats (eg. UTF-16), it lets the reader decide > which of the formats were used. > > For example, a file that reads > > fe ff 41 00 42 00 > > might be a big-endian version of UTF-16 > > while > ff fe 00 41 00 42 > > might be the little-endian version of the same data.
Almost :-) You have the string ")*", two characters. In ASCII, Latin-1, Mac-Roman, UTF-8, and many other encodings, that is represented by two code points. I'm going to use "U+ hex digits" as the symbol for code points, to distinguish them from raw bytes which won't use the U+ prefix. string ")*" gives code points U+41 U+42 They get written out to a single byte each, and so we get 41 42 as the sequence of bytes (still written in hex). In UTF-16, those two characters are represented by the same two code points, *but* the "code unit" is two bytes rather than one: U+0041 U+0042 with leading zeroes included. Each code unit gets written out as a two-byte quantity: On little-endian systems like Intel hardware: 4100 4200 On big-endian systems like Motorola hardware: 0041 0042 Insert the BOM, which is always code point U+FEFF: On little-endian systems: FFFE 4100 4200 On big-endian systems: FEFF 0041 0042 If you take that file and read it back as Latin-1, you get: little-endian: ÿþA\0B\0 big-endian: þÿ\0A\0B Notice the \0 nulls? Your editor might complain that the file is a binary file, and refuse to open it, unless you tell the editor it is UTF-16. > The same concept was used many years ago in two places I know of. > Binary files representing faxes had "II" or "MM" at the beginning. Yes, TIFF files use a similar scheme. You get them starting with a signature TIFF or FFTI, I believe. -- Steve _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor