BOM and principle of least surprise

Erland Sommarskog Fri, 19 Mar 2004 02:12:50 -0800

I have just finished my first attempts to use Unicode with Perl, and I 
can't say that I am wholly confident that Perl has Got It Alright.


Context: ActiveState Perl AS-809, Windows XP.

One issue I had was this. I had this script:

   use strict;
   
   use MSSQL::OlleDB;
   $| = 1;
   my $i = 0;
   foreach (1..2) {
      my $db = 'räksmörgås'; 
      print "Len " . length($db) . " Str: $db\n";
      my $X = MSSQL::OlleDB->connect(undef, undef, undef, $db);
      $i++;
      print "$i\n" if $i % 50 == 0
   }
   
What I want to do was that the XS module I'm writing, handles Unicode
correctly. I first I had saved the file as ANSI, and I got the expected
result. I then saved the file as UTF-8, but too my surprise, the length
of the string was now 13, and Perl did not consider the string to be 
UTF-8.

I found a solution in Perluniintro by using a decode function, and browsing
this newsgroup/mailing list I found the cause for the problem. Perl does
not care about byte-order marks. I found an thread from 1999 where Larry
expressed the opinion that BOMs are an abomination. From a principal point
of view, I can agree. Then again, BOM are not the first example of using
in-band data for meta-purposes. Ever heard of string-terminating nulls?

And more importantly, on the Windows platform, byte-order marks is what
you use to tell what an encoding a file is in. At least it seems so to 
me as a casual user.

Using a thing like utf8 to determine the encoding of character literals
is not a good idea. Suddenly someone saves the file in a different 
encoding, and guess what happens. And as long as Perl does not act
on byte-order marks, how would it be able to read a script that has
been saved in UTF16-LE, which is the normal way of saving Unicode data
on Windows?

Of course, deducing the encoding of literals from file format is also
questionable. In my prime working language, SQL, the syntax is to quote
Unicode literals with N''. And C++ appears to have L"". Would be a bit
ugly in Perl.

I said the principle of least surprise, because having read Perluniintro
my impression was that I should really have to care in which format the
string was in. I was proven wrong fairly directly. Certainly, if the
idea is that Perl is going to continue to ignore BOM, this needs a mention
in Perluniintro.

Another example where this causes problems is when reading a file that
has been saved on Windows. I did this:

   use strict;
   use utf8;
   open (F, '<:utf8', 'räkmacka-utf8.txt');
   open (G, '>:utf8', 'test.txt');
   while (<F>) {
      print G length($_), $_, "\n";
   }
   close F;
   
   open (F,  '<:encoding(ucs-2le)', 'räkmacka-ucs2.txt');
   while (<F>) {
      print G length($_), $_, "\n";
   }
   close F;
   
   print  G "\x{03A1}\x{03B5}\x{03BA}\n";
   
   close G;

I find that the BOM in the orignal files appears in the output, after
length. Logical from the point of view that BOM are ignored, but this
is a funny place for a zero-width non-joiner. And other tools on Windows
will misinterpret the file.

And one things seems just plain wrong to me: The "\n" is written as 
0A 0D to the file, not 000A, 000D. But may there is some more manual
reading I need to do find out how to do it.

It may be that generating and reacting om BOM, may be a bad idea on other
operating systems. But it certainly should be considered for Windows. 
Without it, it is difficult to say that Perl handles Unicode well.

-- 
Erland Sommarskog, Stockholm, [EMAIL PROTECTED]

BOM and principle of least surprise

Reply via email to