I have just finished my first attempts to use Unicode with Perl, and I can't say that I am wholly confident that Perl has Got It Alright.
Context: ActiveState Perl AS-809, Windows XP. One issue I had was this. I had this script: use strict; use MSSQL::OlleDB; $| = 1; my $i = 0; foreach (1..2) { my $db = 'räksmörgås'; print "Len " . length($db) . " Str: $db\n"; my $X = MSSQL::OlleDB->connect(undef, undef, undef, $db); $i++; print "$i\n" if $i % 50 == 0 } What I want to do was that the XS module I'm writing, handles Unicode correctly. I first I had saved the file as ANSI, and I got the expected result. I then saved the file as UTF-8, but too my surprise, the length of the string was now 13, and Perl did not consider the string to be UTF-8. I found a solution in Perluniintro by using a decode function, and browsing this newsgroup/mailing list I found the cause for the problem. Perl does not care about byte-order marks. I found an thread from 1999 where Larry expressed the opinion that BOMs are an abomination. From a principal point of view, I can agree. Then again, BOM are not the first example of using in-band data for meta-purposes. Ever heard of string-terminating nulls? And more importantly, on the Windows platform, byte-order marks is what you use to tell what an encoding a file is in. At least it seems so to me as a casual user. Using a thing like utf8 to determine the encoding of character literals is not a good idea. Suddenly someone saves the file in a different encoding, and guess what happens. And as long as Perl does not act on byte-order marks, how would it be able to read a script that has been saved in UTF16-LE, which is the normal way of saving Unicode data on Windows? Of course, deducing the encoding of literals from file format is also questionable. In my prime working language, SQL, the syntax is to quote Unicode literals with N''. And C++ appears to have L"". Would be a bit ugly in Perl. I said the principle of least surprise, because having read Perluniintro my impression was that I should really have to care in which format the string was in. I was proven wrong fairly directly. Certainly, if the idea is that Perl is going to continue to ignore BOM, this needs a mention in Perluniintro. Another example where this causes problems is when reading a file that has been saved on Windows. I did this: use strict; use utf8; open (F, '<:utf8', 'räkmacka-utf8.txt'); open (G, '>:utf8', 'test.txt'); while (<F>) { print G length($_), $_, "\n"; } close F; open (F, '<:encoding(ucs-2le)', 'räkmacka-ucs2.txt'); while (<F>) { print G length($_), $_, "\n"; } close F; print G "\x{03A1}\x{03B5}\x{03BA}\n"; close G; I find that the BOM in the orignal files appears in the output, after length. Logical from the point of view that BOM are ignored, but this is a funny place for a zero-width non-joiner. And other tools on Windows will misinterpret the file. And one things seems just plain wrong to me: The "\n" is written as 0A 0D to the file, not 000A, 000D. But may there is some more manual reading I need to do find out how to do it. It may be that generating and reacting om BOM, may be a bad idea on other operating systems. But it certainly should be considered for Windows. Without it, it is difficult to say that Perl handles Unicode well. -- Erland Sommarskog, Stockholm, [EMAIL PROTECTED]