I have a program that reads and writes (among others) strings that should be utf8 encoded. I say "should", because somewhere deep inside the dark corners of that program, sometimes, the utf8 flag on a string is lost. (I'm still investigating where, tips to attack such a problem are welcome.)
Even when you try to set UTF-8 flag on strings which consists entirely of ASCII ( /^[\x00-\x7f]$/ ) the UTF-8 will not be on. See "The UTF-8 flag" section of 'perldoc Encode'. Here is the short summary.
perldoc Encode
o When you decode, the resulting utf8 flag is on unless you can unam-
biguously represent data. Here is the definition of dis-ambiguity.
After "$utf8 = decode('foo', $octet);",
When $octet is... The utf8 flag in $utf8 is --------------------------------------------- In ASCII only (or EBCDIC only) OFF In ISO-8859-1 ON In any other Encoding ON ---------------------------------------------
As you see, there is one exception, In ASCII. That way you can assue
Goal #1. And with Encode Goal #2 is assumed but you still have to be
careful in such cases mentioned in CAVEAT paragraphs.
When writing the string, the program clears the utf8 flag and writes a simple string of octets using:
$s = encode("utf8", $s) if $s =~ /[^\x00-\x7f]/; $n = length($s); # yes, we need length in bytes ... print $s;
If what you need is byte length, you can simply "use bytes" as follows. binmode is for print().
use bytes (); # avoid imports binmode STDOUT => ":utf8"; my $s = "\x{5c0f}\x{98fc} \x{5f3e}"; # ... my $n = length($s);ch my $l = bytes::length($s); # ... print $s;
Why would someone test for pure 7-bit strings instead of:
$s = encode("utf8", $s) if Encode::is_utf8($s);
For most cases you don't have to and you should not have to (unless you maintain Encode and/or perl :). Complex it may be, the internal UTF-8 flag was the best way to harness UTF-8 while keeping legacy, byte-oriented scripts compatible.
which seems superior to avoid double utf8 encodings, shoue ld the utf8-flag be lost. And it's faster.
Or even simply: Encode::_utf8_off($s)
The problem is that I'm usually wrong. Am I this time? Am I missing something? Or do I need more coffee?
I have to admit Encode and Perl 5.8-way of handling Unicode needs more recipes (Perl Cookbook 2nd Ed. does cover that issue on Ch. 8 but it was hardly enough).
Dan the Encode Maintainer