Re: Variation In Decoding Between Encode and XML::LibXML
ice. The first time, though, I get a "wide character in print" warning. That warning arises because Perl's STDOUT is set to Latin-1 by default. It wants to "downgrade" the UTF8 scalar to Latin-1, but it can't do so without loss, so it warns and outputs the bytes as is. After we change STDOUT to 'utf8', the warning goes away. The utf8::upgrade() call has no effect, because the scalar starts off as UTF8. Prior to the introduction of the UTF8 flag, there was no way to put the code point \x{010d} into a Perl string because Latin-1 can't represent it. For backwards compatibility reasons, \x escapes below 255 have to be represented as Latin-1. Since you asked for \x{010d}, though, Perl knows that the backwards compat rules don't apply and it can use a UTF8 scalar. HTH, Marvin Humphrey
Re: Variation In Decoding Between Encode and XML::LibXML
On Wed, Jun 16, 2010 at 05:34:44PM -0700, David E. Wheeler wrote: > So the UTF8 flag is enabled, and yet it has "\303\204\302\215" in it. What is > that crap? That's octal notation, which I think Dump() uses for any byte greater than 127 and for control characters, so that it can output pure ASCII. That sequence is only four bytes: mar...@smokey:~ $ perl -MEncode -MDevel::Peek -e '$s = "\303\204\302\215"; Encode::_utf8_on($s); Dump $s' SV = PV(0x801038) at 0x80e880 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x2012f0 "\303\204\302\215"\0 [UTF8 "\x{c4}\x{8d}"] CUR = 4 <--- four bytes LEN = 8 mar...@smokey:~ $ The logical content of the string follows in the second quote: > [UTF8 "Tomas Laurinavi\x{c4}\x{8d}ius"] That's valid UTF-8. > my $str = 'Tomas Laurinaviius'; In source code, I try to stick to pure ASCII and use \x escapes -- like Dump() does. my $str = "Tomas Laurinavi\x{c4}\x{8d}ius" However, because those code points are both representable as Latin-1, Perl will create a Latin-1 string. If you want to force its internal encoding to UTF-8, you need to do additional work. mar...@smokey:~ $ perl -MDevel::Peek -e '$s = "\x{c4}"; Dump $s; utf8::upgrade($s); Dump $s' SV = PV(0x801038) at 0x80e870 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x2012e0 "\304"\0 CUR = 1 LEN = 4 SV = PV(0x801038) at 0x80e870 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x2008f0 "\303\204"\0 [UTF8 "\x{c4}"] CUR = 2 LEN = 3 mar...@smokey:~ $ > Confused and frustrated, IMO, to get UTF-8 right consistently in a large Perl system, you need to understand the internals and you need Devel::Peek at hand. Perl tries to hide the details, but there are too many ways for it to fail silently. ("perl -C", $YAML::Syck::ImplicitUnicode, etc.) Marvin Humphrey
Re: Variation In Decoding Between Encode and XML::LibXML
On Wed, Jun 16, 2010 at 01:59:33PM -0700, David E. Wheeler wrote: > I think what I need is some code to strip non-utf8 characters from a string > -- even if that string has the utf8 bit switched on. I thought that Encode > would do that for me, but in this case apparently not. Anyone got an > example? Tri this: Encode::_utf8_off($string); $string = Encode::decode('utf8', $string); That will replace any byte sequences which are invalid UTF-8 with the Unicode replacement character. If you want to guarantee that the flag is on first, do this: utf8::upgrade($string); Encode::_utf8_off($string); $string = Encode::decode('utf8', $string); Devel::Peek's Dump() function will come in handy for checking results. Cheers, Marvin Humphrey
Re: Win32 *W functions and old -C behavior
On Oct 31, 2006, at 11:57 PM, Oleg V. Volkov wrote: Alas, I couldn't get to general discussion - link at end of -C discussion seems to be either mangled in a way I can't restore or broken. http://www.mail-archive.com/perl-unicode@perl.org/msg01963.html Marvin Humphrey Rectangular Research http://www.rectangular.com/
Re: perlio probs
On Sep 22, 2005, at 10:28 PM, Martin Hosken wrote: Does anyone have any experience of a bug I've encountered in 5.8.7 only whereby occasionally (which is what makes it hard to report) this type of code: $fh = IO::File->new("< input.dat") || die; binmode $fh; # lots of code, even to different package $fh->read($dat, $num_bytes); does utf-8 conversion on $dat !! It's a shot in the dark, but is there any correlation between the screwup and the value of the utf8 flag on $dat prior to the read call? Marvin Humphrey Rectangular Research http://www.rectangular.com/