On Jun 16, 2010, at 4:47 PM, Marvin Humphrey wrote: > On Wed, Jun 16, 2010 at 01:59:33PM -0700, David E. Wheeler wrote: >> I think what I need is some code to strip non-utf8 characters from a string >> -- even if that string has the utf8 bit switched on. I thought that Encode >> would do that for me, but in this case apparently not. Anyone got an >> example? > > Tri this: > > Encode::_utf8_off($string); > $string = Encode::decode('utf8', $string); > > That will replace any byte sequences which are invalid UTF-8 with the Unicode > replacement character.
Yeah. Not working for me. See attached script. Devel::Peek says: SV = PV(0x100801f18) at 0x10082f368 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x1002015c0 "<p>Tomas Laurinavi\303\204\302\215ius</p>"\0 [UTF8 "<p>Tomas Laurinavi\x{c4}\x{8d}ius</p>"] CUR = 29 LEN = 32 So the UTF8 flag is enabled, and yet it has "\303\204\302\215" in it. What is that crap? Confused and frustrated, David
#!/usr/local/bin/perl -w use 5.12.0; use Encode; use Devel::Peek; my $str = '<p>Tomas LaurinaviÃÂius</p>'; my $utf8 = decode('UTF-8', $str); say $str; binmode STDOUT, ':utf8'; say $utf8; Dump($utf8);