On Jun 16, 2010, at 9:05 AM, David E. Wheeler wrote:
> On Jun 16, 2010, at 2:34 AM, Michael Ludwig wrote:
>
>> Try passing the parser options as a hash reference:
>>
>> my $doc = $parser->parse_html_string($str, {encoding => 'utf-8'});
>
> WTF! That fixes it! I don't understand why it seems to be ignoring the
> encoding set in the constructor. But I've noticed the same thing with other
> options. Seems like there's some consistency to be worked out in XML::LibXML
> options, still.
Okay, a bit more information: this was not quite it, alas.
>> In order to print Unicode text strings (as opposed to octet strings)
>> correctly to a terminal (UTF-8 or not), add the following line before
>> the first output:
>>
>> binmode STDOUT, ':utf8';
>>
>> But note that STDOUT is global.
>
> Yes, I do this all the time. Surprisingly, I don't get warnings for this
> script, even though it is outputting multibyte characters.
This is key. If I set the binmode on STDOUT to :utf8, the bogus characters
print out bogus. If I set it to :raw, they come out right after processing by
both Encode and XML::LibXML (I'm assuming they're interpreted as latin-1).
So my question is this: Why isn't Encode dying when it runs into these
characters? They're not valid utf-8, AFAICT. Are they somehow valid utf8 (that
is, valid in Perl's internal format)? Why would they be?
I think what I need is some code to strip non-utf8 characters from a string --
even if that string has the utf8 bit switched on. I thought that Encode would
do that for me, but in this case apparently not. Anyone got an example?
Thanks,
David