On Wed, Apr 7, 2010 at 08:21, Jesper Persson <[email protected]> wrote:
> on this page: http://www.usm.edu/math/conferences/scc1/robota/contents.html
> The content-type meta tag contains a carriage return and a line feed:
>
> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859- 1">
>
> $response->decoded_content cannot handle this.
I don't think LWP should try to guess what charset was meant in cases
like this. Instead I propose to be able to specify what charset to
use as alternative if the one specified in in the response or with the
content isn't known by Encode. Something like this patch:
diff --git a/lib/HTTP/Message.pm b/lib/HTTP/Message.pm
index 818efae..5ffa215 100644
--- a/lib/HTTP/Message.pm
+++ b/lib/HTTP/Message.pm
@@ -374,8 +374,24 @@ sub decoded_content
$content_ref = \$copy;
$content_ref_iscopy++;
}
- $content_ref = \Encode::decode($charset, $$content_ref,
- ($opt{charset_strict} ? Encode::FB_CROAK() : 0) |
Encode::LEAVE_SRC());
+ eval {
+ $content_ref = \Encode::decode($charset, $$content_ref,
+ ($opt{charset_strict} ? Encode::FB_CROAK() :
0) | Encode::LEAVE_SRC());
+ };
+ if ($@) {
+ my $retried;
+ if ($@ =~ /^Unknown encoding/) {
+ my $alt_charset = lc($opt{alt_charset} || "");
+ if ($alt_charset && $charset ne $alt_charset) {
+ # Retry decoding with the alternative charset
+ $content_ref =
\Encode::decode($alt_charset, $$content_ref,
+ ($opt{charset_strict} ?
Encode::FB_CROAK() : 0) | Encode::LEAVE_SRC())
+ unless $alt_charset =~
/^(?:none|us-ascii|iso-8859-1)\z/;
+ $retried++;
+ }
+ }
+ die unless $retried;
+ }
die "Encode::decode() returned undef improperly"
unless defined $$content_ref;
if ($is_xml) {
# Get rid of the XML encoding declaration if present
@@ -872,6 +888,13 @@ C<none> can used to suppress decoding of the charset.
This override the default charset guessed by content_charset() or
if that fails "ISO-8859-1".
+=item C<alt_charset>
+
+If decoding fails because the charset specified in the Content-Type header
+isn't recognized by Perl's Encode module, then try decoding using this charset
+instead of failing. The C<alt_charset> might be specified as C<none> to simply
+return the string without any decoding of charset as alternative.
+
=item C<charset_strict>
Abort decoding if malformed characters is found in the content. By
>
> Regards
> Jesper Persson
>