Hi, I've written up a module that collects encoding informations for (X)HTML files. (X)HTML files may carry encoding information in 1. the higher-level protocol (e.g. the Content-Type headers charset parameter in HTTP and MIME) 2. the XML declaration (for XHTML documents) 3. the byte order mark at the beginning of the file 4. meta elements like <meta http-equiv='Content-Type' content='text/html;charset=iso-8859-1'> At user option it tries to extract the explicit given informations information from these instances. After that process it sorts the list according to the order above, in list context it returns the list, in scalar context it returns the first encoding in the list (i.e. the encoding the user agent must use to parse the document). This looks like #!perl -w use strict; use warnings; use LWP::UserAgent; use HTML::Encoding ''; my $r = LWP::UserAgent->new->request( HTTP::Request->new(GET => 'http://www.w3.org/')); print scalar HTML::Encoding::get_encoding check_bom => 1, check_xmldecl => 1, check_meta => 1, headers => $r->headers, string => $r->content This would currently print out 'us-ascii' as http://www.w3.org/ returns Content-Type: text/html;charset=us-ascii; in list context this would return [ { source => 4, encoding => 'us-ascii' }, { source => 1, encoding => 'us-ascii' }, ] since the page has also a meta header <meta http-equiv='Content-Type' content='text/html;charset=us-ascii' /> The POD says: [...] The source value is mapped to one of the constants FROM_META, FROM_BOM, FROM_XMLDECL and FROM_HEADER. You can import these constants solely into your namespace or using the ":constants" symbol, e.g. use HTML::Encoding ':constants'; [...] This is usable if you want to check if there is a mismatch between the declared encodings. Some issues that came to my mind while writing this module: * HTTP::Headers should provide some information whether LWP::UserAgent already parsed the header section of the HTML file; so I wouldn't need to do the same thing again. currently one cannot distinguish if there were multiple Content-Type: headers in the original response or if they come from meta elements * HTML::Encoding currently uses HTML::Parser to extract the meta element if version 3.21 or later is available (maybe I'll switch to HTML::HeadParser ...) The problem is, that HTML::Parser is AFAIK currently unable to process documents encoded in some encoding that is not compatible with US-ASCII (UTF-16BE for example) I think it is out of scope of HTML::Encoding to recode the given string to some US-ASCII compatible encoding (that'd be UTF-8) in order to parse the document; this should be done by HTML::Parser using some encoding parameter. Personally I'd say that HTML::Parser should only output UTF-8 encoded characters as XML::Parser does, but this will certainly clash with current users who expect to get ISO-8859-1 or something like that out of it... Is it likely that HTML::Parser incorporates such a feature using the Unicode::* modules or Text::Iconv or whatever is currently available? * Is the currently really no module that does what HTML::Encoding is supposed to do? In general you have to use the module everytime you try to do anything with an HTML document; hmm, maybe western people got too used to ISO-8859-1... The current version can be found at http://www.websitedev.de/perl/HTML-Encoding-0.01.tar.gz You'll currently need Perl 5.6.0 to use it. The file currently lacks of a proper README and test files... Is the module name appropriate? Any other comments or suggestions? I greatly appreciate them :-) Thanks for your time, -- Björn Höhrmann { mailto:[EMAIL PROTECTED] } http://www.bjoernsworld.de am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de 25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/