Bugs item #993380, was opened at 2004-07-18 14:04 Message generated for change (Settings changed) made by jshin You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=491356&aid=993380&group_id=59548
>Category: plugin: parse-html Group: None Status: Open Resolution: None Priority: 5 Submitted By: Jungshik Shin (jshin) Assigned to: Nobody/Anonymous (nobody) Summary: figure out charset from the meta tag Initial Comment: The majority of html documents on the web are served without 'charset' specified in Content-Type HTTP header. Currently, Nutch doesn't look 'into' html files to figure out the character encoding (charset) that is often specified in 'meta tag' like this: <meta http-equiv="content-type" content="text/html; charset=ISO-8859-7"> I changed the html parser to look 'into' html files to read off the value of 'charset'. SAX xml parser can't be used because it needs to know the encoding before parsing. My patch uses a technique often used by browsers (I know for sure Mozilla does this), which is to inflates 'byte sequences' (blindly) to 2bytes (by zero-padding) and to use a regular expression. The first 2000 bytes are looked into this way and I was able to figure out the charset of most of documents whose encoding wouldn't be known otherwise. I also explicitly set the fallback charset (when both HTTP C-T header and meta tag are missing) to 'windows-1252' (a superset of ISO-8859-1). Probably, this should be made even smarter in two ways: 1) this should be configurable 2) use per-TLD / ccTLD fall back charset mapping table In addition, I'm storing the value of the 'detected' character encoding as 'metadata' so that 'cached.jsp' can make use of it. I'll file a new bug and upload a patch to cached.jsp. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=491356&aid=993380&group_id=59548 ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
