Bugs item #993380, was opened at 2004-07-18 11:04 Message generated for change (Comment added) made by cutting You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=491356&aid=993380&group_id=59548
Category: plugin: parse-html Group: None >Status: Closed >Resolution: Accepted Priority: 5 Submitted By: Jungshik Shin (jshin) Assigned to: Nobody/Anonymous (nobody) Summary: figure out charset from the meta tag Initial Comment: The majority of html documents on the web are served without 'charset' specified in Content-Type HTTP header. Currently, Nutch doesn't look 'into' html files to figure out the character encoding (charset) that is often specified in 'meta tag' like this: <meta http-equiv="content-type" content="text/html; charset=ISO-8859-7"> I changed the html parser to look 'into' html files to read off the value of 'charset'. SAX xml parser can't be used because it needs to know the encoding before parsing. My patch uses a technique often used by browsers (I know for sure Mozilla does this), which is to inflates 'byte sequences' (blindly) to 2bytes (by zero-padding) and to use a regular expression. The first 2000 bytes are looked into this way and I was able to figure out the charset of most of documents whose encoding wouldn't be known otherwise. I also explicitly set the fallback charset (when both HTTP C-T header and meta tag are missing) to 'windows-1252' (a superset of ISO-8859-1). Probably, this should be made even smarter in two ways: 1) this should be configurable 2) use per-TLD / ccTLD fall back charset mapping table In addition, I'm storing the value of the 'detected' character encoding as 'metadata' so that 'cached.jsp' can make use of it. I'll file a new bug and upload a patch to cached.jsp. ---------------------------------------------------------------------- >Comment By: Doug Cutting (cutting) Date: 2004-07-22 09:37 Message: Logged In: YES user_id=21778 I just committed this. Thanks! ---------------------------------------------------------------------- Comment By: Jungshik Shin (jshin) Date: 2004-07-21 22:29 Message: Logged In: YES user_id=307557 this is a new patch addressing Doug's concerns. ---------------------------------------------------------------------- Comment By: Doug Cutting (cutting) Date: 2004-07-20 08:57 Message: Logged In: YES user_id=21778 I guess resolveEncodingAlias is okay in StringUtils, although I wonder if it might be better on a new class like TextUtils or I18NUtils... But I'm okay with that in StringUtils. ---------------------------------------------------------------------- Comment By: Jungshik Shin (jshin) Date: 2004-07-19 21:39 Message: Logged In: YES user_id=307557 Thanks for taking a look at the patch. I'll do what you suggested (later this week) and upload a new patch. Btw, charset alias-resolution is not only for html but also useful for other 'plugins' (say, text/plain). GIven that, how about keeping 'String resolveEncodingAlias' in StringUtils while moving 'String sniffCharacterEncoding' to htmlParser? ---------------------------------------------------------------------- Comment By: Doug Cutting (cutting) Date: 2004-07-19 14:21 Message: Logged In: YES user_id=21778 Sorry I missed this one. Could you please move the utility methods from StringUtil into HtmlParser? These are not really universal string utilities. Also, please add a default charset to conf/nutch-default.xml and use NutchConf to access it. Look at other classes for how they do this, or ask for help if it is not obvious. Thanks! ---------------------------------------------------------------------- Comment By: Jungshik Shin (jshin) Date: 2004-07-19 13:53 Message: Logged In: YES user_id=307557 Doug, thanks for applying my patch for bug 993385 (cached.jsp fix : https://sourceforge.net/tracker/index.php?func=detail&aid=993385&group_id=59548&atid=491356), but that patch doesn't work without my patch for this bug. Can you take a look at this patch and commit it as you see fit? Thanks. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=491356&aid=993380&group_id=59548 ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
