Bugs item #993102, was opened at 2004-07-17 21:37 Message generated for change (Tracker Item Submitted) made by Item Submitter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=491356&aid=993102&group_id=59548
Category: plugin: parse-html Group: None Status: Open Resolution: None Priority: 5 Submitted By: Jungshik Shin (jshin) Assigned to: Nobody/Anonymous (nobody) Summary: add char. encoding aliases Initial Comment: There are numerous documents with mislabelled charset. The most common case is to use the name of a charset which is a subset of the actual charset used in the document. For example, ISO-8859-1 is a subset of Windows-1252 and their common part (0x00 - 0x7e, 0xa0 -0xfe) encode exactly the same set of characters. In addition to the common part, Windows-1252 has additional characters in 0x80-0x9f. Often, documents with characters only covered by Windows-1252 are mislabelled as ISO-8859-1. On encountering such a document, Nutch currently bails out (because 'SAX' raises an exception - invalid character exception?). Given the status of the web today, it's not desirable to reject those documents and virtually all browsers are generous enough to treat ISO-8859-1-labelled documents as in Windows-1252. There are a few other cases (EUC-KR < x-windows-949, GB2312/x-EUC-CN < GBK < GB18030, TIS620 < ISO-8859-11 < Windows-874). My patch mimicks this 'generosity' of web browsers. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=491356&aid=993102&group_id=59548 ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
