Bugs item #993102, was opened at 2004-07-17 21:37
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=491356&aid=993102&group_id=59548

Category: plugin: parse-html
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Jungshik Shin (jshin)
Assigned to: Nobody/Anonymous (nobody)
Summary: add char. encoding aliases

Initial Comment:
There are numerous documents with mislabelled charset.
The most common case is to use the name of a charset
which is a subset of the actual charset used in the
document. For example, ISO-8859-1 is a subset of
Windows-1252 and their common part (0x00 - 0x7e, 0xa0
-0xfe) encode exactly the same set of characters. In
addition to the common part, Windows-1252 has
additional characters in 0x80-0x9f. Often, documents
with characters only covered by Windows-1252 are
mislabelled as ISO-8859-1. On encountering such a
document, Nutch currently bails out (because 'SAX'
raises an exception - invalid character exception?). 

Given the status of the web today, it's not desirable
to reject those documents and virtually all browsers
are generous enough to treat ISO-8859-1-labelled
documents as in Windows-1252. There are a few other
cases (EUC-KR < x-windows-949, GB2312/x-EUC-CN < GBK <
GB18030, TIS620 < ISO-8859-11 < Windows-874).  My patch
mimicks this 'generosity' of web browsers. 



----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=491356&aid=993102&group_id=59548


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to