Bugs item #993380, was opened at 2004-07-18 14:04
Message generated for change (Settings changed) made by jshin
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=491356&aid=993380&group_id=59548

>Category: plugin: parse-html
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Jungshik Shin (jshin)
Assigned to: Nobody/Anonymous (nobody)
Summary: figure out charset from the meta tag

Initial Comment:
The majority of html documents on the web are served
without 'charset' specified in Content-Type HTTP
header. Currently, Nutch doesn't look 'into' html files
to figure out the character encoding (charset) that is
often specified in 'meta tag' like this:

<meta http-equiv="content-type" content="text/html;
charset=ISO-8859-7">

I changed the html parser to look 'into' html files to
read off the value of 'charset'. SAX xml parser can't
be used because it needs to know the encoding before
parsing. My patch uses a technique often used by
browsers (I know for sure Mozilla does this), which is
to inflates 'byte sequences' (blindly) to 2bytes (by
zero-padding) and to use a regular expression. The
first 2000 bytes are looked into this way and I was
able to figure out the charset of most of documents
whose encoding wouldn't be known otherwise. 

I also explicitly set the fallback charset (when both
HTTP C-T header and meta tag are missing) to
'windows-1252' (a superset of ISO-8859-1). Probably,
this should be made even smarter in two ways:

1) this should be configurable 
2) use per-TLD / ccTLD fall back charset mapping table
  
In addition, I'm storing the value of the 'detected'
character encoding as 'metadata' so that 'cached.jsp'
can make use of it. I'll file a new bug and upload a
patch to cached.jsp.



----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=491356&aid=993380&group_id=59548


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to