I met the problem like Kenneth's
I crawl a page that the actual charset is GB18030 , but in the meta of the
page it is set to gb2312.
so, I have got some unreadable characters when parse it;
I have fixed the StringUtil.resolveEncodingAlias() following Ken's advise,
encodingAliases.put("GB2312", "GB18030");
and I have got the message " setting encoding to GB18030"
but, it resembles evenly useless. the result appear unreadable characters
again.
It seem that the parser use the original encoding as gb2312 .
Would you give me a hand ?
Thanks in advance.
King Kong
Ken Krugler wrote:
>
>>Thanks for your reply.
>>
>>I have found that the method you mentioned looks into the http header from
>>web server. It looks for "charset" and does the mapping. The apache web
>>server which contains the document has already configured:
>>
>>AddDefaultCharset Big5-HKSCS
>>
>>The crawl engine does treat the encoding of all pages from the web server
as
>>Big5-HKSCS.
>>But the crawl engine also looks into the meta tag of the html page.
>>I have two identical html pages with hong kong big5 characters. One has
the
>>tag
>>
>><meta http-equiv="Content-Type" content="text/html; charset=Big5" />
>>
>>The other
>>
>><meta http-equiv="Content-Type" content="text/html; charset=Big5-HKSCS" />
>>
>>When both of these html pages are in the search result page, the "summary"
>>of the first one contains unreadable characters.
>>So I think I need to modify some codes which read the meta tag of html
page.
>>Do you have any idea?
>
> From a quick look at the source, this eventually also calls
> StringUtil.resolveEncodingAlias().
>
> HtmlParser.getParse() calls StringUtil.parseCharacterEncoding(),
> passing it the content-type meta data, and then takes the returned
> charset name and calls StringUtil.resolveEncodingAlias().
>
> So if you fix StringUtil.resolveEncodingAlias(), I think it will take
> care of both issues (HTTP server and HTML pages).
>
> -- Ken
>
>
>>-----Original Message-----
>>>I want to do crawling on document with charset="big5-hkscs" (which is an
>>>extension of big5, with extra hong kong chinese characters). But the
>>>document's meta tags set content="text/html; charset=big5" instead. So
the
>>>crawl engine treats the document as "big5" instead of "big-hkscs". That
>>>makes the extra hong kong characters unreadable on search result page.
How
>>>my question is: Can I force the crawl engine to treat the document as
>>>"big5-hkscs"?
>>
>>I don't know of a way to do this without some coding.
>>
>>You could modify the resolveEncodingAlias method to add (or
>>uncomment) the aliasing of big5 => big5-hkscs, but then you'd have to
>>rebuild Nutch.
>>
>>See the resolveEncodingAlias() method here:
>>
>>http://www.krugle.com/files/svn/svn.apache.org/lucene/nutch/trunk/src/java/o
>>rg/apache/nutch/util/StringUtil.java
>
>
> --
> Ken Krugler
> Krugle, Inc.
> +1 530-210-6378
> "Find Code, Find Answers"
>
>
>
--
View this message in context:
http://www.nabble.com/Charset-question-tf2231717.html#a6353390
Sent from the Nutch - User forum at Nabble.com.
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general