Re: Encoding problem with ExtractRequestHandler for HTML indexing

Teruhiko Kurosaka Wed, 24 Mar 2010 14:14:20 -0700

I suppose you mean Extract_ing_RequestHandler.

Out of curiosity, I sent in a Japanese HTML file of EUC-JP encoding,
and it converted to Unicode properly and the index has correct
Japanese words.


Does your HTML files have META tag for Content-type with the value
having charset= ? For example, this is what I have:
    <meta http-equiv="Content-Type" content="text/html; charset=EUC-JP" />


On Mar 21, 2010, at 9:45 AM, Ukyo Virgden wrote:

> Hi,
> 
> I'm trying to index HTML documents with different encodings. My html are
> either in win-12XX, ISO-8859-X or UTF8 encoding. handler correctly parses
> all html in their respective encodings and indexes. However on the web
> interface I'm developing I enter query terms in UTF-8 which naturally does
> not match with content with different encodings. Also the results I see on
> my web app is not utf8 encoded as expected.
> 
> My question, is there any filter I can use to convert all content extracted
> by the handler to UTF-8 prior to indexing?
> 
> Does it make sense to write a filter which would convert tokens to UTF-8, or
> even is it possible with multiple encodings?
> 
> Thanks in advance.
> Ukyo

----
Teruhiko "Kuro" Kurosaka
RLP + Lucene & Solr = powerful search for global contents

Re: Encoding problem with ExtractRequestHandler for HTML indexing

Reply via email to