I suppose you mean Extract_ing_RequestHandler. Out of curiosity, I sent in a Japanese HTML file of EUC-JP encoding, and it converted to Unicode properly and the index has correct Japanese words.
Does your HTML files have META tag for Content-type with the value having charset= ? For example, this is what I have: <meta http-equiv="Content-Type" content="text/html; charset=EUC-JP" /> On Mar 21, 2010, at 9:45 AM, Ukyo Virgden wrote: > Hi, > > I'm trying to index HTML documents with different encodings. My html are > either in win-12XX, ISO-8859-X or UTF8 encoding. handler correctly parses > all html in their respective encodings and indexes. However on the web > interface I'm developing I enter query terms in UTF-8 which naturally does > not match with content with different encodings. Also the results I see on > my web app is not utf8 encoded as expected. > > My question, is there any filter I can use to convert all content extracted > by the handler to UTF-8 prior to indexing? > > Does it make sense to write a filter which would convert tokens to UTF-8, or > even is it possible with multiple encodings? > > Thanks in advance. > Ukyo ---- Teruhiko "Kuro" Kurosaka RLP + Lucene & Solr = powerful search for global contents