Re: UTF-8 indexing and searching

2005-07-01 Thread Paul Libbrecht
Careful that in the http world, there's an amibuity: x-www-form-url-encoded does not specify the content-encoding that the byts represented in the %-escaped sequences are written with. That's fixed by the very recent URI spec where absence means utf-8... My experience was that Tomcat simply con

Re: UTF-8 indexing and searching

2005-07-01 Thread pierre.conti
Did you check that the request string you get at the analyzer level is corectly encoded as UTF-8? We had the same problem with french accentuated char encoded also as UTF-8, and transmited by tomcat as ISO-8859-1. It was just for a test, also we didn't investgated a lot, but re-encode in URL/ISO-8

UTF-8 indexing and searching

2005-07-01 Thread Faulkner, Jeffrey
I'm trying to index and search html and jsp files that are saved using utf-8 encoding. The pages are indexed on the file system using the StandardAnalyzer. The files can contain a mix of english, chinese, japanese, etc. saved as utf-8. Searches using english terms are successful but none of the