Philipp Chudinov wrote: >Hi! >I am trying to use Lucene with russian texts. I created an index of xml >documents (UTF-8 encoded), but when I am trying to search an index with a >query from a servlet, it seems, that Lucene just finds nothing (though I am >SURE it MUST find a term). Search string is reencoded to UTF-8 too, so I do >not know what to do... If I search this index with english letters - it >works as it should (there are mixed chracters in xml files). Could anybody >help me? Please, note, I use latest nightly build (12 nov) - it claims to >have non-ASCII search ability :( > The problem probably lies in the QueryParser class, as it takes only the less significant bytes of the characters given in the query. I had a very similar problem with querying for polish strings, as they contain characters, that are composed from two bytes in the UTF-8. Also, the chars that appeared in the polish alphabet were not contained in the grammar definition that the query parser accepted.
The solution I had developed is to modify the original grammar definition so that it accepts the non-english characters. Also, the grammar must not contain the less significant bytes of the accepted characters in the term delimeters list. This was posted as a bug description in the old Lucene site, but as the search engine had moved to Jakarta, all old bugs had gone. If you want the modified grammar as an example, or if you want further guidance, feel free to mail me at: [EMAIL PROTECTED] Andrzej Jarmoniuk Internet Developer E-Point S.A. -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>