Philipp Chudinov wrote:

>Hi!
>I am trying to use Lucene with russian texts. I created an index of xml
>documents (UTF-8 encoded), but when I am trying to search an index with a
>query from a servlet, it seems, that Lucene just finds nothing (though I am
>SURE it MUST find a term). Search string is reencoded to UTF-8 too, so I do
>not know what to do... If I search this index with english letters - it
>works as it should (there are mixed chracters in xml files). Could anybody
>help me? Please, note, I use latest nightly build (12 nov) - it claims to
>have non-ASCII search ability :(
>
The problem probably lies in the QueryParser class, as it takes only the 
less significant bytes of the characters given in the query. I had a 
very similar problem with querying for polish strings, as they contain 
characters, that are composed from two bytes in the UTF-8. Also, the 
chars that appeared in the polish alphabet were not contained in the 
grammar definition that the query parser accepted.

The solution I had developed is to modify the original grammar 
definition so that it accepts the non-english characters. Also, the 
grammar must not contain the less significant bytes of the accepted 
characters in the term delimeters list.

This was posted as a bug description in the old Lucene site, but as the 
search engine had moved to Jakarta, all old bugs had gone.

If you want the modified grammar as an example, or if you want further 
guidance, feel free to mail me at: [EMAIL PROTECTED]

Andrzej Jarmoniuk
Internet Developer
E-Point S.A.


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to