Re: Hit highlighting for non-english unicode index/queries not working?

2009-05-26 Thread Erick Erickson
LowercaseFilter is part of Lucene, as are any number of other filters. Thebasic idea is just that *after* tokenization, there may be further transformations you want to do on each token, such as lower-casing it, stemming it, skipping it, But watch out a bit, there are token Filters and search

Re: Hit highlighting for non-english unicode index/queries not working?

2009-05-26 Thread KK
Thank you Erick. As of now I'm using whitespaceanalyzer and no stemming and not stop word remova. Now I feel writing a simple analyzer won't be that difficult after going thru your mail. I'll give it a try. I don't have any idea on filters but I'm pretty it must be simple and will definitely go thr

Re: Hit highlighting for non-english unicode index/queries not working?

2009-05-26 Thread Erick Erickson
It's fairly easy to construct your own analyzer bystringing together some filters and tokenizers. LIA (1st ed) had a SynonymAnalyzer. You probably want something like (WARNING, example only, I'm not even sure it compiles!! Ripped off from the WIKI) public class MyAnalyzer extends Analyzer { p

Re: Hit highlighting for non-english unicode index/queries not working?

2009-05-26 Thread KK
Thank you @Muir. I was earlier using simpleanalyzer for all purposes but as you reccomended me the whitespace one, I tried to use that analyzer and good thing is that I'm able to index/search non-english text as well as supporting hit highlighting for these non-english texts. Thank you very much. B

Re: Hit highlighting for non-english unicode index/queries not working?

2009-05-25 Thread Robert Muir
as mentioned previously, i dont think your text is being analyzed the way you want. SimpleAnalyzer will break your word \u0BAA\u0BB0\u0BBF\u0BA3\u0BBE\u0BAE (பரிணாம) into 3 tokens: \u0BAA\u0BB0 \u0BA3 \u0BAE Not only does it incorrectly split your word into three words, but it completely drops t

Re: Hit highlighting for non-english unicode index/queries not working?

2009-05-25 Thread Michael McCandless
Could you boil down this example to a smaller test case that fails? Eg make a RAMDir, index one document (that should show hilighting), search it, run highlight and show that it's not working? Mike On Mon, May 25, 2009 at 10:02 AM, KK wrote: > Hi, > I'm trying to index some non-english texts. I

Hit highlighting for non-english unicode index/queries not working?

2009-05-25 Thread KK
Hi, I'm trying to index some non-english texts. Indexing and searching is working fine. From command line I'm able to provide the utf-8 unicoded text as input like this, \u0BAA\u0BB0\u0BBF\u0BA3\u0BBE\u0BAE and able to get the search results. Then I tried to add hit highlighting for the same. So I