Hi Steven, Thank you very match for your answer. I tested with the StandardAnalyzer and it really tokenizes the text ideograph by ideograph. May be as Samir says in his mail this is not convenient for people who use CJK language because too lot of documents will match. But the think is in this case (when using StandardAnalyzer) the range searches work correctly. I tested it. The logic is the same as in English range searches. If in English you have the word "brown" and some tokenizer tokenizes it letter by letter like this: 'b' 'r' 'o' 'w' 'n', and then you can search for more than one character. For example consider the following search - content:[aaa TO ccc] - then the token 'b' will be found. Yes for letter based languages it does not make sense to tokenize letter by letter, of course. But in CJK in great number of cases, as I know, single ideographs are separate words, or even group of words. I tested range searches of the Chinese text indexed with StandardAnalyzer and everything in this context is OK. The searches: content:[\u0E80 TO 的\u0E80] content:[\u0E80\u0E80 TO 的\u0E80] content:[\u0E80\u0E80\u0E80 TO 的\u0E80\u0E80] content:[\u0E80\u0E80\u0E80 TO 的\u0E80\u0E80]
not only work but return the same result set as: content:[\u0E80 TO 的] Here \u0E80 is the first ideograph of CJK Unicode code points and 的 is some ideograph persisting in some of the text files. This of course works also with the CJKAnalyzer. But with StandardAnalyzer will be avoided, I think, the case that I describe in my previous mail. So I know range searches are a bit slower but I just fulfil the requirement of our customers. They will decide if range searches are convenient or not and whet Analyzer will better help them. Thanks once again :) Best Regards, Ivan Steven Rowe wrote: > Hi Ivan, > > Ivan Vasilev wrote: > >> But how to understand the meaning of this: “To overcome this, you >> have to index chinese characters as single tokens (this will increase >> recall, but decrease precision).” >> >> I understand it so: To increase the results I have to use instead of >> the Chinese another analyzer that makes tokenization of the text >> character by character. >> > > StandardTokenizer[1] produces single-character tokens for Chinese > ideographs and Japanese kana. > > However, AFAIK, you will no longer be able to perform range searches > like [AG TO PQ], because the terms "AG" and "PQ" will not be present in > the index. [A TO P] should work, but I don't know how useful the > results would be, since this would match all words that contain the > ideographs [A TO P], not just those that start with them. (Note that > this is also the case with the bigram tokens produced by CJKAnalyzer.) > > By the way, what is the use case for matching a range of words? Doesn't > exposing this kind of functionality cause performance concerns? > > Steve > > [1] Lucene's StandardTokenizer API doc: > <http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/analysis/standard/StandardTokenizer.html> > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]