Hello community,

i am doing an evaluation in the context of CJK. I compare some indexing
strategies like "unigram", "bigram", "unigram + bigram" and "word based"
indexing.

1.
I used the Standardanalyzer for "unigram". I think it works for chinese but
it is doing some other staff for Japanese and Korean. In Japanese some
characters get combined and for Korean it works like a WhiteSpaceAnalyzer,
right? Which Analyzer would you prefer for "unigrams" in Japanese and
Korean? Is there any flag in the CJKAnalyzer to output "unigrams" only?

2.
I used the CJKAnalyzer for "bigrams" and "unigrams + bigrams". I think it
works correct, but i have some performance issues. The Querytime for
"unigram + bigram" is about 8-20 times higher than "bigram" only. Any ideas?

Thank you.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/CJK-evaluation-Standardanalyzer-and-Querytime-tp4041190.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to