> I would be against such a move.  I think Lucene's core has too many 
> analyzers in it already, such as the German and Russian ones.  The core 
> could do without any of the concrete analyzers altogether, in my 
> opinion - but it is handy to have a few general purpose convenience 
> ones.
+1
> 
> What benefit, besides convenience, would there be in CJKAnalyzer into 
> the core?  What about the all the others in the sandbox?  If we bring 
> one in, why not all of them?
but for CJK there is no space for word segment in nature. so the Bigram Co-occurrences 
will be the better way for  Word Discrimination. 
For example: term C1C2 if segment into C1 and C2 the results will contains C2C1... but 
in Chinese, the word C1C2 and C2C1 maybe in different meaning.
compare to the the sigram base tokenizer implement in StandardTokenizer the bigram 
based token will return MUCH better results.

According to my feed back on CJKTokenizer: 
for CJK users, the bigram based CJKTokenzier was strongly recommended for better 
results.

for more:
Word Discrimination Based on Bigram Co-occurrences
... There is a match routine that detects any common segment between the target word 
and each of the ... The entries
of the matrix indicate whether a reference word and a lexicon word share at least one 
n- gram ... It also
shows the bigram match list for an unknown word generated by the feature-matching 
process ... 
www.ecse.rpi.edu/homepages/nagy/PDF_files/ ElNasan-Nagy-ICDAR01.pdf 

Segmenting Chinese in Unicode
... However, to date no in-depth analysis has been performed analyzing the 
deficiencies in segmentation
that lead to the improved performance of the simpler bigram methods. ... The 
part-of-speech of the segment
and the ... A study on integrating Chinese word segmentation and part-of-speech 
tagging. ... 
www.basistech.com/papers/chinese/iuc-16-paper.pdf 

> 
> It has been brought up to bring in the SnowballAnalyzer - as it 
> actually is general purpose and spans many languages.  I'm not really 
> for bringing that one in either.
> 
> I'm but one voice and would not veto bringing in other analyzers, I 
> just don't think there is much benefit, especially if we improve the 
> release process to incorporate the sandbox goodies into a single 
> distribution but as separate JARs.
> 
> Erik
Thank you,  Erik. Hope we can more communications on this issue with other east Asian 
Luaguage users.

Che Dong

> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

Reply via email to