Hi Tom,
As far as I know, the following are widely-used and open-source Chinese
tokenizers:
* https://github.com/fxsjy/jieba
* http://sourceforge.net/projects/zpar/
* https://github.com/NLPchina/ansj_seg
And this proprietary one:
* http://ictclas.nlpir.org/
(Disclaimer: I am one of the
Hi Tom,
There used to be a freely available Chinese word segmenter provided by
the LDC as well. Unfortunately, things keep disappearing from the web.
https://web.archive.org/web/20130907032401/http://projects.ldc.upenn.edu/Chinese/LDC_ch.htm
For Arabic, I think that many academic research groups
I'm looking for Chinese and Arabic tokenizers. We've been using
Stanford's for a while but it has downfalls. The Chinese mode loads its
statistical models very slowly. The Arabic mode stems the resulting
tokens. The coup de grace is that their latest jar update (9 days ago)
was compiled run