subject:"\[Moses\-support\] Chinese \& Arabic Tokenizers"

Re: [Moses-support] Chinese & Arabic Tokenizers

2015-12-18 Thread Dingyuan Wang

Hi Tom, As far as I know, the following are widely-used and open-source Chinese tokenizers: * https://github.com/fxsjy/jieba * http://sourceforge.net/projects/zpar/ * https://github.com/NLPchina/ansj_seg And this proprietary one: * http://ictclas.nlpir.org/ (Disclaimer: I am one of the

Re: [Moses-support] Chinese & Arabic Tokenizers

2015-12-18 Thread Matthias Huck

Hi Tom, There used to be a freely available Chinese word segmenter provided by the LDC as well. Unfortunately, things keep disappearing from the web. https://web.archive.org/web/20130907032401/http://projects.ldc.upenn.edu/Chinese/LDC_ch.htm For Arabic, I think that many academic research groups

[Moses-support] Chinese & Arabic Tokenizers

2015-12-18 Thread Tom Hoar

I'm looking for Chinese and Arabic tokenizers. We've been using Stanford's for a while but it has downfalls. The Chinese mode loads its statistical models very slowly. The Arabic mode stems the resulting tokens. The coup de grace is that their latest jar update (9 days ago) was compiled run