Hi all,
thank you all for the tips. I am going with Stanford then.

I am currently producing a language model from the Christian's raw Chinese CommonCrawl data (www.statmt.org/ngrams). Once I am done I will be happy to share back.
Best,
Marcin

W dniu 20.03.2015 o 15:43, Tom Hoar pisze:
We also use the Stanford Segmenter most of the time, but have also used many. Surprisingly, LDC's manseg also gives very good results with SMT and it's much faster to load than Stanford's.

Like Ventzi's comments, a segmenter's absolute accuracy relative to the human interpretation of what is a "word" is not the most important factor when using it as a tokenizer for SMT. It's much more important for the tool to give consistent co-occurrence results relative to the paired language tokens. In "from ZH" environments, the segmented/tokenized form is never seen by humans. In "to ZH" environments, the recaser/detokenizer method(s) can actually repair errors and restore the string to what it should be.

@ Venzi, thanks for mentioning KeTea. We'll test & compare.

Tom



On 03/20/2015 08:43 PM, "Венцислав Жечев (Ventsislav Zhechev)" wrote:
Hi Marcin,

At Autodesk we’ve been successfully using KyTea since 2011. The main reason we chose this specific tool is that it has readily available models for both Chinese and Japanese, which simplified the integration in our workflows. At least for Japanese, we also evaluated Mecab in 2011, but found KyTea to serve us better.

Keep in mind, though, that we are not very interested in the quality of the segmentation per se; instead we need the MT to be of sufficient quality, regardless if what the segmentation tool does makes sense on its own or not.


Cheers,

Ventzi

–––––––
Dr. Ventsislav Zhechev
Computational Linguist, Certified ScrumMaster®
Platform Architecture and Technologies
Localisation Services

MAIN +41 32 723 91 22
FAX +41 32 723 93 99

http://VentsislavZhechev.eu

Autodesk, Inc.
Rue de Puits-Godet 6
2000 Neuchâtel, Switzerland
www.autodesk.com




20.03.2015 г., в 14:32, moses-support-requ...@mit.edu <mailto:moses-support-requ...@mit.edu> написал(а):

Date: Fri, 20 Mar 2015 13:19:02 +0100
From: Marcin Junczys-Dowmunt <junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>>
Subject: [Moses-support] Chinese segmentation/tokenization
To: Moses Support <moses-support@mit.edu>
Message-ID: <e4d171cb90994cb853a9965facaeb...@amu.edu.pl>
Content-Type: text/plain; charset="us-ascii"



Hi,

questions appear from time to time on the list concerning Chinese
segmentation/tokenization. I saw Barry mention Lingpipe and other tools.
Is there a favourite tool you guys prefer to use over others?

Thanks,

Marcin



_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support



_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to