Re: [Moses-support] Chinese segmentation/tokenization

Marcin Junczys-Dowmunt Fri, 20 Mar 2015 11:48:14 -0700

Hi all,
thank you all for the tips. I am going with Stanford then.

I am currently producing a language model from the Christian's rawChinese CommonCrawl data (www.statmt.org/ngrams). Once I am done I willbe happy to share back.

Best,
Marcin

W dniu 20.03.2015 o 15:43, Tom Hoar pisze:

We also use the Stanford Segmenter most of the time, but have alsoused many. Surprisingly, LDC's manseg also gives very good resultswith SMT and it's much faster to load than Stanford's.
Like Ventzi's comments, a segmenter's absolute accuracy relative tothe human interpretation of what is a "word" is not the most importantfactor when using it as a tokenizer for SMT. It's much more importantfor the tool to give consistent co-occurrence results relative to thepaired language tokens. In "from ZH" environments, thesegmented/tokenized form is never seen by humans. In "to ZH"environments, the recaser/detokenizer method(s) can actually repairerrors and restore the string to what it should be.
@ Venzi, thanks for mentioning KeTea. We'll test & compare.

Tom



On 03/20/2015 08:43 PM, "Венцислав Жечев (Ventsislav Zhechev)" wrote:
Hi Marcin,
At Autodesk we’ve been successfully using KyTea since 2011. The mainreason we chose this specific tool is that it has readily availablemodels for both Chinese and Japanese, which simplified theintegration in our workflows.At least for Japanese, we also evaluated Mecab in 2011, but foundKyTea to serve us better.
Keep in mind, though, that we are not very interested in the qualityof the segmentation per se; instead we need the MT to be ofsufficient quality, regardless if what the segmentation tool doesmakes sense on its own or not.
Cheers,

Ventzi

–––––––
Dr. Ventsislav Zhechev
Computational Linguist, Certified ScrumMaster®
Platform Architecture and Technologies
Localisation Services

MAIN +41 32 723 91 22
FAX +41 32 723 93 99

http://VentsislavZhechev.eu

Autodesk, Inc.
Rue de Puits-Godet 6
2000 Neuchâtel, Switzerland
www.autodesk.com
20.03.2015 г., в 14:32, moses-support-requ...@mit.edu<mailto:moses-support-requ...@mit.edu> написал(а):
Date: Fri, 20 Mar 2015 13:19:02 +0100
From: Marcin Junczys-Dowmunt <junc...@amu.edu.pl<mailto:junc...@amu.edu.pl>>
Subject: [Moses-support] Chinese segmentation/tokenization
To: Moses Support <moses-support@mit.edu>
Message-ID: <e4d171cb90994cb853a9965facaeb...@amu.edu.pl>
Content-Type: text/plain; charset="us-ascii"



Hi,

questions appear from time to time on the list concerning Chinese
segmentation/tokenization. I saw Barry mention Lingpipe and other tools.
Is there a favourite tool you guys prefer to use over others?

Thanks,

Marcin
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Chinese segmentation/tokenization

Reply via email to