Hi all,
thank you all for the tips. I am going with Stanford then.
I am currently producing a language model from the Christian's raw
Chinese CommonCrawl data (www.statmt.org/ngrams). Once I am done I will
be happy to share back.
Best,
Marcin
W dniu 20.03.2015 o 15:43, Tom Hoar pisze:
We also use the Stanford Segmenter most of the time, but have also
used many. Surprisingly, LDC's manseg also gives very good results
with SMT and it's much faster to load than Stanford's.
Like Ventzi's comments, a segmenter's absolute accuracy relative to
the human interpretation of what is a "word" is not the most important
factor when using it as a tokenizer for SMT. It's much more important
for the tool to give consistent co-occurrence results relative to the
paired language tokens. In "from ZH" environments, the
segmented/tokenized form is never seen by humans. In "to ZH"
environments, the recaser/detokenizer method(s) can actually repair
errors and restore the string to what it should be.
@ Venzi, thanks for mentioning KeTea. We'll test & compare.
Tom
On 03/20/2015 08:43 PM, "Венцислав Жечев (Ventsislav Zhechev)" wrote:
Hi Marcin,
At Autodesk we’ve been successfully using KyTea since 2011. The main
reason we chose this specific tool is that it has readily available
models for both Chinese and Japanese, which simplified the
integration in our workflows.
At least for Japanese, we also evaluated Mecab in 2011, but found
KyTea to serve us better.
Keep in mind, though, that we are not very interested in the quality
of the segmentation per se; instead we need the MT to be of
sufficient quality, regardless if what the segmentation tool does
makes sense on its own or not.
Cheers,
Ventzi
–––––––
Dr. Ventsislav Zhechev
Computational Linguist, Certified ScrumMaster®
Platform Architecture and Technologies
Localisation Services
MAIN +41 32 723 91 22
FAX +41 32 723 93 99
http://VentsislavZhechev.eu
Autodesk, Inc.
Rue de Puits-Godet 6
2000 Neuchâtel, Switzerland
www.autodesk.com
20.03.2015 г., в 14:32, moses-support-requ...@mit.edu
<mailto:moses-support-requ...@mit.edu> написал(а):
Date: Fri, 20 Mar 2015 13:19:02 +0100
From: Marcin Junczys-Dowmunt <junc...@amu.edu.pl
<mailto:junc...@amu.edu.pl>>
Subject: [Moses-support] Chinese segmentation/tokenization
To: Moses Support <moses-support@mit.edu>
Message-ID: <e4d171cb90994cb853a9965facaeb...@amu.edu.pl>
Content-Type: text/plain; charset="us-ascii"
Hi,
questions appear from time to time on the list concerning Chinese
segmentation/tokenization. I saw Barry mention Lingpipe and other tools.
Is there a favourite tool you guys prefer to use over others?
Thanks,
Marcin
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support