Hi All,
I am trying to develop Urdu SMT using MOSES. I have Urdu parallel corpus and
the 1st step in manual is to tokenize the corpus, but when I enter following
command:
~/SMT/mosesdecoder/scripts/tokenizer/tokenizer.perl -l ur
~/SMT/corpus/training/mycorpus.ur-en.ur
The default tokenizer script only knows specific rules for a few languages. The
fallback (English) rules may suffice for your purposes, they do the obvious
thing with spaces and English punctuation, and also handle some special cases
for abbreviations like Mr. and Mrs..
I'd suggest you
The output will be tokenized, but probably very badly. If you know Urdu and
can create a better tokenizer, please add it to Moses.
You can start by looking at the configuration file for the English
tokenizer in
scripts/share/nonbreaking_prefixes/nonbreaking_prefix.en
You can copy that and
And what about truecaser and cleaning??? Will I have to create that also for
urdu?
Regards
Asad A.Malik
Sent from my iPod
On Dec 27, 2013, at 9:07 PM, Hieu Hoang hieu.ho...@ed.ac.uk wrote:
The output will be tokenized, but probably very badly. If you know Urdu and
can create a better
nope, just the tokenizer
On 27 December 2013 18:21, Asad asad_12...@yahoo.com wrote:
And what about truecaser and cleaning??? Will I have to create that also
for urdu?
Regards
Asad A.Malik
Sent from my iPod
On Dec 27, 2013, at 9:07 PM, Hieu Hoang hieu.ho...@ed.ac.uk wrote:
The output