Re: [Moses-support] Warning during tokenizing Urdu Corpus

2013-12-27 Thread Hieu Hoang
nope, just the tokenizer On 27 December 2013 18:21, Asad wrote: > And what about truecaser and cleaning??? Will I have to create that also > for urdu? > > Regards > Asad A.Malik > > Sent from my iPod > > On Dec 27, 2013, at 9:07 PM, Hieu Hoang wrote: > > The output will be tokenized, but proba

Re: [Moses-support] Warning during tokenizing Urdu Corpus

2013-12-27 Thread Asad
And what about truecaser and cleaning??? Will I have to create that also for urdu? Regards Asad A.Malik Sent from my iPod On Dec 27, 2013, at 9:07 PM, Hieu Hoang wrote: > The output will be tokenized, but probably very badly. If you know Urdu and > can create a better tokenizer, please add i

Re: [Moses-support] Warning during tokenizing Urdu Corpus

2013-12-27 Thread Hieu Hoang
The output will be tokenized, but probably very badly. If you know Urdu and can create a better tokenizer, please add it to Moses. You can start by looking at the configuration file for the English tokenizer in scripts/share/nonbreaking_prefixes/nonbreaking_prefix.en You can copy that and chang

Re: [Moses-support] Warning during tokenizing Urdu Corpus

2013-12-27 Thread John D. Burger
The default tokenizer script only knows specific rules for a few languages. The fallback (English) rules may suffice for your purposes, they do the obvious thing with spaces and English punctuation, and also handle some special cases for abbreviations like "Mr." and "Mrs.". I'd suggest you eye

[Moses-support] Warning during tokenizing Urdu Corpus

2013-12-27 Thread Asad A.Malik
Hi All, I am trying to develop Urdu SMT using MOSES. I have Urdu parallel corpus and the 1st step in manual is to tokenize the corpus, but when I enter following command: ~/SMT/mosesdecoder/scripts/tokenizer/tokenizer.perl -l ur < ~/SMT/corpus/training/mycorpus.ur-en.ur > ~/SMT/corpus/mycorpus