[Moses-support] Warning during tokenizing Urdu Corpus

2013-12-27 Thread Asad A.Malik
Hi All, I am trying to develop Urdu SMT using MOSES. I have Urdu parallel corpus and the 1st step in manual is to tokenize the corpus, but when I enter following command: ~/SMT/mosesdecoder/scripts/tokenizer/tokenizer.perl -l ur ~/SMT/corpus/training/mycorpus.ur-en.ur

Re: [Moses-support] Warning during tokenizing Urdu Corpus

2013-12-27 Thread John D. Burger
The default tokenizer script only knows specific rules for a few languages. The fallback (English) rules may suffice for your purposes, they do the obvious thing with spaces and English punctuation, and also handle some special cases for abbreviations like Mr. and Mrs.. I'd suggest you

Re: [Moses-support] Warning during tokenizing Urdu Corpus

2013-12-27 Thread Hieu Hoang
The output will be tokenized, but probably very badly. If you know Urdu and can create a better tokenizer, please add it to Moses. You can start by looking at the configuration file for the English tokenizer in scripts/share/nonbreaking_prefixes/nonbreaking_prefix.en You can copy that and

Re: [Moses-support] Warning during tokenizing Urdu Corpus

2013-12-27 Thread Asad
And what about truecaser and cleaning??? Will I have to create that also for urdu? Regards Asad A.Malik Sent from my iPod On Dec 27, 2013, at 9:07 PM, Hieu Hoang hieu.ho...@ed.ac.uk wrote: The output will be tokenized, but probably very badly. If you know Urdu and can create a better

Re: [Moses-support] Warning during tokenizing Urdu Corpus

2013-12-27 Thread Hieu Hoang
nope, just the tokenizer On 27 December 2013 18:21, Asad asad_12...@yahoo.com wrote: And what about truecaser and cleaning??? Will I have to create that also for urdu? Regards Asad A.Malik Sent from my iPod On Dec 27, 2013, at 9:07 PM, Hieu Hoang hieu.ho...@ed.ac.uk wrote: The output