Re: [Moses-support] Warning during tokenizing Urdu Corpus

Asad Fri, 27 Dec 2013 10:20:37 -0800

And what about truecaser and cleaning??? Will I have to create that also for 
urdu?


Regards
Asad A.Malik

Sent from my iPod

On Dec 27, 2013, at 9:07 PM, Hieu Hoang <hieu.ho...@ed.ac.uk> wrote:

> The output will be tokenized, but probably very badly. If you know Urdu and 
> can create a better tokenizer, please add it to Moses.
> 
> You can start by looking at the configuration file for the English tokenizer 
> in
>    scripts/share/nonbreaking_prefixes/nonbreaking_prefix.en
> You can copy that and change it specifically for Urdu.
> 
> 
> 
> On 26 December 2013 16:35, Asad A.Malik <asad_12...@yahoo.com> wrote:
> Hi All,
> 
> I am trying to develop Urdu SMT using MOSES. I have Urdu parallel corpus and 
> the 1st step in manual is to tokenize the corpus, but when I enter following 
> command:
> 
> ~/SMT/mosesdecoder/scripts/tokenizer/tokenizer.perl -l ur < 
> ~/SMT/corpus/training/mycorpus.ur-en.ur > ~/SMT/corpus/mycorpus.ur-en.tok.ur  
> 
> it gives me warning:
> 
> WARNING: No known abbreviations for language 'ur', attempting fall-back to 
> English version...
> 
> It also generates the output file but I don't know that this output is 
> tokenized or not
> 
> 
> Regards
> 
> Asad A.Malik
> 
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
> 
> 
> 
> 
> -- 
> Hieu Hoang
> Research Associate
> University of Edinburgh
> http://www.hoang.co.uk/hieu
>

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Warning during tokenizing Urdu Corpus

Reply via email to