And what about truecaser and cleaning??? Will I have to create that also for urdu?
Regards Asad A.Malik Sent from my iPod On Dec 27, 2013, at 9:07 PM, Hieu Hoang <hieu.ho...@ed.ac.uk> wrote: > The output will be tokenized, but probably very badly. If you know Urdu and > can create a better tokenizer, please add it to Moses. > > You can start by looking at the configuration file for the English tokenizer > in > scripts/share/nonbreaking_prefixes/nonbreaking_prefix.en > You can copy that and change it specifically for Urdu. > > > > On 26 December 2013 16:35, Asad A.Malik <asad_12...@yahoo.com> wrote: > Hi All, > > I am trying to develop Urdu SMT using MOSES. I have Urdu parallel corpus and > the 1st step in manual is to tokenize the corpus, but when I enter following > command: > > ~/SMT/mosesdecoder/scripts/tokenizer/tokenizer.perl -l ur < > ~/SMT/corpus/training/mycorpus.ur-en.ur > ~/SMT/corpus/mycorpus.ur-en.tok.ur > > it gives me warning: > > WARNING: No known abbreviations for language 'ur', attempting fall-back to > English version... > > It also generates the output file but I don't know that this output is > tokenized or not > > > Regards > > Asad A.Malik > > _______________________________________________ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support > > > > > -- > Hieu Hoang > Research Associate > University of Edinburgh > http://www.hoang.co.uk/hieu >
_______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support