nope, just the tokenizer
On 27 December 2013 18:21, Asad wrote:
> And what about truecaser and cleaning??? Will I have to create that also
> for urdu?
>
> Regards
> Asad A.Malik
>
> Sent from my iPod
>
> On Dec 27, 2013, at 9:07 PM, Hieu Hoang wrote:
>
> The output will be tokenized, but proba
And what about truecaser and cleaning??? Will I have to create that also for
urdu?
Regards
Asad A.Malik
Sent from my iPod
On Dec 27, 2013, at 9:07 PM, Hieu Hoang wrote:
> The output will be tokenized, but probably very badly. If you know Urdu and
> can create a better tokenizer, please add i
The output will be tokenized, but probably very badly. If you know Urdu and
can create a better tokenizer, please add it to Moses.
You can start by looking at the configuration file for the English
tokenizer in
scripts/share/nonbreaking_prefixes/nonbreaking_prefix.en
You can copy that and chang
The default tokenizer script only knows specific rules for a few languages. The
fallback (English) rules may suffice for your purposes, they do the obvious
thing with spaces and English punctuation, and also handle some special cases
for abbreviations like "Mr." and "Mrs.".
I'd suggest you eye
Hi All,
I am trying to develop Urdu SMT using MOSES. I have Urdu parallel corpus and
the 1st step in manual is to tokenize the corpus, but when I enter following
command:
~/SMT/mosesdecoder/scripts/tokenizer/tokenizer.perl -l ur <
~/SMT/corpus/training/mycorpus.ur-en.ur > ~/SMT/corpus/mycorpus