Hey,
First you need to checkout and compile this fork of nplm:
https://github.com/rsennrich/nplm
Then you need to compile moses with nplm switch:
./bjam --with-nplm=path/to/nplm
Then you can see how to use it here
http://www.statmt.org/moses/?n=FactoredTraining.BuildingLanguageModel#ntoc31
On 30
Hi,
nplm is one toolkit of neural probabilistic language model. This toolkit
can be used in Moses for language model and bilingual LM(neural network
joint model, ACL 2014). These two parts have been updated in github
mosesdecoder.
If you want to use nplm in Moses, you have to compile Moses by lin
So to summarize:
The main issue is that the Moses tokenizer operates at the character
rather than grapheme level on some versions of perl, treating combining
characters (which are arguably parts of words in many cases) as
non-alphanumeric and splitting them off.
Older versions of perl appear to b
Japanese is another language that suffers from standard Unicode NFKC
because the normalization applies changes that can not be reversed.
On 12/30/2014 04:40 AM, John D Burger wrote:
>> This is also a reason to turn Unicode normalization on. If the
>> tokenizer did NFKC at the beginning, then t
The escaping is necessary because Moses reserves these characters for
other uses. When corpora are consistently prepared, the escaping has no
effect on translation results. It looks like you have not prepared your
corpora consistently. Note my results ('s) are different from yours
(' s):
us
> This is also a reason to turn Unicode normalization on. If the
> tokenizer did NFKC at the beginning, then the problem would go away.
If I understand the situation correctly, this would only fix this particular
example and a few others like it. There are many base+combining grapheme
clusters
Dear Moses,
The attached file, taken from line 2345157 of
http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
, tokenizes differently on different machines.
I'm running tokenizer.perl from head (481a07dc) with this perl:
This is perl 5, version 18
Dears,
When I make tokenization on files it replaces the apostrophes with “'”
which make sense, but in the other side it crashes the meaning and the order
of the words at all, for example:
Sentence before tokenization :
Src : keep your notification's payload under 5 kb.
Trg: اجعل حمولة الإعل
I have Arabic into English translation ... Factored Model .
My Question is: Have i to add POS for the source and target or just target
that i want to translate to (through training and tuning) ?
In case i have to add for both , how can i add supertaged (CCG) to the
Arabic language cause there is