Re: [Moses-support] Problem with corpus preparation

Rico Sennrich Thu, 26 Mar 2015 04:51:14 -0700

Abdelfetah Boumerdas <aa_boumerdas@...> writes:

> 
> 
> 
> 
> Hi All,
> i'm trying to build a translation model using moses, and to do that i'm
using 2 corpora (europarl and the news commentary corpus provided in the
manual) but when i reached the corpus preparation step i noticed the
following problem: in the prepared europarl files i find that the apostrophe
(') and the quotation marks are replaced respectively with (&apos;) and
(&quot;) but in the second corpus they're still unchanged.
> can anyone please tell me why?? is it a problem with the files encoding (i
checked and they're both utf8)?? or is it another problem that i don't know
about???
> Thanks in advance. 
> --



Hi Abdelfetah,

some special characters (<, >, [, ], ", ', |) are reserved because they have
special meaning in the phrase table and/or to support XML input. The
tokenizer.perl script automatically replaces them with escape sequences, and
the detokenizer unescapes them again. There's also the scripts
(de)escape-special-chars.perl to go from one to the other without
(de)tokenizing.

consistency (between corpora and between training and test time) is
important. Is it possible that you used different versions of the
tokenizer.perl script? Older versions did not do escaping.

best wishes,
Rico

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Problem with corpus preparation

Reply via email to