Re: [Moses-support] Problem with corpus preparation
On 28/03/15 13:26, Abdelfetah Boumerdas wrote: Hi Rico, Thank you so much for your help, the deescape-special-chers.perl code did the job perfectly and removed all the sepcial xml chars. Now i have another question, i followed the moses manual and trained moses on the news commentary corpus and now i have the moses.ini file and before doing the tuning task i tried to test the trained system with a simple frensh sentence to transalte it to English, but to do that moses consumed all the memory i have which caused my laptop to stop responding (i have an Intel i7-4702MQ processor with 8GB RAM and enough space on disk). so can you please tell me what was the problem??? do i have to binarise the translation table ??? or is it normal for the system to consume that much memory??? Thanks again. ᐧ Hi Abdelfetah, it's not uncommon for moses to use more than 8GB of RAM during decoding, depending on the size of your models. Here are some ways to reduce memory usage, but you might also want to consider using a computer with more memory: http://www.statmt.org/moses/?n=Moses.Optimize#ntoc19 best wishes, Rico ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Problem with corpus preparation
Hi Rico, Thank you so much for your help, the deescape-special-chers.perl code did the job perfectly and removed all the sepcial xml chars. Now i have another question, i followed the moses manual and trained moses on the news commentary corpus and now i have the moses.ini file and before doing the tuning task i tried to test the trained system with a simple frensh sentence to transalte it to English, but to do that moses consumed all the memory i have which caused my laptop to stop responding (i have an Intel i7-4702MQ processor with 8GB RAM and enough space on disk). so can you please tell me what was the problem??? do i have to binarise the translation table ??? or is it normal for the system to consume that much memory??? Thanks again. ᐧ 2015-03-26 12:47 GMT+01:00 Rico Sennrich : > Abdelfetah Boumerdas writes: > > > > > > > > > > > Hi All, > > i'm trying to build a translation model using moses, and to do that i'm > using 2 corpora (europarl and the news commentary corpus provided in the > manual) but when i reached the corpus preparation step i noticed the > following problem: in the prepared europarl files i find that the > apostrophe > (') and the quotation marks are replaced respectively with (') and > (") but in the second corpus they're still unchanged. > > can anyone please tell me why?? is it a problem with the files encoding > (i > checked and they're both utf8)?? or is it another problem that i don't know > about??? > > Thanks in advance. > > -- > > > Hi Abdelfetah, > > some special characters (<, >, [, ], ", ', |) are reserved because they > have > special meaning in the phrase table and/or to support XML input. The > tokenizer.perl script automatically replaces them with escape sequences, > and > the detokenizer unescapes them again. There's also the scripts > (de)escape-special-chars.perl to go from one to the other without > (de)tokenizing. > > consistency (between corpora and between training and test time) is > important. Is it possible that you used different versions of the > tokenizer.perl script? Older versions did not do escaping. > > best wishes, > Rico > > ___ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support > -- BOUMERDAS Abdelfetah 5ème Année Option Systèmes Informatiques (SIQ) Ecole nationale Supérieure d'Informatique ESI (ex INI) BP 68 M Oued Smar 16309 - ALGER ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Problem with corpus preparation
Abdelfetah Boumerdas writes: > > > > > Hi All, > i'm trying to build a translation model using moses, and to do that i'm using 2 corpora (europarl and the news commentary corpus provided in the manual) but when i reached the corpus preparation step i noticed the following problem: in the prepared europarl files i find that the apostrophe (') and the quotation marks are replaced respectively with (') and (") but in the second corpus they're still unchanged. > can anyone please tell me why?? is it a problem with the files encoding (i checked and they're both utf8)?? or is it another problem that i don't know about??? > Thanks in advance. > -- Hi Abdelfetah, some special characters (<, >, [, ], ", ', |) are reserved because they have special meaning in the phrase table and/or to support XML input. The tokenizer.perl script automatically replaces them with escape sequences, and the detokenizer unescapes them again. There's also the scripts (de)escape-special-chars.perl to go from one to the other without (de)tokenizing. consistency (between corpora and between training and test time) is important. Is it possible that you used different versions of the tokenizer.perl script? Older versions did not do escaping. best wishes, Rico ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support