Re: [Moses-support] Incremental training
Hi Raj, I am still getting the same error as follows: reading vocabulary files Reading vocabulary file from:new_corpus/inc.fr.vcb ERROR: TOKEN ID must be unique for each token, in line : 2 traité 34 TOKEN ID 2 has already been assigned to: traité Your script is generating duplicates items. May be you can forward me the .py script again. I hope we are not using different version of the same! However, I made some changes in the .py script based on your suggestion and is working without any error. Please see the attached scripts. Regards, sandipan On 20 November 2014 09:24, Raj Dabre prajda...@gmail.com wrote: Hey, I just remembered that I have a pathetic memory. I forgot to add the lines for sorting the .vcb file in increasing order of id. Just add the following lines to align_new.sh after the line - $MGIZA/scripts/plain2snt-hasvcb.py ../corpus/$4.vcb ../corpus/$3.vcb $2 $1 $2_$1.snt $1_$2.snt $2.vcb $1.vcb : sort -n $1.vcb tmp mv tmp $1.vcb sort -n $2.vcb tmp mv tmp $2.vcb And it will run perfectly. I am sure of it. I used your folder just to be sure. It works. Sorry for my silliness. Lemme know if it works now. Regards. On Thu, Nov 20, 2014 at 1:13 AM, Raj Dabre prajda...@gmail.com wrote: Well then your paths must be wrong. I cant see why the files are not being generated. Ill look into it tomorrow and let you know On 01:10, Thu, 20 Nov 2014 Sandipan Dandapat sandipandanda...@gmail.com wrote: When I am using your script then it has no problem. But when modified the lines nid = len(fvcb)+2; there is no .vcb files in the new_corpus/ dir i used these two commands. sh full_train.sh org.en org.fr sh align_new.sh inc.en inc.fr org.en org.fr Is the above right? I have kept the paths (MGIZA, MODEL_BASE and CORPUS_BASE, NEW_CORPUS_BASE) hard-coded in the scripts. On 19 November 2014 15:49, Raj Dabre prajda...@gmail.com wrote: Cannot open file??? Does the file exist?? Aee you passing the path properly? On 00:44, Thu, 20 Nov 2014 Sandipan Dandapat sandipandanda...@gmail.com wrote: Hi, I made the changes based on your suggestions, its now generating a different error as below: reading vocabulary files Reading vocabulary file from:new_corpus/inc.fr.vcb Cannot open vocabulary file new_corpus/inc.fr.vcbfil I am attaching the working dir and the .py scripts here with. I have the 10 parallel sentences for incremental alignment is in inc_data/ where as the original 500 sentences are there in mtdata/ directory Thanks a ton for your help. Regards, sandipan On 19 November 2014 15:18, Raj Dabre prajda...@gmail.com wrote: Hey, I am pretty sure that my script does not generate duplicate token id. In fact, I used to get the same error till I modified the script. In case you do want to avoid this error and not use my script then: 1. Open the original python script: plain2snt-hasvcb.py 2. There is a line which increments the id counter by 1 ( the line is nid = len(fvcb)+1;) 3. Make this line: nid = len(fvcb)+2; (This is cause the id numbering starts from 1, and thus if you have 23 tokens then the id will go from 2 to 24. The original update script will do: nid = 23 + 1 = 24 and the modification will give 25 correctly). This is in 2 places: nid = len(evcb)+2; Do this and it will work. In any case... send me a zip file of your working directory (if its small you are testing it on small data right ? ). I will see what the problem is. On Wed, Nov 19, 2014 at 11:44 PM, Sandipan Dandapat sandipandanda...@gmail.com wrote: Dear Raj, I also tried to use your scripts for incremental alignment. I copied your python script in the desired directory still I am receiving the same error as posted by Ihab. reading vocabulary files Reading vocabulary file from:new_corpus/inc.fr.vcb ERROR: TOKEN ID must be unique for each token, in line : 24 roi 2 TOKEN ID 24 has already been assigned to: roi I took only 500 sentences pairs for full_train.sh and it worked fine with 758 lines in the corpus/tgt_filename.vcb file I took only 10 sentences for incremental alignment_new.sh which generated the error and I found 8054 lines in the new_corpus/new_tgt_file.vcb Is there any problem? Can you please help me on the same. Thanks and regards, sandipan On 4 November 2014 16:13, prajdabre prajda...@gmail.com wrote: Dear Ihab. There is a python script that was there in the google drive folder in the first mail I sent you. Please replace the existing file with my copy. It has to work. Regards. Sent from Samsung Mobile Original message From: Ihab Ramadan i.rama...@saudisoft.com Date: 05/11/2014 00:54 (GMT+09:00) To: 'Raj Dabre' prajda...@gmail.com Cc: moses-support@mit.edu Subject: RE: [Moses-support] Incremental training Dear Raj, Your point is clear and I try to follow the steps you mentioned but I stuck now in the align_new.sh script which gives me this error reading
Re: [Moses-support] Incremental training
Dear Raj, I also tried to use your scripts for incremental alignment. I copied your python script in the desired directory still I am receiving the same error as posted by Ihab. reading vocabulary files Reading vocabulary file from:new_corpus/inc.fr.vcb ERROR: TOKEN ID must be unique for each token, in line : 24 roi 2 TOKEN ID 24 has already been assigned to: roi I took only 500 sentences pairs for full_train.sh and it worked fine with 758 lines in the corpus/tgt_filename.vcb file I took only 10 sentences for incremental alignment_new.sh which generated the error and I found 8054 lines in the new_corpus/new_tgt_file.vcb Is there any problem? Can you please help me on the same. Thanks and regards, sandipan On 4 November 2014 16:13, prajdabre prajda...@gmail.com wrote: Dear Ihab. There is a python script that was there in the google drive folder in the first mail I sent you. Please replace the existing file with my copy. It has to work. Regards. Sent from Samsung Mobile Original message From: Ihab Ramadan i.rama...@saudisoft.com Date: 05/11/2014 00:54 (GMT+09:00) To: 'Raj Dabre' prajda...@gmail.com Cc: moses-support@mit.edu Subject: RE: [Moses-support] Incremental training Dear Raj, Your point is clear and I try to follow the steps you mentioned but I stuck now in the align_new.sh script which gives me this error reading vocabulary files Reading vocabulary file from:new_corpus/TraningTarget.txt.vcb ERROR: TOKEN ID must be unique for each token, in line : 29107 q-1 4 Do you have any idea what this error means? *From:* Raj Dabre [mailto:prajda...@gmail.com] *Sent:* Tuesday, November 4, 2014 12:06 PM *To:* i.rama...@saudisoft.com *Cc:* moses-support@mit.edu *Subject:* Re: [Moses-support] Incremental training Dear Ihab, Perhaps I should have mentioned much more clearly what my script does. Sorry for that. Let me start with this: There is no direct/easy way to generate the moses.ini file as you need. 1. Suppose you have 2 million lines of parallel corpora and you trained a SMT system for it. This naturally gives the phrase table, reordering table and moses.ini. 2. Suppose you got 500 k more lines of parallel corpora there are 2 ways: a. Retrain 2.5 million lines from scratch (will take lots of time: ~ 2-3 days on a regular machines) b. Train on only the 500k new lines using the alignment information of the original training data. (Faster: ~ 6-7 hours). What my scripts do: *THEY ONLY GENERATE ALIGNMENTS and NOT PHRASE TABLES.* 1. full_train.sh -- This trains on the original corpus of 2 million lines. (Generate alignment files only for the original corpus) 2. align_new.sh -- This trains on the new corpus of 500 k lines. (Generate alignment files only for the new corpus using the alignments for 1) *Why this split * Because the basic training step of Moses does not preserve the alignment probability information. Only the alignments are saved. To continue training we need the probability information. You can pass flags to moses to preserve this information ( this flag is --giza-option . If you do this then you will not need full_train.sh. But you will have to change the config files before using align_new.sh) *HOW TO GET UPDATED PHRASE TABLE:* 1. Append the forward alignments (fwd) generated by align_new.sh to the forward (fwd) alignments generated by full_train.sh. 2. Append the inverse alignments (inv) generated by align_new.sh to the inverse (inv) alignments generated by full_train.sh. 3. Run the moses training script with additional flags: - --first-step -- first step in the training process (default 1)--- This will be 4 - --last-step -- last step in the training process (default 7) This will remain 7 - --giza-f2e -- path to folder/new_giza.fwd - --giza-e2f -- path to folder/new_giza.inv For example: ~/mosesdecoder/scripts/training/train-model.perl -root-dir your training directory \ -corpus your new corpus name \ -f src -e tgt -alignment grow-diag-final-and -reordering msd-bidirectional-fe \ -lm 0:3:path to LM:8 \ --first-step 4 --last-step 7 --giza-f2e -- path to folder/new_giza.fwd --giza-e2f -- path to folder/new_giza.inv \ -external-bin-dir path to giza++ binaries For more details on the training step read this: http://www.statmt.org/moses/?n=FactoredTraining.TrainingParameters What this does is assumes that you have alignments and continue the phrase extraction, reordering and generate the new moses.ini file. WARNING: Specify the filenames and paths properly *OR IT WILL FAIL.* If you are still unclear then please ask and I will try to help you as much as I can. Regards. On Tue, Nov 4, 2014 at 6:09 PM, Ihab Ramadan i.rama...@saudisoft.com wrote: Dear Raj, That’s a great work my friend, This files
Re: [Moses-support] Incremental training
When I am using your script then it has no problem. But when modified the lines nid = len(fvcb)+2; there is no .vcb files in the new_corpus/ dir i used these two commands. sh full_train.sh org.en org.fr sh align_new.sh inc.en inc.fr org.en org.fr Is the above right? I have kept the paths (MGIZA, MODEL_BASE and CORPUS_BASE, NEW_CORPUS_BASE) hard-coded in the scripts. On 19 November 2014 15:49, Raj Dabre prajda...@gmail.com wrote: Cannot open file??? Does the file exist?? Aee you passing the path properly? On 00:44, Thu, 20 Nov 2014 Sandipan Dandapat sandipandanda...@gmail.com wrote: Hi, I made the changes based on your suggestions, its now generating a different error as below: reading vocabulary files Reading vocabulary file from:new_corpus/inc.fr.vcb Cannot open vocabulary file new_corpus/inc.fr.vcbfil I am attaching the working dir and the .py scripts here with. I have the 10 parallel sentences for incremental alignment is in inc_data/ where as the original 500 sentences are there in mtdata/ directory Thanks a ton for your help. Regards, sandipan On 19 November 2014 15:18, Raj Dabre prajda...@gmail.com wrote: Hey, I am pretty sure that my script does not generate duplicate token id. In fact, I used to get the same error till I modified the script. In case you do want to avoid this error and not use my script then: 1. Open the original python script: plain2snt-hasvcb.py 2. There is a line which increments the id counter by 1 ( the line is nid = len(fvcb)+1;) 3. Make this line: nid = len(fvcb)+2; (This is cause the id numbering starts from 1, and thus if you have 23 tokens then the id will go from 2 to 24. The original update script will do: nid = 23 + 1 = 24 and the modification will give 25 correctly). This is in 2 places: nid = len(evcb)+2; Do this and it will work. In any case... send me a zip file of your working directory (if its small you are testing it on small data right ? ). I will see what the problem is. On Wed, Nov 19, 2014 at 11:44 PM, Sandipan Dandapat sandipandanda...@gmail.com wrote: Dear Raj, I also tried to use your scripts for incremental alignment. I copied your python script in the desired directory still I am receiving the same error as posted by Ihab. reading vocabulary files Reading vocabulary file from:new_corpus/inc.fr.vcb ERROR: TOKEN ID must be unique for each token, in line : 24 roi 2 TOKEN ID 24 has already been assigned to: roi I took only 500 sentences pairs for full_train.sh and it worked fine with 758 lines in the corpus/tgt_filename.vcb file I took only 10 sentences for incremental alignment_new.sh which generated the error and I found 8054 lines in the new_corpus/new_tgt_file.vcb Is there any problem? Can you please help me on the same. Thanks and regards, sandipan On 4 November 2014 16:13, prajdabre prajda...@gmail.com wrote: Dear Ihab. There is a python script that was there in the google drive folder in the first mail I sent you. Please replace the existing file with my copy. It has to work. Regards. Sent from Samsung Mobile Original message From: Ihab Ramadan i.rama...@saudisoft.com Date: 05/11/2014 00:54 (GMT+09:00) To: 'Raj Dabre' prajda...@gmail.com Cc: moses-support@mit.edu Subject: RE: [Moses-support] Incremental training Dear Raj, Your point is clear and I try to follow the steps you mentioned but I stuck now in the align_new.sh script which gives me this error reading vocabulary files Reading vocabulary file from:new_corpus/TraningTarget.txt.vcb ERROR: TOKEN ID must be unique for each token, in line : 29107 q-1 4 Do you have any idea what this error means? *From:* Raj Dabre [mailto:prajda...@gmail.com] *Sent:* Tuesday, November 4, 2014 12:06 PM *To:* i.rama...@saudisoft.com *Cc:* moses-support@mit.edu *Subject:* Re: [Moses-support] Incremental training Dear Ihab, Perhaps I should have mentioned much more clearly what my script does. Sorry for that. Let me start with this: There is no direct/easy way to generate the moses.ini file as you need. 1. Suppose you have 2 million lines of parallel corpora and you trained a SMT system for it. This naturally gives the phrase table, reordering table and moses.ini. 2. Suppose you got 500 k more lines of parallel corpora there are 2 ways: a. Retrain 2.5 million lines from scratch (will take lots of time: ~ 2-3 days on a regular machines) b. Train on only the 500k new lines using the alignment information of the original training data. (Faster: ~ 6-7 hours). What my scripts do: *THEY ONLY GENERATE ALIGNMENTS and NOT PHRASE TABLES.* 1. full_train.sh -- This trains on the original corpus of 2 million lines. (Generate alignment files only for the original corpus) 2. align_new.sh -- This trains on the new corpus of 500 k lines. (Generate alignment files only for the new corpus using the alignments for 1
Re: [Moses-support] Help in Moses Incremental Retraing
Hi Ulrich, Sorry for sending the doubts to you directly. I will keep in mind to post my future queries to moses-support. Thanks a lot for the clarification. Let me play with the MERT. Thanks and regards, sandipan On 24 October 2014 01:51, Ulrich Germann ulrich.germ...@gmail.com wrote: Hi Sandipan, first, please post Moses-related questions to moses-support@mit.edu, not individual contributors. second, the current seven features used by Mmsapt / PhraseDictionaryBitextSampling are (for details, see my recent paper on this phrase table implementation: https://www.researchgate.net/publication/267270863_Dynamic_Phrase_Tables_for_Machine_Translation_in_an_Interactive_Post-editing_Scenario ) THE STANDARD SET OF FEATURES MAY CHANGE AT ANY TIME, as this is still work in progress. - forward and backward lexically smoothed phrase scores (2 scores; same as standard features) - rarity penalty (1/(x+1)), where x is the number of phrase pair occurrences in the corpus/sample (1 score) - the lower bound on forward and backward phrase-level probabilities, with confidence level .99 (2 scores) - 2 provenance features (x/(x+1)), where x is the number of phrase pair occurrences in the (static) background and (dynamic) foreground corpus (2 scores) third, you need to retrain the feature weights for good performance with any of the standard techniques, but with the I usually use MERT. The executable simulate-pe allows you to feed in references and word aligmnents one sentence at a time; there are additional parameters --spe-src, --spe-trg, --spe-aln to specify source, target, and alignment (symal output format). Source and target files are one sentence per line, tokenized. Michael Denkowski is currently in the process of integrating online tuning into Moses, but I'm not sure whether that's ready to be deployed yet. Regards - Uli On Thu, Oct 23, 2014 at 1:47 AM, Sandipan Dandapat sandipandanda...@gmail.com wrote: Dear Ulrich, I got your reference from Prashanta Mathur. I am a postdoctoral researcher in CNGL, DCU and I am working with Moses incremental retraining. It will be great if you help me to understand couple of doubts: 1. I found there are 7 weights to define for PT0 (PT0 is the Mmsapt name) i.e. Mmsapt name=PT0 output-factor=0 num-features=7 base=/home/sandipan/inc_retrain/MT_sys/En-Fr/dgt/50_i/mmsa_pt/train. L1=en L2=fr [weight] PT0= 0.1 0.2 0.3 0.4 0.5 0.6 0.7 num-featues in PBSMT model is 4 which does not work with Mmsapt. What are these 7 weights? Can I use uniform weights for all 7 features? Or how do I adjust these values? Or, how to adjust these weights? 2. I found there is significant difference in BLEU score when I am using standard PBSMT model and when I am using MMST based model. Is this because of the weights I am using or am I doing something wrong? It will be real great help, if you help me to understand the above issue. Thanking you. Regards, sandipan -- Ulrich Germann Research Associate School of Informatics University of Edinburgh ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Incremental retraining
Hi Hieu, I am at the last step of the incremental training. 1. I have produced the alignment file for the incremental data. 2. I then just append the new-alignment file and the incremental data to the original data and alignment file. 3. Furthermore I create the memory mapped suffix array phrase table(mmsapt). When I am using the new mmsapt information during decoding I am getting the following error ( which was fine before adding the incremental data) Created input-output object : [0.086] seconds this is a test sentence Translating line 0 in thread id 139739320297216 Translating: this is a test sentence binary file loaded, default OFF_T: -1 Line 0: Initialize search took 0.064 seconds total Alignment range error at sentence 53994! 4/7 6/5 Alignment range error at sentence 54530! 17/19 18/18 Alignment range error at sentence 50292! 0/13 9/9 Alignment range error at sentence 50120! 25/36 31/31 Alignment range error at sentence 55089! 11/27 19/19 terminate called recursively terminate called after throwing an instance of 'terminate called recursively terminate called recursively util::Exception' terminate called recursively Aborted I am not sure if I am doing anything wrong here? Thanks and regards, sandipan On 8 September 2014 10:11, Sandipan Dandapat sandipandanda...@gmail.com wrote: Hi, This worked. Thanks and regards, sandipan On 7 September 2014 16:09, Hieu Hoang hieu.ho...@ed.ac.uk wrote: sorry, i meant On 7 September 2014 16:08, Hieu Hoang hieu.ho...@ed.ac.uk wrote: you HAVE to change both num-features=7 AND [weight] PT0= 0.1 0.2 0.3 0.4 0.5 0.6 0.7 On 7 September 2014 16:06, Sandipan Dandapat sandipandanda...@gmail.com wrote: Hi Hieu, Even I tried with '7' and it fails with the error message Exception: moses/ScoreComponentCollection.cpp:248 in void Moses::ScoreComponentCollection::Assign(const Moses::FeatureFunction*, const std::vectorfloat) threw util::Exception'. Feature function PT0 specified 7 dense scores or weights. Actually has 4 In contrast, when I am using binarised pharse table, I use num-features=4 and this works fine. I am attaching the Moses.ini file in case I am doing anything wrong there. Thanks and regards, sandipan On 7 September 2014 15:46, Hieu Hoang hieu.ho...@ed.ac.uk wrote: maybe's it's 7 scores On 7 September 2014 14:59, Sandipan Dandapat sandipandanda...@gmail.com wrote: Hi Hieu, I also tried the same but generates the error below: Exception: moses/TranslationModel/UG/mmsapt.cpp:381 in virtual void Moses::Mmsapt::Load() threw util::Exception because `this-m_feature_names.size() != this-m_numScoreComponents'. At moses/TranslationModel/UG/mmsapt.cpp:381: number of feature values provided by Phrase table (7) does not match number specified in Moses config file (4)! Thanks and regards, sandipan On 6 September 2014 09:50, Hieu Hoang hieuho...@gmail.com wrote: I'm not sure how many scores there are in the phrase table PhraseDictionaryBitextSampling It may be 4. In which case you must specify [feature] PhraseDictionaryBitextSampling name=PT0 num-features=4 ... [weight] PT0= 0.1 0.2 0.3 0.4 On 05/09/14 14:12, Sandipan Dandapat wrote: Hi, During incremental retraining I specified the following line in moses .ini PhraseDictionaryBitextSampling name=PT0 output-factor=0 num-features=9 path=/home/sandipan/inc_retrain/MT_sys/EnPl/mtdata_pro/train. L1=en L2=pl pfwd=g pbwd=g smooth=0 sample=1000 workers=1 this generates the error: Feature function PT0 specified 9 dense scores or weights. Actually has 0. which is solved when num-features is changed to '0' but generates the error below: Exception: moses/TranslationModel/UG/mmsapt.cpp:381 in virtual void Moses::Mmsapt::Load() threw util::Exception because `this-m_feature_names.size() != this-m_numScoreComponents'. At moses/TranslationModel/UG/mmsapt.cpp:381: number of feature values provided by Phrase table (7) does not match number specified in Moses config file (0)! Changing it to 7 also does not help. I have tried with Mmsapt name=PT0 output-factor=0 num-features=0 base=/home/sandipan/inc_retrain/MT_sys/EnPl/mtdata_pro/train. L1=en L2=pl but does not work. What I need to do at this stage of retraining using moses? Thanks and regards, sandipan ___ Moses-support mailing listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu
Re: [Moses-support] Incremental retraining
Hi Hieu, I also tried the same but generates the error below: Exception: moses/TranslationModel/UG/mmsapt.cpp:381 in virtual void Moses::Mmsapt::Load() threw util::Exception because `this-m_feature_names.size() != this-m_numScoreComponents'. At moses/TranslationModel/UG/mmsapt.cpp:381: number of feature values provided by Phrase table (7) does not match number specified in Moses config file (4)! Thanks and regards, sandipan On 6 September 2014 09:50, Hieu Hoang hieuho...@gmail.com wrote: I'm not sure how many scores there are in the phrase table PhraseDictionaryBitextSampling It may be 4. In which case you must specify [feature] PhraseDictionaryBitextSampling name=PT0 num-features=4 ... [weight] PT0= 0.1 0.2 0.3 0.4 On 05/09/14 14:12, Sandipan Dandapat wrote: Hi, During incremental retraining I specified the following line in moses .ini PhraseDictionaryBitextSampling name=PT0 output-factor=0 num-features=9 path=/home/sandipan/inc_retrain/MT_sys/EnPl/mtdata_pro/train. L1=en L2=pl pfwd=g pbwd=g smooth=0 sample=1000 workers=1 this generates the error: Feature function PT0 specified 9 dense scores or weights. Actually has 0. which is solved when num-features is changed to '0' but generates the error below: Exception: moses/TranslationModel/UG/mmsapt.cpp:381 in virtual void Moses::Mmsapt::Load() threw util::Exception because `this-m_feature_names.size() != this-m_numScoreComponents'. At moses/TranslationModel/UG/mmsapt.cpp:381: number of feature values provided by Phrase table (7) does not match number specified in Moses config file (0)! Changing it to 7 also does not help. I have tried with Mmsapt name=PT0 output-factor=0 num-features=0 base=/home/sandipan/inc_retrain/MT_sys/EnPl/mtdata_pro/train. L1=en L2=pl but does not work. What I need to do at this stage of retraining using moses? Thanks and regards, sandipan ___ Moses-support mailing listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Incremental retraining
Hi Hieu, Even I tried with '7' and it fails with the error message Exception: moses/ScoreComponentCollection.cpp:248 in void Moses::ScoreComponentCollection::Assign(const Moses::FeatureFunction*, const std::vectorfloat) threw util::Exception'. Feature function PT0 specified 7 dense scores or weights. Actually has 4 In contrast, when I am using binarised pharse table, I use num-features=4 and this works fine. I am attaching the Moses.ini file in case I am doing anything wrong there. Thanks and regards, sandipan On 7 September 2014 15:46, Hieu Hoang hieu.ho...@ed.ac.uk wrote: maybe's it's 7 scores On 7 September 2014 14:59, Sandipan Dandapat sandipandanda...@gmail.com wrote: Hi Hieu, I also tried the same but generates the error below: Exception: moses/TranslationModel/UG/mmsapt.cpp:381 in virtual void Moses::Mmsapt::Load() threw util::Exception because `this-m_feature_names.size() != this-m_numScoreComponents'. At moses/TranslationModel/UG/mmsapt.cpp:381: number of feature values provided by Phrase table (7) does not match number specified in Moses config file (4)! Thanks and regards, sandipan On 6 September 2014 09:50, Hieu Hoang hieuho...@gmail.com wrote: I'm not sure how many scores there are in the phrase table PhraseDictionaryBitextSampling It may be 4. In which case you must specify [feature] PhraseDictionaryBitextSampling name=PT0 num-features=4 ... [weight] PT0= 0.1 0.2 0.3 0.4 On 05/09/14 14:12, Sandipan Dandapat wrote: Hi, During incremental retraining I specified the following line in moses .ini PhraseDictionaryBitextSampling name=PT0 output-factor=0 num-features=9 path=/home/sandipan/inc_retrain/MT_sys/EnPl/mtdata_pro/train. L1=en L2=pl pfwd=g pbwd=g smooth=0 sample=1000 workers=1 this generates the error: Feature function PT0 specified 9 dense scores or weights. Actually has 0. which is solved when num-features is changed to '0' but generates the error below: Exception: moses/TranslationModel/UG/mmsapt.cpp:381 in virtual void Moses::Mmsapt::Load() threw util::Exception because `this-m_feature_names.size() != this-m_numScoreComponents'. At moses/TranslationModel/UG/mmsapt.cpp:381: number of feature values provided by Phrase table (7) does not match number specified in Moses config file (0)! Changing it to 7 also does not help. I have tried with Mmsapt name=PT0 output-factor=0 num-features=0 base=/home/sandipan/inc_retrain/MT_sys/EnPl/mtdata_pro/train. L1=en L2=pl but does not work. What I need to do at this stage of retraining using moses? Thanks and regards, sandipan ___ Moses-support mailing listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support moses.ini Description: Binary data ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
[Moses-support] Incremental retraining
Hi, During incremental retraining I specified the following line in moses.ini PhraseDictionaryBitextSampling name=PT0 output-factor=0 num-features=9 path=/home/sandipan/inc_retrain/MT_sys/EnPl/mtdata_pro/train. L1=en L2=pl pfwd=g pbwd=g smooth=0 sample=1000 workers=1 this generates the error: Feature function PT0 specified 9 dense scores or weights. Actually has 0. which is solved when num-features is changed to '0' but generates the error below: Exception: moses/TranslationModel/UG/mmsapt.cpp:381 in virtual void Moses::Mmsapt::Load() threw util::Exception because `this-m_feature_names.size() != this-m_numScoreComponents'. At moses/TranslationModel/UG/mmsapt.cpp:381: number of feature values provided by Phrase table (7) does not match number specified in Moses config file (0)! Changing it to 7 also does not help. I have tried with Mmsapt name=PT0 output-factor=0 num-features=0 base=/home/sandipan/inc_retrain/MT_sys/EnPl/mtdata_pro/train. L1=en L2=pl but does not work. What I need to do at this stage of retraining using moses? Thanks and regards, sandipan ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
[Moses-support] xmlrpc problem
Hi, I am using inc-giza-pp, xmlrpc-c. All of them successfully compiled and my xmlrpc in installed in a local directory and the library files are located in /home/sdandapat/Moses/xmlrpc-c/lib64. Moses decoder also compiled fine with inc-giza-pp. During building the translation model, it fails with following error: /home/sdandapat/Moses/mosesdecoder/tools/GIZA++: error while loading shared libraries: libxmlrpc_server_abyss++.so.8: cannot open shared object file: No such file or directory Exit code: 127 ERROR: Giza did not produce the output file /home/sdandapat/Moses_retrain/EnPl/train/giza.en-pl/en-pl.Ahmm.5. Is your corpus clean (reasonably-sized sentences)? at /home/sdandapat/Moses/mosesdecoder/scripts/training/train-model.perl line 1199. I am attaching the entire log. What am I doing wrong? I found similar post in http://blog.gmane.org/gmane.comp.nlp.moses.user/month=20111201 but could not find the end solution. Thanks and regards, sandipan Sandipan Dandapat Postdoctoral Researcher CNGL, School of Computing Dublin City University Google Scholar Profile: http://scholar.google.co.in/citations?user=DWD_FiQJhl=en Tokenizer Version 1.1 Language: en Number of threads: 1 Tokenizer Version 1.1 Language: pl Number of threads: 1 clean-corpus.perl: processing mtdata/train.tok.en .pl to mtdata/train.clean, cutoff 1-100 . Input sentences: 5 Output sentences: 5 mkdir: cannot create directory ‘lm’: File exists Pruning === 1/5 Counting and sorting n-grams === Reading /home/sdandapat/Moses_retrain/EnPl/mtdata/train.clean.pl 5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100 Unigram tokens 917274 types 51774 === 2/5 Calculating and sorting adjusted counts === Chain sizes: 1:621288 2:2931873280 3:5497262592 Statistics: 1 51774 D1=0.607105 D2=1.0428 D3+=1.51615 2 379279 D1=0.801548 D2=1.15112 D3+=1.4064 3 655686 D1=0.884806 D2=1.19448 D3+=1.37341 Memory estimate for binary LM: type kB probing 21729 assuming -p 1.5 probing 24154 assuming -r models -p 1.5 trie 9558 without quantization trie 5545 assuming -q 8 -b 8 quantization trie 8990 assuming -a 22 array pointer compression trie 4976 assuming -a 22 -q 8 -b 8 array pointer compression and quantization === 3/5 Calculating and sorting initial probabilities === Chain sizes: 1:621288 2:6068464 3:13113720 === 4/5 Calculating and writing order-interpolated probabilities === Chain sizes: 1:621288 2:6068464 3:13113720 Chain sizes: 1:621288 2:6068464 3:13113720 === 5/5 Writing ARPA model === Name:lmplz VmPeak:8778804 kB VmRSS:23180 kB RSSMax:1931028 kB user:2.96355 sys:1.23781 CPU:4.20136 real:4.89953 Reading lm/train.arpa.pl 5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100 SUCCESS Using SCRIPTS_ROOTDIR: /home/sdandapat/Moses/mosesdecoder/scripts Using single-thread GIZA (1) preparing corpus @ Wed Aug 13 12:44:56 IST 2014 Executing: mkdir -p /home/sdandapat/Moses_retrain/EnPl/train/corpus (1.0) selecting factors @ Wed Aug 13 12:44:56 IST 2014 (1.1) running mkcls @ Wed Aug 13 12:44:56 IST 2014 /home/sdandapat/Moses/mosesdecoder/tools/mkcls -c50 -n2 -p/home/sdandapat/Moses_retrain/EnPl/mtdata/train.clean.en -V/home/sdandapat/Moses_retrain/EnPl/train/corpus/en.vcb.classes opt /home/sdandapat/Moses_retrain/EnPl/train/corpus/en.vcb.classes already in place, reusing (1.1) running mkcls @ Wed Aug 13 12:44:56 IST 2014 /home/sdandapat/Moses/mosesdecoder/tools/mkcls -c50 -n2 -p/home/sdandapat/Moses_retrain/EnPl/mtdata/train.clean.pl -V/home/sdandapat/Moses_retrain/EnPl/train/corpus/pl.vcb.classes opt /home/sdandapat/Moses_retrain/EnPl/train/corpus/pl.vcb.classes already in place, reusing (1.2) creating vcb file /home/sdandapat/Moses_retrain/EnPl/train/corpus/en.vcb @ Wed Aug 13 12:44:56 IST 2014 (1.2) creating vcb file /home/sdandapat/Moses_retrain/EnPl/train/corpus/pl.vcb @ Wed Aug 13 12:44:56 IST 2014 (1.3) numberizing corpus /home/sdandapat/Moses_retrain/EnPl/train/corpus/en-pl-int-train.snt @ Wed Aug 13 12:44:57 IST 2014 /home/sdandapat/Moses_retrain/EnPl/train/corpus/en-pl-int-train.snt already in place, reusing (1.3) numberizing corpus /home/sdandapat/Moses_retrain/EnPl/train/corpus/pl-en-int-train.snt @ Wed Aug 13 12:44:57 IST 2014 /home/sdandapat/Moses_retrain/EnPl/train/corpus/pl-en-int-train.snt already in place, reusing (2) running giza @ Wed Aug 13 12:44:57 IST 2014 (2.1a) running snt2cooc en-pl @ Wed Aug 13 12:44:57 IST 2014 Executing: mkdir -p /home/sdandapat/Moses_retrain/EnPl/train/giza.en-pl Executing: /home/sdandapat/Moses/mosesdecoder/tools/snt2cooc.out /home/sdandapat/Moses_retrain/EnPl/train/corpus/pl.vcb /home/sdandapat/Moses_retrain/EnPl/train/corpus/en.vcb /home/sdandapat
Re: [Moses-support] xmlrpc problem
Hi Barry, Thanks a ton, it worked. I have couple of follow up question. 1. Do we need to use moses-server version for incremental retraining using moses? 2. In the website, I found we need to do something % zcat ${CORPUS}.${L1}.gz | mtt-build -i -o /some/path/${CORPUS}.${L1} But who generate this ${CORPUS}.${L1}.gz file. I am using inc-giza-pp and I compiled mosesdecoder with --with-mm option and during training I used option However I am unable to figure out how to perform step 2 3 in http://www.statmt.org/moses/?n=Moses.AdvancedFeatures#ntoc40 ? Thanks and regards, sandipan Sandipan Dandapat Postdoctoral Researcher CNGL, School of Computing Dublin City University Google Scholar Profile: http://scholar.google.co.in/citations?user=DWD_FiQJhl=en On Wed, Aug 13, 2014 at 8:50 PM, Barry Haddow bhad...@staffmail.ed.ac.uk wrote: Hi Sandipan It looks as though GIZA++ cannot find the xmlrpc-c libraries. You need to set your LD_LIBRARY_PATH, like this (assuming bash): export LD_LIBRARY_PATH=/home/sdandapat/Moses/xmlrpc-c/lib64:$LD_LIBRARY_PATH then check if you can run GIZA++ from the command line, before you try building the model, cheers - Barry On 13/08/14 14:04, Sandipan Dandapat wrote: Hi, I am using inc-giza-pp, xmlrpc-c. All of them successfully compiled and my xmlrpc in installed in a local directory and the library files are located in /home/sdandapat/Moses/xmlrpc-c/lib64. Moses decoder also compiled fine with inc-giza-pp. During building the translation model, it fails with following error: /home/sdandapat/Moses/mosesdecoder/tools/GIZA++: error while loading shared libraries: libxmlrpc_server_abyss++.so.8: cannot open shared object file: No such file or directory Exit code: 127 ERROR: Giza did not produce the output file /home/sdandapat/Moses_retrain/EnPl/train/giza.en-pl/en-pl.Ahmm.5. Is your corpus clean (reasonably-sized sentences)? at /home/sdandapat/Moses/mosesdecoder/scripts/training/train-model.perl line 1199. I am attaching the entire log. What am I doing wrong? I found similar post in http://blog.gmane.org/gmane.comp.nlp.moses.user/month=20111201 but could not find the end solution. Thanks and regards, sandipan Sandipan Dandapat Postdoctoral Researcher CNGL, School of Computing Dublin City University Google Scholar Profile: http://scholar.google.co.in/citations?user=DWD_FiQJhl=en ___ Moses-support mailing listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
[Moses-support] Error compiling moses decoder --with-mm
Hi, I was compiling mosesdecoder with './bjam --with-mm' to use memory-mapped dynamic suffix array phrase table. The build failed. I am attaching the log here. Can you please help me to find the problem. The build worked without --with-mm option. Thanks and regards, sandipan Sandipan Dandapat Postdoctoral Researcher CNGL, School of Computing Dublin City University Google Scholar Profile: http://scholar.google.co.in/citations?user=DWD_FiQJhl=en build.log.gz Description: GNU Zip compressed data ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Error compiling moses decoder --with-mm
I am using Ubuntu14.04 and gcc 4.4.7 I used used ./bjam --with-mm I found the typename was causing the problem and I changed the ./moses/TranslationModel/UG/mm/ug_bitext.h file typedef typename boost::unordered_mapuint64_t, jstats trg_map_t; to typedef boost::unordered_mapuint64_t, jstats trg_map_t; and it compiles successfully More question on Retraining: 1. - the word alignment between these files in symal output format is it the standard format found in alligned.grow-diag-final i.e. 0-0 1-1 2-2 2-3 6-4 3-5 4-5 5-5 6-5 7-6 8-7 9-8 0-0 1-1 2-2 3-3 4-4 ... 2. I have normal training and aligned corpus i.e.e train.en, train.fr andalligned.grow-diag-final. During retraining we need to do the following % zcat ${CORPUS}.${L1}.gz | mtt-build -i -o /some/path/${CORPUS}.${L1} % zcat ${CORPUS}.${L2}.gz | mtt-build -i -o /some/path/${CORPUS}.${L2} % zcat ${CORPUS}.${L1}-${L2}.symal.gz | symal2mam /some/path/${CORPUS}.${L1}-${L2}.mam % mmlex-build /some/path/${CORPUS} ${L1} ${L2} -o /some/path/${CORPUS}.${L1}-${L2}.lex -c /some/path/${CORPUS}.${L1}-${L2}.coc I am unable to understand how ${CORPUS}.${L1}.gz,${CORPUS}.${L1}.gz and ${CORPUS}.${L1}-${L2}.symal.gz files are generated. Or are they refering to the standard training and alignment files? Thanks, Sandipan Dandapat Postdoctoral Researcher CNGL, School of Computing Dublin City University Google Scholar Profile: http://scholar.google.co.in/citations?user=DWD_FiQJhl=en On Wed, Jul 30, 2014 at 4:51 PM, Hieu Hoang hieu.ho...@ed.ac.uk wrote: what OS version are you using? What gcc version are you using? What is the exact command you used to compile? On 30 July 2014 14:47, Sandipan Dandapat sandipandanda...@gmail.com wrote: Hi, I was compiling mosesdecoder with './bjam --with-mm' to use memory-mapped dynamic suffix array phrase table. The build failed. I am attaching the log here. Can you please help me to find the problem. The build worked without --with-mm option. Thanks and regards, sandipan Sandipan Dandapat Postdoctoral Researcher CNGL, School of Computing Dublin City University Google Scholar Profile: http://scholar.google.co.in/citations?user=DWD_FiQJhl=en ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
[Moses-support] Issues with Incremental retraining using Moses
Hi, I am trying to use Moses Incremental Retraining as described in http://www.statmt.org/moses/?n=Moses.AdvancedFeatures#ntoc33 I have two doubts: 1. I am able to generate the new-alignment-file using additional data on top of previously used data. Once the new alignment file is generated, the page says to update the model. I am unable to understand how to use the same during decoding? Can you please help me to understand how can I proceed once my new-allignment file is generated? 2. What is happening when we are updating the moses.ini file using PhraseDictionaryDynSuffixArray source=path-to-source-corpus target=path-to-target-corpus alignment=path-to-alignments I am unable to see any reference to this updated moses.ini file in the rest of the section. Thanks and regards, sandipan Sandipan Dandapat Postdoctoral Researcher CNGL, School of Computing Dublin City University Google Scholar Profile: http://scholar.google.co.in/citations?user=DWD_FiQJhl=en ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
[Moses-support] Issues with Incremental retraining using Moses
Hi, I am trying to use Moses Incremental Retraining as described in http://www.statmt.org/moses/?n=Moses.AdvancedFeatures#ntoc33 I have two doubts: 1. I am able to generate the new-alignment-file using additional data on top of previously used data. Once the new alignment file is generated, the page says to update the model. I am unable to understand how to use the same during decoding? Can you please help me to understand how can I proceed once my new-allignment file is generated? 2. What is happening when we are updating the moses.ini file using PhraseDictionaryDynSuffixArray source=path-to-source-corpus target=path-to-target-corpus alignment=path-to-alignments I am unable to see any reference to this updated moses.ini file in the rest of the section. Thanks and regards, sandipan Sandipan Dandapat Postdoctoral Researcher CNGL, School of Computing Dublin City University Google Scholar Profile: http://scholar.google.co.in/citations?user=DWD_FiQJhl=en ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support