Dear Raj, I also tried to use your scripts for incremental alignment. I copied your python script in the desired directory still I am receiving the same error as posted by Ihab. reading vocabulary files Reading vocabulary file from:new_corpus/inc.fr.vcb ERROR: TOKEN ID must be unique for each token, in line : 24 roi 2 TOKEN ID 24 has already been assigned to: roi
I took only 500 sentences pairs for full_train.sh and it worked fine with 758 lines in the corpus/tgt_filename.vcb file I took only 10 sentences for incremental alignment_new.sh which generated the error and I found 8054 lines in the new_corpus/new_tgt_file.vcb Is there any problem? Can you please help me on the same. Thanks and regards, sandipan On 4 November 2014 16:13, prajdabre <prajda...@gmail.com> wrote: > Dear Ihab. > There is a python script that was there in the google drive folder in the > first mail I sent you. > Please replace the existing file with my copy. > > It has to work. > > Regards. > > > Sent from Samsung Mobile > > > > -------- Original message -------- > From: Ihab Ramadan <i.rama...@saudisoft.com> > Date: 05/11/2014 00:54 (GMT+09:00) > To: 'Raj Dabre' <prajda...@gmail.com> > Cc: moses-support@mit.edu > Subject: RE: [Moses-support] Incremental training > > > Dear Raj, > > Your point is clear and I try to follow the steps you mentioned but I > stuck now in the align_new.sh script which gives me this error > > reading vocabulary files > > Reading vocabulary file from:new_corpus/TraningTarget.txt.vcb > > ERROR: TOKEN ID must be unique for each token, in line : > > 29107 q-1 4 > > Do you have any idea what this error means? > > > > *From:* Raj Dabre [mailto:prajda...@gmail.com] > *Sent:* Tuesday, November 4, 2014 12:06 PM > *To:* i.rama...@saudisoft.com > *Cc:* moses-support@mit.edu > *Subject:* Re: [Moses-support] Incremental training > > > > Dear Ihab, > > Perhaps I should have mentioned much more clearly what my script does. > Sorry for that. > > Let me start with this: There is no direct/easy way to generate the > moses.ini file as you need. > > 1. Suppose you have 2 million lines of parallel corpora and you trained a > SMT system for it. This naturally gives the phrase table, reordering table > and moses.ini. > > 2. Suppose you got 500 k more lines of parallel corpora.... there are 2 > ways: > > a. Retrain 2.5 million lines from scratch (will take lots of time: ~ > 2-3 days on a regular machines) > > b. Train on only the 500k new lines using the alignment information of > the original training data. (Faster: ~ 6-7 hours). > > > > What my scripts do: *THEY ONLY GENERATE ALIGNMENTS and NOT PHRASE TABLES.* > > 1. full_train.sh -------------- This trains on the original corpus of 2 > million lines. (Generate alignment files only for the original corpus) > > 2. align_new.sh -------------- This trains on the new corpus of 500 k > lines. (Generate alignment files only for the new corpus using the > alignments for 1) > > > > *Why this split ????* Because the basic training step of Moses does not > preserve the alignment probability information. Only the alignments are > saved. To continue training we need the probability information. > > You can pass flags to moses to preserve this information ( this flag is > --giza-option . If you do this then you will not need full_train.sh. But > you will have to change the config files before using align_new.sh) > > *HOW TO GET UPDATED PHRASE TABLE:* > > 1. Append the forward alignments (fwd) generated by align_new.sh to the > forward (fwd) alignments generated by full_train.sh. > 2. Append the inverse alignments (inv) generated by align_new.sh to the > inverse (inv) alignments generated by full_train.sh. > > 3. Run the moses training script with additional flags: > > - --first-step -- first step in the training process (default > 1)--------------- This will be 4 > - --last-step -- last step in the training process (default > 7)------------ This will remain 7 > - --giza-f2e -- <path to folder>/new_giza.fwd > - --giza-e2f -- <path to folder>/new_giza.inv > > For example: > > ~/mosesdecoder/scripts/training/train-model.perl -root-dir <your training > directory> \ > > -corpus <your new corpus name> \ > > -f <src> -e <tgt> -alignment grow-diag-final-and -reordering > msd-bidirectional-fe \ > > -lm 0:3:<path to LM>:8 \ > --first-step 4 --last-step 7 --giza-f2e -- <path to folder>/new_giza.fwd > --giza-e2f -- <path to folder>/new_giza.inv \ > -external-bin-dir <path to giza++ binaries> > > For more details on the training step read this: > http://www.statmt.org/moses/?n=FactoredTraining.TrainingParameters > > What this does is assumes that you have alignments and continue the phrase > extraction, reordering and generate the new moses.ini file. > > WARNING: Specify the filenames and paths properly *OR IT WILL FAIL.* > > > > If you are still unclear then please ask and I will try to help you as > much as I can. > > Regards. > > > > > > > > On Tue, Nov 4, 2014 at 6:09 PM, Ihab Ramadan <i.rama...@saudisoft.com> > wrote: > > Dear Raj, > > That’s a great work my friend, > > This files make the script work but it takes long time to finish also it > did not generate the model folder which contain the moses.ini file > > Is this normal? > > And I now try to run it again as I suspect that the server was shut down > before the training was completed but i notice that it starts form the > beginning and did not use the existing files generated > > Thanks Raj it still a great work > > > > > > *From:* Raj Dabre [mailto:prajda...@gmail.com] > *Sent:* Thursday, October 30, 2014 4:54 PM > > > *To:* i.rama...@saudisoft.com > *Cc:* moses-support@mit.edu > *Subject:* Re: [Moses-support] Incremental training > > > > Ahh.... i totally forgot that part. > > Sorry. > > PFA. > > Just place them in the folder where the shell scripts full_train.sh and > align_new.sh are. > > Hopefully it should run now. > > Please let me know if you succeed. > > > > On Thu, Oct 30, 2014 at 11:44 PM, Ihab Ramadan <i.rama...@saudisoft.com> > wrote: > > Dear Raj, > > It is a great solution > > I installed MGIZA++ successfully and I am using your scripts to run > training > > And I followed the steps you mentioned but I faces this error when I was > running the full_train.sh script > > > > bla bla bla > > . > > . > > . > > . > > > > Starting MGIZA > > Initializing Global Paras > > DEBUG: EnterDEBUG: PrefixDEBUG: LogParsing Arguments > > ERROR: Cannot open configuration file configgiza.fwd! > > Starting MGIZA > > Initializing Global Paras > > DEBUG: EnterDEBUG: PrefixDEBUG: LogParsing Arguments > > ERROR: Cannot open configuration file configgiza.rev! > > > > > > This two files does not exists > > should they be generated from the installation? > > How to get them? > > > > *From:* Raj Dabre [mailto:prajda...@gmail.com] > *Sent:* Sunday, October 26, 2014 6:21 PM > *To:* i.rama...@saudisoft.com > *Cc:* moses-support@mit.edu > *Subject:* Re: [Moses-support] Incremental training > > > > Hello Ihab, > > I would suggest using mgiza++. > http://www.kyloo.net/software/doku.php/mgiza:overview > > It is very easy to use. > > I also wrote some scripts to make it easy for training. > Visit the link below for my scripts. > > https://drive.google.com/folderview?id=0B2gN8qfxTTUoSU43OFBhZXpPZ3M&usp=sharing > > Usage: > > To train basic IBM models: > bash full_train.sh <src_corpus_file_name> <tgt_corpus_file_name> > <model_folder_base> <corpus_folder_base> <path_to_mgizapp_installation> > > To align 2 new files using previously trained models (aka continue > training). > > bash align_new.sh <new_src_corpus_file_name> <new_tgt_corpus_file_name> > <old_src_corpus_file_name> <old_tgt_corpus_file_name> <model_folder_base> > <corpus_folder_base> <path_to_mgizapp_installation> > > There is also a python script which you had better replace in the scripts > folder of mgiza++. I have modified it to work with my scripts. > > Hope this helps. > > > > > > On Sun, Oct 26, 2014 at 11:05 PM, Ihab Ramadan <i.rama...@saudisoft.com> > wrote: > > Dear All, > > I just need a clear steps on how to do incremental training in moses, as > the illustration in the manual is not cleared enough > > Thanks > > > > Best Regards > > *Ihab Ramadan*| Senior Developer| Saudisoft <http://www.saudisoft.com/> - > Egypt | *Tel * +2 02 330 320 37 Ext- 0 | Mob+201007570826 | Fax > +20233032036 | *Follow us on *[image: linked] > <http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trkInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Aprimary>* > | > **[image: ZA102637861]* > <https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bookmark>* > | > **[image: ZA102637858]* <https://twitter.com/Saudisoft> > > > > > _______________________________________________ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support > > > > > -- > > Raj Dabre. > Research Student, > > Graduate School of Informatics, > Kyoto University. > > CSE MTech, IITB., 2011-2014 > > > > > -- > > Raj Dabre. > Research Student, > > Graduate School of Informatics, > Kyoto University. > > CSE MTech, IITB., 2011-2014 > > > > > -- > > Raj Dabre. > Research Student, > > Graduate School of Informatics, > Kyoto University. > > CSE MTech, IITB., 2011-2014 > > _______________________________________________ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support > >
_______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support