Re: [Moses-support] Incremental training

2014-11-20 Thread Sandipan Dandapat
Hi Raj,
I am still getting the same error as follows:
reading vocabulary files
Reading vocabulary file from:new_corpus/inc.fr.vcb
ERROR: TOKEN ID must be unique for each token, in line :
2 traité 34
TOKEN ID 2 has already been assigned to: traité

Your script is generating duplicates items.  May be you can forward me the
.py script again. I hope we are not using different version of the same!

However, I made some changes in the .py script based on your suggestion and
is working without any error. Please see the attached scripts.
Regards,
sandipan


On 20 November 2014 09:24, Raj Dabre prajda...@gmail.com wrote:

 Hey,

 I just remembered that I have a pathetic memory.
 I forgot to add the lines for sorting the .vcb file in increasing order of
 id.

 Just add the following lines to align_new.sh after the line -
 $MGIZA/scripts/plain2snt-hasvcb.py ../corpus/$4.vcb ../corpus/$3.vcb $2 $1
 $2_$1.snt $1_$2.snt  $2.vcb $1.vcb :

 sort -n $1.vcb  tmp
 mv tmp $1.vcb
 sort -n $2.vcb  tmp
 mv tmp $2.vcb

 And it will run perfectly. I am sure of it. I used your folder just to be
 sure. It works.
 Sorry for my silliness. Lemme know if it works now.

 Regards.

 On Thu, Nov 20, 2014 at 1:13 AM, Raj Dabre prajda...@gmail.com wrote:

 Well then your paths must be wrong.
 I cant see why the files are not being generated.
 Ill look into it tomorrow and let you know


 On 01:10, Thu, 20 Nov 2014 Sandipan Dandapat sandipandanda...@gmail.com
 wrote:

 When I am using your script then it has no problem. But when modified
 the lines nid = len(fvcb)+2; there is no .vcb files in the new_corpus/ dir
 i used these two commands.

 sh full_train.sh org.en org.fr
  sh align_new.sh inc.en inc.fr org.en org.fr

 Is the above right?

 I have kept the paths (MGIZA, MODEL_BASE and CORPUS_BASE,
 NEW_CORPUS_BASE) hard-coded in the scripts.


 On 19 November 2014 15:49, Raj Dabre prajda...@gmail.com wrote:

 Cannot open file???
 Does the file exist??
 Aee you passing the path properly?


 On 00:44, Thu, 20 Nov 2014 Sandipan Dandapat 
 sandipandanda...@gmail.com wrote:

 Hi,
 I made the changes based on your suggestions, its now generating a
 different error as below:


 reading vocabulary files
 Reading vocabulary file from:new_corpus/inc.fr.vcb

 Cannot open vocabulary file new_corpus/inc.fr.vcbfil

 I am attaching the working dir and the .py scripts here with. I have
 the 10 parallel sentences for incremental alignment is in inc_data/ where
 as the original 500 sentences are there in mtdata/ directory

 Thanks a ton for your help.

 Regards,
 sandipan

 On 19 November 2014 15:18, Raj Dabre prajda...@gmail.com wrote:

 Hey,

 I am pretty sure that my script does not generate duplicate token id.

 In fact, I used to get the same error till I modified the script.

 In case you do want to avoid this error and not use my script then:

 1. Open the original python script: plain2snt-hasvcb.py
 2. There is a line which increments the id counter by 1 ( the line is
 nid = len(fvcb)+1;)
 3. Make this line: nid = len(fvcb)+2; (This is cause the id numbering
 starts from 1, and thus if you have 23 tokens then the id will go from 2 
 to
 24. The original update script will do: nid = 23 + 1 = 24 and the
 modification will give 25 correctly). This is in 2 places: nid =
 len(evcb)+2;

 Do this and it will work.

 In any case... send me a zip file of your working directory (if its
 small you are testing it on small data right ? ). I will see what the
 problem is.



 On Wed, Nov 19, 2014 at 11:44 PM, Sandipan Dandapat 
 sandipandanda...@gmail.com wrote:

 Dear Raj,
 I also tried to use your scripts for incremental alignment. I copied
 your python script in the desired directory still I am receiving the 
 same
 error as posted by Ihab.
 reading vocabulary files
 Reading vocabulary file from:new_corpus/inc.fr.vcb
 ERROR: TOKEN ID must be unique for each token, in line :
 24 roi 2
 TOKEN ID 24 has already been assigned to: roi

 I took only 500 sentences pairs for full_train.sh and it worked fine
 with 758 lines in the corpus/tgt_filename.vcb file

 I took only 10 sentences for incremental alignment_new.sh which
 generated the error and I found 8054 lines in the
 new_corpus/new_tgt_file.vcb
 Is there any problem? Can you please help me on the same.

 Thanks and regards,
 sandipan


 On 4 November 2014 16:13, prajdabre prajda...@gmail.com wrote:

 Dear Ihab.
 There is a python script that was there in the google drive folder
 in the first mail I sent you.
 Please replace the existing file with my copy.

 It has to work.

 Regards.


 Sent from Samsung Mobile



  Original message 
 From: Ihab Ramadan i.rama...@saudisoft.com
 Date: 05/11/2014 00:54 (GMT+09:00)
 To: 'Raj Dabre' prajda...@gmail.com
 Cc: moses-support@mit.edu
 Subject: RE: [Moses-support] Incremental training


 Dear Raj,

 Your point is clear and I try to follow the steps you mentioned but
 I stuck now in the align_new.sh script which gives me this error

 reading

Re: [Moses-support] Incremental training

2014-11-19 Thread Sandipan Dandapat
Dear Raj,
I also tried to use your scripts for incremental alignment. I copied your
python script in the desired directory still I am receiving the same error
as posted by Ihab.
reading vocabulary files
Reading vocabulary file from:new_corpus/inc.fr.vcb
ERROR: TOKEN ID must be unique for each token, in line :
24 roi 2
TOKEN ID 24 has already been assigned to: roi

I took only 500 sentences pairs for full_train.sh and it worked fine with
758 lines in the corpus/tgt_filename.vcb file

I took only 10 sentences for incremental alignment_new.sh which generated
the error and I found 8054 lines in the new_corpus/new_tgt_file.vcb
Is there any problem? Can you please help me on the same.

Thanks and regards,
sandipan


On 4 November 2014 16:13, prajdabre prajda...@gmail.com wrote:

 Dear Ihab.
 There is a python script that was there in the google drive folder in the
 first mail I sent you.
 Please replace the existing file with my copy.

 It has to work.

 Regards.


 Sent from Samsung Mobile



  Original message 
 From: Ihab Ramadan i.rama...@saudisoft.com
 Date: 05/11/2014 00:54 (GMT+09:00)
 To: 'Raj Dabre' prajda...@gmail.com
 Cc: moses-support@mit.edu
 Subject: RE: [Moses-support] Incremental training


 Dear Raj,

 Your point is clear and I try to follow the steps you mentioned but I
 stuck now in the align_new.sh script which gives me this error

 reading vocabulary files

 Reading vocabulary file from:new_corpus/TraningTarget.txt.vcb

 ERROR: TOKEN ID must be unique for each token, in line :

 29107 q-1 4

 Do you have any idea what this error means?



 *From:* Raj Dabre [mailto:prajda...@gmail.com]
 *Sent:* Tuesday, November 4, 2014 12:06 PM
 *To:* i.rama...@saudisoft.com
 *Cc:* moses-support@mit.edu
 *Subject:* Re: [Moses-support] Incremental training



 Dear Ihab,

 Perhaps I should have mentioned much more clearly what my script does.
 Sorry for that.

 Let me start with this: There is no direct/easy way to generate the
 moses.ini file as you need.

 1. Suppose you have 2 million lines of parallel corpora and you trained a
 SMT system for it. This naturally gives the phrase table, reordering table
 and moses.ini.

 2. Suppose you got 500 k more lines of parallel corpora there are 2
 ways:

 a. Retrain 2.5 million lines from scratch (will take lots of time: ~
 2-3 days on a regular machines)

 b. Train on only the 500k new lines using the alignment information of
 the original training data. (Faster: ~ 6-7 hours).



 What my scripts do: *THEY ONLY GENERATE ALIGNMENTS and NOT PHRASE TABLES.*

 1. full_train.sh -- This trains on the original corpus of 2
 million lines. (Generate alignment files only for the original corpus)

 2. align_new.sh -- This trains on the new corpus of 500 k
 lines. (Generate alignment files only for the new corpus using the
 alignments for 1)



 *Why this split * Because the basic training step of Moses does not
 preserve the alignment probability information. Only the alignments are
 saved. To continue training we need the probability information.

 You can pass flags to moses to preserve this information ( this flag is
 --giza-option . If you do this then you will not need full_train.sh. But
 you will have to change the config files before using align_new.sh)

 *HOW TO GET UPDATED PHRASE TABLE:*

 1. Append the forward alignments (fwd) generated by align_new.sh to the
 forward (fwd) alignments generated by full_train.sh.
 2. Append the inverse alignments (inv) generated by align_new.sh to the
 inverse (inv) alignments generated by full_train.sh.

 3. Run the moses training script with additional flags:

- --first-step -- first step in the training process (default
1)--- This will be 4
- --last-step -- last step in the training process (default
7) This will remain 7
- --giza-f2e -- path to folder/new_giza.fwd
- --giza-e2f -- path to folder/new_giza.inv

 For example:

 ~/mosesdecoder/scripts/training/train-model.perl -root-dir your training 
 directory \

  -corpus your new corpus name \

  -f src -e tgt -alignment grow-diag-final-and -reordering 
 msd-bidirectional-fe \

  -lm 0:3:path to LM:8  \
  --first-step 4  --last-step 7 --giza-f2e -- path to folder/new_giza.fwd 
 --giza-e2f -- path to folder/new_giza.inv \
  -external-bin-dir path to giza++ binaries

 For more details on the training step read this:
 http://www.statmt.org/moses/?n=FactoredTraining.TrainingParameters

 What this does is assumes that you have alignments and continue the phrase
 extraction, reordering and generate the new moses.ini file.

 WARNING: Specify the filenames and paths properly *OR IT WILL FAIL.*



 If you are still unclear then please ask and I will try to help you as
 much as I can.

 Regards.







 On Tue, Nov 4, 2014 at 6:09 PM, Ihab Ramadan i.rama...@saudisoft.com
 wrote:

 Dear Raj,

 That’s a great work my friend,

 This files 

Re: [Moses-support] Incremental training

2014-11-19 Thread Sandipan Dandapat
When I am using your script then it has no problem. But when modified the
lines nid = len(fvcb)+2; there is no .vcb files in the new_corpus/ dir
i used these two commands.

sh full_train.sh org.en org.fr
 sh align_new.sh inc.en inc.fr org.en org.fr

Is the above right?

I have kept the paths (MGIZA, MODEL_BASE and CORPUS_BASE, NEW_CORPUS_BASE)
hard-coded in the scripts.


On 19 November 2014 15:49, Raj Dabre prajda...@gmail.com wrote:

 Cannot open file???
 Does the file exist??
 Aee you passing the path properly?


 On 00:44, Thu, 20 Nov 2014 Sandipan Dandapat sandipandanda...@gmail.com
 wrote:

 Hi,
 I made the changes based on your suggestions, its now generating a
 different error as below:


 reading vocabulary files
 Reading vocabulary file from:new_corpus/inc.fr.vcb

 Cannot open vocabulary file new_corpus/inc.fr.vcbfil

 I am attaching the working dir and the .py scripts here with. I have the
 10 parallel sentences for incremental alignment is in inc_data/ where as
 the original 500 sentences are there in mtdata/ directory

 Thanks a ton for your help.

 Regards,
 sandipan

 On 19 November 2014 15:18, Raj Dabre prajda...@gmail.com wrote:

 Hey,

 I am pretty sure that my script does not generate duplicate token id.

 In fact, I used to get the same error till I modified the script.

 In case you do want to avoid this error and not use my script then:

 1. Open the original python script: plain2snt-hasvcb.py
 2. There is a line which increments the id counter by 1 ( the line is
 nid = len(fvcb)+1;)
 3. Make this line: nid = len(fvcb)+2; (This is cause the id numbering
 starts from 1, and thus if you have 23 tokens then the id will go from 2 to
 24. The original update script will do: nid = 23 + 1 = 24 and the
 modification will give 25 correctly). This is in 2 places: nid =
 len(evcb)+2;

 Do this and it will work.

 In any case... send me a zip file of your working directory (if its
 small you are testing it on small data right ? ). I will see what the
 problem is.



 On Wed, Nov 19, 2014 at 11:44 PM, Sandipan Dandapat 
 sandipandanda...@gmail.com wrote:

 Dear Raj,
 I also tried to use your scripts for incremental alignment. I copied
 your python script in the desired directory still I am receiving the same
 error as posted by Ihab.
 reading vocabulary files
 Reading vocabulary file from:new_corpus/inc.fr.vcb
 ERROR: TOKEN ID must be unique for each token, in line :
 24 roi 2
 TOKEN ID 24 has already been assigned to: roi

 I took only 500 sentences pairs for full_train.sh and it worked fine
 with 758 lines in the corpus/tgt_filename.vcb file

 I took only 10 sentences for incremental alignment_new.sh which
 generated the error and I found 8054 lines in the
 new_corpus/new_tgt_file.vcb
 Is there any problem? Can you please help me on the same.

 Thanks and regards,
 sandipan


 On 4 November 2014 16:13, prajdabre prajda...@gmail.com wrote:

 Dear Ihab.
 There is a python script that was there in the google drive folder in
 the first mail I sent you.
 Please replace the existing file with my copy.

 It has to work.

 Regards.


 Sent from Samsung Mobile



  Original message 
 From: Ihab Ramadan i.rama...@saudisoft.com
 Date: 05/11/2014 00:54 (GMT+09:00)
 To: 'Raj Dabre' prajda...@gmail.com
 Cc: moses-support@mit.edu
 Subject: RE: [Moses-support] Incremental training


 Dear Raj,

 Your point is clear and I try to follow the steps you mentioned but I
 stuck now in the align_new.sh script which gives me this error

 reading vocabulary files

 Reading vocabulary file from:new_corpus/TraningTarget.txt.vcb

 ERROR: TOKEN ID must be unique for each token, in line :

 29107 q-1 4

 Do you have any idea what this error means?



 *From:* Raj Dabre [mailto:prajda...@gmail.com]
 *Sent:* Tuesday, November 4, 2014 12:06 PM
 *To:* i.rama...@saudisoft.com
 *Cc:* moses-support@mit.edu
 *Subject:* Re: [Moses-support] Incremental training



 Dear Ihab,

 Perhaps I should have mentioned much more clearly what my script does.
 Sorry for that.

 Let me start with this: There is no direct/easy way to generate the
 moses.ini file as you need.

 1. Suppose you have 2 million lines of parallel corpora and you
 trained a SMT system for it. This naturally gives the phrase table,
 reordering table and moses.ini.

 2. Suppose you got 500 k more lines of parallel corpora there are
 2 ways:

 a. Retrain 2.5 million lines from scratch (will take lots of time:
 ~ 2-3 days on a regular machines)

 b. Train on only the 500k new lines using the alignment
 information of the original training data. (Faster: ~ 6-7 hours).



 What my scripts do: *THEY ONLY GENERATE ALIGNMENTS and NOT PHRASE
 TABLES.*

 1. full_train.sh -- This trains on the original corpus of
 2 million lines. (Generate alignment files only for the original corpus)

 2. align_new.sh -- This trains on the new corpus of 500 k
 lines. (Generate alignment files only for the new corpus using the
 alignments for 1

Re: [Moses-support] Help in Moses Incremental Retraing

2014-10-24 Thread Sandipan Dandapat
Hi Ulrich,
Sorry for sending the doubts to you directly.  I will keep in mind to post
my future queries to moses-support.

Thanks a lot for the clarification. Let me play with the MERT.

Thanks and regards,
sandipan

On 24 October 2014 01:51, Ulrich Germann ulrich.germ...@gmail.com wrote:

 Hi Sandipan,

 first, please post Moses-related questions to moses-support@mit.edu, not
 individual contributors.

 second, the current seven features used by Mmsapt /
 PhraseDictionaryBitextSampling are (for details, see my recent paper on
 this phrase table implementation:
 https://www.researchgate.net/publication/267270863_Dynamic_Phrase_Tables_for_Machine_Translation_in_an_Interactive_Post-editing_Scenario
 )

 THE STANDARD SET OF FEATURES MAY CHANGE AT ANY TIME, as this is still work
 in progress.

 - forward and backward lexically smoothed phrase scores (2 scores; same as
 standard features)
 - rarity penalty (1/(x+1)), where x is the number of phrase pair
 occurrences in the corpus/sample (1 score)
 - the lower bound on forward and backward phrase-level probabilities, with
 confidence level .99 (2 scores)
 - 2 provenance features (x/(x+1)), where x is the number of phrase pair
 occurrences in the (static) background and (dynamic) foreground corpus (2
 scores)

 third, you need to retrain the feature weights for good performance with
 any of the standard techniques, but with the  I usually use MERT. The
 executable simulate-pe allows you to feed in references and  word
 aligmnents one sentence at a time; there are additional parameters
 --spe-src, --spe-trg, --spe-aln to specify source, target, and alignment
 (symal output format). Source and target files are one sentence per line,
 tokenized. Michael Denkowski is currently in the process of integrating
 online tuning into Moses, but I'm not sure whether that's ready to be
 deployed yet.

 Regards - Uli



 On Thu, Oct 23, 2014 at 1:47 AM, Sandipan Dandapat 
 sandipandanda...@gmail.com wrote:

 Dear Ulrich,
 I got your reference from Prashanta Mathur. I am a postdoctoral
 researcher in CNGL, DCU and  I am working with Moses incremental
 retraining. It will be great if you help me to understand couple of doubts:

 1. I found there are 7 weights to define for PT0 (PT0 is the Mmsapt name)
 i.e.

 Mmsapt name=PT0 output-factor=0 num-features=7
 base=/home/sandipan/inc_retrain/MT_sys/En-Fr/dgt/50_i/mmsa_pt/train. L1=en
 L2=fr
 [weight]
 PT0= 0.1 0.2 0.3 0.4 0.5 0.6 0.7

 num-featues in PBSMT model is 4 which does not work with Mmsapt. What are
 these 7 weights? Can I use uniform weights for all 7 features? Or how do I
 adjust these values? Or, how to adjust these weights?

 2. I found there is significant difference in BLEU score when I am using
 standard PBSMT model and when I am using MMST based model. Is this because
 of the weights I am using or am I doing something wrong?

 It will be real great help, if you help me to understand the above issue.
 Thanking you.

 Regards,
 sandipan





 --
 Ulrich Germann
 Research Associate
 School of Informatics
 University of Edinburgh

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Incremental retraining

2014-09-09 Thread Sandipan Dandapat
Hi Hieu,
I am at the last step of the incremental training.
1. I have produced the alignment file for the incremental data.
2. I then just append the new-alignment  file and the incremental data to
the original data and alignment file.
3. Furthermore I create the memory mapped suffix array phrase
table(mmsapt).

When I am using the new mmsapt information during decoding I am getting the
following error ( which was fine before adding the incremental data)

Created input-output object : [0.086] seconds
this is a test sentence
Translating line 0  in thread id 139739320297216
Translating: this is a test sentence
binary file loaded, default OFF_T: -1
Line 0: Initialize search took 0.064 seconds total
Alignment range error at sentence 53994!
4/7 6/5

Alignment range error at sentence 54530!
17/19 18/18

Alignment range error at sentence 50292!
0/13 9/9

Alignment range error at sentence 50120!
25/36 31/31

Alignment range error at sentence 55089!
11/27 19/19

terminate called recursively
terminate called after throwing an instance of 'terminate called recursively
terminate called recursively
util::Exception'
terminate called recursively
Aborted

I am not sure if I am doing anything wrong here?

Thanks and regards,
sandipan


On 8 September 2014 10:11, Sandipan Dandapat sandipandanda...@gmail.com
wrote:

 Hi,
 This worked.
 Thanks and regards,
 sandipan

 On 7 September 2014 16:09, Hieu Hoang hieu.ho...@ed.ac.uk wrote:

 sorry, i meant


 On 7 September 2014 16:08, Hieu Hoang hieu.ho...@ed.ac.uk wrote:

 you HAVE to change both
   num-features=7
 AND
   [weight]
   PT0= 0.1 0.2 0.3 0.4 0.5 0.6 0.7


 On 7 September 2014 16:06, Sandipan Dandapat sandipandanda...@gmail.com
  wrote:

 Hi Hieu,
 Even I tried with '7' and it fails with the error message

 Exception: moses/ScoreComponentCollection.cpp:248 in void
 Moses::ScoreComponentCollection::Assign(const Moses::FeatureFunction*,
 const std::vectorfloat) threw util::Exception'.
 Feature function PT0 specified 7 dense scores or weights. Actually has 4

 In contrast, when I am using binarised pharse table, I use
 num-features=4 and this works fine. I am attaching the  Moses.ini file in
 case I am doing anything wrong there.

 Thanks and regards,
 sandipan


 On 7 September 2014 15:46, Hieu Hoang hieu.ho...@ed.ac.uk wrote:

 maybe's it's 7 scores


 On 7 September 2014 14:59, Sandipan Dandapat 
 sandipandanda...@gmail.com wrote:

 Hi Hieu,
 I also tried the same but generates the error below:

 Exception: moses/TranslationModel/UG/mmsapt.cpp:381 in virtual void
 Moses::Mmsapt::Load() threw util::Exception because
 `this-m_feature_names.size() != this-m_numScoreComponents'.
 At moses/TranslationModel/UG/mmsapt.cpp:381: number of feature values
 provided by Phrase table (7) does not match number specified in Moses
 config file (4)!

 Thanks and regards,
 sandipan


 On 6 September 2014 09:50, Hieu Hoang hieuho...@gmail.com wrote:

  I'm not sure how many scores there are in the phrase table
PhraseDictionaryBitextSampling
 It may be 4. In which case you must specify

 [feature]
 PhraseDictionaryBitextSampling name=PT0 num-features=4 ...

 [weight]
 PT0= 0.1 0.2 0.3 0.4


 On 05/09/14 14:12, Sandipan Dandapat wrote:

  Hi,
  During incremental retraining I specified the following line in
 moses .ini
 PhraseDictionaryBitextSampling name=PT0 output-factor=0
 num-features=9
 path=/home/sandipan/inc_retrain/MT_sys/EnPl/mtdata_pro/train. L1=en 
 L2=pl
 pfwd=g pbwd=g smooth=0 sample=1000 workers=1

  this generates the error:
 Feature function PT0 specified 9 dense scores or weights. Actually
 has 0.

  which is solved when num-features is changed to '0'
  but generates the error below:

  Exception: moses/TranslationModel/UG/mmsapt.cpp:381 in virtual
 void Moses::Mmsapt::Load() threw util::Exception because
 `this-m_feature_names.size() != this-m_numScoreComponents'.
 At moses/TranslationModel/UG/mmsapt.cpp:381: number of feature
 values provided by Phrase table (7) does not match number specified in
 Moses config file (0)!
  Changing it to 7 also does not help.

  I have tried with
 Mmsapt name=PT0 output-factor=0 num-features=0
 base=/home/sandipan/inc_retrain/MT_sys/EnPl/mtdata_pro/train. L1=en 
 L2=pl

  but does not work.
  What I need to do at this stage of retraining using moses?

  Thanks and regards,
 sandipan


 ___
 Moses-support mailing 
 listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support



 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support





 --
 Hieu Hoang
 Research Associate
 University of Edinburgh
 http://www.hoang.co.uk/hieu


 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support





 --
 Hieu Hoang
 Research Associate
 University of Edinburgh
 http://www.hoang.co.uk/hieu

Re: [Moses-support] Incremental retraining

2014-09-07 Thread Sandipan Dandapat
Hi Hieu,
I also tried the same but generates the error below:

Exception: moses/TranslationModel/UG/mmsapt.cpp:381 in virtual void
Moses::Mmsapt::Load() threw util::Exception because
`this-m_feature_names.size() != this-m_numScoreComponents'.
At moses/TranslationModel/UG/mmsapt.cpp:381: number of feature values
provided by Phrase table (7) does not match number specified in Moses
config file (4)!

Thanks and regards,
sandipan


On 6 September 2014 09:50, Hieu Hoang hieuho...@gmail.com wrote:

  I'm not sure how many scores there are in the phrase table
PhraseDictionaryBitextSampling
 It may be 4. In which case you must specify

 [feature]
 PhraseDictionaryBitextSampling name=PT0 num-features=4 ...

 [weight]
 PT0= 0.1 0.2 0.3 0.4


 On 05/09/14 14:12, Sandipan Dandapat wrote:

  Hi,
  During incremental retraining I specified the following line in moses .ini
 PhraseDictionaryBitextSampling name=PT0 output-factor=0 num-features=9
 path=/home/sandipan/inc_retrain/MT_sys/EnPl/mtdata_pro/train. L1=en L2=pl
 pfwd=g pbwd=g smooth=0 sample=1000 workers=1

  this generates the error:
 Feature function PT0 specified 9 dense scores or weights. Actually has 0.

  which is solved when num-features is changed to '0'
  but generates the error below:

  Exception: moses/TranslationModel/UG/mmsapt.cpp:381 in virtual void
 Moses::Mmsapt::Load() threw util::Exception because
 `this-m_feature_names.size() != this-m_numScoreComponents'.
 At moses/TranslationModel/UG/mmsapt.cpp:381: number of feature values
 provided by Phrase table (7) does not match number specified in Moses
 config file (0)!
  Changing it to 7 also does not help.

  I have tried with
 Mmsapt name=PT0 output-factor=0 num-features=0
 base=/home/sandipan/inc_retrain/MT_sys/EnPl/mtdata_pro/train. L1=en L2=pl

  but does not work.
  What I need to do at this stage of retraining using moses?

  Thanks and regards,
 sandipan


 ___
 Moses-support mailing 
 listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support



 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Incremental retraining

2014-09-07 Thread Sandipan Dandapat
Hi Hieu,
Even I tried with '7' and it fails with the error message

Exception: moses/ScoreComponentCollection.cpp:248 in void
Moses::ScoreComponentCollection::Assign(const Moses::FeatureFunction*,
const std::vectorfloat) threw util::Exception'.
Feature function PT0 specified 7 dense scores or weights. Actually has 4

In contrast, when I am using binarised pharse table, I use num-features=4
and this works fine. I am attaching the  Moses.ini file in case I am doing
anything wrong there.

Thanks and regards,
sandipan


On 7 September 2014 15:46, Hieu Hoang hieu.ho...@ed.ac.uk wrote:

 maybe's it's 7 scores


 On 7 September 2014 14:59, Sandipan Dandapat sandipandanda...@gmail.com
 wrote:

 Hi Hieu,
 I also tried the same but generates the error below:

 Exception: moses/TranslationModel/UG/mmsapt.cpp:381 in virtual void
 Moses::Mmsapt::Load() threw util::Exception because
 `this-m_feature_names.size() != this-m_numScoreComponents'.
 At moses/TranslationModel/UG/mmsapt.cpp:381: number of feature values
 provided by Phrase table (7) does not match number specified in Moses
 config file (4)!

 Thanks and regards,
 sandipan


 On 6 September 2014 09:50, Hieu Hoang hieuho...@gmail.com wrote:

  I'm not sure how many scores there are in the phrase table
PhraseDictionaryBitextSampling
 It may be 4. In which case you must specify

 [feature]
 PhraseDictionaryBitextSampling name=PT0 num-features=4 ...

 [weight]
 PT0= 0.1 0.2 0.3 0.4


 On 05/09/14 14:12, Sandipan Dandapat wrote:

  Hi,
  During incremental retraining I specified the following line in moses
 .ini
 PhraseDictionaryBitextSampling name=PT0 output-factor=0 num-features=9
 path=/home/sandipan/inc_retrain/MT_sys/EnPl/mtdata_pro/train. L1=en L2=pl
 pfwd=g pbwd=g smooth=0 sample=1000 workers=1

  this generates the error:
 Feature function PT0 specified 9 dense scores or weights. Actually has 0.

  which is solved when num-features is changed to '0'
  but generates the error below:

  Exception: moses/TranslationModel/UG/mmsapt.cpp:381 in virtual void
 Moses::Mmsapt::Load() threw util::Exception because
 `this-m_feature_names.size() != this-m_numScoreComponents'.
 At moses/TranslationModel/UG/mmsapt.cpp:381: number of feature values
 provided by Phrase table (7) does not match number specified in Moses
 config file (0)!
  Changing it to 7 also does not help.

  I have tried with
 Mmsapt name=PT0 output-factor=0 num-features=0
 base=/home/sandipan/inc_retrain/MT_sys/EnPl/mtdata_pro/train. L1=en L2=pl

  but does not work.
  What I need to do at this stage of retraining using moses?

  Thanks and regards,
 sandipan


 ___
 Moses-support mailing 
 listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support



 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support





 --
 Hieu Hoang
 Research Associate
 University of Edinburgh
 http://www.hoang.co.uk/hieu


 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




moses.ini
Description: Binary data
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] Incremental retraining

2014-09-05 Thread Sandipan Dandapat
Hi,
During incremental retraining I specified the following line in moses.ini
PhraseDictionaryBitextSampling name=PT0 output-factor=0 num-features=9
path=/home/sandipan/inc_retrain/MT_sys/EnPl/mtdata_pro/train. L1=en L2=pl
pfwd=g pbwd=g smooth=0 sample=1000 workers=1

this generates the error:
Feature function PT0 specified 9 dense scores or weights. Actually has 0.

which is solved when num-features is changed to '0'
but generates the error below:

Exception: moses/TranslationModel/UG/mmsapt.cpp:381 in virtual void
Moses::Mmsapt::Load() threw util::Exception because
`this-m_feature_names.size() != this-m_numScoreComponents'.
At moses/TranslationModel/UG/mmsapt.cpp:381: number of feature values
provided by Phrase table (7) does not match number specified in Moses
config file (0)!
Changing it to 7 also does not help.

I have tried with
Mmsapt name=PT0 output-factor=0 num-features=0
base=/home/sandipan/inc_retrain/MT_sys/EnPl/mtdata_pro/train. L1=en L2=pl

but does not work.
What I need to do at this stage of retraining using moses?

Thanks and regards,
sandipan
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] xmlrpc problem

2014-08-13 Thread Sandipan Dandapat
Hi,
I am using inc-giza-pp, xmlrpc-c. All of them successfully compiled and my
xmlrpc in installed in a local directory and the library files are located
in /home/sdandapat/Moses/xmlrpc-c/lib64.

Moses decoder also compiled fine with inc-giza-pp.

During building the translation model, it fails with following error:

/home/sdandapat/Moses/mosesdecoder/tools/GIZA++: error while loading shared
libraries: libxmlrpc_server_abyss++.so.8: cannot open shared object file:
No such file or directory
Exit code: 127
ERROR: Giza did not produce the output file
/home/sdandapat/Moses_retrain/EnPl/train/giza.en-pl/en-pl.Ahmm.5. Is your
corpus clean (reasonably-sized sentences)? at
/home/sdandapat/Moses/mosesdecoder/scripts/training/train-model.perl line
1199.

I am attaching the entire log. What am I doing wrong?
I found similar post in
http://blog.gmane.org/gmane.comp.nlp.moses.user/month=20111201 but could
not find the end solution.

Thanks and regards,
sandipan



Sandipan Dandapat
Postdoctoral Researcher
CNGL, School of Computing
Dublin City University
Google Scholar Profile:
http://scholar.google.co.in/citations?user=DWD_FiQJhl=en
Tokenizer Version 1.1
Language: en
Number of threads: 1
Tokenizer Version 1.1
Language: pl
Number of threads: 1
clean-corpus.perl: processing mtdata/train.tok.en  .pl to mtdata/train.clean, cutoff 1-100
.
Input sentences: 5  Output sentences:  5
mkdir: cannot create directory ‘lm’: File exists
Pruning === 1/5 Counting and sorting n-grams ===
Reading /home/sdandapat/Moses_retrain/EnPl/mtdata/train.clean.pl
5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100

Unigram tokens 917274 types 51774
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:621288 2:2931873280 3:5497262592
Statistics:
1 51774 D1=0.607105 D2=1.0428 D3+=1.51615
2 379279 D1=0.801548 D2=1.15112 D3+=1.4064
3 655686 D1=0.884806 D2=1.19448 D3+=1.37341
Memory estimate for binary LM:
type   kB
probing 21729 assuming -p 1.5
probing 24154 assuming -r models -p 1.5
trie 9558 without quantization
trie 5545 assuming -q 8 -b 8 quantization 
trie 8990 assuming -a 22 array pointer compression
trie 4976 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:621288 2:6068464 3:13113720
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:621288 2:6068464 3:13113720
Chain sizes: 1:621288 2:6068464 3:13113720
=== 5/5 Writing ARPA model ===
Name:lmplz	VmPeak:8778804 kB	VmRSS:23180 kB	RSSMax:1931028 kB	user:2.96355	sys:1.23781	CPU:4.20136	real:4.89953
Reading lm/train.arpa.pl
5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100

SUCCESS
Using SCRIPTS_ROOTDIR: /home/sdandapat/Moses/mosesdecoder/scripts
Using single-thread GIZA
(1) preparing corpus @ Wed Aug 13 12:44:56 IST 2014
Executing: mkdir -p /home/sdandapat/Moses_retrain/EnPl/train/corpus
(1.0) selecting factors @ Wed Aug 13 12:44:56 IST 2014
(1.1) running mkcls  @ Wed Aug 13 12:44:56 IST 2014
/home/sdandapat/Moses/mosesdecoder/tools/mkcls -c50 -n2 -p/home/sdandapat/Moses_retrain/EnPl/mtdata/train.clean.en -V/home/sdandapat/Moses_retrain/EnPl/train/corpus/en.vcb.classes opt
  /home/sdandapat/Moses_retrain/EnPl/train/corpus/en.vcb.classes already in place, reusing
(1.1) running mkcls  @ Wed Aug 13 12:44:56 IST 2014
/home/sdandapat/Moses/mosesdecoder/tools/mkcls -c50 -n2 -p/home/sdandapat/Moses_retrain/EnPl/mtdata/train.clean.pl -V/home/sdandapat/Moses_retrain/EnPl/train/corpus/pl.vcb.classes opt
  /home/sdandapat/Moses_retrain/EnPl/train/corpus/pl.vcb.classes already in place, reusing
(1.2) creating vcb file /home/sdandapat/Moses_retrain/EnPl/train/corpus/en.vcb @ Wed Aug 13 12:44:56 IST 2014
(1.2) creating vcb file /home/sdandapat/Moses_retrain/EnPl/train/corpus/pl.vcb @ Wed Aug 13 12:44:56 IST 2014
(1.3) numberizing corpus /home/sdandapat/Moses_retrain/EnPl/train/corpus/en-pl-int-train.snt @ Wed Aug 13 12:44:57 IST 2014
  /home/sdandapat/Moses_retrain/EnPl/train/corpus/en-pl-int-train.snt already in place, reusing
(1.3) numberizing corpus /home/sdandapat/Moses_retrain/EnPl/train/corpus/pl-en-int-train.snt @ Wed Aug 13 12:44:57 IST 2014
  /home/sdandapat/Moses_retrain/EnPl/train/corpus/pl-en-int-train.snt already in place, reusing
(2) running giza @ Wed Aug 13 12:44:57 IST 2014
(2.1a) running snt2cooc en-pl @ Wed Aug 13 12:44:57 IST 2014

Executing: mkdir -p /home/sdandapat/Moses_retrain/EnPl/train/giza.en-pl
Executing: /home/sdandapat/Moses/mosesdecoder/tools/snt2cooc.out /home/sdandapat/Moses_retrain/EnPl/train/corpus/pl.vcb /home/sdandapat/Moses_retrain/EnPl/train/corpus/en.vcb /home/sdandapat

Re: [Moses-support] xmlrpc problem

2014-08-13 Thread Sandipan Dandapat
Hi Barry,
Thanks a ton, it worked. I have couple of follow up question.

1. Do we need to use moses-server version for incremental retraining using
moses?

2. In the website, I found we need to do something

% zcat ${CORPUS}.${L1}.gz  | mtt-build -i -o /some/path/${CORPUS}.${L1}


But who generate this ${CORPUS}.${L1}.gz file.  I am using inc-giza-pp and
I compiled mosesdecoder with --with-mm option and during training I used
option

However I am unable to figure out how to perform step 2  3 in
http://www.statmt.org/moses/?n=Moses.AdvancedFeatures#ntoc40 ?
Thanks and regards,
sandipan


Sandipan Dandapat
Postdoctoral Researcher
CNGL, School of Computing
Dublin City University
Google Scholar Profile:
http://scholar.google.co.in/citations?user=DWD_FiQJhl=en


On Wed, Aug 13, 2014 at 8:50 PM, Barry Haddow bhad...@staffmail.ed.ac.uk
wrote:

  Hi Sandipan

 It looks as though GIZA++ cannot find the xmlrpc-c libraries. You need to
 set your LD_LIBRARY_PATH, like this (assuming bash):

 export
 LD_LIBRARY_PATH=/home/sdandapat/Moses/xmlrpc-c/lib64:$LD_LIBRARY_PATH

 then check if you can run GIZA++ from the command line, before you try
 building the model,

 cheers - Barry



 On 13/08/14 14:04, Sandipan Dandapat wrote:

  Hi,
  I am using inc-giza-pp, xmlrpc-c. All of them successfully compiled and
 my xmlrpc in installed in a local directory and the library files are
 located in /home/sdandapat/Moses/xmlrpc-c/lib64.

  Moses decoder also compiled fine with inc-giza-pp.

  During building the translation model, it fails with following error:

 /home/sdandapat/Moses/mosesdecoder/tools/GIZA++: error while loading
 shared libraries: libxmlrpc_server_abyss++.so.8: cannot open shared object
 file: No such file or directory
 Exit code: 127
 ERROR: Giza did not produce the output file
 /home/sdandapat/Moses_retrain/EnPl/train/giza.en-pl/en-pl.Ahmm.5. Is your
 corpus clean (reasonably-sized sentences)? at
 /home/sdandapat/Moses/mosesdecoder/scripts/training/train-model.perl line
 1199.

  I am attaching the entire log. What am I doing wrong?
  I found similar post in
 http://blog.gmane.org/gmane.comp.nlp.moses.user/month=20111201 but could
 not find the end solution.

  Thanks and regards,
 sandipan



 Sandipan Dandapat
 Postdoctoral Researcher
 CNGL, School of Computing
 Dublin City University
 Google Scholar Profile:
 http://scholar.google.co.in/citations?user=DWD_FiQJhl=en


 ___
 Moses-support mailing 
 listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support



___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] Error compiling moses decoder --with-mm

2014-07-30 Thread Sandipan Dandapat
Hi,
I was compiling mosesdecoder with './bjam --with-mm' to use memory-mapped
dynamic suffix array phrase table. The build failed. I am attaching the log
here. Can you please help me to find the problem. The build worked without
--with-mm option.

Thanks and regards,
sandipan


Sandipan Dandapat
Postdoctoral Researcher
CNGL, School of Computing
Dublin City University
Google Scholar Profile:
http://scholar.google.co.in/citations?user=DWD_FiQJhl=en


build.log.gz
Description: GNU Zip compressed data
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Error compiling moses decoder --with-mm

2014-07-30 Thread Sandipan Dandapat
I am using Ubuntu14.04 and gcc 4.4.7
I used used ./bjam --with-mm

I found the typename was causing the problem and I changed the
./moses/TranslationModel/UG/mm/ug_bitext.h file

typedef typename boost::unordered_mapuint64_t, jstats trg_map_t;

to
typedef boost::unordered_mapuint64_t, jstats trg_map_t;
and it compiles successfully

More question on Retraining:
1.

 - the word alignment between these files in symal output format

is it the standard format found in alligned.grow-diag-final i.e.

0-0 1-1 2-2 2-3 6-4 3-5 4-5 5-5 6-5 7-6 8-7 9-8
0-0 1-1 2-2 3-3 4-4
...

2. I have normal training and aligned corpus i.e.e train.en, train.fr
andalligned.grow-diag-final. During retraining we need to do the following

   % zcat ${CORPUS}.${L1}.gz  | mtt-build -i -o /some/path/${CORPUS}.${L1}
   % zcat ${CORPUS}.${L2}.gz  | mtt-build -i -o /some/path/${CORPUS}.${L2}
   % zcat ${CORPUS}.${L1}-${L2}.symal.gz | symal2mam
/some/path/${CORPUS}.${L1}-${L2}.mam
   % mmlex-build /some/path/${CORPUS} ${L1} ${L2} -o
/some/path/${CORPUS}.${L1}-${L2}.lex -c
/some/path/${CORPUS}.${L1}-${L2}.coc


  I am unable to understand how ${CORPUS}.${L1}.gz,${CORPUS}.${L1}.gz and
${CORPUS}.${L1}-${L2}.symal.gz files are generated. Or are they refering to
the standard training and alignment files?

Thanks,






Sandipan Dandapat
Postdoctoral Researcher
CNGL, School of Computing
Dublin City University
Google Scholar Profile:
http://scholar.google.co.in/citations?user=DWD_FiQJhl=en


On Wed, Jul 30, 2014 at 4:51 PM, Hieu Hoang hieu.ho...@ed.ac.uk wrote:

 what OS  version are you using? What gcc version are you using? What is
 the exact command you used to compile?


 On 30 July 2014 14:47, Sandipan Dandapat sandipandanda...@gmail.com
 wrote:

 Hi,
 I was compiling mosesdecoder with './bjam --with-mm' to use memory-mapped
 dynamic suffix array phrase table. The build failed. I am attaching the log
 here. Can you please help me to find the problem. The build worked without
 --with-mm option.

 Thanks and regards,
 sandipan

 
 Sandipan Dandapat
 Postdoctoral Researcher
 CNGL, School of Computing
 Dublin City University
 Google Scholar Profile:
 http://scholar.google.co.in/citations?user=DWD_FiQJhl=en

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




 --
 Hieu Hoang
 Research Associate
 University of Edinburgh
 http://www.hoang.co.uk/hieu


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] Issues with Incremental retraining using Moses

2014-07-28 Thread Sandipan Dandapat
Hi,
I am trying to use Moses Incremental Retraining as described in
http://www.statmt.org/moses/?n=Moses.AdvancedFeatures#ntoc33

I have two doubts:

1. I am able to generate the new-alignment-file  using additional data on
top of previously used data. Once the new alignment file is generated, the
page says to update the model.  I am unable to understand how to use the
same during decoding? Can you please help me to understand how can I
proceed once my new-allignment file is generated?

2. What is happening when we are updating the moses.ini file using

 PhraseDictionaryDynSuffixArray source=path-to-source-corpus
target=path-to-target-corpus alignment=path-to-alignments

 I am unable to see any reference to this updated moses.ini file in the
rest of the section.

Thanks and regards,
sandipan

Sandipan Dandapat
Postdoctoral Researcher
CNGL, School of Computing
Dublin City University
Google Scholar Profile:
http://scholar.google.co.in/citations?user=DWD_FiQJhl=en
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] Issues with Incremental retraining using Moses

2014-07-25 Thread Sandipan Dandapat
Hi,
I am trying to use Moses Incremental Retraining as described in
http://www.statmt.org/moses/?n=Moses.AdvancedFeatures#ntoc33

I have two doubts:

1. I am able to generate the new-alignment-file  using additional data on
top of previously used data. Once the new alignment file is generated, the
page says to update the model.  I am unable to understand how to use the
same during decoding? Can you please help me to understand how can I
proceed once my new-allignment file is generated?

2. What is happening when we are updating the moses.ini file using

 PhraseDictionaryDynSuffixArray source=path-to-source-corpus
target=path-to-target-corpus alignment=path-to-alignments

 I am unable to see any reference to this updated moses.ini file in the
rest of the section.

Thanks and regards,
sandipan

Sandipan Dandapat
Postdoctoral Researcher
CNGL, School of Computing
Dublin City University
Google Scholar Profile:
http://scholar.google.co.in/citations?user=DWD_FiQJhl=en
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support