Re: [Moses-support] Incremental training

2014-11-19 Thread Raj Dabre
Well then your paths must be wrong.
I cant see why the files are not being generated.
Ill look into it tomorrow and let you know

On 01:10, Thu, 20 Nov 2014 Sandipan Dandapat 
wrote:

> When I am using your script then it has no problem. But when modified the
> lines nid = len(fvcb)+2; there is no .vcb files in the new_corpus/ dir
> i used these two commands.
>
> sh full_train.sh org.en org.fr
>  sh align_new.sh inc.en inc.fr org.en org.fr
>
> Is the above right?
>
> I have kept the paths (MGIZA, MODEL_BASE and CORPUS_BASE, NEW_CORPUS_BASE)
> hard-coded in the scripts.
>
>
> On 19 November 2014 15:49, Raj Dabre  wrote:
>
>> Cannot open file???
>> Does the file exist??
>> Aee you passing the path properly?
>>
>>
>> On 00:44, Thu, 20 Nov 2014 Sandipan Dandapat 
>> wrote:
>>
>>> Hi,
>>> I made the changes based on your suggestions, its now generating a
>>> different error as below:
>>>
>>>
>>> reading vocabulary files
>>> Reading vocabulary file from:new_corpus/inc.fr.vcb
>>>
>>> Cannot open vocabulary file new_corpus/inc.fr.vcbfil
>>>
>>> I am attaching the working dir and the .py scripts here with. I have the
>>> 10 parallel sentences for incremental alignment is in inc_data/ where as
>>> the original 500 sentences are there in mtdata/ directory
>>>
>>> Thanks a ton for your help.
>>>
>>> Regards,
>>> sandipan
>>>
>>> On 19 November 2014 15:18, Raj Dabre  wrote:
>>>
 Hey,

 I am pretty sure that my script does not generate duplicate token id.

 In fact, I used to get the same error till I modified the script.

 In case you do want to avoid this error and not use my script then:

 1. Open the original python script: plain2snt-hasvcb.py
 2. There is a line which increments the id counter by 1 ( the line is
 nid = len(fvcb)+1;)
 3. Make this line: nid = len(fvcb)+2; (This is cause the id numbering
 starts from 1, and thus if you have 23 tokens then the id will go from 2 to
 24. The original update script will do: nid = 23 + 1 = 24 and the
 modification will give 25 correctly). This is in 2 places: nid =
 len(evcb)+2;

 Do this and it will work.

 In any case... send me a zip file of your working directory (if its
 small you are testing it on small data right ? ). I will see what the
 problem is.



 On Wed, Nov 19, 2014 at 11:44 PM, Sandipan Dandapat <
 sandipandanda...@gmail.com> wrote:

> Dear Raj,
> I also tried to use your scripts for incremental alignment. I copied
> your python script in the desired directory still I am receiving the same
> error as posted by Ihab.
> reading vocabulary files
> Reading vocabulary file from:new_corpus/inc.fr.vcb
> ERROR: TOKEN ID must be unique for each token, in line :
> 24 roi 2
> TOKEN ID 24 has already been assigned to: roi
>
> I took only 500 sentences pairs for full_train.sh and it worked fine
> with 758 lines in the corpus/tgt_filename.vcb file
>
> I took only 10 sentences for incremental alignment_new.sh which
> generated the error and I found 8054 lines in the
> new_corpus/new_tgt_file.vcb
> Is there any problem? Can you please help me on the same.
>
> Thanks and regards,
> sandipan
>
>
> On 4 November 2014 16:13, prajdabre  wrote:
>
>> Dear Ihab.
>> There is a python script that was there in the google drive folder in
>> the first mail I sent you.
>> Please replace the existing file with my copy.
>>
>> It has to work.
>>
>> Regards.
>>
>>
>> Sent from Samsung Mobile
>>
>>
>>
>>  Original message 
>> From: Ihab Ramadan 
>> Date: 05/11/2014 00:54 (GMT+09:00)
>> To: 'Raj Dabre' 
>> Cc: moses-support@mit.edu
>> Subject: RE: [Moses-support] Incremental training
>>
>>
>> Dear Raj,
>>
>> Your point is clear and I try to follow the steps you mentioned but I
>> stuck now in the align_new.sh script which gives me this error
>>
>> reading vocabulary files
>>
>> Reading vocabulary file from:new_corpus/TraningTarget.txt.vcb
>>
>> ERROR: TOKEN ID must be unique for each token, in line :
>>
>> 29107 q-1 4
>>
>> Do you have any idea what this error means?
>>
>>
>>
>> *From:* Raj Dabre [mailto:prajda...@gmail.com]
>> *Sent:* Tuesday, November 4, 2014 12:06 PM
>> *To:* i.rama...@saudisoft.com
>> *Cc:* moses-support@mit.edu
>> *Subject:* Re: [Moses-support] Incremental training
>>
>>
>>
>> Dear Ihab,
>>
>> Perhaps I should have mentioned much more clearly what my script
>> does. Sorry for that.
>>
>> Let me start with this: There is no direct/easy way to generate the
>> moses.ini file as you need.
>>
>> 1. Suppose you have 2 million lines of parallel corpora and you
>> trained a SMT system for it. This naturally gives the phrase table

Re: [Moses-support] Incremental training

2014-11-19 Thread Sandipan Dandapat
When I am using your script then it has no problem. But when modified the
lines nid = len(fvcb)+2; there is no .vcb files in the new_corpus/ dir
i used these two commands.

sh full_train.sh org.en org.fr
 sh align_new.sh inc.en inc.fr org.en org.fr

Is the above right?

I have kept the paths (MGIZA, MODEL_BASE and CORPUS_BASE, NEW_CORPUS_BASE)
hard-coded in the scripts.


On 19 November 2014 15:49, Raj Dabre  wrote:

> Cannot open file???
> Does the file exist??
> Aee you passing the path properly?
>
>
> On 00:44, Thu, 20 Nov 2014 Sandipan Dandapat 
> wrote:
>
>> Hi,
>> I made the changes based on your suggestions, its now generating a
>> different error as below:
>>
>>
>> reading vocabulary files
>> Reading vocabulary file from:new_corpus/inc.fr.vcb
>>
>> Cannot open vocabulary file new_corpus/inc.fr.vcbfil
>>
>> I am attaching the working dir and the .py scripts here with. I have the
>> 10 parallel sentences for incremental alignment is in inc_data/ where as
>> the original 500 sentences are there in mtdata/ directory
>>
>> Thanks a ton for your help.
>>
>> Regards,
>> sandipan
>>
>> On 19 November 2014 15:18, Raj Dabre  wrote:
>>
>>> Hey,
>>>
>>> I am pretty sure that my script does not generate duplicate token id.
>>>
>>> In fact, I used to get the same error till I modified the script.
>>>
>>> In case you do want to avoid this error and not use my script then:
>>>
>>> 1. Open the original python script: plain2snt-hasvcb.py
>>> 2. There is a line which increments the id counter by 1 ( the line is
>>> nid = len(fvcb)+1;)
>>> 3. Make this line: nid = len(fvcb)+2; (This is cause the id numbering
>>> starts from 1, and thus if you have 23 tokens then the id will go from 2 to
>>> 24. The original update script will do: nid = 23 + 1 = 24 and the
>>> modification will give 25 correctly). This is in 2 places: nid =
>>> len(evcb)+2;
>>>
>>> Do this and it will work.
>>>
>>> In any case... send me a zip file of your working directory (if its
>>> small you are testing it on small data right ? ). I will see what the
>>> problem is.
>>>
>>>
>>>
>>> On Wed, Nov 19, 2014 at 11:44 PM, Sandipan Dandapat <
>>> sandipandanda...@gmail.com> wrote:
>>>
 Dear Raj,
 I also tried to use your scripts for incremental alignment. I copied
 your python script in the desired directory still I am receiving the same
 error as posted by Ihab.
 reading vocabulary files
 Reading vocabulary file from:new_corpus/inc.fr.vcb
 ERROR: TOKEN ID must be unique for each token, in line :
 24 roi 2
 TOKEN ID 24 has already been assigned to: roi

 I took only 500 sentences pairs for full_train.sh and it worked fine
 with 758 lines in the corpus/tgt_filename.vcb file

 I took only 10 sentences for incremental alignment_new.sh which
 generated the error and I found 8054 lines in the
 new_corpus/new_tgt_file.vcb
 Is there any problem? Can you please help me on the same.

 Thanks and regards,
 sandipan


 On 4 November 2014 16:13, prajdabre  wrote:

> Dear Ihab.
> There is a python script that was there in the google drive folder in
> the first mail I sent you.
> Please replace the existing file with my copy.
>
> It has to work.
>
> Regards.
>
>
> Sent from Samsung Mobile
>
>
>
>  Original message 
> From: Ihab Ramadan 
> Date: 05/11/2014 00:54 (GMT+09:00)
> To: 'Raj Dabre' 
> Cc: moses-support@mit.edu
> Subject: RE: [Moses-support] Incremental training
>
>
> Dear Raj,
>
> Your point is clear and I try to follow the steps you mentioned but I
> stuck now in the align_new.sh script which gives me this error
>
> reading vocabulary files
>
> Reading vocabulary file from:new_corpus/TraningTarget.txt.vcb
>
> ERROR: TOKEN ID must be unique for each token, in line :
>
> 29107 q-1 4
>
> Do you have any idea what this error means?
>
>
>
> *From:* Raj Dabre [mailto:prajda...@gmail.com]
> *Sent:* Tuesday, November 4, 2014 12:06 PM
> *To:* i.rama...@saudisoft.com
> *Cc:* moses-support@mit.edu
> *Subject:* Re: [Moses-support] Incremental training
>
>
>
> Dear Ihab,
>
> Perhaps I should have mentioned much more clearly what my script does.
> Sorry for that.
>
> Let me start with this: There is no direct/easy way to generate the
> moses.ini file as you need.
>
> 1. Suppose you have 2 million lines of parallel corpora and you
> trained a SMT system for it. This naturally gives the phrase table,
> reordering table and moses.ini.
>
> 2. Suppose you got 500 k more lines of parallel corpora there are
> 2 ways:
>
> a. Retrain 2.5 million lines from scratch (will take lots of time:
> ~ 2-3 days on a regular machines)
>
> b. Train on only the 500k new lines using the alignment
> informati

[Moses-support] EMS required for Hierarchical model?

2014-11-19 Thread Asad A.Malik
Hi, 

I wanted to ask that in order to implement Hierarchical model I have to 
implement EMS?  Kind Regards,

Mr. Asad Abdul Malik
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Incremental training

2014-11-19 Thread Raj Dabre
Cannot open file???
Does the file exist??
Aee you passing the path properly?

On 00:44, Thu, 20 Nov 2014 Sandipan Dandapat 
wrote:

> Hi,
> I made the changes based on your suggestions, its now generating a
> different error as below:
>
>
> reading vocabulary files
> Reading vocabulary file from:new_corpus/inc.fr.vcb
>
> Cannot open vocabulary file new_corpus/inc.fr.vcbfil
>
> I am attaching the working dir and the .py scripts here with. I have the
> 10 parallel sentences for incremental alignment is in inc_data/ where as
> the original 500 sentences are there in mtdata/ directory
>
> Thanks a ton for your help.
>
> Regards,
> sandipan
>
> On 19 November 2014 15:18, Raj Dabre  wrote:
>
>> Hey,
>>
>> I am pretty sure that my script does not generate duplicate token id.
>>
>> In fact, I used to get the same error till I modified the script.
>>
>> In case you do want to avoid this error and not use my script then:
>>
>> 1. Open the original python script: plain2snt-hasvcb.py
>> 2. There is a line which increments the id counter by 1 ( the line is nid
>> = len(fvcb)+1;)
>> 3. Make this line: nid = len(fvcb)+2; (This is cause the id numbering
>> starts from 1, and thus if you have 23 tokens then the id will go from 2 to
>> 24. The original update script will do: nid = 23 + 1 = 24 and the
>> modification will give 25 correctly). This is in 2 places: nid =
>> len(evcb)+2;
>>
>> Do this and it will work.
>>
>> In any case... send me a zip file of your working directory (if its
>> small you are testing it on small data right ? ). I will see what the
>> problem is.
>>
>>
>>
>> On Wed, Nov 19, 2014 at 11:44 PM, Sandipan Dandapat <
>> sandipandanda...@gmail.com> wrote:
>>
>>> Dear Raj,
>>> I also tried to use your scripts for incremental alignment. I copied
>>> your python script in the desired directory still I am receiving the same
>>> error as posted by Ihab.
>>> reading vocabulary files
>>> Reading vocabulary file from:new_corpus/inc.fr.vcb
>>> ERROR: TOKEN ID must be unique for each token, in line :
>>> 24 roi 2
>>> TOKEN ID 24 has already been assigned to: roi
>>>
>>> I took only 500 sentences pairs for full_train.sh and it worked fine
>>> with 758 lines in the corpus/tgt_filename.vcb file
>>>
>>> I took only 10 sentences for incremental alignment_new.sh which
>>> generated the error and I found 8054 lines in the
>>> new_corpus/new_tgt_file.vcb
>>> Is there any problem? Can you please help me on the same.
>>>
>>> Thanks and regards,
>>> sandipan
>>>
>>>
>>> On 4 November 2014 16:13, prajdabre  wrote:
>>>
 Dear Ihab.
 There is a python script that was there in the google drive folder in
 the first mail I sent you.
 Please replace the existing file with my copy.

 It has to work.

 Regards.


 Sent from Samsung Mobile



  Original message 
 From: Ihab Ramadan 
 Date: 05/11/2014 00:54 (GMT+09:00)
 To: 'Raj Dabre' 
 Cc: moses-support@mit.edu
 Subject: RE: [Moses-support] Incremental training


 Dear Raj,

 Your point is clear and I try to follow the steps you mentioned but I
 stuck now in the align_new.sh script which gives me this error

 reading vocabulary files

 Reading vocabulary file from:new_corpus/TraningTarget.txt.vcb

 ERROR: TOKEN ID must be unique for each token, in line :

 29107 q-1 4

 Do you have any idea what this error means?



 *From:* Raj Dabre [mailto:prajda...@gmail.com]
 *Sent:* Tuesday, November 4, 2014 12:06 PM
 *To:* i.rama...@saudisoft.com
 *Cc:* moses-support@mit.edu
 *Subject:* Re: [Moses-support] Incremental training



 Dear Ihab,

 Perhaps I should have mentioned much more clearly what my script does.
 Sorry for that.

 Let me start with this: There is no direct/easy way to generate the
 moses.ini file as you need.

 1. Suppose you have 2 million lines of parallel corpora and you trained
 a SMT system for it. This naturally gives the phrase table, reordering
 table and moses.ini.

 2. Suppose you got 500 k more lines of parallel corpora there are 2
 ways:

 a. Retrain 2.5 million lines from scratch (will take lots of time:
 ~ 2-3 days on a regular machines)

 b. Train on only the 500k new lines using the alignment information
 of the original training data. (Faster: ~ 6-7 hours).



 What my scripts do: *THEY ONLY GENERATE ALIGNMENTS and NOT PHRASE
 TABLES.*

 1. full_train.sh -- This trains on the original corpus of 2
 million lines. (Generate alignment files only for the original corpus)

 2. align_new.sh -- This trains on the new corpus of 500 k
 lines. (Generate alignment files only for the new corpus using the
 alignments for 1)



 *Why this split * Because the basic training step of Moses do

Re: [Moses-support] Incremental training

2014-11-19 Thread Raj Dabre
Hey,

I am pretty sure that my script does not generate duplicate token id.

In fact, I used to get the same error till I modified the script.

In case you do want to avoid this error and not use my script then:

1. Open the original python script: plain2snt-hasvcb.py
2. There is a line which increments the id counter by 1 ( the line is nid =
len(fvcb)+1;)
3. Make this line: nid = len(fvcb)+2; (This is cause the id numbering
starts from 1, and thus if you have 23 tokens then the id will go from 2 to
24. The original update script will do: nid = 23 + 1 = 24 and the
modification will give 25 correctly). This is in 2 places: nid =
len(evcb)+2;

Do this and it will work.

In any case... send me a zip file of your working directory (if its
small you are testing it on small data right ? ). I will see what the
problem is.



On Wed, Nov 19, 2014 at 11:44 PM, Sandipan Dandapat <
sandipandanda...@gmail.com> wrote:

> Dear Raj,
> I also tried to use your scripts for incremental alignment. I copied your
> python script in the desired directory still I am receiving the same error
> as posted by Ihab.
> reading vocabulary files
> Reading vocabulary file from:new_corpus/inc.fr.vcb
> ERROR: TOKEN ID must be unique for each token, in line :
> 24 roi 2
> TOKEN ID 24 has already been assigned to: roi
>
> I took only 500 sentences pairs for full_train.sh and it worked fine with
> 758 lines in the corpus/tgt_filename.vcb file
>
> I took only 10 sentences for incremental alignment_new.sh which generated
> the error and I found 8054 lines in the new_corpus/new_tgt_file.vcb
> Is there any problem? Can you please help me on the same.
>
> Thanks and regards,
> sandipan
>
>
> On 4 November 2014 16:13, prajdabre  wrote:
>
>> Dear Ihab.
>> There is a python script that was there in the google drive folder in the
>> first mail I sent you.
>> Please replace the existing file with my copy.
>>
>> It has to work.
>>
>> Regards.
>>
>>
>> Sent from Samsung Mobile
>>
>>
>>
>>  Original message 
>> From: Ihab Ramadan 
>> Date: 05/11/2014 00:54 (GMT+09:00)
>> To: 'Raj Dabre' 
>> Cc: moses-support@mit.edu
>> Subject: RE: [Moses-support] Incremental training
>>
>>
>> Dear Raj,
>>
>> Your point is clear and I try to follow the steps you mentioned but I
>> stuck now in the align_new.sh script which gives me this error
>>
>> reading vocabulary files
>>
>> Reading vocabulary file from:new_corpus/TraningTarget.txt.vcb
>>
>> ERROR: TOKEN ID must be unique for each token, in line :
>>
>> 29107 q-1 4
>>
>> Do you have any idea what this error means?
>>
>>
>>
>> *From:* Raj Dabre [mailto:prajda...@gmail.com]
>> *Sent:* Tuesday, November 4, 2014 12:06 PM
>> *To:* i.rama...@saudisoft.com
>> *Cc:* moses-support@mit.edu
>> *Subject:* Re: [Moses-support] Incremental training
>>
>>
>>
>> Dear Ihab,
>>
>> Perhaps I should have mentioned much more clearly what my script does.
>> Sorry for that.
>>
>> Let me start with this: There is no direct/easy way to generate the
>> moses.ini file as you need.
>>
>> 1. Suppose you have 2 million lines of parallel corpora and you trained a
>> SMT system for it. This naturally gives the phrase table, reordering table
>> and moses.ini.
>>
>> 2. Suppose you got 500 k more lines of parallel corpora there are 2
>> ways:
>>
>> a. Retrain 2.5 million lines from scratch (will take lots of time: ~
>> 2-3 days on a regular machines)
>>
>> b. Train on only the 500k new lines using the alignment information
>> of the original training data. (Faster: ~ 6-7 hours).
>>
>>
>>
>> What my scripts do: *THEY ONLY GENERATE ALIGNMENTS and NOT PHRASE
>> TABLES.*
>>
>> 1. full_train.sh -- This trains on the original corpus of 2
>> million lines. (Generate alignment files only for the original corpus)
>>
>> 2. align_new.sh -- This trains on the new corpus of 500 k
>> lines. (Generate alignment files only for the new corpus using the
>> alignments for 1)
>>
>>
>>
>> *Why this split * Because the basic training step of Moses does not
>> preserve the alignment probability information. Only the alignments are
>> saved. To continue training we need the probability information.
>>
>> You can pass flags to moses to preserve this information ( this flag is
>> --giza-option . If you do this then you will not need full_train.sh. But
>> you will have to change the config files before using align_new.sh)
>>
>> *HOW TO GET UPDATED PHRASE TABLE:*
>>
>> 1. Append the forward alignments (fwd) generated by align_new.sh to the
>> forward (fwd) alignments generated by full_train.sh.
>> 2. Append the inverse alignments (inv) generated by align_new.sh to the
>> inverse (inv) alignments generated by full_train.sh.
>>
>> 3. Run the moses training script with additional flags:
>>
>>- --first-step -- first step in the training process (default
>>1)--- This will be 4
>>- --last-step -- last step in the training process (default
>>7) This will remain 7
>>- --giza-f2e -- /new

[Moses-support] How much time will tokenizing take?

2014-11-19 Thread Asad A.Malik
Hi All, 

I am tokenizing my corpus and have entered the command but it is taking to 
long, I just wanted to know that how much time it will take? 
P.S. the corpus is same, may be around 1000 sentences, and in french language.
 Kind Regards,

Mr. Asad Abdul Malik ___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Incremental training

2014-11-19 Thread Sandipan Dandapat
Dear Raj,
I also tried to use your scripts for incremental alignment. I copied your
python script in the desired directory still I am receiving the same error
as posted by Ihab.
reading vocabulary files
Reading vocabulary file from:new_corpus/inc.fr.vcb
ERROR: TOKEN ID must be unique for each token, in line :
24 roi 2
TOKEN ID 24 has already been assigned to: roi

I took only 500 sentences pairs for full_train.sh and it worked fine with
758 lines in the corpus/tgt_filename.vcb file

I took only 10 sentences for incremental alignment_new.sh which generated
the error and I found 8054 lines in the new_corpus/new_tgt_file.vcb
Is there any problem? Can you please help me on the same.

Thanks and regards,
sandipan


On 4 November 2014 16:13, prajdabre  wrote:

> Dear Ihab.
> There is a python script that was there in the google drive folder in the
> first mail I sent you.
> Please replace the existing file with my copy.
>
> It has to work.
>
> Regards.
>
>
> Sent from Samsung Mobile
>
>
>
>  Original message 
> From: Ihab Ramadan 
> Date: 05/11/2014 00:54 (GMT+09:00)
> To: 'Raj Dabre' 
> Cc: moses-support@mit.edu
> Subject: RE: [Moses-support] Incremental training
>
>
> Dear Raj,
>
> Your point is clear and I try to follow the steps you mentioned but I
> stuck now in the align_new.sh script which gives me this error
>
> reading vocabulary files
>
> Reading vocabulary file from:new_corpus/TraningTarget.txt.vcb
>
> ERROR: TOKEN ID must be unique for each token, in line :
>
> 29107 q-1 4
>
> Do you have any idea what this error means?
>
>
>
> *From:* Raj Dabre [mailto:prajda...@gmail.com]
> *Sent:* Tuesday, November 4, 2014 12:06 PM
> *To:* i.rama...@saudisoft.com
> *Cc:* moses-support@mit.edu
> *Subject:* Re: [Moses-support] Incremental training
>
>
>
> Dear Ihab,
>
> Perhaps I should have mentioned much more clearly what my script does.
> Sorry for that.
>
> Let me start with this: There is no direct/easy way to generate the
> moses.ini file as you need.
>
> 1. Suppose you have 2 million lines of parallel corpora and you trained a
> SMT system for it. This naturally gives the phrase table, reordering table
> and moses.ini.
>
> 2. Suppose you got 500 k more lines of parallel corpora there are 2
> ways:
>
> a. Retrain 2.5 million lines from scratch (will take lots of time: ~
> 2-3 days on a regular machines)
>
> b. Train on only the 500k new lines using the alignment information of
> the original training data. (Faster: ~ 6-7 hours).
>
>
>
> What my scripts do: *THEY ONLY GENERATE ALIGNMENTS and NOT PHRASE TABLES.*
>
> 1. full_train.sh -- This trains on the original corpus of 2
> million lines. (Generate alignment files only for the original corpus)
>
> 2. align_new.sh -- This trains on the new corpus of 500 k
> lines. (Generate alignment files only for the new corpus using the
> alignments for 1)
>
>
>
> *Why this split * Because the basic training step of Moses does not
> preserve the alignment probability information. Only the alignments are
> saved. To continue training we need the probability information.
>
> You can pass flags to moses to preserve this information ( this flag is
> --giza-option . If you do this then you will not need full_train.sh. But
> you will have to change the config files before using align_new.sh)
>
> *HOW TO GET UPDATED PHRASE TABLE:*
>
> 1. Append the forward alignments (fwd) generated by align_new.sh to the
> forward (fwd) alignments generated by full_train.sh.
> 2. Append the inverse alignments (inv) generated by align_new.sh to the
> inverse (inv) alignments generated by full_train.sh.
>
> 3. Run the moses training script with additional flags:
>
>- --first-step -- first step in the training process (default
>1)--- This will be 4
>- --last-step -- last step in the training process (default
>7) This will remain 7
>- --giza-f2e -- /new_giza.fwd
>- --giza-e2f -- /new_giza.inv
>
> For example:
>
> ~/mosesdecoder/scripts/training/train-model.perl -root-dir  directory> \
>
>  -corpus  \
>
>  -f  -e  -alignment grow-diag-final-and -reordering 
> msd-bidirectional-fe \
>
>  -lm 0:3::8  \
>  --first-step 4  --last-step 7 --giza-f2e -- /new_giza.fwd 
> --giza-e2f -- /new_giza.inv \
>  -external-bin-dir 
>
> For more details on the training step read this:
> http://www.statmt.org/moses/?n=FactoredTraining.TrainingParameters
>
> What this does is assumes that you have alignments and continue the phrase
> extraction, reordering and generate the new moses.ini file.
>
> WARNING: Specify the filenames and paths properly *OR IT WILL FAIL.*
>
>
>
> If you are still unclear then please ask and I will try to help you as
> much as I can.
>
> Regards.
>
>
>
>
>
>
>
> On Tue, Nov 4, 2014 at 6:09 PM, Ihab Ramadan 
> wrote:
>
> Dear Raj,
>
> That’s a great work my friend,
>
> This files make the script work but it takes long time to finish also it
> di

Re: [Moses-support] placeholders for numbers - extract step

2014-11-19 Thread Vito Mandorino
Thank you Hieu, that worked very well. I am now tackling the decoding part
and I have two questions.


1) Sometimes, I get the following error message during decoding:

terminate called after throwing an instance of 'util::Exception'
  what():  moses-cmd/IOWrapper.cpp:213 in std::map MosesCmd::GetPlaceholders(const
Moses::Hypothesis&, Moses::FactorType) threw util::Exception because
`targetPos.size() != 1'.
Placeholder should be aligned to 1, and only 1, word
Aborted

I don't understand why. I checked the phrase-table and I didn't find
phrase pairs where the '@num@' token is aligned to 2 or more words.

2) This may be related to the first question. If I run the decoder to
translate the input using the suggested command

./moses  -placeholder-factor 1 -xml-input exclusive

I get the '@num@' string in the output and not the expected number. I
do get the number if I use the option '-placeholder-factor 0'. The
model that I am using is a phrase-based, non-factored model.


Vito




2014-11-19 10:32 GMT+01:00 Hieu Hoang :

> hi vito
>
> On 18 November 2014 11:30, Vito Mandorino <
> vito.mandor...@linguacustodia.com> wrote:
>
>> Hello everyone,
>>
>> I am trying to use placeholders for numbers in phrase-based MT, according
>> to http://www.statmt.org/moses/?n=Moses.AdvancedFeatures#ntoc75
>>
>> The above page says
>>
>> ---
>>
>>  During extraction, add the following to the extract command
>> (phrase-based only for now):
>>
>> ./extract --Placeholders @num@ 
>>
>> --
>>
>> Does this mean that I have to first run train-model.perl with
>> --last-step=4, then the line above and then again train-model.perl with
>> --first-step=6?
>>
> when you run train-model.perl,  add the argument
>-extract-options '--Placeholders @num@'
> You can see it in this script that the EMS creates
>
> http://www.statmt.org/moses/RELEASE-2.1/models/cs-en/steps/3/TRAINING_extract-phrases.3
>
>>
>> If this is the case, which arguments and options should I pass to extract
>> for a baseline training? I think the syntax is something like
>>
> The script will then call extract with the following argument
>--Placeholders @num@
> You can see it in the STDERR file of the above script
>
> http://www.statmt.org/moses/RELEASE-2.1/models/cs-en/steps/3/TRAINING_extract-phrases.3.STDERR
>
>>
>>  syntax: extract en de align extract max-length [orientation [ --model
>> [wbe|phrase|hier]-[msd|mslr|mono] ] | --OnlyOutputSpanInfo | --NoTTable |
>> --GZOutput | --IncludeSentenceId | --SentenceOffset n | --InstanceWeights
>> filename ]
>>
>> In particular I cannot figure out what should be passed as 'align' and
>> 'extract' arguments.
>>
>>
>> Regards,
>>
>> Vito
>>
>>  --
>>
>> *M**. Vito MANDORINO -- Chief Scientist*
>>
>>
>> [image: Description : Description : lingua_custodia_final full logo]
>>
>>  *The Translation Trustee*
>>
>> *1, Place Charles de Gaulle, **78180 Montigny-le-Bretonneux*
>>
>> *Tel : +33 1 30 44 04 23   Mobile : +33 6 84 65 68 89
>> <%2B33%206%2084%2065%2068%2089>*
>>
>> *Email :*  *vito.mandor...@linguacustodia.com
>> *
>>
>> *Website :*  *www.linguacustodia.com  -
>> www.thetranslationtrustee.com  *
>>
>> ___
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>
>
> --
> Hieu Hoang
> Research Associate
> University of Edinburgh
> http://www.hoang.co.uk/hieu
>
>


-- 
*M**. Vito MANDORINO -- Chief Scientist*


[image: Description : Description : lingua_custodia_final full logo]

 *The Translation Trustee*

*1, Place Charles de Gaulle, **78180 Montigny-le-Bretonneux*

*Tel : +33 1 30 44 04 23   Mobile : +33 6 84 65 68 89*

*Email :*  *vito.mandor...@linguacustodia.com
*

*Website :*  *www.linguacustodia.com  -
www.thetranslationtrustee.com  *
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] RandLM make Error

2014-11-19 Thread Tom Hoar
The attached Bash script (renamed to .txt for email) compiles RandLM on 
Ubuntu 12.04 and 14.04. The paths and logging in this script might not 
be what you want, but you can modify as you see fit.


Note that the Debian package manager includes packages called 
"sparsehash" and "libsparsehash-dev." It describe them as "Google's 
extremely memory-effecient C++ hash_map implementation." I do not 
remember that one of these packages is a dependency of RandLM, but your 
error seems to hint that it could be.


The resulting binaries worked fine on 12.04. We just recently learned 
that although the package compiles on 14.04, the resulting binaries do 
not run. The above dependency might also be the source of our error, but 
we haven't had time to debug the problem.


Tom


On 11/19/2014 07:47 PM, Hieu Hoang wrote:
There is a script within the randlm project that compiles just the 
library needed to integrate the library into Moses.

https://sourceforge.net/p/randlm/code/HEAD/tree/trunk/manual-compile/compile.sh
It's been a while since people have asked about RandLM, I'm not sure 
who's still using it and who has time & experience to take care of it.


On 19 November 2014 11:50, Achchuthan Yogarajah > wrote:


Hi Everyone,

when i build RandLM with the following command
make
i got some error

Making all in RandLM
make[1]: Entering directory `/home/achchuthan/randlm-0.2.5/src/RandLM'
make[1]: Nothing to be done for `all'.
make[1]: Leaving directory `/home/achchuthan/randlm-0.2.5/src/RandLM'
Making all in LDHT
make[1]: Entering directory `/home/achchuthan/randlm-0.2.5/src/LDHT'
/bin/bash ../../libtool  --tag=CXX --mode=compile g++
-DHAVE_CONFIG_H -I. -I../..  -I./  -fPIC -Wno-deprecated -Wall
-ggdb -DTIXML_USE_TICPP -g -O2 -MT libLDHT_la-Client.lo -MD -MP
-MF .deps/libLDHT_la-Client.Tpo -c -o libLDHT_la-Client.lo `test
-f 'Client.cpp' || echo './'`Client.cpp
libtool: compile:  g++ -DHAVE_CONFIG_H -I. -I../.. -I./ -fPIC
-Wno-deprecated -Wall -ggdb -DTIXML_USE_TICPP -g -O2 -MT
libLDHT_la-Client.lo -MD -MP -MF .deps/libLDHT_la-Client.Tpo -c
Client.cpp -fPIC -DPIC -o .libs/libLDHT_la-Client.o
In file included from Client.cpp:6:0:
Client.h:8:34: fatal error: google/sparse_hash_map: No such file
or directory
 #include 
  ^
compilation terminated.
make[1]: *** [libLDHT_la-Client.lo] Error 1
make[1]: Leaving directory `/home/achchuthan/randlm-0.2.5/src/LDHT'
make: *** [all-recursive] Error


-- 


*Thanks & Regards,
**Yogarajah Achchuthan*
[ LinkedIn  Twitter
 Facebook
 ]


___
Moses-support mailing list
Moses-support@mit.edu 
http://mailman.mit.edu/mailman/listinfo/moses-support




--
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu



___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


#! /bin/bash
#set -e

project=randlm
make_bin=$(dirname `readlink -f ${0}`)
if [ $EUID -ne 0 -a "${make_bin}" == "/usr/local/bin" ] ; then
echo 'This script must be run as root/super user' 1>&2
exit 1
elif [ $EUID -eq 0 -a ! "${make_bin}" == "/usr/local/bin" ] ; then
echo 'This script can not be run as root/super user' 1>&2
exit 1
fi
parent=$(dirname ${make_bin})



usage() {
cat <<- EOF
usage: ${0##*/} options

This script compiles RandLM and places the results in 
subfolders under the --prefix folder.

OPTIONS:
   -p|-Pprefix directory (${parent})
   -u|-Uuninstall
   -h|-HShow this message
EOF
}




# parse command line
while getopts "hHuUp:P:" OPTION ; do
case $OPTION in
h|H)
usage
exit 0
;;
p|P)
prefix=$OPTARG
;;
u|U)
uninstall=1
;;
*)
usage
exit 1
;;
esac
done









prefix=${prefix:-"${parent}"}
bin="${prefix}/bin"
uninstall=${uninstall:-0}
libdir="${prefix}/lib/${project}"


pushd "${parent}/src/${project}" >/dev/null

if [ $uninstall -eq 1 ] ; then

make -j${procs} uninstall
exitcode=(${PIPESTATUS[*]})

[ -d "${libdir}" ] && find "${libdir}" -depth -type d -empty 
-exec rmdir {} \;

if [ -d "${bin}" ] ; then
# remove broken simlinks

Re: [Moses-support] RandLM make Error

2014-11-19 Thread Hieu Hoang
There is a script within the randlm project that compiles just the library
needed to integrate the library into Moses.

https://sourceforge.net/p/randlm/code/HEAD/tree/trunk/manual-compile/compile.sh
It's been a while since people have asked about RandLM, I'm not sure who's
still using it and who has time & experience to take care of it.

On 19 November 2014 11:50, Achchuthan Yogarajah  wrote:

> Hi Everyone,
>
> when i build RandLM with the following command
> make
> i got some error
>
> Making all in RandLM
> make[1]: Entering directory `/home/achchuthan/randlm-0.2.5/src/RandLM'
> make[1]: Nothing to be done for `all'.
> make[1]: Leaving directory `/home/achchuthan/randlm-0.2.5/src/RandLM'
> Making all in LDHT
> make[1]: Entering directory `/home/achchuthan/randlm-0.2.5/src/LDHT'
> /bin/bash ../../libtool  --tag=CXX   --mode=compile g++ -DHAVE_CONFIG_H
> -I. -I../..  -I./  -fPIC -Wno-deprecated -Wall -ggdb -DTIXML_USE_TICPP -g
> -O2 -MT libLDHT_la-Client.lo -MD -MP -MF .deps/libLDHT_la-Client.Tpo -c -o
> libLDHT_la-Client.lo `test -f 'Client.cpp' || echo './'`Client.cpp
> libtool: compile:  g++ -DHAVE_CONFIG_H -I. -I../.. -I./ -fPIC
> -Wno-deprecated -Wall -ggdb -DTIXML_USE_TICPP -g -O2 -MT
> libLDHT_la-Client.lo -MD -MP -MF .deps/libLDHT_la-Client.Tpo -c Client.cpp
> -fPIC -DPIC -o .libs/libLDHT_la-Client.o
> In file included from Client.cpp:6:0:
> Client.h:8:34: fatal error: google/sparse_hash_map: No such file or
> directory
>  #include 
>   ^
> compilation terminated.
> make[1]: *** [libLDHT_la-Client.lo] Error 1
> make[1]: Leaving directory `/home/achchuthan/randlm-0.2.5/src/LDHT'
> make: *** [all-recursive] Error
>
>
> --
>
>
> *Thanks & Regards,**Yogarajah Achchuthan*
> [ LinkedIn  Twitter
>  Facebook
>  ]
>
>
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>


-- 
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] RandLM make Error

2014-11-19 Thread Achchuthan Yogarajah
Hi Everyone,

when i build RandLM with the following command
make
i got some error

Making all in RandLM
make[1]: Entering directory `/home/achchuthan/randlm-0.2.5/src/RandLM'
make[1]: Nothing to be done for `all'.
make[1]: Leaving directory `/home/achchuthan/randlm-0.2.5/src/RandLM'
Making all in LDHT
make[1]: Entering directory `/home/achchuthan/randlm-0.2.5/src/LDHT'
/bin/bash ../../libtool  --tag=CXX   --mode=compile g++ -DHAVE_CONFIG_H -I.
-I../..  -I./  -fPIC -Wno-deprecated -Wall -ggdb -DTIXML_USE_TICPP -g -O2
-MT libLDHT_la-Client.lo -MD -MP -MF .deps/libLDHT_la-Client.Tpo -c -o
libLDHT_la-Client.lo `test -f 'Client.cpp' || echo './'`Client.cpp
libtool: compile:  g++ -DHAVE_CONFIG_H -I. -I../.. -I./ -fPIC
-Wno-deprecated -Wall -ggdb -DTIXML_USE_TICPP -g -O2 -MT
libLDHT_la-Client.lo -MD -MP -MF .deps/libLDHT_la-Client.Tpo -c Client.cpp
-fPIC -DPIC -o .libs/libLDHT_la-Client.o
In file included from Client.cpp:6:0:
Client.h:8:34: fatal error: google/sparse_hash_map: No such file or
directory
 #include 
  ^
compilation terminated.
make[1]: *** [libLDHT_la-Client.lo] Error 1
make[1]: Leaving directory `/home/achchuthan/randlm-0.2.5/src/LDHT'
make: *** [all-recursive] Error


-- 


*Thanks & Regards,**Yogarajah Achchuthan*
[ LinkedIn  Twitter
 Facebook
 ]
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] mgiza++ force alignment: segmentation fault when reloading a big N table

2014-11-19 Thread Hala Almaghout
Hi Lefteris,

I commented in the issue you opened on github. The OS I'm running MGIZA on
is openSUSE 12.2

Best,

Hala

On 19 November 2014 10:43, Eleftherios Avramidis <
eleftherios.avrami...@dfki.de> wrote:

>  Hi Hala,
>
> I never found a solution to this since I had a strict deadline so I
> stopped using mgiza and went back to IBM-model1. Thanks for reminding,
> cause I may need it in the future. In case it helps. I have opened this
> issue here https://github.com/moses-smt/mgiza/issues/2
>
> Please go there and add you feedback as well.
>
> What is your operating system? I have some suspicion that this may be due
> to some outdated Ubuntu libraries.
>
> best
> Lefteris
>
>
> On 19.11.2014 12:34, Hala Maghout wrote:
>
> Hi,
>
>  I'm facing a segmentation fault problem when loading big N tables during
> forced alignment using MGIZA, which was posted previously on moses list
> (thread below) but no solution was suggested. As Lefteris explained, it's
> due to big N table size. Any suggestion on how to solve it other than
> cutting our entries from the N table?
>
>  Many thanks,
>
>  Best,
>
>  Hala
>
>
> On 11 August 2014 11:03, Hieu Hoang  wrote:
>
>>  did you manage to solve this issue? I tried contacting Qin Gao but
>> there's been no reply so far.
>>
>>  From my experience with mgiza a while ago, force alignment works ok
>>
>>
>>  On 3 August 2014 23:34, Eleftherios Avramidis <
>> eleftherios.avrami...@dfki.de> wrote:
>>
>>>   Hi,
>>>
>>> I am trying to produce word alignment for individual sentences. For this
>>> purpose I am using the "force align" functionality of mgiza++ Unfortunately
>>> when I am loading a big N table (fertility), mgiza crashes with a
>>> segmentation fault.
>>>
>>> In particular, I have initially run mgiza on the full training parallel
>>> corpus using the default settings of the Moses script:
>>>
>>> /project/qtleap/software/moses-2.1.1/bin/training-tools/mgiza  
>>> -CoocurrenceFile 
>>> /local/tmp/elav01/selection-mechanism/systems/de-en/training/giza.1/en-de.cooc
>>>  -c 
>>> /local/tmp/elav01/selection-mechanism/systems/de-en/training/prepared.1/en-de-int-train.snt
>>>  -m1 5 -m2 0 -m3 3 -m4 3 -model1dumpfrequency 1 -model4smoothfactor 0.4 
>>> -ncpus 24 -nodumps 0 -nsmooth 4 -o 
>>> /local/tmp/elav01/selection-mechanism/systems/de-en/training/giza.1/en-de 
>>> -onlyaldumps 0 -p0 0.999 -s 
>>> /local/tmp/elav01/selection-mechanism/systems/de-en/training/prepared.1/de.vcb
>>>  -t 
>>> /local/tmp/elav01/selection-mechanism/systems/de-en/training/prepared.1/en.vcb
>>>
>>>  Afterwards, by executing the mgiza force-align script, I run the
>>> following command
>>>
>>> /project/qtleap/software/moses-2.1.1/mgizapp-code/mgizapp//bin/mgiza 
>>> giza.en-de/en-de.gizacfg -c 
>>> /local/tmp/elav01/selection-mechanism/systems/de-en/falign/qtmp_SOVBrE/prepared./en-de.snt
>>>  -o 
>>> /local/tmp/elav01/selection-mechanism/systems/de-en/falign/qtmp_SOVBrE/giza./en-de
>>>  -s 
>>> /local/tmp/elav01/selection-mechanism/systems/de-en/falign/qtmp_SOVBrE/prepared./de.vcb
>>>  -t 
>>> /local/tmp/elav01/selection-mechanism/systems/de-en/falign/qtmp_SOVBrE/prepared./en.vcb
>>>  -m1 0 -m2 0 -mh 0 -coocurrence 
>>> /local/tmp/elav01/selection-mechanism/systems/de-en/falign/qtmp_SOVBrE/giza./en-de.cooc
>>>  -restart 11 -previoust giza.en-de/en-de.t3.final -previousa 
>>> giza.en-de/en-de.a3.final -previousd giza.en-de/en-de.d3.final -previousn 
>>> giza.en-de/en-de.n3.final -previousd4 giza.en-de/en-de.d4.final 
>>> -previousd42 giza.en-de/en-de.D4.final -m3 0 -m4 1
>>>
>>>  This runs fine, until I get the following error:
>>>
>>>   We are going to load previous N model from giza.en-de/en-de.n3.final
>>>
>>>  Reading fertility table from giza.en-de/en-de.n3.final
>>>
>>>  Segmentation fault (core dumped)
>>>
>>>
>>>  The n-table that is failing has about 300k entries. For this reason, I
>>> thought I should try to see if the size is a problem. So I concatenated the
>>> table to 60k entries. And it works! But the alignments are not good.
>>>
>>> I am struggling to fix this, so any help would be appreciated. I am
>>> running a freshly installed mgiza, on Ubuntu 12.04
>>>
>>> cheers,
>>> Lefteris
>>>
>>> --
>>> MSc. Inf. Eleftherios Avramidis
>>> DFKI GmbH, Alt-Moabit 91c, 10559 Berlin
>>> Tel. +49-30 238 95-1806
>>>
>>> Fax. +49-30 238 95-1810
>>>
>>> ---
>>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>>
>>> Geschaeftsfuehrung:
>>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>>> Dr. Walter Olthoff
>>>
>>> Vorsitzender des Aufsichtsrats:
>>> Prof. Dr. h.c. Hans A. Aukes
>>>
>>> Amtsgericht Kaiserslautern, HRB 2313
>>> ---
>>> 
>>>
>>>
>>>  ___
>>> Moses-support mailing list
>>> Mos

Re: [Moses-support] mgiza++ force alignment: segmentation fault when reloading a big N table

2014-11-19 Thread Eleftherios Avramidis

Hi Hala,

I never found a solution to this since I had a strict deadline so I 
stopped using mgiza and went back to IBM-model1. Thanks for reminding, 
cause I may need it in the future. In case it helps. I have opened this 
issue here https://github.com/moses-smt/mgiza/issues/2


Please go there and add you feedback as well.

What is your operating system? I have some suspicion that this may be 
due to some outdated Ubuntu libraries.


best
Lefteris

On 19.11.2014 12:34, Hala Maghout wrote:

Hi,

I'm facing a segmentation fault problem when loading big N tables 
during forced alignment using MGIZA, which was posted previously on 
moses list (thread below) but no solution was suggested. As Lefteris 
explained, it's due to big N table size. Any suggestion on how to 
solve it other than cutting our entries from the N table?


Many thanks,

Best,

Hala


On 11 August 2014 11:03, Hieu Hoang > wrote:


did you manage to solve this issue? I tried contacting Qin Gao but
there's been no reply so far.

From my experience with mgiza a while ago, force alignment works ok


On 3 August 2014 23:34, Eleftherios Avramidis
mailto:eleftherios.avrami...@dfki.de>> wrote:

Hi,

I am trying to produce word alignment for individual
sentences. For this purpose I am using the "force align"
functionality of mgiza++ Unfortunately when I am loading a big
N table (fertility), mgiza crashes with a segmentation fault.

In particular, I have initially run mgiza on the full training
parallel corpus using the default settings of the Moses script:

/project/qtleap/software/moses-2.1.1/bin/training-tools/mgiza  
-CoocurrenceFile 
/local/tmp/elav01/selection-mechanism/systems/de-en/training/giza.1/en-de.cooc 
-c 
/local/tmp/elav01/selection-mechanism/systems/de-en/training/prepared.1/en-de-int-train.snt
 -m1 5 -m2 0 -m3 3 -m4 3 -model1dumpfrequency 1 -model4smoothfactor 0.4 -ncpus 
24 -nodumps 0 -nsmooth 4 -o 
/local/tmp/elav01/selection-mechanism/systems/de-en/training/giza.1/en-de 
-onlyaldumps 0 -p0 0.999 -s 
/local/tmp/elav01/selection-mechanism/systems/de-en/training/prepared.1/de.vcb 
-t 
/local/tmp/elav01/selection-mechanism/systems/de-en/training/prepared.1/en.vcb

Afterwards, by executing the mgiza force-align script, I run
the following command


/project/qtleap/software/moses-2.1.1/mgizapp-code/mgizapp//bin/mgiza 
giza.en-de/en-de.gizacfg -c 
/local/tmp/elav01/selection-mechanism/systems/de-en/falign/qtmp_SOVBrE/prepared./en-de.snt
 -o 
/local/tmp/elav01/selection-mechanism/systems/de-en/falign/qtmp_SOVBrE/giza./en-de
 -s 
/local/tmp/elav01/selection-mechanism/systems/de-en/falign/qtmp_SOVBrE/prepared./de.vcb
 -t 
/local/tmp/elav01/selection-mechanism/systems/de-en/falign/qtmp_SOVBrE/prepared./en.vcb
 -m1 0 -m2 0 -mh 0 -coocurrence 
/local/tmp/elav01/selection-mechanism/systems/de-en/falign/qtmp_SOVBrE/giza./en-de.cooc
 -restart 11 -previoust giza.en-de/en-de.t3.final -previousa 
giza.en-de/en-de.a3.final -previousd giza.en-de/en-de.d3.final -previousn 
giza.en-de/en-de.n3.final -previousd4 giza.en-de/en-de.d4.final -previousd42 
giza.en-de/en-de.D4.final -m3 0 -m4 1

This runs fine, until I get the following error:

   We are going to load previous N model from 
giza.en-de/en-de.n3.final

Reading fertility table from giza.en-de/en-de.n3.final

Segmentation fault (core dumped)

The n-table that is failing has about 300k entries. For this
reason, I thought I should try to see if the size is a
problem. So I concatenated the table to 60k entries. And it
works! But the alignments are not good.

I am struggling to fix this, so any help would be appreciated.
I am running a freshly installed mgiza, on Ubuntu 12.04

cheers,
Lefteris

-- 
MSc. Inf. Eleftherios Avramidis

DFKI GmbH, Alt-Moabit 91c, 10559 Berlin
Tel.+49-30 238 95-1806  

Fax.+49-30 238 95-1810



---
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern

Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff

Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes

Amtsgericht Kaiserslautern, HRB 2313

---



___
Moses-support mailing list
Moses-support@mit.edu 
http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
Hieu Hoang

Research Associate
University of Edinburgh
htt

Re: [Moses-support] mgiza++ force alignment: segmentation fault when reloading a big N table

2014-11-19 Thread Hala Almaghout
Hi,

I'm facing a segmentation fault problem when loading big N tables during
forced alignment using MGIZA, which was posted previously on moses list
(thread below) but no solution was suggested. As Lefteris explained, it's
due to big N table size. Any suggestion on how to solve it other than
cutting our entries from the N table?

Many thanks,

Best,

Hala

On 11 August 2014 11:03, Hieu Hoang  wrote:

> did you manage to solve this issue? I tried contacting Qin Gao but there's
> been no reply so far.
>
> From my experience with mgiza a while ago, force alignment works ok
>
>
> On 3 August 2014 23:34, Eleftherios Avramidis <
> eleftherios.avrami...@dfki.de> wrote:
>
>>  Hi,
>>
>> I am trying to produce word alignment for individual sentences. For this
>> purpose I am using the "force align" functionality of mgiza++ Unfortunately
>> when I am loading a big N table (fertility), mgiza crashes with a
>> segmentation fault.
>>
>> In particular, I have initially run mgiza on the full training parallel
>> corpus using the default settings of the Moses script:
>>
>> /project/qtleap/software/moses-2.1.1/bin/training-tools/mgiza  
>> -CoocurrenceFile 
>> /local/tmp/elav01/selection-mechanism/systems/de-en/training/giza.1/en-de.cooc
>>  -c 
>> /local/tmp/elav01/selection-mechanism/systems/de-en/training/prepared.1/en-de-int-train.snt
>>  -m1 5 -m2 0 -m3 3 -m4 3 -model1dumpfrequency 1 -model4smoothfactor 0.4 
>> -ncpus 24 -nodumps 0 -nsmooth 4 -o 
>> /local/tmp/elav01/selection-mechanism/systems/de-en/training/giza.1/en-de 
>> -onlyaldumps 0 -p0 0.999 -s 
>> /local/tmp/elav01/selection-mechanism/systems/de-en/training/prepared.1/de.vcb
>>  -t 
>> /local/tmp/elav01/selection-mechanism/systems/de-en/training/prepared.1/en.vcb
>>
>>  Afterwards, by executing the mgiza force-align script, I run the
>> following command
>>
>> /project/qtleap/software/moses-2.1.1/mgizapp-code/mgizapp//bin/mgiza 
>> giza.en-de/en-de.gizacfg -c 
>> /local/tmp/elav01/selection-mechanism/systems/de-en/falign/qtmp_SOVBrE/prepared./en-de.snt
>>  -o 
>> /local/tmp/elav01/selection-mechanism/systems/de-en/falign/qtmp_SOVBrE/giza./en-de
>>  -s 
>> /local/tmp/elav01/selection-mechanism/systems/de-en/falign/qtmp_SOVBrE/prepared./de.vcb
>>  -t 
>> /local/tmp/elav01/selection-mechanism/systems/de-en/falign/qtmp_SOVBrE/prepared./en.vcb
>>  -m1 0 -m2 0 -mh 0 -coocurrence 
>> /local/tmp/elav01/selection-mechanism/systems/de-en/falign/qtmp_SOVBrE/giza./en-de.cooc
>>  -restart 11 -previoust giza.en-de/en-de.t3.final -previousa 
>> giza.en-de/en-de.a3.final -previousd giza.en-de/en-de.d3.final -previousn 
>> giza.en-de/en-de.n3.final -previousd4 giza.en-de/en-de.d4.final -previousd42 
>> giza.en-de/en-de.D4.final -m3 0 -m4 1
>>
>>  This runs fine, until I get the following error:
>>
>>   We are going to load previous N model from giza.en-de/en-de.n3.final
>>
>>  Reading fertility table from giza.en-de/en-de.n3.final
>>
>>  Segmentation fault (core dumped)
>>
>>
>>  The n-table that is failing has about 300k entries. For this reason, I
>> thought I should try to see if the size is a problem. So I concatenated the
>> table to 60k entries. And it works! But the alignments are not good.
>>
>> I am struggling to fix this, so any help would be appreciated. I am
>> running a freshly installed mgiza, on Ubuntu 12.04
>>
>> cheers,
>> Lefteris
>>
>> --
>> MSc. Inf. Eleftherios Avramidis
>> DFKI GmbH, Alt-Moabit 91c, 10559 Berlin
>> Tel. +49-30 238 95-1806
>>
>> Fax. +49-30 238 95-1810
>>
>> ---
>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>
>> Geschaeftsfuehrung:
>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>> Dr. Walter Olthoff
>>
>> Vorsitzender des Aufsichtsrats:
>> Prof. Dr. h.c. Hans A. Aukes
>>
>> Amtsgericht Kaiserslautern, HRB 2313
>> ---
>>  
>>
>>
>> ___
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>
>
> --
> Hieu Hoang
> Research Associate
> University of Edinburgh
> http://www.hoang.co.uk/hieu
>
>
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Regarding Factored Model

2014-11-19 Thread Mukund Roy


On Wed, 19 Nov 2014 09:49:01 +
Hieu Hoang  wrote:

> So the target side of your phrase table contains lemma and POS tags?
> 



Yes Sir, phrase table do contains lemma and POS.




> Also, is the moses.ini file you sent the exact 1 you used? There are
> 2 LM in specified, but only weight for 1 of them
> 


Before MERT, moses.ini  had both LM0 and LM1 . After MERT, only LM0
existed. LM1 weight was dropped.

Thanks & regards
Mukund K Roy





> On 18 November 2014 11:30, Mukund Roy  wrote:
> 
> > Dear Sir
> >
> > I used below command for building factored model
> >
> > $MOSES_HOME/scripts/training/
> > train-model.perl -root-dir
> > $WORKING_DIR/train -corpus $WORKING_DIR/Train.true.clean -f $slang
> > -e $tlang  -alignment grow-diag-final-and -reordering
> > msd-bidirectional-fe --lm
> > 2:3:$WORKING_DIR/lm/lm-corpus.blm.POS.$tlang:0 --alignment-facor
> > 0-0 --translation-factors 0-0,2 --reordering-factors 0-0
> > --decoding-steps t0
> >
> > I have a factored corpus with two factor: lemma & POS. The baseline
> > Phrase based model produced BLEU score of around 27 but using above
> > command for Factored model, BLEU score dipped to 3.5.
> >
> > @ Hoang: Sir As you said I am attaching the ini file and Sample
> > input outputs of Baseline phrase based model and Factored Model
> >
> > Thanks & Regards
> >
> >
> >
> >
> >
> > ___
> > Moses-support mailing list
> > Moses-support@mit.edu
> > http://mailman.mit.edu/mailman/listinfo/moses-support
> >
> >
> 
> 


---
[ C-DAC is on Social-Media too. Kindly follow us at:
Facebook: https://www.facebook.com/CDACINDIA & Twitter: @cdacindia ]

This e-mail is for the sole use of the intended recipient(s) and may
contain confidential and privileged information. If you are not the
intended recipient, please contact the sender by reply e-mail and destroy
all copies and the original message. Any unauthorized review, use,
disclosure, dissemination, forwarding, printing or copying of this email
is strictly prohibited and appropriate legal action will be taken.
---

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] How should I properly change the moses.ini file for tuning if I did not prepare an arpa file (and do we need an arpa file)?

2014-11-19 Thread Barry Haddow
Hi Daniel

That's good news. I'm not sure how the equals sign dropped out of the 
Moses documentation but I've put it back in now and simplified things a bit,

cheers - Barry

On 18/11/14 16:26, Daniel Seita wrote:
> Putting in the equal sign appeared to do the trick. So --text=yes 
> works but not --text yes.
>
> (PS: Sorry for emailing this directly to you Barry, I meant to respond 
> to the whole mailing list so everyone could know.)
>
> Thanks,
> Daniel
>
>
> On Tue, Nov 18, 2014 at 7:25 AM, Barry Haddow 
> mailto:bhad...@staffmail.ed.ac.uk>> wrote:
>
> Hi Daniel
>
> On 18/11/14 15:14, Daniel Seita wrote:
>
> Thanks for the response Barry. I'm still confused after
> reading your suggestions so perhaps you or someone can clarify
> when you have time?
>
> (1) I think that the /tuning/ step requires /both/ the arpa
> and the binarized files, right? While the /training/ step only
> requires the binarized version? I haven't reached the testing
> step yet.
>
>
> The training step doesn't actually use the LM, it just inserts the
> path into the moses.ini file. The tuning step can use either an
> arpa or a binarised file (not both) but using a binarised file
> will take up less RAM.
>
>
> (2) OK, so as you mention, the baseline instructions assume we
> use IRSTLM to create the arpa file, then use KenLM to binarize
> it. Under the "Language Model Training" section, there are six
> boxes that have command line instructions (the last one is
> querying the language model). I assume this means you /only/
> want us to execute the commands in the first and fifth boxes?
>
>
> Yes, you should only run the first and the fifth. The others are
> options which imho confuse the reader.
>
>
> (3) Is it possible to get the entire training, tuning, and
> testing steps done /without/ an arpa file? This might help
> avoid my problems because I don't think I have a problem
> getting my binarized IRSTLM files. The instructions, as you
> say, do not explain how to configure Moses to do that (and we
> do this by changing the moses.ini file, right?).
>
>
> You need a language model file for tuning and testing, but if you
> directly build an IRSTLM binarised file, then you don't need an
> ARPA file. You do need to make changes to moses.ini (as compared
> to the baseline instructions) and at the moment I can't lay my
> hands on the correct arguments.
>
>
> I'm going to check the IRSTLM documentation because in the
> version I have (5.80.06) both "--text yes" and "--text" fail
> and create the exact error "DEBUG: warning too many arguments"
> that we see in the mailing list discussion that we both linked
> to. Also, running that perl script (to do "steps 1-5") to get
> the LM also fails (that command itself doesn't fail; it causes
> problems later in the sequence), and using the EMS fails on
> the tuning step, I assume because of the same issues above,
> but that's a story for another day.
>
>
> That's all a bit strange. The "official" IRSTLM argument is
> "--text=yes" so that should work. The other methods you mention
> should also work.
>
> cheers - Barry
>
>
>
> Thanks,
> Daniel
>
>
>
> On Tue, Nov 18, 2014 at 1:30 AM, Barry Haddow
>  
>  >> wrote:
>
> Hi Daniel
>
> I looked at the baseline system instructions, and they are
> a bit
> confusing around the LM building. They explain how to use
> IRSTLM
> to binarise a language model, but do not say how to configure
> Moses to load an IRSTLM-binarised model.
>
> In fact, when I wrote the original baseline system manual, I
> assumed that you would build an ARPA file with IRSTLM
> (since KENLM
> didn't do estimation then, and SRILM wasn't open-source),
> and then
> binarise with KENLM and use it at runtime.
>
> Now, however, KENLM does estimation, and creates ARPA
> files. This
> could be one solution to your problem:
> http://kheafield.com/code/kenlm/estimation/
>
> If you want to build an ARPA file with IRSTLM, then this is
> definitely possible, but as noted here
> http://comments.gmane.org/gmane.comp.nlp.moses.user/9924
> there is some uncertainty over the arguments. I assume
> this is a
> versioning issue, but the bottom line is that either
> "--text yes"
> or "--text" should work. When I originally wrote the baseline
> instructions, the arguments I gave worked with th

Re: [Moses-support] Regarding Factored Model

2014-11-19 Thread Hieu Hoang
So the target side of your phrase table contains lemma and POS tags?

Also, is the moses.ini file you sent the exact 1 you used? There are 2 LM
in specified, but only weight for 1 of them

On 18 November 2014 11:30, Mukund Roy  wrote:

> Dear Sir
>
> I used below command for building factored model
>
> $MOSES_HOME/scripts/training/
> train-model.perl -root-dir
> $WORKING_DIR/train -corpus $WORKING_DIR/Train.true.clean -f $slang  -e
> $tlang  -alignment grow-diag-final-and -reordering msd-bidirectional-fe
> --lm 2:3:$WORKING_DIR/lm/lm-corpus.blm.POS.$tlang:0 --alignment-facor
> 0-0 --translation-factors 0-0,2 --reordering-factors 0-0
> --decoding-steps t0
>
> I have a factored corpus with two factor: lemma & POS. The baseline
> Phrase based model produced BLEU score of around 27 but using above
> command for Factored model, BLEU score dipped to 3.5.
>
> @ Hoang: Sir As you said I am attaching the ini file and Sample input
> outputs of Baseline phrase based model and Factored Model
>
> Thanks & Regards
>
>
>
>
>
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>


-- 
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] Fwd: Moses-support post from eabau...@umail.iu.edu requires approval

2014-11-19 Thread Hieu Hoang
Please subscribe to the Moses mailing list before posting to it. You can
subscribe here
   http://mailman.mit.edu/mailman/listinfo/moses-support
You say the computer has boost installed. Where are the boost libraries?
What is the exact file name of the boost thread library? ie. is it
   libboost_thread.a
   libboost_thread-mt.a
   libboost_thread.so
   libboost_thread-mt.a
When I install moses, I prefer to compile my own boost library. The
instructions to do this is here
   http://www.statmt.org/moses/?n=Development.GetStarted

-- Forwarded message --
From: 
Date: 19 November 2014 01:22
Subject: Moses-support post from eabau...@umail.iu.edu requires approval
To: moses-support-ow...@mit.edu


As list administrator, your authorization is requested for the
following mailing list posting:

List:Moses-support@mit.edu
From:eabau...@umail.iu.edu
Subject: installation errors
Reason:  Post by non-member to a members-only list

At your convenience, visit:

http://mailman.mit.edu/mailman/admindb/moses-support

to approve or deny the request.


-- Forwarded message --
From: Eric Baucom 
To: moses-support@mit.edu
Cc:
Date: Tue, 18 Nov 2014 20:22:34 -0500
Subject: installation errors
Hello,

I've unzipped the latest Moses decoder archive from git into a directory,
and used the command
" ./bjam --prefix=~/software/mymoses/ -j8 "
to attempt to install the Moses decoder.  I'm trying to install on a Cray
Linux computing cluster ( https://kb.iu.edu/d/bcqt ), with Boost version
1.55.0.  Any insight is appreciated!  The build log is attached.

Thanks,
Eric Baucom


-- Forwarded message --
From: moses-support-requ...@mit.edu
To:
Cc:
Date:
Subject: confirm f41adf1c1441c2027da9a5140a232d585c2f5993
If you reply to this message, keeping the Subject: header intact,
Mailman will discard the held message.  Do this if the message is
spam.  If you reply to this message and include an Approved: header
with the list password in it, the message will be approved for posting
to the list.  The Approved: header can also appear in the first line
of the body of the reply.



-- 
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu


build.log.gz
Description: GNU Zip compressed data
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] placeholders for numbers - extract step

2014-11-19 Thread Hieu Hoang
hi vito

On 18 November 2014 11:30, Vito Mandorino  wrote:

> Hello everyone,
>
> I am trying to use placeholders for numbers in phrase-based MT, according
> to http://www.statmt.org/moses/?n=Moses.AdvancedFeatures#ntoc75
>
> The above page says
>
> ---
>
>  During extraction, add the following to the extract command (phrase-based
> only for now):
>
> ./extract --Placeholders @num@ 
>
> --
>
> Does this mean that I have to first run train-model.perl with
> --last-step=4, then the line above and then again train-model.perl with
> --first-step=6?
>
when you run train-model.perl,  add the argument
   -extract-options '--Placeholders @num@'
You can see it in this script that the EMS creates

http://www.statmt.org/moses/RELEASE-2.1/models/cs-en/steps/3/TRAINING_extract-phrases.3

>
> If this is the case, which arguments and options should I pass to extract
> for a baseline training? I think the syntax is something like
>
The script will then call extract with the following argument
   --Placeholders @num@
You can see it in the STDERR file of the above script

http://www.statmt.org/moses/RELEASE-2.1/models/cs-en/steps/3/TRAINING_extract-phrases.3.STDERR

>
>  syntax: extract en de align extract max-length [orientation [ --model
> [wbe|phrase|hier]-[msd|mslr|mono] ] | --OnlyOutputSpanInfo | --NoTTable |
> --GZOutput | --IncludeSentenceId | --SentenceOffset n | --InstanceWeights
> filename ]
>
> In particular I cannot figure out what should be passed as 'align' and
> 'extract' arguments.
>
>
> Regards,
>
> Vito
>
>  --
>
> *M**. Vito MANDORINO -- Chief Scientist*
>
>
> [image: Description : Description : lingua_custodia_final full logo]
>
>  *The Translation Trustee*
>
> *1, Place Charles de Gaulle, **78180 Montigny-le-Bretonneux*
>
> *Tel : +33 1 30 44 04 23   Mobile : +33 6 84 65 68 89
> <%2B33%206%2084%2065%2068%2089>*
>
> *Email :*  *vito.mandor...@linguacustodia.com
> *
>
> *Website :*  *www.linguacustodia.com  -
> www.thetranslationtrustee.com  *
>
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>


-- 
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support