Re: [Moses-support] Factored model configuration using stems and POS

2016-08-02 Thread Gmehlin Floran
Hi Saso & Hieu,

Thank you for your replies !

This is what I am currently doing. Starting with simpler models (on a very 
short corpus so i have a quicker feedback). I tried this configuration with 
translation-factors = "word+stem+pos -> word+stem+pos" and it works (giving me 
the best results so far, around 28-29 BLEU).

Whenever I tried to add a generation step, however simple it might be, it 
crashes at the TUNING:tune phase.
Here what I fail to understand : in order to use a factor in a generation step, 
say for example :

generation-factors = "word+stem -> pos"

Do you first need to translate the left-hand side factors ? e.g. "word -> 
word,stem -> stem" or "word+stem -> word+stem".

Thank you for your help !

From: Hieu Hoang [hieuho...@gmail.com]
Sent: 01 August 2016 20:50
To: Gmehlin Floran
Cc: moses-support@mit.edu
Subject: Re: [Moses-support] Factored model configuration using stems and POS

I would start simple, then build it up once i know what it's doing, eg. start 
with
input-factors = word stem pos
output-factors = word stem pos
alignment-factors = "word -> word"
translation-factors = "word+stem+pos -> word+stem+pos"
reordering-factors = "word -> word"
generation-factors = ""
decoding-steps = "t0"


Hieu Hoang
http://www.hoang.co.uk/hieu

On 27 July 2016 at 11:46, Gmehlin Floran 
<fgmeh...@student.ethz.ch<mailto:fgmeh...@student.ethz.ch>> wrote:
Hi,

I am trying to build a factored translation model using stems and 
part-of-speech for a week now and I cannot have satisfying results. This 
probably comes from my factor configuration as I probably do not fully 
understand how it work (I am following the paper Factored Translation Model 
from Koehn and Hoang).

I previously built a standard phrase based model (with the same corpus) which 
gave me around 24-25 BLEU score (DE-EN). For my actual factored model, BLEU 
score is around 1 (?).

I tried opening the moses.ini's, (tuned or not) to see if I could have a 
something translated by copy/pasting some lines from the original corpus, but 
it only translates from german to german and does not recognize most of the 
words if not all.

 The motivation behind the factored model is that there are too many OOVs with 
the standard phrase-base, so I wanted to try using stems to reduce them.

I am annotating the corpus with TreeTagger and the factor configuration is as 
following :

input-factors = word stem pos
output-factors = word stem pos
alignment-factors = "word+stem -> word+stem"
translation-factors = "stem -> stem,pos -> pos"
reordering-factors = "word -> word"
generation-factors = "stem -> pos,stem+pos -> word"
decoding-steps = "t0,g0,t1,g1"

Is there something wrong with that ?

I only use a single language model over surface forms as the LM over POS yields 
a segmentation fault in the tuning phase.

Does anyone have an idea how I should configure my model to exploit stems in 
the source language ?

Thanks a lot,

Floran

___
Moses-support mailing list
Moses-support@mit.edu<mailto:Moses-support@mit.edu>
http://mailman.mit.edu/mailman/listinfo/moses-support


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] Factored model configuration using stems and POS

2016-07-27 Thread Gmehlin Floran
Hi,

I am trying to build a factored translation model using stems and 
part-of-speech for a week now and I cannot have satisfying results. This 
probably comes from my factor configuration as I probably do not fully 
understand how it work (I am following the paper Factored Translation Model 
from Koehn and Hoang).

I previously built a standard phrase based model (with the same corpus) which 
gave me around 24-25 BLEU score (DE-EN). For my actual factored model, BLEU 
score is around 1 (?).

I tried opening the moses.ini's, (tuned or not) to see if I could have a 
something translated by copy/pasting some lines from the original corpus, but 
it only translates from german to german and does not recognize most of the 
words if not all.

 The motivation behind the factored model is that there are too many OOVs with 
the standard phrase-base, so I wanted to try using stems to reduce them.

I am annotating the corpus with TreeTagger and the factor configuration is as 
following :

input-factors = word stem pos
output-factors = word stem pos
alignment-factors = "word+stem -> word+stem"
translation-factors = "stem -> stem,pos -> pos"
reordering-factors = "word -> word"
generation-factors = "stem -> pos,stem+pos -> word"
decoding-steps = "t0,g0,t1,g1"

Is there something wrong with that ?

I only use a single language model over surface forms as the LM over POS yields 
a segmentation fault in the tuning phase.

Does anyone have an idea how I should configure my model to exploit stems in 
the source language ?

Thanks a lot,

Floran
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] Decoder Died during TUNING:Tune phase (Factored EMS)

2016-07-22 Thread Gmehlin Floran
Hi,

The decoder dies when reaching the TUNING:tune phase of the EMS and I have no 
idea why it does so. I'm running a factored model with 2 factors as input and 2 
factors as ouput.

The following is written in the file TUNING_tune.8.STDERR :



Using SCRIPTS_ROOTDIR: /local/moses/mosesdecoder/scripts

Asking moses for feature names and values from 
/local/experiments/de_en_fact/tuning/moses.filtered.ini.8

Executing: /local/moses/mosesdecoder/bin/moses -threads 4 -v 0  -config 
/local/experiments/de_en_fact/tuning/moses.filtered.ini.8 -show-weights

exec: /local/moses/mosesdecoder/bin/moses -threads 4 -v 0  -config 
/local/experiments/de_en_fact/tuning/moses.filtered.ini.8 -show-weights

Executing: /local/moses/mosesdecoder/bin/moses -threads 4 -v 0  -config 
/local/experiments/de_en_fact/tuning/moses.filtered.ini.8 -show-weights > 
./features.list 2> /dev/null

MERT starting values and ranges for random generation:

  LexicalReordering0 =   0.300 ( 0.00 ..  1.00)

  LexicalReordering0 =   0.300 ( 0.00 ..  1.00)

  LexicalReordering0 =   0.300 ( 0.00 ..  1.00)

  LexicalReordering0 =   0.300 ( 0.00 ..  1.00)

  LexicalReordering0 =   0.300 ( 0.00 ..  1.00)

  LexicalReordering0 =   0.300 ( 0.00 ..  1.00)

  Distortion0 =   0.300 ( 0.00 ..  1.00)

LM0 =   0.500 ( 0.00 ..  1.00)

LM1 =   0.500 ( 0.00 ..  1.00)

  WordPenalty0 =  -1.000 ( 0.00 ..  1.00)

  PhrasePenalty0 =   0.200 ( 0.00 ..  1.00)

  TranslationModel0 =   0.200 ( 0.00 ..  1.00)

  TranslationModel0 =   0.200 ( 0.00 ..  1.00)

  TranslationModel0 =   0.200 ( 0.00 ..  1.00)

  TranslationModel0 =   0.200 ( 0.00 ..  1.00)

  TranslationModel1 =   0.200 ( 0.00 ..  1.00)

  TranslationModel1 =   0.200 ( 0.00 ..  1.00)

  TranslationModel1 =   0.200 ( 0.00 ..  1.00)

  TranslationModel1 =   0.200 ( 0.00 ..  1.00)

  GenerationModel0 =   0.300 ( 0.00 ..  1.00)

  GenerationModel0 =   0.000 ( 0.00 ..  1.00)

  GenerationModel1 =   0.300 ( 0.00 ..  1.00)

  GenerationModel1 =   0.000 ( 0.00 ..  1.00)

featlist: LexicalReordering0=0.30

featlist: LexicalReordering0=0.30

featlist: LexicalReordering0=0.30

featlist: LexicalReordering0=0.30

featlist: LexicalReordering0=0.30

featlist: LexicalReordering0=0.30

featlist: Distortion0=0.30

featlist: LM0=0.50

featlist: LM1=0.50

featlist: WordPenalty0=-1.00

featlist: PhrasePenalty0=0.20

featlist: TranslationModel0=0.20

featlist: TranslationModel0=0.20

featlist: TranslationModel0=0.20

featlist: TranslationModel0=0.20

featlist: TranslationModel1=0.20

featlist: TranslationModel1=0.20

featlist: TranslationModel1=0.20

featlist: TranslationModel1=0.20

featlist: GenerationModel0=0.30

featlist: GenerationModel0=0.00

featlist: GenerationModel1=0.30

featlist: GenerationModel1=0.00

Saved: ./run1.moses.ini

Normalizing lambdas: 0.30 0.30 0.30 0.30 0.30 0.30 
0.30 0.50 0.50 -1.00 0.20 0.20 0.20 0.20 
0.20 0.20 0.20 0.20 0.20 0.30 0.00 0.30 0.00

DECODER_CFG = -weight-overwrite 'GenerationModel0= 0.046154 0.00 
GenerationModel1= 0.046154 0.00 LM0= 0.076923 LM1= 0.076923 WordPenalty0= 
-0.153846 PhrasePenalty0= 0.030769 TranslationModel0= 0.030769 0.030769 
0.030769 0.030769 Distortion0= 0.046154 LexicalReordering0= 0.046154 0.046154 
0.046154 0.046154 0.046154 0.046154 TranslationModel1= 0.030769 0.030769 
0.030769 0.030769'

Executing: /local/moses/mosesdecoder/bin/moses -threads 4 -v 0   -config 
/local/experiments/de_en_fact/tuning/moses.filtered.ini.8 -weight-overwrite 
'GenerationModel0= 0.046154 0.00 GenerationModel1= 0.046154 0.00 LM0= 
0.076923 LM1= 0.076923 WordPenalty0= -0.153846 PhrasePenalty0= 0.030769 
TranslationModel0= 0.030769 0.030769 0.030769 0.030769 Distortion0= 0.046154 
LexicalReordering0= 0.046154 0.046154 0.046154 0.046154 0.046154 0.046154 
TranslationModel1= 0.030769 0.030769 0.030769 0.030769'  -n-best-list 
run1.best100.out 100 distinct  -input-file 
/local/experiments/de_en_fact/tuning/input.split.6 > run1.out

Executing: /local/moses/mosesdecoder/bin/moses -threads 4 -v 0   -config 
/local/experiments/de_en_fact/tuning/moses.filtered.ini.8 -weight-overwrite 
'GenerationModel0= 0.046154 0.00 GenerationModel1= 0.046154 0.00 LM0= 
0.076923 LM1= 0.076923 WordPenalty0= -0.153846 PhrasePenalty0= 0.030769 
TranslationModel0= 0.030769 0.030769 0.030769 0.030769 Distortion0= 0.046154 
LexicalReordering0= 0.046154 0.046154 0.046154 0.046154 0.046154 0.046154 
TranslationModel1= 0.030769 0.030769 0.030769 0.030769'  -n-best-list 
run1.best100.out 100 distinct  -input-file 
/local/experiments/de_en_fact/tuning/input.split.6 > run1.out

binary file loaded, default OFF_T: -1

terminate called recursively

terminate called recursively

terminate called recursively

sh: line 1:  4269 Aborted /local/moses/mosesdecoder/bin/moses 
-threads 4 -v 0 -config 

[Moses-support] Moses EMS Config file for factored training (post+stem) using TreeTagger

2016-07-19 Thread Gmehlin Floran
Hi,

I am not sure whether I have to provide the files (DE & EN) with the factors 
(e.g. word0|stem0|pos0 ...) to the EMS or if it builds it up itself from the 
original files and the tagging tool ?

I am using TreeTagger to tag POS and Stems in my original corporas. However, I 
am not not really sure how to define this in the EMS config file (also, is it 
necessary to tag both source and target e.g. DE & EN, or just the target EN?)

In the case where it would build the factored corpora itself, what shall I 
write in the EMS config file to use both POS and stem in the corporas ? 
Otherwise, how can I directly provide the factored files ?

Thank you for your help,

Floran
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] TreeTagger and format with pipes for Factored Model in moses

2016-07-18 Thread Gmehlin Floran
Hi,

I would like to try a Factored Training on my corpus. I see that with 
TreeTagger (from uni-muenchen.de) we can parse a text file so that it outputs 
the POS. However, I haven't been able to produce the desired format for Moses 
(with POS and Lemmas). There are a bunch of scripts in the 
scripts/training/wrappers/ folder including one for TreeTagger, but all it does 
is to produce a separate file with POS only.

I have seen that this question has already been posted 2y ago on this mailing 
list, but remained unanswered.

Is there a script or a possibility to parse a text file to get the as output a 
file in the Moses format for Factored Training ?

E.g. :

word0factor0|word0factor1|word0factor2 word1factor0|word1factor1|word1factor2

Thank you for your help !
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] German compound splitter stuck

2016-07-07 Thread Gmehlin Floran
Hi,

I'm using the script for compound splitting 
(/mosesdecoder/scripts/generic/compound-splitter.perl) on the german side of my 
parallel corpora. The corpora contains around 4M. sentences and may contains 
few english sentences in it (as I just noticed). The scripts is actually 
running for 14h on a 4-core 3GHz 16Gb RAM machine and seems to be stuck where 
these english sentences appear.

Is it normal for it to run for such a long time ? May the english sentences 
cause the trouble in the corpora ?

Thanks
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] snt2cooc option fails the training

2016-07-05 Thread Gmehlin Floran
Hi,

It seems that whenever I use the option "-snt2cooc snt2cooc.pl" the training 
fails (see below for error log). When I try on the same corpora (rather small 
because of memory limitation) it works.

Does anyone have a clue about this ?

Reading more sentence pairs into memory ...
[sent:340]
 Train total # sentence pairs (weighted): 6.74606e+06
Size of source portion of the training corpus: 2.05275e+08 tokens
Size of the target portion of the training corpus: 2.40869e+08 tokens
In source portion of the training corpus, only 4208312 unique tokens appeared
In target portion of the training corpus, only 1428380 unique tokens appeared
lambda for PP calculation in IBM-1,IBM-2,HMM:= 
2.40869e+08/(2.12021e+08-6.74606e+06)== 1.1734
Dictionary Loading complete
Inputfile in /local/para_corpora/pattr/claims/giza.en-de/en-de.cooc
ERROR: Execution of: /nas/fgmehlin/bin/mgizapp/mgiza  -CoocurrenceFile 
/local/para_corpora/pattr/claims/giza.en-de/en-de.cooc -c 
/local/para_corpora/pattr/claims/corpus/en-de-int-train.snt -m1 5 -m2 0 -m3 3 
-m4 3 -model1dumpfrequency 1 -model4smoothfactor 0.4 -ncpus 4 -nodumps 1 
-nsmooth 4 -o /local/para_corpora/pattr/claims/giza.en-de/en-de -onlyaldumps 1 
-p0 0.999 -s /local/para_corpora/pattr/claims/corpus/de.vcb -t 
/local/para_corpora/pattr/claims/corpus/en.vcb
  died with signal 11, without coredump
...
Reading more sentence pairs into memory ...
Reading more sentence pairs into memory ...
...
[sent:350]Compacted Vocabulary, eliminated 1 entries 1428381 remains
Compacted Vocabulary, eliminated 1 entries 4208312 remains
 Train total # sentence pairs (weighted): 6.74606e+06
Size of source portion of the training corpus: 2.40869e+08 tokens
Size of the target portion of the training corpus: 2.05275e+08 tokens
In source portion of the training corpus, only 1428381 unique tokens appeared
In target portion of the training corpus, only 4208311 unique tokens appeared
lambda for PP calculation in IBM-1,IBM-2,HMM:= 
2.05275e+08/(2.47615e+08-6.74606e+06)== 0.852227
Dictionary Loading complete
Inputfile in /local/para_corpora/pattr/claims/giza.de-en/de-en.cooc
ERROR: Execution of: /nas/fgmehlin/bin/mgizapp/mgiza  -CoocurrenceFile 
/local/para_corpora/pattr/claims/giza.de-en/de-en.cooc -c 
/local/para_corpora/pattr/claims/corpus/de-en-int-train.snt -m1 5 -m2 0 -m3 3 
-m4 3 -model1dumpfrequency 1 -model4smoothfactor 0.4 -ncpus 4 -nodumps 1 
-nsmooth 4 -o /local/para_corpora/pattr/claims/giza.de-en/de-en -onlyaldumps 1 
-p0 0.999 -s /local/para_corpora/pattr/claims/corpus/en.vcb -t 
/local/para_corpora/pattr/claims/corpus/de.vcb
  died with signal 11, without coredump

The command I used to start the training is the following :

/local/moses/mosesdecoder/scripts/training/train-model.perl --root-dir . 
--corpus clean_lc_utf8 --f de --e en -cores 4 --parallel -external-bin-dir 
/nas/fgmehlin/bin/mgizapp -mgiza -mgiza-cpus 4 -snt2cooc snt2cooc.pl -lm 
0:5:/local/para_corpora/pattr/claims/lm.arpa

Thank you in advance,

Floran
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] Truecaser vs. Lowercase

2016-07-04 Thread Gmehlin Floran
Hi,

I see from this page (http://www.statmt.org/moses/?n=Moses.Baseline) that we 
should train a truecaser before training the translation model.

However, in the page "Preparing training data" 
(http://www.statmt.org/moses/?n=FactoredTraining.PrepareTraining), it is said 
to lowercase the data and nothing is mentioned about the truecaser.

Can you please explain me which is best to do ?
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support