Re: [Moses-support] Factored model configuration using stems and POS
Hi Saso & Hieu, Thank you for your replies ! This is what I am currently doing. Starting with simpler models (on a very short corpus so i have a quicker feedback). I tried this configuration with translation-factors = "word+stem+pos -> word+stem+pos" and it works (giving me the best results so far, around 28-29 BLEU). Whenever I tried to add a generation step, however simple it might be, it crashes at the TUNING:tune phase. Here what I fail to understand : in order to use a factor in a generation step, say for example : generation-factors = "word+stem -> pos" Do you first need to translate the left-hand side factors ? e.g. "word -> word,stem -> stem" or "word+stem -> word+stem". Thank you for your help ! From: Hieu Hoang [hieuho...@gmail.com] Sent: 01 August 2016 20:50 To: Gmehlin Floran Cc: moses-support@mit.edu Subject: Re: [Moses-support] Factored model configuration using stems and POS I would start simple, then build it up once i know what it's doing, eg. start with input-factors = word stem pos output-factors = word stem pos alignment-factors = "word -> word" translation-factors = "word+stem+pos -> word+stem+pos" reordering-factors = "word -> word" generation-factors = "" decoding-steps = "t0" Hieu Hoang http://www.hoang.co.uk/hieu On 27 July 2016 at 11:46, Gmehlin Floran <fgmeh...@student.ethz.ch<mailto:fgmeh...@student.ethz.ch>> wrote: Hi, I am trying to build a factored translation model using stems and part-of-speech for a week now and I cannot have satisfying results. This probably comes from my factor configuration as I probably do not fully understand how it work (I am following the paper Factored Translation Model from Koehn and Hoang). I previously built a standard phrase based model (with the same corpus) which gave me around 24-25 BLEU score (DE-EN). For my actual factored model, BLEU score is around 1 (?). I tried opening the moses.ini's, (tuned or not) to see if I could have a something translated by copy/pasting some lines from the original corpus, but it only translates from german to german and does not recognize most of the words if not all. The motivation behind the factored model is that there are too many OOVs with the standard phrase-base, so I wanted to try using stems to reduce them. I am annotating the corpus with TreeTagger and the factor configuration is as following : input-factors = word stem pos output-factors = word stem pos alignment-factors = "word+stem -> word+stem" translation-factors = "stem -> stem,pos -> pos" reordering-factors = "word -> word" generation-factors = "stem -> pos,stem+pos -> word" decoding-steps = "t0,g0,t1,g1" Is there something wrong with that ? I only use a single language model over surface forms as the LM over POS yields a segmentation fault in the tuning phase. Does anyone have an idea how I should configure my model to exploit stems in the source language ? Thanks a lot, Floran ___ Moses-support mailing list Moses-support@mit.edu<mailto:Moses-support@mit.edu> http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
[Moses-support] Factored model configuration using stems and POS
Hi, I am trying to build a factored translation model using stems and part-of-speech for a week now and I cannot have satisfying results. This probably comes from my factor configuration as I probably do not fully understand how it work (I am following the paper Factored Translation Model from Koehn and Hoang). I previously built a standard phrase based model (with the same corpus) which gave me around 24-25 BLEU score (DE-EN). For my actual factored model, BLEU score is around 1 (?). I tried opening the moses.ini's, (tuned or not) to see if I could have a something translated by copy/pasting some lines from the original corpus, but it only translates from german to german and does not recognize most of the words if not all. The motivation behind the factored model is that there are too many OOVs with the standard phrase-base, so I wanted to try using stems to reduce them. I am annotating the corpus with TreeTagger and the factor configuration is as following : input-factors = word stem pos output-factors = word stem pos alignment-factors = "word+stem -> word+stem" translation-factors = "stem -> stem,pos -> pos" reordering-factors = "word -> word" generation-factors = "stem -> pos,stem+pos -> word" decoding-steps = "t0,g0,t1,g1" Is there something wrong with that ? I only use a single language model over surface forms as the LM over POS yields a segmentation fault in the tuning phase. Does anyone have an idea how I should configure my model to exploit stems in the source language ? Thanks a lot, Floran ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
[Moses-support] Decoder Died during TUNING:Tune phase (Factored EMS)
Hi, The decoder dies when reaching the TUNING:tune phase of the EMS and I have no idea why it does so. I'm running a factored model with 2 factors as input and 2 factors as ouput. The following is written in the file TUNING_tune.8.STDERR : Using SCRIPTS_ROOTDIR: /local/moses/mosesdecoder/scripts Asking moses for feature names and values from /local/experiments/de_en_fact/tuning/moses.filtered.ini.8 Executing: /local/moses/mosesdecoder/bin/moses -threads 4 -v 0 -config /local/experiments/de_en_fact/tuning/moses.filtered.ini.8 -show-weights exec: /local/moses/mosesdecoder/bin/moses -threads 4 -v 0 -config /local/experiments/de_en_fact/tuning/moses.filtered.ini.8 -show-weights Executing: /local/moses/mosesdecoder/bin/moses -threads 4 -v 0 -config /local/experiments/de_en_fact/tuning/moses.filtered.ini.8 -show-weights > ./features.list 2> /dev/null MERT starting values and ranges for random generation: LexicalReordering0 = 0.300 ( 0.00 .. 1.00) LexicalReordering0 = 0.300 ( 0.00 .. 1.00) LexicalReordering0 = 0.300 ( 0.00 .. 1.00) LexicalReordering0 = 0.300 ( 0.00 .. 1.00) LexicalReordering0 = 0.300 ( 0.00 .. 1.00) LexicalReordering0 = 0.300 ( 0.00 .. 1.00) Distortion0 = 0.300 ( 0.00 .. 1.00) LM0 = 0.500 ( 0.00 .. 1.00) LM1 = 0.500 ( 0.00 .. 1.00) WordPenalty0 = -1.000 ( 0.00 .. 1.00) PhrasePenalty0 = 0.200 ( 0.00 .. 1.00) TranslationModel0 = 0.200 ( 0.00 .. 1.00) TranslationModel0 = 0.200 ( 0.00 .. 1.00) TranslationModel0 = 0.200 ( 0.00 .. 1.00) TranslationModel0 = 0.200 ( 0.00 .. 1.00) TranslationModel1 = 0.200 ( 0.00 .. 1.00) TranslationModel1 = 0.200 ( 0.00 .. 1.00) TranslationModel1 = 0.200 ( 0.00 .. 1.00) TranslationModel1 = 0.200 ( 0.00 .. 1.00) GenerationModel0 = 0.300 ( 0.00 .. 1.00) GenerationModel0 = 0.000 ( 0.00 .. 1.00) GenerationModel1 = 0.300 ( 0.00 .. 1.00) GenerationModel1 = 0.000 ( 0.00 .. 1.00) featlist: LexicalReordering0=0.30 featlist: LexicalReordering0=0.30 featlist: LexicalReordering0=0.30 featlist: LexicalReordering0=0.30 featlist: LexicalReordering0=0.30 featlist: LexicalReordering0=0.30 featlist: Distortion0=0.30 featlist: LM0=0.50 featlist: LM1=0.50 featlist: WordPenalty0=-1.00 featlist: PhrasePenalty0=0.20 featlist: TranslationModel0=0.20 featlist: TranslationModel0=0.20 featlist: TranslationModel0=0.20 featlist: TranslationModel0=0.20 featlist: TranslationModel1=0.20 featlist: TranslationModel1=0.20 featlist: TranslationModel1=0.20 featlist: TranslationModel1=0.20 featlist: GenerationModel0=0.30 featlist: GenerationModel0=0.00 featlist: GenerationModel1=0.30 featlist: GenerationModel1=0.00 Saved: ./run1.moses.ini Normalizing lambdas: 0.30 0.30 0.30 0.30 0.30 0.30 0.30 0.50 0.50 -1.00 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.30 0.00 0.30 0.00 DECODER_CFG = -weight-overwrite 'GenerationModel0= 0.046154 0.00 GenerationModel1= 0.046154 0.00 LM0= 0.076923 LM1= 0.076923 WordPenalty0= -0.153846 PhrasePenalty0= 0.030769 TranslationModel0= 0.030769 0.030769 0.030769 0.030769 Distortion0= 0.046154 LexicalReordering0= 0.046154 0.046154 0.046154 0.046154 0.046154 0.046154 TranslationModel1= 0.030769 0.030769 0.030769 0.030769' Executing: /local/moses/mosesdecoder/bin/moses -threads 4 -v 0 -config /local/experiments/de_en_fact/tuning/moses.filtered.ini.8 -weight-overwrite 'GenerationModel0= 0.046154 0.00 GenerationModel1= 0.046154 0.00 LM0= 0.076923 LM1= 0.076923 WordPenalty0= -0.153846 PhrasePenalty0= 0.030769 TranslationModel0= 0.030769 0.030769 0.030769 0.030769 Distortion0= 0.046154 LexicalReordering0= 0.046154 0.046154 0.046154 0.046154 0.046154 0.046154 TranslationModel1= 0.030769 0.030769 0.030769 0.030769' -n-best-list run1.best100.out 100 distinct -input-file /local/experiments/de_en_fact/tuning/input.split.6 > run1.out Executing: /local/moses/mosesdecoder/bin/moses -threads 4 -v 0 -config /local/experiments/de_en_fact/tuning/moses.filtered.ini.8 -weight-overwrite 'GenerationModel0= 0.046154 0.00 GenerationModel1= 0.046154 0.00 LM0= 0.076923 LM1= 0.076923 WordPenalty0= -0.153846 PhrasePenalty0= 0.030769 TranslationModel0= 0.030769 0.030769 0.030769 0.030769 Distortion0= 0.046154 LexicalReordering0= 0.046154 0.046154 0.046154 0.046154 0.046154 0.046154 TranslationModel1= 0.030769 0.030769 0.030769 0.030769' -n-best-list run1.best100.out 100 distinct -input-file /local/experiments/de_en_fact/tuning/input.split.6 > run1.out binary file loaded, default OFF_T: -1 terminate called recursively terminate called recursively terminate called recursively sh: line 1: 4269 Aborted /local/moses/mosesdecoder/bin/moses -threads 4 -v 0 -config
[Moses-support] Moses EMS Config file for factored training (post+stem) using TreeTagger
Hi, I am not sure whether I have to provide the files (DE & EN) with the factors (e.g. word0|stem0|pos0 ...) to the EMS or if it builds it up itself from the original files and the tagging tool ? I am using TreeTagger to tag POS and Stems in my original corporas. However, I am not not really sure how to define this in the EMS config file (also, is it necessary to tag both source and target e.g. DE & EN, or just the target EN?) In the case where it would build the factored corpora itself, what shall I write in the EMS config file to use both POS and stem in the corporas ? Otherwise, how can I directly provide the factored files ? Thank you for your help, Floran ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
[Moses-support] TreeTagger and format with pipes for Factored Model in moses
Hi, I would like to try a Factored Training on my corpus. I see that with TreeTagger (from uni-muenchen.de) we can parse a text file so that it outputs the POS. However, I haven't been able to produce the desired format for Moses (with POS and Lemmas). There are a bunch of scripts in the scripts/training/wrappers/ folder including one for TreeTagger, but all it does is to produce a separate file with POS only. I have seen that this question has already been posted 2y ago on this mailing list, but remained unanswered. Is there a script or a possibility to parse a text file to get the as output a file in the Moses format for Factored Training ? E.g. : word0factor0|word0factor1|word0factor2 word1factor0|word1factor1|word1factor2 Thank you for your help ! ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
[Moses-support] German compound splitter stuck
Hi, I'm using the script for compound splitting (/mosesdecoder/scripts/generic/compound-splitter.perl) on the german side of my parallel corpora. The corpora contains around 4M. sentences and may contains few english sentences in it (as I just noticed). The scripts is actually running for 14h on a 4-core 3GHz 16Gb RAM machine and seems to be stuck where these english sentences appear. Is it normal for it to run for such a long time ? May the english sentences cause the trouble in the corpora ? Thanks ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
[Moses-support] snt2cooc option fails the training
Hi, It seems that whenever I use the option "-snt2cooc snt2cooc.pl" the training fails (see below for error log). When I try on the same corpora (rather small because of memory limitation) it works. Does anyone have a clue about this ? Reading more sentence pairs into memory ... [sent:340] Train total # sentence pairs (weighted): 6.74606e+06 Size of source portion of the training corpus: 2.05275e+08 tokens Size of the target portion of the training corpus: 2.40869e+08 tokens In source portion of the training corpus, only 4208312 unique tokens appeared In target portion of the training corpus, only 1428380 unique tokens appeared lambda for PP calculation in IBM-1,IBM-2,HMM:= 2.40869e+08/(2.12021e+08-6.74606e+06)== 1.1734 Dictionary Loading complete Inputfile in /local/para_corpora/pattr/claims/giza.en-de/en-de.cooc ERROR: Execution of: /nas/fgmehlin/bin/mgizapp/mgiza -CoocurrenceFile /local/para_corpora/pattr/claims/giza.en-de/en-de.cooc -c /local/para_corpora/pattr/claims/corpus/en-de-int-train.snt -m1 5 -m2 0 -m3 3 -m4 3 -model1dumpfrequency 1 -model4smoothfactor 0.4 -ncpus 4 -nodumps 1 -nsmooth 4 -o /local/para_corpora/pattr/claims/giza.en-de/en-de -onlyaldumps 1 -p0 0.999 -s /local/para_corpora/pattr/claims/corpus/de.vcb -t /local/para_corpora/pattr/claims/corpus/en.vcb died with signal 11, without coredump ... Reading more sentence pairs into memory ... Reading more sentence pairs into memory ... ... [sent:350]Compacted Vocabulary, eliminated 1 entries 1428381 remains Compacted Vocabulary, eliminated 1 entries 4208312 remains Train total # sentence pairs (weighted): 6.74606e+06 Size of source portion of the training corpus: 2.40869e+08 tokens Size of the target portion of the training corpus: 2.05275e+08 tokens In source portion of the training corpus, only 1428381 unique tokens appeared In target portion of the training corpus, only 4208311 unique tokens appeared lambda for PP calculation in IBM-1,IBM-2,HMM:= 2.05275e+08/(2.47615e+08-6.74606e+06)== 0.852227 Dictionary Loading complete Inputfile in /local/para_corpora/pattr/claims/giza.de-en/de-en.cooc ERROR: Execution of: /nas/fgmehlin/bin/mgizapp/mgiza -CoocurrenceFile /local/para_corpora/pattr/claims/giza.de-en/de-en.cooc -c /local/para_corpora/pattr/claims/corpus/de-en-int-train.snt -m1 5 -m2 0 -m3 3 -m4 3 -model1dumpfrequency 1 -model4smoothfactor 0.4 -ncpus 4 -nodumps 1 -nsmooth 4 -o /local/para_corpora/pattr/claims/giza.de-en/de-en -onlyaldumps 1 -p0 0.999 -s /local/para_corpora/pattr/claims/corpus/en.vcb -t /local/para_corpora/pattr/claims/corpus/de.vcb died with signal 11, without coredump The command I used to start the training is the following : /local/moses/mosesdecoder/scripts/training/train-model.perl --root-dir . --corpus clean_lc_utf8 --f de --e en -cores 4 --parallel -external-bin-dir /nas/fgmehlin/bin/mgizapp -mgiza -mgiza-cpus 4 -snt2cooc snt2cooc.pl -lm 0:5:/local/para_corpora/pattr/claims/lm.arpa Thank you in advance, Floran ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
[Moses-support] Truecaser vs. Lowercase
Hi, I see from this page (http://www.statmt.org/moses/?n=Moses.Baseline) that we should train a truecaser before training the translation model. However, in the page "Preparing training data" (http://www.statmt.org/moses/?n=FactoredTraining.PrepareTraining), it is said to lowercase the data and nothing is mentioned about the truecaser. Can you please explain me which is best to do ? ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support