Re: [Moses-support] BLEU Score Variance: Which score to use?

Hokage Sama Mon, 22 Jun 2015 01:20:02 -0700

Yes the language model was built earlier when I first went through the
manual to build a French-English baseline system. So I just reused it for
my Samoan-English system.
Yes for all three runs I used the same training and testing files.
How can I determine how much parallel data I should set aside for tuning
and testing? I have only 10,028 segments (198,385 words) altogether. At the
moment I'm using 259 segments for testing and the rest for training.


Thanks,
Hilton

On 22 June 2015 at 02:52, Marcin Junczys-Dowmunt <junc...@amu.edu.pl> wrote:

> Don't see any reason for indeterminism here. Unless mgiza is less stable
> for small data than I thought. The lm lm/news-commentary-v8.fr-en.blm.en
> has been built earlier somewhere?
>
> And to be sure: for all three runs you used exactly the same data,
> training and test set?
>
> On 22.06.2015 09:34, Hokage Sama wrote:
>
>> Wow that was a long read. Still reading though :) but I see that tuning
>> is essential. I am fairly new to Moses so could you please check if the
>> commands I ran were correct (minus the tuning part). I just modified the
>> commands on the Moses website for building a baseline system. Below are the
>> commands I ran. My training files are "compilation.en" and "
>> compilation.sm <http://compilation.sm>". My test files are "test.en" and
>> "test.sm <http://test.sm>".
>>
>> ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en <
>> ~/corpus/training/compilation.en > ~/corpus/compilation.tok.en
>> ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l sm < ~/corpus/training/
>> compilation.sm <http://compilation.sm> > ~/corpus/compilation.tok.sm <
>> http://compilation.tok.sm>
>> ~/mosesdecoder/scripts/recaser/train-truecaser.perl --model
>> ~/corpus/truecase-model.en --corpus ~/corpus/compilation.tok.en
>> ~/mosesdecoder/scripts/recaser/train-truecaser.perl --model ~/corpus/
>> truecase-model.sm <http://truecase-model.sm> --corpus ~/corpus/
>> compilation.tok.sm <http://compilation.tok.sm>
>> ~/mosesdecoder/scripts/recaser/truecase.perl --model
>> ~/corpus/truecase-model.en < ~/corpus/compilation.tok.en >
>> ~/corpus/compilation.true.en
>> ~/mosesdecoder/scripts/recaser/truecase.perl --model ~/corpus/
>> truecase-model.sm <http://truecase-model.sm> < ~/corpus/
>> compilation.tok.sm <http://compilation.tok.sm> > ~/corpus/
>> compilation.true.sm <http://compilation.true.sm>
>> ~/mosesdecoder/scripts/training/clean-corpus-n.perl
>> ~/corpus/compilation.true sm en ~/corpus/compilation.clean 1 80
>>
>> cd ~/working
>> nohup nice ~/mosesdecoder/scripts/training/train-model.perl -root-dir
>> train -corpus ~/corpus/compilation.clean -f sm -e en -alignment
>> grow-diag-final-and -reordering msd-bidirectional-fe -lm
>> 0:3:$HOME/lm/news-commentary-v8.fr-en.blm.en:8 -external-bin-dir
>> ~/mosesdecoder/tools >& training.out &
>>
>> cd ~/corpus
>> ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en < test.en >
>> test.tok.en
>> ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l sm < test.sm <
>> http://test.sm> > test.tok.sm <http://test.tok.sm>
>> ~/mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.en <
>> test.tok.en > test.true.en
>> ~/mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.sm <
>> http://truecase-model.sm> < test.tok.sm <http://test.tok.sm> >
>> test.true.sm <http://test.true.sm>
>>
>> cd ~/working
>> ~/mosesdecoder/scripts/training/filter-model-given-input.pl <
>> http://filter-model-given-input.pl> filtered-test train/model/moses.ini
>> ~/corpus/test.true.sm <http://test.true.sm> -Binarizer
>> ~/mosesdecoder/bin/processPhraseTableMin
>> nohup nice ~/mosesdecoder/bin/moses -f ~/working/filtered-test/moses.ini
>> < ~/corpus/test.true.sm <http://test.true.sm> >
>> ~/working/test.translated.en 2> ~/working/test.out
>> ~/mosesdecoder/scripts/generic/multi-bleu.perl -lc ~/corpus/test.true.en
>> < ~/working/test.translated.en
>>
>> On 22 June 2015 at 01:20, Marcin Junczys-Dowmunt <junc...@amu.edu.pl
>> <mailto:junc...@amu.edu.pl>> wrote:
>>
>>     Hm. That's interesting. The language should not matter.
>>
>>     1) Do not report results without tuning. They are meaningless.
>>     There is a whole thread on that, look for "Major bug found in
>>     Moses". If you ignore the trollish aspects it contains may good
>>     descriptions why this is a mistake.
>>
>>     2) Assuming it was the same data every time (was it?), without
>>     tuning however I do not quite see where the variance is coming
>>     from. This rather suggests you have something weird in your
>>     pipeline. Mgiza is the only stochastic element there, but usually
>>     its results are quite consistent. For the same weights in your
>>     ini-file you should have very similar results. Tuning would be the
>>     part that introduces instability, but even then these differences
>>     would be a little on the extreme end, though possible.
>>
>>     On 22.06.2015 08 <tel:22.06.2015%2008>:12, Hokage Sama wrote:
>>
>>         Thanks Marcin. Its for a new resource-poor language so I only
>>         trained it with what I could collect so far (i.e. only 190,630
>>         words of parallel data). I retrained the entire system each
>>         time without any tuning.
>>
>>         On 22 June 2015 at 01:00, Marcin Junczys-Dowmunt
>>         <junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>
>>         <mailto:junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>>> wrote:
>>
>>             Hi,
>>             I think the average is OK, your variance is however quite
>>         high.
>>             Did you
>>             retrain the entire system or just optimize parameters a
>>         couple of
>>             times?
>>
>>             Two useful papers on the topic:
>>
>>         https://www.cs.cmu.edu/~jhclark/pubs/significance.pdf
>>         <https://www.cs.cmu.edu/%7Ejhclark/pubs/significance.pdf>
>>             <https://www.cs.cmu.edu/%7Ejhclark/pubs/significance.pdf>
>>         http://www.mt-archive.info/MTS-2011-Cettolo.pdf
>>
>>
>>             On 22.06.2015 02 <tel:22.06.2015%2002>
>>         <tel:22.06.2015%2002>:37, Hokage Sama wrote:
>>             > Hi,
>>             >
>>             > Since MT training is non-convex and thus the BLEU score
>>         varies,
>>             which
>>             > score should I use for my system? I trained my system
>>         three times
>>             > using the same data and obtained the three different
>>         scores below.
>>             > Should I take the average or the best score?
>>             >
>>             > BLEU = 17.84, 49.1/22.0/12.5/7.5 (BP=1.000, ratio=1.095,
>>             hyp_len=3952,
>>             > ref_len=3609)
>>             > BLEU = 16.51, 48.4/20.7/11.4/6.5 (BP=1.000, ratio=1.093,
>>             hyp_len=3945,
>>             > ref_len=3609)
>>             > BLEU = 15.33, 48.2/20.1/10.3/5.5 (BP=1.000, ratio=1.087,
>>             hyp_len=3924,
>>             > ref_len=3609)
>>             >
>>             > Thanks,
>>             > Hilton
>>             >
>>             >
>>             > _______________________________________________
>>             > Moses-support mailing list
>>             > Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>         <mailto:Moses-support@mit.edu <mailto:Moses-support@mit.edu>>
>>             > http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>             _______________________________________________
>>             Moses-support mailing list
>>         Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>         <mailto:Moses-support@mit.edu <mailto:Moses-support@mit.edu>>
>>         http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>>
>>
>>
>

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] BLEU Score Variance: Which score to use?

Reply via email to