Re: [Moses-support] BLEU Score Variance: Which score to use?

Hokage Sama Mon, 22 Jun 2015 20:10:04 -0700

Ok my scores don't vary so much when I just run tokenisation, truecasing,
and cleaning once. Found some differences beginning from the truecased
files. Here are my results now:


BLEU = 16.85, 48.7/21.0/11.7/6.7 (BP=1.000, ratio=1.089, hyp_len=3929,
ref_len=3609)
BLEU = 16.82, 48.6/21.1/11.6/6.7 (BP=1.000, ratio=1.085, hyp_len=3914,
ref_len=3609)
BLEU = 16.59, 48.3/20.6/11.4/6.7 (BP=1.000, ratio=1.085, hyp_len=3917,
ref_len=3609)
BLEU = 16.40, 48.4/20.7/11.3/6.4 (BP=1.000, ratio=1.086, hyp_len=3920,
ref_len=3609)
BLEU = 17.25, 49.2/21.6/12.0/6.9 (BP=1.000, ratio=1.090, hyp_len=3935,
ref_len=3609)
BLEU = 16.78, 48.9/21.0/11.6/6.7 (BP=1.000, ratio=1.091, hyp_len=3937,
ref_len=3609)

On 22 June 2015 at 17:53, Hokage Sama <nvnc...@gmail.com> wrote:

> Ok will do
>
> On 22 June 2015 at 17:47, Marcin Junczys-Dowmunt <junc...@amu.edu.pl>
> wrote:
>
>> I don't think so. However, when you repeat those experiments, you might
>> try to identify where two trainings are starting to diverge by pairwise
>> comparisions of the same files between two runs. Maybe then we can deduce
>> something.
>>
>> On 23.06.2015 00:25, Hokage Sama wrote:
>>
>>> Hi I delete all the files (I think) generated during a training job
>>> before rerunning the entire training. You think this could cause variation?
>>> Here's the commands I run to delete:
>>>
>>> rm ~/corpus/train.tok.en
>>> rm ~/corpus/train.tok.sm <http://train.tok.sm>
>>> rm ~/corpus/train.true.en
>>> rm ~/corpus/train.true.sm <http://train.true.sm>
>>> rm ~/corpus/train.clean.en
>>> rm ~/corpus/train.clean.sm <http://train.clean.sm>
>>> rm ~/corpus/truecase-model.en
>>> rm ~/corpus/truecase-model.sm <http://truecase-model.sm>
>>> rm ~/corpus/test.tok.en
>>> rm ~/corpus/test.tok.sm <http://test.tok.sm>
>>> rm ~/corpus/test.true.en
>>> rm ~/corpus/test.true.sm <http://test.true.sm>
>>> rm -rf ~/working/filtered-test
>>> rm ~/working/test.out
>>> rm ~/working/test.translated.en
>>> rm ~/working/training.out
>>> rm -rf ~/working/train/corpus
>>> rm -rf ~/working/train/giza.en-sm
>>> rm -rf ~/working/train/giza.sm-en
>>> rm -rf ~/working/train/model
>>>
>>> On 22 June 2015 at 03:35, Marcin Junczys-Dowmunt <junc...@amu.edu.pl
>>> <mailto:junc...@amu.edu.pl>> wrote:
>>>
>>>     You're welcome. Take another close look at those varying bleu
>>>     scores though. That would make me worry if it happened to me for
>>>     the same data and the same weights.
>>>
>>>     On 22.06.2015 10 <tel:22.06.2015%2010>:31, Hokage Sama wrote:
>>>
>>>         Ok thanks. Appreciate your help.
>>>
>>>         On 22 June 2015 at 03:22, Marcin Junczys-Dowmunt
>>>         <junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>
>>>         <mailto:junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>>> wrote:
>>>
>>>             Difficult to tell with that little data. Once you get beyond
>>>             100,000 segments (or 50,000 at least) i would say 2000 per
>>> dev
>>>             (for tuning) and test set, rest for training. With that few
>>>             segments it's hard to give you any recommendations since
>>>         it might
>>>             just not give meaningful results. It's currently a toy
>>>         model, good
>>>             for learning and playing around with options. But not good
>>> for
>>>             trying to infer anything from BLEU scores.
>>>
>>>
>>>             On 22.06.2015 10 <tel:22.06.2015%2010>
>>>         <tel:22.06.2015%2010>:17, Hokage Sama wrote:
>>>
>>>                 Yes the language model was built earlier when I first
>>> went
>>>                 through the manual to build a French-English baseline
>>>         system.
>>>                 So I just reused it for my Samoan-English system.
>>>                 Yes for all three runs I used the same training and
>>>         testing files.
>>>                 How can I determine how much parallel data I should
>>>         set aside
>>>                 for tuning and testing? I have only 10,028 segments
>>>         (198,385
>>>                 words) altogether. At the moment I'm using 259
>>>         segments for
>>>                 testing and the rest for training.
>>>
>>>                 Thanks,
>>>                 Hilton
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] BLEU Score Variance: Which score to use?

Reply via email to