Re: [Moses-support] BLEU Score Variance: Which score to use?

Marcin Junczys-Dowmunt Tue, 23 Jun 2015 01:00:13 -0700

I think you are good now. That's what I am getting for a 500 sentences 
test set, trained on 10,000 sentences. Similar to your results. For a 
larger test set (4000 sentences) and the same training data there is 
nearly no variance, 12.89 vs. 12.91. So now you need to scale up and tune.


BLEU = 12.37, 49.6/17.2/7.5/3.7 (BP=1.000, ratio=1.004, hyp_len=9358, 
ref_len=9322)
BLEU = 12.51, 49.9/17.6/7.7/3.6 (BP=1.000, ratio=1.005, hyp_len=9364, 
ref_len=9322)
BLEU = 12.25, 49.7/17.1/7.4/3.6 (BP=1.000, ratio=1.003, hyp_len=9348, 
ref_len=9322)
BLEU = 12.29, 49.6/17.3/7.5/3.5 (BP=1.000, ratio=1.004, hyp_len=9361, 
ref_len=9322)
BLEU = 12.45, 49.7/17.5/7.8/3.6 (BP=1.000, ratio=1.005, hyp_len=9373, 
ref_len=9322)
BLEU = 12.30, 49.6/17.6/7.5/3.5 (BP=1.000, ratio=1.007, hyp_len=9385, 
ref_len=9322)


On 23.06.2015 09:11, Marcin Junczys-Dowmunt wrote:
> Now that I think of it, truecasing should not change file sizes, after
> all it only substitutes single letters with their smaller versions, to
> the file should stay the same size. Unless Samoan has some weird utf-8
> letters that have different byte sizes between captialized and
> uncapitalized versions.
>
> On 23.06.2015 08:36, Marcin Junczys-Dowmunt wrote:
>> I checked for some of my experiments and I get nearly identical bleu
>> scores when using the standard weights, differences are on the second
>> place behind the comma if at all. These results now seem more likely,
>> though there is still variance.
>>
>> I am still wondering why would true casing produce different files. Can
>> truecasing be nondeterministic on the same data, anyone?
>>
>> Also did you check where your files start to differ now, with common
>> tokenized/true-cased files?
>>
>> On 23.06.2015 05:06, Hokage Sama wrote:
>>> Ok my scores don't vary so much when I just run tokenisation,
>>> truecasing, and cleaning once. Found some differences beginning from
>>> the truecased files. Here are my results now:
>>>
>>> BLEU = 16.85, 48.7/21.0/11.7/6.7 (BP=1.000, ratio=1.089, hyp_len=3929,
>>> ref_len=3609)
>>> BLEU = 16.82, 48.6/21.1/11.6/6.7 (BP=1.000, ratio=1.085, hyp_len=3914,
>>> ref_len=3609)
>>> BLEU = 16.59, 48.3/20.6/11.4/6.7 (BP=1.000, ratio=1.085, hyp_len=3917,
>>> ref_len=3609)
>>> BLEU = 16.40, 48.4/20.7/11.3/6.4 (BP=1.000, ratio=1.086, hyp_len=3920,
>>> ref_len=3609)
>>> BLEU = 17.25, 49.2/21.6/12.0/6.9 (BP=1.000, ratio=1.090, hyp_len=3935,
>>> ref_len=3609)
>>> BLEU = 16.78, 48.9/21.0/11.6/6.7 (BP=1.000, ratio=1.091, hyp_len=3937,
>>> ref_len=3609)
>>>
>>> On 22 June 2015 at 17:53, Hokage Sama <nvnc...@gmail.com
>>> <mailto:nvnc...@gmail.com>> wrote:
>>>
>>>       Ok will do
>>>
>>>       On 22 June 2015 at 17:47, Marcin Junczys-Dowmunt
>>>       <junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>> wrote:
>>>
>>>           I don't think so. However, when you repeat those experiments,
>>>           you might try to identify where two trainings are starting to
>>>           diverge by pairwise comparisions of the same files between two
>>>           runs. Maybe then we can deduce something.
>>>
>>>           On 23.06.2015 00:25, Hokage Sama wrote:
>>>
>>>               Hi I delete all the files (I think) generated during a
>>>               training job before rerunning the entire training. You
>>>               think this could cause variation? Here's the commands I
>>>               run to delete:
>>>
>>>               rm ~/corpus/train.tok.en
>>>               rm ~/corpus/train.tok.sm <http://train.tok.sm>
>>>               <http://train.tok.sm>
>>>               rm ~/corpus/train.true.en
>>>               rm ~/corpus/train.true.sm <http://train.true.sm>
>>>               <http://train.true.sm>
>>>               rm ~/corpus/train.clean.en
>>>               rm ~/corpus/train.clean.sm <http://train.clean.sm>
>>>               <http://train.clean.sm>
>>>               rm ~/corpus/truecase-model.en
>>>               rm ~/corpus/truecase-model.sm <http://truecase-model.sm>
>>>               <http://truecase-model.sm>
>>>               rm ~/corpus/test.tok.en
>>>               rm ~/corpus/test.tok.sm <http://test.tok.sm>
>>>               <http://test.tok.sm>
>>>               rm ~/corpus/test.true.en
>>>               rm ~/corpus/test.true.sm <http://test.true.sm>
>>>               <http://test.true.sm>
>>>               rm -rf ~/working/filtered-test
>>>               rm ~/working/test.out
>>>               rm ~/working/test.translated.en
>>>               rm ~/working/training.out
>>>               rm -rf ~/working/train/corpus
>>>               rm -rf ~/working/train/giza.en-sm
>>>               rm -rf ~/working/train/giza.sm-en
>>>               rm -rf ~/working/train/model
>>>
>>>               On 22 June 2015 at 03:35, Marcin Junczys-Dowmunt
>>>               <junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>
>>>               <mailto:junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>>>
>>>               wrote:
>>>
>>>                   You're welcome. Take another close look at those
>>>               varying bleu
>>>                   scores though. That would make me worry if it happened
>>>               to me for
>>>                   the same data and the same weights.
>>>
>>>                   On 22.06.2015 10 <tel:22.06.2015%2010>
>>>               <tel:22.06.2015%2010>:31, Hokage Sama wrote:
>>>
>>>                       Ok thanks. Appreciate your help.
>>>
>>>                       On 22 June 2015 at 03:22, Marcin Junczys-Dowmunt
>>>                       <junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>
>>>               <mailto:junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>>
>>>                       <mailto:junc...@amu.edu.pl
>>>               <mailto:junc...@amu.edu.pl> <mailto:junc...@amu.edu.pl
>>>               <mailto:junc...@amu.edu.pl>>>> wrote:
>>>
>>>                           Difficult to tell with that little data. Once
>>>               you get beyond
>>>                           100,000 segments (or 50,000 at least) i would
>>>               say 2000 per dev
>>>                           (for tuning) and test set, rest for training.
>>>               With that few
>>>                           segments it's hard to give you any
>>>               recommendations since
>>>                       it might
>>>                           just not give meaningful results. It's
>>>               currently a toy
>>>                       model, good
>>>                           for learning and playing around with options.
>>>               But not good for
>>>                           trying to infer anything from BLEU scores.
>>>
>>>
>>>                           On 22.06.2015 10 <tel:22.06.2015%2010>
>>>               <tel:22.06.2015%2010>
>>>                       <tel:22.06.2015%2010>:17, Hokage Sama wrote:
>>>
>>>                               Yes the language model was built earlier
>>>               when I first went
>>>                               through the manual to build a
>>>               French-English baseline
>>>                       system.
>>>                               So I just reused it for my Samoan-English
>>>               system.
>>>                               Yes for all three runs I used the same
>>>               training and
>>>                       testing files.
>>>                               How can I determine how much parallel data
>>>               I should
>>>                       set aside
>>>                               for tuning and testing? I have only 10,028
>>>               segments
>>>                       (198,385
>>>                               words) altogether. At the moment I'm using 259
>>>                       segments for
>>>                               testing and the rest for training.
>>>
>>>                               Thanks,
>>>                               Hilton
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] BLEU Score Variance: Which score to use?

Reply via email to