Re: [Moses-support] BLEU Score Variance: Which score to use?

Marcin Junczys-Dowmunt Tue, 23 Jun 2015 00:14:57 -0700

Now that I think of it, truecasing should not change file sizes, after 
all it only substitutes single letters with their smaller versions, to 
the file should stay the same size. Unless Samoan has some weird utf-8 
letters that have different byte sizes between captialized and 
uncapitalized versions.


On 23.06.2015 08:36, Marcin Junczys-Dowmunt wrote:
> I checked for some of my experiments and I get nearly identical bleu
> scores when using the standard weights, differences are on the second
> place behind the comma if at all. These results now seem more likely,
> though there is still variance.
>
> I am still wondering why would true casing produce different files. Can
> truecasing be nondeterministic on the same data, anyone?
>
> Also did you check where your files start to differ now, with common
> tokenized/true-cased files?
>
> On 23.06.2015 05:06, Hokage Sama wrote:
>> Ok my scores don't vary so much when I just run tokenisation,
>> truecasing, and cleaning once. Found some differences beginning from
>> the truecased files. Here are my results now:
>>
>> BLEU = 16.85, 48.7/21.0/11.7/6.7 (BP=1.000, ratio=1.089, hyp_len=3929,
>> ref_len=3609)
>> BLEU = 16.82, 48.6/21.1/11.6/6.7 (BP=1.000, ratio=1.085, hyp_len=3914,
>> ref_len=3609)
>> BLEU = 16.59, 48.3/20.6/11.4/6.7 (BP=1.000, ratio=1.085, hyp_len=3917,
>> ref_len=3609)
>> BLEU = 16.40, 48.4/20.7/11.3/6.4 (BP=1.000, ratio=1.086, hyp_len=3920,
>> ref_len=3609)
>> BLEU = 17.25, 49.2/21.6/12.0/6.9 (BP=1.000, ratio=1.090, hyp_len=3935,
>> ref_len=3609)
>> BLEU = 16.78, 48.9/21.0/11.6/6.7 (BP=1.000, ratio=1.091, hyp_len=3937,
>> ref_len=3609)
>>
>> On 22 June 2015 at 17:53, Hokage Sama <nvnc...@gmail.com
>> <mailto:nvnc...@gmail.com>> wrote:
>>
>>      Ok will do
>>
>>      On 22 June 2015 at 17:47, Marcin Junczys-Dowmunt
>>      <junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>> wrote:
>>
>>          I don't think so. However, when you repeat those experiments,
>>          you might try to identify where two trainings are starting to
>>          diverge by pairwise comparisions of the same files between two
>>          runs. Maybe then we can deduce something.
>>
>>          On 23.06.2015 00:25, Hokage Sama wrote:
>>
>>              Hi I delete all the files (I think) generated during a
>>              training job before rerunning the entire training. You
>>              think this could cause variation? Here's the commands I
>>              run to delete:
>>
>>              rm ~/corpus/train.tok.en
>>              rm ~/corpus/train.tok.sm <http://train.tok.sm>
>>              <http://train.tok.sm>
>>              rm ~/corpus/train.true.en
>>              rm ~/corpus/train.true.sm <http://train.true.sm>
>>              <http://train.true.sm>
>>              rm ~/corpus/train.clean.en
>>              rm ~/corpus/train.clean.sm <http://train.clean.sm>
>>              <http://train.clean.sm>
>>              rm ~/corpus/truecase-model.en
>>              rm ~/corpus/truecase-model.sm <http://truecase-model.sm>
>>              <http://truecase-model.sm>
>>              rm ~/corpus/test.tok.en
>>              rm ~/corpus/test.tok.sm <http://test.tok.sm>
>>              <http://test.tok.sm>
>>              rm ~/corpus/test.true.en
>>              rm ~/corpus/test.true.sm <http://test.true.sm>
>>              <http://test.true.sm>
>>              rm -rf ~/working/filtered-test
>>              rm ~/working/test.out
>>              rm ~/working/test.translated.en
>>              rm ~/working/training.out
>>              rm -rf ~/working/train/corpus
>>              rm -rf ~/working/train/giza.en-sm
>>              rm -rf ~/working/train/giza.sm-en
>>              rm -rf ~/working/train/model
>>
>>              On 22 June 2015 at 03:35, Marcin Junczys-Dowmunt
>>              <junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>
>>              <mailto:junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>>>
>>              wrote:
>>
>>                  You're welcome. Take another close look at those
>>              varying bleu
>>                  scores though. That would make me worry if it happened
>>              to me for
>>                  the same data and the same weights.
>>
>>                  On 22.06.2015 10 <tel:22.06.2015%2010>
>>              <tel:22.06.2015%2010>:31, Hokage Sama wrote:
>>
>>                      Ok thanks. Appreciate your help.
>>
>>                      On 22 June 2015 at 03:22, Marcin Junczys-Dowmunt
>>                      <junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>
>>              <mailto:junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>>
>>                      <mailto:junc...@amu.edu.pl
>>              <mailto:junc...@amu.edu.pl> <mailto:junc...@amu.edu.pl
>>              <mailto:junc...@amu.edu.pl>>>> wrote:
>>
>>                          Difficult to tell with that little data. Once
>>              you get beyond
>>                          100,000 segments (or 50,000 at least) i would
>>              say 2000 per dev
>>                          (for tuning) and test set, rest for training.
>>              With that few
>>                          segments it's hard to give you any
>>              recommendations since
>>                      it might
>>                          just not give meaningful results. It's
>>              currently a toy
>>                      model, good
>>                          for learning and playing around with options.
>>              But not good for
>>                          trying to infer anything from BLEU scores.
>>
>>
>>                          On 22.06.2015 10 <tel:22.06.2015%2010>
>>              <tel:22.06.2015%2010>
>>                      <tel:22.06.2015%2010>:17, Hokage Sama wrote:
>>
>>                              Yes the language model was built earlier
>>              when I first went
>>                              through the manual to build a
>>              French-English baseline
>>                      system.
>>                              So I just reused it for my Samoan-English
>>              system.
>>                              Yes for all three runs I used the same
>>              training and
>>                      testing files.
>>                              How can I determine how much parallel data
>>              I should
>>                      set aside
>>                              for tuning and testing? I have only 10,028
>>              segments
>>                      (198,385
>>                              words) altogether. At the moment I'm using 259
>>                      segments for
>>                              testing and the rest for training.
>>
>>                              Thanks,
>>                              Hilton
>>
>>
>>
>>
>>
>>
>>
>>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] BLEU Score Variance: Which score to use?

Reply via email to