Re: [Moses-support] BLEU Score Variance: Which score to use?

Hokage Sama Tue, 23 Jun 2015 01:21:51 -0700

Nice thanks. Yeah the truecased files I checked had about 18 or so
differences where one file would capitalise the first letter and the other
file wouldn't. I am going to try and compile more data. But I think I will
only manage to get about 10k to 15k parallel segments altogether. Took me
quite a while to extract, sentence align, and clean my data. I am currently
working on my thesis and have been reading some papers on SMT for
resource-poor languages. The least amount of data used was about 200,000
words training, 1,000 segments testing (Genzel et al., 2009). Another was
about 17,000 segments training, 500 test, 1,000 tuning (Lewis and Yang,
2012).
So I was wondering if there was any other way I could use the data I have
for any experiments since tuning and the BLEU score will be useless with
the amount I have?


On 23 June 2015 at 02:57, Marcin Junczys-Dowmunt <junc...@amu.edu.pl> wrote:

> I think you are good now. That's what I am getting for a 500 sentences
> test set, trained on 10,000 sentences. Similar to your results. For a
> larger test set (4000 sentences) and the same training data there is
> nearly no variance, 12.89 vs. 12.91. So now you need to scale up and tune.
>
> BLEU = 12.37, 49.6/17.2/7.5/3.7 (BP=1.000, ratio=1.004, hyp_len=9358,
> ref_len=9322)
> BLEU = 12.51, 49.9/17.6/7.7/3.6 (BP=1.000, ratio=1.005, hyp_len=9364,
> ref_len=9322)
> BLEU = 12.25, 49.7/17.1/7.4/3.6 (BP=1.000, ratio=1.003, hyp_len=9348,
> ref_len=9322)
> BLEU = 12.29, 49.6/17.3/7.5/3.5 (BP=1.000, ratio=1.004, hyp_len=9361,
> ref_len=9322)
> BLEU = 12.45, 49.7/17.5/7.8/3.6 (BP=1.000, ratio=1.005, hyp_len=9373,
> ref_len=9322)
> BLEU = 12.30, 49.6/17.6/7.5/3.5 (BP=1.000, ratio=1.007, hyp_len=9385,
> ref_len=9322)
>
>
> On 23.06.2015 09:11, Marcin Junczys-Dowmunt wrote:
> > Now that I think of it, truecasing should not change file sizes, after
> > all it only substitutes single letters with their smaller versions, to
> > the file should stay the same size. Unless Samoan has some weird utf-8
> > letters that have different byte sizes between captialized and
> > uncapitalized versions.
> >
> > On 23.06.2015 08:36, Marcin Junczys-Dowmunt wrote:
> >> I checked for some of my experiments and I get nearly identical bleu
> >> scores when using the standard weights, differences are on the second
> >> place behind the comma if at all. These results now seem more likely,
> >> though there is still variance.
> >>
> >> I am still wondering why would true casing produce different files. Can
> >> truecasing be nondeterministic on the same data, anyone?
> >>
> >> Also did you check where your files start to differ now, with common
> >> tokenized/true-cased files?
> >>
> >> On 23.06.2015 05:06, Hokage Sama wrote:
> >>> Ok my scores don't vary so much when I just run tokenisation,
> >>> truecasing, and cleaning once. Found some differences beginning from
> >>> the truecased files. Here are my results now:
> >>>
> >>> BLEU = 16.85, 48.7/21.0/11.7/6.7 (BP=1.000, ratio=1.089, hyp_len=3929,
> >>> ref_len=3609)
> >>> BLEU = 16.82, 48.6/21.1/11.6/6.7 (BP=1.000, ratio=1.085, hyp_len=3914,
> >>> ref_len=3609)
> >>> BLEU = 16.59, 48.3/20.6/11.4/6.7 (BP=1.000, ratio=1.085, hyp_len=3917,
> >>> ref_len=3609)
> >>> BLEU = 16.40, 48.4/20.7/11.3/6.4 (BP=1.000, ratio=1.086, hyp_len=3920,
> >>> ref_len=3609)
> >>> BLEU = 17.25, 49.2/21.6/12.0/6.9 (BP=1.000, ratio=1.090, hyp_len=3935,
> >>> ref_len=3609)
> >>> BLEU = 16.78, 48.9/21.0/11.6/6.7 (BP=1.000, ratio=1.091, hyp_len=3937,
> >>> ref_len=3609)
> >>>
> >>> On 22 June 2015 at 17:53, Hokage Sama <nvnc...@gmail.com
> >>> <mailto:nvnc...@gmail.com>> wrote:
> >>>
> >>>       Ok will do
> >>>
> >>>       On 22 June 2015 at 17:47, Marcin Junczys-Dowmunt
> >>>       <junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>> wrote:
> >>>
> >>>           I don't think so. However, when you repeat those experiments,
> >>>           you might try to identify where two trainings are starting to
> >>>           diverge by pairwise comparisions of the same files between
> two
> >>>           runs. Maybe then we can deduce something.
> >>>
> >>>           On 23.06.2015 00:25, Hokage Sama wrote:
> >>>
> >>>               Hi I delete all the files (I think) generated during a
> >>>               training job before rerunning the entire training. You
> >>>               think this could cause variation? Here's the commands I
> >>>               run to delete:
> >>>
> >>>               rm ~/corpus/train.tok.en
> >>>               rm ~/corpus/train.tok.sm <http://train.tok.sm>
> >>>               <http://train.tok.sm>
> >>>               rm ~/corpus/train.true.en
> >>>               rm ~/corpus/train.true.sm <http://train.true.sm>
> >>>               <http://train.true.sm>
> >>>               rm ~/corpus/train.clean.en
> >>>               rm ~/corpus/train.clean.sm <http://train.clean.sm>
> >>>               <http://train.clean.sm>
> >>>               rm ~/corpus/truecase-model.en
> >>>               rm ~/corpus/truecase-model.sm <http://truecase-model.sm>
> >>>               <http://truecase-model.sm>
> >>>               rm ~/corpus/test.tok.en
> >>>               rm ~/corpus/test.tok.sm <http://test.tok.sm>
> >>>               <http://test.tok.sm>
> >>>               rm ~/corpus/test.true.en
> >>>               rm ~/corpus/test.true.sm <http://test.true.sm>
> >>>               <http://test.true.sm>
> >>>               rm -rf ~/working/filtered-test
> >>>               rm ~/working/test.out
> >>>               rm ~/working/test.translated.en
> >>>               rm ~/working/training.out
> >>>               rm -rf ~/working/train/corpus
> >>>               rm -rf ~/working/train/giza.en-sm
> >>>               rm -rf ~/working/train/giza.sm-en
> >>>               rm -rf ~/working/train/model
> >>>
> >>>               On 22 June 2015 at 03:35, Marcin Junczys-Dowmunt
> >>>               <junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>
> >>>               <mailto:junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>>>
> >>>               wrote:
> >>>
> >>>                   You're welcome. Take another close look at those
> >>>               varying bleu
> >>>                   scores though. That would make me worry if it
> happened
> >>>               to me for
> >>>                   the same data and the same weights.
> >>>
> >>>                   On 22.06.2015 10 <tel:22.06.2015%2010>
> >>>               <tel:22.06.2015%2010>:31, Hokage Sama wrote:
> >>>
> >>>                       Ok thanks. Appreciate your help.
> >>>
> >>>                       On 22 June 2015 at 03:22, Marcin Junczys-Dowmunt
> >>>                       <junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>
> >>>               <mailto:junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>>
> >>>                       <mailto:junc...@amu.edu.pl
> >>>               <mailto:junc...@amu.edu.pl> <mailto:junc...@amu.edu.pl
> >>>               <mailto:junc...@amu.edu.pl>>>> wrote:
> >>>
> >>>                           Difficult to tell with that little data. Once
> >>>               you get beyond
> >>>                           100,000 segments (or 50,000 at least) i would
> >>>               say 2000 per dev
> >>>                           (for tuning) and test set, rest for training.
> >>>               With that few
> >>>                           segments it's hard to give you any
> >>>               recommendations since
> >>>                       it might
> >>>                           just not give meaningful results. It's
> >>>               currently a toy
> >>>                       model, good
> >>>                           for learning and playing around with options.
> >>>               But not good for
> >>>                           trying to infer anything from BLEU scores.
> >>>
> >>>
> >>>                           On 22.06.2015 10 <tel:22.06.2015%2010>
> >>>               <tel:22.06.2015%2010>
> >>>                       <tel:22.06.2015%2010>:17, Hokage Sama wrote:
> >>>
> >>>                               Yes the language model was built earlier
> >>>               when I first went
> >>>                               through the manual to build a
> >>>               French-English baseline
> >>>                       system.
> >>>                               So I just reused it for my Samoan-English
> >>>               system.
> >>>                               Yes for all three runs I used the same
> >>>               training and
> >>>                       testing files.
> >>>                               How can I determine how much parallel
> data
> >>>               I should
> >>>                       set aside
> >>>                               for tuning and testing? I have only
> 10,028
> >>>               segments
> >>>                       (198,385
> >>>                               words) altogether. At the moment I'm
> using 259
> >>>                       segments for
> >>>                               testing and the rest for training.
> >>>
> >>>                               Thanks,
> >>>                               Hilton
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >> _______________________________________________
> >> Moses-support mailing list
> >> Moses-support@mit.edu
> >> http://mailman.mit.edu/mailman/listinfo/moses-support
> > _______________________________________________
> > Moses-support mailing list
> > Moses-support@mit.edu
> > http://mailman.mit.edu/mailman/listinfo/moses-support
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] BLEU Score Variance: Which score to use?

Reply via email to