Ok my scores don't vary so much when I just run tokenisation, truecasing, and cleaning once. Found some differences beginning from the truecased files. Here are my results now:
BLEU = 16.85, 48.7/21.0/11.7/6.7 (BP=1.000, ratio=1.089, hyp_len=3929, ref_len=3609) BLEU = 16.82, 48.6/21.1/11.6/6.7 (BP=1.000, ratio=1.085, hyp_len=3914, ref_len=3609) BLEU = 16.59, 48.3/20.6/11.4/6.7 (BP=1.000, ratio=1.085, hyp_len=3917, ref_len=3609) BLEU = 16.40, 48.4/20.7/11.3/6.4 (BP=1.000, ratio=1.086, hyp_len=3920, ref_len=3609) BLEU = 17.25, 49.2/21.6/12.0/6.9 (BP=1.000, ratio=1.090, hyp_len=3935, ref_len=3609) BLEU = 16.78, 48.9/21.0/11.6/6.7 (BP=1.000, ratio=1.091, hyp_len=3937, ref_len=3609) On 22 June 2015 at 17:53, Hokage Sama <nvnc...@gmail.com> wrote: > Ok will do > > On 22 June 2015 at 17:47, Marcin Junczys-Dowmunt <junc...@amu.edu.pl> > wrote: > >> I don't think so. However, when you repeat those experiments, you might >> try to identify where two trainings are starting to diverge by pairwise >> comparisions of the same files between two runs. Maybe then we can deduce >> something. >> >> On 23.06.2015 00:25, Hokage Sama wrote: >> >>> Hi I delete all the files (I think) generated during a training job >>> before rerunning the entire training. You think this could cause variation? >>> Here's the commands I run to delete: >>> >>> rm ~/corpus/train.tok.en >>> rm ~/corpus/train.tok.sm <http://train.tok.sm> >>> rm ~/corpus/train.true.en >>> rm ~/corpus/train.true.sm <http://train.true.sm> >>> rm ~/corpus/train.clean.en >>> rm ~/corpus/train.clean.sm <http://train.clean.sm> >>> rm ~/corpus/truecase-model.en >>> rm ~/corpus/truecase-model.sm <http://truecase-model.sm> >>> rm ~/corpus/test.tok.en >>> rm ~/corpus/test.tok.sm <http://test.tok.sm> >>> rm ~/corpus/test.true.en >>> rm ~/corpus/test.true.sm <http://test.true.sm> >>> rm -rf ~/working/filtered-test >>> rm ~/working/test.out >>> rm ~/working/test.translated.en >>> rm ~/working/training.out >>> rm -rf ~/working/train/corpus >>> rm -rf ~/working/train/giza.en-sm >>> rm -rf ~/working/train/giza.sm-en >>> rm -rf ~/working/train/model >>> >>> On 22 June 2015 at 03:35, Marcin Junczys-Dowmunt <junc...@amu.edu.pl >>> <mailto:junc...@amu.edu.pl>> wrote: >>> >>> You're welcome. Take another close look at those varying bleu >>> scores though. That would make me worry if it happened to me for >>> the same data and the same weights. >>> >>> On 22.06.2015 10 <tel:22.06.2015%2010>:31, Hokage Sama wrote: >>> >>> Ok thanks. Appreciate your help. >>> >>> On 22 June 2015 at 03:22, Marcin Junczys-Dowmunt >>> <junc...@amu.edu.pl <mailto:junc...@amu.edu.pl> >>> <mailto:junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>>> wrote: >>> >>> Difficult to tell with that little data. Once you get beyond >>> 100,000 segments (or 50,000 at least) i would say 2000 per >>> dev >>> (for tuning) and test set, rest for training. With that few >>> segments it's hard to give you any recommendations since >>> it might >>> just not give meaningful results. It's currently a toy >>> model, good >>> for learning and playing around with options. But not good >>> for >>> trying to infer anything from BLEU scores. >>> >>> >>> On 22.06.2015 10 <tel:22.06.2015%2010> >>> <tel:22.06.2015%2010>:17, Hokage Sama wrote: >>> >>> Yes the language model was built earlier when I first >>> went >>> through the manual to build a French-English baseline >>> system. >>> So I just reused it for my Samoan-English system. >>> Yes for all three runs I used the same training and >>> testing files. >>> How can I determine how much parallel data I should >>> set aside >>> for tuning and testing? I have only 10,028 segments >>> (198,385 >>> words) altogether. At the moment I'm using 259 >>> segments for >>> testing and the rest for training. >>> >>> Thanks, >>> Hilton >>> >>> >>> >>> >>> >>> >> >
_______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support