Re: [Moses-support] BLEU Score Variance: Which score to use?

2015-06-22 Thread Marcin Junczys-Dowmunt
I checked for some of my experiments and I get nearly identical bleu 
scores when using the standard weights, differences are on the second 
place behind the comma if at all. These results now seem more likely, 
though there is still variance.

I am still wondering why would true casing produce different files. Can 
truecasing be nondeterministic on the same data, anyone?

Also did you check where your files start to differ now, with common 
tokenized/true-cased files?

On 23.06.2015 05:06, Hokage Sama wrote:
> Ok my scores don't vary so much when I just run tokenisation, 
> truecasing, and cleaning once. Found some differences beginning from 
> the truecased files. Here are my results now:
>
> BLEU = 16.85, 48.7/21.0/11.7/6.7 (BP=1.000, ratio=1.089, hyp_len=3929, 
> ref_len=3609)
> BLEU = 16.82, 48.6/21.1/11.6/6.7 (BP=1.000, ratio=1.085, hyp_len=3914, 
> ref_len=3609)
> BLEU = 16.59, 48.3/20.6/11.4/6.7 (BP=1.000, ratio=1.085, hyp_len=3917, 
> ref_len=3609)
> BLEU = 16.40, 48.4/20.7/11.3/6.4 (BP=1.000, ratio=1.086, hyp_len=3920, 
> ref_len=3609)
> BLEU = 17.25, 49.2/21.6/12.0/6.9 (BP=1.000, ratio=1.090, hyp_len=3935, 
> ref_len=3609)
> BLEU = 16.78, 48.9/21.0/11.6/6.7 (BP=1.000, ratio=1.091, hyp_len=3937, 
> ref_len=3609)
>
> On 22 June 2015 at 17:53, Hokage Sama  > wrote:
>
> Ok will do
>
> On 22 June 2015 at 17:47, Marcin Junczys-Dowmunt
> mailto:junc...@amu.edu.pl>> wrote:
>
> I don't think so. However, when you repeat those experiments,
> you might try to identify where two trainings are starting to
> diverge by pairwise comparisions of the same files between two
> runs. Maybe then we can deduce something.
>
> On 23.06.2015 00:25, Hokage Sama wrote:
>
> Hi I delete all the files (I think) generated during a
> training job before rerunning the entire training. You
> think this could cause variation? Here's the commands I
> run to delete:
>
> rm ~/corpus/train.tok.en
> rm ~/corpus/train.tok.sm 
> 
> rm ~/corpus/train.true.en
> rm ~/corpus/train.true.sm 
> 
> rm ~/corpus/train.clean.en
> rm ~/corpus/train.clean.sm 
> 
> rm ~/corpus/truecase-model.en
> rm ~/corpus/truecase-model.sm 
> 
> rm ~/corpus/test.tok.en
> rm ~/corpus/test.tok.sm 
> 
> rm ~/corpus/test.true.en
> rm ~/corpus/test.true.sm 
> 
> rm -rf ~/working/filtered-test
> rm ~/working/test.out
> rm ~/working/test.translated.en
> rm ~/working/training.out
> rm -rf ~/working/train/corpus
> rm -rf ~/working/train/giza.en-sm
> rm -rf ~/working/train/giza.sm-en
> rm -rf ~/working/train/model
>
> On 22 June 2015 at 03:35, Marcin Junczys-Dowmunt
> mailto:junc...@amu.edu.pl>
> >>
> wrote:
>
> You're welcome. Take another close look at those
> varying bleu
> scores though. That would make me worry if it happened
> to me for
> the same data and the same weights.
>
> On 22.06.2015 10 
> :31, Hokage Sama wrote:
>
> Ok thanks. Appreciate your help.
>
> On 22 June 2015 at 03:22, Marcin Junczys-Dowmunt
> mailto:junc...@amu.edu.pl>
> >
>    
> Difficult to tell with that little data. Once
> you get beyond
> 100,000 segments (or 50,000 at least) i would
> say 2000 per dev
> (for tuning) and test set, rest for training.
> With that few
> segments it's hard to give you any
> recommendations since
> it might
> just not give meaningful results. It's
> currently a toy
> model, good
> for learning and playing around with options.
> But not good for
> trying to infer anything from BLEU scores.
>
>
> On 22.06.2015 10 
> 
> :17, Hokage Sama wrote:
>
> 

Re: [Moses-support] BLEU Score Variance: Which score to use?

2015-06-22 Thread Hokage Sama
Ok my scores don't vary so much when I just run tokenisation, truecasing,
and cleaning once. Found some differences beginning from the truecased
files. Here are my results now:

BLEU = 16.85, 48.7/21.0/11.7/6.7 (BP=1.000, ratio=1.089, hyp_len=3929,
ref_len=3609)
BLEU = 16.82, 48.6/21.1/11.6/6.7 (BP=1.000, ratio=1.085, hyp_len=3914,
ref_len=3609)
BLEU = 16.59, 48.3/20.6/11.4/6.7 (BP=1.000, ratio=1.085, hyp_len=3917,
ref_len=3609)
BLEU = 16.40, 48.4/20.7/11.3/6.4 (BP=1.000, ratio=1.086, hyp_len=3920,
ref_len=3609)
BLEU = 17.25, 49.2/21.6/12.0/6.9 (BP=1.000, ratio=1.090, hyp_len=3935,
ref_len=3609)
BLEU = 16.78, 48.9/21.0/11.6/6.7 (BP=1.000, ratio=1.091, hyp_len=3937,
ref_len=3609)

On 22 June 2015 at 17:53, Hokage Sama  wrote:

> Ok will do
>
> On 22 June 2015 at 17:47, Marcin Junczys-Dowmunt 
> wrote:
>
>> I don't think so. However, when you repeat those experiments, you might
>> try to identify where two trainings are starting to diverge by pairwise
>> comparisions of the same files between two runs. Maybe then we can deduce
>> something.
>>
>> On 23.06.2015 00:25, Hokage Sama wrote:
>>
>>> Hi I delete all the files (I think) generated during a training job
>>> before rerunning the entire training. You think this could cause variation?
>>> Here's the commands I run to delete:
>>>
>>> rm ~/corpus/train.tok.en
>>> rm ~/corpus/train.tok.sm 
>>> rm ~/corpus/train.true.en
>>> rm ~/corpus/train.true.sm 
>>> rm ~/corpus/train.clean.en
>>> rm ~/corpus/train.clean.sm 
>>> rm ~/corpus/truecase-model.en
>>> rm ~/corpus/truecase-model.sm 
>>> rm ~/corpus/test.tok.en
>>> rm ~/corpus/test.tok.sm 
>>> rm ~/corpus/test.true.en
>>> rm ~/corpus/test.true.sm 
>>> rm -rf ~/working/filtered-test
>>> rm ~/working/test.out
>>> rm ~/working/test.translated.en
>>> rm ~/working/training.out
>>> rm -rf ~/working/train/corpus
>>> rm -rf ~/working/train/giza.en-sm
>>> rm -rf ~/working/train/giza.sm-en
>>> rm -rf ~/working/train/model
>>>
>>> On 22 June 2015 at 03:35, Marcin Junczys-Dowmunt >> > wrote:
>>>
>>> You're welcome. Take another close look at those varying bleu
>>> scores though. That would make me worry if it happened to me for
>>> the same data and the same weights.
>>>
>>> On 22.06.2015 10 :31, Hokage Sama wrote:
>>>
>>> Ok thanks. Appreciate your help.
>>>
>>> On 22 June 2015 at 03:22, Marcin Junczys-Dowmunt
>>> mailto:junc...@amu.edu.pl>
>>> >> wrote:
>>>
>>> Difficult to tell with that little data. Once you get beyond
>>> 100,000 segments (or 50,000 at least) i would say 2000 per
>>> dev
>>> (for tuning) and test set, rest for training. With that few
>>> segments it's hard to give you any recommendations since
>>> it might
>>> just not give meaningful results. It's currently a toy
>>> model, good
>>> for learning and playing around with options. But not good
>>> for
>>> trying to infer anything from BLEU scores.
>>>
>>>
>>> On 22.06.2015 10 
>>> :17, Hokage Sama wrote:
>>>
>>> Yes the language model was built earlier when I first
>>> went
>>> through the manual to build a French-English baseline
>>> system.
>>> So I just reused it for my Samoan-English system.
>>> Yes for all three runs I used the same training and
>>> testing files.
>>> How can I determine how much parallel data I should
>>> set aside
>>> for tuning and testing? I have only 10,028 segments
>>> (198,385
>>> words) altogether. At the moment I'm using 259
>>> segments for
>>> testing and the rest for training.
>>>
>>> Thanks,
>>> Hilton
>>>
>>>
>>>
>>>
>>>
>>>
>>
>
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] BLEU Score Variance: Which score to use?

2015-06-22 Thread Hokage Sama
Ok will do

On 22 June 2015 at 17:47, Marcin Junczys-Dowmunt  wrote:

> I don't think so. However, when you repeat those experiments, you might
> try to identify where two trainings are starting to diverge by pairwise
> comparisions of the same files between two runs. Maybe then we can deduce
> something.
>
> On 23.06.2015 00:25, Hokage Sama wrote:
>
>> Hi I delete all the files (I think) generated during a training job
>> before rerunning the entire training. You think this could cause variation?
>> Here's the commands I run to delete:
>>
>> rm ~/corpus/train.tok.en
>> rm ~/corpus/train.tok.sm 
>> rm ~/corpus/train.true.en
>> rm ~/corpus/train.true.sm 
>> rm ~/corpus/train.clean.en
>> rm ~/corpus/train.clean.sm 
>> rm ~/corpus/truecase-model.en
>> rm ~/corpus/truecase-model.sm 
>> rm ~/corpus/test.tok.en
>> rm ~/corpus/test.tok.sm 
>> rm ~/corpus/test.true.en
>> rm ~/corpus/test.true.sm 
>> rm -rf ~/working/filtered-test
>> rm ~/working/test.out
>> rm ~/working/test.translated.en
>> rm ~/working/training.out
>> rm -rf ~/working/train/corpus
>> rm -rf ~/working/train/giza.en-sm
>> rm -rf ~/working/train/giza.sm-en
>> rm -rf ~/working/train/model
>>
>> On 22 June 2015 at 03:35, Marcin Junczys-Dowmunt > > wrote:
>>
>> You're welcome. Take another close look at those varying bleu
>> scores though. That would make me worry if it happened to me for
>> the same data and the same weights.
>>
>> On 22.06.2015 10 :31, Hokage Sama wrote:
>>
>> Ok thanks. Appreciate your help.
>>
>> On 22 June 2015 at 03:22, Marcin Junczys-Dowmunt
>> mailto:junc...@amu.edu.pl>
>> >> wrote:
>>
>> Difficult to tell with that little data. Once you get beyond
>> 100,000 segments (or 50,000 at least) i would say 2000 per dev
>> (for tuning) and test set, rest for training. With that few
>> segments it's hard to give you any recommendations since
>> it might
>> just not give meaningful results. It's currently a toy
>> model, good
>> for learning and playing around with options. But not good for
>> trying to infer anything from BLEU scores.
>>
>>
>> On 22.06.2015 10 
>> :17, Hokage Sama wrote:
>>
>> Yes the language model was built earlier when I first went
>> through the manual to build a French-English baseline
>> system.
>> So I just reused it for my Samoan-English system.
>> Yes for all three runs I used the same training and
>> testing files.
>> How can I determine how much parallel data I should
>> set aside
>> for tuning and testing? I have only 10,028 segments
>> (198,385
>> words) altogether. At the moment I'm using 259
>> segments for
>> testing and the rest for training.
>>
>> Thanks,
>> Hilton
>>
>>
>>
>>
>>
>>
>
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] BLEU Score Variance: Which score to use?

2015-06-22 Thread Marcin Junczys-Dowmunt
I don't think so. However, when you repeat those experiments, you might 
try to identify where two trainings are starting to diverge by pairwise 
comparisions of the same files between two runs. Maybe then we can 
deduce something.

On 23.06.2015 00:25, Hokage Sama wrote:
> Hi I delete all the files (I think) generated during a training job 
> before rerunning the entire training. You think this could cause 
> variation? Here's the commands I run to delete:
>
> rm ~/corpus/train.tok.en
> rm ~/corpus/train.tok.sm 
> rm ~/corpus/train.true.en
> rm ~/corpus/train.true.sm 
> rm ~/corpus/train.clean.en
> rm ~/corpus/train.clean.sm 
> rm ~/corpus/truecase-model.en
> rm ~/corpus/truecase-model.sm 
> rm ~/corpus/test.tok.en
> rm ~/corpus/test.tok.sm 
> rm ~/corpus/test.true.en
> rm ~/corpus/test.true.sm 
> rm -rf ~/working/filtered-test
> rm ~/working/test.out
> rm ~/working/test.translated.en
> rm ~/working/training.out
> rm -rf ~/working/train/corpus
> rm -rf ~/working/train/giza.en-sm
> rm -rf ~/working/train/giza.sm-en
> rm -rf ~/working/train/model
>
> On 22 June 2015 at 03:35, Marcin Junczys-Dowmunt  > wrote:
>
> You're welcome. Take another close look at those varying bleu
> scores though. That would make me worry if it happened to me for
> the same data and the same weights.
>
> On 22.06.2015 10 :31, Hokage Sama wrote:
>
> Ok thanks. Appreciate your help.
>
> On 22 June 2015 at 03:22, Marcin Junczys-Dowmunt
> mailto:junc...@amu.edu.pl>
> >> wrote:
>
> Difficult to tell with that little data. Once you get beyond
> 100,000 segments (or 50,000 at least) i would say 2000 per dev
> (for tuning) and test set, rest for training. With that few
> segments it's hard to give you any recommendations since
> it might
> just not give meaningful results. It's currently a toy
> model, good
> for learning and playing around with options. But not good for
> trying to infer anything from BLEU scores.
>
>
> On 22.06.2015 10 
> :17, Hokage Sama wrote:
>
> Yes the language model was built earlier when I first went
> through the manual to build a French-English baseline
> system.
> So I just reused it for my Samoan-English system.
> Yes for all three runs I used the same training and
> testing files.
> How can I determine how much parallel data I should
> set aside
> for tuning and testing? I have only 10,028 segments
> (198,385
> words) altogether. At the moment I'm using 259
> segments for
> testing and the rest for training.
>
> Thanks,
> Hilton
>
>
>
>
>

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Major bug found in Moses

2015-06-22 Thread Marcin Junczys-Dowmunt
That would make very cool student projects.
Also that video is acing it, even the voice-over is synthetic :)

On 23.06.2015 00:27, Ondrej Bojar wrote:
> ...and I wouldn't be surprised to find Moses also behind this Java-to-C# 
> automatic translation:
>
> https://www.youtube.com/watch?v=CHDDNnRm-g8
>
> O.
>
> - Original Message -
>> From: "Marcin Junczys-Dowmunt" 
>> To: moses-support@mit.edu
>> Sent: Friday, 19 June, 2015 19:21:45
>> Subject: Re: [Moses-support] Major bug found in Moses
>> On that interesting idea that moses should be naturally good at
>> translating things, just for general considerations.
>>
>> Since some said this thread has educational value I would like to share
>> something that might not be obvious due to the SMT-biased posts here.
>> Moses is also the _leading_ tool for automatic grammatical error
>> correction (GEC) right now. The first and third system of the CoNLL
>> shared task 2014 were based on Moses. By now I have results that surpass
>> the CoNLL results by far by adding some specialized features to Moses
>> (which thanks to Hieu is very easy).
>>
>> It even gets good results for GEC when you do crazy things like
>> inverting the TM (so it should actually make the input worse) provided
>> you tune on the correct metric and for the correct task. The interaction
>> of all the other features after tuning makes that possible.
>>
>> So, if anything, Moses is just a very flexible text-rewriting tool.
>> Tuning (and data) turns into a translator, GEC tool, POS-tagger,
>> Chunker, Semantic Tagger etc.
>>
>> On 19.06.2015 18:40, Lane Schwartz wrote:
>>> On Fri, Jun 19, 2015 at 11:28 AM, Read, James C >> > wrote:
>>>
>>>  What I take issue with is the en-masse denial that there is a
>>>  problem with the system if it behaves in such a way with no LM +
>>>  no pruning and/or tuning.
>>>
>>>
>>> There is no mass denial taking place.
>>>
>>> Regardless of whether or not you tune, the decoder will do its best to
>>> find translations with the highest model score. That is the expected
>>> behavior.
>>>
>>> What I have tried to tell you, and what other people have tried to
>>> tell you, is that translations with high model scores are not
>>> necessarily good translations.
>>>
>>> We all want our models to be such that high model scores correspond to
>>> good translations, and that low model scores correspond with bad
>>> translations. But unfortunately, our models do not innately have this
>>> characteristic. We all know this. We also know a good way to deal with
>>> this shortcoming, namely tuning. Tuning is the process by which we
>>> attempt to ensure that high model scores correspond to high quality
>>> translations, and that low model scores correspond to low quality
>>> translations.
>>>
>>> If you can design models that naturally correspond with translation
>>> quality without tuning, that's great. If you can do that, you've got a
>>> great shot at winning a Best Paper award at ACL.
>>>
>>> In the meantime, you may want to consider an apology for your rude
>>> behavior and unprofessional attitude.
>>>
>>> Goodbye.
>>> Lane
>>>
>>>
>>>
>>> ___
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>> ___
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Major bug found in Moses

2015-06-22 Thread Ondrej Bojar
...and I wouldn't be surprised to find Moses also behind this Java-to-C# 
automatic translation:

https://www.youtube.com/watch?v=CHDDNnRm-g8

O.

- Original Message -
> From: "Marcin Junczys-Dowmunt" 
> To: moses-support@mit.edu
> Sent: Friday, 19 June, 2015 19:21:45
> Subject: Re: [Moses-support] Major bug found in Moses

> On that interesting idea that moses should be naturally good at
> translating things, just for general considerations.
> 
> Since some said this thread has educational value I would like to share
> something that might not be obvious due to the SMT-biased posts here.
> Moses is also the _leading_ tool for automatic grammatical error
> correction (GEC) right now. The first and third system of the CoNLL
> shared task 2014 were based on Moses. By now I have results that surpass
> the CoNLL results by far by adding some specialized features to Moses
> (which thanks to Hieu is very easy).
> 
> It even gets good results for GEC when you do crazy things like
> inverting the TM (so it should actually make the input worse) provided
> you tune on the correct metric and for the correct task. The interaction
> of all the other features after tuning makes that possible.
> 
> So, if anything, Moses is just a very flexible text-rewriting tool.
> Tuning (and data) turns into a translator, GEC tool, POS-tagger,
> Chunker, Semantic Tagger etc.
> 
> On 19.06.2015 18:40, Lane Schwartz wrote:
>> On Fri, Jun 19, 2015 at 11:28 AM, Read, James C > > wrote:
>>
>> What I take issue with is the en-masse denial that there is a
>> problem with the system if it behaves in such a way with no LM +
>> no pruning and/or tuning.
>>
>>
>> There is no mass denial taking place.
>>
>> Regardless of whether or not you tune, the decoder will do its best to
>> find translations with the highest model score. That is the expected
>> behavior.
>>
>> What I have tried to tell you, and what other people have tried to
>> tell you, is that translations with high model scores are not
>> necessarily good translations.
>>
>> We all want our models to be such that high model scores correspond to
>> good translations, and that low model scores correspond with bad
>> translations. But unfortunately, our models do not innately have this
>> characteristic. We all know this. We also know a good way to deal with
>> this shortcoming, namely tuning. Tuning is the process by which we
>> attempt to ensure that high model scores correspond to high quality
>> translations, and that low model scores correspond to low quality
>> translations.
>>
>> If you can design models that naturally correspond with translation
>> quality without tuning, that's great. If you can do that, you've got a
>> great shot at winning a Best Paper award at ACL.
>>
>> In the meantime, you may want to consider an apology for your rude
>> behavior and unprofessional attitude.
>>
>> Goodbye.
>> Lane
>>
>>
>>
>> ___
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
> 
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

-- 
Ondrej Bojar (mailto:o...@cuni.cz / bo...@ufal.mff.cuni.cz)
http://www.cuni.cz/~obo
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] BLEU Score Variance: Which score to use?

2015-06-22 Thread Hokage Sama
Hi I delete all the files (I think) generated during a training job before
rerunning the entire training. You think this could cause variation? Here's
the commands I run to delete:

rm ~/corpus/train.tok.en
rm ~/corpus/train.tok.sm
rm ~/corpus/train.true.en
rm ~/corpus/train.true.sm
rm ~/corpus/train.clean.en
rm ~/corpus/train.clean.sm
rm ~/corpus/truecase-model.en
rm ~/corpus/truecase-model.sm
rm ~/corpus/test.tok.en
rm ~/corpus/test.tok.sm
rm ~/corpus/test.true.en
rm ~/corpus/test.true.sm
rm -rf ~/working/filtered-test
rm ~/working/test.out
rm ~/working/test.translated.en
rm ~/working/training.out
rm -rf ~/working/train/corpus
rm -rf ~/working/train/giza.en-sm
rm -rf ~/working/train/giza.sm-en
rm -rf ~/working/train/model

On 22 June 2015 at 03:35, Marcin Junczys-Dowmunt  wrote:

> You're welcome. Take another close look at those varying bleu scores
> though. That would make me worry if it happened to me for the same data and
> the same weights.
>
> On 22.06.2015 10:31, Hokage Sama wrote:
>
>> Ok thanks. Appreciate your help.
>>
>> On 22 June 2015 at 03:22, Marcin Junczys-Dowmunt > > wrote:
>>
>> Difficult to tell with that little data. Once you get beyond
>> 100,000 segments (or 50,000 at least) i would say 2000 per dev
>> (for tuning) and test set, rest for training. With that few
>> segments it's hard to give you any recommendations since it might
>> just not give meaningful results. It's currently a toy model, good
>> for learning and playing around with options. But not good for
>> trying to infer anything from BLEU scores.
>>
>>
>> On 22.06.2015 10 :17, Hokage Sama wrote:
>>
>> Yes the language model was built earlier when I first went
>> through the manual to build a French-English baseline system.
>> So I just reused it for my Samoan-English system.
>> Yes for all three runs I used the same training and testing files.
>> How can I determine how much parallel data I should set aside
>> for tuning and testing? I have only 10,028 segments (198,385
>> words) altogether. At the moment I'm using 259 segments for
>> testing and the rest for training.
>>
>> Thanks,
>> Hilton
>>
>>
>>
>>
>
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] How to re-run tuning using EMS

2015-06-22 Thread Barry Haddow
Just remove steps/1/TUNING_tune.1.DONE (replacing 1 with your experiment 
id) and then re-run.


It would be nice if EMS supported multiple tuning runs without 
intervention, but afaik it doesn't.


On 22/06/15 16:15, Lane Schwartz wrote:
Given a successful run of EMS, what do I need to do to configure a new 
run that re-uses all of the training, but re-runs tuning?


Thanks,
Lane



___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] How to re-run tuning using EMS

2015-06-22 Thread Lane Schwartz
Given a successful run of EMS, what do I need to do to configure a new run
that re-uses all of the training, but re-runs tuning?

Thanks,
Lane
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] BLEU Score Variance: Which score to use?

2015-06-22 Thread Hokage Sama
Ok I will.

On 22 June 2015 at 03:35, Marcin Junczys-Dowmunt  wrote:

> You're welcome. Take another close look at those varying bleu scores
> though. That would make me worry if it happened to me for the same data and
> the same weights.
>
> On 22.06.2015 10:31, Hokage Sama wrote:
>
>> Ok thanks. Appreciate your help.
>>
>> On 22 June 2015 at 03:22, Marcin Junczys-Dowmunt > > wrote:
>>
>> Difficult to tell with that little data. Once you get beyond
>> 100,000 segments (or 50,000 at least) i would say 2000 per dev
>> (for tuning) and test set, rest for training. With that few
>> segments it's hard to give you any recommendations since it might
>> just not give meaningful results. It's currently a toy model, good
>> for learning and playing around with options. But not good for
>> trying to infer anything from BLEU scores.
>>
>>
>> On 22.06.2015 10 :17, Hokage Sama wrote:
>>
>> Yes the language model was built earlier when I first went
>> through the manual to build a French-English baseline system.
>> So I just reused it for my Samoan-English system.
>> Yes for all three runs I used the same training and testing files.
>> How can I determine how much parallel data I should set aside
>> for tuning and testing? I have only 10,028 segments (198,385
>> words) altogether. At the moment I'm using 259 segments for
>> testing and the rest for training.
>>
>> Thanks,
>> Hilton
>>
>>
>>
>>
>
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] BLEU Score Variance: Which score to use?

2015-06-22 Thread Marcin Junczys-Dowmunt
You're welcome. Take another close look at those varying bleu scores 
though. That would make me worry if it happened to me for the same data 
and the same weights.

On 22.06.2015 10:31, Hokage Sama wrote:
> Ok thanks. Appreciate your help.
>
> On 22 June 2015 at 03:22, Marcin Junczys-Dowmunt  > wrote:
>
> Difficult to tell with that little data. Once you get beyond
> 100,000 segments (or 50,000 at least) i would say 2000 per dev
> (for tuning) and test set, rest for training. With that few
> segments it's hard to give you any recommendations since it might
> just not give meaningful results. It's currently a toy model, good
> for learning and playing around with options. But not good for
> trying to infer anything from BLEU scores.
>
>
> On 22.06.2015 10 :17, Hokage Sama wrote:
>
> Yes the language model was built earlier when I first went
> through the manual to build a French-English baseline system.
> So I just reused it for my Samoan-English system.
> Yes for all three runs I used the same training and testing files.
> How can I determine how much parallel data I should set aside
> for tuning and testing? I have only 10,028 segments (198,385
> words) altogether. At the moment I'm using 259 segments for
> testing and the rest for training.
>
> Thanks,
> Hilton
>
>
>

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] BLEU Score Variance: Which score to use?

2015-06-22 Thread Hokage Sama
Ok thanks. Appreciate your help.

On 22 June 2015 at 03:22, Marcin Junczys-Dowmunt  wrote:

> Difficult to tell with that little data. Once you get beyond 100,000
> segments (or 50,000 at least) i would say 2000 per dev (for tuning) and
> test set, rest for training. With that few segments it's hard to give you
> any recommendations since it might just not give meaningful results. It's
> currently a toy model, good for learning and playing around with options.
> But not good for trying to infer anything from BLEU scores.
>
>
> On 22.06.2015 10:17, Hokage Sama wrote:
>
>> Yes the language model was built earlier when I first went through the
>> manual to build a French-English baseline system. So I just reused it for
>> my Samoan-English system.
>> Yes for all three runs I used the same training and testing files.
>> How can I determine how much parallel data I should set aside for tuning
>> and testing? I have only 10,028 segments (198,385 words) altogether. At the
>> moment I'm using 259 segments for testing and the rest for training.
>>
>> Thanks,
>> Hilton
>>
>>
>
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] BLEU Score Variance: Which score to use?

2015-06-22 Thread Marcin Junczys-Dowmunt
Difficult to tell with that little data. Once you get beyond 100,000 
segments (or 50,000 at least) i would say 2000 per dev (for tuning) and 
test set, rest for training. With that few segments it's hard to give 
you any recommendations since it might just not give meaningful results. 
It's currently a toy model, good for learning and playing around with 
options. But not good for trying to infer anything from BLEU scores.

On 22.06.2015 10:17, Hokage Sama wrote:
> Yes the language model was built earlier when I first went through the 
> manual to build a French-English baseline system. So I just reused it 
> for my Samoan-English system.
> Yes for all three runs I used the same training and testing files.
> How can I determine how much parallel data I should set aside for 
> tuning and testing? I have only 10,028 segments (198,385 words) 
> altogether. At the moment I'm using 259 segments for testing and the 
> rest for training.
>
> Thanks,
> Hilton
>

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] BLEU Score Variance: Which score to use?

2015-06-22 Thread Hokage Sama
Yes the language model was built earlier when I first went through the
manual to build a French-English baseline system. So I just reused it for
my Samoan-English system.
Yes for all three runs I used the same training and testing files.
How can I determine how much parallel data I should set aside for tuning
and testing? I have only 10,028 segments (198,385 words) altogether. At the
moment I'm using 259 segments for testing and the rest for training.

Thanks,
Hilton

On 22 June 2015 at 02:52, Marcin Junczys-Dowmunt  wrote:

> Don't see any reason for indeterminism here. Unless mgiza is less stable
> for small data than I thought. The lm lm/news-commentary-v8.fr-en.blm.en
> has been built earlier somewhere?
>
> And to be sure: for all three runs you used exactly the same data,
> training and test set?
>
> On 22.06.2015 09:34, Hokage Sama wrote:
>
>> Wow that was a long read. Still reading though :) but I see that tuning
>> is essential. I am fairly new to Moses so could you please check if the
>> commands I ran were correct (minus the tuning part). I just modified the
>> commands on the Moses website for building a baseline system. Below are the
>> commands I ran. My training files are "compilation.en" and "
>> compilation.sm ". My test files are "test.en" and
>> "test.sm ".
>>
>> ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en <
>> ~/corpus/training/compilation.en > ~/corpus/compilation.tok.en
>> ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l sm < ~/corpus/training/
>> compilation.sm  > ~/corpus/compilation.tok.sm <
>> http://compilation.tok.sm>
>> ~/mosesdecoder/scripts/recaser/train-truecaser.perl --model
>> ~/corpus/truecase-model.en --corpus ~/corpus/compilation.tok.en
>> ~/mosesdecoder/scripts/recaser/train-truecaser.perl --model ~/corpus/
>> truecase-model.sm  --corpus ~/corpus/
>> compilation.tok.sm 
>> ~/mosesdecoder/scripts/recaser/truecase.perl --model
>> ~/corpus/truecase-model.en < ~/corpus/compilation.tok.en >
>> ~/corpus/compilation.true.en
>> ~/mosesdecoder/scripts/recaser/truecase.perl --model ~/corpus/
>> truecase-model.sm  < ~/corpus/
>> compilation.tok.sm  > ~/corpus/
>> compilation.true.sm 
>> ~/mosesdecoder/scripts/training/clean-corpus-n.perl
>> ~/corpus/compilation.true sm en ~/corpus/compilation.clean 1 80
>>
>> cd ~/working
>> nohup nice ~/mosesdecoder/scripts/training/train-model.perl -root-dir
>> train -corpus ~/corpus/compilation.clean -f sm -e en -alignment
>> grow-diag-final-and -reordering msd-bidirectional-fe -lm
>> 0:3:$HOME/lm/news-commentary-v8.fr-en.blm.en:8 -external-bin-dir
>> ~/mosesdecoder/tools >& training.out &
>>
>> cd ~/corpus
>> ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en < test.en >
>> test.tok.en
>> ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l sm < test.sm <
>> http://test.sm> > test.tok.sm 
>> ~/mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.en <
>> test.tok.en > test.true.en
>> ~/mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.sm <
>> http://truecase-model.sm> < test.tok.sm  >
>> test.true.sm 
>>
>> cd ~/working
>> ~/mosesdecoder/scripts/training/filter-model-given-input.pl <
>> http://filter-model-given-input.pl> filtered-test train/model/moses.ini
>> ~/corpus/test.true.sm  -Binarizer
>> ~/mosesdecoder/bin/processPhraseTableMin
>> nohup nice ~/mosesdecoder/bin/moses -f ~/working/filtered-test/moses.ini
>> < ~/corpus/test.true.sm  >
>> ~/working/test.translated.en 2> ~/working/test.out
>> ~/mosesdecoder/scripts/generic/multi-bleu.perl -lc ~/corpus/test.true.en
>> < ~/working/test.translated.en
>>
>> On 22 June 2015 at 01:20, Marcin Junczys-Dowmunt > > wrote:
>>
>> Hm. That's interesting. The language should not matter.
>>
>> 1) Do not report results without tuning. They are meaningless.
>> There is a whole thread on that, look for "Major bug found in
>> Moses". If you ignore the trollish aspects it contains may good
>> descriptions why this is a mistake.
>>
>> 2) Assuming it was the same data every time (was it?), without
>> tuning however I do not quite see where the variance is coming
>> from. This rather suggests you have something weird in your
>> pipeline. Mgiza is the only stochastic element there, but usually
>> its results are quite consistent. For the same weights in your
>> ini-file you should have very similar results. Tuning would be the
>> part that introduces instability, but even then these differences
>> would be a little on the extreme end, though possible.
>>
>> On 22.06.2015 08 :12, Hokage Sama wrote:
>>
>> Thanks Marcin. Its for a new resource-poor language so I only
>> trained it with

Re: [Moses-support] BLEU Score Variance: Which score to use?

2015-06-22 Thread Marcin Junczys-Dowmunt
Don't see any reason for indeterminism here. Unless mgiza is less stable 
for small data than I thought. The lm lm/news-commentary-v8.fr-en.blm.en 
has been built earlier somewhere?

And to be sure: for all three runs you used exactly the same data, 
training and test set?

On 22.06.2015 09:34, Hokage Sama wrote:
> Wow that was a long read. Still reading though :) but I see that 
> tuning is essential. I am fairly new to Moses so could you please 
> check if the commands I ran were correct (minus the tuning part). I 
> just modified the commands on the Moses website for building a 
> baseline system. Below are the commands I ran. My training files are 
> "compilation.en" and "compilation.sm ". My test 
> files are "test.en" and "test.sm ".
>
> ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en < 
> ~/corpus/training/compilation.en > ~/corpus/compilation.tok.en
> ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l sm < 
> ~/corpus/training/compilation.sm  > 
> ~/corpus/compilation.tok.sm 
> ~/mosesdecoder/scripts/recaser/train-truecaser.perl --model 
> ~/corpus/truecase-model.en --corpus ~/corpus/compilation.tok.en
> ~/mosesdecoder/scripts/recaser/train-truecaser.perl --model 
> ~/corpus/truecase-model.sm  --corpus 
> ~/corpus/compilation.tok.sm 
> ~/mosesdecoder/scripts/recaser/truecase.perl --model 
> ~/corpus/truecase-model.en < ~/corpus/compilation.tok.en > 
> ~/corpus/compilation.true.en
> ~/mosesdecoder/scripts/recaser/truecase.perl --model 
> ~/corpus/truecase-model.sm  < 
> ~/corpus/compilation.tok.sm  > 
> ~/corpus/compilation.true.sm 
> ~/mosesdecoder/scripts/training/clean-corpus-n.perl 
> ~/corpus/compilation.true sm en ~/corpus/compilation.clean 1 80
>
> cd ~/working
> nohup nice ~/mosesdecoder/scripts/training/train-model.perl -root-dir 
> train -corpus ~/corpus/compilation.clean -f sm -e en -alignment 
> grow-diag-final-and -reordering msd-bidirectional-fe -lm 
> 0:3:$HOME/lm/news-commentary-v8.fr-en.blm.en:8 -external-bin-dir 
> ~/mosesdecoder/tools >& training.out &
>
> cd ~/corpus
> ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en < test.en > 
> test.tok.en
> ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l sm < test.sm 
>  > test.tok.sm 
> ~/mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.en 
> < test.tok.en > test.true.en
> ~/mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.sm 
>  < test.tok.sm  > 
> test.true.sm 
>
> cd ~/working
> ~/mosesdecoder/scripts/training/filter-model-given-input.pl 
>  filtered-test 
> train/model/moses.ini ~/corpus/test.true.sm  
> -Binarizer ~/mosesdecoder/bin/processPhraseTableMin
> nohup nice ~/mosesdecoder/bin/moses -f 
> ~/working/filtered-test/moses.ini < ~/corpus/test.true.sm 
>  > ~/working/test.translated.en 2> ~/working/test.out
> ~/mosesdecoder/scripts/generic/multi-bleu.perl -lc 
> ~/corpus/test.true.en < ~/working/test.translated.en
>
> On 22 June 2015 at 01:20, Marcin Junczys-Dowmunt  > wrote:
>
> Hm. That's interesting. The language should not matter.
>
> 1) Do not report results without tuning. They are meaningless.
> There is a whole thread on that, look for "Major bug found in
> Moses". If you ignore the trollish aspects it contains may good
> descriptions why this is a mistake.
>
> 2) Assuming it was the same data every time (was it?), without
> tuning however I do not quite see where the variance is coming
> from. This rather suggests you have something weird in your
> pipeline. Mgiza is the only stochastic element there, but usually
> its results are quite consistent. For the same weights in your
> ini-file you should have very similar results. Tuning would be the
> part that introduces instability, but even then these differences
> would be a little on the extreme end, though possible.
>
> On 22.06.2015 08 :12, Hokage Sama wrote:
>
> Thanks Marcin. Its for a new resource-poor language so I only
> trained it with what I could collect so far (i.e. only 190,630
> words of parallel data). I retrained the entire system each
> time without any tuning.
>
> On 22 June 2015 at 01:00, Marcin Junczys-Dowmunt
> mailto:junc...@amu.edu.pl>
> >> wrote:
>
> Hi,
> I think the average is OK, your variance is however quite
> high.
> Did you
> retrain the entire system or just optimize parameters a
> couple of
> times?
>
> Two useful papers on

Re: [Moses-support] BLEU Score Variance: Which score to use?

2015-06-22 Thread Hokage Sama
Wow that was a long read. Still reading though :) but I see that tuning is
essential. I am fairly new to Moses so could you please check if the
commands I ran were correct (minus the tuning part). I just modified the
commands on the Moses website for building a baseline system. Below are the
commands I ran. My training files are "compilation.en" and "compilation.sm".
My test files are "test.en" and "test.sm".

~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en <
~/corpus/training/compilation.en > ~/corpus/compilation.tok.en
~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l sm < ~/corpus/training/
compilation.sm > ~/corpus/compilation.tok.sm
~/mosesdecoder/scripts/recaser/train-truecaser.perl --model
~/corpus/truecase-model.en --corpus ~/corpus/compilation.tok.en
~/mosesdecoder/scripts/recaser/train-truecaser.perl --model ~/corpus/
truecase-model.sm --corpus ~/corpus/compilation.tok.sm
~/mosesdecoder/scripts/recaser/truecase.perl --model
~/corpus/truecase-model.en < ~/corpus/compilation.tok.en >
~/corpus/compilation.true.en
~/mosesdecoder/scripts/recaser/truecase.perl --model ~/corpus/
truecase-model.sm < ~/corpus/compilation.tok.sm > ~/corpus/
compilation.true.sm
~/mosesdecoder/scripts/training/clean-corpus-n.perl
~/corpus/compilation.true sm en ~/corpus/compilation.clean 1 80

cd ~/working
nohup nice ~/mosesdecoder/scripts/training/train-model.perl -root-dir train
-corpus ~/corpus/compilation.clean -f sm -e en -alignment
grow-diag-final-and -reordering msd-bidirectional-fe -lm
0:3:$HOME/lm/news-commentary-v8.fr-en.blm.en:8 -external-bin-dir
~/mosesdecoder/tools >& training.out &

cd ~/corpus
~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en < test.en >
test.tok.en
~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l sm < test.sm >
test.tok.sm
~/mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.en <
test.tok.en > test.true.en
~/mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.sm <
test.tok.sm > test.true.sm

cd ~/working
~/mosesdecoder/scripts/training/filter-model-given-input.pl filtered-test
train/model/moses.ini ~/corpus/test.true.sm -Binarizer
~/mosesdecoder/bin/processPhraseTableMin
nohup nice ~/mosesdecoder/bin/moses -f ~/working/filtered-test/moses.ini <
~/corpus/test.true.sm > ~/working/test.translated.en 2> ~/working/test.out
~/mosesdecoder/scripts/generic/multi-bleu.perl -lc ~/corpus/test.true.en <
~/working/test.translated.en

On 22 June 2015 at 01:20, Marcin Junczys-Dowmunt  wrote:

> Hm. That's interesting. The language should not matter.
>
> 1) Do not report results without tuning. They are meaningless. There is a
> whole thread on that, look for "Major bug found in Moses". If you ignore
> the trollish aspects it contains may good descriptions why this is a
> mistake.
>
> 2) Assuming it was the same data every time (was it?), without tuning
> however I do not quite see where the variance is coming from. This rather
> suggests you have something weird in your pipeline. Mgiza is the only
> stochastic element there, but usually its results are quite consistent. For
> the same weights in your ini-file you should have very similar results.
> Tuning would be the part that introduces instability, but even then these
> differences would be a little on the extreme end, though possible.
>
> On 22.06.2015 08:12, Hokage Sama wrote:
>
>> Thanks Marcin. Its for a new resource-poor language so I only trained it
>> with what I could collect so far (i.e. only 190,630 words of parallel
>> data). I retrained the entire system each time without any tuning.
>>
>> On 22 June 2015 at 01:00, Marcin Junczys-Dowmunt > > wrote:
>>
>> Hi,
>> I think the average is OK, your variance is however quite high.
>> Did you
>> retrain the entire system or just optimize parameters a couple of
>> times?
>>
>> Two useful papers on the topic:
>>
>> https://www.cs.cmu.edu/~jhclark/pubs/significance.pdf
>> 
>> http://www.mt-archive.info/MTS-2011-Cettolo.pdf
>>
>>
>> On 22.06.2015 02 :37, Hokage Sama wrote:
>> > Hi,
>> >
>> > Since MT training is non-convex and thus the BLEU score varies,
>> which
>> > score should I use for my system? I trained my system three times
>> > using the same data and obtained the three different scores below.
>> > Should I take the average or the best score?
>> >
>> > BLEU = 17.84, 49.1/22.0/12.5/7.5 (BP=1.000, ratio=1.095,
>> hyp_len=3952,
>> > ref_len=3609)
>> > BLEU = 16.51, 48.4/20.7/11.4/6.5 (BP=1.000, ratio=1.093,
>> hyp_len=3945,
>> > ref_len=3609)
>> > BLEU = 15.33, 48.2/20.1/10.3/5.5 (BP=1.000, ratio=1.087,
>> hyp_len=3924,
>> > ref_len=3609)
>> >
>> > Thanks,
>> > Hilton
>> >
>> >
>> > ___
>> > Moses-support mailing list
>> > Moses-support@mit.edu 
>> > http://ma