Re: [Moses-support] BLEU score difference about 0.13 for one dataset is normal?

Michael Denkowski Wed, 14 Oct 2015 09:16:30 -0700

Hi Davood,

If you're comparing two versions of the system to see what effect your work
has on translation quality, you can run Jon Clark's MultEval
<https://github.com/jhclark/multeval> (an implementation of the hypothesis
testing described in the paper).  From the BLEU differences you reported,
1000 sentences should be enough to get pretty stable results for your
system.  If you run MERT 3 times for each system and MultEval reports
statistically significant improvement across all metrics (BLEU, TER,
Meteor), that's a pretty good indicator that the system is better.


Best,
Michael

On Wed, Oct 14, 2015 at 1:50 AM, Davood Mohammadifar <davood...@hotmail.com>
wrote:

> Thanks Michael for the paper and thanks Tom.
>
> Based on the paper, one solution is replication of MERT and testing at
> least three times.
>
> My ideas have subtle effects on BLUE. Do you recommend me run MERT and
> testing three times or more? should i increase the number of sentences for
> tuning?
>
> my dataset for Persian to English includes:
> Training: about 240000 sentences
> Tune: 1000 sentences
> Test: 1000 sentences
>
> ------------------------------
> From: tah...@precisiontranslationtools.com
> Date: Sun, 11 Oct 2015 12:53:37 +0700
> To: moses-support@mit.edu
> Subject: Re: [Moses-support] BLEU score difference about 0.13 for one
> dataset is normal?
>
>
> Yes. Each tuning with the same test set will give you small variations in
> the final BLEU. Yours looks like they're in a normal range.
>
>
>
> Date: Sun, 11 Oct 2015 04:23:56 +0000
> From: Davood Mohammadifar <davood...@hotmail.com>
> Subject: [Moses-support] BLEU score difference about 0.13 for one
> dataset is normal?
> To: Moses Support <moses-support@mit.edu>
>
> Hello every one
>
> I noticed different BLEU scores for same dataset. Also the difference is
> not so much and is about 0.13.
>
> I trained my dataset and tuned development set for Persian-English
> translation. after testing, the score was 21.95. For second time i did the
> same process and obtained 21.82. (my tools were mgiza, mert, ...)
>
> is this difference normal?
>
> My system:
> CPU: Core i7-4790K
> RAM: 16GB
> OS: ubuntu 12.04
>
> Thanks
>
> _______________________________________________ Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] BLEU score difference about 0.13 for one dataset is normal?

Reply via email to