Re: [Moses-support] BLEU score difference about 0.13 for one dataset is normal?

2015-10-14 Thread Michael Denkowski
Hi Davood,

If you're comparing two versions of the system to see what effect your work
has on translation quality, you can run Jon Clark's MultEval
<https://github.com/jhclark/multeval> (an implementation of the hypothesis
testing described in the paper).  From the BLEU differences you reported,
1000 sentences should be enough to get pretty stable results for your
system.  If you run MERT 3 times for each system and MultEval reports
statistically significant improvement across all metrics (BLEU, TER,
Meteor), that's a pretty good indicator that the system is better.

Best,
Michael

On Wed, Oct 14, 2015 at 1:50 AM, Davood Mohammadifar <davood...@hotmail.com>
wrote:

> Thanks Michael for the paper and thanks Tom.
>
> Based on the paper, one solution is replication of MERT and testing at
> least three times.
>
> My ideas have subtle effects on BLUE. Do you recommend me run MERT and
> testing three times or more? should i increase the number of sentences for
> tuning?
>
> my dataset for Persian to English includes:
> Training: about 24 sentences
> Tune: 1000 sentences
> Test: 1000 sentences
>
> --
> From: tah...@precisiontranslationtools.com
> Date: Sun, 11 Oct 2015 12:53:37 +0700
> To: moses-support@mit.edu
> Subject: Re: [Moses-support] BLEU score difference about 0.13 for one
> dataset is normal?
>
>
> Yes. Each tuning with the same test set will give you small variations in
> the final BLEU. Yours looks like they're in a normal range.
>
>
>
> Date: Sun, 11 Oct 2015 04:23:56 +
> From: Davood Mohammadifar <davood...@hotmail.com>
> Subject: [Moses-support] BLEU score difference about 0.13 for one
> dataset is normal?
> To: Moses Support <moses-support@mit.edu>
>
> Hello every one
>
> I noticed different BLEU scores for same dataset. Also the difference is
> not so much and is about 0.13.
>
> I trained my dataset and tuned development set for Persian-English
> translation. after testing, the score was 21.95. For second time i did the
> same process and obtained 21.82. (my tools were mgiza, mert, ...)
>
> is this difference normal?
>
> My system:
> CPU: Core i7-4790K
> RAM: 16GB
> OS: ubuntu 12.04
>
> Thanks
>
> ___ Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] BLEU score difference about 0.13 for one dataset is normal?

2015-10-14 Thread Tom Hoar

Davood,

I don't know enough about your data and uses cases to recommend one way 
or another. Running MERT multiple times will give you different BLEU 
scores, I have never found the deltas to make a difference in a 
production environment.


Tom


On 10/14/2015 12:50 PM, Davood Mohammadifar wrote:

Thanks Michael for the paper and thanks Tom.

Based on the paper, one solution is replication of MERT and testing at 
least three times.


My ideas have subtle effects on BLUE. Do you recommend me run MERT and 
testing three times or more? should i increase the number of sentences 
for tuning?


my dataset for Persian to English includes:
Training: about 24 sentences
Tune: 1000 sentences
Test: 1000 sentences


From: tah...@precisiontranslationtools.com
Date: Sun, 11 Oct 2015 12:53:37 +0700
To: moses-support@mit.edu
Subject: Re: [Moses-support] BLEU score difference about 0.13 for one 
dataset is normal?


Yes. Each tuning with the same test set will give you small variations 
in the final BLEU. Yours looks like they're in a normal range.




Date: Sun, 11 Oct 2015 04:23:56 +
From: Davood Mohammadifar <davood...@hotmail.com>
Subject: [Moses-support] BLEU score difference about 0.13 for one
dataset is normal?
To: Moses Support <moses-support@mit.edu>

Hello every one

I noticed different BLEU scores for same dataset. Also the difference 
is not so much and is about 0.13.


I trained my dataset and tuned development set for Persian-English 
translation. after testing, the score was 21.95. For second time i did 
the same process and obtained 21.82. (my tools were mgiza, mert, ...)


is this difference normal?

My system:
CPU: Core i7-4790K
RAM: 16GB
OS: ubuntu 12.04

Thanks

___ Moses-support mailing 
list Moses-support@mit.edu 
http://mailman.mit.edu/mailman/listinfo/moses-support


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] BLEU score difference about 0.13 for one dataset is normal?

2015-10-13 Thread Davood Mohammadifar



Thanks Michael for the paper and thanks Tom. 

Based on the paper, one solution is replication of MERT and testing at least 
three times. 

My ideas have subtle effects on BLUE. Do you recommend me run MERT and testing 
three times or more? should i increase the number of sentences for tuning?

my dataset for Persian to English includes:
Training: about 24 sentences
Tune: 1000 sentences
Test: 1000 sentences

From: tah...@precisiontranslationtools.com
Date: Sun, 11 Oct 2015 12:53:37 +0700
To: moses-support@mit.edu
Subject: Re: [Moses-support] BLEU score difference about 0.13 for one   dataset 
is normal?

Yes. Each tuning with the same test set will give you small variations in the 
final BLEU. Yours looks like they're in a normal range. 







Date: Sun, 11 Oct 2015 04:23:56 +

From: Davood Mohammadifar <davood...@hotmail.com>

Subject: [Moses-support] BLEU score difference about 0.13 for one

    dataset is  normal?

To: Moses Support <moses-support@mit.edu>



Hello every one



I noticed different BLEU scores for same dataset. Also the difference is not so 
much and is about 0.13.



I trained my dataset and tuned development set for Persian-English translation. 
after testing, the score was 21.95. For second time i did the same process and 
obtained 21.82. (my tools were mgiza, mert, ...)



is this difference normal?



My system:

CPU: Core i7-4790K

RAM: 16GB

OS: ubuntu 12.04



Thanks


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
  ___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] BLEU score difference about 0.13 for one dataset is normal?

2015-10-10 Thread Tom Hoar
Yes. Each tuning with the same test set will give you small variations in the 
final BLEU. Yours looks like they're in a normal range. 



Date: Sun, 11 Oct 2015 04:23:56 +
From: Davood Mohammadifar <davood...@hotmail.com>
Subject: [Moses-support] BLEU score difference about 0.13 for one
    dataset is  normal?
To: Moses Support <moses-support@mit.edu>

Hello every one

I noticed different BLEU scores for same dataset. Also the difference is not so 
much and is about 0.13.

I trained my dataset and tuned development set for Persian-English translation. 
after testing, the score was 21.95. For second time i did the same process and 
obtained 21.82. (my tools were mgiza, mert, ...)

is this difference normal?

My system:
CPU: Core i7-4790K
RAM: 16GB
OS: ubuntu 12.04

Thanks
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] BLEU score difference about 0.13 for one dataset is normal?

2015-10-10 Thread Davood Mohammadifar
Hello every one

I noticed different BLEU scores for same dataset. Also the difference is not so 
much and is about 0.13.

I trained my dataset and tuned development set for Persian-English translation. 
after testing, the score was 21.95. For second time i did the same process and 
obtained 21.82. (my tools were mgiza, mert, ...)

is this difference normal?

My system:
CPU: Core i7-4790K
RAM: 16GB
OS: ubuntu 12.04

Thanks
  ___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] BLEU score difference about 0.13 for one dataset is normal?

2015-10-10 Thread Michael Denkowski
Hi Davood,

Optimizers like MERT will give you a slightly different result every time
you run them, leading to variance in BLEU score.  It's generally a good
idea to use multiple optimizer runs, especially when comparing two
systems.  There's a good paper on hypothesis testing for MT that goes into
detail on this .  Some
other parts of a standard system like word alignment can also be
non-deterministic but the optimizer is the most frequent cause of
fluctuating metric scores.

Best,
Michael

On Sun, Oct 11, 2015 at 12:23 AM, Davood Mohammadifar  wrote:

> Hello every one
>
> I noticed different BLEU scores for same dataset. Also the difference is
> not so much and is about 0.13.
>
> I trained my dataset and tuned development set for Persian-English
> translation. after testing, the score was 21.95. For second time i did the
> same process and obtained 21.82. (my tools were mgiza, mert, ...)
>
> is this difference normal?
>
> My system:
> CPU: Core i7-4790K
> RAM: 16GB
> OS: ubuntu 12.04
>
> Thanks
>
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support