subject:"Re\: \[Moses\-support\] METEOR\: difference between ranking task and other tasks"

Re: [Moses-support] METEOR: difference between ranking task and other tasks

2014-11-26 Thread Michael Denkowski

Hi Marcin,

Meteor scores can vary widely across tasks due to the training data and
goal.  The default ranking task tries to replicate WMT rankings, so the
absolute scores are not as important as the relative scores between
systems.  The adequacy task tries to fit Meteor scores to numeric adequacy
judgements as linearly as possible.  If you're looking to evaluate a system
in isolation to see if the translations are good, you can simulate an
adequacy scale with the adq task.  If you're comparing multiple systems,
you should get the most reliable ranking with the default rank task, but
the absolute scores will be less meaningful.

Best,
Michael

On Wed, Nov 26, 2014 at 9:34 AM, Marcin Junczys-Dowmunt junc...@amu.edu.pl
wrote:

  Hi,

 A question concerning METEOR, maybe someone has some experience. I am
 seeing huge differences between values for English with the defauly task
 ranking and any other of the tasks (e.g. adq). up to 30-40 points. Is
 this normal? In the literature I only ever see marginal differences of
 maybe 1 or 2 per cent but nothing like 35% vs. 65%. For the language
 independent setting is still get a score of 55%.

 See for instance:
 http://www.cs.cmu.edu/~alavie/METEOR/pdf/meteor-wmt11.pdf for the
 Urdu-English system for much smaller differences between ranking and
 adq. I get the same discrepancies with meteor-1.3.jar and meteor-1.5.jar

 Cheers,

 Marcin


 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] METEOR: difference between ranking task and other tasks

2014-11-26 Thread Marcin Junczys-Dowmunt

Thanks, that's a very useful answer. I figured something similar, but I 
was curious how come these huge differences between the methods are 
never reported anywhere. Even in your paper they are just a few percent.


Also, could it be that the default METEOR setting is slighlty 
overfitting to the WMT ranking task? I have the impression that for 
systems that have generally higher BLEU scores than WMT systems (beyond 
45% BLEU) METEOR seems to flatten out, barely changing values, while 
BLEU differences are 4-6% absolute. This is not happening for BLEU 
values around 20-30%, METEOR scales nearly linearly in that range, 
following BLEU scores quite closely.

Cheers,
Marcin

W dniu 26.11.2014 o 22:31, Michael Denkowski pisze:

Hi Marcin,

Meteor scores can vary widely across tasks due to the training data 
and goal.  The default ranking task tries to replicate WMT rankings, 
so the absolute scores are not as important as the relative scores 
between systems.  The adequacy task tries to fit Meteor scores to 
numeric adequacy judgements as linearly as possible.  If you're 
looking to evaluate a system in isolation to see if the translations 
are good, you can simulate an adequacy scale with the adq task.  
If you're comparing multiple systems, you should get the most reliable 
ranking with the default rank task, but the absolute scores will be 
less meaningful.


Best,
Michael

On Wed, Nov 26, 2014 at 9:34 AM, Marcin Junczys-Dowmunt 
junc...@amu.edu.pl mailto:junc...@amu.edu.pl wrote:


Hi,

A question concerning METEOR, maybe someone has some experience. I
am seeing huge differences between values for English with the
defauly task ranking and any other of the tasks (e.g. adq). up
to 30-40 points. Is this normal? In the literature I only ever see
marginal differences of maybe 1 or 2 per cent but nothing like 35%
vs. 65%. For the language independent setting is still get a score
of 55%.

See for instance:
http://www.cs.cmu.edu/~alavie/METEOR/pdf/meteor-wmt11.pdf
http://www.cs.cmu.edu/%7Ealavie/METEOR/pdf/meteor-wmt11.pdf for
the Urdu-English system for much smaller differences between
ranking and adq. I get the same discrepancies with
meteor-1.3.jar and meteor-1.5.jar

Cheers,

Marcin


___
Moses-support mailing list
Moses-support@mit.edu mailto:Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support




___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] METEOR: difference between ranking task and other tasks

Re: [Moses-support] METEOR: difference between ranking task and other tasks

2 matches

Site Navigation

Mail list logo

Footer information