Thanks, that's a very useful answer. I figured something similar, but I
was curious how come these huge differences between the methods are
never reported anywhere. Even in your paper they are just a few percent.
Also, could it be that the default METEOR setting is slighlty
overfitting to the WMT ranking task? I have the impression that for
systems that have generally higher BLEU scores than WMT systems (beyond
45% BLEU) METEOR seems to flatten out, barely changing values, while
BLEU differences are 4-6% absolute. This is not happening for BLEU
values around 20-30%, METEOR scales nearly linearly in that range,
following BLEU scores quite closely.
Cheers,
Marcin
W dniu 26.11.2014 o 22:31, Michael Denkowski pisze:
Hi Marcin,
Meteor scores can vary widely across tasks due to the training data
and goal. The default ranking task tries to replicate WMT rankings,
so the absolute scores are not as important as the relative scores
between systems. The adequacy task tries to fit Meteor scores to
numeric adequacy judgements as linearly as possible. If you're
looking to evaluate a system in isolation to see if the translations
are good, you can simulate an adequacy scale with the adq task.
If you're comparing multiple systems, you should get the most reliable
ranking with the default rank task, but the absolute scores will be
less meaningful.
Best,
Michael
On Wed, Nov 26, 2014 at 9:34 AM, Marcin Junczys-Dowmunt
junc...@amu.edu.pl mailto:junc...@amu.edu.pl wrote:
Hi,
A question concerning METEOR, maybe someone has some experience. I
am seeing huge differences between values for English with the
defauly task ranking and any other of the tasks (e.g. adq). up
to 30-40 points. Is this normal? In the literature I only ever see
marginal differences of maybe 1 or 2 per cent but nothing like 35%
vs. 65%. For the language independent setting is still get a score
of 55%.
See for instance:
http://www.cs.cmu.edu/~alavie/METEOR/pdf/meteor-wmt11.pdf
http://www.cs.cmu.edu/%7Ealavie/METEOR/pdf/meteor-wmt11.pdf for
the Urdu-English system for much smaller differences between
ranking and adq. I get the same discrepancies with
meteor-1.3.jar and meteor-1.5.jar
Cheers,
Marcin
___
Moses-support mailing list
Moses-support@mit.edu mailto:Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support