Thanks, that's a very useful answer. I figured something similar, but I
was curious how come these huge differences between the methods are
never reported anywhere. Even in your paper they are just a few percent.
Also, could it be that the default METEOR setting is slighlty
overfitting to the
Hi Marcin,
Meteor scores can vary widely across tasks due to the training data and
goal. The default ranking task tries to replicate WMT rankings, so the
absolute scores are not as important as the relative scores between
systems. The adequacy task tries to fit Meteor scores to numeric adequacy
Hi,
A question concerning METEOR, maybe someone has some experience. I am
seeing huge differences between values for English with the defauly task
"ranking" and any other of the tasks (e.g. "adq"). up to 30-40 points.
Is this normal? In the literature I only ever see marginal differences
of ma