Hi Marcin,

Meteor scores can vary widely across tasks due to the training data and
goal.  The default ranking task tries to replicate WMT rankings, so the
absolute scores are not as important as the relative scores between
systems.  The adequacy task tries to fit Meteor scores to numeric adequacy
judgements as linearly as possible.  If you're looking to evaluate a system
in isolation to see if the translations are "good", you can simulate an
adequacy scale with the "adq" task.  If you're comparing multiple systems,
you should get the most reliable ranking with the default "rank" task, but
the absolute scores will be less meaningful.

Best,
Michael

On Wed, Nov 26, 2014 at 9:34 AM, Marcin Junczys-Dowmunt <junc...@amu.edu.pl>
wrote:

>  Hi,
>
> A question concerning METEOR, maybe someone has some experience. I am
> seeing huge differences between values for English with the defauly task
> "ranking" and any other of the tasks (e.g. "adq"). up to 30-40 points. Is
> this normal? In the literature I only ever see marginal differences of
> maybe 1 or 2 per cent but nothing like 35% vs. 65%. For the language
> independent setting is still get a score of 55%.
>
> See for instance:
> http://www.cs.cmu.edu/~alavie/METEOR/pdf/meteor-wmt11.pdf for the
> Urdu-English system for much smaller differences between "ranking" and
> "adq". I get the same discrepancies with meteor-1.3.jar and meteor-1.5.jar
>
> Cheers,
>
> Marcin
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to