Hi all,

I'm running some experiments for my thesis and I've been told by a more
experienced user that the achieved scores for BLEU/METEOR of my MT engine
were too good to be true. Since this is the very first MT engine I've ever
made and I am not experienced with interpreting scores, I really don't know
how to reflect them. The first test set achieves a BLEU score of 0.6508
(v13). METEOR's final score is 0.7055 (v1.3, exact, stem, paraphrase). A
second test set indicated a slightly lower BLEU score of 0.6267 and a METEOR
score of 0.6748.

Here are some basic facts about my system:
Decoding direction: EN-DE
Training corpus: 1.8 mil sentences
Tuning runs: 5
Test sets: a) 2,000 sentences, b) 1,000 sentences (both in-domain)
LM type: trigram
TM type: unfactored

I'm now trying to figure out if these scores are realistic at all, as
different papers indicate by far lower BLEU scores, e.g. Koehn and Hoang
2011. Any comments regarding the mentioned decoding direction and related
scores will be much appreciated.

Best,
Daniel
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to