Hi all, I'm running some experiments for my thesis and I've been told by a more experienced user that the achieved scores for BLEU/METEOR of my MT engine were too good to be true. Since this is the very first MT engine I've ever made and I am not experienced with interpreting scores, I really don't know how to reflect them. The first test set achieves a BLEU score of 0.6508 (v13). METEOR's final score is 0.7055 (v1.3, exact, stem, paraphrase). A second test set indicated a slightly lower BLEU score of 0.6267 and a METEOR score of 0.6748.
Here are some basic facts about my system: Decoding direction: EN-DE Training corpus: 1.8 mil sentences Tuning runs: 5 Test sets: a) 2,000 sentences, b) 1,000 sentences (both in-domain) LM type: trigram TM type: unfactored I'm now trying to figure out if these scores are realistic at all, as different papers indicate by far lower BLEU scores, e.g. Koehn and Hoang 2011. Any comments regarding the mentioned decoding direction and related scores will be much appreciated. Best, Daniel
_______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support