I =think= I recall that pairwise BLEU scores for human translators are usually around 0.50, so anything much better than that is indeed suspect.
- JB On Apr 26, 2012, at 14:18 , Daniel Schaut wrote: > Hi all, > > > I’m running some experiments for my thesis and I’ve been told by a more > experienced user that the achieved scores for BLEU/METEOR of my MT engine > were too good to be true. Since this is the very first MT engine I’ve ever > made and I am not experienced with interpreting scores, I really don’t know > how to reflect them. The first test set achieves a BLEU score of 0.6508 > (v13). METEOR’s final score is 0.7055 (v1.3, exact, stem, paraphrase). A > second test set indicated a slightly lower BLEU score of 0.6267 and a METEOR > score of 0.6748. > > > Here are some basic facts about my system: > > Decoding direction: EN-DE > > Training corpus: 1.8 mil sentences > > Tuning runs: 5 > > Test sets: a) 2,000 sentences, b) 1,000 sentences (both in-domain) > > LM type: trigram > > TM type: unfactored > > > I’m now trying to figure out if these scores are realistic at all, as > different papers indicate by far lower BLEU scores, e.g. Koehn and Hoang > 2011. Any comments regarding the mentioned decoding direction and related > scores will be much appreciated. > > > Best, > > Daniel > > _______________________________________________ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support