I =think= I recall that pairwise BLEU scores for human translators are usually 
around 0.50, so anything much better than that is indeed suspect.

- JB

On Apr 26, 2012, at 14:18 , Daniel Schaut wrote:

> Hi all,
> 
> 
> I’m running some experiments for my thesis and I’ve been told by a more 
> experienced user that the achieved scores for BLEU/METEOR of my MT engine 
> were too good to be true. Since this is the very first MT engine I’ve ever 
> made and I am not experienced with interpreting scores, I really don’t know 
> how to reflect them. The first test set achieves a BLEU score of 0.6508 
> (v13). METEOR’s final score is 0.7055 (v1.3, exact, stem, paraphrase). A 
> second test set indicated a slightly lower BLEU score of 0.6267 and a METEOR 
> score of 0.6748.
> 
> 
> Here are some basic facts about my system:
> 
> Decoding direction: EN-DE
> 
> Training corpus: 1.8 mil sentences
> 
> Tuning runs: 5
> 
> Test sets: a) 2,000 sentences, b) 1,000 sentences (both in-domain)
> 
> LM type: trigram
> 
> TM type: unfactored
> 
> 
> I’m now trying to figure out if these scores are realistic at all, as 
> different papers indicate by far lower BLEU scores, e.g. Koehn and Hoang 
> 2011. Any comments regarding the mentioned decoding direction and related 
> scores will be much appreciated.
> 
> 
> Best,
> 
> Daniel
> 
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support


_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to