El dj 26 de 04 de 2012 a les 20:18 +0200, en/na Daniel Schaut va
escriure:
> Hi all,
> 
> I’m running some experiments for my thesis and I’ve been told by a
> more experienced user that the achieved scores for BLEU/METEOR of my
> MT engine were too good to be true. Since this is the very first MT
> engine I’ve ever made and I am not experienced with interpreting
> scores, I really don’t know how to reflect them. The first test set
> achieves a BLEU score of 0.6508 (v13). METEOR’s final score is 0.7055
> (v1.3, exact, stem, paraphrase). A second test set indicated a
> slightly lower BLEU score of 0.6267 and a METEOR score of 0.6748.
> 
> Here are some basic facts about my system:
> 
> Decoding direction: EN-DE
> 
> Training corpus: 1.8 mil sentences
> 
> Tuning runs: 5
> 
> Test sets: a) 2,000 sentences, b) 1,000 sentences (both in-domain)
> 
> LM type: trigram
> 
> TM type: unfactored
> 
> I’m now trying to figure out if these scores are realistic at all, as
> different papers indicate by far lower BLEU scores, e.g. Koehn and
> Hoang 2011. Any comments regarding the mentioned decoding direction
> and related scores will be much appreciated.

Did you try looking at the sentences ? -- 1,000 is few enough to eyeball
them. Have you tried the same system with a different corpus ? (e.g.
EuroParl). Have you checked that your test set and your training set do
not intersect ?

If the scores don't seem believable, then probably they aren't :)

Fran

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to