El dj 26 de 04 de 2012 a les 20:18 +0200, en/na Daniel Schaut va escriure: > Hi all, > > I’m running some experiments for my thesis and I’ve been told by a > more experienced user that the achieved scores for BLEU/METEOR of my > MT engine were too good to be true. Since this is the very first MT > engine I’ve ever made and I am not experienced with interpreting > scores, I really don’t know how to reflect them. The first test set > achieves a BLEU score of 0.6508 (v13). METEOR’s final score is 0.7055 > (v1.3, exact, stem, paraphrase). A second test set indicated a > slightly lower BLEU score of 0.6267 and a METEOR score of 0.6748. > > Here are some basic facts about my system: > > Decoding direction: EN-DE > > Training corpus: 1.8 mil sentences > > Tuning runs: 5 > > Test sets: a) 2,000 sentences, b) 1,000 sentences (both in-domain) > > LM type: trigram > > TM type: unfactored > > I’m now trying to figure out if these scores are realistic at all, as > different papers indicate by far lower BLEU scores, e.g. Koehn and > Hoang 2011. Any comments regarding the mentioned decoding direction > and related scores will be much appreciated.
Did you try looking at the sentences ? -- 1,000 is few enough to eyeball them. Have you tried the same system with a different corpus ? (e.g. EuroParl). Have you checked that your test set and your training set do not intersect ? If the scores don't seem believable, then probably they aren't :) Fran _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support