Hi guys,
Thank you for your comprehensive comments. The most likely thing is that you have some of your test set included in your training set, Indeed, there exist some similarities owing to the domain (instruction manuals). Typically for all kinds of manuals, you will find a high degree of similarities, e.g. on sub-segment level. I extracted the test set A and the tuning sets from the whole corpus before training my engine to make sure that test set A doesn’t interfere with the training set. Hmmm… that’s an epic fail then… Test set B was provided at a much later stage, when the training process was already done. Did you try looking at the sentences ? -- 1,000 is few enough to eyeball them. Have you tried the same system with a different corpus ? (e.g. EuroParl). Have you checked that your test set and your training set do not intersect ? Apart from scoring, I checked almost every sentence in both test sets for my thesis. The quality of the outputs is on a moderate level for sentences up to 50 words; everything beyond is of lesser quality. Especially, sentences up to 20 words are on a good level. I’ve just prepared a third and fourth test set from the OpenOffice corpus files and from another bunch of in-domain files. Regarding OO files (2,000 sentences )BLEU is 0.0858 and METEOR is 0.3031. Kind of disappointing… The fourth test set of 2,000 sentences reveals similar scores compared to the other in-domain test sets. Very short sentences will give you high scores. This might be truly another related issue for boosting the scores. On average, almost half of the sentences in the test set A and B are quit short. To conclude, one could say that I’ve created an engine suitable for a specific domain? However, the engine’s performance outside my domain equals almost to zero? Best, Daniel Von: miles...@gmail.com [mailto:miles...@gmail.com] Im Auftrag von Miles Osborne Gesendet: 26 April 2012 21:17 An: John D Burger Cc: Daniel Schaut; moses-support@mit.edu Betreff: Re: [Moses-support] Higher BLEU/METEOR score than usual for EN-DE Very short sentences will give you high scores. Also multiple references will boost them Miles On Apr 26, 2012 8:13 PM, "John D Burger" <j...@mitre.org> wrote: I =think= I recall that pairwise BLEU scores for human translators are usually around 0.50, so anything much better than that is indeed suspect. - JB On Apr 26, 2012, at 14:18 , Daniel Schaut wrote: > Hi all, > > > I’m running some experiments for my thesis and I’ve been told by a more > experienced user that the achieved scores for BLEU/METEOR of my MT engine > were too good to be true. Since this is the very first MT engine I’ve ever > made and I am not experienced with interpreting scores, I really don’t know > how to reflect them. The first test set achieves a BLEU score of 0.6508 > (v13). METEOR’s final score is 0.7055 (v1.3, exact, stem, paraphrase). A > second test set indicated a slightly lower BLEU score of 0.6267 and a METEOR > score of 0.6748. > > > Here are some basic facts about my system: > > Decoding direction: EN-DE > > Training corpus: 1.8 mil sentences > > Tuning runs: 5 > > Test sets: a) 2,000 sentences, b) 1,000 sentences (both in-domain) > > LM type: trigram > > TM type: unfactored > > > I’m now trying to figure out if these scores are realistic at all, as > different papers indicate by far lower BLEU scores, e.g. Koehn and Hoang > 2011. Any comments regarding the mentioned decoding direction and related > scores will be much appreciated. > > > Best, > > Daniel > > _______________________________________________ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support