Hi guys,

 

Thank you for your comprehensive comments.

 

The most likely thing is that you have some of your test set included in your 
training set,

 

Indeed, there exist some similarities owing to the domain (instruction 
manuals). Typically for all kinds of manuals, you will find a high degree of 
similarities, e.g. on sub-segment level. I extracted the test set A and the 
tuning sets from the whole corpus before training my engine to make sure that 
test set A doesn’t interfere with the training set. Hmmm… that’s an epic fail 
then… Test set B was provided at a much later stage, when the training process 
was already done.

 

Did you try looking at the sentences ? -- 1,000 is few enough to eyeball them. 
Have you tried the same system with a different corpus ? (e.g.

EuroParl). Have you checked that your test set and your training set do not 
intersect ?

 

Apart from scoring, I checked almost every sentence in both test sets for my 
thesis. The quality of the outputs is on a moderate level for sentences up to 
50 words; everything beyond is of lesser quality. Especially, sentences up to 
20 words are on a good level.

I’ve just prepared a third and fourth test set from the OpenOffice corpus files 
and from another bunch of in-domain files. Regarding OO files (2,000 sentences 
)BLEU is 0.0858 and METEOR is 0.3031. Kind of disappointing…
The fourth test set of 2,000 sentences reveals similar scores compared to the 
other in-domain test sets.

Very short sentences will give you high scores. 

This might be truly another related issue for boosting the scores. On average, 
almost half of the sentences in the test set A and B are quit short.

 

To conclude, one could say that I’ve created an engine suitable for a specific 
domain? However, the engine’s performance outside my domain equals almost to 
zero?

 

Best,

Daniel

 

Von: miles...@gmail.com [mailto:miles...@gmail.com] Im Auftrag von Miles Osborne
Gesendet: 26 April 2012 21:17
An: John D Burger
Cc: Daniel Schaut; moses-support@mit.edu
Betreff: Re: [Moses-support] Higher BLEU/METEOR score than usual for EN-DE

 

Very short sentences will give you high scores. 

Also multiple references will boost them

Miles

On Apr 26, 2012 8:13 PM, "John D Burger" <j...@mitre.org> wrote:

I =think= I recall that pairwise BLEU scores for human translators are usually 
around 0.50, so anything much better than that is indeed suspect.

- JB

On Apr 26, 2012, at 14:18 , Daniel Schaut wrote:

> Hi all,
>
>
> I’m running some experiments for my thesis and I’ve been told by a more 
> experienced user that the achieved scores for BLEU/METEOR of my MT engine 
> were too good to be true. Since this is the very first MT engine I’ve ever 
> made and I am not experienced with interpreting scores, I really don’t know 
> how to reflect them. The first test set achieves a BLEU score of 0.6508 
> (v13). METEOR’s final score is 0.7055 (v1.3, exact, stem, paraphrase). A 
> second test set indicated a slightly lower BLEU score of 0.6267 and a METEOR 
> score of 0.6748.
>
>
> Here are some basic facts about my system:
>
> Decoding direction: EN-DE
>
> Training corpus: 1.8 mil sentences
>
> Tuning runs: 5
>
> Test sets: a) 2,000 sentences, b) 1,000 sentences (both in-domain)
>
> LM type: trigram
>
> TM type: unfactored
>
>
> I’m now trying to figure out if these scores are realistic at all, as 
> different papers indicate by far lower BLEU scores, e.g. Koehn and Hoang 
> 2011. Any comments regarding the mentioned decoding direction and related 
> scores will be much appreciated.
>
>
> Best,
>
> Daniel
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support


_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to