Re: [Moses-support] Higher BLEU/METEOR score than usual for EN-DE

Daniel Schaut Sun, 29 Apr 2012 11:56:04 -0700

Hi Kenneth,

For my experiments, decoding compound nouns with moses is fine when a high
probability within the phrase table is assigned to the correct phrase.
Otherwise the process can be quit messy. Decoding nouns with moses_chart
sometimes gets a bit more complicated, e.g. where the modifier and the head
of a compound is swapped. Where necessary, I change the weighting of
reordering to achieve better results. Compound verbs are even more difficult
as they sometimes break up into two parts and take different positions in a
sentence. I've had several ideas for further experiments:


1. adding special rules to the rule table of moses_chart
2. using a factored model
3. creating a second separate lm containing compound units solely
4. testing a hybrid SMT system, e.g. the one Gloves and Way describe
5. resolving compound noun contractions by regex before decoding, e.g.
Schmid proposes a similar method for resolving verb contractions in English
during tokenization
6. pre-editing
7. etc.

Best,
Daniel

-----Ursprüngliche Nachricht-----
Von: Kenneth Heafield [mailto:mo...@kheafield.com] 
Gesendet: 27 April 2012 17:20
An: Daniel Schaut
Cc: moses-support@mit.edu
Betreff: Re: [Moses-support] Higher BLEU/METEOR score than usual for EN-DE

Hi,

        Since this is EN-DE, how are you processing German compounds?

Kenneth

On 04/27/2012 07:43 AM, Daniel Schaut wrote:
> Hi guys,
>
> Thank you for your comprehensive comments.
>
> The most likely thing is that you have some of your test set included 
> in your training set,
>
> Indeed, there exist some similarities owing to the domain (instruction 
> manuals). Typically for all kinds of manuals, you will find a high 
> degree of similarities, e.g. on sub-segment level. I extracted the 
> test set A and the tuning sets from the whole corpus before training 
> my engine to make sure that test set A doesnt interfere with the 
> training set. Hmmm thats an epic fail then Test set B was provided 
> at a much later stage, when the training process was already done.
>
> Did you try looking at the sentences ? -- 1,000 is few enough to 
> eyeball them. Have you tried the same system with a different corpus ?
(e.g.
>
> EuroParl). Have you checked that your test set and your training set 
> do not intersect ?
>
> Apart from scoring, I checked almost every sentence in both test sets 
> for my thesis. The quality of the outputs is on a moderate level for 
> sentences up to 50 words; everything beyond is of lesser quality.
> Especially, sentences up to 20 words are on a good level.
>
> Ive just prepared a third and fourth test set from the OpenOffice 
> corpus files and from another bunch of in-domain files. Regarding OO 
> files (2,000 sentences )BLEU is 0.0858 and METEOR is 0.3031. Kind of 
> disappointing The fourth test set of 2,000 sentences reveals similar 
> scores compared to the other in-domain test sets.
>
> Very short sentences will give you high scores.
>
> This might be truly another related issue for boosting the scores. On 
> average, almost half of the sentences in the test set A and B are quit 
> short.
>
> To conclude, one could say that Ive created an engine suitable for a 
> specific domain? However, the engines performance outside my domain 
> equals almost to zero?
>
> Best,
>
> Daniel
>
> *Von:*miles...@gmail.com [mailto:miles...@gmail.com] *Im Auftrag von 
> *Miles Osborne
> *Gesendet:* 26 April 2012 21:17
> *An:* John D Burger
> *Cc:* Daniel Schaut; moses-support@mit.edu
> *Betreff:* Re: [Moses-support] Higher BLEU/METEOR score than usual for 
> EN-DE
>
> Very short sentences will give you high scores.
>
> Also multiple references will boost them
>
> Miles
>
> On Apr 26, 2012 8:13 PM, "John D Burger" <j...@mitre.org 
> <mailto:j...@mitre.org>> wrote:
>
> I =think= I recall that pairwise BLEU scores for human translators are 
> usually around 0.50, so anything much better than that is indeed suspect.
>
> - JB
>
> On Apr 26, 2012, at 14:18 , Daniel Schaut wrote:
>
>  > Hi all,
>  >
>  >
>  > Im running some experiments for my thesis and Ive been told by a 
> more experienced user that the achieved scores for BLEU/METEOR of my 
> MT engine were too good to be true. Since this is the very first MT 
> engine Ive ever made and I am not experienced with interpreting 
> scores, I really dont know how to reflect them. The first test set 
> achieves a BLEU score of 0.6508 (v13). METEORs final score is 0.7055 
> (v1.3, exact, stem, paraphrase). A second test set indicated a 
> slightly lower BLEU score of 0.6267 and a METEOR score of 0.6748.
>  >
>  >
>  > Here are some basic facts about my system:
>  >
>  > Decoding direction: EN-DE
>  >
>  > Training corpus: 1.8 mil sentences
>  >
>  > Tuning runs: 5
>  >
>  > Test sets: a) 2,000 sentences, b) 1,000 sentences (both in-domain)  
> >  > LM type: trigram  >  > TM type: unfactored  >  >  > Im now 
> trying to figure out if these scores are realistic at all, as 
> different papers indicate by far lower BLEU scores, e.g. Koehn and 
> Hoang 2011. Any comments regarding the mentioned decoding direction 
> and related scores will be much appreciated.
>  >
>  >
>  > Best,
>  >
>  > Daniel
>  >
>  > _______________________________________________
>  > Moses-support mailing list
>  > Moses-support@mit.edu <mailto:Moses-support@mit.edu>  > 
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu <mailto:Moses-support@mit.edu> 
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support


_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Higher BLEU/METEOR score than usual for EN-DE

Reply via email to