Hi Barry,

Thanks for you information.  
The scores are calculated by MultiEval on the test set. And I used only one 
reference in development.  
I re-caculated the BELU score via the mutli-bleu.pl. 
BLEU = 29.02, 65.8/36.2/22.0/13.7 (BP=0.996, ratio=0.996, hyp_len=19684, 
ref_len=19755)

It's very closer to these calculated by MultiEval now. 
And I'm very interested about the multiple references. Does that mean I need to 
use multiple development sets to tune the MT engine's weights? 

Thanks,
Jun



-----Original Message-----
From: Barry Haddow [mailto:bhad...@staffmail.ed.ac.uk] 
Sent: Thursday, 24 January 2013 5:44 PM
To: Tan, Jun
Cc: moses-support@mit.edu
Subject: Re: [Moses-support] The BELU score from MultiEval is much lower than 
which generated by the Moses mert-moses.pl script

Hi Jun

mert-moses.pl is not an evaluation script, it's for tuning the MT 
engine. It will report bleu scores obtained during tuning, but these are 
on the development set. The scores you're showing using MultiEval are (I 
hope!) on the test set, which would make them different. It's quite a 
big difference between development and test though - are you using 
multiple references in development?

The NaNs in the MultiEval output are a bit strange. I'm not familiar 
with this tool, but Moses contains multi-bleu.pl (in scripts/generic) 
which you can also use to calculate Bleu,

cheers - Barry

On 24/01/13 02:49, Tan, Jun wrote:
> Hello all,
> I have created an English-Chinese MT engine via Moses. I’m doing the 
> translation quality evaluation regard this engine. I have an 
> evaluation report created by MultiEval tool about 1000 sentences. I 
> found the BELU score is much lower than the score generated by the 
> mert-moses.pl script. It’s only 0.3 of MultiEval, but for 
> mert-moses.pl is 0.65.
> MultiEval report:
>
>       BLEU (s_sel/s_opt/p)    METEOR (s_sel/s_opt/p)  TER (s_sel/s_opt/p) 
> Length (s_sel/s_opt/p)
> EMC DATA      *29.0 (0.6/NaN/-) *     *31.7 (0.3/NaN/-) *     57.1 
> (0.7/NaN/-) 
> 100.4 (0.6/NaN/-)
> TAUS DATA     *21.8 (0.5/NaN/0.00) *  *28.1 (0.2/NaN/0.00) *  61.8 
> (0.6/NaN/0.00)        97.5 (0.6/NaN/0.00)
>
> Top unmatched hypothesis words according to METEOR:
> [ 的 x 341, , x 177, 在 x 117, " x 91, 和 x 85, 中 x 84, 到 x 
> 84, 将 x 74, / x 65, 一个 x 65]
> [ 的 x 436, , x 273, 在 x 163, 将 x 85, 中 x 82, 时 x 71, 上 x 65, 以 
> x 54, 为 x 52, 数据 x 50]
> [ 的 x 400, , x 197, 在 x 139, 一个 x 91, 数据 x 89, 将 x 89, 是 x 
> 85, “ x 85, 和 x 82, 数据域 x 77]
> [ 的 x 369, , x 227, 在 x 151, Domain x 139, Data x 136, 数据 x 115, 
> 上 x 96, 中 x 93, 将 x 86, 消除 x 83]
> I have some following questions regard this issue:
>
>  1. The causes of this issue.
>  2. Anyone else has similar experience?
>  3. Is it normal?
>  4. Which tool do you recommend for the MT evaluation?
>  5. How to improve the engine according to the MultiEval report?
>
> Any question or any suggestion is welcome ~
> Thanks,
> Jun
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to