Hi all,

can any of you help to provide some materials about how to select the sample
data for BLEU/NIST evaluation?
I mean, how many lines of data shoud I choose for the evaluation? and how
can I choose the data to let them can be more representable for our
domain/use?


I have tried to generate BLEU score by using 1000 lines' sample data and
12000 lines' data, which of both are in our domain, but the second times'
evaluation has higher scores, does this make sense?
I actually trained two Moses engines, for the first evaluation (1000 line),
Moses Engine1's score is lower than Moses Engine 2; but for the second time
(12000 line), Moses Engine1's score is higher than Moses Engine 2.
Which result should I trust? This phenominon makes me trust the scores less.

Does anybody has any similar experiences? Is there any problem in my
evaluation data?
How can I generate more accurate scores?

Thanks so much,
Wenlong
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to