Hi all,

Why are MT test sets the sizes they are? Most are between 1200 and 3000
sentences, usually with one reference, but occasionally some have 4
references. How are these sizes justified? I am sure they are not arbitrary,
but I did not find an answer in most conference proceedings. What is the
goal? (For example, maybe the goal is that a difference of 0.1 BLEU is
statistically significant at 95% CI...)

What about multiple references? Is it better to have a test set with 1200
sentences and 4 references, or a test set with 4800 sentences and 1
reference? Any intuition?

Thanks, everyone. I have been curious about this for a while, and am sure
there is much insight to be gained from the people on this forum!

Kazi


_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to