Hi all, Why are MT test sets the sizes they are? Most are between 1200 and 3000 sentences, usually with one reference, but occasionally some have 4 references. How are these sizes justified? I am sure they are not arbitrary, but I did not find an answer in most conference proceedings. What is the goal? (For example, maybe the goal is that a difference of 0.1 BLEU is statistically significant at 95% CI...)
What about multiple references? Is it better to have a test set with 1200 sentences and 4 references, or a test set with 4800 sentences and 1 reference? Any intuition? Thanks, everyone. I have been curious about this for a while, and am sure there is much insight to be gained from the people on this forum! Kazi _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
