Hi, It is not obvious to me why this would happen due to data duplication - there are things like Good Turing smoothing that would be affected by count doubling, but that is not turned on by default. Do the phrase translation tables look at all different?
There is a clear affect on language model training if you double the data, because SRILM's ngram-count by default drops higher order singletons (which would not exist in a doubled corpus. It may be just be due to different tuning runs (which are random processes that add noise). You could check this by re-using the weights from the other run, and vice versa. -phi On Mon, Aug 27, 2012 at 7:11 PM, Tan, Jun <jun....@emc.com> wrote: > Hi all, > > > > Just like the thread title says, what will happen in that situation? > > I did an experiment to create two Moses translation models, one created by > the original corpus, the other created by two copy of the same corpus. And > in the last, I found that the BLEU score is a little different between the > two models. The model with two copy of the same corpus is about 1.2% higher > than the engine created by the original corpus. > > > > Can anybody tell me whether it is normal? What’s the impact if I using a > lot of copies of the same corpus to create the model? > > > _______________________________________________ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support > _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support