Hi, this is a bit odd -
if the phrase table is larger, then it must contain phrase pairs that were not in the original phrase table. However, these were extracted from the same data - why were they not extracted in the first place? Can you check this? I am not surprised that the language model is larger, if you used default settings, since there will be less singletons (actually, none) to be pruned out, but I would have expected a bigger increase than 10%. -phi On Tue, Aug 28, 2012 at 7:23 PM, Tan, Jun <jun....@emc.com> wrote: > Hi Koehn, > > Thanks for your reply. > I check the both phrase-table, most of them are the same. The difference is > the size of phrase-table created by duplicated corpus is about 5% larger than > the original corpus. For the language model, the size of duplicated corpus is > 10% larger than the original corpus. > > I think the tuning processes are same for the both Moses engine, the only > change is the training data. The steps and the tuning data are the same for > both of them. > > > -----Original Message----- > From: phko...@gmail.com [mailto:phko...@gmail.com] On Behalf Of Philipp Koehn > Sent: Wednesday, August 29, 2012 4:31 AM > To: Tan, Jun > Cc: moses-support@mit.edu > Subject: Re: [Moses-support] What will happen if training Moses with > duplicated corpus? > > Hi, > > It is not obvious to me why this would happen due to data duplication - there > are things like Good Turing smoothing that would be affected by count > doubling, but that is not turned on by default. Do the phrase translation > tables look at all different? > > There is a clear affect on language model training if you double the data, > because SRILM's ngram-count by default drops higher order singletons (which > would not exist in a doubled corpus. > > It may be just be due to different tuning runs (which are random processes > that add noise). You could check this by re-using the weights from the other > run, and vice versa. > > -phi > > On Mon, Aug 27, 2012 at 7:11 PM, Tan, Jun <jun....@emc.com> wrote: >> Hi all, >> >> >> >> Just like the thread title says, what will happen in that situation? >> >> I did an experiment to create two Moses translation models, one >> created by the original corpus, the other created by two copy of the >> same corpus. And in the last, I found that the BLEU score is a little >> different between the two models. The model with two copy of the same >> corpus is about 1.2% higher than the engine created by the original corpus. >> >> >> >> Can anybody tell me whether it is normal? What's the impact if I using a >> lot of copies of the same corpus to create the model? >> >> >> _______________________________________________ >> Moses-support mailing list >> Moses-support@mit.edu >> http://mailman.mit.edu/mailman/listinfo/moses-support >> > _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support