Re: [Moses-support] What will happen if training Moses with duplicated corpus?

Philipp Koehn Wed, 29 Aug 2012 14:04:20 -0700

Hi,

this is a bit odd -


if the phrase table is larger, then it must contain phrase pairs that were
not in the original phrase table. However, these were extracted from
the same data - why were they not extracted in the first place?

Can you check this?

I am not surprised that the language model is larger, if you used default
settings, since there will be less singletons (actually, none) to be pruned
out, but I would have expected a bigger increase than 10%.

-phi

On Tue, Aug 28, 2012 at 7:23 PM, Tan, Jun <jun....@emc.com> wrote:
> Hi Koehn,
>
> Thanks for your reply.
> I check the both phrase-table, most of them are the same. The difference is 
> the size of phrase-table created by duplicated corpus is about 5% larger than 
> the original corpus. For the language model, the size of duplicated corpus is 
> 10% larger than the original corpus.
>
> I think the tuning processes are same for the both Moses engine, the only 
> change is the training data. The steps and the tuning data are the same for 
> both of them.
>
>
> -----Original Message-----
> From: phko...@gmail.com [mailto:phko...@gmail.com] On Behalf Of Philipp Koehn
> Sent: Wednesday, August 29, 2012 4:31 AM
> To: Tan, Jun
> Cc: moses-support@mit.edu
> Subject: Re: [Moses-support] What will happen if training Moses with 
> duplicated corpus?
>
> Hi,
>
> It is not obvious to me why this would happen due to data duplication - there 
> are things like Good Turing smoothing that would be affected by count 
> doubling, but that is not turned on by default. Do the phrase translation 
> tables look at all different?
>
> There is a clear affect on language model training if you double the data, 
> because SRILM's ngram-count by default drops higher order singletons (which 
> would not exist in a doubled corpus.
>
> It may be just be due to different tuning runs (which are random processes 
> that add noise). You could check this by re-using the weights from the other 
> run, and vice versa.
>
> -phi
>
> On Mon, Aug 27, 2012 at 7:11 PM, Tan, Jun <jun....@emc.com> wrote:
>> Hi all,
>>
>>
>>
>> Just like the thread title says, what will happen in that situation?
>>
>> I did an experiment to create two Moses translation models, one
>> created by the original corpus, the other created by two copy of the
>> same corpus. And in the last, I found that the BLEU score is a little
>> different between the two models.  The model with two copy of the same
>> corpus is about 1.2% higher than the engine created by the original corpus.
>>
>>
>>
>> Can anybody tell me whether it is normal?   What's the impact if I using a
>> lot of copies of the same corpus to create the model?
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] What will happen if training Moses with duplicated corpus?

Reply via email to