Re: [Moses-support] What will happen if training Moses with duplicated corpus?

Tan, Jun Wed, 29 Aug 2012 23:20:56 -0700

Hi Koehn,

The line number of the phrase-table is too big. I don't know to check. 
I checked the both files, and found something that the corpus should be not 
clean enough, there are lots of non-meaningful phrases.


[root@Redhat-251 tmp]# wc -l phrase-table
19992218 phrase-table
[root@Redhat-251 tmp]# wc -l phrase-table1
21546088 phrase-table1

-----Original Message-----
From: phko...@gmail.com [mailto:phko...@gmail.com] On Behalf Of Philipp Koehn
Sent: Thursday, August 30, 2012 5:02 AM
To: Tan, Jun
Cc: moses-support@mit.edu
Subject: Re: [Moses-support] What will happen if training Moses with duplicated 
corpus?

Hi,

this is a bit odd -

if the phrase table is larger, then it must contain phrase pairs that were not 
in the original phrase table. However, these were extracted from the same data 
- why were they not extracted in the first place?

Can you check this?

I am not surprised that the language model is larger, if you used default 
settings, since there will be less singletons (actually, none) to be pruned 
out, but I would have expected a bigger increase than 10%.

-phi

On Tue, Aug 28, 2012 at 7:23 PM, Tan, Jun <jun....@emc.com> wrote:
> Hi Koehn,
>
> Thanks for your reply.
> I check the both phrase-table, most of them are the same. The difference is 
> the size of phrase-table created by duplicated corpus is about 5% larger than 
> the original corpus. For the language model, the size of duplicated corpus is 
> 10% larger than the original corpus.
>
> I think the tuning processes are same for the both Moses engine, the only 
> change is the training data. The steps and the tuning data are the same for 
> both of them.
>
>
> -----Original Message-----
> From: phko...@gmail.com [mailto:phko...@gmail.com] On Behalf Of 
> Philipp Koehn
> Sent: Wednesday, August 29, 2012 4:31 AM
> To: Tan, Jun
> Cc: moses-support@mit.edu
> Subject: Re: [Moses-support] What will happen if training Moses with 
> duplicated corpus?
>
> Hi,
>
> It is not obvious to me why this would happen due to data duplication - there 
> are things like Good Turing smoothing that would be affected by count 
> doubling, but that is not turned on by default. Do the phrase translation 
> tables look at all different?
>
> There is a clear affect on language model training if you double the data, 
> because SRILM's ngram-count by default drops higher order singletons (which 
> would not exist in a doubled corpus.
>
> It may be just be due to different tuning runs (which are random processes 
> that add noise). You could check this by re-using the weights from the other 
> run, and vice versa.
>
> -phi
>
> On Mon, Aug 27, 2012 at 7:11 PM, Tan, Jun <jun....@emc.com> wrote:
>> Hi all,
>>
>>
>>
>> Just like the thread title says, what will happen in that situation?
>>
>> I did an experiment to create two Moses translation models, one 
>> created by the original corpus, the other created by two copy of the 
>> same corpus. And in the last, I found that the BLEU score is a little 
>> different between the two models.  The model with two copy of the 
>> same corpus is about 1.2% higher than the engine created by the original 
>> corpus.
>>
>>
>>
>> Can anybody tell me whether it is normal?   What's the impact if I using a
>> lot of copies of the same corpus to create the model?
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>


_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] What will happen if training Moses with duplicated corpus?

Reply via email to