Hi,

It is not obvious to me why this would happen due to
data duplication - there are things like Good Turing
smoothing that would be affected by count doubling,
but that is not turned on by default. Do the phrase
translation tables look at all different?

There is a clear affect on language model training
if you double the data, because SRILM's ngram-count
by default drops higher order singletons (which would
not exist in a doubled corpus.

It may be just be due to different tuning runs (which are
random processes that add noise). You could check this
by re-using the weights from the other run, and vice versa.

-phi

On Mon, Aug 27, 2012 at 7:11 PM, Tan, Jun <jun....@emc.com> wrote:
> Hi all,
>
>
>
> Just like the thread title says, what will happen in that situation?
>
> I did an experiment to create two Moses translation models, one created by
> the original corpus, the other created by two copy of the same corpus. And
> in the last, I found that the BLEU score is a little different between the
> two models.  The model with two copy of the same corpus is about 1.2% higher
> than the engine created by the original corpus.
>
>
>
> Can anybody tell me whether it is normal?   What’s the impact if I using a
> lot of copies of the same corpus to create the model?
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to