Jehan, here are my strategies, others may vary.

 1/ the 100-word (token) limit is a dependency of GIZA++ and MGIZA++, 
 not just a convenience for speed. If you make the effort to use the 
 BerkeleyAligner, this limit disappears.

 2/ From a statistics and survey methodology point of view, your 
 training data is a subset of individual samples selected from a whole 
 population (linguistic domain) so-as to estimate the characteristics of 
 the whole population. So, duplicates can exist and they play an 
 important role in determining statistical significance and calculating 
 probabilities. Some data sources, however, repeat information with 
 little relevance to the linguistic balance of the whole domain. One 
 example is a web sites with repetitive menus on every page. Therefore, 
 for our use, we keep duplicates where we believe they represent a 
 balanced sampling and results we want to achieve. We remove them when 
 they do not. Not everyone, however, agrees with this approach.

 3/ Yes, none of the data pairs in the tuning set should be present in 
 your training data. To do so skews the tuning weights to give excellent 
 BLEU scores on the tuning results, but horrible scores on "real world" 
 translations.

 Tom


 On Fri, 18 Nov 2011 14:31:44 +0900, Jehan Pages <je...@mygengo.com> 
 wrote:
> Hi all,
>
> I have a few questions about quality of training and tuning. If 
> anyone
> has any clarifications, that would be nice! :-)
>
> 1/ According to the documentation:
> «
> sentences longer than 100 words (and their corresponding 
> translations)
> have to be eliminated
>    (note that a shorter sentence length limit will speed up training
> »
> So is it only for the sake of training speed or can too long 
> sentences
> end up being a liability in MT quality? In other words, when I 
> finally
> need to train "for real usage", should I really remove long 
> sentences?
>
> 2/ My data is taken from real crowd-sourced translated data. As a
> consequence, we end up with some duplicates (same original text and
> same translation). I wonder if for training, that either doesn't
> matter, or else we should remove duplicates, or finally that's better
> to have duplicates.
>
> I would imagine the latter (keep duplicates) is the best as this is
> "statistical machine learning" and after all, these represent "real
> life" duplicates (text we often encounter and that we apparently
> usually translate the same way) so that would be good to "insist on"
> these translations during training.
> Am I right?
>
> 3/ Do training and tuning data have necessarily to be different? I
> guess for it to be meaningful, it should, and various examples on the
> website seem to go in that way, but I could not read anything clearly
> stating this.
>
> Thanks.
>
> Jehan
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support


_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to