Re: [Moses-support] Various questions about training and tuning

2011-11-19 Thread Jehan Pages
Hi,

On Fri, Nov 18, 2011 at 11:22 PM, Tom Hoar
 wrote:
> Jehan,
>
> A brute-force method to give some phrases more weight is to simply create
> intentional duplicates in your training data set. Miles' option has more
> finesse.

Though I would definitely try this if I cannot make weight in GIZA++
work, I'd like to try the GIZA way first (unless someone would tell me
that's actually exactly the same thing?).

Yet I have been trying to find the GIZA++ option to do this, but there
is definitely a lack of documentation here, and the web is not very
talkative either on the matter.
A grep on the source code of GIZA++ seems to tell me that only
plain2snt has an option named weight (else the feature has another
naming in other tools of the archive). That apparently works like
this:
---
plain2snt txt1 txt2 [txt3 txt4 -weight w]
 Converts plain text into GIZA++ snt-format
--

But first that's not enough to be sure how it works (though it gives
some hint), but in particular, the train-model.perl does not ever seem
to use this tool (and it is not used internally either by GIZA++
according to its source). Anyway as Moses documentation does not tell
to link this tool in bin/ (so I didn't), I am quite sure it is never
used during training process.
So is there something I am missing? How do one set weights on training
data through GIZA++ (or MGIZA++ as I am trying right now its
multi-thread features)?
Thanks.

Jehan

> Tom
>
> On Fri, 18 Nov 2011 18:29:50 +0900, Jehan Pages  wrote:
>>
>> Hi,
>>
>> On Fri, Nov 18, 2011 at 6:00 PM, Miles Osborne  wrote:
>>>
>>> re: not tuning on training data, in principle this shouldn't matter
>>> (especially if the tuning set is large and/or representative of the
>>> task).
>>>
>>> in reality, Moses will assign far too much weight to these examples,
>>> at the detriment of the others.  (it will drastically overfit).  this
>>> is why the tuning and training sets are typically disjoint.  this is a
>>> standard tactic in NLP and not just Moses.
>>
>> Ok thanks. Actually I think that reminds me indeed what I learned
>> years ago on the topic (when I was still in university, in fact
>> working on these kind of topics, though now that's kind of far away).
>>
>> [Also, Tom Hoar, forget my questions on what you answer at this point
>> (when I asked "how do you do so?" and such). I misunderstood the
>> meaning of your answer! Now with Miles's answer, and rereading your
>> first one, I understand]
>>
>>> re:  assigning more weight to certain translations, you have two
>>> options here.  the first would be to assign more weight to these pairs
>>> when you run Giza++.  (you can assign per-sentence pair weights at
>>> this stage).  this is really just a hint and won't guarantee anything.
>>>  the second option would be to force translations (using the XML
>>> markup).
>>
>> I see. Interesting. For what I want, the weights on GIZA++ looks nice.
>> I'll try to find information on this.
>>
>> Thanks a lot for the answers.
>>
>> Jehan
>>
>>> Miles
>>>
>>> On 18 November 2011 08:42, Jehan Pages  wrote:

 Hi,

 On Fri, Nov 18, 2011 at 2:59 PM, Tom Hoar
  wrote:
>
> Jehan, here are my strategies, others may vary.

 Thanks.

> 1/ the 100-word (token) limit is a dependency of GIZA++ and MGIZA++,
> not
> just a convenience for speed. If you make the effort to use the
> BerkeleyAligner, this limit disappears.

 Ok I didn't know this alternative to GIZA++. I see there are some
 explanation on the website for switching to this aligner. I may give
 it a try someday then. :-)

> 2/ From a statistics and survey methodology point of view, your
> training
> data is a subset of individual samples selected from a whole population
> (linguistic domain) so-as to estimate the characteristics of the whole
> population. So, duplicates can exist and they play an important role in
> determining statistical significance and calculating probabilities.
> Some
> data sources, however, repeat information with little relevance to the
> linguistic balance of the whole domain. One example is a web sites with
> repetitive menus on every page. Therefore, for our use, we keep
> duplicates
> where we believe they represent a balanced sampling and results we want
> to
> achieve. We remove them when they do not. Not everyone, however, agrees
> with
> this approach.

 I see. And that confirms my thoughts. I don't know for sure what will
 be my strategy, but I think that will be keeping them all then, most
 probably. Making conditional removal like you do is interesting, but
 that would prove hard to do on our platform as we don't have context
 on translations stored.

> 3/ Yes, none of the data pairs in the tuning set should be present in
> your
> training data. To do so skews the tuning weights to give excellent BLEU
> scores on the tuning results, but horribl

Re: [Moses-support] Various questions about training and tuning

2011-11-18 Thread Tom Hoar
 Jehan,

 A brute-force method to give some phrases more weight is to simply 
 create intentional duplicates in your training data set. Miles' option 
 has more finesse.

 Tom

 On Fri, 18 Nov 2011 18:29:50 +0900, Jehan Pages  
 wrote:
> Hi,
>
> On Fri, Nov 18, 2011 at 6:00 PM, Miles Osborne  
> wrote:
>> re: not tuning on training data, in principle this shouldn't matter
>> (especially if the tuning set is large and/or representative of the
>> task).
>>
>> in reality, Moses will assign far too much weight to these examples,
>> at the detriment of the others.  (it will drastically overfit). 
>>  this
>> is why the tuning and training sets are typically disjoint.  this is 
>> a
>> standard tactic in NLP and not just Moses.
>
> Ok thanks. Actually I think that reminds me indeed what I learned
> years ago on the topic (when I was still in university, in fact
> working on these kind of topics, though now that's kind of far away).
>
> [Also, Tom Hoar, forget my questions on what you answer at this point
> (when I asked "how do you do so?" and such). I misunderstood the
> meaning of your answer! Now with Miles's answer, and rereading your
> first one, I understand]
>
>> re:  assigning more weight to certain translations, you have two
>> options here.  the first would be to assign more weight to these 
>> pairs
>> when you run Giza++.  (you can assign per-sentence pair weights at
>> this stage).  this is really just a hint and won't guarantee 
>> anything.
>>  the second option would be to force translations (using the XML
>> markup).
>
> I see. Interesting. For what I want, the weights on GIZA++ looks 
> nice.
> I'll try to find information on this.
>
> Thanks a lot for the answers.
>
> Jehan
>
>> Miles
>>
>> On 18 November 2011 08:42, Jehan Pages  wrote:
>>> Hi,
>>>
>>> On Fri, Nov 18, 2011 at 2:59 PM, Tom Hoar
>>>  wrote:
 Jehan, here are my strategies, others may vary.
>>>
>>> Thanks.
>>>
 1/ the 100-word (token) limit is a dependency of GIZA++ and 
 MGIZA++, not
 just a convenience for speed. If you make the effort to use the
 BerkeleyAligner, this limit disappears.
>>>
>>> Ok I didn't know this alternative to GIZA++. I see there are some
>>> explanation on the website for switching to this aligner. I may 
>>> give
>>> it a try someday then. :-)
>>>
 2/ From a statistics and survey methodology point of view, your 
 training
 data is a subset of individual samples selected from a whole 
 population
 (linguistic domain) so-as to estimate the characteristics of the 
 whole
 population. So, duplicates can exist and they play an important 
 role in
 determining statistical significance and calculating 
 probabilities. Some
 data sources, however, repeat information with little relevance to 
 the
 linguistic balance of the whole domain. One example is a web sites 
 with
 repetitive menus on every page. Therefore, for our use, we keep 
 duplicates
 where we believe they represent a balanced sampling and results we 
 want to
 achieve. We remove them when they do not. Not everyone, however, 
 agrees with
 this approach.
>>>
>>> I see. And that confirms my thoughts. I don't know for sure what 
>>> will
>>> be my strategy, but I think that will be keeping them all then, 
>>> most
>>> probably. Making conditional removal like you do is interesting, 
>>> but
>>> that would prove hard to do on our platform as we don't have 
>>> context
>>> on translations stored.
>>>
 3/ Yes, none of the data pairs in the tuning set should be present 
 in your
 training data. To do so skews the tuning weights to give excellent 
 BLEU
 scores on the tuning results, but horrible scores on "real world"
 translations.
>>>
>>> I am not sure I understand what you say. How do you do so? Also why
>>> would we want to give horrible score to real world translations? 
>>> Isn't
>>> the point exactly that the tuning data should actually "represent"
>>> this real world translations that we want to get close to?
>>>
>>>
>>> 4/ Also I was wondering something else that I just remember. So 
>>> that
>>> will be a fourth question!
>>> Suppose in our system, we have some translations we know for sure 
>>> are
>>> very good (all are good but some are supposed to be more like
>>> "certified quality"). Is there no way in Moses to give some more
>>> weight to some translations in order to influence the system 
>>> towards
>>> quality data (still keeping all data though)?
>>>
>>> Thanks again!
>>>
>>> Jehan
>>>
 Tom


 On Fri, 18 Nov 2011 14:31:44 +0900, Jehan Pages 
  wrote:
>
> Hi all,
>
> I have a few questions about quality of training and tuning. If 
> anyone
> has any clarifications, that would be nice! :-)
>
> 1/ According to the documentation:
> «
> sentences longer than 100 words (and their corresponding 
> translations)
> have to be eliminated
>   (note th

Re: [Moses-support] Various questions about training and tuning

2011-11-18 Thread Jehan Pages
Hi,

On Fri, Nov 18, 2011 at 6:00 PM, Miles Osborne  wrote:
> re: not tuning on training data, in principle this shouldn't matter
> (especially if the tuning set is large and/or representative of the
> task).
>
> in reality, Moses will assign far too much weight to these examples,
> at the detriment of the others.  (it will drastically overfit).  this
> is why the tuning and training sets are typically disjoint.  this is a
> standard tactic in NLP and not just Moses.

Ok thanks. Actually I think that reminds me indeed what I learned
years ago on the topic (when I was still in university, in fact
working on these kind of topics, though now that's kind of far away).

[Also, Tom Hoar, forget my questions on what you answer at this point
(when I asked "how do you do so?" and such). I misunderstood the
meaning of your answer! Now with Miles's answer, and rereading your
first one, I understand]

> re:  assigning more weight to certain translations, you have two
> options here.  the first would be to assign more weight to these pairs
> when you run Giza++.  (you can assign per-sentence pair weights at
> this stage).  this is really just a hint and won't guarantee anything.
>  the second option would be to force translations (using the XML
> markup).

I see. Interesting. For what I want, the weights on GIZA++ looks nice.
I'll try to find information on this.

Thanks a lot for the answers.

Jehan

> Miles
>
> On 18 November 2011 08:42, Jehan Pages  wrote:
>> Hi,
>>
>> On Fri, Nov 18, 2011 at 2:59 PM, Tom Hoar
>>  wrote:
>>> Jehan, here are my strategies, others may vary.
>>
>> Thanks.
>>
>>> 1/ the 100-word (token) limit is a dependency of GIZA++ and MGIZA++, not
>>> just a convenience for speed. If you make the effort to use the
>>> BerkeleyAligner, this limit disappears.
>>
>> Ok I didn't know this alternative to GIZA++. I see there are some
>> explanation on the website for switching to this aligner. I may give
>> it a try someday then. :-)
>>
>>> 2/ From a statistics and survey methodology point of view, your training
>>> data is a subset of individual samples selected from a whole population
>>> (linguistic domain) so-as to estimate the characteristics of the whole
>>> population. So, duplicates can exist and they play an important role in
>>> determining statistical significance and calculating probabilities. Some
>>> data sources, however, repeat information with little relevance to the
>>> linguistic balance of the whole domain. One example is a web sites with
>>> repetitive menus on every page. Therefore, for our use, we keep duplicates
>>> where we believe they represent a balanced sampling and results we want to
>>> achieve. We remove them when they do not. Not everyone, however, agrees with
>>> this approach.
>>
>> I see. And that confirms my thoughts. I don't know for sure what will
>> be my strategy, but I think that will be keeping them all then, most
>> probably. Making conditional removal like you do is interesting, but
>> that would prove hard to do on our platform as we don't have context
>> on translations stored.
>>
>>> 3/ Yes, none of the data pairs in the tuning set should be present in your
>>> training data. To do so skews the tuning weights to give excellent BLEU
>>> scores on the tuning results, but horrible scores on "real world"
>>> translations.
>>
>> I am not sure I understand what you say. How do you do so? Also why
>> would we want to give horrible score to real world translations? Isn't
>> the point exactly that the tuning data should actually "represent"
>> this real world translations that we want to get close to?
>>
>>
>> 4/ Also I was wondering something else that I just remember. So that
>> will be a fourth question!
>> Suppose in our system, we have some translations we know for sure are
>> very good (all are good but some are supposed to be more like
>> "certified quality"). Is there no way in Moses to give some more
>> weight to some translations in order to influence the system towards
>> quality data (still keeping all data though)?
>>
>> Thanks again!
>>
>> Jehan
>>
>>> Tom
>>>
>>>
>>> On Fri, 18 Nov 2011 14:31:44 +0900, Jehan Pages  wrote:

 Hi all,

 I have a few questions about quality of training and tuning. If anyone
 has any clarifications, that would be nice! :-)

 1/ According to the documentation:
 «
 sentences longer than 100 words (and their corresponding translations)
 have to be eliminated
   (note that a shorter sentence length limit will speed up training
 »
 So is it only for the sake of training speed or can too long sentences
 end up being a liability in MT quality? In other words, when I finally
 need to train "for real usage", should I really remove long sentences?

 2/ My data is taken from real crowd-sourced translated data. As a
 consequence, we end up with some duplicates (same original text and
 same translation). I wonder if for training, that either doesn't
 matter,

Re: [Moses-support] Various questions about training and tuning

2011-11-18 Thread Miles Osborne
re: not tuning on training data, in principle this shouldn't matter
(especially if the tuning set is large and/or representative of the
task).

in reality, Moses will assign far too much weight to these examples,
at the detriment of the others.  (it will drastically overfit).  this
is why the tuning and training sets are typically disjoint.  this is a
standard tactic in NLP and not just Moses.

re:  assigning more weight to certain translations, you have two
options here.  the first would be to assign more weight to these pairs
when you run Giza++.  (you can assign per-sentence pair weights at
this stage).  this is really just a hint and won't guarantee anything.
 the second option would be to force translations (using the XML
markup).

Miles

On 18 November 2011 08:42, Jehan Pages  wrote:
> Hi,
>
> On Fri, Nov 18, 2011 at 2:59 PM, Tom Hoar
>  wrote:
>> Jehan, here are my strategies, others may vary.
>
> Thanks.
>
>> 1/ the 100-word (token) limit is a dependency of GIZA++ and MGIZA++, not
>> just a convenience for speed. If you make the effort to use the
>> BerkeleyAligner, this limit disappears.
>
> Ok I didn't know this alternative to GIZA++. I see there are some
> explanation on the website for switching to this aligner. I may give
> it a try someday then. :-)
>
>> 2/ From a statistics and survey methodology point of view, your training
>> data is a subset of individual samples selected from a whole population
>> (linguistic domain) so-as to estimate the characteristics of the whole
>> population. So, duplicates can exist and they play an important role in
>> determining statistical significance and calculating probabilities. Some
>> data sources, however, repeat information with little relevance to the
>> linguistic balance of the whole domain. One example is a web sites with
>> repetitive menus on every page. Therefore, for our use, we keep duplicates
>> where we believe they represent a balanced sampling and results we want to
>> achieve. We remove them when they do not. Not everyone, however, agrees with
>> this approach.
>
> I see. And that confirms my thoughts. I don't know for sure what will
> be my strategy, but I think that will be keeping them all then, most
> probably. Making conditional removal like you do is interesting, but
> that would prove hard to do on our platform as we don't have context
> on translations stored.
>
>> 3/ Yes, none of the data pairs in the tuning set should be present in your
>> training data. To do so skews the tuning weights to give excellent BLEU
>> scores on the tuning results, but horrible scores on "real world"
>> translations.
>
> I am not sure I understand what you say. How do you do so? Also why
> would we want to give horrible score to real world translations? Isn't
> the point exactly that the tuning data should actually "represent"
> this real world translations that we want to get close to?
>
>
> 4/ Also I was wondering something else that I just remember. So that
> will be a fourth question!
> Suppose in our system, we have some translations we know for sure are
> very good (all are good but some are supposed to be more like
> "certified quality"). Is there no way in Moses to give some more
> weight to some translations in order to influence the system towards
> quality data (still keeping all data though)?
>
> Thanks again!
>
> Jehan
>
>> Tom
>>
>>
>> On Fri, 18 Nov 2011 14:31:44 +0900, Jehan Pages  wrote:
>>>
>>> Hi all,
>>>
>>> I have a few questions about quality of training and tuning. If anyone
>>> has any clarifications, that would be nice! :-)
>>>
>>> 1/ According to the documentation:
>>> «
>>> sentences longer than 100 words (and their corresponding translations)
>>> have to be eliminated
>>>   (note that a shorter sentence length limit will speed up training
>>> »
>>> So is it only for the sake of training speed or can too long sentences
>>> end up being a liability in MT quality? In other words, when I finally
>>> need to train "for real usage", should I really remove long sentences?
>>>
>>> 2/ My data is taken from real crowd-sourced translated data. As a
>>> consequence, we end up with some duplicates (same original text and
>>> same translation). I wonder if for training, that either doesn't
>>> matter, or else we should remove duplicates, or finally that's better
>>> to have duplicates.
>>>
>>> I would imagine the latter (keep duplicates) is the best as this is
>>> "statistical machine learning" and after all, these represent "real
>>> life" duplicates (text we often encounter and that we apparently
>>> usually translate the same way) so that would be good to "insist on"
>>> these translations during training.
>>> Am I right?
>>>
>>> 3/ Do training and tuning data have necessarily to be different? I
>>> guess for it to be meaningful, it should, and various examples on the
>>> website seem to go in that way, but I could not read anything clearly
>>> stating this.
>>>
>>> Thanks.
>>>
>>> Jehan
>>>
>>> ___

Re: [Moses-support] Various questions about training and tuning

2011-11-18 Thread Jehan Pages
Hi,

On Fri, Nov 18, 2011 at 2:59 PM, Tom Hoar
 wrote:
> Jehan, here are my strategies, others may vary.

Thanks.

> 1/ the 100-word (token) limit is a dependency of GIZA++ and MGIZA++, not
> just a convenience for speed. If you make the effort to use the
> BerkeleyAligner, this limit disappears.

Ok I didn't know this alternative to GIZA++. I see there are some
explanation on the website for switching to this aligner. I may give
it a try someday then. :-)

> 2/ From a statistics and survey methodology point of view, your training
> data is a subset of individual samples selected from a whole population
> (linguistic domain) so-as to estimate the characteristics of the whole
> population. So, duplicates can exist and they play an important role in
> determining statistical significance and calculating probabilities. Some
> data sources, however, repeat information with little relevance to the
> linguistic balance of the whole domain. One example is a web sites with
> repetitive menus on every page. Therefore, for our use, we keep duplicates
> where we believe they represent a balanced sampling and results we want to
> achieve. We remove them when they do not. Not everyone, however, agrees with
> this approach.

I see. And that confirms my thoughts. I don't know for sure what will
be my strategy, but I think that will be keeping them all then, most
probably. Making conditional removal like you do is interesting, but
that would prove hard to do on our platform as we don't have context
on translations stored.

> 3/ Yes, none of the data pairs in the tuning set should be present in your
> training data. To do so skews the tuning weights to give excellent BLEU
> scores on the tuning results, but horrible scores on "real world"
> translations.

I am not sure I understand what you say. How do you do so? Also why
would we want to give horrible score to real world translations? Isn't
the point exactly that the tuning data should actually "represent"
this real world translations that we want to get close to?


4/ Also I was wondering something else that I just remember. So that
will be a fourth question!
Suppose in our system, we have some translations we know for sure are
very good (all are good but some are supposed to be more like
"certified quality"). Is there no way in Moses to give some more
weight to some translations in order to influence the system towards
quality data (still keeping all data though)?

Thanks again!

Jehan

> Tom
>
>
> On Fri, 18 Nov 2011 14:31:44 +0900, Jehan Pages  wrote:
>>
>> Hi all,
>>
>> I have a few questions about quality of training and tuning. If anyone
>> has any clarifications, that would be nice! :-)
>>
>> 1/ According to the documentation:
>> «
>> sentences longer than 100 words (and their corresponding translations)
>> have to be eliminated
>>   (note that a shorter sentence length limit will speed up training
>> »
>> So is it only for the sake of training speed or can too long sentences
>> end up being a liability in MT quality? In other words, when I finally
>> need to train "for real usage", should I really remove long sentences?
>>
>> 2/ My data is taken from real crowd-sourced translated data. As a
>> consequence, we end up with some duplicates (same original text and
>> same translation). I wonder if for training, that either doesn't
>> matter, or else we should remove duplicates, or finally that's better
>> to have duplicates.
>>
>> I would imagine the latter (keep duplicates) is the best as this is
>> "statistical machine learning" and after all, these represent "real
>> life" duplicates (text we often encounter and that we apparently
>> usually translate the same way) so that would be good to "insist on"
>> these translations during training.
>> Am I right?
>>
>> 3/ Do training and tuning data have necessarily to be different? I
>> guess for it to be meaningful, it should, and various examples on the
>> website seem to go in that way, but I could not read anything clearly
>> stating this.
>>
>> Thanks.
>>
>> Jehan
>>
>> ___
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Various questions about training and tuning

2011-11-17 Thread Tom Hoar
 Jehan, here are my strategies, others may vary.

 1/ the 100-word (token) limit is a dependency of GIZA++ and MGIZA++, 
 not just a convenience for speed. If you make the effort to use the 
 BerkeleyAligner, this limit disappears.

 2/ From a statistics and survey methodology point of view, your 
 training data is a subset of individual samples selected from a whole 
 population (linguistic domain) so-as to estimate the characteristics of 
 the whole population. So, duplicates can exist and they play an 
 important role in determining statistical significance and calculating 
 probabilities. Some data sources, however, repeat information with 
 little relevance to the linguistic balance of the whole domain. One 
 example is a web sites with repetitive menus on every page. Therefore, 
 for our use, we keep duplicates where we believe they represent a 
 balanced sampling and results we want to achieve. We remove them when 
 they do not. Not everyone, however, agrees with this approach.

 3/ Yes, none of the data pairs in the tuning set should be present in 
 your training data. To do so skews the tuning weights to give excellent 
 BLEU scores on the tuning results, but horrible scores on "real world" 
 translations.

 Tom


 On Fri, 18 Nov 2011 14:31:44 +0900, Jehan Pages  
 wrote:
> Hi all,
>
> I have a few questions about quality of training and tuning. If 
> anyone
> has any clarifications, that would be nice! :-)
>
> 1/ According to the documentation:
> «
> sentences longer than 100 words (and their corresponding 
> translations)
> have to be eliminated
>(note that a shorter sentence length limit will speed up training
> »
> So is it only for the sake of training speed or can too long 
> sentences
> end up being a liability in MT quality? In other words, when I 
> finally
> need to train "for real usage", should I really remove long 
> sentences?
>
> 2/ My data is taken from real crowd-sourced translated data. As a
> consequence, we end up with some duplicates (same original text and
> same translation). I wonder if for training, that either doesn't
> matter, or else we should remove duplicates, or finally that's better
> to have duplicates.
>
> I would imagine the latter (keep duplicates) is the best as this is
> "statistical machine learning" and after all, these represent "real
> life" duplicates (text we often encounter and that we apparently
> usually translate the same way) so that would be good to "insist on"
> these translations during training.
> Am I right?
>
> 3/ Do training and tuning data have necessarily to be different? I
> guess for it to be meaningful, it should, and various examples on the
> website seem to go in that way, but I could not read anything clearly
> stating this.
>
> Thanks.
>
> Jehan
>
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] Various questions about training and tuning

2011-11-17 Thread Jehan Pages
Hi all,

I have a few questions about quality of training and tuning. If anyone
has any clarifications, that would be nice! :-)

1/ According to the documentation:
«
sentences longer than 100 words (and their corresponding translations)
have to be eliminated
   (note that a shorter sentence length limit will speed up training
»
So is it only for the sake of training speed or can too long sentences
end up being a liability in MT quality? In other words, when I finally
need to train "for real usage", should I really remove long sentences?

2/ My data is taken from real crowd-sourced translated data. As a
consequence, we end up with some duplicates (same original text and
same translation). I wonder if for training, that either doesn't
matter, or else we should remove duplicates, or finally that's better
to have duplicates.

I would imagine the latter (keep duplicates) is the best as this is
"statistical machine learning" and after all, these represent "real
life" duplicates (text we often encounter and that we apparently
usually translate the same way) so that would be good to "insist on"
these translations during training.
Am I right?

3/ Do training and tuning data have necessarily to be different? I
guess for it to be meaningful, it should, and various examples on the
website seem to go in that way, but I could not read anything clearly
stating this.

Thanks.

Jehan

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support