Re: [Moses-support] apostrophe: detokenization or corpus issue ?

2016-03-14 Thread Vincent Nguyen
I think I found the culprit. this is very tricky . it's not a detokenizer issue but a "normalize-punctuation | tokenizer" issue. the normalize-punctuation script convert the special apostrophe utf-8 sequence E2 80 99 when it is surrounded by [a-z] on both sides.

Re: [Moses-support] apostrophe: detokenization or corpus issue ?

2016-03-10 Thread Vincent Nguyen
this age group is decoded as ce groupe d âge I'll check my corpus and see why it got instead of in there. thanks. Le 10/03/2016 13:00, Philipp Koehn a écrit : Hi, I do not think that the detokenizer would cause conversion of ' to ". You can check the raw output of the decoder, and see

Re: [Moses-support] apostrophe: detokenization or corpus issue ?

2016-03-10 Thread Philipp Koehn
Hi, I do not think that the detokenizer would cause conversion of ' to ". You can check the raw output of the decoder, and see how it is changed by the detokenizer. -phi On Wed, Mar 9, 2016 at 11:44 AM, Vincent Nguyen wrote: > Hi, > > I got the following situation: > > This

[Moses-support] apostrophe: detokenization or corpus issue ?

2016-03-09 Thread Vincent Nguyen
Hi, I got the following situation: This group age is translated sometimes in: ce groupe d'âge (correct) ce groupe d" âge (incorrect) ce groupe d "âge (incorrect) I am wondering if this is more a detokenizer issue or a corpus issue, or both. Technically in French, there shouldn't be any space