I think I found the culprit.
this is very tricky . it's not a detokenizer issue but a
"normalize-punctuation | tokenizer" issue.
the normalize-punctuation script convert the special apostrophe utf-8
sequence E2 80 99
when it is surrounded by [a-z] on both sides.
this age group
is decoded as
ce groupe d âge
I'll check my corpus and see why it got instead of in there.
thanks.
Le 10/03/2016 13:00, Philipp Koehn a écrit :
Hi,
I do not think that the detokenizer would cause conversion of ' to ".
You can check the raw output of the decoder, and see
Hi,
I do not think that the detokenizer would cause conversion of ' to ".
You can check the raw output of the decoder, and see how it is
changed by the detokenizer.
-phi
On Wed, Mar 9, 2016 at 11:44 AM, Vincent Nguyen wrote:
> Hi,
>
> I got the following situation:
>
> This
Hi,
I got the following situation:
This group age
is translated sometimes in:
ce groupe d'âge (correct)
ce groupe d" âge (incorrect)
ce groupe d "âge (incorrect)
I am wondering if this is more a detokenizer issue or a corpus issue, or
both.
Technically in French, there shouldn't be any space