Re: [Moses-support] apostrophe: detokenization or corpus issue ?

2016-03-14 Thread Vincent Nguyen


I think I found the culprit.
this is very tricky . it's not a detokenizer issue but a 
"normalize-punctuation | tokenizer" issue.


the normalize-punctuation script convert the special apostrophe utf-8 
sequence E2 80 99

when it is surrounded by [a-z] on both sides.

s/([a-z])‘([a-z])/$1\'$2/gi;
s/([a-z])’([a-z])/$1\'$2/gi;

The problem is that when the apostrophe is followed by a special 
character like é or â which are utf-8 sequence C3 A9 or C3 A2

then it does not work .
then the script converts these apostrophes to quotes "
s/‘/\"/g;
s/‚/\"/g;
s/’/\"/g;

Either we need to correct the [a-z] thing or maybe the last 3 conversion 
et convert to the regular ' no matter what.


Hope this is clear.



Le 10/03/2016 13:00, Philipp Koehn a écrit :

Hi,

I do not think that the detokenizer would cause conversion of ' to ".
You can check the raw output of the decoder, and see how it is
changed by the detokenizer.

-phi

On Wed, Mar 9, 2016 at 11:44 AM, Vincent Nguyen > wrote:


Hi,

I got the following situation:

This group age
is translated sometimes in:
ce groupe d'âge (correct)
ce groupe d" âge (incorrect)
ce groupe d "âge (incorrect)

I am wondering if this is more a detokenizer issue or a corpus
issue, or
both.

Technically in French, there shouldn't be any space before or
after the
apostrophe.
In the Europarl Corpus, as well as in the News2014 one, there are some
instances with a space before or after.

Then I have the feeling that the decoder gets a  with
surrounding
spaces leading to the detokenizer to transform into "

Anyone with a similar issue ?

thanks.
___
Moses-support mailing list
Moses-support@mit.edu 
http://mailman.mit.edu/mailman/listinfo/moses-support




___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] apostrophe: detokenization or corpus issue ?

2016-03-10 Thread Vincent Nguyen


this age group
is decoded as
ce groupe d  âge

I'll check my corpus and see why it got  instead of  in there.

thanks.


Le 10/03/2016 13:00, Philipp Koehn a écrit :

Hi,

I do not think that the detokenizer would cause conversion of ' to ".
You can check the raw output of the decoder, and see how it is
changed by the detokenizer.

-phi

On Wed, Mar 9, 2016 at 11:44 AM, Vincent Nguyen > wrote:


Hi,

I got the following situation:

This group age
is translated sometimes in:
ce groupe d'âge (correct)
ce groupe d" âge (incorrect)
ce groupe d "âge (incorrect)

I am wondering if this is more a detokenizer issue or a corpus
issue, or
both.

Technically in French, there shouldn't be any space before or
after the
apostrophe.
In the Europarl Corpus, as well as in the News2014 one, there are some
instances with a space before or after.

Then I have the feeling that the decoder gets a  with
surrounding
spaces leading to the detokenizer to transform into "

Anyone with a similar issue ?

thanks.
___
Moses-support mailing list
Moses-support@mit.edu 
http://mailman.mit.edu/mailman/listinfo/moses-support




___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] apostrophe: detokenization or corpus issue ?

2016-03-10 Thread Philipp Koehn
Hi,

I do not think that the detokenizer would cause conversion of ' to ".
You can check the raw output of the decoder, and see how it is
changed by the detokenizer.

-phi

On Wed, Mar 9, 2016 at 11:44 AM, Vincent Nguyen  wrote:

> Hi,
>
> I got the following situation:
>
> This group age
> is translated sometimes in:
> ce groupe d'âge (correct)
> ce groupe d" âge (incorrect)
> ce groupe d "âge (incorrect)
>
> I am wondering if this is more a detokenizer issue or a corpus issue, or
> both.
>
> Technically in French, there shouldn't be any space before or after the
> apostrophe.
> In the Europarl Corpus, as well as in the News2014 one, there are some
> instances with a space before or after.
>
> Then I have the feeling that the decoder gets a  with surrounding
> spaces leading to the detokenizer to transform into "
>
> Anyone with a similar issue ?
>
> thanks.
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support