Re: [Moses-support] how does Moses handle with the apostrophes?

Tan, Jun Tue, 07 Aug 2012 19:03:19 -0700

Hi Barry,

I have checked the source data for training. I found that some of the 
apostrophe already got converted to '&apos;', but there are still some 
apostrophe like ’and &#91;  .
With my understanding, the tool you mentioned will convert the apostrophe from 
Unicode to ASCIII, so the tool can only works for the English-Chinese 
translation.  Is it right?  The apostrophe in Chinese is two-byte, in English 
is one-byte. 
If I use the tool (http://www.statmt.org/wmt11/normalize-punctuation.perl) , 
what will be the translation result of apostrophe(’,‘,').



-----Original Message-----
From: Barry Haddow [mailto:bhad...@staffmail.ed.ac.uk] 
Sent: Tuesday, August 07, 2012 6:18 PM
To: Tan, Jun
Cc: moses-support@mit.edu
Subject: Re: [Moses-support] how does Moses handle with the apostrophes?

Hi Jun

If you're using this version of the tokeniser on your source sentence, then I 
would expect it to convert the apostrophe to &apos; The fact that there is no 
&apos; in your output suggests that either the decoder is translating it to ' 
(unlikely) or the apostrophe in your source is not a regular apostrophe, but 
some unicode variant. So you need to check for that.

This script will normalise a lot of the punctuation 
http://www.statmt.org/wmt11/normalize-punctuation.perl
However if you use it, then you should also run it over your training data,  
and retrain.

cheers - Barry

On 07/08/12 11:00, Tan, Jun wrote:
> Hi Barry,
>
> I think the version is new, below is output from the file tokenizer.perl
>   #escape special chars
>    $text =~ s/\&/\&amp;/g;   # escape escape
>    $text =~ s/\|/\&#124;/g;  # factor separator
>    $text =~ s/\</\&lt;/g;    # xml
>    $text =~ s/\>/\&gt;/g;    # xml
>    $text =~ s/\'/\&apos;/g;  # xml
>    $text =~ s/\"/\&quot;/g;  # xml
>    $text =~ s/\[/\&#91;/g;   # syntax non-terminal
>    $text =~ s/\]/\&#93;/g;   # syntax non-terminal
>
>
>
> -----Original Message-----
> From: Barry Haddow [mailto:bhad...@staffmail.ed.ac.uk]
> Sent: Tuesday, August 07, 2012 5:55 PM
> To: Tan, Jun
> Cc: moses-support@mit.edu
> Subject: Re: [Moses-support] how does Moses handle with the apostrophes?
>
> Hi Jun
>
> Recent versions of the tokeniser have a line like
>
> $text =~ s/\'/\&apos;/g;  # xml
>
> to escape apostrophes.
>
> cheers - Barry
>
> On 07/08/12 09:51, Tan, Jun wrote:
>> Hi Barry,
>>
>> How to check the Moses version?  I'm sure that the tokeniser for training is 
>> same with testing. I'm using Standford Word Segmenter for Chinese language.
>>
>> -----Original Message-----
>> From: Barry Haddow [mailto:bhad...@staffmail.ed.ac.uk]
>> Sent: Tuesday, August 07, 2012 4:43 PM
>> To: Tan, Jun
>> Cc: tah...@precisiontranslationtools.com; moses-support@mit.edu
>> Subject: Re: [Moses-support] how does Moses handle with the apostrophes?
>>
>> Hi Jun
>>
>> Is the apostrophe in your source data an ascii apostrophe, or a unicode 
>> variant (use xxd to check this)? As Tom said, recent versions of the Moses 
>> tokeniser escape apostrophes, so either you're using an old version, or it 
>> does not recognise it as an apostrophe.
>>
>> Make sure you are using the same tokeniser in training and test.
>>
>> cheers - Barry
>>
>> On 07/08/12 06:38, jun....@emc.com wrote:
>>> Yes, I’m using Moss’s tokenizer.perl for English language, and the Moses 
>>> got installed in June, the version should be relatively new.
>>> Do you have any ideas how to fix it?
>>> From: Tom Hoar [mailto:tah...@precisiontranslationtools.com]
>>> Sent: Tuesday, August 07, 2012 1:13 PM
>>> To: Tan, Jun
>>> Cc: moses-support@mit.edu
>>> Subject: Re: [Moses-support] how does Moses handle with the apostrophes?
>>>
>>>
>>> If you're using Moses' tokenizer.perl script, the English handling 
>>> separates the "company's" into "company 's". In recent (~2 months) moses 
>>> github releases, the tokenizer.perl script also escapes the string to this 
>>> "company&apos;s". The English detokenizer unescapes the "&apos;s" to "'s" 
>>> and restores it without the preceding space.
>>>
>>>
>>>
>>> On Tue, 7 Aug 2012 00:33:07 -0400,<jun....@emc.com<mailto:jun....@emc.com>> 
>>>    wrote:
>>> Hi all,
>>>
>>> When I using Moses to translate some sentences contain apostrophes, it 
>>> doesn’t work correctly.
>>> Source:
>>> EMC Corporation (NYSE:EMC) today reported strong financial results for the 
>>> second quarter of 2012, marking the company's 10th consecutive quarter of 
>>> double-digit year-over-year growth for consolidated revenue, GAAP net 
>>> income, and GAAP and non-GAAP EPS. EMC expects to achieve its full-year 
>>> 2012 goals for consolidated revenue, non-GAAP EPS and free cash flow.
>>>
>>> Translation result:
>>> 2012 年 7 月 24 日 — EMC 公司 （ NYSE ： EMC) 今天 报告 了 强有力 的 财务 业绩 2012 年 第 
>>> 2
>>> 季度 ， 标志 着 公司 's 连续 10 个 季度 实现 两 位 数 的 同比 增长 ， 以 实现 整合 的 收入 、 GAAP 净
>>> 收入
>>> 和 GAAP 和 非 GAAP 每 股 收益 。 EMC 预计 到 2012 年 实现 其 目标 的 要求 年 全 年 的 合并 收入 
>>> 、
>>> 非 GAAP EPS 和 自由 现金流 。
>>>
>>> As we can see, the translation result of “company's” is “公司 's”，and 
>>> translation of the apostrophes(‘)  and the letter (s) got failed.
>>> Does anybody know the cause of this issue? Do I need some other module to 
>>> handle it? Does anybody know how to fix it?  Below is an example:
>>>
>>>
>>> Thanks
>>>
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>> --
>> The University of Edinburgh is a charitable body, registered in Scotland, 
>> with registration number SC005336.
>>
>>
>
> --
> The University of Edinburgh is a charitable body, registered in Scotland, 
> with registration number SC005336.
>
>


--
The University of Edinburgh is a charitable body, registered in Scotland, with 
registration number SC005336.



_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] how does Moses handle with the apostrophes?

Reply via email to