Hi Barry, I have checked the source data for training. I found that some of the apostrophe already got converted to ''', but there are still some apostrophe like ’and [ . With my understanding, the tool you mentioned will convert the apostrophe from Unicode to ASCIII, so the tool can only works for the English-Chinese translation. Is it right? The apostrophe in Chinese is two-byte, in English is one-byte. If I use the tool (http://www.statmt.org/wmt11/normalize-punctuation.perl) , what will be the translation result of apostrophe(’,‘,').
-----Original Message----- From: Barry Haddow [mailto:bhad...@staffmail.ed.ac.uk] Sent: Tuesday, August 07, 2012 6:18 PM To: Tan, Jun Cc: moses-support@mit.edu Subject: Re: [Moses-support] how does Moses handle with the apostrophes? Hi Jun If you're using this version of the tokeniser on your source sentence, then I would expect it to convert the apostrophe to ' The fact that there is no ' in your output suggests that either the decoder is translating it to ' (unlikely) or the apostrophe in your source is not a regular apostrophe, but some unicode variant. So you need to check for that. This script will normalise a lot of the punctuation http://www.statmt.org/wmt11/normalize-punctuation.perl However if you use it, then you should also run it over your training data, and retrain. cheers - Barry On 07/08/12 11:00, Tan, Jun wrote: > Hi Barry, > > I think the version is new, below is output from the file tokenizer.perl > #escape special chars > $text =~ s/\&/\&/g; # escape escape > $text =~ s/\|/\|/g; # factor separator > $text =~ s/\</\</g; # xml > $text =~ s/\>/\>/g; # xml > $text =~ s/\'/\'/g; # xml > $text =~ s/\"/\"/g; # xml > $text =~ s/\[/\[/g; # syntax non-terminal > $text =~ s/\]/\]/g; # syntax non-terminal > > > > -----Original Message----- > From: Barry Haddow [mailto:bhad...@staffmail.ed.ac.uk] > Sent: Tuesday, August 07, 2012 5:55 PM > To: Tan, Jun > Cc: moses-support@mit.edu > Subject: Re: [Moses-support] how does Moses handle with the apostrophes? > > Hi Jun > > Recent versions of the tokeniser have a line like > > $text =~ s/\'/\'/g; # xml > > to escape apostrophes. > > cheers - Barry > > On 07/08/12 09:51, Tan, Jun wrote: >> Hi Barry, >> >> How to check the Moses version? I'm sure that the tokeniser for training is >> same with testing. I'm using Standford Word Segmenter for Chinese language. >> >> -----Original Message----- >> From: Barry Haddow [mailto:bhad...@staffmail.ed.ac.uk] >> Sent: Tuesday, August 07, 2012 4:43 PM >> To: Tan, Jun >> Cc: tah...@precisiontranslationtools.com; moses-support@mit.edu >> Subject: Re: [Moses-support] how does Moses handle with the apostrophes? >> >> Hi Jun >> >> Is the apostrophe in your source data an ascii apostrophe, or a unicode >> variant (use xxd to check this)? As Tom said, recent versions of the Moses >> tokeniser escape apostrophes, so either you're using an old version, or it >> does not recognise it as an apostrophe. >> >> Make sure you are using the same tokeniser in training and test. >> >> cheers - Barry >> >> On 07/08/12 06:38, jun....@emc.com wrote: >>> Yes, I’m using Moss’s tokenizer.perl for English language, and the Moses >>> got installed in June, the version should be relatively new. >>> Do you have any ideas how to fix it? >>> From: Tom Hoar [mailto:tah...@precisiontranslationtools.com] >>> Sent: Tuesday, August 07, 2012 1:13 PM >>> To: Tan, Jun >>> Cc: moses-support@mit.edu >>> Subject: Re: [Moses-support] how does Moses handle with the apostrophes? >>> >>> >>> If you're using Moses' tokenizer.perl script, the English handling >>> separates the "company's" into "company 's". In recent (~2 months) moses >>> github releases, the tokenizer.perl script also escapes the string to this >>> "company's". The English detokenizer unescapes the "'s" to "'s" >>> and restores it without the preceding space. >>> >>> >>> >>> On Tue, 7 Aug 2012 00:33:07 -0400,<jun....@emc.com<mailto:jun....@emc.com>> >>> wrote: >>> Hi all, >>> >>> When I using Moses to translate some sentences contain apostrophes, it >>> doesn’t work correctly. >>> Source: >>> EMC Corporation (NYSE:EMC) today reported strong financial results for the >>> second quarter of 2012, marking the company's 10th consecutive quarter of >>> double-digit year-over-year growth for consolidated revenue, GAAP net >>> income, and GAAP and non-GAAP EPS. EMC expects to achieve its full-year >>> 2012 goals for consolidated revenue, non-GAAP EPS and free cash flow. >>> >>> Translation result: >>> 2012 年 7 月 24 日 — EMC 公司 ( NYSE : EMC) 今天 报告 了 强有力 的 财务 业绩 2012 年 第 >>> 2 >>> 季度 , 标志 着 公司 's 连续 10 个 季度 实现 两 位 数 的 同比 增长 , 以 实现 整合 的 收入 、 GAAP 净 >>> 收入 >>> 和 GAAP 和 非 GAAP 每 股 收益 。 EMC 预计 到 2012 年 实现 其 目标 的 要求 年 全 年 的 合并 收入 >>> 、 >>> 非 GAAP EPS 和 自由 现金流 。 >>> >>> As we can see, the translation result of “company's” is “公司 's”,and >>> translation of the apostrophes(‘) and the letter (s) got failed. >>> Does anybody know the cause of this issue? Do I need some other module to >>> handle it? Does anybody know how to fix it? Below is an example: >>> >>> >>> Thanks >>> >>> >>> _______________________________________________ >>> Moses-support mailing list >>> Moses-support@mit.edu >>> http://mailman.mit.edu/mailman/listinfo/moses-support >> -- >> The University of Edinburgh is a charitable body, registered in Scotland, >> with registration number SC005336. >> >> > > -- > The University of Edinburgh is a charitable body, registered in Scotland, > with registration number SC005336. > > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support