Re: [Moses-support] The BELU score from MultiEval is much lower than which generated by the Moses mert-moses.pl script
Hi Barry, Thanks for you information. The scores are calculated by MultiEval on the test set. And I used only one reference in development. I re-caculated the BELU score via the mutli-bleu.pl. BLEU = 29.02, 65.8/36.2/22.0/13.7 (BP=0.996, ratio=0.996, hyp_len=19684, ref_len=19755) It's very closer to these calculated by MultiEval now. And I'm very interested about the multiple references. Does that mean I need to use multiple development sets to tune the MT engine's weights? Thanks, Jun -Original Message- From: Barry Haddow [mailto:bhad...@staffmail.ed.ac.uk] Sent: Thursday, 24 January 2013 5:44 PM To: Tan, Jun Cc: moses-support@mit.edu Subject: Re: [Moses-support] The BELU score from MultiEval is much lower than which generated by the Moses mert-moses.pl script Hi Jun mert-moses.pl is not an evaluation script, it's for tuning the MT engine. It will report bleu scores obtained during tuning, but these are on the development set. The scores you're showing using MultiEval are (I hope!) on the test set, which would make them different. It's quite a big difference between development and test though - are you using multiple references in development? The NaNs in the MultiEval output are a bit strange. I'm not familiar with this tool, but Moses contains multi-bleu.pl (in scripts/generic) which you can also use to calculate Bleu, cheers - Barry On 24/01/13 02:49, Tan, Jun wrote: Hello all, I have created an English-Chinese MT engine via Moses. I’m doing the translation quality evaluation regard this engine. I have an evaluation report created by MultiEval tool about 1000 sentences. I found the BELU score is much lower than the score generated by the mert-moses.pl script. It’s only 0.3 of MultiEval, but for mert-moses.pl is 0.65. MultiEval report: BLEU (s_sel/s_opt/p)METEOR (s_sel/s_opt/p) TER (s_sel/s_opt/p) Length (s_sel/s_opt/p) EMC DATA *29.0 (0.6/NaN/-) * *31.7 (0.3/NaN/-) * 57.1 (0.7/NaN/-) 100.4 (0.6/NaN/-) TAUS DATA *21.8 (0.5/NaN/0.00) * *28.1 (0.2/NaN/0.00) * 61.8 (0.6/NaN/0.00)97.5 (0.6/NaN/0.00) Top unmatched hypothesis words according to METEOR: [ 的 x 341, , x 177, 在 x 117, quot; x 91, 和 x 85, 中 x 84, 到 x 84, 将 x 74, / x 65, 一个 x 65] [ 的 x 436, , x 273, 在 x 163, 将 x 85, 中 x 82, 时 x 71, 上 x 65, 以 x 54, 为 x 52, 数据 x 50] [ 的 x 400, , x 197, 在 x 139, 一个 x 91, 数据 x 89, 将 x 89, 是 x 85, “ x 85, 和 x 82, 数据域 x 77] [ 的 x 369, , x 227, 在 x 151, Domain x 139, Data x 136, 数据 x 115, 上 x 96, 中 x 93, 将 x 86, 消除 x 83] I have some following questions regard this issue: 1. The causes of this issue. 2. Anyone else has similar experience? 3. Is it normal? 4. Which tool do you recommend for the MT evaluation? 5. How to improve the engine according to the MultiEval report? Any question or any suggestion is welcome ~ Thanks, Jun ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
[Moses-support] The BELU score from MultiEval is much lower than which generated by the Moses mert-moses.pl script
Hello all, I have created an English-Chinese MT engine via Moses. I’m doing the translation quality evaluation regard this engine. I have an evaluation report created by MultiEval tool about 1000 sentences. I found the BELU score is much lower than the score generated by the mert-moses.pl script. It’s only 0.3 of MultiEval, but for mert-moses.pl is 0.65. MultiEval report: BLEU (s_sel/s_opt/p)METEOR (s_sel/s_opt/p) TER (s_sel/s_opt/p) Length (s_sel/s_opt/p) EMC DATA29.0 (0.6/NaN/-)31.7 (0.3/NaN/-)57.1 (0.7/NaN/-)100.4 (0.6/NaN/-) TAUS DATA 21.8 (0.5/NaN/0.00) 28.1 (0.2/NaN/0.00) 61.8 (0.6/NaN/0.00) 97.5 (0.6/NaN/0.00) Top unmatched hypothesis words according to METEOR: [的 x 341, , x 177, 在 x 117, quot; x 91, 和 x 85, 中 x 84, 到 x 84, 将 x 74, / x 65, 一个 x 65] [的 x 436, , x 273, 在 x 163, 将 x 85, 中 x 82, 时 x 71, 上 x 65, 以 x 54, 为 x 52, 数据 x 50] [的 x 400, , x 197, 在 x 139, 一个 x 91, 数据 x 89, 将 x 89, 是 x 85, “ x 85, 和 x 82, 数据域 x 77] [的 x 369, , x 227, 在 x 151, Domain x 139, Data x 136, 数据 x 115, 上 x 96, 中 x 93, 将 x 86, 消除 x 83] I have some following questions regard this issue: 1. The causes of this issue. 2. Anyone else has similar experience? 3. Is it normal? 4. Which tool do you recommend for the MT evaluation? 5. How to improve the engine according to the MultiEval report? Any question or any suggestion is welcome ~ Thanks, Jun ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Dose Moses support binarised translation table for factored model?
Hi Koehn, So the factor separator must be |? I tagged all the data via some other tool, and default separator is _. I also have noticed the separator of target phrase in the phrase table is |, even I changed the separator to _ during the training process. I changed all the separator in the phrase-table from | to _, and the decoding did work. -Original Message- From: phko...@gmail.com [mailto:phko...@gmail.com] On Behalf Of Philipp Koehn Sent: Wednesday, September 05, 2012 4:22 AM To: Tan, Jun Cc: moses-support@mit.edu Subject: Re: [Moses-support] Dose Moses support binarised translation table for factored model? Hi, this should be working. What seems odd to me that you are using _ as factor separator, while it is standard to use |. There is no option in processPhraseTable to change the separator. -phi On Tue, Sep 4, 2012 at 6:15 AM, Tan, Jun jun@emc.com wrote: Hi all, I built a factored model following the guideline on Moses web page. In order to faster the decoding speed, I’m trying to use the binarised phrase table. The binaring progress is finished, when trying to decode with the binarised phrase table, the translation got failed. The input and output are the same. Dose Moses support binarised translation table for factored model? Does anybody also meet this issue? Below are the outputs of the decoding process: 1.decoding with binarised phrase-table: [root@Redhat-252 binarised-model]# echo 'the_DT' | /data/moses/moses-smt-mosesdecoder/bin/moses -f moses.ini Defined parameters (per moses.ini or switch): config: moses.ini distortion-limit: 6 factor-delimiter: _ input-factors: 0 lmodel-file: 0 0 3 /data/english-chinese_POS_tag/chinese-lm/english-chinese.lm.cn 0 1 3 /data/english-chinese_POS_tag/chinese-pos-lm/english-chinese.lm.cn mapping: 0 T 0 ttable-file: 1 0 0,1 5 /data/english-chinese_POS_tag/binarised-model/phrase-table ttable-limit: 20 weight-d: 0.6 weight-l: 0.2500 0.2500 weight-t: 0.20 0.20 0.20 0.20 0.20 weight-w: -1 /data/moses/moses-smt-mosesdecoder/bin Loading lexical distortion models...have 0 models Start loading LanguageModel /data/english-chinese_POS_tag/chinese-lm/english-chinese.lm.cn : [0.001] seconds /data/english-chinese_POS_tag/chinese-lm/english-chinese.lm.cn: line 125679: warning: non-zero probability for unk in closed-vocabulary LM Start loading LanguageModel /data/english-chinese_POS_tag/chinese-pos-lm/english-chinese.lm.cn : [7.148] seconds /data/english-chinese_POS_tag/chinese-pos-lm/english-chinese.lm.cn: line 46: warning: non-zero probability for unk in closed-vocabulary LM Finished loading LanguageModels : [7.214] seconds Start loading PhraseTable /data/english-chinese_POS_tag/binarised-model/phrase-table : [7.214] seconds filePath: /data/english-chinese_POS_tag/binarised-model/phrase-table Finished loading phrase tables : [7.214] seconds IO from STDOUT/STDIN Created input-output object : [7.214] seconds Translating line 0 in thread id 140249033144064 Translating: the reading bin ttable size of OFF_T 8 binary phrasefile loaded, default OFF_T: -1 Line 0: Collecting options took 0.000 seconds Line 0: Search took 0.000 seconds the BEST TRANSLATION: the_UNK_UNK_UNK [1] [total=-111.439] 0.000, -1.000, -100.000, -23.206, -26.549, 0.000, 0.000, 0.000, 0.000, 0.000 0-0 Line 0: Translation took 0.894 seconds total 2.Normal decoding [root@Redhat-252 english-chinese_POS_tag]# echo 'the_DT' | /data/moses/moses-smt-mosesdecoder/bin/moses -f train/model/moses.ini Defined parameters (per moses.ini or switch): config: train/model/moses.ini distortion-limit: 6 factor-delimiter: _ input-factors: 0 lmodel-file: 0 0 3 /data/english-chinese_POS_tag/chinese-lm/english-chinese.lm.cn 0 1 3 /data/english-chinese_POS_tag/chinese-pos-lm/english-chinese.lm.cn mapping: 0 T 0 ttable-file: 0 0 0,1 5 /data/english-chinese_POS_tag/train/model/phrase-table.0-0,1.gz ttable-limit: 20 weight-d: 0.6 weight-l: 0.2500 0.2500 weight-t: 0.20 0.20 0.20 0.20 0.20 weight-w: -1 /data/moses/moses-smt-mosesdecoder/bin Loading lexical distortion models...have 0 models Start loading LanguageModel /data/english-chinese_POS_tag/chinese-lm/english-chinese.lm.cn : [0.000] seconds /data/english-chinese_POS_tag/chinese-lm/english-chinese.lm.cn: line 125679: warning: non-zero probability for unk in closed-vocabulary LM Start loading LanguageModel /data/english-chinese_POS_tag/chinese-pos-lm/english-chinese.lm.cn : [4.239] seconds /data/english-chinese_POS_tag/chinese-pos-lm/english-chinese.lm.cn: line 46: warning: non-zero probability for unk in closed-vocabulary LM Finished loading LanguageModels : [4.254] seconds Start loading
[Moses-support] Dose Moses support binarised translation table for factored model?
Hi all, I built a factored model following the guideline on Moses web page. In order to faster the decoding speed, I’m trying to use the binarised phrase table. The binaring progress is finished, when trying to decode with the binarised phrase table, the translation got failed. The input and output are the same. Dose Moses support binarised translation table for factored model? Does anybody also meet this issue? Below are the outputs of the decoding process: 1.decoding with binarised phrase-table: [root@Redhat-252 binarised-model]# echo 'the_DT' | /data/moses/moses-smt-mosesdecoder/bin/moses -f moses.ini Defined parameters (per moses.ini or switch): config: moses.ini distortion-limit: 6 factor-delimiter: _ input-factors: 0 lmodel-file: 0 0 3 /data/english-chinese_POS_tag/chinese-lm/english-chinese.lm.cn 0 1 3 /data/english-chinese_POS_tag/chinese-pos-lm/english-chinese.lm.cn mapping: 0 T 0 ttable-file: 1 0 0,1 5 /data/english-chinese_POS_tag/binarised-model/phrase-table ttable-limit: 20 weight-d: 0.6 weight-l: 0.2500 0.2500 weight-t: 0.20 0.20 0.20 0.20 0.20 weight-w: -1 /data/moses/moses-smt-mosesdecoder/bin Loading lexical distortion models...have 0 models Start loading LanguageModel /data/english-chinese_POS_tag/chinese-lm/english-chinese.lm.cn : [0.001] seconds /data/english-chinese_POS_tag/chinese-lm/english-chinese.lm.cn: line 125679: warning: non-zero probability for unk in closed-vocabulary LM Start loading LanguageModel /data/english-chinese_POS_tag/chinese-pos-lm/english-chinese.lm.cn : [7.148] seconds /data/english-chinese_POS_tag/chinese-pos-lm/english-chinese.lm.cn: line 46: warning: non-zero probability for unk in closed-vocabulary LM Finished loading LanguageModels : [7.214] seconds Start loading PhraseTable /data/english-chinese_POS_tag/binarised-model/phrase-table : [7.214] seconds filePath: /data/english-chinese_POS_tag/binarised-model/phrase-table Finished loading phrase tables : [7.214] seconds IO from STDOUT/STDIN Created input-output object : [7.214] seconds Translating line 0 in thread id 140249033144064 Translating: the reading bin ttable size of OFF_T 8 binary phrasefile loaded, default OFF_T: -1 Line 0: Collecting options took 0.000 seconds Line 0: Search took 0.000 seconds the BEST TRANSLATION: the_UNK_UNK_UNK [1] [total=-111.439] 0.000, -1.000, -100.000, -23.206, -26.549, 0.000, 0.000, 0.000, 0.000, 0.000 0-0 Line 0: Translation took 0.894 seconds total 2.Normal decoding [root@Redhat-252 english-chinese_POS_tag]# echo 'the_DT' | /data/moses/moses-smt-mosesdecoder/bin/moses -f train/model/moses.ini Defined parameters (per moses.ini or switch): config: train/model/moses.ini distortion-limit: 6 factor-delimiter: _ input-factors: 0 lmodel-file: 0 0 3 /data/english-chinese_POS_tag/chinese-lm/english-chinese.lm.cn 0 1 3 /data/english-chinese_POS_tag/chinese-pos-lm/english-chinese.lm.cn mapping: 0 T 0 ttable-file: 0 0 0,1 5 /data/english-chinese_POS_tag/train/model/phrase-table.0-0,1.gz ttable-limit: 20 weight-d: 0.6 weight-l: 0.2500 0.2500 weight-t: 0.20 0.20 0.20 0.20 0.20 weight-w: -1 /data/moses/moses-smt-mosesdecoder/bin Loading lexical distortion models...have 0 models Start loading LanguageModel /data/english-chinese_POS_tag/chinese-lm/english-chinese.lm.cn : [0.000] seconds /data/english-chinese_POS_tag/chinese-lm/english-chinese.lm.cn: line 125679: warning: non-zero probability for unk in closed-vocabulary LM Start loading LanguageModel /data/english-chinese_POS_tag/chinese-pos-lm/english-chinese.lm.cn : [4.239] seconds /data/english-chinese_POS_tag/chinese-pos-lm/english-chinese.lm.cn: line 46: warning: non-zero probability for unk in closed-vocabulary LM Finished loading LanguageModels : [4.254] seconds Start loading PhraseTable /data/english-chinese_POS_tag/train/model/phrase-table.0-0,1.gz : [4.254] seconds filePath: /data/english-chinese_POS_tag/train/model/phrase-table.0-0,1.gz Finished loading phrase tables : [4.254] seconds Start loading phrase table from /data/english-chinese_POS_tag/train/model/phrase-table.0-0,1.gz : [4.254] seconds Reading /data/english-chinese_POS_tag/train/model/phrase-table.0-0,1.gz 5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100 Finished loading phrase tables : [422.886] seconds IO from STDOUT/STDIN Created input-output object : [422.895] seconds Translating line 0 in thread id 139991742867200 Translating: the Line 0: Collecting options took 0.061 seconds Line 0: Search took 0.185 seconds 在 BEST TRANSLATION: 在_P [1] [total=-6.025] 0.000, -1.000, 0.000, -12.496, -9.723, -1.545, -1.590, -2.312, -2.906, 1.000 Line 0: Translation took 0.247 seconds total
Re: [Moses-support] What will happen if training Moses with duplicated corpus?
Hi Koehn, The line number of the phrase-table is too big. I don't know to check. I checked the both files, and found something that the corpus should be not clean enough, there are lots of non-meaningful phrases. [root@Redhat-251 tmp]# wc -l phrase-table 19992218 phrase-table [root@Redhat-251 tmp]# wc -l phrase-table1 21546088 phrase-table1 -Original Message- From: phko...@gmail.com [mailto:phko...@gmail.com] On Behalf Of Philipp Koehn Sent: Thursday, August 30, 2012 5:02 AM To: Tan, Jun Cc: moses-support@mit.edu Subject: Re: [Moses-support] What will happen if training Moses with duplicated corpus? Hi, this is a bit odd - if the phrase table is larger, then it must contain phrase pairs that were not in the original phrase table. However, these were extracted from the same data - why were they not extracted in the first place? Can you check this? I am not surprised that the language model is larger, if you used default settings, since there will be less singletons (actually, none) to be pruned out, but I would have expected a bigger increase than 10%. -phi On Tue, Aug 28, 2012 at 7:23 PM, Tan, Jun jun@emc.com wrote: Hi Koehn, Thanks for your reply. I check the both phrase-table, most of them are the same. The difference is the size of phrase-table created by duplicated corpus is about 5% larger than the original corpus. For the language model, the size of duplicated corpus is 10% larger than the original corpus. I think the tuning processes are same for the both Moses engine, the only change is the training data. The steps and the tuning data are the same for both of them. -Original Message- From: phko...@gmail.com [mailto:phko...@gmail.com] On Behalf Of Philipp Koehn Sent: Wednesday, August 29, 2012 4:31 AM To: Tan, Jun Cc: moses-support@mit.edu Subject: Re: [Moses-support] What will happen if training Moses with duplicated corpus? Hi, It is not obvious to me why this would happen due to data duplication - there are things like Good Turing smoothing that would be affected by count doubling, but that is not turned on by default. Do the phrase translation tables look at all different? There is a clear affect on language model training if you double the data, because SRILM's ngram-count by default drops higher order singletons (which would not exist in a doubled corpus. It may be just be due to different tuning runs (which are random processes that add noise). You could check this by re-using the weights from the other run, and vice versa. -phi On Mon, Aug 27, 2012 at 7:11 PM, Tan, Jun jun@emc.com wrote: Hi all, Just like the thread title says, what will happen in that situation? I did an experiment to create two Moses translation models, one created by the original corpus, the other created by two copy of the same corpus. And in the last, I found that the BLEU score is a little different between the two models. The model with two copy of the same corpus is about 1.2% higher than the engine created by the original corpus. Can anybody tell me whether it is normal? What's the impact if I using a lot of copies of the same corpus to create the model? ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] What will happen if training Moses with duplicated corpus?
Hi Koehn, Thanks for your reply. I check the both phrase-table, most of them are the same. The difference is the size of phrase-table created by duplicated corpus is about 5% larger than the original corpus. For the language model, the size of duplicated corpus is 10% larger than the original corpus. I think the tuning processes are same for the both Moses engine, the only change is the training data. The steps and the tuning data are the same for both of them. -Original Message- From: phko...@gmail.com [mailto:phko...@gmail.com] On Behalf Of Philipp Koehn Sent: Wednesday, August 29, 2012 4:31 AM To: Tan, Jun Cc: moses-support@mit.edu Subject: Re: [Moses-support] What will happen if training Moses with duplicated corpus? Hi, It is not obvious to me why this would happen due to data duplication - there are things like Good Turing smoothing that would be affected by count doubling, but that is not turned on by default. Do the phrase translation tables look at all different? There is a clear affect on language model training if you double the data, because SRILM's ngram-count by default drops higher order singletons (which would not exist in a doubled corpus. It may be just be due to different tuning runs (which are random processes that add noise). You could check this by re-using the weights from the other run, and vice versa. -phi On Mon, Aug 27, 2012 at 7:11 PM, Tan, Jun jun@emc.com wrote: Hi all, Just like the thread title says, what will happen in that situation? I did an experiment to create two Moses translation models, one created by the original corpus, the other created by two copy of the same corpus. And in the last, I found that the BLEU score is a little different between the two models. The model with two copy of the same corpus is about 1.2% higher than the engine created by the original corpus. Can anybody tell me whether it is normal? What's the impact if I using a lot of copies of the same corpus to create the model? ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
[Moses-support] What will happen if training Moses with duplicated corpus?
Hi all, Just like the thread title says, what will happen in that situation? I did an experiment to create two Moses translation models, one created by the original corpus, the other created by two copy of the same corpus. And in the last, I found that the BLEU score is a little different between the two models. The model with two copy of the same corpus is about 1.2% higher than the engine created by the original corpus. Can anybody tell me whether it is normal? What's the impact if I using a lot of copies of the same corpus to create the model? ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
[Moses-support] Malformed input issue happened during decoding process, when using factored model
Hi all, I’m learning about the factored model, tried to create a factored model following the guideline on Moses website. Everything goes fine during the creating process, but got a “Malformed input issue” when doing the first time decoding like below: Loading lexical distortion models...have 0 models Start loading LanguageModel /tmp/factored-corpus/english-chinese/1500.en.lm.cn : [0.000] seconds /tmp/factored-corpus/english-chinese/1500.en.lm.cn: line 5700: warning: non-zero probability for unk in closed-vocabulary LM Start loading LanguageModel /tmp/factored-corpus/english-chinese/1500.en.pos.lm.cn : [0.000] seconds /tmp/factored-corpus/english-chinese/1500.en.pos.lm.cn: line 42: warning: non-zero probability for unk in closed-vocabulary LM Finished loading LanguageModels : [0.000] seconds Start loading PhraseTable /tmp/factored-corpus/english-chinese/train/model/phrase-table.0-0,1.gz : [0.000] seconds filePath: /tmp/factored-corpus/english-chinese/train/model/phrase-table.0-0,1.gz Finished loading phrase tables : [0.000] seconds Start loading phrase table from /tmp/factored-corpus/english-chinese/train/model/phrase-table.0-0,1.gz : [0.000] seconds Reading /tmp/factored-corpus/english-chinese/train/model/phrase-table.0-0,1.gz 5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100 [ERROR] Malformed input: '!|PU' In ' !|PU ' Expected input to have words composed of 2 factor(s) (form FAC1|FAC2|...) but instead received input with 1 factor(s). Aborted (core dumped) I searched the moses-support mail archive, got some helpful information from below thread: http://www.mail-archive.com/moses-support@mit.edu/msg03209.html , found that this issue caused by the delimiter of phrase in target language is wrong in phrase-table. The phrase-table looks like below: !_. ||| !|PU ||| 1 0.545454 0.714286 0.26087 2.718 ||| ||| 5 7 !_. ||| 。|PU ||| 0.00139665 0.0027529 0.285714 0.173913 2.718 ||| ||| 1432 7 When I replace the delimiter “|” with “_”, the issue is gone. And here is my question, since I have already used the option “--factor-delimiter=_” during the training process, why the delimiter for the target language phrase still be the default delimiter “|”. The configuration for delimiter in the moses.ini is as below: # delimiter between factors in input [factor-delimiter] _ ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] how does Moses handle with the apostrophes?
Hi Barry, How to check the Moses version? I'm sure that the tokeniser for training is same with testing. I'm using Standford Word Segmenter for Chinese language. -Original Message- From: Barry Haddow [mailto:bhad...@staffmail.ed.ac.uk] Sent: Tuesday, August 07, 2012 4:43 PM To: Tan, Jun Cc: tah...@precisiontranslationtools.com; moses-support@mit.edu Subject: Re: [Moses-support] how does Moses handle with the apostrophes? Hi Jun Is the apostrophe in your source data an ascii apostrophe, or a unicode variant (use xxd to check this)? As Tom said, recent versions of the Moses tokeniser escape apostrophes, so either you're using an old version, or it does not recognise it as an apostrophe. Make sure you are using the same tokeniser in training and test. cheers - Barry On 07/08/12 06:38, jun@emc.com wrote: Yes, I’m using Moss’s tokenizer.perl for English language, and the Moses got installed in June, the version should be relatively new. Do you have any ideas how to fix it? From: Tom Hoar [mailto:tah...@precisiontranslationtools.com] Sent: Tuesday, August 07, 2012 1:13 PM To: Tan, Jun Cc: moses-support@mit.edu Subject: Re: [Moses-support] how does Moses handle with the apostrophes? If you're using Moses' tokenizer.perl script, the English handling separates the company's into company 's. In recent (~2 months) moses github releases, the tokenizer.perl script also escapes the string to this companyapos;s. The English detokenizer unescapes the apos;s to 's and restores it without the preceding space. On Tue, 7 Aug 2012 00:33:07 -0400,jun@emc.commailto:jun@emc.com wrote: Hi all, When I using Moses to translate some sentences contain apostrophes, it doesn’t work correctly. Source: EMC Corporation (NYSE:EMC) today reported strong financial results for the second quarter of 2012, marking the company's 10th consecutive quarter of double-digit year-over-year growth for consolidated revenue, GAAP net income, and GAAP and non-GAAP EPS. EMC expects to achieve its full-year 2012 goals for consolidated revenue, non-GAAP EPS and free cash flow. Translation result: 2012 年 7 月 24 日 — EMC 公司 ( NYSE : EMC) 今天 报告 了 强有力 的 财务 业绩 2012 年 第 2 季度 , 标志 着 公司 's 连续 10 个 季度 实现 两 位 数 的 同比 增长 , 以 实现 整合 的 收入 、 GAAP 净 收入 和 GAAP 和 非 GAAP 每 股 收益 。 EMC 预计 到 2012 年 实现 其 目标 的 要求 年 全 年 的 合并 收入 、 非 GAAP EPS 和 自由 现金流 。 As we can see, the translation result of “company's” is “公司 's”,and translation of the apostrophes(‘) and the letter (s) got failed. Does anybody know the cause of this issue? Do I need some other module to handle it? Does anybody know how to fix it? Below is an example: Thanks ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] how does Moses handle with the apostrophes?
Hi Barry, I think the version is new, below is output from the file tokenizer.perl #escape special chars $text =~ s/\/\amp;/g; # escape escape $text =~ s/\|/\#124;/g; # factor separator $text =~ s/\/\lt;/g;# xml $text =~ s/\/\gt;/g;# xml $text =~ s/\'/\apos;/g; # xml $text =~ s/\/\quot;/g; # xml $text =~ s/\[/\#91;/g; # syntax non-terminal $text =~ s/\]/\#93;/g; # syntax non-terminal -Original Message- From: Barry Haddow [mailto:bhad...@staffmail.ed.ac.uk] Sent: Tuesday, August 07, 2012 5:55 PM To: Tan, Jun Cc: moses-support@mit.edu Subject: Re: [Moses-support] how does Moses handle with the apostrophes? Hi Jun Recent versions of the tokeniser have a line like $text =~ s/\'/\apos;/g; # xml to escape apostrophes. cheers - Barry On 07/08/12 09:51, Tan, Jun wrote: Hi Barry, How to check the Moses version? I'm sure that the tokeniser for training is same with testing. I'm using Standford Word Segmenter for Chinese language. -Original Message- From: Barry Haddow [mailto:bhad...@staffmail.ed.ac.uk] Sent: Tuesday, August 07, 2012 4:43 PM To: Tan, Jun Cc: tah...@precisiontranslationtools.com; moses-support@mit.edu Subject: Re: [Moses-support] how does Moses handle with the apostrophes? Hi Jun Is the apostrophe in your source data an ascii apostrophe, or a unicode variant (use xxd to check this)? As Tom said, recent versions of the Moses tokeniser escape apostrophes, so either you're using an old version, or it does not recognise it as an apostrophe. Make sure you are using the same tokeniser in training and test. cheers - Barry On 07/08/12 06:38, jun@emc.com wrote: Yes, I’m using Moss’s tokenizer.perl for English language, and the Moses got installed in June, the version should be relatively new. Do you have any ideas how to fix it? From: Tom Hoar [mailto:tah...@precisiontranslationtools.com] Sent: Tuesday, August 07, 2012 1:13 PM To: Tan, Jun Cc: moses-support@mit.edu Subject: Re: [Moses-support] how does Moses handle with the apostrophes? If you're using Moses' tokenizer.perl script, the English handling separates the company's into company 's. In recent (~2 months) moses github releases, the tokenizer.perl script also escapes the string to this companyapos;s. The English detokenizer unescapes the apos;s to 's and restores it without the preceding space. On Tue, 7 Aug 2012 00:33:07 -0400,jun@emc.commailto:jun@emc.com wrote: Hi all, When I using Moses to translate some sentences contain apostrophes, it doesn’t work correctly. Source: EMC Corporation (NYSE:EMC) today reported strong financial results for the second quarter of 2012, marking the company's 10th consecutive quarter of double-digit year-over-year growth for consolidated revenue, GAAP net income, and GAAP and non-GAAP EPS. EMC expects to achieve its full-year 2012 goals for consolidated revenue, non-GAAP EPS and free cash flow. Translation result: 2012 年 7 月 24 日 — EMC 公司 ( NYSE : EMC) 今天 报告 了 强有力 的 财务 业绩 2012 年 第 2 季度 , 标志 着 公司 's 连续 10 个 季度 实现 两 位 数 的 同比 增长 , 以 实现 整合 的 收入 、 GAAP 净 收入 和 GAAP 和 非 GAAP 每 股 收益 。 EMC 预计 到 2012 年 实现 其 目标 的 要求 年 全 年 的 合并 收入 、 非 GAAP EPS 和 自由 现金流 。 As we can see, the translation result of “company's” is “公司 's”,and translation of the apostrophes(‘) and the letter (s) got failed. Does anybody know the cause of this issue? Do I need some other module to handle it? Does anybody know how to fix it? Below is an example: Thanks ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] how does Moses handle with the apostrophes?
Hi Barry, I have checked the source data for training. I found that some of the apostrophe already got converted to 'apos;', but there are still some apostrophe like ’and #91; . With my understanding, the tool you mentioned will convert the apostrophe from Unicode to ASCIII, so the tool can only works for the English-Chinese translation. Is it right? The apostrophe in Chinese is two-byte, in English is one-byte. If I use the tool (http://www.statmt.org/wmt11/normalize-punctuation.perl) , what will be the translation result of apostrophe(’,‘,'). -Original Message- From: Barry Haddow [mailto:bhad...@staffmail.ed.ac.uk] Sent: Tuesday, August 07, 2012 6:18 PM To: Tan, Jun Cc: moses-support@mit.edu Subject: Re: [Moses-support] how does Moses handle with the apostrophes? Hi Jun If you're using this version of the tokeniser on your source sentence, then I would expect it to convert the apostrophe to apos; The fact that there is no apos; in your output suggests that either the decoder is translating it to ' (unlikely) or the apostrophe in your source is not a regular apostrophe, but some unicode variant. So you need to check for that. This script will normalise a lot of the punctuation http://www.statmt.org/wmt11/normalize-punctuation.perl However if you use it, then you should also run it over your training data, and retrain. cheers - Barry On 07/08/12 11:00, Tan, Jun wrote: Hi Barry, I think the version is new, below is output from the file tokenizer.perl #escape special chars $text =~ s/\/\amp;/g; # escape escape $text =~ s/\|/\#124;/g; # factor separator $text =~ s/\/\lt;/g;# xml $text =~ s/\/\gt;/g;# xml $text =~ s/\'/\apos;/g; # xml $text =~ s/\/\quot;/g; # xml $text =~ s/\[/\#91;/g; # syntax non-terminal $text =~ s/\]/\#93;/g; # syntax non-terminal -Original Message- From: Barry Haddow [mailto:bhad...@staffmail.ed.ac.uk] Sent: Tuesday, August 07, 2012 5:55 PM To: Tan, Jun Cc: moses-support@mit.edu Subject: Re: [Moses-support] how does Moses handle with the apostrophes? Hi Jun Recent versions of the tokeniser have a line like $text =~ s/\'/\apos;/g; # xml to escape apostrophes. cheers - Barry On 07/08/12 09:51, Tan, Jun wrote: Hi Barry, How to check the Moses version? I'm sure that the tokeniser for training is same with testing. I'm using Standford Word Segmenter for Chinese language. -Original Message- From: Barry Haddow [mailto:bhad...@staffmail.ed.ac.uk] Sent: Tuesday, August 07, 2012 4:43 PM To: Tan, Jun Cc: tah...@precisiontranslationtools.com; moses-support@mit.edu Subject: Re: [Moses-support] how does Moses handle with the apostrophes? Hi Jun Is the apostrophe in your source data an ascii apostrophe, or a unicode variant (use xxd to check this)? As Tom said, recent versions of the Moses tokeniser escape apostrophes, so either you're using an old version, or it does not recognise it as an apostrophe. Make sure you are using the same tokeniser in training and test. cheers - Barry On 07/08/12 06:38, jun@emc.com wrote: Yes, I’m using Moss’s tokenizer.perl for English language, and the Moses got installed in June, the version should be relatively new. Do you have any ideas how to fix it? From: Tom Hoar [mailto:tah...@precisiontranslationtools.com] Sent: Tuesday, August 07, 2012 1:13 PM To: Tan, Jun Cc: moses-support@mit.edu Subject: Re: [Moses-support] how does Moses handle with the apostrophes? If you're using Moses' tokenizer.perl script, the English handling separates the company's into company 's. In recent (~2 months) moses github releases, the tokenizer.perl script also escapes the string to this companyapos;s. The English detokenizer unescapes the apos;s to 's and restores it without the preceding space. On Tue, 7 Aug 2012 00:33:07 -0400,jun@emc.commailto:jun@emc.com wrote: Hi all, When I using Moses to translate some sentences contain apostrophes, it doesn’t work correctly. Source: EMC Corporation (NYSE:EMC) today reported strong financial results for the second quarter of 2012, marking the company's 10th consecutive quarter of double-digit year-over-year growth for consolidated revenue, GAAP net income, and GAAP and non-GAAP EPS. EMC expects to achieve its full-year 2012 goals for consolidated revenue, non-GAAP EPS and free cash flow. Translation result: 2012 年 7 月 24 日 — EMC 公司 ( NYSE : EMC) 今天 报告 了 强有力 的 财务 业绩 2012 年 第 2 季度 , 标志 着 公司 's 连续 10 个 季度 实现 两 位 数 的 同比 增长 , 以 实现 整合 的 收入 、 GAAP 净 收入 和 GAAP 和 非 GAAP 每 股 收益 。 EMC 预计 到 2012 年 实现 其 目标 的 要求 年 全 年 的 合并 收入 、 非 GAAP EPS 和 自由 现金流 。 As we can see, the translation result of “company's” is “公司 's”,and translation of the apostrophes(‘) and the letter (s) got failed. Does anybody know the cause of this issue? Do I need some other module to handle it? Does anybody know how to fix it? Below is an example: Thanks