Re: [Moses-support] The BELU score from MultiEval is much lower than which generated by the Moses mert-moses.pl script

2013-01-24 Thread Tan, Jun
Hi Barry,

Thanks for you information.  
The scores are calculated by MultiEval on the test set. And I used only one 
reference in development.  
I re-caculated the BELU score via the mutli-bleu.pl. 
BLEU = 29.02, 65.8/36.2/22.0/13.7 (BP=0.996, ratio=0.996, hyp_len=19684, 
ref_len=19755)

It's very closer to these calculated by MultiEval now. 
And I'm very interested about the multiple references. Does that mean I need to 
use multiple development sets to tune the MT engine's weights? 

Thanks,
Jun



-Original Message-
From: Barry Haddow [mailto:bhad...@staffmail.ed.ac.uk] 
Sent: Thursday, 24 January 2013 5:44 PM
To: Tan, Jun
Cc: moses-support@mit.edu
Subject: Re: [Moses-support] The BELU score from MultiEval is much lower than 
which generated by the Moses mert-moses.pl script

Hi Jun

mert-moses.pl is not an evaluation script, it's for tuning the MT 
engine. It will report bleu scores obtained during tuning, but these are 
on the development set. The scores you're showing using MultiEval are (I 
hope!) on the test set, which would make them different. It's quite a 
big difference between development and test though - are you using 
multiple references in development?

The NaNs in the MultiEval output are a bit strange. I'm not familiar 
with this tool, but Moses contains multi-bleu.pl (in scripts/generic) 
which you can also use to calculate Bleu,

cheers - Barry

On 24/01/13 02:49, Tan, Jun wrote:
 Hello all,
 I have created an English-Chinese MT engine via Moses. I’m doing the 
 translation quality evaluation regard this engine. I have an 
 evaluation report created by MultiEval tool about 1000 sentences. I 
 found the BELU score is much lower than the score generated by the 
 mert-moses.pl script. It’s only 0.3 of MultiEval, but for 
 mert-moses.pl is 0.65.
 MultiEval report:

   BLEU (s_sel/s_opt/p)METEOR (s_sel/s_opt/p)  TER (s_sel/s_opt/p) 
 Length (s_sel/s_opt/p)
 EMC DATA  *29.0 (0.6/NaN/-) * *31.7 (0.3/NaN/-) * 57.1 
 (0.7/NaN/-) 
 100.4 (0.6/NaN/-)
 TAUS DATA *21.8 (0.5/NaN/0.00) *  *28.1 (0.2/NaN/0.00) *  61.8 
 (0.6/NaN/0.00)97.5 (0.6/NaN/0.00)

 Top unmatched hypothesis words according to METEOR:
 [ 的 x 341, , x 177, 在 x 117, quot; x 91, 和 x 85, 中 x 84, 到 x 
 84, 将 x 74, / x 65, 一个 x 65]
 [ 的 x 436, , x 273, 在 x 163, 将 x 85, 中 x 82, 时 x 71, 上 x 65, 以 
 x 54, 为 x 52, 数据 x 50]
 [ 的 x 400, , x 197, 在 x 139, 一个 x 91, 数据 x 89, 将 x 89, 是 x 
 85, “ x 85, 和 x 82, 数据域 x 77]
 [ 的 x 369, , x 227, 在 x 151, Domain x 139, Data x 136, 数据 x 115, 
 上 x 96, 中 x 93, 将 x 86, 消除 x 83]
 I have some following questions regard this issue:

  1. The causes of this issue.
  2. Anyone else has similar experience?
  3. Is it normal?
  4. Which tool do you recommend for the MT evaluation?
  5. How to improve the engine according to the MultiEval report?

 Any question or any suggestion is welcome ~
 Thanks,
 Jun


 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] The BELU score from MultiEval is much lower than which generated by the Moses mert-moses.pl script

2013-01-23 Thread Tan, Jun
Hello all,

I have created an English-Chinese MT engine via Moses.  I’m doing the 
translation quality evaluation regard this engine. I have an evaluation report 
created by MultiEval tool about 1000 sentences. I found the BELU score is much 
lower than the score generated by the mert-moses.pl script.  It’s only 0.3 of 
MultiEval, but for mert-moses.pl is 0.65.


MultiEval report:

BLEU (s_sel/s_opt/p)METEOR (s_sel/s_opt/p)  TER (s_sel/s_opt/p) 
Length (s_sel/s_opt/p)
EMC DATA29.0 (0.6/NaN/-)31.7 (0.3/NaN/-)57.1 
(0.7/NaN/-)100.4 (0.6/NaN/-)
TAUS DATA   21.8 (0.5/NaN/0.00) 28.1 (0.2/NaN/0.00) 61.8 
(0.6/NaN/0.00) 97.5 (0.6/NaN/0.00)

Top unmatched hypothesis words according to METEOR:
[的 x 341, , x 177, 在 x 117, quot; x 91, 和 x 85, 中 x 84, 到 x 84, 将 x 74, / x 
65, 一个 x 65]
[的 x 436, , x 273, 在 x 163, 将 x 85, 中 x 82, 时 x 71, 上 x 65, 以 x 54, 为 x 52, 数据 
x 50]
[的 x 400, , x 197, 在 x 139, 一个 x 91, 数据 x 89, 将 x 89, 是 x 85, “ x 85, 和 x 82, 
数据域 x 77]
[的 x 369, , x 227, 在 x 151, Domain x 139, Data x 136, 数据 x 115, 上 x 96, 中 x 93, 
将 x 86, 消除 x 83]


I have some following questions regard this issue:
1.  The causes of this issue.
2.  Anyone else has similar experience?
3.  Is it normal?
4.  Which tool do you recommend for the MT evaluation?
5.  How to improve the engine according to the MultiEval report?

Any question or  any suggestion is welcome ~

Thanks,
Jun






___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Dose Moses support binarised translation table for factored model?

2012-09-04 Thread Tan, Jun
Hi Koehn,

So the factor separator must be |? 
I tagged all the data via some other tool,  and default separator is _. 
I also have noticed the separator of target phrase in the phrase table is |, 
even I changed the separator to _ during the training process. I changed all 
the separator in the phrase-table from | to _, and the decoding did work. 


-Original Message-
From: phko...@gmail.com [mailto:phko...@gmail.com] On Behalf Of Philipp Koehn
Sent: Wednesday, September 05, 2012 4:22 AM
To: Tan, Jun
Cc: moses-support@mit.edu
Subject: Re: [Moses-support] Dose Moses support binarised translation table for 
factored model?

Hi,

this should be working.

What seems odd to me that you are using _ as factor separator, while it is 
standard to use |. There is no option in processPhraseTable to change the 
separator.

-phi

On Tue, Sep 4, 2012 at 6:15 AM, Tan, Jun jun@emc.com wrote:
 Hi all,



 I built a factored model following the guideline on Moses web page. In 
 order to faster the decoding speed, I’m trying to use the binarised phrase 
 table.

 The binaring progress is finished, when trying to decode with the 
 binarised phrase table, the translation got failed.  The input and 
 output are the same.

 Dose Moses support binarised translation table for factored model? 
 Does anybody also meet this issue?

 Below are the outputs of the decoding process:



 1.decoding with binarised phrase-table:

 [root@Redhat-252 binarised-model]# echo 'the_DT' | 
 /data/moses/moses-smt-mosesdecoder/bin/moses  -f moses.ini

 Defined parameters (per moses.ini or switch):

 config: moses.ini

 distortion-limit: 6

 factor-delimiter: _

 input-factors: 0

 lmodel-file: 0 0 3
 /data/english-chinese_POS_tag/chinese-lm/english-chinese.lm.cn 0 1 3 
 /data/english-chinese_POS_tag/chinese-pos-lm/english-chinese.lm.cn

 mapping: 0 T 0

 ttable-file: 1 0 0,1 5
 /data/english-chinese_POS_tag/binarised-model/phrase-table

 ttable-limit: 20

 weight-d: 0.6

 weight-l: 0.2500 0.2500

 weight-t: 0.20 0.20 0.20 0.20 0.20

 weight-w: -1

 /data/moses/moses-smt-mosesdecoder/bin

 Loading lexical distortion models...have 0 models

 Start loading LanguageModel
 /data/english-chinese_POS_tag/chinese-lm/english-chinese.lm.cn : 
 [0.001] seconds

 /data/english-chinese_POS_tag/chinese-lm/english-chinese.lm.cn: line 125679:
 warning: non-zero probability for unk in closed-vocabulary LM

 Start loading LanguageModel
 /data/english-chinese_POS_tag/chinese-pos-lm/english-chinese.lm.cn : 
 [7.148] seconds

 /data/english-chinese_POS_tag/chinese-pos-lm/english-chinese.lm.cn: line 46:
 warning: non-zero probability for unk in closed-vocabulary LM

 Finished loading LanguageModels : [7.214] seconds

 Start loading PhraseTable
 /data/english-chinese_POS_tag/binarised-model/phrase-table : [7.214] 
 seconds

 filePath: /data/english-chinese_POS_tag/binarised-model/phrase-table

 Finished loading phrase tables : [7.214] seconds

 IO from STDOUT/STDIN

 Created input-output object : [7.214] seconds

 Translating line 0  in thread id 140249033144064

 Translating: the



 reading bin ttable

 size of OFF_T 8

 binary phrasefile loaded, default OFF_T: -1

 Line 0: Collecting options took 0.000 seconds

 Line 0: Search took 0.000 seconds

 the

 BEST TRANSLATION: the_UNK_UNK_UNK [1]  [total=-111.439] 0.000, 
 -1.000, -100.000, -23.206, -26.549, 0.000, 0.000, 0.000, 0.000, 
 0.000 0-0

 Line 0: Translation took 0.894 seconds total



 2.Normal decoding



 [root@Redhat-252 english-chinese_POS_tag]# echo 'the_DT' | 
 /data/moses/moses-smt-mosesdecoder/bin/moses -f train/model/moses.ini

 Defined parameters (per moses.ini or switch):

 config: train/model/moses.ini

 distortion-limit: 6

 factor-delimiter: _

 input-factors: 0

 lmodel-file: 0 0 3
 /data/english-chinese_POS_tag/chinese-lm/english-chinese.lm.cn 0 1 3 
 /data/english-chinese_POS_tag/chinese-pos-lm/english-chinese.lm.cn

 mapping: 0 T 0

 ttable-file: 0 0 0,1 5
 /data/english-chinese_POS_tag/train/model/phrase-table.0-0,1.gz

 ttable-limit: 20

 weight-d: 0.6

 weight-l: 0.2500 0.2500

 weight-t: 0.20 0.20 0.20 0.20 0.20

 weight-w: -1

 /data/moses/moses-smt-mosesdecoder/bin

 Loading lexical distortion models...have 0 models

 Start loading LanguageModel
 /data/english-chinese_POS_tag/chinese-lm/english-chinese.lm.cn : 
 [0.000] seconds

 /data/english-chinese_POS_tag/chinese-lm/english-chinese.lm.cn: line 125679:
 warning: non-zero probability for unk in closed-vocabulary LM

 Start loading LanguageModel
 /data/english-chinese_POS_tag/chinese-pos-lm/english-chinese.lm.cn : 
 [4.239] seconds

 /data/english-chinese_POS_tag/chinese-pos-lm/english-chinese.lm.cn: line 46:
 warning: non-zero probability for unk in closed-vocabulary LM

 Finished loading LanguageModels : [4.254] seconds

 Start loading

[Moses-support] Dose Moses support binarised translation table for factored model?

2012-09-03 Thread Tan, Jun
Hi all,

I built a factored model following the guideline on Moses web page. In order to 
faster the decoding speed, I’m trying to use the binarised phrase table.
The binaring progress is finished, when trying to decode with the binarised 
phrase table, the translation got failed.  The input and output are the same.
Dose Moses support binarised translation table for factored model? Does anybody 
also meet this issue?
Below are the outputs of the decoding process:

1.decoding with binarised phrase-table:
[root@Redhat-252 binarised-model]# echo 'the_DT' | 
/data/moses/moses-smt-mosesdecoder/bin/moses  -f moses.ini
Defined parameters (per moses.ini or switch):
config: moses.ini
distortion-limit: 6
factor-delimiter: _
input-factors: 0
lmodel-file: 0 0 3 
/data/english-chinese_POS_tag/chinese-lm/english-chinese.lm.cn 0 1 3 
/data/english-chinese_POS_tag/chinese-pos-lm/english-chinese.lm.cn
mapping: 0 T 0
ttable-file: 1 0 0,1 5 
/data/english-chinese_POS_tag/binarised-model/phrase-table
ttable-limit: 20
weight-d: 0.6
weight-l: 0.2500 0.2500
weight-t: 0.20 0.20 0.20 0.20 0.20
weight-w: -1
/data/moses/moses-smt-mosesdecoder/bin
Loading lexical distortion models...have 0 models
Start loading LanguageModel 
/data/english-chinese_POS_tag/chinese-lm/english-chinese.lm.cn : [0.001] seconds
/data/english-chinese_POS_tag/chinese-lm/english-chinese.lm.cn: line 125679: 
warning: non-zero probability for unk in closed-vocabulary LM
Start loading LanguageModel 
/data/english-chinese_POS_tag/chinese-pos-lm/english-chinese.lm.cn : [7.148] 
seconds
/data/english-chinese_POS_tag/chinese-pos-lm/english-chinese.lm.cn: line 46: 
warning: non-zero probability for unk in closed-vocabulary LM
Finished loading LanguageModels : [7.214] seconds
Start loading PhraseTable 
/data/english-chinese_POS_tag/binarised-model/phrase-table : [7.214] seconds
filePath: /data/english-chinese_POS_tag/binarised-model/phrase-table
Finished loading phrase tables : [7.214] seconds
IO from STDOUT/STDIN
Created input-output object : [7.214] seconds
Translating line 0  in thread id 140249033144064
Translating: the

reading bin ttable
size of OFF_T 8
binary phrasefile loaded, default OFF_T: -1
Line 0: Collecting options took 0.000 seconds
Line 0: Search took 0.000 seconds
the
BEST TRANSLATION: the_UNK_UNK_UNK [1]  [total=-111.439] 0.000, -1.000, 
-100.000, -23.206, -26.549, 0.000, 0.000, 0.000, 0.000, 0.000 0-0
Line 0: Translation took 0.894 seconds total

2.Normal decoding

[root@Redhat-252 english-chinese_POS_tag]# echo 'the_DT' | 
/data/moses/moses-smt-mosesdecoder/bin/moses -f train/model/moses.ini
Defined parameters (per moses.ini or switch):
config: train/model/moses.ini
distortion-limit: 6
factor-delimiter: _
input-factors: 0
lmodel-file: 0 0 3 
/data/english-chinese_POS_tag/chinese-lm/english-chinese.lm.cn 0 1 3 
/data/english-chinese_POS_tag/chinese-pos-lm/english-chinese.lm.cn
mapping: 0 T 0
ttable-file: 0 0 0,1 5 
/data/english-chinese_POS_tag/train/model/phrase-table.0-0,1.gz
ttable-limit: 20
weight-d: 0.6
weight-l: 0.2500 0.2500
weight-t: 0.20 0.20 0.20 0.20 0.20
weight-w: -1
/data/moses/moses-smt-mosesdecoder/bin
Loading lexical distortion models...have 0 models
Start loading LanguageModel 
/data/english-chinese_POS_tag/chinese-lm/english-chinese.lm.cn : [0.000] seconds
/data/english-chinese_POS_tag/chinese-lm/english-chinese.lm.cn: line 125679: 
warning: non-zero probability for unk in closed-vocabulary LM
Start loading LanguageModel 
/data/english-chinese_POS_tag/chinese-pos-lm/english-chinese.lm.cn : [4.239] 
seconds
/data/english-chinese_POS_tag/chinese-pos-lm/english-chinese.lm.cn: line 46: 
warning: non-zero probability for unk in closed-vocabulary LM
Finished loading LanguageModels : [4.254] seconds
Start loading PhraseTable 
/data/english-chinese_POS_tag/train/model/phrase-table.0-0,1.gz : [4.254] 
seconds
filePath: /data/english-chinese_POS_tag/train/model/phrase-table.0-0,1.gz
Finished loading phrase tables : [4.254] seconds
Start loading phrase table from 
/data/english-chinese_POS_tag/train/model/phrase-table.0-0,1.gz : [4.254] 
seconds
Reading /data/english-chinese_POS_tag/train/model/phrase-table.0-0,1.gz
5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100

Finished loading phrase tables : [422.886] seconds
IO from STDOUT/STDIN
Created input-output object : [422.895] seconds
Translating line 0  in thread id 139991742867200
Translating: the

Line 0: Collecting options took 0.061 seconds
Line 0: Search took 0.185 seconds
在
BEST TRANSLATION: 在_P [1]  [total=-6.025] 0.000, -1.000, 0.000, -12.496, 
-9.723, -1.545, -1.590, -2.312, -2.906, 1.000
Line 0: Translation took 0.247 seconds total

Re: [Moses-support] What will happen if training Moses with duplicated corpus?

2012-08-30 Thread Tan, Jun
Hi Koehn,

The line number of the phrase-table is too big. I don't know to check. 
I checked the both files, and found something that the corpus should be not 
clean enough, there are lots of non-meaningful phrases.

[root@Redhat-251 tmp]# wc -l phrase-table
19992218 phrase-table
[root@Redhat-251 tmp]# wc -l phrase-table1
21546088 phrase-table1

-Original Message-
From: phko...@gmail.com [mailto:phko...@gmail.com] On Behalf Of Philipp Koehn
Sent: Thursday, August 30, 2012 5:02 AM
To: Tan, Jun
Cc: moses-support@mit.edu
Subject: Re: [Moses-support] What will happen if training Moses with duplicated 
corpus?

Hi,

this is a bit odd -

if the phrase table is larger, then it must contain phrase pairs that were not 
in the original phrase table. However, these were extracted from the same data 
- why were they not extracted in the first place?

Can you check this?

I am not surprised that the language model is larger, if you used default 
settings, since there will be less singletons (actually, none) to be pruned 
out, but I would have expected a bigger increase than 10%.

-phi

On Tue, Aug 28, 2012 at 7:23 PM, Tan, Jun jun@emc.com wrote:
 Hi Koehn,

 Thanks for your reply.
 I check the both phrase-table, most of them are the same. The difference is 
 the size of phrase-table created by duplicated corpus is about 5% larger than 
 the original corpus. For the language model, the size of duplicated corpus is 
 10% larger than the original corpus.

 I think the tuning processes are same for the both Moses engine, the only 
 change is the training data. The steps and the tuning data are the same for 
 both of them.


 -Original Message-
 From: phko...@gmail.com [mailto:phko...@gmail.com] On Behalf Of 
 Philipp Koehn
 Sent: Wednesday, August 29, 2012 4:31 AM
 To: Tan, Jun
 Cc: moses-support@mit.edu
 Subject: Re: [Moses-support] What will happen if training Moses with 
 duplicated corpus?

 Hi,

 It is not obvious to me why this would happen due to data duplication - there 
 are things like Good Turing smoothing that would be affected by count 
 doubling, but that is not turned on by default. Do the phrase translation 
 tables look at all different?

 There is a clear affect on language model training if you double the data, 
 because SRILM's ngram-count by default drops higher order singletons (which 
 would not exist in a doubled corpus.

 It may be just be due to different tuning runs (which are random processes 
 that add noise). You could check this by re-using the weights from the other 
 run, and vice versa.

 -phi

 On Mon, Aug 27, 2012 at 7:11 PM, Tan, Jun jun@emc.com wrote:
 Hi all,



 Just like the thread title says, what will happen in that situation?

 I did an experiment to create two Moses translation models, one 
 created by the original corpus, the other created by two copy of the 
 same corpus. And in the last, I found that the BLEU score is a little 
 different between the two models.  The model with two copy of the 
 same corpus is about 1.2% higher than the engine created by the original 
 corpus.



 Can anybody tell me whether it is normal?   What's the impact if I using a
 lot of copies of the same corpus to create the model?


 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] What will happen if training Moses with duplicated corpus?

2012-08-28 Thread Tan, Jun
Hi Koehn,

Thanks for your reply.
I check the both phrase-table, most of them are the same. The difference is the 
size of phrase-table created by duplicated corpus is about 5% larger than the 
original corpus. For the language model, the size of duplicated corpus is 10% 
larger than the original corpus. 

I think the tuning processes are same for the both Moses engine, the only 
change is the training data. The steps and the tuning data are the same for 
both of them.


-Original Message-
From: phko...@gmail.com [mailto:phko...@gmail.com] On Behalf Of Philipp Koehn
Sent: Wednesday, August 29, 2012 4:31 AM
To: Tan, Jun
Cc: moses-support@mit.edu
Subject: Re: [Moses-support] What will happen if training Moses with duplicated 
corpus?

Hi,

It is not obvious to me why this would happen due to data duplication - there 
are things like Good Turing smoothing that would be affected by count doubling, 
but that is not turned on by default. Do the phrase translation tables look at 
all different?

There is a clear affect on language model training if you double the data, 
because SRILM's ngram-count by default drops higher order singletons (which 
would not exist in a doubled corpus.

It may be just be due to different tuning runs (which are random processes that 
add noise). You could check this by re-using the weights from the other run, 
and vice versa.

-phi

On Mon, Aug 27, 2012 at 7:11 PM, Tan, Jun jun@emc.com wrote:
 Hi all,



 Just like the thread title says, what will happen in that situation?

 I did an experiment to create two Moses translation models, one 
 created by the original corpus, the other created by two copy of the 
 same corpus. And in the last, I found that the BLEU score is a little 
 different between the two models.  The model with two copy of the same 
 corpus is about 1.2% higher than the engine created by the original corpus.



 Can anybody tell me whether it is normal?   What's the impact if I using a
 lot of copies of the same corpus to create the model?


 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support



___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] What will happen if training Moses with duplicated corpus?

2012-08-27 Thread Tan, Jun
Hi all,

Just like the thread title says, what will happen in that situation?
I did an experiment to create two Moses translation models, one created by the 
original corpus, the other created by two copy of the same corpus. And in the 
last, I found that the BLEU score is a little different between the two models. 
 The model with two copy of the same corpus is about 1.2% higher than the 
engine created by the original corpus.

Can anybody tell me whether it is normal?   What's the impact if I using a lot 
of copies of the same corpus to create the model?
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] Malformed input issue happened during decoding process, when using factored model

2012-08-23 Thread Tan, Jun
Hi all,

I’m learning about the factored model, tried to create a factored model 
following the guideline on Moses website. Everything goes fine during the 
creating process, but got a “Malformed input issue” when doing the first time 
decoding like below:
Loading lexical distortion models...have 0 models
Start loading LanguageModel /tmp/factored-corpus/english-chinese/1500.en.lm.cn 
: [0.000] seconds
/tmp/factored-corpus/english-chinese/1500.en.lm.cn: line 5700: warning: 
non-zero probability for unk in closed-vocabulary LM
Start loading LanguageModel 
/tmp/factored-corpus/english-chinese/1500.en.pos.lm.cn : [0.000] seconds
/tmp/factored-corpus/english-chinese/1500.en.pos.lm.cn: line 42: warning: 
non-zero probability for unk in closed-vocabulary LM
Finished loading LanguageModels : [0.000] seconds
Start loading PhraseTable 
/tmp/factored-corpus/english-chinese/train/model/phrase-table.0-0,1.gz : 
[0.000] seconds
filePath: /tmp/factored-corpus/english-chinese/train/model/phrase-table.0-0,1.gz
Finished loading phrase tables : [0.000] seconds
Start loading phrase table from 
/tmp/factored-corpus/english-chinese/train/model/phrase-table.0-0,1.gz : 
[0.000] seconds
Reading /tmp/factored-corpus/english-chinese/train/model/phrase-table.0-0,1.gz
5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100

[ERROR] Malformed input: '!|PU'
In ' !|PU '
  Expected input to have words composed of 2 factor(s) (form FAC1|FAC2|...)
  but instead received input with 1 factor(s).
Aborted (core dumped)


I searched the moses-support mail archive, got some helpful information from 
below thread: http://www.mail-archive.com/moses-support@mit.edu/msg03209.html , 
found that this issue caused by the delimiter of phrase in target language is 
wrong in phrase-table.
The phrase-table looks like below:

!_. ||| !|PU ||| 1 0.545454 0.714286 0.26087 2.718 ||| ||| 5 7
!_. ||| 。|PU ||| 0.00139665 0.0027529 0.285714 0.173913 2.718 ||| ||| 1432 7

When I replace the delimiter “|” with “_”, the issue is gone. And here is my 
question, since I have already used the option “--factor-delimiter=_” during 
the training process, why the delimiter for the target language phrase still be 
the default delimiter “|”.

The configuration for delimiter in the moses.ini is as below:
# delimiter between factors in input
[factor-delimiter]
_






___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] how does Moses handle with the apostrophes?

2012-08-07 Thread Tan, Jun
Hi Barry,

How to check the Moses version?  I'm sure that the tokeniser for training is 
same with testing. I'm using Standford Word Segmenter for Chinese language. 

-Original Message-
From: Barry Haddow [mailto:bhad...@staffmail.ed.ac.uk] 
Sent: Tuesday, August 07, 2012 4:43 PM
To: Tan, Jun
Cc: tah...@precisiontranslationtools.com; moses-support@mit.edu
Subject: Re: [Moses-support] how does Moses handle with the apostrophes?

Hi Jun

Is the apostrophe in your source data an ascii apostrophe, or a unicode variant 
(use xxd to check this)? As Tom said, recent versions of the Moses tokeniser 
escape apostrophes, so either you're using an old version, or it does not 
recognise it as an apostrophe.

Make sure you are using the same tokeniser in training and test.

cheers - Barry

On 07/08/12 06:38, jun@emc.com wrote:
 Yes, I’m using Moss’s tokenizer.perl for English language, and the Moses got 
 installed in June, the version should be relatively new.
 Do you have any ideas how to fix it?
 From: Tom Hoar [mailto:tah...@precisiontranslationtools.com]
 Sent: Tuesday, August 07, 2012 1:13 PM
 To: Tan, Jun
 Cc: moses-support@mit.edu
 Subject: Re: [Moses-support] how does Moses handle with the apostrophes?


 If you're using Moses' tokenizer.perl script, the English handling separates 
 the company's into company 's. In recent (~2 months) moses github 
 releases, the tokenizer.perl script also escapes the string to this 
 companyapos;s. The English detokenizer unescapes the apos;s to 's and 
 restores it without the preceding space.



 On Tue, 7 Aug 2012 00:33:07 -0400,jun@emc.commailto:jun@emc.com  
 wrote:
 Hi all,

 When I using Moses to translate some sentences contain apostrophes, it 
 doesn’t work correctly.
 Source:
 EMC Corporation (NYSE:EMC) today reported strong financial results for the 
 second quarter of 2012, marking the company's 10th consecutive quarter of 
 double-digit year-over-year growth for consolidated revenue, GAAP net income, 
 and GAAP and non-GAAP EPS. EMC expects to achieve its full-year 2012 goals 
 for consolidated revenue, non-GAAP EPS and free cash flow.

 Translation result:
 2012 年 7 月 24 日 — EMC 公司 ( NYSE : EMC) 今天 报告 了 强有力 的 财务 业绩 2012 年 第 2 
 季度 , 标志 着 公司 's 连续 10 个 季度 实现 两 位 数 的 同比 增长 , 以 实现 整合 的 收入 、 GAAP 净 收入 
 和 GAAP 和 非 GAAP 每 股 收益 。 EMC 预计 到 2012 年 实现 其 目标 的 要求 年 全 年 的 合并 收入 、 
 非 GAAP EPS 和 自由 现金流 。

 As we can see, the translation result of “company's” is “公司 's”,and 
 translation of the apostrophes(‘)  and the letter (s) got failed.
 Does anybody know the cause of this issue? Do I need some other module to 
 handle it? Does anybody know how to fix it?  Below is an example:


 Thanks


 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support


--
The University of Edinburgh is a charitable body, registered in Scotland, with 
registration number SC005336.



___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] how does Moses handle with the apostrophes?

2012-08-07 Thread Tan, Jun
Hi Barry,

I think the version is new, below is output from the file tokenizer.perl  
 #escape special chars
  $text =~ s/\/\amp;/g;   # escape escape
  $text =~ s/\|/\#124;/g;  # factor separator
  $text =~ s/\/\lt;/g;# xml
  $text =~ s/\/\gt;/g;# xml
  $text =~ s/\'/\apos;/g;  # xml
  $text =~ s/\/\quot;/g;  # xml
  $text =~ s/\[/\#91;/g;   # syntax non-terminal
  $text =~ s/\]/\#93;/g;   # syntax non-terminal



-Original Message-
From: Barry Haddow [mailto:bhad...@staffmail.ed.ac.uk] 
Sent: Tuesday, August 07, 2012 5:55 PM
To: Tan, Jun
Cc: moses-support@mit.edu
Subject: Re: [Moses-support] how does Moses handle with the apostrophes?

Hi Jun

Recent versions of the tokeniser have a line like

$text =~ s/\'/\apos;/g;  # xml

to escape apostrophes.

cheers - Barry

On 07/08/12 09:51, Tan, Jun wrote:
 Hi Barry,

 How to check the Moses version?  I'm sure that the tokeniser for training is 
 same with testing. I'm using Standford Word Segmenter for Chinese language.

 -Original Message-
 From: Barry Haddow [mailto:bhad...@staffmail.ed.ac.uk]
 Sent: Tuesday, August 07, 2012 4:43 PM
 To: Tan, Jun
 Cc: tah...@precisiontranslationtools.com; moses-support@mit.edu
 Subject: Re: [Moses-support] how does Moses handle with the apostrophes?

 Hi Jun

 Is the apostrophe in your source data an ascii apostrophe, or a unicode 
 variant (use xxd to check this)? As Tom said, recent versions of the Moses 
 tokeniser escape apostrophes, so either you're using an old version, or it 
 does not recognise it as an apostrophe.

 Make sure you are using the same tokeniser in training and test.

 cheers - Barry

 On 07/08/12 06:38, jun@emc.com wrote:
 Yes, I’m using Moss’s tokenizer.perl for English language, and the Moses got 
 installed in June, the version should be relatively new.
 Do you have any ideas how to fix it?
 From: Tom Hoar [mailto:tah...@precisiontranslationtools.com]
 Sent: Tuesday, August 07, 2012 1:13 PM
 To: Tan, Jun
 Cc: moses-support@mit.edu
 Subject: Re: [Moses-support] how does Moses handle with the apostrophes?


 If you're using Moses' tokenizer.perl script, the English handling separates 
 the company's into company 's. In recent (~2 months) moses github 
 releases, the tokenizer.perl script also escapes the string to this 
 companyapos;s. The English detokenizer unescapes the apos;s to 's 
 and restores it without the preceding space.



 On Tue, 7 Aug 2012 00:33:07 -0400,jun@emc.commailto:jun@emc.com  
  wrote:
 Hi all,

 When I using Moses to translate some sentences contain apostrophes, it 
 doesn’t work correctly.
 Source:
 EMC Corporation (NYSE:EMC) today reported strong financial results for the 
 second quarter of 2012, marking the company's 10th consecutive quarter of 
 double-digit year-over-year growth for consolidated revenue, GAAP net 
 income, and GAAP and non-GAAP EPS. EMC expects to achieve its full-year 2012 
 goals for consolidated revenue, non-GAAP EPS and free cash flow.

 Translation result:
 2012 年 7 月 24 日 — EMC 公司 ( NYSE : EMC) 今天 报告 了 强有力 的 财务 业绩 2012 年 第 2
 季度 , 标志 着 公司 's 连续 10 个 季度 实现 两 位 数 的 同比 增长 , 以 实现 整合 的 收入 、 GAAP 净 
 收入
 和 GAAP 和 非 GAAP 每 股 收益 。 EMC 预计 到 2012 年 实现 其 目标 的 要求 年 全 年 的 合并 收入 、
 非 GAAP EPS 和 自由 现金流 。

 As we can see, the translation result of “company's” is “公司 's”,and 
 translation of the apostrophes(‘)  and the letter (s) got failed.
 Does anybody know the cause of this issue? Do I need some other module to 
 handle it? Does anybody know how to fix it?  Below is an example:


 Thanks


 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

 --
 The University of Edinburgh is a charitable body, registered in Scotland, 
 with registration number SC005336.




--
The University of Edinburgh is a charitable body, registered in Scotland, with 
registration number SC005336.



___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] how does Moses handle with the apostrophes?

2012-08-07 Thread Tan, Jun
Hi Barry,

I have checked the source data for training. I found that some of the 
apostrophe already got converted to 'apos;', but there are still some 
apostrophe like ’and #91;  .
With my understanding, the tool you mentioned will convert the apostrophe from 
Unicode to ASCIII, so the tool can only works for the English-Chinese 
translation.  Is it right?  The apostrophe in Chinese is two-byte, in English 
is one-byte. 
If I use the tool (http://www.statmt.org/wmt11/normalize-punctuation.perl) , 
what will be the translation result of apostrophe(’,‘,').


-Original Message-
From: Barry Haddow [mailto:bhad...@staffmail.ed.ac.uk] 
Sent: Tuesday, August 07, 2012 6:18 PM
To: Tan, Jun
Cc: moses-support@mit.edu
Subject: Re: [Moses-support] how does Moses handle with the apostrophes?

Hi Jun

If you're using this version of the tokeniser on your source sentence, then I 
would expect it to convert the apostrophe to apos; The fact that there is no 
apos; in your output suggests that either the decoder is translating it to ' 
(unlikely) or the apostrophe in your source is not a regular apostrophe, but 
some unicode variant. So you need to check for that.

This script will normalise a lot of the punctuation 
http://www.statmt.org/wmt11/normalize-punctuation.perl
However if you use it, then you should also run it over your training data,  
and retrain.

cheers - Barry

On 07/08/12 11:00, Tan, Jun wrote:
 Hi Barry,

 I think the version is new, below is output from the file tokenizer.perl
   #escape special chars
$text =~ s/\/\amp;/g;   # escape escape
$text =~ s/\|/\#124;/g;  # factor separator
$text =~ s/\/\lt;/g;# xml
$text =~ s/\/\gt;/g;# xml
$text =~ s/\'/\apos;/g;  # xml
$text =~ s/\/\quot;/g;  # xml
$text =~ s/\[/\#91;/g;   # syntax non-terminal
$text =~ s/\]/\#93;/g;   # syntax non-terminal



 -Original Message-
 From: Barry Haddow [mailto:bhad...@staffmail.ed.ac.uk]
 Sent: Tuesday, August 07, 2012 5:55 PM
 To: Tan, Jun
 Cc: moses-support@mit.edu
 Subject: Re: [Moses-support] how does Moses handle with the apostrophes?

 Hi Jun

 Recent versions of the tokeniser have a line like

 $text =~ s/\'/\apos;/g;  # xml

 to escape apostrophes.

 cheers - Barry

 On 07/08/12 09:51, Tan, Jun wrote:
 Hi Barry,

 How to check the Moses version?  I'm sure that the tokeniser for training is 
 same with testing. I'm using Standford Word Segmenter for Chinese language.

 -Original Message-
 From: Barry Haddow [mailto:bhad...@staffmail.ed.ac.uk]
 Sent: Tuesday, August 07, 2012 4:43 PM
 To: Tan, Jun
 Cc: tah...@precisiontranslationtools.com; moses-support@mit.edu
 Subject: Re: [Moses-support] how does Moses handle with the apostrophes?

 Hi Jun

 Is the apostrophe in your source data an ascii apostrophe, or a unicode 
 variant (use xxd to check this)? As Tom said, recent versions of the Moses 
 tokeniser escape apostrophes, so either you're using an old version, or it 
 does not recognise it as an apostrophe.

 Make sure you are using the same tokeniser in training and test.

 cheers - Barry

 On 07/08/12 06:38, jun@emc.com wrote:
 Yes, I’m using Moss’s tokenizer.perl for English language, and the Moses 
 got installed in June, the version should be relatively new.
 Do you have any ideas how to fix it?
 From: Tom Hoar [mailto:tah...@precisiontranslationtools.com]
 Sent: Tuesday, August 07, 2012 1:13 PM
 To: Tan, Jun
 Cc: moses-support@mit.edu
 Subject: Re: [Moses-support] how does Moses handle with the apostrophes?


 If you're using Moses' tokenizer.perl script, the English handling 
 separates the company's into company 's. In recent (~2 months) moses 
 github releases, the tokenizer.perl script also escapes the string to this 
 companyapos;s. The English detokenizer unescapes the apos;s to 's 
 and restores it without the preceding space.



 On Tue, 7 Aug 2012 00:33:07 -0400,jun@emc.commailto:jun@emc.com 
wrote:
 Hi all,

 When I using Moses to translate some sentences contain apostrophes, it 
 doesn’t work correctly.
 Source:
 EMC Corporation (NYSE:EMC) today reported strong financial results for the 
 second quarter of 2012, marking the company's 10th consecutive quarter of 
 double-digit year-over-year growth for consolidated revenue, GAAP net 
 income, and GAAP and non-GAAP EPS. EMC expects to achieve its full-year 
 2012 goals for consolidated revenue, non-GAAP EPS and free cash flow.

 Translation result:
 2012 年 7 月 24 日 — EMC 公司 ( NYSE : EMC) 今天 报告 了 强有力 的 财务 业绩 2012 年 第 
 2
 季度 , 标志 着 公司 's 连续 10 个 季度 实现 两 位 数 的 同比 增长 , 以 实现 整合 的 收入 、 GAAP 净
 收入
 和 GAAP 和 非 GAAP 每 股 收益 。 EMC 预计 到 2012 年 实现 其 目标 的 要求 年 全 年 的 合并 收入 
 、
 非 GAAP EPS 和 自由 现金流 。

 As we can see, the translation result of “company's” is “公司 's”,and 
 translation of the apostrophes(‘)  and the letter (s) got failed.
 Does anybody know the cause of this issue? Do I need some other module to 
 handle it? Does anybody know how to fix it?  Below is an example:


 Thanks