Dear all,

I’m building the baseline system, and some error occurred during the last step of LM training process as the first attached file shows. 

I checked another case of “Last input should have been poison”, but that one has more detailed information “no space left on device”, while mine has nothing but that sentence.

The exact command I use for Kenlm is: 

As mosesdecoder is installed at the administrator’s directory instead of my own, "~/mosesdecoder "is replaced by $MOSES. 

my corpus(the language pair is Spanish to Finnish) was downloaded from Opus(http://opus.lingfil.uu.se/OpenSubtitles2013.php) in the Moses format.
 
The downloaded profile contains three files:  OpenSubtitles2013.es-fi.esOpenSubtitles2013.es-fi.fi, and OpenSubtitles2013.es-fi.ids. 

The tokenization, truecasing and cleaning are all completed with the “es" and “fi” files. Is it possible if the error has something to do with the “ids” file? 

Here attaches the output of LM process, and the command I used for corpus preparation. 

fangting@phon:~/lm$ $MOSES/bin/lmplz -o 3 < 
~/es-fi/OpenSubtitles2013.es-fi.true.fi > OpenSubtitles2013.es-fi.arpa.fi
=== 1/5 Counting and sorting n-grams ===
Reading /home/fangting/es-fi/OpenSubtitles2013.es-fi.true.fi
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
tcmalloc: large alloc 2531827712 bytes == 0x2ffce000 @ 
****************************************************************************************************
Unigram tokens 44622536 types 1102478
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:13229736 2:1146485248 3:2149659904
tcmalloc: large alloc 2149662720 bytes == 0x2ffce000 @ 
Statistics:
1 1102478 D1=0.698134 D2=1.05351 D3+=1.36696
2 8382723 D1=0.802872 D2=1.11671 D3+=1.34474
3 17631891 D1=0.75173 D2=1.35452 D3+=1.59193
Memory estimate for binary LM:
type     MB
probing 521 assuming -p 1.5
probing 574 assuming -r models -p 1.5
trie    243 without quantization
trie    148 assuming -q 8 -b 8 quantization 
trie    226 assuming -a 22 array pointer compression
trie    131 assuming -a 22 -q 8 -b 8 array pointer compression and 
quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:13229736 2:134123568 3:352637820
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
--------------------------------------------------------------------------------++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++********************************************************************************####################################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:13229736 2:134123568 3:352637820
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
--------------------------------------------------------------------------------++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++****************************************************************************************************
Chain sizes: 1:13229736 2:65424996 3:122671872
=== 5/5 Writing ARPA model ===
Last input should have been poison.
terminate called without an active exception
Aborted
Corpus Preparation 

cd ~/es-fi

Tokenisation
$MOSES/scripts/tokenizer/tokenizer.perl -l es < 
~/es-fi/OpenSubtitles2013.es-fi.es > ~/es-fi/OpenSubtitles2013.es-fi.tok.es

$MOSES/scripts/tokenizer/tokenizer.perl -l fi < 
~/es-fi/OpenSubtitles2013.es-fi.fi > ~/es-fi/OpenSubtitles2013.es-fi.tok.fi

Training truecaser
$MOSES/scripts/recaser/train-truecaser.perl --model  ~/es-fi/truecase-model.es 
--corpus ~/es-fi/OpenSubtitles2013.es-fi.tok.es

$MOSES/scripts/recaser/train-truecaser.perl --model  ~/es-fi/truecase-model.fi 
--corpus ~/es-fi/OpenSubtitles2013.es-fi.tok.fi

Truecasing
$MOSES/scripts/recaser/truecase.perl --model ~/es-fi/truecase-model.es < 
~/es-fi/OpenSubtitles2013.es-fi.tok.es > ~/es-fi/OpenSubtitles2013.es-fi.true.es

$MOSES/scripts/recaser/truecase.perl --model ~/es-fi/truecase-model.fi < 
~/es-fi/OpenSubtitles2013.es-fi.tok.fi > ~/es-fi/OpenSubtitles2013.es-fi.true.fi

Cleaning
$MOSES/scripts/training/clean-corpus-n.perl 
~/es-fi/OpenSubtitles2013.es-fi.true es fi 
~/es-fi/OpenSubtitles2013.es-fi.clean 1 80

Language model training (doesn’t work)
cd lm
/opt/irstlm/bin/add-start-end.sh < ~/es-fi/OpenSubtitles2013.es-fi.true.fi > 
OpenSubtitles2013.es-fi.sb.fi

export IRSTLM=/opt/irstlm; /opt/irstlm/bin/build-lm.sh -i 
OpenSubtitles2013.es-fi.sb.fi -t ./tmp -p -s improved-kneser-ney -o 
OpenSubtitles2013.es-fi.lm.fi

Change to KenLM
$MOSES/bin/lmplz -o 3 < ~/es-fi/OpenSubtitles2013.es-fi.true.fi > 
OpenSubtitles2013.es-fi.arpa.fi
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to