[Moses-support] KenLM poison

徐同学 Mon, 05 Oct 2015 06:55:40 -0700

Dear all,

I’m building the baseline system, and some error occurred during the last step of LM training process as the first attached file shows.

I checked another case of “Last input should have been poison”, but that one has more detailed information “no space left on device”, while mine has nothing but that sentence.

The exact command I use for Kenlm is:

$MOSES/bin/lmplz -o 3 < ~/es-fi/OpenSubtitles2013.es-fi.true.fi > OpenSubtitles2013.es-fi.arpa.fi

As mosesdecoder is installed at the administrator’s directory instead of my own, "~/mosesdecoder "is replaced by $MOSES.

my corpus(the language pair is Spanish to Finnish) was downloaded from Opus(http://opus.lingfil.uu.se/OpenSubtitles2013.php) in the Moses format.

The downloaded profile contains three files: OpenSubtitles2013.es-fi.es, OpenSubtitles2013.es -fi.fi, and OpenSubtitles2013.es-fi.ids.

The tokenization, truecasing and cleaning are all completed with the “es" and “fi” files. Is it possible if the error has something to do with the “ids” file?

Here attaches the output of LM process, and the command I used for corpus preparation.

fangting@phon:~/lm$ $MOSES/bin/lmplz -o 3 < 
~/es-fi/OpenSubtitles2013.es-fi.true.fi > OpenSubtitles2013.es-fi.arpa.fi
=== 1/5 Counting and sorting n-grams ===
Reading /home/fangting/es-fi/OpenSubtitles2013.es-fi.true.fi
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
tcmalloc: large alloc 2531827712 bytes == 0x2ffce000 @Â 
****************************************************************************************************
Unigram tokens 44622536 types 1102478
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:13229736 2:1146485248 3:2149659904
tcmalloc: large alloc 2149662720 bytes == 0x2ffce000 @Â 
Statistics:
1 1102478 D1=0.698134 D2=1.05351 D3+=1.36696
2 8382723 D1=0.802872 D2=1.11671 D3+=1.34474
3 17631891 D1=0.75173 D2=1.35452 D3+=1.59193
Memory estimate for binary LM:
type Â  Â  MB
probing 521 assuming -p 1.5
probing 574 assuming -r models -p 1.5
trieÂ  Â  243 without quantization
trieÂ  Â  148 assuming -q 8 -b 8 quantizationÂ 
trieÂ  Â  226 assuming -a 22 array pointer compression
trieÂ  Â  131 assuming -a 22 -q 8 -b 8 array pointer compression and 
quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:13229736 2:134123568 3:352637820
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
--------------------------------------------------------------------------------++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++********************************************************************************####################################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:13229736 2:134123568 3:352637820
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
--------------------------------------------------------------------------------++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++****************************************************************************************************
Chain sizes: 1:13229736 2:65424996 3:122671872
=== 5/5 Writing ARPA model ===
Last input should have been poison.
terminate called without an active exception
Aborted

Corpus Preparation


cd ~/es-fi

Tokenisation
$MOSES/scripts/tokenizer/tokenizer.perl -l es < 
~/es-fi/OpenSubtitles2013.es-fi.es > ~/es-fi/OpenSubtitles2013.es-fi.tok.es

$MOSES/scripts/tokenizer/tokenizer.perl -l fi < 
~/es-fi/OpenSubtitles2013.es-fi.fi > ~/es-fi/OpenSubtitles2013.es-fi.tok.fi

Training truecaser
$MOSES/scripts/recaser/train-truecaser.perl --model  ~/es-fi/truecase-model.es 
--corpus ~/es-fi/OpenSubtitles2013.es-fi.tok.es

$MOSES/scripts/recaser/train-truecaser.perl --model  ~/es-fi/truecase-model.fi 
--corpus ~/es-fi/OpenSubtitles2013.es-fi.tok.fi

Truecasing
$MOSES/scripts/recaser/truecase.perl --model ~/es-fi/truecase-model.es < 
~/es-fi/OpenSubtitles2013.es-fi.tok.es > ~/es-fi/OpenSubtitles2013.es-fi.true.es

$MOSES/scripts/recaser/truecase.perl --model ~/es-fi/truecase-model.fi < 
~/es-fi/OpenSubtitles2013.es-fi.tok.fi > ~/es-fi/OpenSubtitles2013.es-fi.true.fi

Cleaning
$MOSES/scripts/training/clean-corpus-n.perl 
~/es-fi/OpenSubtitles2013.es-fi.true es fi 
~/es-fi/OpenSubtitles2013.es-fi.clean 1 80

Language model training (doesnât work)
cd lm
/opt/irstlm/bin/add-start-end.sh < ~/es-fi/OpenSubtitles2013.es-fi.true.fi > 
OpenSubtitles2013.es-fi.sb.fi

export IRSTLM=/opt/irstlm; /opt/irstlm/bin/build-lm.sh -i 
OpenSubtitles2013.es-fi.sb.fi -t ./tmp -p -s improved-kneser-ney -o 
OpenSubtitles2013.es-fi.lm.fi

Change to KenLM
$MOSES/bin/lmplz -o 3 < ~/es-fi/OpenSubtitles2013.es-fi.true.fi > 
OpenSubtitles2013.es-fi.arpa.fi

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

[Moses-support] KenLM poison

Reply via email to