Dear all, I’m building the baseline system, and some error occurred during the last step of LM training process as the first attached file shows. I checked another case of “Last input should have been poison”, but that one has more detailed information “no space left on device”, while mine has nothing but that sentence. The exact command I use for Kenlm is: As mosesdecoder is installed at the administrator’s directory instead of my own, "~/mosesdecoder "is replaced by $MOSES. my corpus(the language pair is Spanish to Finnish) was downloaded from Opus(http://opus.lingfil.uu.se/OpenSubtitles2013.php) in the Moses format. The downloaded profile contains three files: OpenSubtitles2013.es-fi.es, OpenSubtitles2013.es-fi.fi, and OpenSubtitles2013.es-fi.ids. The tokenization, truecasing and cleaning are all completed with the “es" and “fi” files. Is it possible if the error has something to do with the “ids” file? Here attaches the output of LM process, and the command I used for corpus preparation. |
fangting@phon:~/lm$ $MOSES/bin/lmplz -o 3 < ~/es-fi/OpenSubtitles2013.es-fi.true.fi > OpenSubtitles2013.es-fi.arpa.fi === 1/5 Counting and sorting n-grams === Reading /home/fangting/es-fi/OpenSubtitles2013.es-fi.true.fi ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100 tcmalloc: large alloc 2531827712 bytes == 0x2ffce000 @ **************************************************************************************************** Unigram tokens 44622536 types 1102478 === 2/5 Calculating and sorting adjusted counts === Chain sizes: 1:13229736 2:1146485248 3:2149659904 tcmalloc: large alloc 2149662720 bytes == 0x2ffce000 @ Statistics: 1 1102478 D1=0.698134 D2=1.05351 D3+=1.36696 2 8382723 D1=0.802872 D2=1.11671 D3+=1.34474 3 17631891 D1=0.75173 D2=1.35452 D3+=1.59193 Memory estimate for binary LM: type   MB probing 521 assuming -p 1.5 probing 574 assuming -r models -p 1.5 trie  243 without quantization trie  148 assuming -q 8 -b 8 quantization trie  226 assuming -a 22 array pointer compression trie  131 assuming -a 22 -q 8 -b 8 array pointer compression and quantization === 3/5 Calculating and sorting initial probabilities === Chain sizes: 1:13229736 2:134123568 3:352637820 ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100 --------------------------------------------------------------------------------++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++********************************************************************************#################################################################################################### === 4/5 Calculating and writing order-interpolated probabilities === Chain sizes: 1:13229736 2:134123568 3:352637820 ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100 --------------------------------------------------------------------------------++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++**************************************************************************************************** Chain sizes: 1:13229736 2:65424996 3:122671872 === 5/5 Writing ARPA model === Last input should have been poison. terminate called without an active exception Aborted
Corpus Preparation
cd ~/es-fi Tokenisation $MOSES/scripts/tokenizer/tokenizer.perl -l es < ~/es-fi/OpenSubtitles2013.es-fi.es > ~/es-fi/OpenSubtitles2013.es-fi.tok.es $MOSES/scripts/tokenizer/tokenizer.perl -l fi < ~/es-fi/OpenSubtitles2013.es-fi.fi > ~/es-fi/OpenSubtitles2013.es-fi.tok.fi Training truecaser $MOSES/scripts/recaser/train-truecaser.perl --model ~/es-fi/truecase-model.es --corpus ~/es-fi/OpenSubtitles2013.es-fi.tok.es $MOSES/scripts/recaser/train-truecaser.perl --model ~/es-fi/truecase-model.fi --corpus ~/es-fi/OpenSubtitles2013.es-fi.tok.fi Truecasing $MOSES/scripts/recaser/truecase.perl --model ~/es-fi/truecase-model.es < ~/es-fi/OpenSubtitles2013.es-fi.tok.es > ~/es-fi/OpenSubtitles2013.es-fi.true.es $MOSES/scripts/recaser/truecase.perl --model ~/es-fi/truecase-model.fi < ~/es-fi/OpenSubtitles2013.es-fi.tok.fi > ~/es-fi/OpenSubtitles2013.es-fi.true.fi Cleaning $MOSES/scripts/training/clean-corpus-n.perl ~/es-fi/OpenSubtitles2013.es-fi.true es fi ~/es-fi/OpenSubtitles2013.es-fi.clean 1 80 Language model training (doesnât work) cd lm /opt/irstlm/bin/add-start-end.sh < ~/es-fi/OpenSubtitles2013.es-fi.true.fi > OpenSubtitles2013.es-fi.sb.fi export IRSTLM=/opt/irstlm; /opt/irstlm/bin/build-lm.sh -i OpenSubtitles2013.es-fi.sb.fi -t ./tmp -p -s improved-kneser-ney -o OpenSubtitles2013.es-fi.lm.fi Change to KenLM $MOSES/bin/lmplz -o 3 < ~/es-fi/OpenSubtitles2013.es-fi.true.fi > OpenSubtitles2013.es-fi.arpa.fi
_______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support