Hi Patricia It looks like you have some odd characters in your corpus - perhaps vertical tabs. You could use xxd on the lm file to try to figure out what it is,
cheers - Barry On Tuesday 03 July 2012 16:46:35 Nicholas Ruiz wrote: > Hi Patricia, > > Unfortunately, I'm not so well versed in SRILM, so I'm not sure I can > answer the question about the blank line appearing in your ARPA file. You > can also try training your model directly with IRSTLM (in text format) and > you can see if the blank line also appears. > > tlm -tr=<corpus> -lm=[wb|msb] -n=3 > -o=complete_fr.truecased_unique_tok_irst.lm > > (I'm not sure what you original params were for the SRI model) > wb=Witten-Bell Smoothing > msb=Modified Shift-Beta Smoothing > > Best, > Nick > > ________________________________ > From: Patricia Helmich [patriciahelm...@hotmail.com] > Sent: Tuesday, July 03, 2012 5:38 PM > To: Nicholas Ruiz > Subject: RE: [Moses-support] IRSTLM - Error: dictionary::loadtxt wrong > entry was found (0) in position 1 > > Hi Nick, > > ok, here are the first 10 lines of the BLM: > > lingua@StatMT24:~/Patricia/Corpora/Corpora_Monoling_Complete/fr$ cat -n > complete_fr.truecased_unique_tok_clean.blm | head 1 blmt 3 1091677 > 13524189 23061450 > 2 1091677 > 3 > 0 > 4 ! 0 > 5 " 0 > 6 # 0 > 7 $ 0 > 8 % 0 > 9 & 0 > 10 ' 0 > > > > It seems that the third line causes the problems because I deleted it in a > copy of the BLM > > lingua@StatMT24:~/Patricia/Corpora/Corpora_Monoling_Complete/fr$ cat -n > complete_fr.truecased_unique_tok_clean_copy.blm | head 1 blmt 3 1091677 > 13524189 23061450 > 2 1091677 > 3 ! 0 > 4 " 0 > 5 # 0 > 6 $ 0 > 7 % 0 > 8 & 0 > 9 ' 0 > 10 '00 0 > > and then I tried to compute the perplexity with the copy of the BLM and it > worked well: > > lingua@StatMT24:~/Patricia/Corpora/Corpora_Monoling_Complete/fr$ > /home/lingua/smt/irstlm/bin/compile-lm > complete_fr.truecased_unique_tok_clean_copy.blm --eval > /home/lingua/Patricia/Corpora/Corpora_Eval/devtest/nc-test2007.truecased.t > ok.fr inpfile: complete_fr.truecased_unique_tok_clean_copy.blm > loading up to the LM level 1000 (if any) > dub: 10000000 > Language Model Type of complete_fr.truecased_unique_tok_clean_copy.blm is 1 > blmt > loadbin() > lmtable::loadbin_dict() > dict->size(): 1091677 > loadbin_level (level 1) > loading 1091677 1-grams > done (level1) > loadbin_level (level 2) > loading 13524189 2-grams > done (level2) > loadbin_level (level 3) > loading 23061450 3-grams > done (level3) > done > OOV code is 218080 > Start Eval > OOV code: 218080 > %% Nw=58714 PP=1.03 PPwp=0.03 Nbo=58713 Noov=105 OOV=0.18% > lmtable class statistics > levels 3 > lev 1 entries 1091677 used mem 15.62Mb > lev 2 entries 13524189 used mem 193.47Mb > lev 3 entries 23061450 used mem 153.95Mb > total allocated mem 363.03Mb > total number of get and binary search calls > level 1 get: 58714 bsearch: 0 > level 2 get: 58713 bsearch: 117425 > level 3 get: 58712 bsearch: 0 > > > In the LM, I have also this empty line > > lingua@StatMT24:~/Patricia/Corpora/Corpora_Monoling_Complete/fr$ cat -n > complete_fr.truecased_unique_tok_clean.lm | head 1 > 2 \data\ > 3 ngram 1=1091677 > 4 ngram 2=13524189 > 5 ngram 3=23061450 > 6 > 7 \1-grams: > 8 -7.154682 > -0.1456359 > 9 -3.339167 ! -1.472732 > 10 -2.43139 " -0.733331 > > but in the phrase training or the perplexity computation with the LM, this > does not cause any problems. > > Also, I'm wondering why there is an entry for an empty line in the LM > because I checked my french corpus and it does not contain any empty > lines. > > > Best, Patricia > > > From: nicr...@fbk.eu > > To: patriciahelm...@hotmail.com > > Subject: RE: [Moses-support] IRSTLM - Error: dictionary::loadtxt wrong > > entry was found (0) in position 1 Date: Tue, 3 Jul 2012 14:59:57 +0000 > > > > Hi Patricia, > > > > Could you also send me the top 10 lines of your binarized LM? > > > > head complete_fr.truecased_unique_tok_clean.blm > > > > Thanks, > > Nick > > > > ________________________________ > > From: Patricia Helmich [patriciahelm...@hotmail.com] > > Sent: Tuesday, July 03, 2012 4:40 PM > > To: Nicholas Ruiz; moses-support@mit.edu > > Subject: RE: [Moses-support] IRSTLM - Error: dictionary::loadtxt wrong > > entry was found (0) in position 1 > > > > Hi Nick, > > > > for > > > > /home/lingua/smt/irstlm/bin/compile-lm > > complete_fr.truecased_unique_tok_clean.lm --eval > > /home/lingua/Patricia/Corpora/Corpora_Eval/devtest/nc-test2007.truecased. > >tok.fr > > > > I get the following output: > > > > inpfile: complete_fr.truecased_unique_tok_clean.lm > > loading up to the LM level 1000 (if any) > > dub: 10000000 > > Language Model Type of complete_fr.truecased_unique_tok_clean.lm is 1 > > \data\ > > loadtxt_ram() > > 1-grams: reading 1091677 entries > > done level1 > > 2-grams: reading 13524189 entries > > ..done level2 > > 3-grams: reading 23061450 entries > > ....done level3 > > done > > OOV code is 218081 > > OOV code is 218081 > > Start Eval > > OOV code: 218081 > > %% Nw=58714 PP=201.88 PPwp=5.70 Nbo=19233 Noov=105 OOV=0.18% > > lmtable class statistics > > levels 3 > > lev 1 entries 1091677 used mem 15.62Mb > > lev 2 entries 13524189 used mem 193.47Mb > > lev 3 entries 23061450 used mem 153.95Mb > > total allocated mem 363.03Mb > > total number of get and binary search calls > > level 1 get: 3042 bsearch: 0 > > level 2 get: 58713 bsearch: 23178875 > > level 3 get: 58712 bsearch: 55672 > > > > > > > > For > > > > /home/lingua/smt/irstlm/bin/compile-lm > > complete_fr.truecased_unique_tok_clean.blm --eval > > /home/lingua/Patricia/Corpora/Corpora_Eval/devtest/nc-test2007.truecased. > >tok.fr > > > > I get the same error as in the phrase training: > > > > inpfile: complete_fr.truecased_unique_tok_clean.blm > > loading up to the LM level 1000 (if any) > > dub: 10000000 > > Language Model Type of complete_fr.truecased_unique_tok_clean.blm is 1 > > blmt > > loadbin() > > lmtable::loadbin_dict() > > dictionary::loadtxt wrong entry was found (0) in position 1 > > > > > > > > Best, > > Patricia > > > > > From: nicr...@fbk.eu > > > To: patriciahelm...@hotmail.com; moses-support@mit.edu > > > Subject: RE: [Moses-support] IRSTLM - Error: dictionary::loadtxt wrong > > > entry was found (0) in position 1 Date: Tue, 3 Jul 2012 13:29:26 +0000 > > > > > > Hi Patricia, > > > > > > Could you try computing the perplexity of your binarized LM with > > > compile-lm? > > > > > > First on the ARPA format (SRILM): > > > /home/lingua/smt/irstlm/bin/compile-lm > > > complete_fr.truecased_unique_tok_clean.lm --eval <text-to-eval> > > > > > > and then on the binarized version (before your symbolic link): > > > /home/lingua/smt/irstlm/bin/compile-lm > > > complete_fr.truecased_unique_tok_clean.blm --eval <text-to-eval> > > > > > > It might be easier to debug by first looking at the direct output from > > > IRSTLM. > > > > > > Thanks, > > > Nick > > > > > > > > > ________________________________ > > > From: moses-support-boun...@mit.edu [moses-support-boun...@mit.edu] on > > > behalf of Patricia Helmich [patriciahelm...@hotmail.com] Sent: Tuesday, > > > July 03, 2012 3:07 PM > > > To: moses-support@mit.edu > > > Subject: [Moses-support] IRSTLM - Error: dictionary::loadtxt wrong > > > entry was found (0) in position 1 > > > > > > Hi, > > > I am using Moses in combination with SRILM and IRSTLM for several > > > language pairs. After building LMs with SRILM and training the phrase > > > model, I try to translate a sentence, for example: > > > > > > echo "this is a small house" | /home/lingua/smt/moses/bin/moses -f > > > model/moses.ini > > > > > > This works well for each language pair. > > > > > > Then I produce an IRSTLM binary LM for each language pair, for example: > > > > > > /home/lingua/smt/irstlm/bin/compile-lm > > > complete_fr.truecased_unique_tok_clean.lm > > > complete_fr.truecased_unique_tok_clean.blm ln -s > > > complete_fr.truecased_unique_tok_clean.blm > > > complete_fr.truecased_unique_tok_clean.blm.mm > > > > > > and I produce binary phrase tables and binary reordering tables: > > > > > > gzip -cd fr-en/f_en.e_fr/model/phrase-table.gz | LC_ALL=C sort | > > > /home/lingua/smt/moses/bin/processPhraseTable -ttable 0 0 - -nscores 5 > > > -out fr-en/f_en.e_fr/model/phrase-table gzip -cd > > > fr-en/f_en.e_fr/model/reordering-table.wbe-msd-bidirectional-fe.gz | > > > LC_ALL=C sort | /home/lingua/smt/moses/bin/processLexicalTable -out > > > fr-en/f_en.e_fr/model/reordering-table > > > > > > Then I create a copy of moses.ini (->moses-bin.ini) and set > > > moses-bin.ini to use the binary files. > > > > > > > > > Now I try to translate a sentence with: > > > > > > echo "this is a small house" | TMP=/tmp > > > /home/lingua/smt/moses/bin/moses -v 2 -f model/moses-bin.ini > > > > > > > > > This works well for each language pair, except for the language pair f: > > > en, e: fr. > > > > > > The output is: > > > > > > Defined parameters (per moses.ini or switch): > > > config: model/moses-bin.ini > > > distortion-file: 0-0 wbe-msd-bidirectional-fe-allff 6 > > > /home/lingua/Patricia/Corpora/Corpora_Biling/fr-en/f_en.e_fr/model/reor > > >dering-table distortion-limit: 6 > > > input-factors: 0 > > > lmodel-file: 1 0 3 > > > /home/lingua/Patricia/Corpora/Corpora_Monoling_Complete/fr/complete_fr. > > >truecased_unique_tok_clean.blm.mm mapping: 0 T 0 > > > ttable-file: 1 0 0 5 > > > /home/lingua/Patricia/Corpora/Corpora_Biling/fr-en/f_en.e_fr/model/phra > > >se-table ttable-limit: 20 > > > verbose: 2 > > > weight-d: 0.3 0.3 0.3 0.3 0.3 0.3 0.3 > > > weight-l: 0.5000 > > > weight-t: 0.20 0.20 0.20 0.20 0.20 > > > weight-w: -1 > > > input type is: text input > > > Loading lexical distortion models...have 1 models > > > Creating lexical reordering... > > > weights: 0.300 0.300 0.300 0.300 0.300 0.300 > > > binary file loaded, default OFF_T: -1 > > > Start loading LanguageModel > > > /home/lingua/Patricia/Corpora/Corpora_Monoling_Complete/fr/complete_fr. > > >truecased_unique_tok_clean.blm.mm : [0.000] seconds In > > > LanguageModelIRST::Load: nGramOrder = 3 > > > Language Model Type of > > > /home/lingua/Patricia/Corpora/Corpora_Monoling_Complete/fr/complete_fr. > > >truecased_unique_tok_clean.blm.mm is 1 blmt > > > loadbin() > > > lmtable::loadbin_dict() > > > dictionary::loadtxt wrong entry was found (0) in position 1 > > > > > > I don't understand the reason for this error. Could you help me with > > > this problem? > > > > > > Thank you, > > > Patricia > > _______________________________________________ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support > -- Barry Haddow University of Edinburgh +44 (0) 131 651 3173 -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support