Hi Patricia

It looks like you have some odd characters in your corpus - perhaps vertical 
tabs. You could use xxd on the lm file to try to figure out what it is,

cheers - Barry

On Tuesday 03 July 2012 16:46:35 Nicholas Ruiz wrote:
> Hi Patricia,
> 
> Unfortunately, I'm not so well versed in SRILM, so I'm not sure I can
>  answer the question about the blank line appearing in your ARPA file. You
>  can also try training your model directly with IRSTLM (in text format) and
>  you can see if the blank line also appears.
> 
> tlm -tr=<corpus> -lm=[wb|msb] -n=3
>  -o=complete_fr.truecased_unique_tok_irst.lm
> 
> (I'm not sure what you original params were for the SRI model)
> wb=Witten-Bell Smoothing
> msb=Modified Shift-Beta Smoothing
> 
> Best,
> Nick
> 
> ________________________________
> From: Patricia Helmich [patriciahelm...@hotmail.com]
> Sent: Tuesday, July 03, 2012 5:38 PM
> To: Nicholas Ruiz
> Subject: RE: [Moses-support] IRSTLM - Error: dictionary::loadtxt wrong
>  entry was found (0) in position 1
> 
> Hi Nick,
> 
> ok, here are the first 10 lines of the BLM:
> 
> lingua@StatMT24:~/Patricia/Corpora/Corpora_Monoling_Complete/fr$ cat -n
>  complete_fr.truecased_unique_tok_clean.blm | head 1  blmt 3 1091677
>  13524189 23061450
>      2  1091677
>      3
>          0
>      4  ! 0
>      5  " 0
>      6  # 0
>      7  $ 0
>      8  % 0
>      9  & 0
>     10  ' 0
> 
> 
> 
> It seems that the third line causes the problems because I deleted it in a
>  copy of the BLM
> 
> lingua@StatMT24:~/Patricia/Corpora/Corpora_Monoling_Complete/fr$ cat -n
>  complete_fr.truecased_unique_tok_clean_copy.blm | head 1  blmt 3 1091677
>  13524189 23061450
>      2  1091677
>      3  ! 0
>      4  " 0
>      5  # 0
>      6  $ 0
>      7  % 0
>      8  & 0
>      9  ' 0
>     10  '00 0
> 
> and then I tried to compute the perplexity with the copy of the BLM and it
>  worked well:
> 
> lingua@StatMT24:~/Patricia/Corpora/Corpora_Monoling_Complete/fr$
>  /home/lingua/smt/irstlm/bin/compile-lm
>  complete_fr.truecased_unique_tok_clean_copy.blm --eval
>  /home/lingua/Patricia/Corpora/Corpora_Eval/devtest/nc-test2007.truecased.t
> ok.fr inpfile: complete_fr.truecased_unique_tok_clean_copy.blm
> loading up to the LM level 1000 (if any)
> dub: 10000000
> Language Model Type of complete_fr.truecased_unique_tok_clean_copy.blm is 1
> blmt
> loadbin()
> lmtable::loadbin_dict()
> dict->size(): 1091677
> loadbin_level (level 1)
> loading 1091677 1-grams
> done (level1)
> loadbin_level (level 2)
> loading 13524189 2-grams
> done (level2)
> loadbin_level (level 3)
> loading 23061450 3-grams
> done (level3)
> done
> OOV code is 218080
> Start Eval
> OOV code: 218080
> %% Nw=58714 PP=1.03 PPwp=0.03 Nbo=58713 Noov=105 OOV=0.18%
> lmtable class statistics
> levels 3
> lev 1 entries 1091677 used mem 15.62Mb
> lev 2 entries 13524189 used mem 193.47Mb
> lev 3 entries 23061450 used mem 153.95Mb
> total allocated mem 363.03Mb
> total number of get and binary search calls
> level 1 get: 58714 bsearch: 0
> level 2 get: 58713 bsearch: 117425
> level 3 get: 58712 bsearch: 0
> 
> 
> In the LM, I have also this empty line
> 
> lingua@StatMT24:~/Patricia/Corpora/Corpora_Monoling_Complete/fr$ cat -n
>  complete_fr.truecased_unique_tok_clean.lm | head 1
>      2  \data\
>      3  ngram 1=1091677
>      4  ngram 2=13524189
>      5  ngram 3=23061450
>      6
>      7  \1-grams:
>      8  -7.154682
>                                 -0.1456359
>      9  -3.339167       !       -1.472732
>     10  -2.43139        "       -0.733331
> 
> but in the phrase training or the perplexity computation with the LM, this
>  does not cause any problems.
> 
> Also, I'm wondering why there is an entry for an empty line in the LM
>  because I checked my french corpus and it does not contain any empty
>  lines.
> 
> 
> Best, Patricia
> 
> > From: nicr...@fbk.eu
> > To: patriciahelm...@hotmail.com
> > Subject: RE: [Moses-support] IRSTLM - Error: dictionary::loadtxt wrong
> > entry was found (0) in position 1 Date: Tue, 3 Jul 2012 14:59:57 +0000
> >
> > Hi Patricia,
> >
> > Could you also send me the top 10 lines of your binarized LM?
> >
> > head complete_fr.truecased_unique_tok_clean.blm
> >
> > Thanks,
> > Nick
> >
> > ________________________________
> > From: Patricia Helmich [patriciahelm...@hotmail.com]
> > Sent: Tuesday, July 03, 2012 4:40 PM
> > To: Nicholas Ruiz; moses-support@mit.edu
> > Subject: RE: [Moses-support] IRSTLM - Error: dictionary::loadtxt wrong
> > entry was found (0) in position 1
> >
> > Hi Nick,
> >
> > for
> >
> > /home/lingua/smt/irstlm/bin/compile-lm
> > complete_fr.truecased_unique_tok_clean.lm --eval
> > /home/lingua/Patricia/Corpora/Corpora_Eval/devtest/nc-test2007.truecased.
> >tok.fr
> >
> > I get the following output:
> >
> > inpfile: complete_fr.truecased_unique_tok_clean.lm
> > loading up to the LM level 1000 (if any)
> > dub: 10000000
> > Language Model Type of complete_fr.truecased_unique_tok_clean.lm is 1
> > \data\
> > loadtxt_ram()
> > 1-grams: reading 1091677 entries
> > done level1
> > 2-grams: reading 13524189 entries
> > ..done level2
> > 3-grams: reading 23061450 entries
> > ....done level3
> > done
> > OOV code is 218081
> > OOV code is 218081
> > Start Eval
> > OOV code: 218081
> > %% Nw=58714 PP=201.88 PPwp=5.70 Nbo=19233 Noov=105 OOV=0.18%
> > lmtable class statistics
> > levels 3
> > lev 1 entries 1091677 used mem 15.62Mb
> > lev 2 entries 13524189 used mem 193.47Mb
> > lev 3 entries 23061450 used mem 153.95Mb
> > total allocated mem 363.03Mb
> > total number of get and binary search calls
> > level 1 get: 3042 bsearch: 0
> > level 2 get: 58713 bsearch: 23178875
> > level 3 get: 58712 bsearch: 55672
> >
> >
> >
> > For
> >
> > /home/lingua/smt/irstlm/bin/compile-lm
> > complete_fr.truecased_unique_tok_clean.blm --eval
> > /home/lingua/Patricia/Corpora/Corpora_Eval/devtest/nc-test2007.truecased.
> >tok.fr
> >
> > I get the same error as in the phrase training:
> >
> > inpfile: complete_fr.truecased_unique_tok_clean.blm
> > loading up to the LM level 1000 (if any)
> > dub: 10000000
> > Language Model Type of complete_fr.truecased_unique_tok_clean.blm is 1
> > blmt
> > loadbin()
> > lmtable::loadbin_dict()
> > dictionary::loadtxt wrong entry was found (0) in position 1
> >
> >
> >
> > Best,
> > Patricia
> >
> > > From: nicr...@fbk.eu
> > > To: patriciahelm...@hotmail.com; moses-support@mit.edu
> > > Subject: RE: [Moses-support] IRSTLM - Error: dictionary::loadtxt wrong
> > > entry was found (0) in position 1 Date: Tue, 3 Jul 2012 13:29:26 +0000
> > >
> > > Hi Patricia,
> > >
> > > Could you try computing the perplexity of your binarized LM with
> > > compile-lm?
> > >
> > > First on the ARPA format (SRILM):
> > > /home/lingua/smt/irstlm/bin/compile-lm
> > > complete_fr.truecased_unique_tok_clean.lm --eval <text-to-eval>
> > >
> > > and then on the binarized version (before your symbolic link):
> > > /home/lingua/smt/irstlm/bin/compile-lm
> > > complete_fr.truecased_unique_tok_clean.blm --eval <text-to-eval>
> > >
> > > It might be easier to debug by first looking at the direct output from
> > > IRSTLM.
> > >
> > > Thanks,
> > > Nick
> > >
> > >
> > > ________________________________
> > > From: moses-support-boun...@mit.edu [moses-support-boun...@mit.edu] on
> > > behalf of Patricia Helmich [patriciahelm...@hotmail.com] Sent: Tuesday,
> > > July 03, 2012 3:07 PM
> > > To: moses-support@mit.edu
> > > Subject: [Moses-support] IRSTLM - Error: dictionary::loadtxt wrong
> > > entry was found (0) in position 1
> > >
> > > Hi,
> > > I am using Moses in combination with SRILM and IRSTLM for several
> > > language pairs. After building LMs with SRILM and training the phrase
> > > model, I try to translate a sentence, for example:
> > >
> > > echo "this is a small house" | /home/lingua/smt/moses/bin/moses -f
> > > model/moses.ini
> > >
> > > This works well for each language pair.
> > >
> > > Then I produce an IRSTLM binary LM for each language pair, for example:
> > >
> > > /home/lingua/smt/irstlm/bin/compile-lm
> > > complete_fr.truecased_unique_tok_clean.lm
> > > complete_fr.truecased_unique_tok_clean.blm ln -s
> > > complete_fr.truecased_unique_tok_clean.blm
> > > complete_fr.truecased_unique_tok_clean.blm.mm
> > >
> > > and I produce binary phrase tables and binary reordering tables:
> > >
> > > gzip -cd fr-en/f_en.e_fr/model/phrase-table.gz | LC_ALL=C sort |
> > > /home/lingua/smt/moses/bin/processPhraseTable -ttable 0 0 - -nscores 5
> > > -out fr-en/f_en.e_fr/model/phrase-table gzip -cd
> > > fr-en/f_en.e_fr/model/reordering-table.wbe-msd-bidirectional-fe.gz |
> > > LC_ALL=C sort | /home/lingua/smt/moses/bin/processLexicalTable -out
> > > fr-en/f_en.e_fr/model/reordering-table
> > >
> > > Then I create a copy of moses.ini (->moses-bin.ini) and set
> > > moses-bin.ini to use the binary files.
> > >
> > >
> > > Now I try to translate a sentence with:
> > >
> > > echo "this is a small house" | TMP=/tmp
> > > /home/lingua/smt/moses/bin/moses -v 2 -f model/moses-bin.ini
> > >
> > >
> > > This works well for each language pair, except for the language pair f:
> > > en, e: fr.
> > >
> > > The output is:
> > >
> > > Defined parameters (per moses.ini or switch):
> > > config: model/moses-bin.ini
> > > distortion-file: 0-0 wbe-msd-bidirectional-fe-allff 6
> > > /home/lingua/Patricia/Corpora/Corpora_Biling/fr-en/f_en.e_fr/model/reor
> > >dering-table distortion-limit: 6
> > > input-factors: 0
> > > lmodel-file: 1 0 3
> > > /home/lingua/Patricia/Corpora/Corpora_Monoling_Complete/fr/complete_fr.
> > >truecased_unique_tok_clean.blm.mm mapping: 0 T 0
> > > ttable-file: 1 0 0 5
> > > /home/lingua/Patricia/Corpora/Corpora_Biling/fr-en/f_en.e_fr/model/phra
> > >se-table ttable-limit: 20
> > > verbose: 2
> > > weight-d: 0.3 0.3 0.3 0.3 0.3 0.3 0.3
> > > weight-l: 0.5000
> > > weight-t: 0.20 0.20 0.20 0.20 0.20
> > > weight-w: -1
> > > input type is: text input
> > > Loading lexical distortion models...have 1 models
> > > Creating lexical reordering...
> > > weights: 0.300 0.300 0.300 0.300 0.300 0.300
> > > binary file loaded, default OFF_T: -1
> > > Start loading LanguageModel
> > > /home/lingua/Patricia/Corpora/Corpora_Monoling_Complete/fr/complete_fr.
> > >truecased_unique_tok_clean.blm.mm : [0.000] seconds In
> > > LanguageModelIRST::Load: nGramOrder = 3
> > > Language Model Type of
> > > /home/lingua/Patricia/Corpora/Corpora_Monoling_Complete/fr/complete_fr.
> > >truecased_unique_tok_clean.blm.mm is 1 blmt
> > > loadbin()
> > > lmtable::loadbin_dict()
> > > dictionary::loadtxt wrong entry was found (0) in position 1
> > >
> > > I don't understand the reason for this error. Could you help me with
> > > this problem?
> > >
> > > Thank you,
> > > Patricia
> 
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
> 
 
--
Barry Haddow
University of Edinburgh
+44 (0) 131 651 3173

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to