I build a language model using IRSTLM on 20 million tokenized English sentences and tested on the following two sentences:
1. Yesterday when I was walking towards home , I saw a kangaroo . 2. smdnbs sadb jghsa sdabasd asasd tsados hasdb , I saw a snake . As we can the first portion of second sentence is completely trash while first sentence is a proper grammatical one. I was surprised to see that second sentence got higher probability score (-27.887135) than first one (-28.91925). I guess this happened due to back-off, I am not sure though. echo 'Yesterday when I was walking towards home , I saw a kangaroo .' | /usr/bin/query english-lcc-ilci-ukwac-tok-20M-n3.blm 2> /tmp/a Yesterday=126222 2 -4.08843 when=409 3 -2.51627 I=260 3 -0.58336 was=771 3 -0.764257 walking=1624 3 -2.58353 towards=1335 3 -1.95033 home=388 2 -3.910977 ,=209 3 -1.15596 I=260 3 -1.55485 saw=4411 3 -2.31963 a=131 3 -0.886832 kangaroo=106652 2 -5.3615108 .=10 3 -1.24128 </s>=11 3 -0.00203508 Total: -28.91925 OOV: 0 Perplexity including OOVs: 116.32170228822577 Perplexity excluding OOVs: 116.32170228822577 OOVs: 0 Tokens: 14 echo 'smdnbs sadb jghsa sdabasd asasd tsados hasdb , I saw a snake .' | /usr/bin/query english-lcc-ilci-ukwac-tok-20M-n3.blm 2> /tmp/a smdnbs=0 1 -4.0025997 sadb=0 1 -2.23153 jghsa=0 1 -2.23153 sdabasd=0 1 -2.23153 asasd=0 1 -2.23153 tsados=0 1 -2.23153 hasdb=0 1 -2.23153 ,=209 1 -1.42496 I=260 2 -1.9045 saw=4411 3 -2.31963 a=131 3 -0.886832 snake=3768 3 -3.16116 .=10 3 -0.793541 </s>=11 3 -0.0047327 Total: -27.887135 OOV: 7 Perplexity including OOVs: 98.16082104257269 Perplexity excluding OOVs: 31.57449745907425 OOVs: 7 Tokens: 14
_______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support