I build a language model using IRSTLM on 20 million tokenized English
sentences and tested on the following two sentences:

1. Yesterday when I was walking towards home , I saw a kangaroo .
2. smdnbs sadb jghsa sdabasd asasd tsados hasdb , I saw a snake .

As we can the first portion of second sentence is completely trash while
first sentence is a proper grammatical one. I was surprised to see that
second sentence got higher probability score (-27.887135) than first one
(-28.91925).

I guess this happened due to back-off, I am not sure though.

echo 'Yesterday when I was walking towards home , I saw a kangaroo .' |
/usr/bin/query english-lcc-ilci-ukwac-tok-20M-n3.blm 2> /tmp/a
Yesterday=126222 2 -4.08843 when=409 3 -2.51627 I=260 3 -0.58336 was=771 3
-0.764257 walking=1624 3 -2.58353 towards=1335 3 -1.95033 home=388 2
-3.910977 ,=209 3 -1.15596 I=260 3 -1.55485 saw=4411 3 -2.31963 a=131 3
-0.886832 kangaroo=106652 2 -5.3615108 .=10 3 -1.24128 </s>=11 3
-0.00203508 Total:
-28.91925 OOV: 0
Perplexity including OOVs: 116.32170228822577
Perplexity excluding OOVs: 116.32170228822577
OOVs: 0
Tokens: 14

echo 'smdnbs sadb jghsa sdabasd asasd tsados hasdb , I saw a snake .' |
/usr/bin/query english-lcc-ilci-ukwac-tok-20M-n3.blm 2> /tmp/a
smdnbs=0 1 -4.0025997 sadb=0 1 -2.23153 jghsa=0 1 -2.23153 sdabasd=0 1
-2.23153 asasd=0 1 -2.23153 tsados=0 1 -2.23153 hasdb=0 1 -2.23153 ,=209 1
-1.42496 I=260 2 -1.9045 saw=4411 3 -2.31963 a=131 3 -0.886832 snake=3768 3
-3.16116 .=10 3 -0.793541 </s>=11 3 -0.0047327 Total: -27.887135 OOV: 7
Perplexity including OOVs: 98.16082104257269
Perplexity excluding OOVs: 31.57449745907425
OOVs: 7
Tokens: 14
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to