Am 21.12.15 um 11:36 schrieb Steven D'Aprano:
On Mon, 21 Dec 2015 08:56 pm, Christian Gollwitzer wrote:

Apfelkiste:Tests chris$ python score_my.py
-8.74  baby lions at play
-7.63  saturday_morning12
-6.38  Fukushima
-5.72  ImpossibleFork
-10.6  xy39mGWbosjY
-12.9  9sjz7s8198ghwt
-12.1  rz4sdko-28dbRW00u
Apfelkiste:Tests chris$ python score_my.py 'bnsip atl ayba loy'
-9.43  bnsip atl ayba loy

Thanks Christian and Peter for the suggestion, I'll certainly investigate
this further.

But the scoring doesn't seem very good. "baby lions at play" is 100% English
words, and ought to have a radically different score from (say)
xy39mGWbosjY which is extremely non-English like. (How many English words
do you know of with W, X, two Y, and J?) And yet they are only two units
apart. "baby lions..." is a score almost as negative as the authentic
gibberish, while Fukushima (a Japanese word) has a much less negative
score.

It is the spaces, which do not occur in the training wordlist (I mentioned that above, maybe not prominently enough). /usr/share/dict/words contains one word per line. The underscore _ is probably putting the saturday morning low, while the spaces put the babies low. Using trigraphs:


Apfelkiste:Tests chris$ python score_my.py
-11.5  baby lions at play
-9.88  saturday_morning12
-9.85  Fukushima
-7.68  ImpossibleFork
-13.4  xy39mGWbosjY
-14.2  9sjz7s8198ghwt
-14.2  rz4sdko-28dbRW00u
Apfelkiste:Tests chris$ python score_my.py 'babylionsatplay'
-8.74  babylionsatplay
Apfelkiste:Tests chris$ python score_my.py 'saturdaymorning12'
-8.93  saturdaymorning12
Apfelkiste:Tests chris$

So for the spaces, either use a proper trainig material (some long corpus from Wikipedia or such), with punctuation removed. Then it will catch the correct probabilities at word boundaries. Or preprocess by removing the spaces.

        Christian
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to