Am 21.12.15 um 11:53 schrieb Christian Gollwitzer:
So for the spaces, either use a proper trainig material (some long
corpus from Wikipedia or such), with punctuation removed. Then it will
catch the correct probabilities at word boundaries. Or preprocess by
removing the spaces.
Christian
PS: The real log-likelihood would become -infinity, when some pair does
not appear at all in the training set (esp. the numbers, e.g.). I used
the 1/total in the defaultdict to mitigate that. You could tweak that
value a bit. The larger the corpus, the sharper it will divide by
itself, too.
Christian
--
https://mail.python.org/mailman/listinfo/python-list