On Aug 12, 7:26 am, John Machin <[EMAIL PROTECTED]> wrote: > On Aug 12, 12:26 pm, Brandon <[EMAIL PROTECTED]> wrote: > > > > > You are very correct about the Laplace adjustment. However, a more > > precise statement of my overall problem would involve training and > > testing which utilizes bigram probabilities derived in part from the > > Laplace adjustment; as I understand the workflow that I should follow, > > I can't allow myself to be constrained only to bigrams that actually > > exist in training or my overall probability when I run through testing > > will be thrown off to 0 as soon as a test bigram that doesn't exist in > > training is encountered. Hence my desire to find all possible bigrams > > in train (having taken steps to ensure proper set relations between > > train and test). > > The best way I can currently see to do this is with > > my current two-dictionary "caper", and by iterating over foo, not > > bar :) > > I can't grok large chunks of the above, especially these troublesome > test bigrams that don't exist in training but which you desire to find > in train(ing?). > > However let's look at the mechanics: Are you now saying that your > original assertion "I am certain that all keys in bar belong to foo as > well" was not quite "precise"? If not, please explain why you think > you need to iterate (slowly) over foo in order to accomplish your > stated task.
I was merely trying to be brief. The statement of my certainty about foo/bar was precise as a stand-alone statement, but I was attempting to say that within the context of the larger problem, I need to iterate over foo. This is actually for a school project, but as I have already worked out a feasible (if perhaps not entirely optimized) workflow, I don't feel overly guilty about sharing this or getting some small amount of input - but certainly none is asked for beyond what you've given me :) I am tasked with finding the joint probability of a test sequence, utilizing bigram probabilities derived from train(ing) counts. I have ensured that all members (unigrams) of test are also members of train, although I do not have any idea as to bigram frequencies in test. Thus I need to iterate over all members of train for training bigram frequencies in order to be prepared for any test bigram I might encounter. The problem is that without Laplace smoothing, many POTENTIAL bigrams in train might have an ACTUAL frequency of 0 in train. And if one or more of those bigrams which have 0 frequency in train is actually found in test, the joint probability of test will become 0, and that's no fun at all. So I made foo dictionary that creates all POTENTIAL training bigrams with a smoothed frequency of 1. I also made bar dictionary that creates keys of all ACTUAL training bigrams with their actual values. I needed to combine the two dictionaries as a first step to eventually finding the test sequence probability. So any bigram in test will at least have a smoothed train frequency of 1 and possibly a smoothed train frequency of the existing train value + 1. Having iterated over foo, foo becomes the dictionary which holds these smoothed & combined train frequencies. I don't see a way to combine the two types of counts into one dictionary without keeping them separate first. Hence the caper. Sorry for the small essay. P.S. I do realize that there are better smoothing methods than Laplace, but that is what the problem has specified. -- http://mail.python.org/mailman/listinfo/python-list