I understood you. The assignment of "weights" to words (or to other features) happens automatically.
Here's a set of slides on how the naive bayes classifier learns those "weights": http://www.slideshare.net/aiaioo/fun-with-text-hacking-text-analytics (you may want to start at slide 16) Does that answer your question? Cohan On Wed, Jan 18, 2017 at 7:52 PM, Manoj B. Narayanan < [email protected]> wrote: > Hi Cohan, > > Thanks for the reply. Am not sure if my intention got conveyed properly. To > rephrase my intention - Let us assume that we have three key tokens that > decide the outcome. Out of the three, one token can mean a lot to one > outcome while the same can be used in another outcome with less importance. > In this case, while computing the overall score, is it possible to boost > the weight of one particular token for that outcome? > > When I have closely related outcomes, the words used in the outcomes will > overlap. In such a case, I should be able to teach the machine certain > words, which should be given importance when calculating the likelihood for > a particular outcome and will be treated normal when calculating the > likelihood for other outcomes. > > For example, the word 'player' is very important in a 'Sport' outcome than > in a 'Politics' outcome. > 1. He has been a very popular basket ball player among our country's clubs > since the 90's. - Sport > 2. The country's changes made it a very popular player in world politics > since the 90's. - Politics > > While calculating the likelihood of sentence 1 corresponding 'Sport' > outcome, the word 'player' will be given more weight than while 'player' in > 'Politics' outcome. > > The worst case will be when I have 3 outcomes and I have 3 tokens used in > all 3 outcomes. Each outcome will have 1 token among the 3 given > importance. This will be the same worst case as before where the > surrounding words determine the outcome. But the best case will improve by > a lot. > > Say, I have a sentence of 10 words. 9/10 words say that the sentence > belongs to A. 5/10 say that sentence belongs to B. I know that the sentence > belongs to B. But A would be chosen over B. > > What I suggest is, when calculating the likelihood for B, I would boost > a/some tokens out of the 5 which say that the sentence belongs to B, so > that the machine would choose B over A. > > I believe I have made my intention more clear. > > Manoj. > > On Wed, Jan 18, 2017 at 5:11 PM, Cohan Sujay Carlos <[email protected]> > wrote: > > > In machine learning, one learns the weights you're speaking of, Manoj. > > > > So, the words that are more important for any category are given higher > > weightage during classification. > > > > However, rather than requiring a user to manually assign these weights, a > > machine learning system learns the weights from training data. > > > > That's what happens when you call say DocumentCategorizerME.train(*" > en"*, > > sampleStream); > > > > The model that the train method returns is just a record of the "weights" > > that have been learnt. > > > > Cohan > > > > On Wed, Jan 18, 2017 at 4:18 PM, Manoj B. Narayanan < > > [email protected]> wrote: > > > > > Hi, > > > > > > I was wondering if there is a way to assign weights to certain words > of a > > > class in the Document Classifier. > > > > > > Some words are important for a particular class. Even though these > words > > > may occur in other classes, the level of importance may vary. So, if > > > certain words in certain classes are given specific weights, it would > > > produce more accurate results. > > > > > > Let me explain this with an example. > > > > > > Say we have 2 classes. Nature and Sports. > > > Consider these 2 sentences : > > > 1. We played basket ball, under the sun. > > > 2. The sun is a big ball of fire. > > > > > > In the first sentence, which belongs to the class 'Sports', the words > > > 'played','basket','ball' are more important than the word 'sun'. > Whereas, > > > in the second sentence, the words 'sun' and 'fire' are important than > the > > > word 'ball'. > > > > > > Thelevel of importance can be assigned by assigning weight to a few > > > specific words that are distinct for a class. > > > > > > Is there already a way to do this in OpenNLP Document Classifier? If > not > > > please consider this. > > > > > >
