CC Leah, who added Bernoulli option to MLlib's NaiveBayes. -Xiangrui On Wed, Apr 15, 2015 at 4:49 AM, 姜林和 <linhe_ji...@163.com> wrote:
> > Dear meng: > Thanks for the great work for park machine learning, and I saw the > changes for NaiveBayes algorithm , > separate the algorithm to : multinomial model and Bernoulli model ,but > there be something confused me: > > the caculating of > P(Ci) -- pi(i) > P(j|Ci) -- theta(i,j) > > on multinomial and Bernoulli model are all different ,I can only see > theta(i,j) is calculate on different way,but not pi(i) > > > Bernoulli: > the origin feature vector i of label must be 0 or 1, 1 represent word j is > exits in Document i, > > pi(i) = (number of Documents of class C(i) + lamda)/(number of Documents > of all class + 2*lamda ) > theta(i)(j) = (number of Documents which j exists in class C(i) + > lamda)/(number of Documents of class C(i) + 2*lamda ) > > Multinomial: > > pi(i) = (number of words of class C(i) + lamda)/(number of words of all > classes + numFeatures*lamda ) > theta(i)(j) = (number of words j in class C(i) + lamda)/(number of words > in class C(i) + numFeatures*lamda ) > > the conparison of two algorithm : > > > definition in Multinomial Multinomial definition in Bernoulli > Bernoulli pi(i) number of words of class C(i) math.log(numAllWordsOfC + > lambda) -piLogDenom number of Documents of class C(i) math.log(n + > lambda) - piLogDenom piLogDenom number of words of all classes > math.log(numAllWords > + numfeatures* lambda) number of Documents of all class math.log(numDocuments > + 2 * lambda) theta(i)(j) number of words j in class C(i) > math.log(sumTermFreqs(j) + lamda) - thetaLogDenom number of Documents > which j exists in class C(i) theta(i)(j) = math.log(sumTermFreqs(j) + > lamda) - thetaLogDenom thetaLogDenom number of words in class C(i) > math.log(numAllWordsOfC > + numfeatures*lambda) number of Documents of class C(i) math.log(n + 2 * > lamda) > > best regard ! > > Linhe Jiang > > > > > > > Linhe Jiang > > >