Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT) Page: Bayesian (https://cwiki.apache.org/confluence/display/MAHOUT/Bayesian)
Edited by Grant Ingersoll: --------------------------------------------------------------------- h1. Intro Mahout currently has two implementations of Bayesian classifiers. One is the traditional Naive Bayes approach, and the other is called Complementary Naive Bayes. h1. Implementations [NaiveBayes] ([MAHOUT-9|http://issues.apache.org/jira/browse/MAHOUT-9]) [Complementary Naive Bayes] ([MAHOUT-60|http://issues.apache.org/jira/browse/MAHOUT-60]) The Naive Bayes implementations in Mahout follow the paper [http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf] Before we get to the actual algorithm lets discuss the terminology Given, in an input set of classified documents: # j = 0 to N features # k = 0 to L labels Then: # Normalized Frequency for a term(feature) in a document is calculated by dividing the term frequency by the root mean square of terms frequencies in that document # Weight Normalized Tf for a given feature in a given label = sum of Normalized Frequency of the feature across all the documents in the label. # Weight Normalized Tf-Idf for a given feature in a label is the Tf-idf calculated using standard idf multiplied by the Weight Normalized Tf Once Weight Normalized Tf-idf(W-N-Tf-idf) is calculated, the final weight matrix for Bayes and Cbayes are calculated as follows We calculate the sum of W-N-Tf-idf for all the features in a label called as Sigma_k or sumLabelWeight For Bayes {noformat} Weight = Log [ ( W-N-Tf-Idf + alpha_i ) / ( Sigma_k + N ) ] {noformat} For CBayes We calculate the Sum of W-N-Tf-Idf across all labels for a given feature. We call this sumFeatureWeight of Sigma_j Also we sum the entire W-N-Tf-Idf weights for all feature,label pair in the train set. Call this Sigma_jSigma_k Final Weight is calculated as {noformat} Weight = Log [ ( Sigma_j - W-N-Tf-Idf + alpha_i ) / ( Sigma_jSigma_k - Sigma_k + N ) ] {noformat} h1. Examples In Mahout's example code, there are two samples that can be used: # [Wikipedia Bayes Example] - Classify Wikipedia data. # [Twenty Newsgroups] - Classify the classic Twenty Newsgroups data. Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action
