If there exists a sample that doesn't not belong to A/B/C, it means that there exists another class D or Unknown besides A/B/C. You should have some of these samples in the training set in order to let naive Bayes learn the priors. -Xiangrui
On Tue, Feb 10, 2015 at 10:44 PM, jatinpreet <jatinpr...@gmail.com> wrote: > Hi, > > I am using MLlib's Naive Baye's classifier to classify textual data. I am > accessing the posterior probabilities through a hack for each class. > > Once I have trained the model, I want to remove documents whose confidence > of classification is low. Say for a document, if the highest class > probability is lesser than a pre-defined threshold(separate for each class), > categorize this document as 'unknown'. > > Say there are three classes A, B and C with thresholds 0.35, 0.32 and 0.33 > respectively defined after training and testing. If I score a sample that > belongs to neither of the three categories, I wish to classify it as > 'unknown'. But the issue is I can get a probability higher than these > thresholds for a document that doesn't belong to the trained categories. > > Is there any technique which I can apply to segregate documents that belong > to untrained classes with certain degree of confidence? > > Thanks > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Unknown-sample-in-Naive-Baye-s-tp21594.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org