Hi, I am using MLlib's Naive Baye's classifier to classify textual data. I am accessing the posterior probabilities through a hack for each class.
Once I have trained the model, I want to remove documents whose confidence of classification is low. Say for a document, if the highest class probability is lesser than a pre-defined threshold(separate for each class), categorize this document as 'unknown'. Say there are three classes A, B and C with thresholds 0.35, 0.32 and 0.33 respectively defined after training and testing. If I score a sample that belongs to neither of the three categories, I wish to classify it as 'unknown'. But the issue is I can get a probability higher than these thresholds for a document that doesn't belong to the trained categories. Is there any technique which I can apply to segregate documents that belong to untrained classes with certain degree of confidence? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unknown-sample-in-Naive-Baye-s-tp21594.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org