Hi,

I am using MLlib's Naive Baye's classifier to classify textual data. I am
accessing the posterior probabilities through a hack for each class. 

Once I have trained the model, I want to remove documents whose confidence
of classification is low. Say for a document, if the highest class
probability is lesser than a pre-defined threshold(separate for each class),
categorize this document as 'unknown'.

Say there are three classes A, B and C with thresholds 0.35, 0.32 and 0.33
respectively defined after training and testing. If I score a sample that
belongs to neither of the three categories, I wish to classify it as
'unknown'. But the issue is I can get a probability higher than these
thresholds for a document that doesn't belong to the trained categories.

Is there any technique which I can apply to segregate documents that belong
to untrained classes with certain degree of confidence?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unknown-sample-in-Naive-Baye-s-tp21594.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to