If there exists a sample that doesn't not belong to A/B/C, it means
that there exists another class D or Unknown besides A/B/C. You should
have some of these samples in the training set in order to let naive
Bayes learn the priors. -Xiangrui

On Tue, Feb 10, 2015 at 10:44 PM, jatinpreet <jatinpr...@gmail.com> wrote:
> Hi,
>
> I am using MLlib's Naive Baye's classifier to classify textual data. I am
> accessing the posterior probabilities through a hack for each class.
>
> Once I have trained the model, I want to remove documents whose confidence
> of classification is low. Say for a document, if the highest class
> probability is lesser than a pre-defined threshold(separate for each class),
> categorize this document as 'unknown'.
>
> Say there are three classes A, B and C with thresholds 0.35, 0.32 and 0.33
> respectively defined after training and testing. If I score a sample that
> belongs to neither of the three categories, I wish to classify it as
> 'unknown'. But the issue is I can get a probability higher than these
> thresholds for a document that doesn't belong to the trained categories.
>
> Is there any technique which I can apply to segregate documents that belong
> to untrained classes with certain degree of confidence?
>
> Thanks
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Unknown-sample-in-Naive-Baye-s-tp21594.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to