If you know there are data doesn't belong to any existing category, put them into the training set and make a new category for them. It won't help much if instances from this unknown category are all outliers. In that case, lower the thresholds and tune the parameters to get a lower error rate. -Xiangrui
On Thu, Feb 19, 2015 at 8:58 AM, Jatinpreet Singh <jatinpr...@gmail.com> wrote: > Hi Xiangrui, > > Thanks for the answer. The problem is that in my application, I can not stop > user from scoring any type of sample against trained model. > > So, even if the class of a completely unknown sample has not been trained, > the model will put it in one of the categories with high priority. I wish to > eliminate this with come kind of probability threshold. Is this possible in > any way with Naive Baye's? Can changing the classification algorithm help in > this regard? > > I appreciate any help on this. > > Thanks, > Jatin > > On Wed, Feb 18, 2015 at 3:07 AM, Xiangrui Meng <men...@gmail.com> wrote: >> >> If there exists a sample that doesn't not belong to A/B/C, it means >> that there exists another class D or Unknown besides A/B/C. You should >> have some of these samples in the training set in order to let naive >> Bayes learn the priors. -Xiangrui >> >> On Tue, Feb 10, 2015 at 10:44 PM, jatinpreet <jatinpr...@gmail.com> wrote: >> > Hi, >> > >> > I am using MLlib's Naive Baye's classifier to classify textual data. I >> > am >> > accessing the posterior probabilities through a hack for each class. >> > >> > Once I have trained the model, I want to remove documents whose >> > confidence >> > of classification is low. Say for a document, if the highest class >> > probability is lesser than a pre-defined threshold(separate for each >> > class), >> > categorize this document as 'unknown'. >> > >> > Say there are three classes A, B and C with thresholds 0.35, 0.32 and >> > 0.33 >> > respectively defined after training and testing. If I score a sample >> > that >> > belongs to neither of the three categories, I wish to classify it as >> > 'unknown'. But the issue is I can get a probability higher than these >> > thresholds for a document that doesn't belong to the trained categories. >> > >> > Is there any technique which I can apply to segregate documents that >> > belong >> > to untrained classes with certain degree of confidence? >> > >> > Thanks >> > >> > >> > >> > -- >> > View this message in context: >> > http://apache-spark-user-list.1001560.n3.nabble.com/Unknown-sample-in-Naive-Baye-s-tp21594.html >> > Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> > For additional commands, e-mail: user-h...@spark.apache.org >> > > > > > > -- > Regards, > Jatinpreet Singh --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org