Re: Unknown sample in Naive Baye's

2015-02-19 Thread Xiangrui Meng
If you know there are data doesn't belong to any existing category,
put them into the training set and make a new category for them. It
won't help much if instances from this unknown category are all
outliers. In that case, lower the thresholds and tune the parameters
to get a lower error rate. -Xiangrui

On Thu, Feb 19, 2015 at 8:58 AM, Jatinpreet Singh jatinpr...@gmail.com wrote:
 Hi Xiangrui,

 Thanks for the answer. The problem is that in my application, I can not stop
 user from scoring any type of sample against trained model.

 So, even if the class of a completely unknown sample has not been trained,
 the model will put it in one of the categories with high priority. I wish to
 eliminate this with come kind of probability threshold. Is this possible in
 any way with Naive Baye's? Can changing the classification algorithm help in
 this regard?

 I appreciate any help on this.

 Thanks,
 Jatin

 On Wed, Feb 18, 2015 at 3:07 AM, Xiangrui Meng men...@gmail.com wrote:

 If there exists a sample that doesn't not belong to A/B/C, it means
 that there exists another class D or Unknown besides A/B/C. You should
 have some of these samples in the training set in order to let naive
 Bayes learn the priors. -Xiangrui

 On Tue, Feb 10, 2015 at 10:44 PM, jatinpreet jatinpr...@gmail.com wrote:
  Hi,
 
  I am using MLlib's Naive Baye's classifier to classify textual data. I
  am
  accessing the posterior probabilities through a hack for each class.
 
  Once I have trained the model, I want to remove documents whose
  confidence
  of classification is low. Say for a document, if the highest class
  probability is lesser than a pre-defined threshold(separate for each
  class),
  categorize this document as 'unknown'.
 
  Say there are three classes A, B and C with thresholds 0.35, 0.32 and
  0.33
  respectively defined after training and testing. If I score a sample
  that
  belongs to neither of the three categories, I wish to classify it as
  'unknown'. But the issue is I can get a probability higher than these
  thresholds for a document that doesn't belong to the trained categories.
 
  Is there any technique which I can apply to segregate documents that
  belong
  to untrained classes with certain degree of confidence?
 
  Thanks
 
 
 
  --
  View this message in context:
  http://apache-spark-user-list.1001560.n3.nabble.com/Unknown-sample-in-Naive-Baye-s-tp21594.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 




 --
 Regards,
 Jatinpreet Singh

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Unknown sample in Naive Baye's

2015-02-17 Thread Xiangrui Meng
If there exists a sample that doesn't not belong to A/B/C, it means
that there exists another class D or Unknown besides A/B/C. You should
have some of these samples in the training set in order to let naive
Bayes learn the priors. -Xiangrui

On Tue, Feb 10, 2015 at 10:44 PM, jatinpreet jatinpr...@gmail.com wrote:
 Hi,

 I am using MLlib's Naive Baye's classifier to classify textual data. I am
 accessing the posterior probabilities through a hack for each class.

 Once I have trained the model, I want to remove documents whose confidence
 of classification is low. Say for a document, if the highest class
 probability is lesser than a pre-defined threshold(separate for each class),
 categorize this document as 'unknown'.

 Say there are three classes A, B and C with thresholds 0.35, 0.32 and 0.33
 respectively defined after training and testing. If I score a sample that
 belongs to neither of the three categories, I wish to classify it as
 'unknown'. But the issue is I can get a probability higher than these
 thresholds for a document that doesn't belong to the trained categories.

 Is there any technique which I can apply to segregate documents that belong
 to untrained classes with certain degree of confidence?

 Thanks



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Unknown-sample-in-Naive-Baye-s-tp21594.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org