If you know there are data doesn't belong to any existing category,
put them into the training set and make a new category for them. It
won't help much if instances from this unknown category are all
outliers. In that case, lower the thresholds and tune the parameters
to get a lower error rate. -Xiangrui
On Thu, Feb 19, 2015 at 8:58 AM, Jatinpreet Singh jatinpr...@gmail.com wrote:
Hi Xiangrui,
Thanks for the answer. The problem is that in my application, I can not stop
user from scoring any type of sample against trained model.
So, even if the class of a completely unknown sample has not been trained,
the model will put it in one of the categories with high priority. I wish to
eliminate this with come kind of probability threshold. Is this possible in
any way with Naive Baye's? Can changing the classification algorithm help in
this regard?
I appreciate any help on this.
Thanks,
Jatin
On Wed, Feb 18, 2015 at 3:07 AM, Xiangrui Meng men...@gmail.com wrote:
If there exists a sample that doesn't not belong to A/B/C, it means
that there exists another class D or Unknown besides A/B/C. You should
have some of these samples in the training set in order to let naive
Bayes learn the priors. -Xiangrui
On Tue, Feb 10, 2015 at 10:44 PM, jatinpreet jatinpr...@gmail.com wrote:
Hi,
I am using MLlib's Naive Baye's classifier to classify textual data. I
am
accessing the posterior probabilities through a hack for each class.
Once I have trained the model, I want to remove documents whose
confidence
of classification is low. Say for a document, if the highest class
probability is lesser than a pre-defined threshold(separate for each
class),
categorize this document as 'unknown'.
Say there are three classes A, B and C with thresholds 0.35, 0.32 and
0.33
respectively defined after training and testing. If I score a sample
that
belongs to neither of the three categories, I wish to classify it as
'unknown'. But the issue is I can get a probability higher than these
thresholds for a document that doesn't belong to the trained categories.
Is there any technique which I can apply to segregate documents that
belong
to untrained classes with certain degree of confidence?
Thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Unknown-sample-in-Naive-Baye-s-tp21594.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
--
Regards,
Jatinpreet Singh
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org