If you know there are data doesn't belong to any existing category,
put them into the training set and make a new category for them. It
won't help much if instances from this unknown category are all
outliers. In that case, lower the thresholds and tune the parameters
to get a lower error rate. -Xiangrui

On Thu, Feb 19, 2015 at 8:58 AM, Jatinpreet Singh <jatinpr...@gmail.com> wrote:
> Hi Xiangrui,
>
> Thanks for the answer. The problem is that in my application, I can not stop
> user from scoring any type of sample against trained model.
>
> So, even if the class of a completely unknown sample has not been trained,
> the model will put it in one of the categories with high priority. I wish to
> eliminate this with come kind of probability threshold. Is this possible in
> any way with Naive Baye's? Can changing the classification algorithm help in
> this regard?
>
> I appreciate any help on this.
>
> Thanks,
> Jatin
>
> On Wed, Feb 18, 2015 at 3:07 AM, Xiangrui Meng <men...@gmail.com> wrote:
>>
>> If there exists a sample that doesn't not belong to A/B/C, it means
>> that there exists another class D or Unknown besides A/B/C. You should
>> have some of these samples in the training set in order to let naive
>> Bayes learn the priors. -Xiangrui
>>
>> On Tue, Feb 10, 2015 at 10:44 PM, jatinpreet <jatinpr...@gmail.com> wrote:
>> > Hi,
>> >
>> > I am using MLlib's Naive Baye's classifier to classify textual data. I
>> > am
>> > accessing the posterior probabilities through a hack for each class.
>> >
>> > Once I have trained the model, I want to remove documents whose
>> > confidence
>> > of classification is low. Say for a document, if the highest class
>> > probability is lesser than a pre-defined threshold(separate for each
>> > class),
>> > categorize this document as 'unknown'.
>> >
>> > Say there are three classes A, B and C with thresholds 0.35, 0.32 and
>> > 0.33
>> > respectively defined after training and testing. If I score a sample
>> > that
>> > belongs to neither of the three categories, I wish to classify it as
>> > 'unknown'. But the issue is I can get a probability higher than these
>> > thresholds for a document that doesn't belong to the trained categories.
>> >
>> > Is there any technique which I can apply to segregate documents that
>> > belong
>> > to untrained classes with certain degree of confidence?
>> >
>> > Thanks
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://apache-spark-user-list.1001560.n3.nabble.com/Unknown-sample-in-Naive-Baye-s-tp21594.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>
>
>
>
> --
> Regards,
> Jatinpreet Singh

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to