[
https://issues.apache.org/jira/browse/MAHOUT-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Palumbo updated MAHOUT-1564:
-----------------------------------
Description:
MapReduce Naive Bayes implementation currently lacks the ability to classify a
new document (outside of the training/holdout corpus). I've begun some work on
a "ClassifyNew" job which will do the following:
1. Vectorize a new text document using the dictionary and document frequencies
from the training/holdout corpus
- assume the original corpus was vectorized using `seq2sparse`; step (1)
will use all of the same parameters.
2. Score and label a new document using a previously trained model.
I think that it will be a useful addition to the NB package. Unfortunately,
this is going to be mostly MR workhorse code and doesn't really introduce much
new logic. I will try to keep any new logic separate from MR code so that it
can be called from scala for MAHOUT-1493.
was:
MapReduce Naive Bayes implementation currently lacks the ability to classify a
new document (outside of the training/holdout corpus). I've begun some work on
a "ClassifyNew" job which will do the following:
1. Vectorize a new text document using the dictionary and document frequencies
from the training/holdout corpus
- assuming the original corpus was vectorized using `seq2sparse`, step (1)
will use all of the same parameters.
2. Score and Label a new document using a previously trained model.
I think that it will be a useful addition to the NB package. Unfortunately,
this is going to be mostly MR workhorse code and doesn't really introduce much
new logic. I will try to keep any new logic separate from MR code so that it
can be called from scala for MAHOUT-1493.
> Naive Bayes Classifier for New Text Documents
> ---------------------------------------------
>
> Key: MAHOUT-1564
> URL: https://issues.apache.org/jira/browse/MAHOUT-1564
> Project: Mahout
> Issue Type: Improvement
> Affects Versions: 0.9
> Reporter: Andrew Palumbo
> Fix For: 1.0
>
>
> MapReduce Naive Bayes implementation currently lacks the ability to classify
> a new document (outside of the training/holdout corpus). I've begun some
> work on a "ClassifyNew" job which will do the following:
> 1. Vectorize a new text document using the dictionary and document
> frequencies from the training/holdout corpus
> - assume the original corpus was vectorized using `seq2sparse`; step (1)
> will use all of the same parameters.
> 2. Score and label a new document using a previously trained model.
> I think that it will be a useful addition to the NB package. Unfortunately,
> this is going to be mostly MR workhorse code and doesn't really introduce
> much new logic. I will try to keep any new logic separate from MR code so
> that it can be called from scala for MAHOUT-1493.
--
This message was sent by Atlassian JIRA
(v6.2#6252)