Re: [jira] Commented: (MAHOUT-479) Streamline classification/ clustering data structures

Ted Dunning Tue, 17 Aug 2010 14:31:42 -0700

I think it is important to be able to load a classification object up that
implements something like AbstractVectorClassifier.

The use case I have in mind is real-time classification.  Here, we would
need to accept input, convert to vector form and get a classification output
from a model
for a single input at a time, typically inside some kind of web-service.

The model could come from supervised learning (classifier) or unsupervised
learning (clusterer).

Clusters are commonly used as features for classifiers.  Classifiers trained
on some external result are also used as features.  Thus we need to be able
to load several models, evaluate some on the raw input and then evaluate
others on the outputs of the first ones as well as the rest of the feature
vector.

Model learning and clustering are typically done off-line and the current
very shiny and new command-line interface is probably fine for that.

Model deployment is another matter and there a real-time capability is a
must.

On Tue, Aug 17, 2010 at 2:06 PM, Jeff Eastman <[email protected]>wrote:

> The clusterData() process for most algorithms produces a single,
> most-likely cluster assignment, usually the closest cluster. For Dirichlet
> and FuzzyK, the clustering can be specified to use the most-likely
> assignment (the default) or a pdf threshold can be specified above which
> multiple cluster assignments will be output. All clusterData() processes
> produce WeightedVectorWritable objects in persistent storage which contain a
> probability weight and the input vector. These sequence files are keyed by
> the clusterId and are output to the clusteredPoints directory.
>
> The buildClusters() step is always run from the command line but the
> clusterData step is optional (-cl flag). It would be straightforward to
> support the other use case (clusterData only). Users who instantiate the
> drivers from Java code can call either/both at their discretion now.
>
> I've also implemented an execution method (-xm) parameter on all clustering
> drivers which allows the sequential, in-memory reference implementation to
> be invoked from the command line using the same arguments as the mapreduce
> implementation. The display examples use these now, except Dirichlet which I
> didn't get to before I left.
>
> Given this information, what do you now see as logical next steps?
>

Re: [jira] Commented: (MAHOUT-479) Streamline classification/ clustering data structures

Reply via email to