[jira] Commented: (MAHOUT-479) Streamline classification/ clustering data structures

Jeff Eastman (JIRA) Wed, 18 Aug 2010 08:35:39 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899874#action_12899874
 ]


Jeff Eastman commented on MAHOUT-479:
-------------------------------------

I don't see AbstractVectorClassifier as a super-class of these models, since it 
needs to operate on a set of models rather than an individual model. What I see 
is something more like below. I realize classify() does not return the right 
sized vector, but how else to normalize it properly given that an arbitrary set 
of model pdfs won't sum to 1? Also, what about making AbstractVectorClassifier 
work over VectorWritables instead of Vectors? All the clustering code uses VWs.

{code}
public class VectorModelClassifier extends AbstractVectorClassifier {

  List<Model<VectorWritable>> models;

  public VectorModelClassifier(List<Model<VectorWritable>> models) {
    super();
    this.models = models;
  }

  @Override
  public Vector classify(Vector instance) {
    Vector pdfs = new DenseVector(models.size());
    int i = 0;
    for (Model<VectorWritable> model : models) {
      pdfs.set(i++, model.pdf(new VectorWritable(instance)));
    }
    return pdfs.assign(new TimesFunction(), 1.0 / pdfs.zSum());
  }

  @Override
  public double classifyScalar(Vector instance) {
    if (models.size() == 2) {
      double pdf0 = models.get(0).pdf(new VectorWritable(instance));
      double pdf1 = models.get(1).pdf(new VectorWritable(instance));
      return pdf0 / (pdf0 + pdf1);
    }
    throw new IllegalStateException();
  }

  @Override
  public int numCategories() {
    return models.size();
  }
}
{code}

> Streamline classification/ clustering data structures
> -----------------------------------------------------
>
>                 Key: MAHOUT-479
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-479
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.1, 0.2, 0.3, 0.4
>            Reporter: Isabel Drost
>
> Opening this JIRA issue to collect ideas on how to streamline our 
> classification and clustering algorithms to make integration for users easier 
> as per mailing list thread http://markmail.org/message/pnzvrqpv5226twfs
> {quote}
> Jake and Robin and I were talking the other evening and a common lament was 
> that our classification (and clustering) stuff was all over the map in terms 
> of data structures.  Driving that to rest and getting those comments even 
> vaguely as plug and play as our much more advanced recommendation components 
> would be very, very helpful.
> {quote}
> This issue probably also realates to MAHOUT-287 (intention there is to make 
> naive bayes run on vectors as input).
> Ted, Jake, Robin: Would be great if someone of you could add a comment on 
> some of the issues you discussed "the other evening" and (if applicable) any 
> minor or major changes you think could help solve this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-479) Streamline classification/ clustering data structures

Reply via email to