[
https://issues.apache.org/jira/browse/MAHOUT-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12912284#action_12912284
]
Ted Dunning commented on MAHOUT-479:
------------------------------------
I did a commit recently that introduced ModelDissector. This is useful for
reverse engineering feature hashed models.
The idea is that the hashed encoders have the option of having a trace
dictionary. This tells us where each feature is hashed to, or each
feature/value combination in the case of word-like values. Using this
dictionary, we can put values into a synthetic feature vector in just the
locations specified by a single feature or interaction. Then we can push this
through a linear model to see the contribution of that input. For any
generalized linear model like logistic regression, there is a linear part of
the model that allows this.
What the ModelDissector does is to accept a trace dictionary and a model in an
update method. Then in a flush method, the biggest weights are returned. This
update/flush style is used so that the trace dictionary doesn't have to grow to
enormous levels, but instead can be cleared
between updates.
> Streamline classification/ clustering data structures
> -----------------------------------------------------
>
> Key: MAHOUT-479
> URL: https://issues.apache.org/jira/browse/MAHOUT-479
> Project: Mahout
> Issue Type: Improvement
> Components: Classification, Clustering
> Affects Versions: 0.1, 0.2, 0.3, 0.4
> Reporter: Isabel Drost
> Assignee: Isabel Drost
>
> Opening this JIRA issue to collect ideas on how to streamline our
> classification and clustering algorithms to make integration for users easier
> as per mailing list thread http://markmail.org/message/pnzvrqpv5226twfs
> {quote}
> Jake and Robin and I were talking the other evening and a common lament was
> that our classification (and clustering) stuff was all over the map in terms
> of data structures. Driving that to rest and getting those comments even
> vaguely as plug and play as our much more advanced recommendation components
> would be very, very helpful.
> {quote}
> This issue probably also realates to MAHOUT-287 (intention there is to make
> naive bayes run on vectors as input).
> Ted, Jake, Robin: Would be great if someone of you could add a comment on
> some of the issues you discussed "the other evening" and (if applicable) any
> minor or major changes you think could help solve this issue.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.