[
https://issues.apache.org/jira/browse/MAHOUT-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898504#action_12898504
]
Ted Dunning commented on MAHOUT-479:
------------------------------------
Unification of the resulting models is probably much easier than the
unification of the model building process itself.
Some of the problems I have seen include:
a) all of our clustering and classification models should be able to accept
vectors and produce either a "most-likely" category or a vector of scores for
all possible categories. Unfortunately, there is no uniform way to load a
model from a file and no uniform object structure for these models and no
consistent way to call them.
b) most of our learning algorithms would be happy with vectors, but there is a
pretty fundamental difference between good ways to call hadoop-based and
sequential training algorithms. The sequential stuff is traditional java so
the interface is very easy. The parallel stuff is considerably harder to make
into a really good interface. We may learn some tricks with Plume or we may be
able to use the Distributed Row Matrix, but it isn't an obvious answer.
c) in some cases, the vectors are noticeably larger than the original data.
This occurs when the original data is very sparse and we are looking at lots of
interaction variables. Again, for sequential algorithms, this is pretty easy
to deal with, but for parallel ones, it really might be better to store the
original data and pass in a function that handles the vectorization on the fly.
> Streamline classification/ clustering data structures
> -----------------------------------------------------
>
> Key: MAHOUT-479
> URL: https://issues.apache.org/jira/browse/MAHOUT-479
> Project: Mahout
> Issue Type: Improvement
> Components: Classification, Clustering
> Affects Versions: 0.1, 0.2, 0.3, 0.4
> Reporter: Isabel Drost
>
> Opening this JIRA issue to collect ideas on how to streamline our
> classification and clustering algorithms to make integration for users easier
> as per mailing list thread http://markmail.org/message/pnzvrqpv5226twfs
> {quote}
> Jake and Robin and I were talking the other evening and a common lament was
> that our classification (and clustering) stuff was all over the map in terms
> of data structures. Driving that to rest and getting those comments even
> vaguely as plug and play as our much more advanced recommendation components
> would be very, very helpful.
> {quote}
> This issue probably also realates to MAHOUT-287 (intention there is to make
> naive bayes run on vectors as input).
> Ted, Jake, Robin: Would be great if someone of you could add a comment on
> some of the issues you discussed "the other evening" and (if applicable) any
> minor or major changes you think could help solve this issue.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.