[jira] Commented: (MAHOUT-479) Streamline classification/ clustering data structures

Ted Dunning (JIRA) Fri, 13 Aug 2010 19:04:43 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898504#action_12898504
 ]


Ted Dunning commented on MAHOUT-479:
------------------------------------

Unification of the resulting models is probably much easier than the 
unification of the model building process itself.

Some of the problems I have seen include:

a) all of our clustering and classification models should be able to accept 
vectors and produce either a "most-likely" category or a vector of scores for 
all possible categories.  Unfortunately, there is no uniform way to load a 
model from a file and no uniform object structure for these models and no 
consistent way to call them.

b) most of our learning algorithms would be happy with vectors, but there is a 
pretty fundamental difference between good ways to call hadoop-based and 
sequential training algorithms.  The sequential stuff is traditional java so 
the interface is very easy.  The parallel stuff is considerably harder to make 
into a really good interface.  We may learn some tricks with Plume or we may be 
able to use the Distributed Row Matrix, but it isn't an obvious answer.

c) in some cases, the vectors are noticeably larger than the original data.  
This occurs when the original data is very sparse and we are looking at lots of 
interaction variables.  Again, for sequential algorithms, this is pretty easy 
to deal with, but for parallel ones, it really might be better to store the 
original data and pass in a function that handles the vectorization on the fly.
 

> Streamline classification/ clustering data structures
> -----------------------------------------------------
>
>                 Key: MAHOUT-479
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-479
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.1, 0.2, 0.3, 0.4
>            Reporter: Isabel Drost
>
> Opening this JIRA issue to collect ideas on how to streamline our 
> classification and clustering algorithms to make integration for users easier 
> as per mailing list thread http://markmail.org/message/pnzvrqpv5226twfs
> {quote}
> Jake and Robin and I were talking the other evening and a common lament was 
> that our classification (and clustering) stuff was all over the map in terms 
> of data structures.  Driving that to rest and getting those comments even 
> vaguely as plug and play as our much more advanced recommendation components 
> would be very, very helpful.
> {quote}
> This issue probably also realates to MAHOUT-287 (intention there is to make 
> naive bayes run on vectors as input).
> Ted, Jake, Robin: Would be great if someone of you could add a comment on 
> some of the issues you discussed "the other evening" and (if applicable) any 
> minor or major changes you think could help solve this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-479) Streamline classification/ clustering data structures

Reply via email to