[ 
https://issues.apache.org/jira/browse/MAHOUT-332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853182#action_12853182
 ] 

Robin Anil commented on MAHOUT-332:
-----------------------------------

Conversion of any arbitary data in a database to vectors would be along the 
same lines as how ARFF format is to be converted to vectors. You can find the 
code under trunk/utils. It treats boolean, enum and numeric and string 
datatypes separately. That code still may need some more tweaking up so that 
the entire ARFF spec is supported. But its a good starting point for you to 
understand how data is converted to vectors. Also look at the 
SparseVectorsFromSequenceFiles to understand how text documents in a 
SequenceFile(you need to understand this also) are converted to vectors using 
tf-idf based weighting. So in short there could be many weighting strategies. 
It will be really nice if you can make this pluggable so that users of the 
library could make custom weighting techniques for each field. 

> Create adapters for  MYSQL and NOSQL(hbase, cassandra) to access data for all 
> the algorithms to use
> ---------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-332
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-332
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Robin Anil
>
> A student with a good proposal 
> - should be free to work for Mahout in the summer and should be thrilled to 
> work in this area :)
> - should be able to program in Java and be comfortable with datastructures 
> and algorithms
> - must explore SQL and NOSQL implementations, and design a framework with 
> which data from them could be fetched and converted to mahout format or used 
> directly as a matrix transparently
> - should have a plan to make it high performance with ample caching 
> strategies or the ability to use it on a map/reduce job
> - should focus more on getting a working version than to implement all 
> functionalities. So its recommended that you divide features into milestones
> - must have clear deadlines and pace it evenly across the span of 3 months.
> If you can do something extra it counts, but make sure the plan is reasonable 
> within the specified time frame.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to