[ 
https://issues.apache.org/jira/browse/MAHOUT-332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853174#action_12853174
 ] 

Robin Anil commented on MAHOUT-332:
-----------------------------------

Hi Necati, Take a look at the matrix and vector classes in mahout. And read up 
on how mahout converts text into vectors. We need a generic framework where 
data from Databases could be iterated upon as a vector and algorithms can use 
it seamlessly. The current VectorWritable could be extended to say a database 
backed vector, which should reach each field and convert it to a vector on the 
fly using a pre populated dictionary. This could be easily consumed by the 
mahout algorithms. The database backed vector should be configurable enough 
such that fields could be selected. I am sure there are frameworks which 
already does this.  Drew Farris is working on a document structure for mahout 
using avro. I am sure he will have more inputs on how these adapters should fit 
with his structure. 

> Create adapters for  MYSQL and NOSQL(hbase, cassandra) to access data for all 
> the algorithms to use
> ---------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-332
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-332
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Robin Anil
>
> A student with a good proposal 
> - should be free to work for Mahout in the summer and should be thrilled to 
> work in this area :)
> - should be able to program in Java and be comfortable with datastructures 
> and algorithms
> - must explore SQL and NOSQL implementations, and design a framework with 
> which data from them could be fetched and converted to mahout format or used 
> directly as a matrix transparently
> - should have a plan to make it high performance with ample caching 
> strategies or the ability to use it on a map/reduce job
> - should focus more on getting a working version than to implement all 
> functionalities. So its recommended that you divide features into milestones
> - must have clear deadlines and pace it evenly across the span of 3 months.
> If you can do something extra it counts, but make sure the plan is reasonable 
> within the specified time frame.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to