[ https://issues.apache.org/jira/browse/MAHOUT-332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853174#action_12853174 ]
Robin Anil commented on MAHOUT-332: ----------------------------------- Hi Necati, Take a look at the matrix and vector classes in mahout. And read up on how mahout converts text into vectors. We need a generic framework where data from Databases could be iterated upon as a vector and algorithms can use it seamlessly. The current VectorWritable could be extended to say a database backed vector, which should reach each field and convert it to a vector on the fly using a pre populated dictionary. This could be easily consumed by the mahout algorithms. The database backed vector should be configurable enough such that fields could be selected. I am sure there are frameworks which already does this. Drew Farris is working on a document structure for mahout using avro. I am sure he will have more inputs on how these adapters should fit with his structure. > Create adapters for MYSQL and NOSQL(hbase, cassandra) to access data for all > the algorithms to use > --------------------------------------------------------------------------------------------------- > > Key: MAHOUT-332 > URL: https://issues.apache.org/jira/browse/MAHOUT-332 > Project: Mahout > Issue Type: New Feature > Reporter: Robin Anil > > A student with a good proposal > - should be free to work for Mahout in the summer and should be thrilled to > work in this area :) > - should be able to program in Java and be comfortable with datastructures > and algorithms > - must explore SQL and NOSQL implementations, and design a framework with > which data from them could be fetched and converted to mahout format or used > directly as a matrix transparently > - should have a plan to make it high performance with ample caching > strategies or the ability to use it on a map/reduce job > - should focus more on getting a working version than to implement all > functionalities. So its recommended that you divide features into milestones > - must have clear deadlines and pace it evenly across the span of 3 months. > If you can do something extra it counts, but make sure the plan is reasonable > within the specified time frame. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.