[ https://issues.apache.org/jira/browse/MAHOUT-332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853182#action_12853182 ]
Robin Anil commented on MAHOUT-332: ----------------------------------- Conversion of any arbitary data in a database to vectors would be along the same lines as how ARFF format is to be converted to vectors. You can find the code under trunk/utils. It treats boolean, enum and numeric and string datatypes separately. That code still may need some more tweaking up so that the entire ARFF spec is supported. But its a good starting point for you to understand how data is converted to vectors. Also look at the SparseVectorsFromSequenceFiles to understand how text documents in a SequenceFile(you need to understand this also) are converted to vectors using tf-idf based weighting. So in short there could be many weighting strategies. It will be really nice if you can make this pluggable so that users of the library could make custom weighting techniques for each field. > Create adapters for MYSQL and NOSQL(hbase, cassandra) to access data for all > the algorithms to use > --------------------------------------------------------------------------------------------------- > > Key: MAHOUT-332 > URL: https://issues.apache.org/jira/browse/MAHOUT-332 > Project: Mahout > Issue Type: New Feature > Reporter: Robin Anil > > A student with a good proposal > - should be free to work for Mahout in the summer and should be thrilled to > work in this area :) > - should be able to program in Java and be comfortable with datastructures > and algorithms > - must explore SQL and NOSQL implementations, and design a framework with > which data from them could be fetched and converted to mahout format or used > directly as a matrix transparently > - should have a plan to make it high performance with ample caching > strategies or the ability to use it on a map/reduce job > - should focus more on getting a working version than to implement all > functionalities. So its recommended that you divide features into milestones > - must have clear deadlines and pace it evenly across the span of 3 months. > If you can do something extra it counts, but make sure the plan is reasonable > within the specified time frame. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.