[ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795114#action_12795114 ]
Jake Mannix commented on MAHOUT-220: ------------------------------------ Robin, To really be scalable here, I'm down with the M/R approach for the classifiers. The random-access nature of the current Datastore interface definitely seems limiting - even using HBase this way means we're making lots of remote calls, while a traditional hadoop job would do the nice "put the coding where the data lives" instead. Switching over to use SparseVectors and doing things sequentially over the data set stored in SequenceFile's of them seems definitely the way I'd see this going. Is that what your current hadoopified version of this do? bq. I am currenly writing a Map/reduce job to convert text documents to vectors without relying on Lucene. What is the way you're doing this? Is this bag-of-words representation (what form of tf are you using? how are you putting in idf if it's fully distributed?)? > Mahout Bayes Code cleanup > ------------------------- > > Key: MAHOUT-220 > URL: https://issues.apache.org/jira/browse/MAHOUT-220 > Project: Mahout > Issue Type: Improvement > Components: Classification > Affects Versions: 0.3 > Reporter: Robin Anil > Assignee: Robin Anil > Fix For: 0.3 > > Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch > > > Following isabel's checkstyle, I am adding a whole slew of code cleanup with > the following exceptions > 1. Line length used is 120 instead of 80. > 2. static final log is kept as is. not LOG. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.