[ 
https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795114#action_12795114
 ] 

Jake Mannix commented on MAHOUT-220:
------------------------------------

Robin,

  To really be scalable here, I'm down with the M/R approach for the 
classifiers.  The random-access nature of the current Datastore interface 
definitely seems limiting - even using HBase this way means we're making lots 
of remote calls, while a traditional hadoop job would do the nice "put the 
coding where the data lives" instead.

Switching over to use SparseVectors and doing things sequentially over the data 
set stored in SequenceFile's of them seems definitely the way I'd see this 
going.  Is that what your current hadoopified version of this do?

bq. I am currenly writing a Map/reduce job to convert text documents to vectors 
without relying on Lucene.

What is the way you're doing this?  Is this bag-of-words representation (what 
form of tf are you using?  how are you putting in idf if it's fully 
distributed?)?

> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with 
> the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to