[ 
https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795050#action_12795050
 ] 

Robin Anil commented on MAHOUT-220:
-----------------------------------

Datastore is an interface which allows you pick a named vector or a matrix and 
lookup the cell.  For Bayes classifier, since the entire code is based on 
tokens and not SparseVectors. The names of the matrix, the row and column is 
upto the implementation. for the Cbayes/Bayes algorithms, We have the 
HBaseBayesDatastore.java and 
InMemoryBayesDatastore.java. 

{code}
  double getWeight(String matrixName, String row, String column) throws 
InvalidDatastoreException;
  double getWeight(String vectorName, String index) throws 
InvalidDatastoreException;
{code}

For sgd algorithm. I suggest you define your own matrix names, row indices and 
column indices, which your algorithm and datastore agree upon.

I know it, this creates a limitation that you can use integer based column and 
row names. Maybe we can parameterize it OR change Bayes package to use Vectors 
instead of the current string token based implementation. 

I am currenly writing a Map/reduce job to convert text documents to vectors 
without relying on Lucene. Once that is done, I will overhaul the classifier 
package to use SparseVectors. 

Before that I need to know if this Patch is ok. In terms of code style, I will 
then patch it and start with the enhancements 


> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with 
> the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to