[ 
https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795117#action_12795117
 ] 

Robin Anil commented on MAHOUT-220:
-----------------------------------

A Caching layer is implemented in HbaseDatastore, You can set the cache size. 
Take a look at MAHOUT-124 for more details

I am just porting the feature mapper and tfidf mapper from bayes classifier 
common over to make a the new text vectorizer. Take a look at them. Its a fully 
distributed way of doing tf.idf in 2 map/reduces. 

For the vector convertor
Here is the idea in Steps

M/R1:  Count frequencies of words tokenized using configurable lucene Analyzer
SEQ1: read the frequency list, prune words less than minSupport and create the 
dictionary file(string => long) and the frequency file (string=>long)
Do map/reduce in chunks by keeping a block of the dictionary file in memory. 
   repeat- M/R2: Run over the input documents. replacing string with the 
integer id. and create (docid => sparsevector). This sparsevector as weigths as 
TF. but its incomplete.
Now run a map reduce over the incomplete sparse vectors. Group by docid.In 
reducer, merge the sparse vectors. 
Initial SparseVectors dataset is ready.

function multiplyIDF(){
M/R3: Calculate DF from the SparseVector dataset
M/R4: Run over the SparseVector TF dataset. and get IDF.
}


This is the first plan. Atleast when i finish. Second is to convert the 
document into a stream of integers using the dictionary file. Then subsequent 
funcitons can run M/R jobs to calculate LLR and make bigrams. 

For this. The sparsevector merge MapReduce fucntion should be generic enough. 






> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with 
> the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to