[ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795117#action_12795117 ]
Robin Anil commented on MAHOUT-220: ----------------------------------- A Caching layer is implemented in HbaseDatastore, You can set the cache size. Take a look at MAHOUT-124 for more details I am just porting the feature mapper and tfidf mapper from bayes classifier common over to make a the new text vectorizer. Take a look at them. Its a fully distributed way of doing tf.idf in 2 map/reduces. For the vector convertor Here is the idea in Steps M/R1: Count frequencies of words tokenized using configurable lucene Analyzer SEQ1: read the frequency list, prune words less than minSupport and create the dictionary file(string => long) and the frequency file (string=>long) Do map/reduce in chunks by keeping a block of the dictionary file in memory. repeat- M/R2: Run over the input documents. replacing string with the integer id. and create (docid => sparsevector). This sparsevector as weigths as TF. but its incomplete. Now run a map reduce over the incomplete sparse vectors. Group by docid.In reducer, merge the sparse vectors. Initial SparseVectors dataset is ready. function multiplyIDF(){ M/R3: Calculate DF from the SparseVector dataset M/R4: Run over the SparseVector TF dataset. and get IDF. } This is the first plan. Atleast when i finish. Second is to convert the document into a stream of integers using the dictionary file. Then subsequent funcitons can run M/R jobs to calculate LLR and make bigrams. For this. The sparsevector merge MapReduce fucntion should be generic enough. > Mahout Bayes Code cleanup > ------------------------- > > Key: MAHOUT-220 > URL: https://issues.apache.org/jira/browse/MAHOUT-220 > Project: Mahout > Issue Type: Improvement > Components: Classification > Affects Versions: 0.3 > Reporter: Robin Anil > Assignee: Robin Anil > Fix For: 0.3 > > Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch > > > Following isabel's checkstyle, I am adding a whole slew of code cleanup with > the following exceptions > 1. Line length used is 120 instead of 80. > 2. static final log is kept as is. not LOG. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.