[ https://issues.apache.org/jira/browse/MAHOUT-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12606947#action_12606947 ]
Karl Wettin commented on MAHOUT-61: ----------------------------------- Tokenization is currently more of a classification problem than a clusterer problem solver. I wanted to add shingles but could not find the class in the lucene dists? Not even in a snapshot. So far this code just creates a matrix. I needs to be written to file so it can be read by the algorithms that wants to use it. I have not really tested this, it is an early beta just to get some feedback. > Text problem matrix builder > ---------------------------- > > Key: MAHOUT-61 > URL: https://issues.apache.org/jira/browse/MAHOUT-61 > Project: Mahout > Issue Type: New Feature > Reporter: Karl Wettin > Assignee: Karl Wettin > Priority: Minor > Attachments: MAHOUT-61.txt > > > A set of classes that builds matrices from text. > Currently the API consists of TokenMatrixBuilder and TokenInstanceBuilder. > Should be thread safe. > PostReader imports 20news-bydate. This takes several GB heap. It would be > nice to bounce the data via JDBM or perhaps using the PersistentHashMap in > MAHOUT-19. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.