[ https://issues.apache.org/jira/browse/MAHOUT-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13982427#comment-13982427 ]
Drew Farris commented on MAHOUT-1252: ------------------------------------- I'm able to build a FST dictionary that can be used for subsequent vector calculation with the TF based on the document being vectorized and the IDF of the terms in the corpus at the time of dictionary generation. The vectors look great and have been used as input to the Bayes classification job with success: See: https://github.com/tamingtext/book/tree/drew_upgrade_wip/src/main/java/com/tamingtext/classifier/bayes https://github.com/tamingtext/book/tree/drew_upgrade_wip/src/test/java/com/tamingtext/classifier/bayes Currently this is all operating on the text dictionary format emitted by seq2sparse. I haven't spent any time working on integrating this into the existing seq2sparse workflow. I wonder what he future holds for the existing vectorization workflow with regards to the move to Spark. Any opinions as to whether I should spend time looking at the existing MR jobs, or should be looking into re-implementing the vectorization workflow at this point? I'm tending towards the the latter, but for me, that's all very experimental at this point. > Add support for Finite State Transducers (FST) as a DictionaryType. > ------------------------------------------------------------------- > > Key: MAHOUT-1252 > URL: https://issues.apache.org/jira/browse/MAHOUT-1252 > Project: Mahout > Issue Type: Improvement > Components: Integration > Affects Versions: 0.7 > Reporter: Suneel Marthi > Assignee: Suneel Marthi > Fix For: 1.0 > > > Add support for Finite State Transducers (FST) as a DictionaryType, this > should result in an order of magnitude speedup of seq2sparse. -- This message was sent by Atlassian JIRA (v6.2#6252)