[ https://issues.apache.org/jira/browse/MAHOUT-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13919336#comment-13919336 ]
Drew Farris commented on MAHOUT-1252: ------------------------------------- Hi Suneel, I have the start of some code that will transform text vector dictionaries into FSTs that could be used for something like this. I wanted to check in to see if you've had a start on this tand discuss some of the options here. A couple questions: 1) Would this be a third dictionary output format in addition to text and sequence files? I'm assuming we would attempt to generate the FSTs as a part of the process of vectorization instead of generating them as a separate transformation step once vectorization is complete. 2) Would one goal of this be the ability to generate vectors for new documents based on a existing term dictionary once the initial vectorization run for a coprus is complete? For example I can think of an example where I've build a Bayes model using some training data, and then I want to be able to classify new documents as they arrive using that model. That's not so easy to do today unless you use something like random projection because I don't believe there are any tools to generate text vectors on a somewhat ad-hoc basis. 3) What would need to be stored in the FST as output? I'm thinking term index and df might be sufficient. > Add support for Finite State Transducers (FST) as a DictionaryType. > ------------------------------------------------------------------- > > Key: MAHOUT-1252 > URL: https://issues.apache.org/jira/browse/MAHOUT-1252 > Project: Mahout > Issue Type: Improvement > Components: Integration > Affects Versions: 0.7 > Reporter: Suneel Marthi > Assignee: Suneel Marthi > Fix For: 1.0 > > > Add support for Finite State Transducers (FST) as a DictionaryType, this > should result in an order of magnitude speedup of seq2sparse. -- This message was sent by Atlassian JIRA (v6.2#6252)