[ 
https://issues.apache.org/jira/browse/MAHOUT-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13919336#comment-13919336
 ] 

Drew Farris commented on MAHOUT-1252:
-------------------------------------

Hi Suneel,

I have the start of some code that will transform text vector dictionaries into 
FSTs that could be used for something like this. I wanted to check in to see if 
you've had a start on this tand discuss some of the options here. 

A couple questions: 

1) Would this be a third dictionary output format in addition to text and 
sequence files? I'm assuming we would attempt to generate the FSTs as a part of 
the process of vectorization instead of generating them as a separate 
transformation step once vectorization is complete.

2) Would one goal of this be the ability to generate vectors for new documents 
based on a existing term dictionary once the initial vectorization run for a 
coprus is complete? For example I can think of an example where I've build a 
Bayes model using some training data, and then I want to be able to classify 
new documents as they arrive using that model. That's not so easy to do today 
unless you use something like random projection because I don't believe there 
are any tools to generate text vectors on a somewhat ad-hoc basis.

3) What would need to be stored in the FST as output? I'm thinking term index 
and df might be sufficient.

> Add support for Finite State Transducers (FST) as a DictionaryType.
> -------------------------------------------------------------------
>
>                 Key: MAHOUT-1252
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1252
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Integration
>    Affects Versions: 0.7
>            Reporter: Suneel Marthi
>            Assignee: Suneel Marthi
>             Fix For: 1.0
>
>
> Add support for Finite State Transducers (FST) as a DictionaryType, this 
> should result in an order of magnitude speedup of seq2sparse.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to