[
https://issues.apache.org/jira/browse/MAHOUT-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13919336#comment-13919336
]
Drew Farris commented on MAHOUT-1252:
-------------------------------------
Hi Suneel,
I have the start of some code that will transform text vector dictionaries into
FSTs that could be used for something like this. I wanted to check in to see if
you've had a start on this tand discuss some of the options here.
A couple questions:
1) Would this be a third dictionary output format in addition to text and
sequence files? I'm assuming we would attempt to generate the FSTs as a part of
the process of vectorization instead of generating them as a separate
transformation step once vectorization is complete.
2) Would one goal of this be the ability to generate vectors for new documents
based on a existing term dictionary once the initial vectorization run for a
coprus is complete? For example I can think of an example where I've build a
Bayes model using some training data, and then I want to be able to classify
new documents as they arrive using that model. That's not so easy to do today
unless you use something like random projection because I don't believe there
are any tools to generate text vectors on a somewhat ad-hoc basis.
3) What would need to be stored in the FST as output? I'm thinking term index
and df might be sufficient.
> Add support for Finite State Transducers (FST) as a DictionaryType.
> -------------------------------------------------------------------
>
> Key: MAHOUT-1252
> URL: https://issues.apache.org/jira/browse/MAHOUT-1252
> Project: Mahout
> Issue Type: Improvement
> Components: Integration
> Affects Versions: 0.7
> Reporter: Suneel Marthi
> Assignee: Suneel Marthi
> Fix For: 1.0
>
>
> Add support for Finite State Transducers (FST) as a DictionaryType, this
> should result in an order of magnitude speedup of seq2sparse.
--
This message was sent by Atlassian JIRA
(v6.2#6252)