[ 
https://issues.apache.org/jira/browse/MAHOUT-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13921924#comment-13921924
 ] 

Drew Farris commented on MAHOUT-1252:
-------------------------------------

Hi Suneel, here's some code to get started:
https://github.com/tamingtext/book/commit/bccb23fd17c376f48955d98ce0f6cf9a94c7617f

This is pretty rudimentary in it's current form, but it's a step.

In this particular example I use {{UpToTwoPositiveIntOutputs}} which causes us 
to end up with an {{FST<Object>}} in the end. Furthermore, the 
{{UpToTwoPositiveIntOutputs}} takes {{long}}s despite the name. The term index 
from the dictionary is the first long of the pair, and the and the document 
frequency is the second. 

For reference: 
http://lucene.apache.org/core/4_6_1/core/org/apache/lucene/util/fst/package-summary.html

As Ted pointed out, this approach won't work in many cases. First and foremost, 
as new terms are encountered, we are unable to represent them in the vector 
space until we've somehow added them to the dictionary. I would not want to 
convert a corpus piecemeal using this approach, for example I wouldn't use an 
approach like this to add another 1000 document vectors to an existing 
collection - for something like that I'd want to be able to take the entire 
collection of documents into account to obtain the appropriate thresholds / 
cutoffs / word co-occurrences, etc.

However, this approach should be fine for generating a vector for a document to 
be classified against an existing model, such as in the case of performing 
ad-hoc bayes classification. 

For incremental dictionary management, a Lucene index based approach could be a 
good approach that would manage memory effectively, but I think that's probably 
more suitable for a different JIRA issue.


> Add support for Finite State Transducers (FST) as a DictionaryType.
> -------------------------------------------------------------------
>
>                 Key: MAHOUT-1252
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1252
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Integration
>    Affects Versions: 0.7
>            Reporter: Suneel Marthi
>            Assignee: Suneel Marthi
>             Fix For: 1.0
>
>
> Add support for Finite State Transducers (FST) as a DictionaryType, this 
> should result in an order of magnitude speedup of seq2sparse.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to