[ 
https://issues.apache.org/jira/browse/MAHOUT-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13982427#comment-13982427
 ] 

Drew Farris commented on MAHOUT-1252:
-------------------------------------

I'm able to build a FST dictionary that can be used for subsequent vector 
calculation with the TF based on the document being vectorized and the IDF of 
the terms in the corpus at the time of dictionary generation. The vectors look 
great and have been used as input to the Bayes classification job with success:

See:
https://github.com/tamingtext/book/tree/drew_upgrade_wip/src/main/java/com/tamingtext/classifier/bayes
https://github.com/tamingtext/book/tree/drew_upgrade_wip/src/test/java/com/tamingtext/classifier/bayes

Currently this is all operating on the text dictionary format emitted by 
seq2sparse. I haven't spent any time working on integrating this into the 
existing seq2sparse workflow.

I wonder what he future holds for the existing vectorization workflow with 
regards to the move to Spark. Any opinions as to whether I should spend time 
looking at the existing MR jobs, or should be looking into re-implementing the 
vectorization workflow at this point? I'm tending towards the the latter, but 
for me, that's all very experimental at this point.





> Add support for Finite State Transducers (FST) as a DictionaryType.
> -------------------------------------------------------------------
>
>                 Key: MAHOUT-1252
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1252
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Integration
>    Affects Versions: 0.7
>            Reporter: Suneel Marthi
>            Assignee: Suneel Marthi
>             Fix For: 1.0
>
>
> Add support for Finite State Transducers (FST) as a DictionaryType, this 
> should result in an order of magnitude speedup of seq2sparse.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to