[ 
https://issues.apache.org/jira/browse/MAHOUT-285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Farris updated MAHOUT-285:
-------------------------------

    Attachment: MAHOUT-285.patch

First pass at integration patch, this patch includes the following:

* Input is now a SequenceFile<Text, StringTuple>, tokenized documents where the 
key is the document id (ignored) and the value is an array of tokens. No need 
to perform analysis in this code, so factored out NGramCollector and moved code 
back into CollocMapper. Removed associated command-line options. This input can 
be produced by the SparseVectorsFromSequenceFiles task, the DocumentProcessor 
class emits this to the tokenized-documents directory in the output directory 
of this task. 
* Output is now a SequenceFile<Text, DoubleWritable>, key is collocation, value 
is LLR value

Tested with 20news and alice in wonderland.

Remaining work:

* Wrap up a driver that combines the DocumentProcessor and Colloc tasks.
* Add the ability to pass-through unigrams so that output from this job can be 
used as input for the DIctionaryVectorizer task.

> Wrap up collocation and dictionary vectorizer integration
> ---------------------------------------------------------
>
>                 Key: MAHOUT-285
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-285
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.3
>            Reporter: Drew Farris
>             Fix For: 0.3
>
>         Attachments: MAHOUT-285.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Final bit of work to integrate collocations into 0.3
> * Modify collocation finder to use dictionary vectorizer output as input 
> (saves analysis step)
> * Generate input dictionary for dictionary vectorizer that includes unigrams 
> and collocations.
> Chatted with Robin this morning, I know what needs to be done it is just a 
> matter of grinding out the code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to