[ 
https://issues.apache.org/jira/browse/MAHOUT-242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Farris updated MAHOUT-242:
-------------------------------

    Attachment: MAHOUT-242.patch

Log-likelihood collocation identifier  in patch form. This puts itself in 
o.a.m.nlp.collocations.llr

I think there are some improvements that can be made, but if possible it would 
be nice to review, commit this version and add on to it later through 
additional patches More specifically, I'd like to see this:

* include the ability to avoid forming collocations around sentence boundaries 
and other boundaries per: 
http://www.lucidimagination.com/search/document/d259def498803ffe/collocation_clarification#29fbb050cf5fa64
* work for non-whitespace delimited languages, e.g: anything an analyzer can 
produce tokens for.

I removed the ability to read in files from a directory, Robin's document -> 
sequence file work fits into this well.


> LLR Collocation Identifier
> --------------------------
>
>                 Key: MAHOUT-242
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-242
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.3
>            Reporter: Drew Farris
>            Priority: Minor
>         Attachments: MAHOUT-242.patch, mahout-colloc.tar.gz, 
> mahout-colloc.tar.gz
>
>
> Identifies interesting Collocations in text using ngrams scored via the 
> LogLikelihoodRatio calculation. 
> As discussed in: 
> * 
> http://www.lucidimagination.com/search/document/d051123800ab6ce7/collocations_in_mahout#26634d6364c2c0d2
> * 
> http://www.lucidimagination.com/search/document/b8d5bb0745eef6e8/n_grams_for_terms#f16fa54417697d8e
> Current form is a tar of a maven project that depends on mahout. Build as 
> usual with 'mvn clean install', can be executed using:
> {noformat}
> mvn -e exec:java  -Dexec.mainClass="org.apache.mahout.colloc.CollocDriver" 
> -Dexec.args="--input src/test/resources/article --colloc target/colloc 
> --output target/output -w"
> {noformat}
> Output will be placed in target/output and can be viewed nicely using:
> {noformat}
> sort -rn -k1 target/output/part-00000
> {noformat}
> Includes rudimentary unit tests. Please review and comment. Needs more work 
> to get this into patch state and integrate with Robin's document vectorizer 
> work in MAHOUT-237
> Some basic TODO/FIXME's include:
> * use mahout math's ObjectInt map implementation when available
> * make the analyzer configurable
> * better input validation + negative unit tests.
> * more flexible ways to generate units of analysis (n-1)grams.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to