[jira] Commented: (MAHOUT-242) LLR Collocation Identifier

2010-01-29 Thread Drew Farris (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12806357#action_12806357
 ] 

Drew Farris commented on MAHOUT-242:


bq. Hey Drew, I'm not much of a maven guy - what's the maven-foo you use to get 
this running?

Jake, my mistake, I should have included updated docs when I updated the patch. 

I eliminated the code to read plain text files from a directory, so you would 
need to begin by producing a SequenceFileText,Text (document id, document) as 
input. Robin's utility in mahout-examples, 
o.a.m.text.SequenceFilesFromDirectory can do this. Run the following from the 
'examples' directory;

{code}
mvn -e exec:java 
-Dexec.mainClass=org.apache.mahout.text.SequenceFilesFromDirectory 
-Dexec.args=--parent (...input directory..) --outputDir (..output directory..) 
--charset UTF-8
{code}

Once you have the sequence file to use as input, run the following from the 
'examples' directory as well.

{code}
mvn -e exec:java  
-Dexec.mainClass=org.apache.mahout.nlp.collocations.llr.CollocDriver 
-Dexec.args=--input (..path-to-input..) --output (..path-to-output..) -w
{code}

Once the driver class is run, the collocations will be in 
(output-directory)/colloc/part-0 as plaintext. They can be sorted by LLR 
score using the same sort command I included in the previous comment above (I 
have a question about this below).

FWIW, I need to re-submit the patch to clean up some of the pom changes to 
MAHOUT-215 were applied.

A couple follow up questions while you're looking at this:

Currently I just dump the results of the second pass to a file, not sorted by 
LLR score. I could sort by LLR by sending it through another pass with an 
identity mapper and a reducer but I suspect that's probably pretty inefficient. 
Is there a better way to sort the output of the second pass by LLR?

If I only wanted to emit the top 1-10% of the collocs (user configurable), how 
would I tell the reducer to stop emitting results at a certain point (or is yet 
another pass needed to achieve something like this?)

Would it be better to emit a sequencefileLongWritable,Text instead of a text 
file as the output from the final pass?



 LLR Collocation Identifier
 --

 Key: MAHOUT-242
 URL: https://issues.apache.org/jira/browse/MAHOUT-242
 Project: Mahout
  Issue Type: New Feature
Affects Versions: 0.3
Reporter: Drew Farris
Assignee: Jake Mannix
Priority: Minor
 Attachments: MAHOUT-242.patch, MAHOUT-242.patch, 
 mahout-colloc.tar.gz, mahout-colloc.tar.gz


 Identifies interesting Collocations in text using ngrams scored via the 
 LogLikelihoodRatio calculation. 
 As discussed in: 
 * 
 http://www.lucidimagination.com/search/document/d051123800ab6ce7/collocations_in_mahout#26634d6364c2c0d2
 * 
 http://www.lucidimagination.com/search/document/b8d5bb0745eef6e8/n_grams_for_terms#f16fa54417697d8e
 Current form is a tar of a maven project that depends on mahout. Build as 
 usual with 'mvn clean install', can be executed using:
 {noformat}
 mvn -e exec:java  -Dexec.mainClass=org.apache.mahout.colloc.CollocDriver 
 -Dexec.args=--input src/test/resources/article --colloc target/colloc 
 --output target/output -w
 {noformat}
 Output will be placed in target/output and can be viewed nicely using:
 {noformat}
 sort -rn -k1 target/output/part-0
 {noformat}
 Includes rudimentary unit tests. Please review and comment. Needs more work 
 to get this into patch state and integrate with Robin's document vectorizer 
 work in MAHOUT-237
 Some basic TODO/FIXME's include:
 * use mahout math's ObjectInt map implementation when available
 * make the analyzer configurable
 * better input validation + negative unit tests.
 * more flexible ways to generate units of analysis (n-1)grams.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



dependency question: mahout-examples - watchmaker-swing - jfreechart - jcommons?

2010-01-29 Thread Drew Farris
I spent some time looking at the licenses for the dependencies
included in the binary release built as a part of MAHOUT-215, and I'm
wondering if anyone knows whether code in mahout-examples uses
directly or indirectly any of the jfreechart code is included as a
transient dependency of the watchmaker-swing library.

The issue at hand is that jfreechart pulls in something called
jcommons, which appears to be licensed under GPL. It is my
understanding that mahout shouldn't include GPL licensed dependencies
in a binary release.

So, if mahout doesn't use jfreechart in any way via watchmaker-swing,
I can set an exclusion for it in the dependency declaration and thus
prevent the inclusion of jcommons. Mahout builds and test complete
fine with this exclusion set, but that's not the whole story of
course.

Drew


Re: dependency question: mahout-examples - watchmaker-swing - jfreechart - jcommons?

2010-01-29 Thread deneche abdelhakim
The only example that actually uses watchmaker-swing is Travelling
Salesman, mainly because it was a direct port of an existing
watchmaker example. And if I remember well, it does not actually use
JFreeChart...so I think it's safe to exclude it.

On Sat, Jan 30, 2010 at 5:19 AM, Drew Farris drew.far...@gmail.com wrote:
 I spent some time looking at the licenses for the dependencies
 included in the binary release built as a part of MAHOUT-215, and I'm
 wondering if anyone knows whether code in mahout-examples uses
 directly or indirectly any of the jfreechart code is included as a
 transient dependency of the watchmaker-swing library.

 The issue at hand is that jfreechart pulls in something called
 jcommons, which appears to be licensed under GPL. It is my
 understanding that mahout shouldn't include GPL licensed dependencies
 in a binary release.

 So, if mahout doesn't use jfreechart in any way via watchmaker-swing,
 I can set an exclusion for it in the dependency declaration and thus
 prevent the inclusion of jcommons. Mahout builds and test complete
 fine with this exclusion set, but that's not the whole story of
 course.

 Drew