[ https://issues.apache.org/jira/browse/MAHOUT-242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803361#action_12803361 ]
Ted Dunning commented on MAHOUT-242: ------------------------------------ {quote} Are we really worried about individual documents which are bigger than hundreds of MB? {quote} I am not worried about them at this point. Even in the most skewed distributions, this would be very, very rare. How many people on linked in have 100 million friends? The quadratic cost of unwindowed collocation detection would make this infeasible first. For windowed collocation, it isn't that hard to scan the input, but reading the whole shebang isn't such a bad thing. > LLR Collocation Identifier > -------------------------- > > Key: MAHOUT-242 > URL: https://issues.apache.org/jira/browse/MAHOUT-242 > Project: Mahout > Issue Type: New Feature > Affects Versions: 0.3 > Reporter: Drew Farris > Priority: Minor > Attachments: MAHOUT-242.patch, mahout-colloc.tar.gz, > mahout-colloc.tar.gz > > > Identifies interesting Collocations in text using ngrams scored via the > LogLikelihoodRatio calculation. > As discussed in: > * > http://www.lucidimagination.com/search/document/d051123800ab6ce7/collocations_in_mahout#26634d6364c2c0d2 > * > http://www.lucidimagination.com/search/document/b8d5bb0745eef6e8/n_grams_for_terms#f16fa54417697d8e > Current form is a tar of a maven project that depends on mahout. Build as > usual with 'mvn clean install', can be executed using: > {noformat} > mvn -e exec:java -Dexec.mainClass="org.apache.mahout.colloc.CollocDriver" > -Dexec.args="--input src/test/resources/article --colloc target/colloc > --output target/output -w" > {noformat} > Output will be placed in target/output and can be viewed nicely using: > {noformat} > sort -rn -k1 target/output/part-00000 > {noformat} > Includes rudimentary unit tests. Please review and comment. Needs more work > to get this into patch state and integrate with Robin's document vectorizer > work in MAHOUT-237 > Some basic TODO/FIXME's include: > * use mahout math's ObjectInt map implementation when available > * make the analyzer configurable > * better input validation + negative unit tests. > * more flexible ways to generate units of analysis (n-1)grams. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.