[ https://issues.apache.org/jira/browse/MAHOUT-242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Drew Farris updated MAHOUT-242: ------------------------------- Attachment: MAHOUT-242.patch Thanks for the review Isabel, here's an updated patch. {quote} CollocMapper, Line 66: If I read your implementation correctly, this means that documents are always read fully into memory, right? {quote} Yes, you are correct. In addition to what Ted and Jake have already said, reading parts documents is a little tricky because there would have to be some overlap to ensure that the collocations around the split were picked up properly. Perhaps there will be an opportunity to do something like this once the collocs are windowed based on sentence boundary? {quote} Gram, Line 192: You can omit the "else" clauses, in case the "if" already returns its result to the caller, however this is a question of style." {quote} I prefer the form you describe, easier to read. {quote} I was wondering, why in line 177 you did not write "this.position != other.position"? {quote} I'm not certain about what you're referring to here, before edits, line 177 is in the middle of the write() method. {quote} NGramCollector, Line 47 (and a few others): Shouldn't we avoid using deprecated apis instead of suppressing deprecation warnings? {quote} I've removed the SuppressWarnings("deprecated"). The only reason they are there is that I've used the hadoop 0.19 api here and I tend to get distracted by the deprecation warnings. This brings up a question of policy however. Should I be using the newest hadoop api for new work in Mahout such as this, or should we continue to make efforts to retain backwards compatibility with older hadoop installations? I assumed the policy was the latter, but relish the opportunity to learn the new api -- I just didn't think Mahout ready to commit to the new api yet. {quote} LLRReducer, Line 143: How about making the method package private if it should be used in unit tests only anyway? Line 106: Would be nice to have an additional counter for the skipped grams. {quote} done and done, great idea to use the counter to track skipped grams. > LLR Collocation Identifier > -------------------------- > > Key: MAHOUT-242 > URL: https://issues.apache.org/jira/browse/MAHOUT-242 > Project: Mahout > Issue Type: New Feature > Affects Versions: 0.3 > Reporter: Drew Farris > Priority: Minor > Attachments: MAHOUT-242.patch, MAHOUT-242.patch, > mahout-colloc.tar.gz, mahout-colloc.tar.gz > > > Identifies interesting Collocations in text using ngrams scored via the > LogLikelihoodRatio calculation. > As discussed in: > * > http://www.lucidimagination.com/search/document/d051123800ab6ce7/collocations_in_mahout#26634d6364c2c0d2 > * > http://www.lucidimagination.com/search/document/b8d5bb0745eef6e8/n_grams_for_terms#f16fa54417697d8e > Current form is a tar of a maven project that depends on mahout. Build as > usual with 'mvn clean install', can be executed using: > {noformat} > mvn -e exec:java -Dexec.mainClass="org.apache.mahout.colloc.CollocDriver" > -Dexec.args="--input src/test/resources/article --colloc target/colloc > --output target/output -w" > {noformat} > Output will be placed in target/output and can be viewed nicely using: > {noformat} > sort -rn -k1 target/output/part-00000 > {noformat} > Includes rudimentary unit tests. Please review and comment. Needs more work > to get this into patch state and integrate with Robin's document vectorizer > work in MAHOUT-237 > Some basic TODO/FIXME's include: > * use mahout math's ObjectInt map implementation when available > * make the analyzer configurable > * better input validation + negative unit tests. > * more flexible ways to generate units of analysis (n-1)grams. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.