Re: Cosine and Tanimoto Similarity
One thing I found very irritating when using cosine or numbers in the range 0,1 is that sometimes two distinct items have very small values of distance when you inspect them. I am always worried that precision of float is not enough to capture that small detail that makes the difference of accept or reject. On the other hand Log likelihood similarity seem to have values in the range 100+, sometimes even 1000+ for strong likelihoods. Very unlikely events have small values 1.0 In practice, it kind of holds, as the number of documents increase, I usually have to scale cosine to a larger range or switch to some hybrid similarity metric for good clustering. What about you guys, I mean both of you have worked on huge data sets, what kind of insights can you share about what works and what doesnt.
[jira] Commented: (MAHOUT-221) Implementation of FP-Bonsai Pruning for fast pattern mining
[ https://issues.apache.org/jira/browse/MAHOUT-221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12794677#action_12794677 ] Robin Anil commented on MAHOUT-221: --- Ping! Need a code review Implementation of FP-Bonsai Pruning for fast pattern mining --- Key: MAHOUT-221 URL: https://issues.apache.org/jira/browse/MAHOUT-221 Project: Mahout Issue Type: New Feature Components: Frequent Itemset/Association Rule Mining Affects Versions: 0.3 Reporter: Robin Anil Assignee: Robin Anil Attachments: MAHOUT-FPGROWTH.patch, MAHOUT-FPGROWTH.patch FP Bonsai is a method to prune long chained FP-Trees for faster growth. http://win.ua.ac.be/~adrem/bibrem/pubs/fpbonsai.pdf This implementation also adds a transaction preprocessing map/reduce job which converts a list of transactions {1, 2, 4, 5}, {1, 2, 3}, {1, 2} into a tree structure and thus saves space during fpgrowth map/reduce the tree formed from above is. For typical this improves the storage space by a great amount and thus saves on time during shuffle and sort (1,3) - (2,3) | - (4,1) - (5,1) (3,1) Also added a reducer to PFPgrowth (not part of the original paper) which does this compression and saves on space. This patch also adds an example transaction dataset generator from flickr and delicious data set https://www.uni-koblenz.de/FB4/Institutes/IFI/AGStaab/Research/DataSets/PINTSExperimentsDataSets/ Both of them are GIG of tag data. Where date userid itemid tag is given. The example maker creates a transaction based on all the unique tags a user has tagged on an item. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup
[ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12794678#action_12794678 ] Robin Anil commented on MAHOUT-220: --- I will update a new patch. I am reverting all these changes. Will stick to 80 column format and the new lucene code formatter. Will start re-working from latest trunk Mahout Bayes Code cleanup - Key: MAHOUT-220 URL: https://issues.apache.org/jira/browse/MAHOUT-220 Project: Mahout Issue Type: Improvement Components: Classification Affects Versions: 0.3 Reporter: Robin Anil Assignee: Robin Anil Fix For: 0.3 Attachments: MAHOUT-BAYES.patch Following isabel's checkstyle, I am adding a whole slew of code cleanup with the following exceptions 1. Line length used is 120 instead of 80. 2. static final log is kept as is. not LOG. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAHOUT-231) Upgrade QM reports to use Clover 2.6
Upgrade QM reports to use Clover 2.6 Key: MAHOUT-231 URL: https://issues.apache.org/jira/browse/MAHOUT-231 Project: Mahout Issue Type: Task Components: Website Affects Versions: 0.3 Reporter: Isabel Drost Priority: Minor Fix For: 0.3 Atlassian has donated a license for a new Clover version. The reports provide more information and are easier to read. We should upgrade to site reports to use that version. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Cosine and Tanimoto Similarity
Floating point precision is not an issue with any of these metrics since the counts you are dealing with are never large enough for the statistical uncertainty (roughly sqrt(number of observations)) to outweigh the numerical accuracy (roughly 10^-7 for float 10^-17 for double). A much large problem is actually with small counts where coincidence can give you cosine similarity of 1 or very close to that. Log-likelihood ratio testing is a fine way to mask away the measures unlikely to have good results. Even better is to not actually train those numbers based on actual cooccurrence, but to use corpus frequency to weight events that have interesting LLR. Another way to deal with statistical noise is to use a regularized learning system. You need to deal with the problem one way or another. On Sun, Dec 27, 2009 at 1:51 AM, Robin Anil robin.a...@gmail.com wrote: One thing I found very irritating when using cosine or numbers in the range 0,1 is that sometimes two distinct items have very small values of distance when you inspect them. I am always worried that precision of float is not enough to capture that small detail that makes the difference of accept or reject. On the other hand Log likelihood similarity seem to have values in the range 100+, sometimes even 1000+ for strong likelihoods. Very unlikely events have small values 1.0 In practice, it kind of holds, as the number of documents increase, I usually have to scale cosine to a larger range or switch to some hybrid similarity metric for good clustering. What about you guys, I mean both of you have worked on huge data sets, what kind of insights can you share about what works and what doesnt. -- Ted Dunning, CTO DeepDyve