Re: Cosine and Tanimoto Similarity

2009-12-27 Thread Robin Anil
One thing I found very irritating when using cosine or numbers in the range
0,1 is that sometimes two distinct items have very small values of distance
when you inspect them. I am always worried that precision of float is not
enough to capture that small detail that makes the difference of accept or
reject.  On the other hand Log likelihood similarity seem to have values in
the range 100+, sometimes even 1000+ for strong likelihoods.
Very unlikely events have small values 1.0

In practice, it kind of holds, as the number of documents increase, I
usually have to scale cosine to a larger range or switch to some hybrid
similarity metric for good clustering.  What about you guys, I mean both of
you have worked on huge data sets, what kind of insights can you share about
what works and what doesnt.


[jira] Commented: (MAHOUT-221) Implementation of FP-Bonsai Pruning for fast pattern mining

2009-12-27 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12794677#action_12794677
 ] 

Robin Anil commented on MAHOUT-221:
---

Ping! Need a code review

 Implementation of FP-Bonsai Pruning for fast pattern mining
 ---

 Key: MAHOUT-221
 URL: https://issues.apache.org/jira/browse/MAHOUT-221
 Project: Mahout
  Issue Type: New Feature
  Components: Frequent Itemset/Association Rule Mining
Affects Versions: 0.3
Reporter: Robin Anil
Assignee: Robin Anil
 Attachments: MAHOUT-FPGROWTH.patch, MAHOUT-FPGROWTH.patch


 FP Bonsai is a method to prune long chained FP-Trees for faster growth. 
 http://win.ua.ac.be/~adrem/bibrem/pubs/fpbonsai.pdf
 This implementation also adds a transaction preprocessing map/reduce job 
 which converts a list of transactions {1, 2, 4, 5}, {1, 2, 3}, {1, 2} into a 
 tree structure and thus saves space during fpgrowth map/reduce 
 the tree formed from above is. For typical this improves the storage space by 
 a great amount and thus saves on time during shuffle and sort
 (1,3) - (2,3) | - (4,1) - (5,1)
   (3,1)
 Also added a reducer to PFPgrowth (not part of the original paper) which does 
 this compression and saves on space. 
 This patch also adds an example transaction dataset generator from flickr and 
 delicious data set 
 https://www.uni-koblenz.de/FB4/Institutes/IFI/AGStaab/Research/DataSets/PINTSExperimentsDataSets/
 Both of them are GIG of tag data. Where date userid itemid tag is given. 
 The example maker creates a transaction based on all the unique tags a user 
 has tagged on an item. 
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

2009-12-27 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12794678#action_12794678
 ] 

Robin Anil commented on MAHOUT-220:
---

I will update a new patch. I am reverting all these changes. Will stick to 80 
column format and the new lucene code formatter. Will start re-working from 
latest trunk

 Mahout Bayes Code cleanup
 -

 Key: MAHOUT-220
 URL: https://issues.apache.org/jira/browse/MAHOUT-220
 Project: Mahout
  Issue Type: Improvement
  Components: Classification
Affects Versions: 0.3
Reporter: Robin Anil
Assignee: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-BAYES.patch


 Following isabel's checkstyle, I am adding a whole slew of code cleanup with 
 the following exceptions
 1.  Line length used is 120 instead of 80. 
 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAHOUT-231) Upgrade QM reports to use Clover 2.6

2009-12-27 Thread Isabel Drost (JIRA)
Upgrade QM reports to use Clover 2.6


 Key: MAHOUT-231
 URL: https://issues.apache.org/jira/browse/MAHOUT-231
 Project: Mahout
  Issue Type: Task
  Components: Website
Affects Versions: 0.3
Reporter: Isabel Drost
Priority: Minor
 Fix For: 0.3


Atlassian has donated a license for a new Clover version. The reports provide 
more information and are easier to read. We should upgrade to site reports to 
use that version.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Cosine and Tanimoto Similarity

2009-12-27 Thread Ted Dunning
Floating point precision is not an issue with any of these metrics since the
counts you are dealing with are never large enough for the statistical
uncertainty (roughly sqrt(number of observations)) to outweigh the numerical
accuracy (roughly 10^-7 for float 10^-17 for double).

A much large problem is actually with small counts where coincidence can
give you cosine similarity of 1 or very close to that.  Log-likelihood ratio
testing is a fine way to mask away the measures unlikely to have good
results.  Even better is to not actually train those numbers based on actual
cooccurrence, but to use corpus frequency to weight events that have
interesting LLR.  Another way to deal with statistical noise is to use a
regularized learning system.

You need to deal with the problem one way or another.

On Sun, Dec 27, 2009 at 1:51 AM, Robin Anil robin.a...@gmail.com wrote:

 One thing I found very irritating when using cosine or numbers in the range
 0,1 is that sometimes two distinct items have very small values of distance
 when you inspect them. I am always worried that precision of float is not
 enough to capture that small detail that makes the difference of accept or
 reject.  On the other hand Log likelihood similarity seem to have values in
 the range 100+, sometimes even 1000+ for strong likelihoods.
 Very unlikely events have small values 1.0

 In practice, it kind of holds, as the number of documents increase, I
 usually have to scale cosine to a larger range or switch to some hybrid
 similarity metric for good clustering.  What about you guys, I mean both of
 you have worked on huge data sets, what kind of insights can you share
 about
 what works and what doesnt.




-- 
Ted Dunning, CTO
DeepDyve