log-likelihood ratio value in item similarity calculation

2013-04-10 Thread Phoenix Bai
Hi, the counts for two events are: * **Event A**Everything but A**Event B**k11=7**k12=8**Everything but B** k21=13**k22=300,000* according to the code, I will get: rowEntropy = entropy(7,8) + entropy(13, 300,000) = 222 colEntropy = entropy(7,13) + entropy(8, 300,000) = 152

Re: log-likelihood ratio value in item similarity calculation

2013-04-10 Thread Sean Owen
These events do sound 'similar'. They occur together about half the time either one of them occurs. You might have many pairs that end up being similar for the same reason, and this is not surprising. They're all really similar. The mapping here from LLR's range of [0,inf) to [0,1] is pretty

Re: log-likelihood ratio value in item similarity calculation

2013-04-10 Thread Phoenix Bai
Good point. btw, why use counts instead of probabilities? for easy and efficient implementation? also, do you think the similarity score using counts might quite differ from using probabilities? thank you very much for your prompt reply. [?] On Wed, Apr 10, 2013 at 5:50 PM, Sean Owen

Re: In-memory kmeans clustering

2013-04-10 Thread Dan Filimon
Thanks! I actually didn't know you can do that. :) On Tue, Apr 9, 2013 at 7:22 PM, Johannes Schulte johannes.schu...@gmail.com wrote: dataPoints can be in memory or from disk, and you can sample the dataPoints for initialClusters. On Tue, Apr 9, 2013 at 6:16 PM, Johannes Schulte

Re: log-likelihood ratio value in item similarity calculation

2013-04-10 Thread Sean Owen
Yes using counts is more efficient. Certainly it makes the LLR value different since the results are not normalized; all the input values are N times larger (N = sum k), and so the LLR is N times larger. 2x more events in the same ratio will make the LLR 2x larger too. That's just fine if you're

Re: Could OpenNLP use Mahout for classification?

2013-04-10 Thread Isabel Drost-Fromm
Hi Jörn, On Tuesday, April 09, 2013 10:12:47 PM Jörn Kottmann wrote: Logistic Regression (is that similar to our maxent ?) Online Passive Aggressive HMM The datasets we are training OpenNLP are usually rather small and can easily be processed with a single CPU, does Mahout support

Using lucene.vector

2013-04-10 Thread Vineet Krishnan
Hi all, I'm following the tutorials at http://searchhub.org/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/ to generate a vector file and a dictionary file from a solr index. Does anyone know or have any links to a resource that describes how to use these files for

Re: cross recommender

2013-04-10 Thread Pat Ferrel
BTW I have this working on trivial data and am in the process of measuring it's results on some real world data. It does a lot with DistributedRowMatix and so I'll be interested to see how it performs with a larger data set. Does anyone know of a public data set that provides things like views

RE: Classification Algorithms in Mahout

2013-04-10 Thread Bhattacharjee, Rohan
Doesn't the random part of random forest defend against overfitting ? -Original Message- From: ey-chih chow [mailto:eyc...@gmail.com] Sent: Saturday, April 06, 2013 5:45 PM To: user@mahout.apache.org Subject: Re: Classification Algorithms in Mahout I actually got a lot of over fitting.

Re: In-memory kmeans clustering

2013-04-10 Thread Ahmet Ylmaz
Thanks, we will try MapReduce version as you described From: Dan Filimon dangeorge.fili...@gmail.com To: user@mahout.apache.org Sent: Wednesday, April 10, 2013 1:19 PM Subject: Re: In-memory kmeans clustering Thanks! I actually didn't know you can do that.

Re: Could OpenNLP use Mahout for classification?

2013-04-10 Thread Jörn Kottmann
Thanks for your response, I will give it a try, our follow up jira issue is here: https://issues.apache.org/jira/browse/OPENNLP-574 Jörn On 04/10/2013 05:04 PM, Isabel Drost-Fromm wrote: Hi Jörn, On Tuesday, April 09, 2013 10:12:47 PM Jörn Kottmann wrote: Logistic Regression (is that

Re: Could OpenNLP use Mahout for classification?

2013-04-10 Thread Suneel Marthi
FWIW,  this paper talks about the equivalence of Logistic Regression and Maxent http://www.win-vector.com/dfiles/LogisticRegressionMaxEnt.pdf From: Jörn Kottmann kottm...@gmail.com To: user@mahout.apache.org Sent: Wednesday, April 10, 2013 4:23 PM Subject:

Fold-in for ALSWR

2013-04-10 Thread Chloe
Hi everyone, I am reaching out to the list requesting some help/advice on implementing fold-in with the Alternating Least Squares algo in Mahout, a problem on which I am stumped. I've read other posts on the list and over on SO, like:

Re: Fold-in for ALSWR

2013-04-10 Thread Sean Owen
For simplicity let's consider a brand-new user first, not a new rating for existing user. I'll use the notation from my slides that you mention, A = X * Y'. To clarify, I think you mean you have a new A_u row, and want to know X_u. The two expressions are not alternatives, they're the same thing,

Re: cross recommender

2013-04-10 Thread Ted Dunning
On Wed, Apr 10, 2013 at 10:38 AM, Pat Ferrel p...@occamsmachete.com wrote: Does anyone know of a public data set that provides things like views and purchases? I don't.

Re: cross recommender

2013-04-10 Thread Koobas
Retail data may be hard to impossible, but one can improvise. It seems to be fairly common to use Wikipedia articles (Myrrix, GraphLab). Another idea is to use StackOverflow tags (Myrrix examples). Although they are only good for emulating implicit feedback. On Wed, Apr 10, 2013 at 6:48 PM, Ted

Re: cross recommender

2013-04-10 Thread Pat Ferrel
I have retail data but can't publish results from it. If I could get a public sample I'd share how the technique worked out. Not sure how to simulate this data. It has the important characteristic that every purchase is also a view but not the other way around and Ted's technique is a way to