Re: log-likelihood ratio value in item similarity calculation
I got 168, because I use log base 2 instead of e. ([?]) if memory serves right, I read it in entropy definition that people normally use base 2, so I just assumed it was 2 in code. (my bad). And now I have a better understanding, so thank you both for the explanation. On Fri, Apr 12, 2013 at 6:01 AM, Sean Owen sro...@gmail.com wrote: Yes I also get (er, Mahout gets) 117 (116.69), FWIW. I think the second question concerned counts vs relative frequencies -- normalized, or not. Like whether you divide all the counts by their sum or not. For a fixed set of observations that does change the LLR because it is unnormalized, not because the situation has changed. Obviously you're right that the changing situations you describe do entail a change in LLR! On Thu, Apr 11, 2013 at 10:52 PM, Ted Dunning ted.dunn...@gmail.com wrote: These numbers don't match what I get. I get LLR = 117. This is wildly anomalous so this pair should definitely be connected. Both items are quite rare (15/300,000 or 20/300,000 rates) but they occur together most of the time that they appear. On Wed, Apr 10, 2013 at 2:15 AM, Phoenix Bai baizh...@gmail.com wrote: Hi, the counts for two events are: * **Event A**Everything but A**Event B**k11=7**k12=8**Everything but B** k21=13**k22=300,000* according to the code, I will get: rowEntropy = entropy(7,8) + entropy(13, 300,000) = 222 colEntropy = entropy(7,13) + entropy(8, 300,000) = 152 matrixEntropy(entropy(7,8,13, 300,000) = 458 thus, LLR=2.0*(458-222-152) = 168 similarityScore = 1 - 1/(1+168) = 0.994 So, my problem is, the similarity scores I get for all the items are all this high and it makes it so hard to identify the real similar ones. As you can see, the counts of event A, and B are quite small while the total count for k22 is quite high. And this phenomenon is quite common in my dataset. So, my question is, what kind of adjustment could I do to lower the similarity score to a more reasonable range? Please shed some lights, thanks in advance!
Re: log-likelihood ratio value in item similarity calculation
Yes that's true, it is more usually bits. Here it's natural log / nats. Since it's unnormalized anyway another constant factor doesn't hurt and it means not having to change the base. On Fri, Apr 12, 2013 at 8:01 AM, Phoenix Bai baizh...@gmail.com wrote: I got 168, because I use log base 2 instead of e. ([?]) if memory serves right, I read it in entropy definition that people normally use base 2, so I just assumed it was 2 in code. (my bad). And now I have a better understanding, so thank you both for the explanation.
Java Code for PCA
I am having trouble understanding whether the following code is sufficient for running PCA I have a sequence file of dense vectors that I am calling and then I am trying to run the following code SSVDSolver pcaFactory = new SSVDSolver(conf, new Path(vectorsFolder), new Path(pcaOutput),18,5,3,10); pcaFactory.setPcaMeanPath(pcaFactory.getPcaMeanPath()); pcaFactory.run(); Is this enough for PCA or does anyone have example code they are willing to share to see how PCA works using the SSVD solver.
Re: cross recommender
That looks like the best shortcut. It is one of the few places where the rows of one and the columns of the other are seen together. Now I know why you transpose the first input :-) But, I have begun to wonder whether it is the right thing to do for a cross recommender because you are comparing a purchase vector to all view vectors--like comparing Honey Crisp to Cameo (couldn't resist an apples to apples joke). Co-occurrence makes sense but does cosine or log-likelihood? Maybe... On Apr 11, 2013, at 10:49 AM, Sebastian Schelter s...@apache.org wrote: Do I have to create a SimilarityJob( matrixB, matrixA, similarityType ) to get this or have I missed something already in Mahout? It could be worth to investigate whether MatrixMultiplicationJob could be extended to compute similarities instead of dot products. Best, Sebastian
Re: Java Code for PCA
No,this is not right. I will explain later when i have a moment. On Apr 12, 2013 8:08 AM, Chirag Lakhani clakh...@zaloni.com wrote: I am having trouble understanding whether the following code is sufficient for running PCA I have a sequence file of dense vectors that I am calling and then I am trying to run the following code SSVDSolver pcaFactory = new SSVDSolver(conf, new Path(vectorsFolder), new Path(pcaOutput),18,5,3,10); pcaFactory.setPcaMeanPath(pcaFactory.getPcaMeanPath()); pcaFactory.run(); Is this enough for PCA or does anyone have example code they are willing to share to see how PCA works using the SSVD solver.
Re: Java Code for PCA
On Fri, Apr 12, 2013 at 8:42 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: No,this is not right. I will explain later when i have a moment. On Apr 12, 2013 8:08 AM, Chirag Lakhani clakh...@zaloni.com wrote: I am having trouble understanding whether the following code is sufficient for running PCA I have a sequence file of dense vectors that I am calling and then I am trying to run the following code SSVDSolver pcaFactory = new SSVDSolver(conf, new Path(vectorsFolder), new Path(pcaOutput),18,5,3,10); pcaFactory.setPcaMeanPath(pcaFactory.getPcaMeanPath()); ssvd solver doesn't compute pca mean -- it requires it. this line therefore achieves nothing SSVDCli.java computes PCA mean using DistributedRowMatrix and passes it over to SSVD Solver. This behavior is switched on by -pca option. See the SSVDCli code for details. -d pcaFactory.run(); Is this enough for PCA or does anyone have example code they are willing to share to see how PCA works using the SSVD solver.
Re: log-likelihood ratio value in item similarity calculation
The only virtue of using the natural base is that you get a nice asymptotic distribution for random data. On Fri, Apr 12, 2013 at 1:10 AM, Sean Owen sro...@gmail.com wrote: Yes that's true, it is more usually bits. Here it's natural log / nats. Since it's unnormalized anyway another constant factor doesn't hurt and it means not having to change the base. On Fri, Apr 12, 2013 at 8:01 AM, Phoenix Bai baizh...@gmail.com wrote: I got 168, because I use log base 2 instead of e. ([?]) if memory serves right, I read it in entropy definition that people normally use base 2, so I just assumed it was 2 in code. (my bad). And now I have a better understanding, so thank you both for the explanation.
Re: cross recommender
Log-likelihood similarity is a bit of a force-fit of the concept of the LLR. It is basically a binarizing and sparsifying filter applied to cooccurrence counts. As such, it is eminently suited to implementation using a matrix multiply. On Fri, Apr 12, 2013 at 8:35 AM, Pat Ferrel p...@occamsmachete.com wrote: That looks like the best shortcut. It is one of the few places where the rows of one and the columns of the other are seen together. Now I know why you transpose the first input :-) But, I have begun to wonder whether it is the right thing to do for a cross recommender because you are comparing a purchase vector to all view vectors--like comparing Honey Crisp to Cameo (couldn't resist an apples to apples joke). Co-occurrence makes sense but does cosine or log-likelihood? Maybe... On Apr 11, 2013, at 10:49 AM, Sebastian Schelter s...@apache.org wrote: Do I have to create a SimilarityJob( matrixB, matrixA, similarityType ) to get this or have I missed something already in Mahout? It could be worth to investigate whether MatrixMultiplicationJob could be extended to compute similarities instead of dot products. Best, Sebastian
Feature reduction for LibLinear weights
Hi all, We're (ab)using LibLinear (linear SVM) as a multi-class classifier, with 200+ labels and 400K features. This results in a model that's 800MB, which is a bit unwieldy. Unfortunately LibLinear uses a full array of weights (nothing sparse), being a port from the C version. I could do feature reduction (removing rows from the matrix) with Mahout prior to training the model, but I'd prefer to reduce the (in memory) nxm array of weights. Any suggestions for approaches to take? Thanks, -- Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr