Re: log-likelihood ratio value in item similarity calculation

2013-04-12 Thread Phoenix Bai
I got 168, because I use log base 2 instead of e.
([?]) if memory serves right, I read it in entropy definition that people
normally use base 2, so I just assumed it was 2 in code. (my bad).

And now I have a better understanding, so thank you both for the
explanation.


On Fri, Apr 12, 2013 at 6:01 AM, Sean Owen sro...@gmail.com wrote:

 Yes I also get (er, Mahout gets) 117 (116.69), FWIW.

 I think the second question concerned counts vs relative frequencies
 -- normalized, or not. Like whether you divide all the counts by their
 sum or not. For a fixed set of observations that does change the LLR
 because it is unnormalized, not because the situation has changed.

 Obviously you're right that the changing situations you describe do
 entail a change in LLR!

 On Thu, Apr 11, 2013 at 10:52 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:
  These numbers don't match what I get.
 
  I get LLR = 117.
 
  This is wildly anomalous so this pair should definitely be connected.
  Both
  items are quite rare (15/300,000 or 20/300,000 rates) but they occur
  together most of the time that they appear.
 
 
 
  On Wed, Apr 10, 2013 at 2:15 AM, Phoenix Bai baizh...@gmail.com wrote:
 
  Hi,
 
  the counts for two events are:
  * **Event A**Everything but A**Event B**k11=7**k12=8**Everything but B**
  k21=13**k22=300,000*
  according to the code, I will get:
 
  rowEntropy = entropy(7,8) + entropy(13, 300,000) = 222
  colEntropy = entropy(7,13) + entropy(8, 300,000) = 152
  matrixEntropy(entropy(7,8,13, 300,000) = 458
 
  thus,
 
  LLR=2.0*(458-222-152) = 168
  similarityScore = 1 - 1/(1+168) = 0.994
 
  So, my problem is,
  the similarity scores I get for all the items are all this high and it
  makes it so hard to identify the real similar ones.
 
  As you can see, the counts of event A, and B are quite small while the
  total count for k22 is quite high. And this phenomenon is quite common
 in
  my dataset.
 
  So, my question is,
  what kind of adjustment could I do to lower the similarity score to a
 more
  reasonable range?
 
  Please shed some lights, thanks in advance!
 



Re: log-likelihood ratio value in item similarity calculation

2013-04-12 Thread Sean Owen
Yes that's true, it is more usually bits. Here it's natural log / nats.
Since it's unnormalized anyway another constant factor doesn't hurt and it
means not having to change the base.


On Fri, Apr 12, 2013 at 8:01 AM, Phoenix Bai baizh...@gmail.com wrote:

 I got 168, because I use log base 2 instead of e.
 ([?]) if memory serves right, I read it in entropy definition that people
 normally use base 2, so I just assumed it was 2 in code. (my bad).

 And now I have a better understanding, so thank you both for the
 explanation.




Java Code for PCA

2013-04-12 Thread Chirag Lakhani
I am having trouble understanding whether the following code is sufficient
for running PCA

I have a sequence file of dense vectors that I am calling and then I am
trying to run the following code

SSVDSolver pcaFactory = new SSVDSolver(conf, new Path(vectorsFolder), new
Path(pcaOutput),18,5,3,10);


pcaFactory.setPcaMeanPath(pcaFactory.getPcaMeanPath());

pcaFactory.run();


Is this enough for PCA or does anyone have example code they are willing to
share to see how PCA works using the SSVD solver.


Re: cross recommender

2013-04-12 Thread Pat Ferrel
That looks like the best shortcut. It is one of the few places where the rows 
of one and the columns of the other are seen together. Now I know why you 
transpose the first input :-)

But, I have begun to wonder whether it is the right thing to do for a cross 
recommender because you are comparing a purchase vector to all view 
vectors--like comparing Honey Crisp to Cameo (couldn't resist an apples to 
apples joke). Co-occurrence makes sense but does cosine or log-likelihood? 
Maybe...

 
On Apr 11, 2013, at 10:49 AM, Sebastian Schelter s...@apache.org wrote:

 Do I have to create a SimilarityJob( matrixB, matrixA, similarityType
) to get this or have I missed something already in Mahout?

It could be worth to investigate whether MatrixMultiplicationJob could
be extended to compute similarities instead of dot products.

Best,
Sebastian



Re: Java Code for PCA

2013-04-12 Thread Dmitriy Lyubimov
No,this is not right.

I will explain later when i have a moment.
On Apr 12, 2013 8:08 AM, Chirag Lakhani clakh...@zaloni.com wrote:

 I am having trouble understanding whether the following code is sufficient
 for running PCA

 I have a sequence file of dense vectors that I am calling and then I am
 trying to run the following code

 SSVDSolver pcaFactory = new SSVDSolver(conf, new Path(vectorsFolder), new
 Path(pcaOutput),18,5,3,10);


 pcaFactory.setPcaMeanPath(pcaFactory.getPcaMeanPath());

 pcaFactory.run();


 Is this enough for PCA or does anyone have example code they are willing to
 share to see how PCA works using the SSVD solver.



Re: Java Code for PCA

2013-04-12 Thread Dmitriy Lyubimov
On Fri, Apr 12, 2013 at 8:42 AM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 No,this is not right.

 I will explain later when i have a moment.
 On Apr 12, 2013 8:08 AM, Chirag Lakhani clakh...@zaloni.com wrote:

 I am having trouble understanding whether the following code is sufficient
 for running PCA

 I have a sequence file of dense vectors that I am calling and then I am
 trying to run the following code

 SSVDSolver pcaFactory = new SSVDSolver(conf, new Path(vectorsFolder), new
 Path(pcaOutput),18,5,3,10);


 pcaFactory.setPcaMeanPath(pcaFactory.getPcaMeanPath());

 ssvd solver doesn't compute pca mean -- it requires it. this line
therefore achieves nothing

SSVDCli.java  computes PCA mean using DistributedRowMatrix and passes it
over to SSVD Solver. This behavior is switched on by -pca option. See the
SSVDCli code for details.

-d


 pcaFactory.run();


 Is this enough for PCA or does anyone have example code they are willing
 to
 share to see how PCA works using the SSVD solver.




Re: log-likelihood ratio value in item similarity calculation

2013-04-12 Thread Ted Dunning
The only virtue of using the natural base is that you get a nice asymptotic
distribution for random data.




On Fri, Apr 12, 2013 at 1:10 AM, Sean Owen sro...@gmail.com wrote:

 Yes that's true, it is more usually bits. Here it's natural log / nats.
 Since it's unnormalized anyway another constant factor doesn't hurt and it
 means not having to change the base.


 On Fri, Apr 12, 2013 at 8:01 AM, Phoenix Bai baizh...@gmail.com wrote:

 I got 168, because I use log base 2 instead of e.
 ([?]) if memory serves right, I read it in entropy definition that people
 normally use base 2, so I just assumed it was 2 in code. (my bad).

 And now I have a better understanding, so thank you both for the
 explanation.




Re: cross recommender

2013-04-12 Thread Ted Dunning
Log-likelihood similarity is a bit of a force-fit of the concept of the
LLR.  It is basically a binarizing and sparsifying filter applied to
cooccurrence counts.

As such, it is eminently suited to implementation using a matrix multiply.


On Fri, Apr 12, 2013 at 8:35 AM, Pat Ferrel p...@occamsmachete.com wrote:

 That looks like the best shortcut. It is one of the few places where the
 rows of one and the columns of the other are seen together. Now I know why
 you transpose the first input :-)

 But, I have begun to wonder whether it is the right thing to do for a
 cross recommender because you are comparing a purchase vector to all view
 vectors--like comparing Honey Crisp to Cameo (couldn't resist an apples to
 apples joke). Co-occurrence makes sense but does cosine or log-likelihood?
 Maybe...


 On Apr 11, 2013, at 10:49 AM, Sebastian Schelter s...@apache.org wrote:

  Do I have to create a SimilarityJob( matrixB, matrixA, similarityType
 ) to get this or have I missed something already in Mahout?

 It could be worth to investigate whether MatrixMultiplicationJob could
 be extended to compute similarities instead of dot products.

 Best,
 Sebastian




Feature reduction for LibLinear weights

2013-04-12 Thread Ken Krugler
Hi all,

We're (ab)using LibLinear (linear SVM) as a multi-class classifier, with 200+ 
labels and 400K features.

This results in a model that's  800MB, which is a bit unwieldy. Unfortunately 
LibLinear uses a full array of weights (nothing sparse), being a port from the 
C version.

I could do feature reduction (removing rows from the matrix) with Mahout prior 
to training the model, but I'd prefer to reduce the (in memory) nxm array of 
weights.

Any suggestions for approaches to take?

Thanks,

-- Ken

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr