Re: Plotting cluster quality

2013-02-24 Thread David Murgatroyd
What does color mean here? What about width of the box?
FWIW, I infer color is solely for visual distinction -- rotating through
orange, red, yellow, pink from left to right. I infer width is proportional
to count of items in each cluster, though apparently not linearly.

I agree that a single plot comparing the algorithms is important since the
purpose of the plot is to compare the algorithms rather than better
understand the data on which they've been run. I haven't thought of a good
way to do that while still having a cluster-by-cluster visual element.

On Fri, Feb 22, 2013 at 12:47 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 What does color mean here?

 What about width of the box?

 When you say median or mean of all cluster distances, do you mean across
 that single run?

 I think that this plot is fine as it is except that it needs a legend that
 explains all of these issues.  My general rule of thumb is that most
 figures should have what I call a Kipling caption.  See the caption of
 the first image here: http://www.boop.org/jan/justso/butter.htm to see
 what
 I mean by this.  Imagine that there is a very mathematically inclined 4
 year old who is looking at your diagram and quizzing you about every part.
  Answer all their questions in the caption and you have a Kipling caption.

 For comparing different runs of the clustering or different algorithms, I
 think that a cumulative distribution plot (using plot.ecdf) with all of the
 different algorithms on one plot would be the best comparison tool.

 On Fri, Feb 22, 2013 at 8:33 AM, Dan Filimon dangeorge.fili...@gmail.com
 wrote:

  As most of the regulars know, I'm working with Ted Dunning on a new
  clustering framework for Mahout that should land in 0.8.
 
  Part of my work is comparing the clustering quality of the new code
  with the existing Mahout implementation.
 
  I compiled a CSV of the quality data [1]. I ran 5 runs of the
  clustering on the 20 newsgroups data set comparing Mahout KMeans (km),
  Ball KMeans (bkm), Streaming KMeans (skm) and Streaming KMeans
  followed by Ball KMeans (bskm).
 
  I'm looking at now making some appealing plots for the data. For
  instance, I think want to make box plots of individual clustering
  runs. Here's an example [2] of what a clustering looks like for one
  run of Mahout's standard k-means.
 
  There's a box for each cluster, the mean distance is the thick line,
  the limits are the 1st and 3rd quartiles and the whiskers are the min
  and max distances.
  The blue horizontal line is the mean of all average cluster distances.
  The green horizontal line is the median of all average cluster distances.
 
  I intend on making similar plots for the other runs and then
  aggregating the means of the runs into box plots for the different
  classes of k-means.
  The main result being that streaming k-means + ball k-means (as done
  in the MR) gives a high quality clustering.
 
  How do you feel about this plot? Is it too dense? Too colorful? Should
  I not draw the median any more?
  What are some other good ways of plotting the quality given the data set?
 
  Thanks!
 
  [1]
 
 https://github.com/dfilimon/mahout/blob/skm/examples/src/main/resources/kmeans-comparison-nospace.csv
  [2]
 
 http://swarm.cs.pub.ro/~dfilimon/skm-mahout/Mahout%20KMeans%20Run%201.pdf
 



Re: Plotting cluster quality

2013-02-24 Thread Ted Dunning
I spoke off-line to Dan and he confirmed your inference.  Color was just
there for visual esthetics.

On Sun, Feb 24, 2013 at 6:18 AM, David Murgatroyd dmu...@gmail.com wrote:

 What does color mean here? What about width of the box?
 FWIW, I infer color is solely for visual distinction -- rotating through
 orange, red, yellow, pink from left to right. I infer width is proportional
 to count of items in each cluster, though apparently not linearly.

 I agree that a single plot comparing the algorithms is important since the
 purpose of the plot is to compare the algorithms rather than better
 understand the data on which they've been run. I haven't thought of a good
 way to do that while still having a cluster-by-cluster visual element.

 On Fri, Feb 22, 2013 at 12:47 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:

  What does color mean here?
 
  What about width of the box?
 
  When you say median or mean of all cluster distances, do you mean across
  that single run?
 
  I think that this plot is fine as it is except that it needs a legend
 that
  explains all of these issues.  My general rule of thumb is that most
  figures should have what I call a Kipling caption.  See the caption of
  the first image here: http://www.boop.org/jan/justso/butter.htm to see
  what
  I mean by this.  Imagine that there is a very mathematically inclined 4
  year old who is looking at your diagram and quizzing you about every
 part.
   Answer all their questions in the caption and you have a Kipling
 caption.
 
  For comparing different runs of the clustering or different algorithms, I
  think that a cumulative distribution plot (using plot.ecdf) with all of
 the
  different algorithms on one plot would be the best comparison tool.
 
  On Fri, Feb 22, 2013 at 8:33 AM, Dan Filimon 
 dangeorge.fili...@gmail.com
  wrote:
 
   As most of the regulars know, I'm working with Ted Dunning on a new
   clustering framework for Mahout that should land in 0.8.
  
   Part of my work is comparing the clustering quality of the new code
   with the existing Mahout implementation.
  
   I compiled a CSV of the quality data [1]. I ran 5 runs of the
   clustering on the 20 newsgroups data set comparing Mahout KMeans (km),
   Ball KMeans (bkm), Streaming KMeans (skm) and Streaming KMeans
   followed by Ball KMeans (bskm).
  
   I'm looking at now making some appealing plots for the data. For
   instance, I think want to make box plots of individual clustering
   runs. Here's an example [2] of what a clustering looks like for one
   run of Mahout's standard k-means.
  
   There's a box for each cluster, the mean distance is the thick line,
   the limits are the 1st and 3rd quartiles and the whiskers are the min
   and max distances.
   The blue horizontal line is the mean of all average cluster distances.
   The green horizontal line is the median of all average cluster
 distances.
  
   I intend on making similar plots for the other runs and then
   aggregating the means of the runs into box plots for the different
   classes of k-means.
   The main result being that streaming k-means + ball k-means (as done
   in the MR) gives a high quality clustering.
  
   How do you feel about this plot? Is it too dense? Too colorful? Should
   I not draw the median any more?
   What are some other good ways of plotting the quality given the data
 set?
  
   Thanks!
  
   [1]
  
 
 https://github.com/dfilimon/mahout/blob/skm/examples/src/main/resources/kmeans-comparison-nospace.csv
   [2]
  
 
 http://swarm.cs.pub.ro/~dfilimon/skm-mahout/Mahout%20KMeans%20Run%201.pdf
  
 



Re: Cross recommendation

2013-02-24 Thread Sean Owen
I may not be 100% following the thread, but:

Similarity metrics won't care whether some items are really actions and
some items are items. The math is the same. The problem which you may be
alluding to is the one I mentioned earlier -- there is no connection
between item and item-action in the model, when there plainly is in real
life. The upside is what Ted mention: you get to treat actions like views
separately from purchases, and yes it's also certain those aren't the same
thing in real life. YMMV.

The piece of code you're playing with has nothing to do with latent factor
models and won't learn weights. It's going to assume by default that all
items (+actions) are equal.

(user+action,item) doesn't make sense. You compute item-item similarity
from (user,item+action) data. Some of the results are really item-action
similarities or action-action. It may be useful, maybe not, to know these
things too but you can just look at item-item if you want.



On Sun, Feb 24, 2013 at 4:39 PM, Pat Ferrel pat.fer...@gmail.com wrote:

 Yes I understand that you need (user, item+action) input for user based
 recs returned from recommender.recommend(userID, n).

 But can you expect item similarity to work with the same input? I am fuzzy
 about how item similarity is calculated in cf/taste.

 I was expecting to train one recommender with (user, item+action) and call
 recommender1.recommend(userID, n) to get recs but also train another
 recommender with (user+action, item) to get recommender2.mostSimilarItems(
 itemID, n). I realize it's a hack but that aside is this second recommender
 required? I'd expect it to return items that use all actions to calculate
 similarity and therefore will use view information to improve the
 similarity calculation.

 No?


 On Feb 23, 2013, at 10:06 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 No.

 It is uniformly better to have (item+action, user).  In fact, I would
 prefer to have it the other way around when describing it to match the
 matrix row x column convention.

 (user, item+action) where action is binary leads to A = [A_1 | A_2] = user
 by 2xitem.  The alternative of (user+action, item) leads to

[ A_1 ]
 A = [ ] = 2xuser by item
[ A_2 ]

 This last form doesn't have a uniform set of users to connect the items
 together.  When you compute the cooccurrence matrix you get A_1' A_1 + A_2'
 A_2 which gives you recommendations from 1=1 and from 2=2, but no
 recommendations 1=2 or 2=1.  Thus, no cross recommendations.



 On Sat, Feb 23, 2013 at 10:39 AM, Pat Ferrel pat.fer...@gmail.com wrote:

  But the discussion below lead me to realize that cf/taste is doing
  something in addition to [B'B] h_p, which returns user history based
 recs.
  I'm getting better results currently from item similarity based recs,
 which
  I blend with user-history based recs. To get item similarity based recs
  cf/taste is using a similarity metric and I'd guess that it uses the
 input
  matrix to get these results (something like the dot product for cosine).
  For item similarity should I create a training set of (item,
 user+action)?




Re: Naive Bayes Classifier - Scores

2013-02-24 Thread Seetha


Ramprakash Ramamoorthy youngestachiever at gmail.com writes:

 
 Dear all,
 
  I am performing a sentiment analysis using the naive bayes
 classifier on apache mahout. Every time when I get the result, I get a
 category and a score corresponding to the score.
 
  Can some one here enlighten me on the score? Like what is the
 maximum score for a given category(I have two categories - positive and
 negative). So based on the score can I categorize them as very
 positive,very negative etc. Any input regarding the same would be helpful.
 Thank you.
 

Hi Ramprakash,

Have you resolved the problem u mentioned earlier.

Thank you.