Re: Plotting cluster quality
What does color mean here? What about width of the box? FWIW, I infer color is solely for visual distinction -- rotating through orange, red, yellow, pink from left to right. I infer width is proportional to count of items in each cluster, though apparently not linearly. I agree that a single plot comparing the algorithms is important since the purpose of the plot is to compare the algorithms rather than better understand the data on which they've been run. I haven't thought of a good way to do that while still having a cluster-by-cluster visual element. On Fri, Feb 22, 2013 at 12:47 PM, Ted Dunning ted.dunn...@gmail.com wrote: What does color mean here? What about width of the box? When you say median or mean of all cluster distances, do you mean across that single run? I think that this plot is fine as it is except that it needs a legend that explains all of these issues. My general rule of thumb is that most figures should have what I call a Kipling caption. See the caption of the first image here: http://www.boop.org/jan/justso/butter.htm to see what I mean by this. Imagine that there is a very mathematically inclined 4 year old who is looking at your diagram and quizzing you about every part. Answer all their questions in the caption and you have a Kipling caption. For comparing different runs of the clustering or different algorithms, I think that a cumulative distribution plot (using plot.ecdf) with all of the different algorithms on one plot would be the best comparison tool. On Fri, Feb 22, 2013 at 8:33 AM, Dan Filimon dangeorge.fili...@gmail.com wrote: As most of the regulars know, I'm working with Ted Dunning on a new clustering framework for Mahout that should land in 0.8. Part of my work is comparing the clustering quality of the new code with the existing Mahout implementation. I compiled a CSV of the quality data [1]. I ran 5 runs of the clustering on the 20 newsgroups data set comparing Mahout KMeans (km), Ball KMeans (bkm), Streaming KMeans (skm) and Streaming KMeans followed by Ball KMeans (bskm). I'm looking at now making some appealing plots for the data. For instance, I think want to make box plots of individual clustering runs. Here's an example [2] of what a clustering looks like for one run of Mahout's standard k-means. There's a box for each cluster, the mean distance is the thick line, the limits are the 1st and 3rd quartiles and the whiskers are the min and max distances. The blue horizontal line is the mean of all average cluster distances. The green horizontal line is the median of all average cluster distances. I intend on making similar plots for the other runs and then aggregating the means of the runs into box plots for the different classes of k-means. The main result being that streaming k-means + ball k-means (as done in the MR) gives a high quality clustering. How do you feel about this plot? Is it too dense? Too colorful? Should I not draw the median any more? What are some other good ways of plotting the quality given the data set? Thanks! [1] https://github.com/dfilimon/mahout/blob/skm/examples/src/main/resources/kmeans-comparison-nospace.csv [2] http://swarm.cs.pub.ro/~dfilimon/skm-mahout/Mahout%20KMeans%20Run%201.pdf
Re: Plotting cluster quality
I spoke off-line to Dan and he confirmed your inference. Color was just there for visual esthetics. On Sun, Feb 24, 2013 at 6:18 AM, David Murgatroyd dmu...@gmail.com wrote: What does color mean here? What about width of the box? FWIW, I infer color is solely for visual distinction -- rotating through orange, red, yellow, pink from left to right. I infer width is proportional to count of items in each cluster, though apparently not linearly. I agree that a single plot comparing the algorithms is important since the purpose of the plot is to compare the algorithms rather than better understand the data on which they've been run. I haven't thought of a good way to do that while still having a cluster-by-cluster visual element. On Fri, Feb 22, 2013 at 12:47 PM, Ted Dunning ted.dunn...@gmail.com wrote: What does color mean here? What about width of the box? When you say median or mean of all cluster distances, do you mean across that single run? I think that this plot is fine as it is except that it needs a legend that explains all of these issues. My general rule of thumb is that most figures should have what I call a Kipling caption. See the caption of the first image here: http://www.boop.org/jan/justso/butter.htm to see what I mean by this. Imagine that there is a very mathematically inclined 4 year old who is looking at your diagram and quizzing you about every part. Answer all their questions in the caption and you have a Kipling caption. For comparing different runs of the clustering or different algorithms, I think that a cumulative distribution plot (using plot.ecdf) with all of the different algorithms on one plot would be the best comparison tool. On Fri, Feb 22, 2013 at 8:33 AM, Dan Filimon dangeorge.fili...@gmail.com wrote: As most of the regulars know, I'm working with Ted Dunning on a new clustering framework for Mahout that should land in 0.8. Part of my work is comparing the clustering quality of the new code with the existing Mahout implementation. I compiled a CSV of the quality data [1]. I ran 5 runs of the clustering on the 20 newsgroups data set comparing Mahout KMeans (km), Ball KMeans (bkm), Streaming KMeans (skm) and Streaming KMeans followed by Ball KMeans (bskm). I'm looking at now making some appealing plots for the data. For instance, I think want to make box plots of individual clustering runs. Here's an example [2] of what a clustering looks like for one run of Mahout's standard k-means. There's a box for each cluster, the mean distance is the thick line, the limits are the 1st and 3rd quartiles and the whiskers are the min and max distances. The blue horizontal line is the mean of all average cluster distances. The green horizontal line is the median of all average cluster distances. I intend on making similar plots for the other runs and then aggregating the means of the runs into box plots for the different classes of k-means. The main result being that streaming k-means + ball k-means (as done in the MR) gives a high quality clustering. How do you feel about this plot? Is it too dense? Too colorful? Should I not draw the median any more? What are some other good ways of plotting the quality given the data set? Thanks! [1] https://github.com/dfilimon/mahout/blob/skm/examples/src/main/resources/kmeans-comparison-nospace.csv [2] http://swarm.cs.pub.ro/~dfilimon/skm-mahout/Mahout%20KMeans%20Run%201.pdf
Re: Cross recommendation
I may not be 100% following the thread, but: Similarity metrics won't care whether some items are really actions and some items are items. The math is the same. The problem which you may be alluding to is the one I mentioned earlier -- there is no connection between item and item-action in the model, when there plainly is in real life. The upside is what Ted mention: you get to treat actions like views separately from purchases, and yes it's also certain those aren't the same thing in real life. YMMV. The piece of code you're playing with has nothing to do with latent factor models and won't learn weights. It's going to assume by default that all items (+actions) are equal. (user+action,item) doesn't make sense. You compute item-item similarity from (user,item+action) data. Some of the results are really item-action similarities or action-action. It may be useful, maybe not, to know these things too but you can just look at item-item if you want. On Sun, Feb 24, 2013 at 4:39 PM, Pat Ferrel pat.fer...@gmail.com wrote: Yes I understand that you need (user, item+action) input for user based recs returned from recommender.recommend(userID, n). But can you expect item similarity to work with the same input? I am fuzzy about how item similarity is calculated in cf/taste. I was expecting to train one recommender with (user, item+action) and call recommender1.recommend(userID, n) to get recs but also train another recommender with (user+action, item) to get recommender2.mostSimilarItems( itemID, n). I realize it's a hack but that aside is this second recommender required? I'd expect it to return items that use all actions to calculate similarity and therefore will use view information to improve the similarity calculation. No? On Feb 23, 2013, at 10:06 PM, Ted Dunning ted.dunn...@gmail.com wrote: No. It is uniformly better to have (item+action, user). In fact, I would prefer to have it the other way around when describing it to match the matrix row x column convention. (user, item+action) where action is binary leads to A = [A_1 | A_2] = user by 2xitem. The alternative of (user+action, item) leads to [ A_1 ] A = [ ] = 2xuser by item [ A_2 ] This last form doesn't have a uniform set of users to connect the items together. When you compute the cooccurrence matrix you get A_1' A_1 + A_2' A_2 which gives you recommendations from 1=1 and from 2=2, but no recommendations 1=2 or 2=1. Thus, no cross recommendations. On Sat, Feb 23, 2013 at 10:39 AM, Pat Ferrel pat.fer...@gmail.com wrote: But the discussion below lead me to realize that cf/taste is doing something in addition to [B'B] h_p, which returns user history based recs. I'm getting better results currently from item similarity based recs, which I blend with user-history based recs. To get item similarity based recs cf/taste is using a similarity metric and I'd guess that it uses the input matrix to get these results (something like the dot product for cosine). For item similarity should I create a training set of (item, user+action)?
Re: Naive Bayes Classifier - Scores
Ramprakash Ramamoorthy youngestachiever at gmail.com writes: Dear all, I am performing a sentiment analysis using the naive bayes classifier on apache mahout. Every time when I get the result, I get a category and a score corresponding to the score. Can some one here enlighten me on the score? Like what is the maximum score for a given category(I have two categories - positive and negative). So based on the score can I categorize them as very positive,very negative etc. Any input regarding the same would be helpful. Thank you. Hi Ramprakash, Have you resolved the problem u mentioned earlier. Thank you.