Re: Call to action – Mahout needs your help

2013-04-04 Thread Sebastian Schelter
Great to hear that you use Mahout in production! If you want to start working on it, you can either browse our jira issues or propose some issue to work on yourself. If you need some input, it would be awesome to enhance our ALS recommenders with cross-validation and tooling for finding a good reg

Re: Call to action – Mahout needs your help

2013-04-04 Thread Ted Dunning
We would love to have you! I will let others answer about things to do since I have to fly. On Fri, Apr 5, 2013 at 1:56 AM, Andrew Musselman wrote: > In case this thread is still a good place to reply with an offer to help, > I'd love to pitch in. I have built a few production recommenders, m

[jira] [Commented] (MAHOUT-998) Fix up the cluster-reuters.sh script

2013-04-04 Thread Suneel Marthi (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13623338#comment-13623338 ] Suneel Marthi commented on MAHOUT-998: -- Grant, would you like me to take a stab on th

Re: Call to action – Mahout needs your help

2013-04-04 Thread Andrew Musselman
In case this thread is still a good place to reply with an offer to help, I'd love to pitch in. I have built a few production recommenders, most recently using Mahout at a large retailer along with my partner where we used ALS, with a pipeline of transforming transactions in XML into vectors using

Re: CosineDistanceMeasure for 2 zero vectors?

2013-04-04 Thread Ted Dunning
All of this doesn't normally matter when cosine distance is used since usually it is used with normalized vectors. For that set of vectors it is a measure. On Thu, Apr 4, 2013 at 11:25 PM, Andrew Musselman < andrew.mussel...@gmail.com> wrote: > I agree 1 is wrong :) > > > On Thu, Apr 4, 2013 at

Re: CosineDistanceMeasure for 2 zero vectors?

2013-04-04 Thread Andrew Musselman
I agree 1 is wrong :) On Thu, Apr 4, 2013 at 2:22 PM, Dan Filimon wrote: > Ah, okay then. :) > I thought that you depend on the current convention that it returns 1. So, > disclaimers aside, you're fine with the change? > > > On Fri, Apr 5, 2013 at 12:20 AM, Sebastian Schelter < > ssc.o...@googl

Re: CosineDistanceMeasure for 2 zero vectors?

2013-04-04 Thread Sebastian Schelter
On 04.04.2013 23:22, Dan Filimon wrote: > Ah, okay then. :) > I thought that you depend on the current convention that it returns 1. So, > disclaimers aside, you're fine with the change? Yes, I concur that the distance between two identical vectors should be zero. > > > On Fri, Apr 5, 2013 at 1

Re: CosineDistanceMeasure for 2 zero vectors?

2013-04-04 Thread Dan Filimon
Ah, okay then. :) I thought that you depend on the current convention that it returns 1. So, disclaimers aside, you're fine with the change? On Fri, Apr 5, 2013 at 12:20 AM, Sebastian Schelter wrote: > You can ignore the recommender stuff for the DistanceMeasure classes, as > the recommenders u

Re: CosineDistanceMeasure for 2 zero vectors?

2013-04-04 Thread Sebastian Schelter
You can ignore the recommender stuff for the DistanceMeasure classes, as the recommenders use their own distance/similarity implementations. I justed wanted to comment on the example that Andrew gave, to mention that there are some common pitfalls with modeling ratings/interactions. On 04.04.2013

Re: CosineDistanceMeasure for 2 zero vectors?

2013-04-04 Thread Dan Filimon
Right, that's fair. So, you're saying there needs to be a special value when both vectors are 0 for the recommender system to work? And that 0 means dislike, which isn't in fact accurate. You want to convey lack of information. But now, the code returns 1. Is that a special value? I'd guess it mea

Re: CosineDistanceMeasure for 2 zero vectors?

2013-04-04 Thread Sebastian Schelter
In recommender systems, it's dangerous to interpret "no interaction" as dislike. Think of all movies you never watched, do you really dislike them all? :) On 04.04.2013 23:03, Andrew Musselman wrote: > I agree; I mis-spoke before if I said "dislike". Zero to me means > literally nothing. No int

Re: CosineDistanceMeasure for 2 zero vectors?

2013-04-04 Thread Dan Filimon
I'm not familiar with the recommender code at all. I was only thinking of the clustering. How is dislike related to the cosine distance? Also, CosineDistanceMeasure isn't really behaving like a measure in this case (the whole d(x, x) = 0 thing). Maybe it makes sense to have a specific subclass spe

Re: CosineDistanceMeasure for 2 zero vectors?

2013-04-04 Thread Andrew Musselman
I agree; I mis-spoke before if I said "dislike". Zero to me means literally nothing. No interaction. Which could be either "don't like", "don't like today", "dislike", etc. Which adds to the meaninglessness of it. On Thu, Apr 4, 2013 at 2:00 PM, Sebastian Schelter wrote: > I think that in ou

Re: CosineDistanceMeasure for 2 zero vectors?

2013-04-04 Thread Sebastian Schelter
I think that in our recommender code, 0 should mean no rating or no interaction observed. I think modeling dislike with 0 creates lot of unnecessary problems. On 04.04.2013 22:56, Andrew Musselman wrote: > I see the arguments for having it defined, just raising the point that it's > a very strange

Re: CosineDistanceMeasure for 2 zero vectors?

2013-04-04 Thread Andrew Musselman
I see the arguments for having it defined, just raising the point that it's a very strange spot to be in. If all users are zero except for one person who likes the lentil soup, then the other users are equally different from that person. The problem for me is the discontinuity Sean mentions, wher

Re: CosineDistanceMeasure for 2 zero vectors?

2013-04-04 Thread Sebastian Schelter
Dislike should not be modeled by a zero rating IMHO. This might also create problems with the iterateNonZero() method in our vectors. On 04.04.2013 22:40, Andrew Musselman wrote: > I think it should return an "undefined" symbol. There is no angle between > two zero vectors. > > In a practical

Re: CosineDistanceMeasure for 2 zero vectors?

2013-04-04 Thread Dan Filimon
While I agree that it's fairly meaningless mathematically, this ensures that the distance between two vectors that are the same is 0 always holds. Think of yourself using this class through the DistanceMeasure interface. The implicit expectation [1] here is that d(x, y) = 0 iff x = y. [1] http://e

Re: CosineDistanceMeasure for 2 zero vectors?

2013-04-04 Thread Sean Owen
It is a good argument, and cosine distance is discontinuous at 0. In the context here they're trying to define a distance metric rather than actually care about the angle in question, and 0 is probably a better way to define it than anything else. I think it's OK to say that two users for whom you

Re: CosineDistanceMeasure for 2 zero vectors?

2013-04-04 Thread Andrew Musselman
I think it should return an "undefined" symbol. There is no angle between two zero vectors. In a practical sense, taking two zero vectors to be equivalent in the context of user-item vectors, say, is dodgy in my opinion. That is akin to saying "If we both hate everything on this restaurant's men

Re: CosineDistanceMeasure for 2 zero vectors?

2013-04-04 Thread Dan Filimon
Suneel is right. :) Let me explain how this came up: - When clustering, and assigning a point to a cluster, the centroid needs to be updated. - To update the centroid in the nearest neighbor searcher classes, the centroid must first be removed. - To remove the centroid, we get the closest vector (

Re: CosineDistanceMeasure for 2 zero vectors?

2013-04-04 Thread Suneel Marthi
Code from CosineDistanceMeasure     // correct for zero-vector corner case     if (denominator == 0 && dotProduct == 0) {   return 1;     }     Seems like a bug to me, agree with Dan it should be 0 (and not 1). From: Dan Filimon To: dev@mahout.apache.org

Re: GSOC proposals and mentors [was Call to action – Mahout needs your help]

2013-04-04 Thread Shannon Quinn
According to the GSoC calendar, accepted organizations aren't posted until April 8 (Monday), at which point (assuming Apache is accepted...I can't imagine it wouldn't be) slots will be doled out internally. This will probably take at least a day or two, so probably by middle of next week we'll

Re: CosineDistanceMeasure for 2 zero vectors?

2013-04-04 Thread Sean Owen
It sounds pretty undefined, but I would tend to define the distance as 0 in this case of course. And that means defining the cosine as 1. Which class in particular? There are a few implementations of this distance measure. On Thu, Apr 4, 2013 at 7:42 PM, Dan Filimon wrote: > In the case where bot

Re: GSOC proposals and mentors [was Call to action – Mahout needs your help]

2013-04-04 Thread Dan Filimon
Any news on this front? Did we get approved/assigned a slot/anything? On Fri, Mar 29, 2013 at 7:44 PM, Dan Filimon wrote: > Ok, updated! > > > On Fri, Mar 29, 2013 at 7:36 PM, Andy Twigg wrote: > >> Dan, >> >> I think what you've written is fine (I wanted to edit to remove the >> '?' around ran

CosineDistanceMeasure for 2 zero vectors?

2013-04-04 Thread Dan Filimon
In the case where both vectors are all zeros, the angle between them is 0, so the cosine is therefore 1 and the so the distance returned should be 0 (unless I misunderstood what the distance does). In Mahout, when calling distance() however, if both the denominator and dotProduct are 0 (which is t

[jira] [Commented] (MAHOUT-1161) Unable to run CJKAnalyzer for conversion of a sequence file to sparse vector due to instantiation exception.

2013-04-04 Thread Sebastian Schelter (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13622351#comment-13622351 ] Sebastian Schelter commented on MAHOUT-1161: @rohit did you apply the patch f

[jira] [Commented] (MAHOUT-1161) Unable to run CJKAnalyzer for conversion of a sequence file to sparse vector due to instantiation exception.

2013-04-04 Thread Grant Ingersoll (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13622190#comment-13622190 ] Grant Ingersoll commented on MAHOUT-1161: - Sebastian, that's a reasonable approac

[jira] [Commented] (MAHOUT-1161) Unable to run CJKAnalyzer for conversion of a sequence file to sparse vector due to instantiation exception.

2013-04-04 Thread Rohit Haritash (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13622171#comment-13622171 ] Rohit Haritash commented on MAHOUT-1161: Hi Sebastian , Tested with the jars i

[jira] [Commented] (MAHOUT-1161) Unable to run CJKAnalyzer for conversion of a sequence file to sparse vector due to instantiation exception.

2013-04-04 Thread Sebastian Schelter (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13622088#comment-13622088 ] Sebastian Schelter commented on MAHOUT-1161: Hi Rohit, can you test wether t

[jira] [Commented] (MAHOUT-1161) Unable to run CJKAnalyzer for conversion of a sequence file to sparse vector due to instantiation exception.

2013-04-04 Thread Rohit Haritash (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13622086#comment-13622086 ] Rohit Haritash commented on MAHOUT-1161: Hi when envoking the CJK Analyser gettin

Re: Welcome Suneel Marthi and Dan Filimon

2013-04-04 Thread Shannon Quinn
Congratulations! :) On 4/4/13 6:30 AM, Grant Ingersoll wrote: In recognition of the contributions of Suneel Marthi and Dan Filimon to the Mahout project, the PMC is pleased to announce both have accepted our invitations to join the Mahout project as committers. As is customary, I will leave i

Welcome Suneel Marthi and Dan Filimon

2013-04-04 Thread Grant Ingersoll
In recognition of the contributions of Suneel Marthi and Dan Filimon to the Mahout project, the PMC is pleased to announce both have accepted our invitations to join the Mahout project as committers. As is customary, I will leave it to Suneel and Dan to provide a little bit of background on who

Seq2Sparse performance

2013-04-04 Thread Grant Ingersoll
Has anyone looked at seq2sparse performance in recent memory? I'm wondering if anyone has any ideas for improving it. Based on my reading of the code, it likely is slowly due to the sheer number of steps it has to do, but I'm hoping there are some other cheaper wins hiding in there. (I am awa