[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-12-23 Thread Ankur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12794036#action_12794036
 ] 

Ankur commented on MAHOUT-103:
--

I skimmed through your version and what's present in .item package. Few 
immediate things that come to mind are
1. Moving to AbstractJob
2.  Re-factoring to separate map,reduce and job classes. Personally I hate that 
coz it the code base just bloats when number of M/R jobs increase.

I have been trying to setup my idea using IntelliJ.codestyle.xml provided by 
mahout cwiki. I placed the file under idea_home/config/codeStyles and restarted 
idea but it still does not an import option  in File-Settings-Code Style.  
Idea shows following messages in the back ground
Field not copied JAVA_INDENT_OPTIONS
Field not copied JSP_INDENT_OPTIONS
Field not copied XML_INDENT_OPTIONS
Field not copied OTHER_INDENT_OPTIONS
Field not copied FIELD_TYPE_TO_NAME
Field not copied STATIC_FIELD_TYPE_TO_NAME
Field not copied PARAMETER_TYPE_TO_NAME
Field not copied LOCAL_VARIABLE_TYPE_TO_NAME
Field not copied PACKAGES_TO_USE_IMPORT_ON_DEMAND
Field not copied IMPORT_LAYOUT_TABLE

I am using v 8.1.4

Once i set this up, I am gonna look at your version and what's there in .item 
package

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: MAHOUT-103.patch, mahout-103.patch.v1, 
 mahout-103.patch.v2, prepare.pl, run.sh


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-12-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12794064#action_12794064
 ] 

Sean Owen commented on MAHOUT-103:
--

I'm not worried about integrating with AbstractJob just this second but at some 
point all our MR jobs ought to have some consistent approach. I also don't feel 
too strongly about Mapper/Reducers as inner classes. It's the same amount of 
code either way, so I don't think that's the difference, and I prefer the 
clarify of top-level classes in general unless a class is truly intricately 
associated to another class. But don't change that now.

Am I understanding you're roughly OK with this form of the patch, and I should 
submit, or..?

Looking at .item can come later too. I'm mostly thinking of small/local changes 
at the moment.

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: MAHOUT-103.patch, mahout-103.patch.v1, 
 mahout-103.patch.v2, prepare.pl, run.sh


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-12-23 Thread Ankur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12794133#action_12794133
 ] 

Ankur commented on MAHOUT-103:
--

Your changes don't look too mutating and yes roughly speaking I am ok. Since we 
are talking about committing  this I would like to say that I tested this for 
correctness on very small hand-coded data-set and then ran it on netflix-data. 
However I couldn't verify its correctness over netflix data though I am pretty 
confident it works correctly. That is why I was hoping to have a couple of unit 
test:-

1.   To verify that similar items are identified correctly.
2.   None of the seen items are recommended for a user. 

But since this is not the final version, can you suggest any other approach to 
be 100% sure of correctness? I don't want something to be committed only to 
discover a silly issue later just because we didn't take extra care.

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: MAHOUT-103.patch, mahout-103.patch.v1, 
 mahout-103.patch.v2, prepare.pl, run.sh


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-12-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793617#action_12793617
 ] 

Sean Owen commented on MAHOUT-103:
--

My only significant concern is that this overlaps a lot with what I've already 
committed under .item. However I'm not against committing this as this part of 
the code is still in an experimental phase where we should have room to play 
and see what 'sticks'.

There is I think a lot of small changes that need to be made here to fit code 
style, etc. I'm willing to commit after those sorts of things are addressed, 
and also volunteer to do them.

If you're reasonably willing to evolve and integrate this code along with the 
surrounding code, it's fine by me to get this in and continue working that way.

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: mahout-103.patch.v1, mahout-103.patch.v2, prepare.pl, 
 run.sh


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-12-22 Thread Ankur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793628#action_12793628
 ] 

Ankur commented on MAHOUT-103:
--

Evolving the code to integrate better with the existing stuff is fine with me. 
I  am in  general ok with throwing away code if it can be replaced by existing 
stuff that is better. 

However, I don't think its a good idea to try to come up with a unified 
approach of generating hadoop based recommendations. I am afraid we'll create 
more problems than we'd solve.

I see recommendations in hadoop world as the following linear chain of M/R jobs

Data-Formatting -- Data Filter -- Core Recommender-- Post Processor

The last 2 jobs can themselves be comprised of 1 or more M/R jobs.

 There is I think .

Let me come up with the unit-tests and code documentation. After that you can 
start doing the changes. Thanks a lot for help. Appreciate that :-)

BTW did u have a chance to actually run it on netflix-data ?

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: mahout-103.patch.v1, mahout-103.patch.v2, prepare.pl, 
 run.sh


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-12-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793631#action_12793631
 ] 

Sean Owen commented on MAHOUT-103:
--

Yep, fine by me.

I think it would be far easier to unit test / document the 'final' version 
rather than something that will change notably? really I'm talking about 
formatting, style, use of libraries, etc. How about I post my version of the 
patch now?

I have not run it. I trust you're on top of that.

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: mahout-103.patch.v1, mahout-103.patch.v2, prepare.pl, 
 run.sh


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-12-22 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793676#action_12793676
 ] 

Jake Mannix commented on MAHOUT-103:


I've actually got a bunch of variations of this in the .item package as well, 
but I haven't got them all fully working yet.  I'm hoping they're a bit faster 
than what's in there now.

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: MAHOUT-103.patch, mahout-103.patch.v1, 
 mahout-103.patch.v2, prepare.pl, run.sh


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-12-17 Thread Ankur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791971#action_12791971
 ] 

Ankur commented on MAHOUT-103:
--

Ok, so here is the next version which I again re-wrote completely :-( for 
performance reasons. The version now computes item similarity and uses that to 
generate recommendations in truely hadoop fashion. In a nutshell the 
recommendations are generated in 2 steps:-

1. Join item-similarity data (generated via analyzing co-occurence) with 
user-click data
2. Group output of step 1. on user key so that we recieve all potential 
candidates for a user in a reducer and also all items already clicked/seen by 
him so that they can be excluded from final recommendations set.

Also attached 
1. Perl script to convert the netflix data into required format (userId \t 
movieId) 
2. Bash script used to run in on 50 node hadoop cluster. The recommendations 
are generated for all the users in less than 45 min.

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch, mahout-103.patch.v1


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-24 Thread Ankur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781838#action_12781838
 ] 

Ankur commented on MAHOUT-103:
--

For this co-occurrence based recommender I am planning to write a set of 
map-reduce jobs that compute recommendations for users as folllowing:-

1. Take user's item history
2. for each item in his history fetch the top-N similar items. (Similarity 
based on co-occurrence)
3. Add the co-occurrence scores if an item appears more than once (NOT weighted 
avg). Consider an e.g. user-history { M1, M2, M3 } and top - 3 similar movies 
for each of these along with co-occurrence scores 

M1 - (A, 5), (B, 4), (C, 2)
M2 - (D, 6), (E, 3), (F, 2)
M3 - (G, 8), (C, 5), (B, 2)  

So the final scores in decreasing order will look like
(G, 8)
(C, 7)
(B, 6)
(D, 6)
(A, 5)
(E, 3)
(F, 2)

The idea I want to capture is that a candidate item gets higher score if its 
similar to more items in user's click history.

Do you see any issue with this approach ? Any other better approach that you 
can think of ?

As for the precision-recall test, I am still trying to see how to divide the 
data in 'train' and 'test' for a fair evaluation. How do we do it in the 
existing code ?

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch, mahout-103.patch.v1


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-24 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781858#action_12781858
 ] 

Sean Owen commented on MAHOUT-103:
--

Yes, this is basically item-based recommendation. With some superficial 
changes, it would exactly fit that model. Co-occurrence here is like a 
similarity metric, which is ultimately used as a weighting. Canonically this 
value would be in [-1,1], and you can easily map [1,...) into that range of 
course.

Next you're sort of estimating preferences when you add up co-occurrence 
values. Canonically, you'd be doing a weighted average over M1 - M3. This is 
the same thing -- you're just not dividing by 3.

The result is conceptually the same, though different approaches would yield 
slightly different results. I'm not necessarily suggesting you change the 
algorithm. At the same time I am also about to implement this very same thing 
-- the more 'canoncial' form, to go hand-in-hand with the existing 
GenericItemBasedRecommender. I'd rather avoid duplication, and would like to 
make the Hadoop-based implementation as analogous to the existing code as 
possible. All I'd say is, go ahead, and maybe we look at generalizing it or 
shifting these concepts towards the canonical setup later.

Look at GenericIRStatsEvaluator and subclass for precision-recall approaches.

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch, mahout-103.patch.v1


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12778838#action_12778838
 ] 

Sean Owen commented on MAHOUT-103:
--

It looks fine to me, with one request -- put it in a subpackage of the .hadoop 
package? I'm going to rearrange the bits already in there into subpackages as 
they no longer share one related purpose.

From there if I have any other comments we can look at those after it's 
committed. I'm certain it's good enough to get into the repo now, if you 
believe it's ready.

In about 2 weeks I am beginning writing the recommenders-in-Hadoop chapter, so 
this is very timely. I've been worried since my Hadoop-related code has been 
stymied by a Hadoop bug that's still not fixed. I am hoping your approach has a 
way around it.

I'd like to synthesize one approach to using Hadoop 0.20+ based on our 
implementations and then look to making the whole project consistent in this 
regard.

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch, mahout-103.patch.v1


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-17 Thread Ankur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12778911#action_12778911
 ] 

Ankur commented on MAHOUT-103:
--

Thanks for the quick lookup, appreciate that :-).

Putting in a subpackage, sure, for now I'll just leave all the main code under 
one subpackage (how about 'bigram') until u have it sorted out. 

As for the code, once I have the test code ready for netflix dataset and at 
least one unit test, it will be good to go. One question, How do we apply 
precision-recall or RMSE or any other evaluation technique to the results since 
all we are doing is counting co-occurrence ?

Do u have the JIRA for this hadoop related bug? 

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch, mahout-103.patch.v1


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12778950#action_12778950
 ] 

Sean Owen commented on MAHOUT-103:
--

I don't have a JIRA, just some threads on the mailing list. I'm going to dig 
back in to this next week and if I still see the problem formally report the 
bug. I hope it's cleared up in the latest code.

Yeah this code just implements counting co-occurrence, which isn't a complete 
recommender. Ideally each of these subpackages is a Hadoop-based recommender 
system that outputs recommendations.

When the input has rating values, you will output estimated rating values too, 
I'd imagine. That's good, but not quite what's needed to conduct an RMSE 
evaluation of the estimates. For that we'd want to hold back, say, 5% of the 
data and see how well it estimated the ratings of that 5%. But the output is 
recommendations, which don't necessarily include that 5% of test data.

(You could write a separate job that does it, sure.)

But seems relatively easier to conduct a precision-recall test. Identify for 
each user some good recommendations, perhaps their favorite items. Remove 
those from the input. See how much of it gets recommended back in the output. 
From that you can compute precision and recall figures on the recommendations.

It's all a nice-to-have -- to start I'd like to have a couple end-to-end, 
consistent recommenders based on Hadoop. I can see three right now:

- Your co-occurrence-based system
- The, er, dot-product-based item-based recommender Ted sketched, which I'll 
write. Not sure what to call it.
- The pseudo-distributed system already in there

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch, mahout-103.patch.v1


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776921#action_12776921
 ] 

Sean Owen commented on MAHOUT-103:
--

Re-post an updated patch and happy to give my comments on it. The more the 
merrier. If it's basically sound I'd like to mention it in the forthcoming book 
which I'm writing now.

I use the GroupLens, Jester, Netflix data sets regularly. Indeed, just drop the 
rating. The framework can do this automatically too if you like in the 
DataModel.

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-12 Thread Ankur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776939#action_12776939
 ] 

Ankur commented on MAHOUT-103:
--

Re-post an updated patch 

Sure I'll have the updated code coming by early next week.

If it's basically sound I'd like to mention it 

+10, The more people know about it the better chances it has of being used :-)  

I use the GroupLens, Jester, Netflix data sets regularly. Indeed, just drop 
the rating ...

Simply dropping the rating might introduce too much noise. I was thinking of 
keeoing only those that have ratings  2.5 (or 2 to be more liberal). 

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776951#action_12776951
 ] 

Sean Owen commented on MAHOUT-103:
--

That last point is interesting. Another school of thought is that rating 
something, even negatively, suggests you have a closer association to that 
thing than to the millions of other things you've never heard of.

Let's say you rate Bach a 5 and Brahms a 4 and Mendelssohn a 1.5. Would you 
rather recommend a Mendelssohn recording to this person, or death metal?

This is my understanding of the intuition I've gotten from Ted, and seems to 
bear out somewhat in practice, that ratings have a lot less info than one would 
think.

Well it's obviously something one can evaluate within the framework with the 
evaluator code to decide for sure.

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-12 Thread Ankur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776966#action_12776966
 ] 

Ankur commented on MAHOUT-103:
--

In that case dropping ratings might not be such a good idea and may lead to bad 
results. Consider the following movies that a user might have seen with the 
scores

Matrix - 4.5
Matrix Reloaded - 2.5
Matrix Revolutions - 2

Assuming that a lot of people have watched these movies and didn't like the 
subsequent two versions, they still will get high similarity scores w.r.t 
Matrix going purely by co-occurrence. IMHO, that leaves us with the following 
2 alternatives :-

1. Add the ratings when counting co-occurrence and hope that better ones will 
stand out even if they co-occur less.
2. Apply a Re-scorer that re-ranks the the similar items for a given item 
based on their average scores.

Point 1 is something I am thinking of trying out.   
 

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776986#action_12776986
 ] 

Sean Owen commented on MAHOUT-103:
--

What's the problem in this example? Two people that have both seen all three 
Matrix films are probably similar. All the more so if they've rated the first 
one highly and the other two poorly. You'd correctly identify them as similar 
with or without ratings here.

The issue, I suppose, comes up when you encounter someone who didn't like the 
first one and liked the other two (strange, I know). Without pref values, we'd 
draw the same conclusion -- they have some similarity. With pref values, most 
metrics would say they are very dissimilar.

I actually think that's the wrong conclusion! The fact that two people bothered 
to watch all three says much more about their similarities than the variance in 
ratings says about their differences. I'd still guess they're sorta-similar, 
and metrics without pref values would tend to draw the more correct conclusion.


Of course there's no one right answer, and we can easily construct situations 
where throwing out pref values indeed hurts the result. I'm only asserting that 
it's entirely possible, in real data sets, for ratings to *hurt* on the whole. 


Let's start by adding the basic approach and then keep going to look at 
variations. I at least have some global knowledge of how the framework is set 
up and could help design in these variations in a way that's consistent with 
the framework.



 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-03-23 Thread Ted Dunning
The classic technique is to get total counts for B and for all events and
assume that these are effectively *B and *B' (that is, the number of times
that B exists is the number of times that either A or A' occurred with B).
This is usually just fine.  Then k(A'B) = k(B) - k(AB) and k(A'B') = k(*) -
k(B) - k(AB').  It is commonly necessary to compute the total and the
individual item counts like k(B) in a separate pass through the data, but
some storage schemes may make that unnecessary.

To get this right requires a careful touch.  I try to keep in mind that if
we start with a binary occurrence matrix U = users x items, then the item
cooccurrence matrix should be exactly U' U = items x items.  If U is
represented as a set of (user, item) pairs, then we can compute U' U by a
join to self followed by counting.  This usually results in a list of (item,
item, count) triples.

The item counts are then the rowsums of U' U and the overall sum of U' U is
the total count.  These can easily be computed from U' U and take much less
time than computing U' U.  Note that U' U is symmetrical so you should only
need to compute the rowsums.  Checking that the column sums are the same is
a nice exercise, but not necessary in production.  The rowsums are
represented (item, count) pairs and the overall total is just a count.

The final scoring of the cooccurrence counts requires that you join the
triples that form U' U once with the rowsums against the first member of the
triple and then again with the against the second member of the triple.  You
can carry around the overall total separately, but there is an implied join
against this single number as well.

This gives you tuples of the form (a, b, count_ab, count_a, count_b,
total).  The counts you need for the contingency table are

count_ab,  count_a - count_ab
count_b - count_ab,total - count_a - count_b + count_ab

Does this answer your question?

I can make the log-likelihood code have a method that takes count_ab,
count_a, count_b and total as arguments.

On Sun, Mar 22, 2009 at 10:32 PM, Ankur Goel ankur.g...@corp.aol.comwrote:

 As you can see from each entry we can get the values for AB and AB'.
 A'B and A'B' are not available directly in a single result and need to be
 computed.

 It would be nice to hear how you are planning to implement this in map-red.




-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
www.deepdyve.com
408-773-0110 ext. 738
858-414-0013 (m)
408-773-0220 (fax)


Re: [jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-03-19 Thread Ted Dunning
Ankur,

What form will the counts be in when you need this function?

Four integers separately available?

Values in a view of a matrix?

I will be happy to adapt some code to compute the measure you need.

On Wed, Mar 18, 2009 at 9:45 PM, Ankur (JIRA) j...@apache.org wrote:

 map-red implementation of the log-likelihood ratio test as described in
 Ted's paper.




-- 
Ted Dunning, CTO
DeepDyve


[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-03-19 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683516#action_12683516
 ] 

Ted Dunning commented on MAHOUT-103:



Hmmm I actually think of a click as the relation that connects a user to an 
item.  As such, it is distinct from either.

And I routinely do recommendation like computations that involve users, network 
entities, query terms, documents, and other things that you would call users as 
they relate (by abstract clicks) to users, query terms, videos, music, web 
pages, network entities, words, query terms and other things that you would 
call items.

There is a horrible tension here between naming things by their most common 
usage and expecting programmers to realize that they really are abstract 
entities or naming things in a total abstract way and risking that no 
programmers ever catch on.  An example of an abstract naming that derives from 
linguistic terminology might be Agent (instead of User), Relation (instead of 
Click) and Target (instead of Item).  This makes the general interaction be 
Relation \subsetof Agent x Target.  I wouldn't recommend this, however, because 
(as you say) people generally describe social algorithms in excessively 
concrete ways.   

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-03-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683170#action_12683170
 ] 

Sean Owen commented on MAHOUT-103:
--

1. How do you feel about, therefore, changing to use more abstract objects 
rather than, say, Click? These objects could be the existing ones, or 
modified or new ones. I think as you say the existing objects are about what is 
needed. That way the solution is that much more reusable. Same with the job -- 
the more it uses abstract/standard classes, the more reusable I think it looks.

2. Yeah the two interfaces are nearly identical: provide a method that takes 
two items as input and a numerical score as output. I suppose it just makes 
sense to use the existing ItemSimilarity interface in this section of the code.

3. Good question, here is my brief digression:

The code was originally written with an on-line model in mind -- 
recommendations happen in real-time. Over time that has proved inefficient or 
impractical for large data sets, though it remains quite nice for small- to 
medium-size data sets. Hence i have attempted to preserve the real-time model 
at the core, and build a batch-oriented extension around it using Hadoop.

The two are a bit separate, and that is fine. So in this section of the code, I 
don't mind attaching Hadoop-related jobs that are not intimately connected to 
the core code. I am trying to keep them as consistent as possible so that the 
original on-line and newer off-line models don't evolve into two separate 
worlds within this part of the code.

To be specific... well I don't know, I don't have a problem with adding this 
job actually. Ideally we build a bit more around it: takes as input the 
standard preference-file format as used by FileDataModel, and outputs a file 
format that can be ready by a new ItemSimillarity implementation that would 
read and cache all these results. That would be a nice step towards integrating 
with the core code.

This is something I have been remiss in - I wrote a job to do the 
pre-computation of item-item diffs for slope one but never wrote an 
implementation of DiffStorage that would read this output and operate based on 
those results. This would close the loop. 

How about we make #3 my part of this issue, to complete the connection between 
this job and the core code a bit more?

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-03-18 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683232#action_12683232
 ] 

Ted Dunning commented on MAHOUT-103:


  1. How do you feel about, therefore, changing to use more abstract objects 
  rather than, say, Click? 

How is click more or less abstract than the term user?



 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-03-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683238#action_12683238
 ] 

Sean Owen commented on MAHOUT-103:
--

The comparison would be to Item. You could say that's as domain-specific as 
Click; I'd suggest that User/Item are the 'abstract' concepts in this context 
since collaborative filtering is invariably explained in terms of users and 
items, though of course your user or item can be whatever you like.

At least, there is no need to have both Click and Item -- unless this 
particular context requires one to store more information about a click as an 
item, in which case it should at least implement Item. But I don't think that's 
the case.

The good news is that this work doesn't seem to only apply to processing click 
logs, so, I'm suggesting it might be even more useful to express it in terms of 
the 'abstract' concepts in this context.

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-03-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683251#action_12683251
 ] 

Sean Owen commented on MAHOUT-103:
--

The comparison would be to Item. You could say that's as domain-specific as 
Click; I'd suggest that User/Item are the 'abstract' concepts in this context 
since collaborative filtering is invariably explained in terms of users and 
items, though of course your user or item can be whatever you like.

At least, there is no need to have both Click and Item -- unless this 
particular context requires one to store more information about a click as an 
item, in which case it should at least implement Item. But I don't think that's 
the case.

The good news is that this work doesn't seem to only apply to processing click 
logs, so, I'm suggesting it might be even more useful to express it in terms of 
the 'abstract' concepts in this context.

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-03-17 Thread Ankur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12682915#action_12682915
 ] 

Ankur commented on MAHOUT-103:
--

Hey Sean, Thanks for review comments. Some specific questions

1. This indeed is doing approximately the same thing as 
TanimotoCoefficientSimilarity and BooleanPreferenceUser. The difference being 
that similarity computations is parallelized in map-reduce.

2. The idea of introducing a FitnessEvaluator was to allow people to apply 
domain specific things when comparing a preference. Are you suggesting the 
replacement of FitnessEvaluator with ItemSimilarity ?

3. The Hadoop job was written to run this thing stand-alone. What modifications 
do you feel would be appropriate for integration into the framework?


 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-01-29 Thread Ankur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668475#action_12668475
 ] 

Ankur commented on MAHOUT-103:
--

I hoping to make the above improvements after I get some review comments.

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.