subject:"\[jira\] Commented\: \(MAHOUT\-103\) Co\-occurence based nearest neighbourhood"

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-12-23 Thread Ankur (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12794036#action_12794036
]

Ankur commented on MAHOUT-103:
--

I skimmed through your version and what's present in .item package. Few
immediate things that come to mind are
1. Moving to AbstractJob
2. Re-factoring to separate map,reduce and job classes. Personally I hate that
coz it the code base just bloats when number of M/R jobs increase.

I have been trying to setup my idea using IntelliJ.codestyle.xml provided by
mahout cwiki. I placed the file under idea_home/config/codeStyles and restarted
idea but it still does not an import option in File-Settings-Code Style.
Idea shows following messages in the back ground
Field not copied JAVA_INDENT_OPTIONS
Field not copied JSP_INDENT_OPTIONS
Field not copied XML_INDENT_OPTIONS
Field not copied OTHER_INDENT_OPTIONS
Field not copied FIELD_TYPE_TO_NAME
Field not copied STATIC_FIELD_TYPE_TO_NAME
Field not copied PARAMETER_TYPE_TO_NAME
Field not copied LOCAL_VARIABLE_TYPE_TO_NAME
Field not copied PACKAGES_TO_USE_IMPORT_ON_DEMAND
Field not copied IMPORT_LAYOUT_TABLE

I am using v 8.1.4

Once i set this up, I am gonna look at your version and what's there in .item
package

Co-occurence based nearest neighbourhood

Key: MAHOUT-103
URL: https://issues.apache.org/jira/browse/MAHOUT-103
Project: Mahout
Issue Type: New Feature
Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
Attachments: MAHOUT-103.patch, mahout-103.patch.v1,
mahout-103.patch.v2, prepare.pl, run.sh

Nearest neighborhood type queries for users/items can be answered efficiently
and effectively by analyzing the co-occurrence model of a user/item w.r.t
another. This patch aims at providing an implementation for answering such
queries based upon simple co-occurrence counts.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-12-23 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12794064#action_12794064
]

Sean Owen commented on MAHOUT-103:
--

I'm not worried about integrating with AbstractJob just this second but at some
point all our MR jobs ought to have some consistent approach. I also don't feel
too strongly about Mapper/Reducers as inner classes. It's the same amount of
code either way, so I don't think that's the difference, and I prefer the
clarify of top-level classes in general unless a class is truly intricately
associated to another class. But don't change that now.

Am I understanding you're roughly OK with this form of the patch, and I should
submit, or..?

Looking at .item can come later too. I'm mostly thinking of small/local changes
at the moment.

Co-occurence based nearest neighbourhood

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-12-23 Thread Ankur (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12794133#action_12794133
]

Ankur commented on MAHOUT-103:
--

Your changes don't look too mutating and yes roughly speaking I am ok. Since we
are talking about committing this I would like to say that I tested this for
correctness on very small hand-coded data-set and then ran it on netflix-data.
However I couldn't verify its correctness over netflix data though I am pretty
confident it works correctly. That is why I was hoping to have a couple of unit
test:-

1. To verify that similar items are identified correctly.
2. None of the seen items are recommended for a user.

But since this is not the final version, can you suggest any other approach to
be 100% sure of correctness? I don't want something to be committed only to
discover a silly issue later just because we didn't take extra care.

Co-occurence based nearest neighbourhood

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-12-22 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793617#action_12793617
]

Sean Owen commented on MAHOUT-103:
--

My only significant concern is that this overlaps a lot with what I've already
committed under .item. However I'm not against committing this as this part of
the code is still in an experimental phase where we should have room to play
and see what 'sticks'.

There is I think a lot of small changes that need to be made here to fit code
style, etc. I'm willing to commit after those sorts of things are addressed,
and also volunteer to do them.

If you're reasonably willing to evolve and integrate this code along with the
surrounding code, it's fine by me to get this in and continue working that way.

Co-occurence based nearest neighbourhood

Key: MAHOUT-103
URL: https://issues.apache.org/jira/browse/MAHOUT-103
Project: Mahout
Issue Type: New Feature
Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
Attachments: mahout-103.patch.v1, mahout-103.patch.v2, prepare.pl,
run.sh

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-12-22 Thread Ankur (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793628#action_12793628
]

Ankur commented on MAHOUT-103:
--

Evolving the code to integrate better with the existing stuff is fine with me.
I am in general ok with throwing away code if it can be replaced by existing
stuff that is better.

However, I don't think its a good idea to try to come up with a unified
approach of generating hadoop based recommendations. I am afraid we'll create
more problems than we'd solve.

I see recommendations in hadoop world as the following linear chain of M/R jobs

Data-Formatting -- Data Filter -- Core Recommender-- Post Processor

The last 2 jobs can themselves be comprised of 1 or more M/R jobs.

There is I think .

Let me come up with the unit-tests and code documentation. After that you can
start doing the changes. Thanks a lot for help. Appreciate that :-)

BTW did u have a chance to actually run it on netflix-data ?

Co-occurence based nearest neighbourhood

Key: MAHOUT-103
URL: https://issues.apache.org/jira/browse/MAHOUT-103
Project: Mahout
Issue Type: New Feature
Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
Attachments: mahout-103.patch.v1, mahout-103.patch.v2, prepare.pl,
run.sh

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-12-22 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793631#action_12793631
 ] 

Sean Owen commented on MAHOUT-103:
--

Yep, fine by me.

I think it would be far easier to unit test / document the 'final' version 
rather than something that will change notably? really I'm talking about 
formatting, style, use of libraries, etc. How about I post my version of the 
patch now?

I have not run it. I trust you're on top of that.

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: mahout-103.patch.v1, mahout-103.patch.v2, prepare.pl, 
 run.sh


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-12-22 Thread Jake Mannix (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793676#action_12793676
 ] 

Jake Mannix commented on MAHOUT-103:


I've actually got a bunch of variations of this in the .item package as well, 
but I haven't got them all fully working yet.  I'm hoping they're a bit faster 
than what's in there now.

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: MAHOUT-103.patch, mahout-103.patch.v1, 
 mahout-103.patch.v2, prepare.pl, run.sh


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-12-17 Thread Ankur (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791971#action_12791971
]

Ankur commented on MAHOUT-103:
--

Ok, so here is the next version which I again re-wrote completely :-( for
performance reasons. The version now computes item similarity and uses that to
generate recommendations in truely hadoop fashion. In a nutshell the
recommendations are generated in 2 steps:-

1. Join item-similarity data (generated via analyzing co-occurence) with
user-click data
2. Group output of step 1. on user key so that we recieve all potential
candidates for a user in a reducer and also all items already clicked/seen by
him so that they can be excluded from final recommendations set.

Also attached
1. Perl script to convert the netflix data into required format (userId \t
movieId)
2. Bash script used to run in on 50 node hadoop cluster. The recommendations
are generated for all the users in less than 45 min.

Co-occurence based nearest neighbourhood

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-24 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781838#action_12781838
 ] 

Ankur commented on MAHOUT-103:
--

For this co-occurrence based recommender I am planning to write a set of 
map-reduce jobs that compute recommendations for users as folllowing:-

1. Take user's item history
2. for each item in his history fetch the top-N similar items. (Similarity 
based on co-occurrence)
3. Add the co-occurrence scores if an item appears more than once (NOT weighted 
avg). Consider an e.g. user-history { M1, M2, M3 } and top - 3 similar movies 
for each of these along with co-occurrence scores 

M1 - (A, 5), (B, 4), (C, 2)
M2 - (D, 6), (E, 3), (F, 2)
M3 - (G, 8), (C, 5), (B, 2)  

So the final scores in decreasing order will look like
(G, 8)
(C, 7)
(B, 6)
(D, 6)
(A, 5)
(E, 3)
(F, 2)

The idea I want to capture is that a candidate item gets higher score if its 
similar to more items in user's click history.

Do you see any issue with this approach ? Any other better approach that you 
can think of ?

As for the precision-recall test, I am still trying to see how to divide the 
data in 'train' and 'test' for a fair evaluation. How do we do it in the 
existing code ?

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch, mahout-103.patch.v1


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-24 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781858#action_12781858
]

Sean Owen commented on MAHOUT-103:
--

Yes, this is basically item-based recommendation. With some superficial
changes, it would exactly fit that model. Co-occurrence here is like a
similarity metric, which is ultimately used as a weighting. Canonically this
value would be in [-1,1], and you can easily map [1,...) into that range of
course.

Next you're sort of estimating preferences when you add up co-occurrence
values. Canonically, you'd be doing a weighted average over M1 - M3. This is
the same thing -- you're just not dividing by 3.

The result is conceptually the same, though different approaches would yield
slightly different results. I'm not necessarily suggesting you change the
algorithm. At the same time I am also about to implement this very same thing
-- the more 'canoncial' form, to go hand-in-hand with the existing
GenericItemBasedRecommender. I'd rather avoid duplication, and would like to
make the Hadoop-based implementation as analogous to the existing code as
possible. All I'd say is, go ahead, and maybe we look at generalizing it or
shifting these concepts towards the canonical setup later.

Look at GenericIRStatsEvaluator and subclass for precision-recall approaches.

Co-occurence based nearest neighbourhood

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-17 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12778838#action_12778838
]

Sean Owen commented on MAHOUT-103:
--

It looks fine to me, with one request -- put it in a subpackage of the .hadoop
package? I'm going to rearrange the bits already in there into subpackages as
they no longer share one related purpose.

From there if I have any other comments we can look at those after it's
committed. I'm certain it's good enough to get into the repo now, if you
believe it's ready.

In about 2 weeks I am beginning writing the recommenders-in-Hadoop chapter, so
this is very timely. I've been worried since my Hadoop-related code has been
stymied by a Hadoop bug that's still not fixed. I am hoping your approach has a
way around it.

I'd like to synthesize one approach to using Hadoop 0.20+ based on our
implementations and then look to making the whole project consistent in this
regard.

Co-occurence based nearest neighbourhood

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-17 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12778911#action_12778911
 ] 

Ankur commented on MAHOUT-103:
--

Thanks for the quick lookup, appreciate that :-).

Putting in a subpackage, sure, for now I'll just leave all the main code under 
one subpackage (how about 'bigram') until u have it sorted out. 

As for the code, once I have the test code ready for netflix dataset and at 
least one unit test, it will be good to go. One question, How do we apply 
precision-recall or RMSE or any other evaluation technique to the results since 
all we are doing is counting co-occurrence ?

Do u have the JIRA for this hadoop related bug? 

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch, mahout-103.patch.v1


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-17 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12778950#action_12778950
]

Sean Owen commented on MAHOUT-103:
--

I don't have a JIRA, just some threads on the mailing list. I'm going to dig
back in to this next week and if I still see the problem formally report the
bug. I hope it's cleared up in the latest code.

Yeah this code just implements counting co-occurrence, which isn't a complete
recommender. Ideally each of these subpackages is a Hadoop-based recommender
system that outputs recommendations.

When the input has rating values, you will output estimated rating values too,
I'd imagine. That's good, but not quite what's needed to conduct an RMSE
evaluation of the estimates. For that we'd want to hold back, say, 5% of the
data and see how well it estimated the ratings of that 5%. But the output is
recommendations, which don't necessarily include that 5% of test data.

(You could write a separate job that does it, sure.)

But seems relatively easier to conduct a precision-recall test. Identify for
each user some good recommendations, perhaps their favorite items. Remove
those from the input. See how much of it gets recommended back in the output.
From that you can compute precision and recall figures on the recommendations.

It's all a nice-to-have -- to start I'd like to have a couple end-to-end,
consistent recommenders based on Hadoop. I can see three right now:

- Your co-occurrence-based system
- The, er, dot-product-based item-based recommender Ted sketched, which I'll
write. Not sure what to call it.
- The pseudo-distributed system already in there

Co-occurence based nearest neighbourhood

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-12 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776921#action_12776921
 ] 

Sean Owen commented on MAHOUT-103:
--

Re-post an updated patch and happy to give my comments on it. The more the 
merrier. If it's basically sound I'd like to mention it in the forthcoming book 
which I'm writing now.

I use the GroupLens, Jester, Netflix data sets regularly. Indeed, just drop the 
rating. The framework can do this automatically too if you like in the 
DataModel.

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-12 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776939#action_12776939
 ] 

Ankur commented on MAHOUT-103:
--

Re-post an updated patch 

Sure I'll have the updated code coming by early next week.

If it's basically sound I'd like to mention it 

+10, The more people know about it the better chances it has of being used :-)  

I use the GroupLens, Jester, Netflix data sets regularly. Indeed, just drop 
the rating ...

Simply dropping the rating might introduce too much noise. I was thinking of 
keeoing only those that have ratings  2.5 (or 2 to be more liberal). 

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-12 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776951#action_12776951
]

Sean Owen commented on MAHOUT-103:
--

That last point is interesting. Another school of thought is that rating
something, even negatively, suggests you have a closer association to that
thing than to the millions of other things you've never heard of.

Let's say you rate Bach a 5 and Brahms a 4 and Mendelssohn a 1.5. Would you
rather recommend a Mendelssohn recording to this person, or death metal?

This is my understanding of the intuition I've gotten from Ted, and seems to
bear out somewhat in practice, that ratings have a lot less info than one would
think.

Well it's obviously something one can evaluate within the framework with the
evaluator code to decide for sure.

Co-occurence based nearest neighbourhood

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-12 Thread Ankur (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776966#action_12776966
]

Ankur commented on MAHOUT-103:
--

In that case dropping ratings might not be such a good idea and may lead to bad
results. Consider the following movies that a user might have seen with the
scores

Matrix - 4.5
Matrix Reloaded - 2.5
Matrix Revolutions - 2

Assuming that a lot of people have watched these movies and didn't like the
subsequent two versions, they still will get high similarity scores w.r.t
Matrix going purely by co-occurrence. IMHO, that leaves us with the following
2 alternatives :-

1. Add the ratings when counting co-occurrence and hope that better ones will
stand out even if they co-occur less.
2. Apply a Re-scorer that re-ranks the the similar items for a given item
based on their average scores.

Point 1 is something I am thinking of trying out.

Co-occurence based nearest neighbourhood

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-12 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776986#action_12776986
]

Sean Owen commented on MAHOUT-103:
--

What's the problem in this example? Two people that have both seen all three
Matrix films are probably similar. All the more so if they've rated the first
one highly and the other two poorly. You'd correctly identify them as similar
with or without ratings here.

The issue, I suppose, comes up when you encounter someone who didn't like the
first one and liked the other two (strange, I know). Without pref values, we'd
draw the same conclusion -- they have some similarity. With pref values, most
metrics would say they are very dissimilar.

I actually think that's the wrong conclusion! The fact that two people bothered
to watch all three says much more about their similarities than the variance in
ratings says about their differences. I'd still guess they're sorta-similar,
and metrics without pref values would tend to draw the more correct conclusion.

Of course there's no one right answer, and we can easily construct situations
where throwing out pref values indeed hurts the result. I'm only asserting that
it's entirely possible, in real data sets, for ratings to *hurt* on the whole.

Let's start by adding the basic approach and then keep going to look at
variations. I at least have some global knowledge of how the framework is set
up and could help design in these variations in a way that's consistent with
the framework.

Co-occurence based nearest neighbourhood

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-03-23 Thread Ted Dunning

The classic technique is to get total counts for B and for all events and
assume that these are effectively *B and *B' (that is, the number of times
that B exists is the number of times that either A or A' occurred with B).
This is usually just fine.  Then k(A'B) = k(B) - k(AB) and k(A'B') = k(*) -
k(B) - k(AB').  It is commonly necessary to compute the total and the
individual item counts like k(B) in a separate pass through the data, but
some storage schemes may make that unnecessary.

To get this right requires a careful touch.  I try to keep in mind that if
we start with a binary occurrence matrix U = users x items, then the item
cooccurrence matrix should be exactly U' U = items x items.  If U is
represented as a set of (user, item) pairs, then we can compute U' U by a
join to self followed by counting.  This usually results in a list of (item,
item, count) triples.

The item counts are then the rowsums of U' U and the overall sum of U' U is
the total count.  These can easily be computed from U' U and take much less
time than computing U' U.  Note that U' U is symmetrical so you should only
need to compute the rowsums.  Checking that the column sums are the same is
a nice exercise, but not necessary in production.  The rowsums are
represented (item, count) pairs and the overall total is just a count.

The final scoring of the cooccurrence counts requires that you join the
triples that form U' U once with the rowsums against the first member of the
triple and then again with the against the second member of the triple.  You
can carry around the overall total separately, but there is an implied join
against this single number as well.

This gives you tuples of the form (a, b, count_ab, count_a, count_b,
total).  The counts you need for the contingency table are

count_ab,  count_a - count_ab
count_b - count_ab,total - count_a - count_b + count_ab

Does this answer your question?

I can make the log-likelihood code have a method that takes count_ab,
count_a, count_b and total as arguments.

On Sun, Mar 22, 2009 at 10:32 PM, Ankur Goel ankur.g...@corp.aol.comwrote:

 As you can see from each entry we can get the values for AB and AB'.
 A'B and A'B' are not available directly in a single result and need to be
 computed.

 It would be nice to hear how you are planning to implement this in map-red.




-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
www.deepdyve.com
408-773-0110 ext. 738
858-414-0013 (m)
408-773-0220 (fax)

Re: [jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-03-19 Thread Ted Dunning

Ankur,

What form will the counts be in when you need this function?

Four integers separately available?

Values in a view of a matrix?

I will be happy to adapt some code to compute the measure you need.

On Wed, Mar 18, 2009 at 9:45 PM, Ankur (JIRA) j...@apache.org wrote:

 map-red implementation of the log-likelihood ratio test as described in
 Ted's paper.




-- 
Ted Dunning, CTO
DeepDyve

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-03-19 Thread Ted Dunning (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683516#action_12683516
]

Ted Dunning commented on MAHOUT-103:

Hmmm I actually think of a click as the relation that connects a user to an
item. As such, it is distinct from either.

And I routinely do recommendation like computations that involve users, network
entities, query terms, documents, and other things that you would call users as
they relate (by abstract clicks) to users, query terms, videos, music, web
pages, network entities, words, query terms and other things that you would
call items.

There is a horrible tension here between naming things by their most common
usage and expecting programmers to realize that they really are abstract
entities or naming things in a total abstract way and risking that no
programmers ever catch on. An example of an abstract naming that derives from
linguistic terminology might be Agent (instead of User), Relation (instead of
Click) and Target (instead of Item). This makes the general interaction be
Relation \subsetof Agent x Target. I wouldn't recommend this, however, because
(as you say) people generally describe social algorithms in excessively
concrete ways.

Co-occurence based nearest neighbourhood

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-03-18 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683170#action_12683170
]

Sean Owen commented on MAHOUT-103:
--

1. How do you feel about, therefore, changing to use more abstract objects
rather than, say, Click? These objects could be the existing ones, or
modified or new ones. I think as you say the existing objects are about what is
needed. That way the solution is that much more reusable. Same with the job --
the more it uses abstract/standard classes, the more reusable I think it looks.

2. Yeah the two interfaces are nearly identical: provide a method that takes
two items as input and a numerical score as output. I suppose it just makes
sense to use the existing ItemSimilarity interface in this section of the code.

3. Good question, here is my brief digression:

The code was originally written with an on-line model in mind --
recommendations happen in real-time. Over time that has proved inefficient or
impractical for large data sets, though it remains quite nice for small- to
medium-size data sets. Hence i have attempted to preserve the real-time model
at the core, and build a batch-oriented extension around it using Hadoop.

The two are a bit separate, and that is fine. So in this section of the code, I
don't mind attaching Hadoop-related jobs that are not intimately connected to
the core code. I am trying to keep them as consistent as possible so that the
original on-line and newer off-line models don't evolve into two separate
worlds within this part of the code.

To be specific... well I don't know, I don't have a problem with adding this
job actually. Ideally we build a bit more around it: takes as input the
standard preference-file format as used by FileDataModel, and outputs a file
format that can be ready by a new ItemSimillarity implementation that would
read and cache all these results. That would be a nice step towards integrating
with the core code.

This is something I have been remiss in - I wrote a job to do the
pre-computation of item-item diffs for slope one but never wrote an
implementation of DiffStorage that would read this output and operate based on
those results. This would close the loop.

How about we make #3 my part of this issue, to complete the connection between
this job and the core code a bit more?

Co-occurence based nearest neighbourhood

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-03-18 Thread Ted Dunning (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683232#action_12683232
 ] 

Ted Dunning commented on MAHOUT-103:


  1. How do you feel about, therefore, changing to use more abstract objects 
  rather than, say, Click? 

How is click more or less abstract than the term user?



 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-03-18 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683238#action_12683238
]

Sean Owen commented on MAHOUT-103:
--

The comparison would be to Item. You could say that's as domain-specific as
Click; I'd suggest that User/Item are the 'abstract' concepts in this context
since collaborative filtering is invariably explained in terms of users and
items, though of course your user or item can be whatever you like.

At least, there is no need to have both Click and Item -- unless this
particular context requires one to store more information about a click as an
item, in which case it should at least implement Item. But I don't think that's
the case.

The good news is that this work doesn't seem to only apply to processing click
logs, so, I'm suggesting it might be even more useful to express it in terms of
the 'abstract' concepts in this context.

Co-occurence based nearest neighbourhood

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-03-18 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683251#action_12683251
]

Sean Owen commented on MAHOUT-103:
--

The good news is that this work doesn't seem to only apply to processing click
logs, so, I'm suggesting it might be even more useful to express it in terms of
the 'abstract' concepts in this context.

Co-occurence based nearest neighbourhood

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-03-17 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12682915#action_12682915
 ] 

Ankur commented on MAHOUT-103:
--

Hey Sean, Thanks for review comments. Some specific questions

1. This indeed is doing approximately the same thing as 
TanimotoCoefficientSimilarity and BooleanPreferenceUser. The difference being 
that similarity computations is parallelized in map-reduce.

2. The idea of introducing a FitnessEvaluator was to allow people to apply 
domain specific things when comparing a preference. Are you suggesting the 
replacement of FitnessEvaluator with ItemSimilarity ?

3. The Hadoop job was written to run this thing stand-alone. What modifications 
do you feel would be appropriate for integration into the framework?


 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-01-29 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668475#action_12668475
 ] 

Ankur commented on MAHOUT-103:
--

I hoping to make the above improvements after I get some review comments.

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

Re: [jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

Re: [jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

27 matches

Site Navigation

Mail list logo

Footer information