from:"Ankur \(JIRA\)"

[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-04-26 Thread Ankur (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860882#action_12860882
]

Ankur commented on MAHOUT-305:
--

CooccurrenceCombiner caches items internally and increments counts whenever it
sees a new value. This might lead to memory issues with some real big datasets.
Moreover, for every (item-id, count) cached, a new object is created to apply
a simple procedure. Looks an overkill to me.

With the secondary sort (item1, item2) pairs are already sorted so that for
each key (item1) all the (item1, item2) pairs appear before (item1, item3)
assuming item2 item3. With this we simple increment the count each time we
see item2 and put the (item2, count) entry into a priority queue as soon as we
see item3 or something else. The size of the priority queue can be limited to
N. Check out ItemSimilarityEstimator.java.

Agreed we need better facilities for pruning, something like support-count (any
other?).

About merging, I feel CooccurrenceCombiner would be better with secondary sort.
Also it will be good if we can retain TupleWritable for future use. Other than
these I have no issues with throwing away code under
o.a.m.cf.taste.hadoop.cooccurrence

Combine both cooccurrence-based CF M/R jobs
---

Key: MAHOUT-305
URL: https://issues.apache.org/jira/browse/MAHOUT-305
Project: Mahout
Issue Type: Improvement
Components: Collaborative Filtering
Affects Versions: 0.2
Reporter: Sean Owen
Assignee: Ankur
Priority: Minor

We have two different but essentially identical MapReduce jobs to make
recommendations based on item co-occurrence:
org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be
merged. Not sure exactly how to approach that but noting this in JIRA, per
Ankur.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-04-26 Thread Ankur (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860939#action_12860939
]

Ankur commented on MAHOUT-305:
--

But the answer is the partitioner ?
Yes

Am I right that (item1, item2) -count is all that's needed ?
Yes

And why is the priority queue needed ...

You could use both a co-occurrence count (your favorite) and max number
co-occurrent pair (say 1000). I have chosen a size 100. So for any given item
the top-100 co-occurrent items (by count) would be output. Though the size is
limited with this it still can cause explosion if there are very long
histories. From netflix dataset recall the users who have rated more than 10K
movies. So one way of taking care of them is to apply 'sessionization' i.e.
output a co-occurrence pair only if they are part of a session or satisfy some
other constraint. But that is not implemented yet.

TupleWritable ...
Not really. I have a specialized implementation for my own purpose using
GenericWritable that wraps each object of TupleWritable.

Combine both cooccurrence-based CF M/R jobs
---

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-344) Minhash based clustering

2010-03-31 Thread Ankur (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851756#action_12851756
]

Ankur commented on MAHOUT-344:
--

Drew, thanks for pitching in as I've been running super busy with some crap :-)

@Cristi
That's right but its totally unnecessary as each of the mappers can do their
own initialization of hash functions. They will be the same hash function if
they used the same seed for java.util.Random(). So distributed cache can be
removed alltogther with that change. The code will be shorter and simpler.

What is the min-cluster size you are using? How many hash hash functions? How
many hashes are grouped together?
We will need some tests to show how good the clusters are. As a start we can
compute a simple metrics like average similarity of items within a cluster
aggregated over all clusters.

Minhash based clustering
-

Key: MAHOUT-344
URL: https://issues.apache.org/jira/browse/MAHOUT-344
Project: Mahout
Issue Type: Bug
Components: Clustering
Affects Versions: 0.3
Reporter: Ankur
Assignee: Ankur
Attachments: MAHOUT-344-v1.patch

Minhash clustering performs probabilistic dimension reduction of high
dimensional data. The essence of the technique is to hash each item using
multiple independent hash functions such that the probability of collision of
similar items is higher. Multiple such hash tables can then be constructed
to answer near neighbor type of queries efficiently.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (MAHOUT-344) Minhash based clustering

2010-03-22 Thread Ankur (JIRA)

Minhash based clustering 
-

 Key: MAHOUT-344
 URL: https://issues.apache.org/jira/browse/MAHOUT-344
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Reporter: Ankur


Minhash clustering performs probabilistic dimension reduction of high 
dimensional data. The essence of the technique is to hash each item using 
multiple independent hash functions such that the probability of collision of 
similar items is higher. Multiple such hash tables can then be constructed  to 
answer near neighbor type of queries efficiently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-344) Minhash based clustering

2010-03-22 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated MAHOUT-344:
-

Affects Version/s: 0.3
 Assignee: Ankur

 Minhash based clustering 
 -

 Key: MAHOUT-344
 URL: https://issues.apache.org/jira/browse/MAHOUT-344
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Affects Versions: 0.3
Reporter: Ankur
Assignee: Ankur

 Minhash clustering performs probabilistic dimension reduction of high 
 dimensional data. The essence of the technique is to hash each item using 
 multiple independent hash functions such that the probability of collision of 
 similar items is higher. Multiple such hash tables can then be constructed  
 to answer near neighbor type of queries efficiently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-344) Minhash based clustering

2010-03-22 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated MAHOUT-344:
-

Attachment: MAHOUT-344-v1.patch

As per Yonik's law of patches submitting my implementation. Please feel free 
to provide ideas for improvement or even submit an improved patch. 

 Minhash based clustering 
 -

 Key: MAHOUT-344
 URL: https://issues.apache.org/jira/browse/MAHOUT-344
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Affects Versions: 0.3
Reporter: Ankur
Assignee: Ankur
 Attachments: MAHOUT-344-v1.patch


 Minhash clustering performs probabilistic dimension reduction of high 
 dimensional data. The essence of the technique is to hash each item using 
 multiple independent hash functions such that the probability of collision of 
 similar items is higher. Multiple such hash tables can then be constructed  
 to answer near neighbor type of queries efficiently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-320) Modify IntPairWritable in LDA implementation to be binary comparable to improve performance.

2010-03-03 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840601#action_12840601
 ] 

Ankur commented on MAHOUT-320:
--

Binary comparison looks more or less the same in both the classes. Its the data 
serialization/serialization where Bigram scores over IntPairWritable. Bigram 
encodes/decodes the data in VInt format which uses zero compressed encodings 
for more info see o.a.h.io.WritableUtils.java. The encoding can give 
considerable savings when serializing huge amounts of numeric data. 

 Modify IntPairWritable in LDA implementation to be binary comparable to 
 improve performance.
 

 Key: MAHOUT-320
 URL: https://issues.apache.org/jira/browse/MAHOUT-320
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.3
Reporter: Drew Farris
Assignee: Robin Anil
Priority: Minor
 Attachments: MAHOUT-320.patch


 Per discussion with Robin, modifying o.a.m.clustering.lda.IntPairWritable to 
 be binary comparable will improve the performance of the comparison 
 operations during a sort because no marshaling will need to occur to compare 
 IntPairWritable instances.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-320) Modify IntPairWritable in LDA implementation to be binary comparable to improve performance.

2010-03-03 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840619#action_12840619
 ] 

Ankur commented on MAHOUT-320:
--

And yes I see the issue with (firstb1 - firstb2) thing in Bigram. This 
definitely needs to be fixed. I don't mind replacing one with either. Just that 
we should be using VInt format for ser/de on the wire.

 Modify IntPairWritable in LDA implementation to be binary comparable to 
 improve performance.
 

 Key: MAHOUT-320
 URL: https://issues.apache.org/jira/browse/MAHOUT-320
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.3
Reporter: Drew Farris
Assignee: Robin Anil
Priority: Minor
 Attachments: MAHOUT-320.patch


 Per discussion with Robin, modifying o.a.m.clustering.lda.IntPairWritable to 
 be binary comparable will improve the performance of the comparison 
 operations during a sort because no marshaling will need to occur to compare 
 IntPairWritable instances.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-320) Modify IntPairWritable in LDA implementation to be binary comparable to improve performance.

2010-03-03 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840636#action_12840636
 ] 

Ankur commented on MAHOUT-320:
--

Robin, Can you update your revision and create a fresh patch ?

 Modify IntPairWritable in LDA implementation to be binary comparable to 
 improve performance.
 

 Key: MAHOUT-320
 URL: https://issues.apache.org/jira/browse/MAHOUT-320
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.3
Reporter: Drew Farris
Assignee: Robin Anil
Priority: Minor
 Attachments: MAHOUT-320.patch, MAHOUT-320.patch, MAHOUT-320.patch


 Per discussion with Robin, modifying o.a.m.clustering.lda.IntPairWritable to 
 be binary comparable will improve the performance of the comparison 
 operations during a sort because no marshaling will need to occur to compare 
 IntPairWritable instances.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-320) Modify IntPairWritable in LDA implementation to be binary comparable to improve performance.

2010-03-03 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841059#action_12841059
 ] 

Ankur commented on MAHOUT-320:
--

It still complains that it cannot find the file to patch - 
core/src/main/java/org/apache/mahout/common/IntPairWritable.java. Also it looks 
like the unti test for IntPairWritable is still lying under 
o.a.m.clustering.lda.

 Modify IntPairWritable in LDA implementation to be binary comparable to 
 improve performance.
 

 Key: MAHOUT-320
 URL: https://issues.apache.org/jira/browse/MAHOUT-320
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.3
Reporter: Drew Farris
Assignee: Robin Anil
Priority: Minor
 Attachments: MAHOUT-320.patch, MAHOUT-320.patch, MAHOUT-320.patch, 
 MAHOUT-320.patch


 Per discussion with Robin, modifying o.a.m.clustering.lda.IntPairWritable to 
 be binary comparable will improve the performance of the comparison 
 operations during a sort because no marshaling will need to occur to compare 
 IntPairWritable instances.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-23 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837192#action_12837192
 ] 

Ankur commented on MAHOUT-305:
--

Just picking random N % data for each user calculating avg precision and recall 
across all users in test data  and then repeating the test K times to take 
average across all runs should be reasonably fair assessment IMHO.

Mahouters your opinion here would be valuable.

 Combine both cooccurrence-based CF M/R jobs
 ---

 Key: MAHOUT-305
 URL: https://issues.apache.org/jira/browse/MAHOUT-305
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.2
Reporter: Sean Owen
Assignee: Ankur
Priority: Minor

 We have two different but essentially identical MapReduce jobs to make 
 recommendations based on item co-occurrence: 
 org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
 merged. Not sure exactly how to approach that but noting this in JIRA, per 
 Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-23 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837198#action_12837198
 ] 

Ankur commented on MAHOUT-305:
--

I am not proposing that we choose random subset over all movies.  Rather choose 
random N% movie ratings  from EACH user and use it as test data to get 
precision recall across this test set.  Also repeat this procedure X times to 
get a fair assessment. They seem to do it the same way - 
http://www2007.org/papers/paper570.pdf 

 Combine both cooccurrence-based CF M/R jobs
 ---

 Key: MAHOUT-305
 URL: https://issues.apache.org/jira/browse/MAHOUT-305
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.2
Reporter: Sean Owen
Assignee: Ankur
Priority: Minor

 We have two different but essentially identical MapReduce jobs to make 
 recommendations based on item co-occurrence: 
 org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
 merged. Not sure exactly how to approach that but noting this in JIRA, per 
 Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-23 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837205#action_12837205
 ] 

Ankur commented on MAHOUT-305:
--

Well! not factoring ratings in the similarity metric but having them influence 
the train/test data for evaluation doesn't sound fair to me. So I don't think 
both of us agree on the evaluation methodology.  

 Combine both cooccurrence-based CF M/R jobs
 ---

 Key: MAHOUT-305
 URL: https://issues.apache.org/jira/browse/MAHOUT-305
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.2
Reporter: Sean Owen
Assignee: Ankur
Priority: Minor

 We have two different but essentially identical MapReduce jobs to make 
 recommendations based on item co-occurrence: 
 org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
 merged. Not sure exactly how to approach that but noting this in JIRA, per 
 Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-23 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837230#action_12837230
 ] 

Ankur commented on MAHOUT-305:
--

*smile* There we go. 
Our last steps are essentially different. I don't do any multiplication, 
instead I just join (user, movie) on 'movie'  with co-occurrence set followed 
by a group on 'user' to calculate recommendations. I guess while joining I 
should multiply ratings with co-occurrence counts for better evaluation.

Can you give a small illustrative example with dummy data to describe your last 
steps? 

 Combine both cooccurrence-based CF M/R jobs
 ---

 Key: MAHOUT-305
 URL: https://issues.apache.org/jira/browse/MAHOUT-305
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.2
Reporter: Sean Owen
Assignee: Ankur
Priority: Minor

 We have two different but essentially identical MapReduce jobs to make 
 recommendations based on item co-occurrence: 
 org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
 merged. Not sure exactly how to approach that but noting this in JIRA, per 
 Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-22 Thread Ankur (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836543#action_12836543
]

Ankur commented on MAHOUT-305:
--

Sean, Thanks for filing the jira. Nothing points from our discussion here.

1. Need to decide on the dataset to run both the implementations on. I have
netflix dataset in mind but a strange thing I observed during my tests with it
is that there were 2 - 3 users who rated more than 10,000 movies! This seemed a
little odd to me. Can you or some else who has had experience with the dataset
validate my observation ?

2. Both the implementations need to run on dataset in the identical environment
to gauge performance and accuracy. For accuracy I believe we need to do a
Precision-Recall test. My understanding of it is that

a) Do a 80-20 split of the data (80% train and 20% test) with split
happening on a timeline.
b) Feed training data to the algorithm and generate recommendations for a
subset of users from training data.
c) Compare those recommendations with items actually present in the
history of user in test data.
d) Calculate precision = tp / (tp + fp) = (recommendations actually
present in user's history) / (total items recommended)
e) Calculate recall = tp / (tp + fn) =(recommendations actually
present in user's history) / (total items in user's history)
f) Finally take a simple avg of both across all the users to get approx
global precision/recall.

please feel free to correct any of the step above if I misunderstood anything.

Combine both cooccurrence-based CF M/R jobs
---

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-22 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1283#action_1283
 ] 

Ankur commented on MAHOUT-305:
--

Hey Sean,
   Have you played with netflix dataset? Are there really user who 
have rated more than 10,000 movies? For PR test do we have something already 
that will work in this case or some coding is required ? Any other thoughts ?  

 Combine both cooccurrence-based CF M/R jobs
 ---

 Key: MAHOUT-305
 URL: https://issues.apache.org/jira/browse/MAHOUT-305
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.2
Reporter: Sean Owen
Priority: Minor

 We have two different but essentially identical MapReduce jobs to make 
 recommendations based on item co-occurrence: 
 org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
 merged. Not sure exactly how to approach that but noting this in JIRA, per 
 Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-22 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur reassigned MAHOUT-305:


Assignee: Ankur

 Combine both cooccurrence-based CF M/R jobs
 ---

 Key: MAHOUT-305
 URL: https://issues.apache.org/jira/browse/MAHOUT-305
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.2
Reporter: Sean Owen
Assignee: Ankur
Priority: Minor

 We have two different but essentially identical MapReduce jobs to make 
 recommendations based on item co-occurrence: 
 org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
 merged. Not sure exactly how to approach that but noting this in JIRA, per 
 Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-22 Thread Ankur (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836725#action_12836725
]

Ankur commented on MAHOUT-305:
--

Typically when doing train-test data split, we divide the data on a timeline.
So as a simple example if we have 10 days data then we would keep last 2 days
data as test data and remaining as training data. If we remove all 5 star
rating the crude way, we may not be able to ensure this condition, not a hard
one but still a best practice AFAIK. Also I am not sure if 5 star ratings
would be 20 or even 10% of the total data.

The crude way you mentioned is ok for a start but I am not sure if its a fair
evaluation or not. Also with this we would effectively be calculating precision
as
precision = (5 start recommendations actually present in user's history) /
(total 5 star recommendations)
recall = (5 start recommendations actually present in user's history) / (total
5 start items in user's history)

is that what you mean?

Combine both cooccurrence-based CF M/R jobs
---

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-22 Thread Ankur (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837123#action_12837123
]

Ankur commented on MAHOUT-305:
--

With co-occurrence analysis we are dropping ratings. So if there are a lot of
people who watched Harry Potter also watched Maid in manhattan it will have
a higher chance of getting recommended regardless of ratings.

I am trying not be influenced too much by ratings as that is not the strength
of this algorithm. Where it really shines is when you have lots and lots of
sparse user click data where a click may be present or absent. Something like
an online book store or a shopping site. We are sticking with netflix as there
is no such publicly available dataset AFAIK.

Ok so moving forward with the action plan, here is what I propose to do. Please
feel free to suggest modifications.

1. For each user take out the most recent movies that he has rated 3 or 4 or 5
as TEST data. Use the remaining as TRAIN data.
2. Run both implementations in identical environment on test data and record
runtimes and results
3. Join recommendation results with TEST data on 'user' key and calculate
precision recall.
4. Report average precision recall.

Ok so when separating top ratings as TEST data. For each user

Precision @10 = (3,4,5 rating movies recommended actually present ) / 10
Recall @ 10 = (3,4,5 rating movies recommended actually present ) / (all
3,4,5 movies seen by user)

Hope this was more clear.

Combine both cooccurrence-based CF M/R jobs
---

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-12-23 Thread Ankur (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12794036#action_12794036
]

Ankur commented on MAHOUT-103:
--

I skimmed through your version and what's present in .item package. Few
immediate things that come to mind are
1. Moving to AbstractJob
2. Re-factoring to separate map,reduce and job classes. Personally I hate that
coz it the code base just bloats when number of M/R jobs increase.

I have been trying to setup my idea using IntelliJ.codestyle.xml provided by
mahout cwiki. I placed the file under idea_home/config/codeStyles and restarted
idea but it still does not an import option in File-Settings-Code Style.
Idea shows following messages in the back ground
Field not copied JAVA_INDENT_OPTIONS
Field not copied JSP_INDENT_OPTIONS
Field not copied XML_INDENT_OPTIONS
Field not copied OTHER_INDENT_OPTIONS
Field not copied FIELD_TYPE_TO_NAME
Field not copied STATIC_FIELD_TYPE_TO_NAME
Field not copied PARAMETER_TYPE_TO_NAME
Field not copied LOCAL_VARIABLE_TYPE_TO_NAME
Field not copied PACKAGES_TO_USE_IMPORT_ON_DEMAND
Field not copied IMPORT_LAYOUT_TABLE

I am using v 8.1.4

Once i set this up, I am gonna look at your version and what's there in .item
package

Co-occurence based nearest neighbourhood

Key: MAHOUT-103
URL: https://issues.apache.org/jira/browse/MAHOUT-103
Project: Mahout
Issue Type: New Feature
Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
Attachments: MAHOUT-103.patch, mahout-103.patch.v1,
mahout-103.patch.v2, prepare.pl, run.sh

Nearest neighborhood type queries for users/items can be answered efficiently
and effectively by analyzing the co-occurrence model of a user/item w.r.t
another. This patch aims at providing an implementation for answering such
queries based upon simple co-occurrence counts.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-12-23 Thread Ankur (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12794036#action_12794036
]

Ankur edited comment on MAHOUT-103 at 12/23/09 12:41 PM:
-

I skimmed through your version and what's present in .item package. Few
immediate things that come to mind are
1. Moving to AbstractJob
2. Re-factoring to separate map,reduce and job classes. Personally I hate that
coz the code base just bloats when number of M/R jobs increase.

I am using v 8.1.4

Once i set this up, I am gonna look at your version and what's there in .item
package

was (Author: ankur):
I skimmed through your version and what's present in .item package. Few
immediate things that come to mind are
1. Moving to AbstractJob
2. Re-factoring to separate map,reduce and job classes. Personally I hate that
coz it the code base just bloats when number of M/R jobs increase.

I am using v 8.1.4

Once i set this up, I am gonna look at your version and what's there in .item
package

Co-occurence based nearest neighbourhood

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-12-23 Thread Ankur (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12794133#action_12794133
]

Ankur commented on MAHOUT-103:
--

Your changes don't look too mutating and yes roughly speaking I am ok. Since we
are talking about committing this I would like to say that I tested this for
correctness on very small hand-coded data-set and then ran it on netflix-data.
However I couldn't verify its correctness over netflix data though I am pretty
confident it works correctly. That is why I was hoping to have a couple of unit
test:-

1. To verify that similar items are identified correctly.
2. None of the seen items are recommended for a user.

But since this is not the final version, can you suggest any other approach to
be 100% sure of correctness? I don't want something to be committed only to
discover a silly issue later just because we didn't take extra care.

Co-occurence based nearest neighbourhood

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-12-22 Thread Ankur (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793628#action_12793628
]

Ankur commented on MAHOUT-103:
--

Evolving the code to integrate better with the existing stuff is fine with me.
I am in general ok with throwing away code if it can be replaced by existing
stuff that is better.

However, I don't think its a good idea to try to come up with a unified
approach of generating hadoop based recommendations. I am afraid we'll create
more problems than we'd solve.

I see recommendations in hadoop world as the following linear chain of M/R jobs

Data-Formatting -- Data Filter -- Core Recommender-- Post Processor

The last 2 jobs can themselves be comprised of 1 or more M/R jobs.

There is I think .

Let me come up with the unit-tests and code documentation. After that you can
start doing the changes. Thanks a lot for help. Appreciate that :-)

BTW did u have a chance to actually run it on netflix-data ?

Co-occurence based nearest neighbourhood

Key: MAHOUT-103
URL: https://issues.apache.org/jira/browse/MAHOUT-103
Project: Mahout
Issue Type: New Feature
Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
Attachments: mahout-103.patch.v1, mahout-103.patch.v2, prepare.pl,
run.sh

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-12-17 Thread Ankur (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791971#action_12791971
]

Ankur commented on MAHOUT-103:
--

Ok, so here is the next version which I again re-wrote completely :-( for
performance reasons. The version now computes item similarity and uses that to
generate recommendations in truely hadoop fashion. In a nutshell the
recommendations are generated in 2 steps:-

1. Join item-similarity data (generated via analyzing co-occurence) with
user-click data
2. Group output of step 1. on user key so that we recieve all potential
candidates for a user in a reducer and also all items already clicked/seen by
him so that they can be excluded from final recommendations set.

Also attached
1. Perl script to convert the netflix data into required format (userId \t
movieId)
2. Bash script used to run in on 50 node hadoop cluster. The recommendations
are generated for all the users in less than 45 min.

Co-occurence based nearest neighbourhood

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-12-17 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated MAHOUT-103:
-

Attachment: run.sh
prepare.pl
mahout-103.patch.v2

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch, mahout-103.patch.v1, 
 mahout-103.patch.v2, prepare.pl, run.sh


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-12-17 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated MAHOUT-103:
-

Attachment: (was: jira-103.patch)

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: mahout-103.patch.v1, mahout-103.patch.v2, prepare.pl, 
 run.sh


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-24 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781838#action_12781838
 ] 

Ankur commented on MAHOUT-103:
--

For this co-occurrence based recommender I am planning to write a set of 
map-reduce jobs that compute recommendations for users as folllowing:-

1. Take user's item history
2. for each item in his history fetch the top-N similar items. (Similarity 
based on co-occurrence)
3. Add the co-occurrence scores if an item appears more than once (NOT weighted 
avg). Consider an e.g. user-history { M1, M2, M3 } and top - 3 similar movies 
for each of these along with co-occurrence scores 

M1 - (A, 5), (B, 4), (C, 2)
M2 - (D, 6), (E, 3), (F, 2)
M3 - (G, 8), (C, 5), (B, 2)  

So the final scores in decreasing order will look like
(G, 8)
(C, 7)
(B, 6)
(D, 6)
(A, 5)
(E, 3)
(F, 2)

The idea I want to capture is that a candidate item gets higher score if its 
similar to more items in user's click history.

Do you see any issue with this approach ? Any other better approach that you 
can think of ?

As for the precision-recall test, I am still trying to see how to divide the 
data in 'train' and 'test' for a fair evaluation. How do we do it in the 
existing code ?

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch, mahout-103.patch.v1


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-17 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated MAHOUT-103:
-

Attachment: mahout-103.patch.v1

Ok,  so here's the revised version of the algorithm that this jira proposes to 
implement.  I have tried to make the code as clean and readable as possible. 
Next I plan to write some test code for preparing and running on Netflix prize 
dataset. As a part of data preparation the 'dates' and 'ratings' will be 
dropped and algo will run on (user-id, item-id) pairs. 

Not sure how we can include age related decay/boost when counting 
co-occurrence. May be others can pitch in once we have the basic stuff working 
fine.

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch, mahout-103.patch.v1


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-17 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12778911#action_12778911
 ] 

Ankur commented on MAHOUT-103:
--

Thanks for the quick lookup, appreciate that :-).

Putting in a subpackage, sure, for now I'll just leave all the main code under 
one subpackage (how about 'bigram') until u have it sorted out. 

As for the code, once I have the test code ready for netflix dataset and at 
least one unit test, it will be good to go. One question, How do we apply 
precision-recall or RMSE or any other evaluation technique to the results since 
all we are doing is counting co-occurrence ?

Do u have the JIRA for this hadoop related bug? 

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch, mahout-103.patch.v1


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-12 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776939#action_12776939
 ] 

Ankur commented on MAHOUT-103:
--

Re-post an updated patch 

Sure I'll have the updated code coming by early next week.

If it's basically sound I'd like to mention it 

+10, The more people know about it the better chances it has of being used :-)  

I use the GroupLens, Jester, Netflix data sets regularly. Indeed, just drop 
the rating ...

Simply dropping the rating might introduce too much noise. I was thinking of 
keeoing only those that have ratings  2.5 (or 2 to be more liberal). 

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-12 Thread Ankur (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776966#action_12776966
]

Ankur commented on MAHOUT-103:
--

In that case dropping ratings might not be such a good idea and may lead to bad
results. Consider the following movies that a user might have seen with the
scores

Matrix - 4.5
Matrix Reloaded - 2.5
Matrix Revolutions - 2

Assuming that a lot of people have watched these movies and didn't like the
subsequent two versions, they still will get high similarity scores w.r.t
Matrix going purely by co-occurrence. IMHO, that leaves us with the following
2 alternatives :-

1. Add the ratings when counting co-occurrence and hope that better ones will
stand out even if they co-occur less.
2. Apply a Re-scorer that re-ranks the the similar items for a given item
based on their average scores.

Point 1 is something I am thinking of trying out.

Co-occurence based nearest neighbourhood

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-03-17 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12682915#action_12682915
 ] 

Ankur commented on MAHOUT-103:
--

Hey Sean, Thanks for review comments. Some specific questions

1. This indeed is doing approximately the same thing as 
TanimotoCoefficientSimilarity and BooleanPreferenceUser. The difference being 
that similarity computations is parallelized in map-reduce.

2. The idea of introducing a FitnessEvaluator was to allow people to apply 
domain specific things when comparing a preference. Are you suggesting the 
replacement of FitnessEvaluator with ItemSimilarity ?

3. The Hadoop job was written to run this thing stand-alone. What modifications 
do you feel would be appropriate for integration into the framework?


 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-01-29 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur reassigned MAHOUT-103:


Assignee: Ankur

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur

 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-01-29 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated MAHOUT-103:
-

Attachment: jira-103.patch

Ok here is a quick patch with just enough documentation and no unit tests or 
dummy data. The code works but following things can be improved...

1. The code can be better structured and integrated.
2. Logging needs to be added.
3.  Documentation can be more informative. 
4. Dummy data and unit tests need to be added. 


 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-01-29 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668475#action_12668475
 ] 

Ankur commented on MAHOUT-103:
--

I hoping to make the above improvements after I get some review comments.

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-01-20 Thread Ankur (JIRA)

Co-occurence based nearest neighbourhood


 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur


Nearest neighborhood type queries for users/items can be answered efficiently 
and effectively by analyzing the co-occurrence model of a user/item w.r.t 
another. This patch aims at providing an implementation for answering such 
queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-19) Hierarchial clusterer

2009-01-13 Thread Ankur (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12663319#action_12663319
]

Ankur commented on MAHOUT-19:
-

Hi Karl, Welcome back :-)
Can you share the following few things about this patch?

1. Assuming you are training the tree top-down, what is the division criteria
you are using ?
2. How well does it scale ?
3. Was the data on which this was tried, sparse ?
4. What is the distance metric that has been used ?

Basically I have a use -case where-in I have a set of 5 - 10 million urls which
have an inherent hierarchical relationship and a set of user-clicks on them. I
would like to cluster them in a tree and use the model to answer the near
neighborhood type queries i.e. what urls are related to what other urls. I did
implement a sequential bottom-up hierarchical clustering algorithm but the
complexity is too bad for my data-set. I then thought about implementing a
top-down hierarchical clustering algorithm using Jaccard co-efficient as my
distance measure and came across this patch.

Can you suggest if this patch will help?

Hierarchial clusterer
-

Key: MAHOUT-19
URL: https://issues.apache.org/jira/browse/MAHOUT-19
Project: Mahout
Issue Type: New Feature
Components: Clustering
Reporter: Karl Wettin
Assignee: Karl Wettin
Priority: Minor
Attachments: MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt,
MAHOUT-19.txt, MAHOUT-19.txt, TestBottomFeed.test.png, TestTopFeed.test.png

In a hierarchial clusterer the instances are the leaf nodes in a tree where
branch nodes contains the mean features of and the distance between its
children.
For performance reasons I always trained trees from the top-down. I have
been told that it can cause various effects I never encountered. And I
believe Huffman solved his problem by training bottom-up? The thing is, I
don't think it is possible to train the tree top-down using map reduce. I do
however think it is possible to train it bottom-up. I would very much
appreciate any thoughts on this.
Once this tree is trained one can extract clusters in various ways. The mean
distance between all instances is usually a good maximum distance to allow
between nodes when navigating the tree in search for a cluster.
Navigating the tree and gather nodes that are not too far away from each
other is usually instant if the tree is available in memory or persisted in a
smart way. In my experience there is not much to win from extracting all
clusters from start. Also, it usually makes sense to allow for the user to
modify the cluster boundary variables in real time using a slider or perhaps
present the named summary of neighbouring clusters, blacklist paths in the
tree, etc. It is also not to bad to use secondary classification on the
instances to create worm holes in the tree. I always thought it would be cool
to visualize it using Touchgraph.
My focus is on clustering text documents for instant more like this-feature
in search engines and use Tanimoto similarity on the vector spaces to
calculate the distance.
See LUCENE-1025 for a single threaded all in memory proof of concept of a
hierarchial clusterer.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-4) Simple prototype for Expectation Maximization (EM)

2008-04-10 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated MAHOUT-4:
---

Attachment: (was: Mahout_EM.patch)

 Simple prototype for Expectation Maximization (EM)
 --

 Key: MAHOUT-4
 URL: https://issues.apache.org/jira/browse/MAHOUT-4
 Project: Mahout
  Issue Type: New Feature
Reporter: Ankur
 Attachments: dp-cluster.r


 Create a simple prototype implementing Expectation Maximization - EM that 
 demonstrates the algorithm functionality given a set of (user, click-url) 
 data.
 The prototype should be functionally complete and should serve as a basis for 
 the Map-Reduce version of the EM algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-4) Simple prototype for Expectation Maximization (EM)

2008-02-21 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated MAHOUT-4:
---

Attachment: (was: PLSI_EM.patch)

 Simple prototype for Expectation Maximization (EM)
 --

 Key: MAHOUT-4
 URL: https://issues.apache.org/jira/browse/MAHOUT-4
 Project: Mahout
  Issue Type: New Feature
Reporter: Ankur

 Create a simple prototype implementing Expectation Maximization - EM that 
 demonstrates the algorithm functionality given a set of (user, click-url) 
 data.
 The prototype should be functionally complete and should serve as a basis for 
 the Map-Reduce version of the EM algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-4) Simple prototype for Expectation Maximization (EM)

2008-02-21 Thread Ankur (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated MAHOUT-4:
---

Attachment: Mahout_EM.patch

Oops! Looks like my Subversive Eclipse plugin did something whacky while 
generating the pacth. Sorry for that. Please find the recitifed patch file. 
Hope this goes through fine.

 Simple prototype for Expectation Maximization (EM)
 --

 Key: MAHOUT-4
 URL: https://issues.apache.org/jira/browse/MAHOUT-4
 Project: Mahout
  Issue Type: New Feature
Reporter: Ankur
 Attachments: Mahout_EM.patch


 Create a simple prototype implementing Expectation Maximization - EM that 
 demonstrates the algorithm functionality given a set of (user, click-url) 
 data.
 The prototype should be functionally complete and should serve as a basis for 
 the Map-Reduce version of the EM algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

40 matches

Mail list logo