[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860882#action_12860882 ] Ankur commented on MAHOUT-305: -- CooccurrenceCombiner caches items internally and increments counts whenever it sees a new value. This might lead to memory issues with some real big datasets. Moreover, for every (item-id, count) cached, a new object is created to apply a simple procedure. Looks an overkill to me. With the secondary sort (item1, item2) pairs are already sorted so that for each key (item1) all the (item1, item2) pairs appear before (item1, item3) assuming item2 item3. With this we simple increment the count each time we see item2 and put the (item2, count) entry into a priority queue as soon as we see item3 or something else. The size of the priority queue can be limited to N. Check out ItemSimilarityEstimator.java. Agreed we need better facilities for pruning, something like support-count (any other?). About merging, I feel CooccurrenceCombiner would be better with secondary sort. Also it will be good if we can retain TupleWritable for future use. Other than these I have no issues with throwing away code under o.a.m.cf.taste.hadoop.cooccurrence Combine both cooccurrence-based CF M/R jobs --- Key: MAHOUT-305 URL: https://issues.apache.org/jira/browse/MAHOUT-305 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.2 Reporter: Sean Owen Assignee: Ankur Priority: Minor We have two different but essentially identical MapReduce jobs to make recommendations based on item co-occurrence: org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be merged. Not sure exactly how to approach that but noting this in JIRA, per Ankur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860939#action_12860939 ] Ankur commented on MAHOUT-305: -- But the answer is the partitioner ? Yes Am I right that (item1, item2) -count is all that's needed ? Yes And why is the priority queue needed ... You could use both a co-occurrence count (your favorite) and max number co-occurrent pair (say 1000). I have chosen a size 100. So for any given item the top-100 co-occurrent items (by count) would be output. Though the size is limited with this it still can cause explosion if there are very long histories. From netflix dataset recall the users who have rated more than 10K movies. So one way of taking care of them is to apply 'sessionization' i.e. output a co-occurrence pair only if they are part of a session or satisfy some other constraint. But that is not implemented yet. TupleWritable ... Not really. I have a specialized implementation for my own purpose using GenericWritable that wraps each object of TupleWritable. Combine both cooccurrence-based CF M/R jobs --- Key: MAHOUT-305 URL: https://issues.apache.org/jira/browse/MAHOUT-305 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.2 Reporter: Sean Owen Assignee: Ankur Priority: Minor We have two different but essentially identical MapReduce jobs to make recommendations based on item co-occurrence: org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be merged. Not sure exactly how to approach that but noting this in JIRA, per Ankur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-344) Minhash based clustering
[ https://issues.apache.org/jira/browse/MAHOUT-344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851756#action_12851756 ] Ankur commented on MAHOUT-344: -- Drew, thanks for pitching in as I've been running super busy with some crap :-) @Cristi That's right but its totally unnecessary as each of the mappers can do their own initialization of hash functions. They will be the same hash function if they used the same seed for java.util.Random(). So distributed cache can be removed alltogther with that change. The code will be shorter and simpler. What is the min-cluster size you are using? How many hash hash functions? How many hashes are grouped together? We will need some tests to show how good the clusters are. As a start we can compute a simple metrics like average similarity of items within a cluster aggregated over all clusters. Minhash based clustering - Key: MAHOUT-344 URL: https://issues.apache.org/jira/browse/MAHOUT-344 Project: Mahout Issue Type: Bug Components: Clustering Affects Versions: 0.3 Reporter: Ankur Assignee: Ankur Attachments: MAHOUT-344-v1.patch Minhash clustering performs probabilistic dimension reduction of high dimensional data. The essence of the technique is to hash each item using multiple independent hash functions such that the probability of collision of similar items is higher. Multiple such hash tables can then be constructed to answer near neighbor type of queries efficiently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAHOUT-344) Minhash based clustering
Minhash based clustering - Key: MAHOUT-344 URL: https://issues.apache.org/jira/browse/MAHOUT-344 Project: Mahout Issue Type: Bug Components: Clustering Reporter: Ankur Minhash clustering performs probabilistic dimension reduction of high dimensional data. The essence of the technique is to hash each item using multiple independent hash functions such that the probability of collision of similar items is higher. Multiple such hash tables can then be constructed to answer near neighbor type of queries efficiently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-344) Minhash based clustering
[ https://issues.apache.org/jira/browse/MAHOUT-344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated MAHOUT-344: - Affects Version/s: 0.3 Assignee: Ankur Minhash based clustering - Key: MAHOUT-344 URL: https://issues.apache.org/jira/browse/MAHOUT-344 Project: Mahout Issue Type: Bug Components: Clustering Affects Versions: 0.3 Reporter: Ankur Assignee: Ankur Minhash clustering performs probabilistic dimension reduction of high dimensional data. The essence of the technique is to hash each item using multiple independent hash functions such that the probability of collision of similar items is higher. Multiple such hash tables can then be constructed to answer near neighbor type of queries efficiently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-344) Minhash based clustering
[ https://issues.apache.org/jira/browse/MAHOUT-344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated MAHOUT-344: - Attachment: MAHOUT-344-v1.patch As per Yonik's law of patches submitting my implementation. Please feel free to provide ideas for improvement or even submit an improved patch. Minhash based clustering - Key: MAHOUT-344 URL: https://issues.apache.org/jira/browse/MAHOUT-344 Project: Mahout Issue Type: Bug Components: Clustering Affects Versions: 0.3 Reporter: Ankur Assignee: Ankur Attachments: MAHOUT-344-v1.patch Minhash clustering performs probabilistic dimension reduction of high dimensional data. The essence of the technique is to hash each item using multiple independent hash functions such that the probability of collision of similar items is higher. Multiple such hash tables can then be constructed to answer near neighbor type of queries efficiently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-320) Modify IntPairWritable in LDA implementation to be binary comparable to improve performance.
[ https://issues.apache.org/jira/browse/MAHOUT-320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840601#action_12840601 ] Ankur commented on MAHOUT-320: -- Binary comparison looks more or less the same in both the classes. Its the data serialization/serialization where Bigram scores over IntPairWritable. Bigram encodes/decodes the data in VInt format which uses zero compressed encodings for more info see o.a.h.io.WritableUtils.java. The encoding can give considerable savings when serializing huge amounts of numeric data. Modify IntPairWritable in LDA implementation to be binary comparable to improve performance. Key: MAHOUT-320 URL: https://issues.apache.org/jira/browse/MAHOUT-320 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.3 Reporter: Drew Farris Assignee: Robin Anil Priority: Minor Attachments: MAHOUT-320.patch Per discussion with Robin, modifying o.a.m.clustering.lda.IntPairWritable to be binary comparable will improve the performance of the comparison operations during a sort because no marshaling will need to occur to compare IntPairWritable instances. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-320) Modify IntPairWritable in LDA implementation to be binary comparable to improve performance.
[ https://issues.apache.org/jira/browse/MAHOUT-320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840619#action_12840619 ] Ankur commented on MAHOUT-320: -- And yes I see the issue with (firstb1 - firstb2) thing in Bigram. This definitely needs to be fixed. I don't mind replacing one with either. Just that we should be using VInt format for ser/de on the wire. Modify IntPairWritable in LDA implementation to be binary comparable to improve performance. Key: MAHOUT-320 URL: https://issues.apache.org/jira/browse/MAHOUT-320 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.3 Reporter: Drew Farris Assignee: Robin Anil Priority: Minor Attachments: MAHOUT-320.patch Per discussion with Robin, modifying o.a.m.clustering.lda.IntPairWritable to be binary comparable will improve the performance of the comparison operations during a sort because no marshaling will need to occur to compare IntPairWritable instances. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-320) Modify IntPairWritable in LDA implementation to be binary comparable to improve performance.
[ https://issues.apache.org/jira/browse/MAHOUT-320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840636#action_12840636 ] Ankur commented on MAHOUT-320: -- Robin, Can you update your revision and create a fresh patch ? Modify IntPairWritable in LDA implementation to be binary comparable to improve performance. Key: MAHOUT-320 URL: https://issues.apache.org/jira/browse/MAHOUT-320 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.3 Reporter: Drew Farris Assignee: Robin Anil Priority: Minor Attachments: MAHOUT-320.patch, MAHOUT-320.patch, MAHOUT-320.patch Per discussion with Robin, modifying o.a.m.clustering.lda.IntPairWritable to be binary comparable will improve the performance of the comparison operations during a sort because no marshaling will need to occur to compare IntPairWritable instances. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-320) Modify IntPairWritable in LDA implementation to be binary comparable to improve performance.
[ https://issues.apache.org/jira/browse/MAHOUT-320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841059#action_12841059 ] Ankur commented on MAHOUT-320: -- It still complains that it cannot find the file to patch - core/src/main/java/org/apache/mahout/common/IntPairWritable.java. Also it looks like the unti test for IntPairWritable is still lying under o.a.m.clustering.lda. Modify IntPairWritable in LDA implementation to be binary comparable to improve performance. Key: MAHOUT-320 URL: https://issues.apache.org/jira/browse/MAHOUT-320 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.3 Reporter: Drew Farris Assignee: Robin Anil Priority: Minor Attachments: MAHOUT-320.patch, MAHOUT-320.patch, MAHOUT-320.patch, MAHOUT-320.patch Per discussion with Robin, modifying o.a.m.clustering.lda.IntPairWritable to be binary comparable will improve the performance of the comparison operations during a sort because no marshaling will need to occur to compare IntPairWritable instances. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837192#action_12837192 ] Ankur commented on MAHOUT-305: -- Just picking random N % data for each user calculating avg precision and recall across all users in test data and then repeating the test K times to take average across all runs should be reasonably fair assessment IMHO. Mahouters your opinion here would be valuable. Combine both cooccurrence-based CF M/R jobs --- Key: MAHOUT-305 URL: https://issues.apache.org/jira/browse/MAHOUT-305 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.2 Reporter: Sean Owen Assignee: Ankur Priority: Minor We have two different but essentially identical MapReduce jobs to make recommendations based on item co-occurrence: org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be merged. Not sure exactly how to approach that but noting this in JIRA, per Ankur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837198#action_12837198 ] Ankur commented on MAHOUT-305: -- I am not proposing that we choose random subset over all movies. Rather choose random N% movie ratings from EACH user and use it as test data to get precision recall across this test set. Also repeat this procedure X times to get a fair assessment. They seem to do it the same way - http://www2007.org/papers/paper570.pdf Combine both cooccurrence-based CF M/R jobs --- Key: MAHOUT-305 URL: https://issues.apache.org/jira/browse/MAHOUT-305 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.2 Reporter: Sean Owen Assignee: Ankur Priority: Minor We have two different but essentially identical MapReduce jobs to make recommendations based on item co-occurrence: org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be merged. Not sure exactly how to approach that but noting this in JIRA, per Ankur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837205#action_12837205 ] Ankur commented on MAHOUT-305: -- Well! not factoring ratings in the similarity metric but having them influence the train/test data for evaluation doesn't sound fair to me. So I don't think both of us agree on the evaluation methodology. Combine both cooccurrence-based CF M/R jobs --- Key: MAHOUT-305 URL: https://issues.apache.org/jira/browse/MAHOUT-305 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.2 Reporter: Sean Owen Assignee: Ankur Priority: Minor We have two different but essentially identical MapReduce jobs to make recommendations based on item co-occurrence: org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be merged. Not sure exactly how to approach that but noting this in JIRA, per Ankur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837230#action_12837230 ] Ankur commented on MAHOUT-305: -- *smile* There we go. Our last steps are essentially different. I don't do any multiplication, instead I just join (user, movie) on 'movie' with co-occurrence set followed by a group on 'user' to calculate recommendations. I guess while joining I should multiply ratings with co-occurrence counts for better evaluation. Can you give a small illustrative example with dummy data to describe your last steps? Combine both cooccurrence-based CF M/R jobs --- Key: MAHOUT-305 URL: https://issues.apache.org/jira/browse/MAHOUT-305 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.2 Reporter: Sean Owen Assignee: Ankur Priority: Minor We have two different but essentially identical MapReduce jobs to make recommendations based on item co-occurrence: org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be merged. Not sure exactly how to approach that but noting this in JIRA, per Ankur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836543#action_12836543 ] Ankur commented on MAHOUT-305: -- Sean, Thanks for filing the jira. Nothing points from our discussion here. 1. Need to decide on the dataset to run both the implementations on. I have netflix dataset in mind but a strange thing I observed during my tests with it is that there were 2 - 3 users who rated more than 10,000 movies! This seemed a little odd to me. Can you or some else who has had experience with the dataset validate my observation ? 2. Both the implementations need to run on dataset in the identical environment to gauge performance and accuracy. For accuracy I believe we need to do a Precision-Recall test. My understanding of it is that a) Do a 80-20 split of the data (80% train and 20% test) with split happening on a timeline. b) Feed training data to the algorithm and generate recommendations for a subset of users from training data. c) Compare those recommendations with items actually present in the history of user in test data. d) Calculate precision = tp / (tp + fp) = (recommendations actually present in user's history) / (total items recommended) e) Calculate recall = tp / (tp + fn) =(recommendations actually present in user's history) / (total items in user's history) f) Finally take a simple avg of both across all the users to get approx global precision/recall. please feel free to correct any of the step above if I misunderstood anything. Combine both cooccurrence-based CF M/R jobs --- Key: MAHOUT-305 URL: https://issues.apache.org/jira/browse/MAHOUT-305 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.2 Reporter: Sean Owen Priority: Minor We have two different but essentially identical MapReduce jobs to make recommendations based on item co-occurrence: org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be merged. Not sure exactly how to approach that but noting this in JIRA, per Ankur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1283#action_1283 ] Ankur commented on MAHOUT-305: -- Hey Sean, Have you played with netflix dataset? Are there really user who have rated more than 10,000 movies? For PR test do we have something already that will work in this case or some coding is required ? Any other thoughts ? Combine both cooccurrence-based CF M/R jobs --- Key: MAHOUT-305 URL: https://issues.apache.org/jira/browse/MAHOUT-305 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.2 Reporter: Sean Owen Priority: Minor We have two different but essentially identical MapReduce jobs to make recommendations based on item co-occurrence: org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be merged. Not sure exactly how to approach that but noting this in JIRA, per Ankur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur reassigned MAHOUT-305: Assignee: Ankur Combine both cooccurrence-based CF M/R jobs --- Key: MAHOUT-305 URL: https://issues.apache.org/jira/browse/MAHOUT-305 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.2 Reporter: Sean Owen Assignee: Ankur Priority: Minor We have two different but essentially identical MapReduce jobs to make recommendations based on item co-occurrence: org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be merged. Not sure exactly how to approach that but noting this in JIRA, per Ankur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836725#action_12836725 ] Ankur commented on MAHOUT-305: -- Typically when doing train-test data split, we divide the data on a timeline. So as a simple example if we have 10 days data then we would keep last 2 days data as test data and remaining as training data. If we remove all 5 star rating the crude way, we may not be able to ensure this condition, not a hard one but still a best practice AFAIK. Also I am not sure if 5 star ratings would be 20 or even 10% of the total data. The crude way you mentioned is ok for a start but I am not sure if its a fair evaluation or not. Also with this we would effectively be calculating precision as precision = (5 start recommendations actually present in user's history) / (total 5 star recommendations) recall = (5 start recommendations actually present in user's history) / (total 5 start items in user's history) is that what you mean? Combine both cooccurrence-based CF M/R jobs --- Key: MAHOUT-305 URL: https://issues.apache.org/jira/browse/MAHOUT-305 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.2 Reporter: Sean Owen Assignee: Ankur Priority: Minor We have two different but essentially identical MapReduce jobs to make recommendations based on item co-occurrence: org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be merged. Not sure exactly how to approach that but noting this in JIRA, per Ankur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837123#action_12837123 ] Ankur commented on MAHOUT-305: -- With co-occurrence analysis we are dropping ratings. So if there are a lot of people who watched Harry Potter also watched Maid in manhattan it will have a higher chance of getting recommended regardless of ratings. I am trying not be influenced too much by ratings as that is not the strength of this algorithm. Where it really shines is when you have lots and lots of sparse user click data where a click may be present or absent. Something like an online book store or a shopping site. We are sticking with netflix as there is no such publicly available dataset AFAIK. Ok so moving forward with the action plan, here is what I propose to do. Please feel free to suggest modifications. 1. For each user take out the most recent movies that he has rated 3 or 4 or 5 as TEST data. Use the remaining as TRAIN data. 2. Run both implementations in identical environment on test data and record runtimes and results 3. Join recommendation results with TEST data on 'user' key and calculate precision recall. 4. Report average precision recall. Ok so when separating top ratings as TEST data. For each user Precision @10 = (3,4,5 rating movies recommended actually present ) / 10 Recall @ 10 = (3,4,5 rating movies recommended actually present ) / (all 3,4,5 movies seen by user) Hope this was more clear. Combine both cooccurrence-based CF M/R jobs --- Key: MAHOUT-305 URL: https://issues.apache.org/jira/browse/MAHOUT-305 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.2 Reporter: Sean Owen Assignee: Ankur Priority: Minor We have two different but essentially identical MapReduce jobs to make recommendations based on item co-occurrence: org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be merged. Not sure exactly how to approach that but noting this in JIRA, per Ankur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12794036#action_12794036 ] Ankur commented on MAHOUT-103: -- I skimmed through your version and what's present in .item package. Few immediate things that come to mind are 1. Moving to AbstractJob 2. Re-factoring to separate map,reduce and job classes. Personally I hate that coz it the code base just bloats when number of M/R jobs increase. I have been trying to setup my idea using IntelliJ.codestyle.xml provided by mahout cwiki. I placed the file under idea_home/config/codeStyles and restarted idea but it still does not an import option in File-Settings-Code Style. Idea shows following messages in the back ground Field not copied JAVA_INDENT_OPTIONS Field not copied JSP_INDENT_OPTIONS Field not copied XML_INDENT_OPTIONS Field not copied OTHER_INDENT_OPTIONS Field not copied FIELD_TYPE_TO_NAME Field not copied STATIC_FIELD_TYPE_TO_NAME Field not copied PARAMETER_TYPE_TO_NAME Field not copied LOCAL_VARIABLE_TYPE_TO_NAME Field not copied PACKAGES_TO_USE_IMPORT_ON_DEMAND Field not copied IMPORT_LAYOUT_TABLE I am using v 8.1.4 Once i set this up, I am gonna look at your version and what's there in .item package Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: MAHOUT-103.patch, mahout-103.patch.v1, mahout-103.patch.v2, prepare.pl, run.sh Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12794036#action_12794036 ] Ankur edited comment on MAHOUT-103 at 12/23/09 12:41 PM: - I skimmed through your version and what's present in .item package. Few immediate things that come to mind are 1. Moving to AbstractJob 2. Re-factoring to separate map,reduce and job classes. Personally I hate that coz the code base just bloats when number of M/R jobs increase. I have been trying to setup my idea using IntelliJ.codestyle.xml provided by mahout cwiki. I placed the file under idea_home/config/codeStyles and restarted idea but it still does not an import option in File-Settings-Code Style. Idea shows following messages in the back ground Field not copied JAVA_INDENT_OPTIONS Field not copied JSP_INDENT_OPTIONS Field not copied XML_INDENT_OPTIONS Field not copied OTHER_INDENT_OPTIONS Field not copied FIELD_TYPE_TO_NAME Field not copied STATIC_FIELD_TYPE_TO_NAME Field not copied PARAMETER_TYPE_TO_NAME Field not copied LOCAL_VARIABLE_TYPE_TO_NAME Field not copied PACKAGES_TO_USE_IMPORT_ON_DEMAND Field not copied IMPORT_LAYOUT_TABLE I am using v 8.1.4 Once i set this up, I am gonna look at your version and what's there in .item package was (Author: ankur): I skimmed through your version and what's present in .item package. Few immediate things that come to mind are 1. Moving to AbstractJob 2. Re-factoring to separate map,reduce and job classes. Personally I hate that coz it the code base just bloats when number of M/R jobs increase. I have been trying to setup my idea using IntelliJ.codestyle.xml provided by mahout cwiki. I placed the file under idea_home/config/codeStyles and restarted idea but it still does not an import option in File-Settings-Code Style. Idea shows following messages in the back ground Field not copied JAVA_INDENT_OPTIONS Field not copied JSP_INDENT_OPTIONS Field not copied XML_INDENT_OPTIONS Field not copied OTHER_INDENT_OPTIONS Field not copied FIELD_TYPE_TO_NAME Field not copied STATIC_FIELD_TYPE_TO_NAME Field not copied PARAMETER_TYPE_TO_NAME Field not copied LOCAL_VARIABLE_TYPE_TO_NAME Field not copied PACKAGES_TO_USE_IMPORT_ON_DEMAND Field not copied IMPORT_LAYOUT_TABLE I am using v 8.1.4 Once i set this up, I am gonna look at your version and what's there in .item package Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: MAHOUT-103.patch, mahout-103.patch.v1, mahout-103.patch.v2, prepare.pl, run.sh Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12794133#action_12794133 ] Ankur commented on MAHOUT-103: -- Your changes don't look too mutating and yes roughly speaking I am ok. Since we are talking about committing this I would like to say that I tested this for correctness on very small hand-coded data-set and then ran it on netflix-data. However I couldn't verify its correctness over netflix data though I am pretty confident it works correctly. That is why I was hoping to have a couple of unit test:- 1. To verify that similar items are identified correctly. 2. None of the seen items are recommended for a user. But since this is not the final version, can you suggest any other approach to be 100% sure of correctness? I don't want something to be committed only to discover a silly issue later just because we didn't take extra care. Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: MAHOUT-103.patch, mahout-103.patch.v1, mahout-103.patch.v2, prepare.pl, run.sh Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793628#action_12793628 ] Ankur commented on MAHOUT-103: -- Evolving the code to integrate better with the existing stuff is fine with me. I am in general ok with throwing away code if it can be replaced by existing stuff that is better. However, I don't think its a good idea to try to come up with a unified approach of generating hadoop based recommendations. I am afraid we'll create more problems than we'd solve. I see recommendations in hadoop world as the following linear chain of M/R jobs Data-Formatting -- Data Filter -- Core Recommender-- Post Processor The last 2 jobs can themselves be comprised of 1 or more M/R jobs. There is I think . Let me come up with the unit-tests and code documentation. After that you can start doing the changes. Thanks a lot for help. Appreciate that :-) BTW did u have a chance to actually run it on netflix-data ? Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: mahout-103.patch.v1, mahout-103.patch.v2, prepare.pl, run.sh Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791971#action_12791971 ] Ankur commented on MAHOUT-103: -- Ok, so here is the next version which I again re-wrote completely :-( for performance reasons. The version now computes item similarity and uses that to generate recommendations in truely hadoop fashion. In a nutshell the recommendations are generated in 2 steps:- 1. Join item-similarity data (generated via analyzing co-occurence) with user-click data 2. Group output of step 1. on user key so that we recieve all potential candidates for a user in a reducer and also all items already clicked/seen by him so that they can be excluded from final recommendations set. Also attached 1. Perl script to convert the netflix data into required format (userId \t movieId) 2. Bash script used to run in on 50 node hadoop cluster. The recommendations are generated for all the users in less than 45 min. Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch, mahout-103.patch.v1 Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated MAHOUT-103: - Attachment: run.sh prepare.pl mahout-103.patch.v2 Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch, mahout-103.patch.v1, mahout-103.patch.v2, prepare.pl, run.sh Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated MAHOUT-103: - Attachment: (was: jira-103.patch) Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: mahout-103.patch.v1, mahout-103.patch.v2, prepare.pl, run.sh Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781838#action_12781838 ] Ankur commented on MAHOUT-103: -- For this co-occurrence based recommender I am planning to write a set of map-reduce jobs that compute recommendations for users as folllowing:- 1. Take user's item history 2. for each item in his history fetch the top-N similar items. (Similarity based on co-occurrence) 3. Add the co-occurrence scores if an item appears more than once (NOT weighted avg). Consider an e.g. user-history { M1, M2, M3 } and top - 3 similar movies for each of these along with co-occurrence scores M1 - (A, 5), (B, 4), (C, 2) M2 - (D, 6), (E, 3), (F, 2) M3 - (G, 8), (C, 5), (B, 2) So the final scores in decreasing order will look like (G, 8) (C, 7) (B, 6) (D, 6) (A, 5) (E, 3) (F, 2) The idea I want to capture is that a candidate item gets higher score if its similar to more items in user's click history. Do you see any issue with this approach ? Any other better approach that you can think of ? As for the precision-recall test, I am still trying to see how to divide the data in 'train' and 'test' for a fair evaluation. How do we do it in the existing code ? Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch, mahout-103.patch.v1 Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated MAHOUT-103: - Attachment: mahout-103.patch.v1 Ok, so here's the revised version of the algorithm that this jira proposes to implement. I have tried to make the code as clean and readable as possible. Next I plan to write some test code for preparing and running on Netflix prize dataset. As a part of data preparation the 'dates' and 'ratings' will be dropped and algo will run on (user-id, item-id) pairs. Not sure how we can include age related decay/boost when counting co-occurrence. May be others can pitch in once we have the basic stuff working fine. Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch, mahout-103.patch.v1 Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12778911#action_12778911 ] Ankur commented on MAHOUT-103: -- Thanks for the quick lookup, appreciate that :-). Putting in a subpackage, sure, for now I'll just leave all the main code under one subpackage (how about 'bigram') until u have it sorted out. As for the code, once I have the test code ready for netflix dataset and at least one unit test, it will be good to go. One question, How do we apply precision-recall or RMSE or any other evaluation technique to the results since all we are doing is counting co-occurrence ? Do u have the JIRA for this hadoop related bug? Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch, mahout-103.patch.v1 Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776939#action_12776939 ] Ankur commented on MAHOUT-103: -- Re-post an updated patch Sure I'll have the updated code coming by early next week. If it's basically sound I'd like to mention it +10, The more people know about it the better chances it has of being used :-) I use the GroupLens, Jester, Netflix data sets regularly. Indeed, just drop the rating ... Simply dropping the rating might introduce too much noise. I was thinking of keeoing only those that have ratings 2.5 (or 2 to be more liberal). Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776966#action_12776966 ] Ankur commented on MAHOUT-103: -- In that case dropping ratings might not be such a good idea and may lead to bad results. Consider the following movies that a user might have seen with the scores Matrix - 4.5 Matrix Reloaded - 2.5 Matrix Revolutions - 2 Assuming that a lot of people have watched these movies and didn't like the subsequent two versions, they still will get high similarity scores w.r.t Matrix going purely by co-occurrence. IMHO, that leaves us with the following 2 alternatives :- 1. Add the ratings when counting co-occurrence and hope that better ones will stand out even if they co-occur less. 2. Apply a Re-scorer that re-ranks the the similar items for a given item based on their average scores. Point 1 is something I am thinking of trying out. Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12682915#action_12682915 ] Ankur commented on MAHOUT-103: -- Hey Sean, Thanks for review comments. Some specific questions 1. This indeed is doing approximately the same thing as TanimotoCoefficientSimilarity and BooleanPreferenceUser. The difference being that similarity computations is parallelized in map-reduce. 2. The idea of introducing a FitnessEvaluator was to allow people to apply domain specific things when comparing a preference. Are you suggesting the replacement of FitnessEvaluator with ItemSimilarity ? 3. The Hadoop job was written to run this thing stand-alone. What modifications do you feel would be appropriate for integration into the framework? Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur reassigned MAHOUT-103: Assignee: Ankur Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated MAHOUT-103: - Attachment: jira-103.patch Ok here is a quick patch with just enough documentation and no unit tests or dummy data. The code works but following things can be improved... 1. The code can be better structured and integrated. 2. Logging needs to be added. 3. Documentation can be more informative. 4. Dummy data and unit tests need to be added. Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668475#action_12668475 ] Ankur commented on MAHOUT-103: -- I hoping to make the above improvements after I get some review comments. Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAHOUT-103) Co-occurence based nearest neighbourhood
Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-19) Hierarchial clusterer
[ https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12663319#action_12663319 ] Ankur commented on MAHOUT-19: - Hi Karl, Welcome back :-) Can you share the following few things about this patch? 1. Assuming you are training the tree top-down, what is the division criteria you are using ? 2. How well does it scale ? 3. Was the data on which this was tried, sparse ? 4. What is the distance metric that has been used ? Basically I have a use -case where-in I have a set of 5 - 10 million urls which have an inherent hierarchical relationship and a set of user-clicks on them. I would like to cluster them in a tree and use the model to answer the near neighborhood type queries i.e. what urls are related to what other urls. I did implement a sequential bottom-up hierarchical clustering algorithm but the complexity is too bad for my data-set. I then thought about implementing a top-down hierarchical clustering algorithm using Jaccard co-efficient as my distance measure and came across this patch. Can you suggest if this patch will help? Hierarchial clusterer - Key: MAHOUT-19 URL: https://issues.apache.org/jira/browse/MAHOUT-19 Project: Mahout Issue Type: New Feature Components: Clustering Reporter: Karl Wettin Assignee: Karl Wettin Priority: Minor Attachments: MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, TestBottomFeed.test.png, TestTopFeed.test.png In a hierarchial clusterer the instances are the leaf nodes in a tree where branch nodes contains the mean features of and the distance between its children. For performance reasons I always trained trees from the top-down. I have been told that it can cause various effects I never encountered. And I believe Huffman solved his problem by training bottom-up? The thing is, I don't think it is possible to train the tree top-down using map reduce. I do however think it is possible to train it bottom-up. I would very much appreciate any thoughts on this. Once this tree is trained one can extract clusters in various ways. The mean distance between all instances is usually a good maximum distance to allow between nodes when navigating the tree in search for a cluster. Navigating the tree and gather nodes that are not too far away from each other is usually instant if the tree is available in memory or persisted in a smart way. In my experience there is not much to win from extracting all clusters from start. Also, it usually makes sense to allow for the user to modify the cluster boundary variables in real time using a slider or perhaps present the named summary of neighbouring clusters, blacklist paths in the tree, etc. It is also not to bad to use secondary classification on the instances to create worm holes in the tree. I always thought it would be cool to visualize it using Touchgraph. My focus is on clustering text documents for instant more like this-feature in search engines and use Tanimoto similarity on the vector spaces to calculate the distance. See LUCENE-1025 for a single threaded all in memory proof of concept of a hierarchial clusterer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-4) Simple prototype for Expectation Maximization (EM)
[ https://issues.apache.org/jira/browse/MAHOUT-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated MAHOUT-4: --- Attachment: (was: Mahout_EM.patch) Simple prototype for Expectation Maximization (EM) -- Key: MAHOUT-4 URL: https://issues.apache.org/jira/browse/MAHOUT-4 Project: Mahout Issue Type: New Feature Reporter: Ankur Attachments: dp-cluster.r Create a simple prototype implementing Expectation Maximization - EM that demonstrates the algorithm functionality given a set of (user, click-url) data. The prototype should be functionally complete and should serve as a basis for the Map-Reduce version of the EM algorithm. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-4) Simple prototype for Expectation Maximization (EM)
[ https://issues.apache.org/jira/browse/MAHOUT-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated MAHOUT-4: --- Attachment: (was: PLSI_EM.patch) Simple prototype for Expectation Maximization (EM) -- Key: MAHOUT-4 URL: https://issues.apache.org/jira/browse/MAHOUT-4 Project: Mahout Issue Type: New Feature Reporter: Ankur Create a simple prototype implementing Expectation Maximization - EM that demonstrates the algorithm functionality given a set of (user, click-url) data. The prototype should be functionally complete and should serve as a basis for the Map-Reduce version of the EM algorithm. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-4) Simple prototype for Expectation Maximization (EM)
[ https://issues.apache.org/jira/browse/MAHOUT-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated MAHOUT-4: --- Attachment: Mahout_EM.patch Oops! Looks like my Subversive Eclipse plugin did something whacky while generating the pacth. Sorry for that. Please find the recitifed patch file. Hope this goes through fine. Simple prototype for Expectation Maximization (EM) -- Key: MAHOUT-4 URL: https://issues.apache.org/jira/browse/MAHOUT-4 Project: Mahout Issue Type: New Feature Reporter: Ankur Attachments: Mahout_EM.patch Create a simple prototype implementing Expectation Maximization - EM that demonstrates the algorithm functionality given a set of (user, click-url) data. The prototype should be functionally complete and should serve as a basis for the Map-Reduce version of the EM algorithm. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.