Re: Regarding Online Recommenders
Ok, I tested the MatrixBackedDataModel, and the heap size is reduced to 7G for the Netflix Data, still large. The same history is encoded in 2 SparseRowMatrices, one is row-indexed by users and one is by item. It has serious concurrency issues at several places, though (sets and removes need to be thread-safe). Best Gokhan On Sat, Jul 20, 2013 at 12:15 AM, Peng Cheng pc...@uowmail.edu.au wrote: Hi, Just one simple question: Is the org.apache.mahout.math.**BinarySearch.binarySearch() function an optimized version of Arrays.binarySearch()? If it is not, why implement it again? Yours Peng On 13-07-17 06:31 PM, Sebastian Schelter wrote: You are completely right, the simple interface would only be usable for readonly / batch-updatable recommenders. Online recommenders might need something different. I tried to widen the discussion here to discuss all kinds of API changes in the recommenders that would be necessary in the future. 2013/7/17 Peng Cheng pc...@uowmail.edu.au One thing that suddenly comes to my mind is that, for a simple interface like FactorizablePreferences, maybe sequential READ in real time is possible, but sequential WRITE in O(1) time is Utopia. Because you need to flush out old preference with same user and item ID (in worst case it could be an interpolation search), otherwise you are permitting a user rating an item twice with different values. Considering how FileDataModel suppose to work (new files flush old files), maybe using the simple interface has less advantages than we used to believe. On 13-07-17 04:58 PM, Sebastian Schelter wrote: Hi Peng, I never wanted to discard the old interface, I just wanted to split it up. I want to have a simple interface that only supports sequential access (and allows for very memory efficient implementions, e.g. by the use of primitive arrays). DataModel should *extend* this interface and provide sequential and random access (basically what is already does). Than a recommender such as SGD could state that it only needs sequential access to the preferences and you can either feed it a DataModel (so we dont break backwards compatibility) or a memory efficient sequential access thingy. Does that make sense for you? 2013/7/17 Peng Cheng pc...@uowmail.edu.au I see, OK so we shouldn't use the old implementation. But I mean, the old interface doesn't have to be discarded. The discrepancy between your FactorizablePreferences and DataModel is that, your model supports getPreferences(), which returns all preferences as an iterator, and DataModel supports a few old functions that returns preferences for an individual user or item. My point is that, it is not hard for each of them to implement what they lack of: old DataModel can implement getPreferences() just by a a loop in abstract class. Your new FactorizablePreferences can implement those old functions by a binary search that takes O(log n) time, or an interpolation search that takes O(log log n) time in average. So does the online update. It will just be a matter of different speed and space, but not different interface standard, we can use old unit tests, old examples, old everything. And we will be more flexible in writing ensemble recommender. Just a few thoughts, I'll have to validate the idea first before creating a new JIRA ticket. Yours Peng On 13-07-16 02:51 PM, Sebastian Schelter wrote: I completely agree, Netflix is less than one gigabye in a smart representation, 12x more memory is a nogo. The techniques used in FactorizablePreferences allow a much more memory efficient representation, tested on KDD Music dataset which is approx 2.5 times Netflix and fits into 3GB with that approach. 2013/7/16 Ted Dunning ted.dunn...@gmail.com Netflix is a small dataset. 12G for that seems quite excessive. Note also that this is before you have done any work. Ideally, 100million observations should take 1GB. On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au wrote: The second idea is indeed splendid, we should separate time-complexity first and space-complexity first implementation. What I'm not quite sure, is that if we really need to create two interfaces instead of one. Personally, I think 12G heap space is not that high right? Most new laptop can already handle that (emphasis on laptop). And if we replace hash map (the culprit of high memory consumption) with list/linkedList, it would simply degrade time complexity for a linear search to O(n), not too bad either. The current DataModel is a result of careful thoughts and has underwent extensive test, it is easier to expand on top of it instead of subverting it.
Re: Regarding Online Recommenders
Hi, Just one simple question: Is the org.apache.mahout.math.BinarySearch.binarySearch() function an optimized version of Arrays.binarySearch()? If it is not, why implement it again? Yours Peng On 13-07-17 06:31 PM, Sebastian Schelter wrote: You are completely right, the simple interface would only be usable for readonly / batch-updatable recommenders. Online recommenders might need something different. I tried to widen the discussion here to discuss all kinds of API changes in the recommenders that would be necessary in the future. 2013/7/17 Peng Cheng pc...@uowmail.edu.au One thing that suddenly comes to my mind is that, for a simple interface like FactorizablePreferences, maybe sequential READ in real time is possible, but sequential WRITE in O(1) time is Utopia. Because you need to flush out old preference with same user and item ID (in worst case it could be an interpolation search), otherwise you are permitting a user rating an item twice with different values. Considering how FileDataModel suppose to work (new files flush old files), maybe using the simple interface has less advantages than we used to believe. On 13-07-17 04:58 PM, Sebastian Schelter wrote: Hi Peng, I never wanted to discard the old interface, I just wanted to split it up. I want to have a simple interface that only supports sequential access (and allows for very memory efficient implementions, e.g. by the use of primitive arrays). DataModel should *extend* this interface and provide sequential and random access (basically what is already does). Than a recommender such as SGD could state that it only needs sequential access to the preferences and you can either feed it a DataModel (so we dont break backwards compatibility) or a memory efficient sequential access thingy. Does that make sense for you? 2013/7/17 Peng Cheng pc...@uowmail.edu.au I see, OK so we shouldn't use the old implementation. But I mean, the old interface doesn't have to be discarded. The discrepancy between your FactorizablePreferences and DataModel is that, your model supports getPreferences(), which returns all preferences as an iterator, and DataModel supports a few old functions that returns preferences for an individual user or item. My point is that, it is not hard for each of them to implement what they lack of: old DataModel can implement getPreferences() just by a a loop in abstract class. Your new FactorizablePreferences can implement those old functions by a binary search that takes O(log n) time, or an interpolation search that takes O(log log n) time in average. So does the online update. It will just be a matter of different speed and space, but not different interface standard, we can use old unit tests, old examples, old everything. And we will be more flexible in writing ensemble recommender. Just a few thoughts, I'll have to validate the idea first before creating a new JIRA ticket. Yours Peng On 13-07-16 02:51 PM, Sebastian Schelter wrote: I completely agree, Netflix is less than one gigabye in a smart representation, 12x more memory is a nogo. The techniques used in FactorizablePreferences allow a much more memory efficient representation, tested on KDD Music dataset which is approx 2.5 times Netflix and fits into 3GB with that approach. 2013/7/16 Ted Dunning ted.dunn...@gmail.com Netflix is a small dataset. 12G for that seems quite excessive. Note also that this is before you have done any work. Ideally, 100million observations should take 1GB. On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au wrote: The second idea is indeed splendid, we should separate time-complexity first and space-complexity first implementation. What I'm not quite sure, is that if we really need to create two interfaces instead of one. Personally, I think 12G heap space is not that high right? Most new laptop can already handle that (emphasis on laptop). And if we replace hash map (the culprit of high memory consumption) with list/linkedList, it would simply degrade time complexity for a linear search to O(n), not too bad either. The current DataModel is a result of careful thoughts and has underwent extensive test, it is easier to expand on top of it instead of subverting it.
Re: Regarding Online Recommenders
On Jul 17, 2013, at 1:19 PM, Gokhan Capan gkhn...@gmail.com wrote: Hi Pat, please see my response inline. Best, Gokhan On Wed, Jul 17, 2013 at 8:23 PM, Pat Ferrel pat.fer...@gmail.com wrote: May I ask how you plan to support model updates and 'anonymous' users? I assume the latent factors model is calculated offline still in batch mode, then there are periodic updates? How are the updates handled? If you are referring to the recommender of discussion here, no, updating the model can be done with a single preference, using stochastic gradient descent, by updating the particular user and item factors simultaneously. Aren't there two different things needed to truly update the model: 1) add the new preference to the lower dimensional space 2) refactorize the all preferences. #2 only needs to be done periodically--afaik. #1 would be super fast and could be done at runtime. Am I wrong or are you planning to incrementally refactorize the entire preference array with every new preference?
Re: Regarding Online Recommenders
For a low-rank matrix factorization based recommender, a new preference is not itself, but a dot product of two vectors in the low dimensional space, so it needs no projection. The user and item vectors however may need to be projected into a lower dimensional space, if and only if you want to reduce the rank of the preference matrix. The refactorization step in SGD is super fast--that's the charm of SGD. So, yes, we will refactorize in every update. Yours Peng On 13-07-18 11:34 AM, Pat Ferrel wrote: On Jul 17, 2013, at 1:19 PM, Gokhan Capan gkhn...@gmail.com wrote: Hi Pat, please see my response inline. Best, Gokhan On Wed, Jul 17, 2013 at 8:23 PM, Pat Ferrel pat.fer...@gmail.com wrote: May I ask how you plan to support model updates and 'anonymous' users? I assume the latent factors model is calculated offline still in batch mode, then there are periodic updates? How are the updates handled? If you are referring to the recommender of discussion here, no, updating the model can be done with a single preference, using stochastic gradient descent, by updating the particular user and item factors simultaneously. Aren't there two different things needed to truly update the model: 1) add the new preference to the lower dimensional space 2) refactorize the all preferences. #2 only needs to be done periodically--afaik. #1 would be super fast and could be done at runtime. Am I wrong or are you planning to incrementally refactorize the entire preference array with every new preference?
Re: Regarding Online Recommenders
Yes, what Myrrix does is good. My last aside was a wish for an item-based online recommender not only factorized. Ted talks about using Solr for this, which we're experimenting with alongside Myrrix. I suspect Solr works but it does require a bit of tinkering and doesn't have quite the same set of options--no llr similarity for instance. On the same subject I recently attended a workshop in Seattle for UAI2013 where Walmart reported similar results using a factorized recommender. They had to increase the factor number past where it would perform well. Along the way they saw increasing performance measuring precision offline. They eventually gave up on a factorized solution. This decision seems odd but anyway… In the case of Walmart and our data set they are quite diverse. The best idea is probably to create different recommenders for separate parts of the catalog but if you create one model on all items our intuition is that item-based works better than factorized. Again caveat--no A/B tests to support this yet. Doing an online item-based recommender would quickly run into scaling problems, no? We put together the simple Mahout in-memory version and it could not really handle more than a down-sampled few months of our data. Down-sampling lost us 20% of our precision scores so we moved to the hadoop version. Now we have use-cases for an online recommender that handles anonymous new users and that takes the story full circle. On Jul 17, 2013, at 1:28 PM, Sebastian Schelter s...@apache.org wrote: Hi Pat I think we should provide a simple support for recommending to anonymous users. We should have a method recommendToAnonymous() that takes a PreferenceArray as argument. For itembased recommenders, its straightforward to compute recommendations, for userbased you have to search through all users once, for latent factor models, you have to fold the user vector into the low dimensional space. I think Sean already added this method in myrrix and I have some code for my kornakapi project (a simple weblayer for mahout). Would such a method fit your needs? Best, Sebastian 2013/7/17 Pat Ferrel pat.fer...@gmail.com May I ask how you plan to support model updates and 'anonymous' users? I assume the latent factors model is calculated offline still in batch mode, then there are periodic updates? How are the updates handled? Do you plan to require batch model refactorization for any update? Or perform some partial update by maybe just transforming new data into the LF space already in place then doing full refactorization every so often in batch mode? By 'anonymous users' I mean users with some history that is not yet incorporated in the LF model. This could be history from a new user asked to pick a few items to start the rec process, or an old user with some new action history not yet in the model. Are you going to allow for passing the entire history vector or userID+incremental new history to the recommender? I hope so. For what it's worth we did a comparison of Mahout Item based CF to Mahout ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months of data. The data was purchase data from a diverse ecom source with a large variety of products from electronics to clothes. We found Item based CF did far better than ALS. As we increased the number of latent factors the results got better but were never within 10% of item based (we used MAP as the offline metric). Not sure why but maybe it has to do with the diversity of the item types. I understand that a full item based online recommender has very different tradeoffs and anyway others may not have seen this disparity of results. Furthermore we don't have A/B test results yet to validate the offline metric. On Jul 16, 2013, at 2:41 PM, Gokhan Capan gkhn...@gmail.com wrote: Peng, This is the reason I separated out the DataModel, and only put the learner stuff there. The learner I mentioned yesterday just stores the parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care where preferences are stored. I, kind of, agree with the multi-level DataModel approach: One for iterating over all preferences, one for if one wants to deploy a recommender and perform a lot of top-N recommendation tasks. (Or one DataModel with a strategy that might reduce existing memory consumption, while still providing fast access, I am not sure. Let me try a matrix-backed DataModel approach) Gokhan On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter s...@apache.org wrote: I completely agree, Netflix is less than one gigabye in a smart representation, 12x more memory is a nogo. The techniques used in FactorizablePreferences allow a much more memory efficient representation, tested on KDD Music dataset which is approx 2.5 times Netflix and fits into 3GB with that approach. 2013/7/16 Ted Dunning ted.dunn...@gmail.com Netflix is a small dataset. 12G for that seems
Re: Regarding Online Recommenders
Well, with itembased the only problem is new items. New users can immediately be served by the model (although this is not well supported by the API in Mahout). For the majority of usecases I saw, it is perfectly fine to have a short delay until new items enter the recommender, usually this happens after a retraining in batch. You have to care for cold-start and collect some interactions anyway. 2013/7/18 Pat Ferrel pat.fer...@gmail.com Yes, what Myrrix does is good. My last aside was a wish for an item-based online recommender not only factorized. Ted talks about using Solr for this, which we're experimenting with alongside Myrrix. I suspect Solr works but it does require a bit of tinkering and doesn't have quite the same set of options--no llr similarity for instance. On the same subject I recently attended a workshop in Seattle for UAI2013 where Walmart reported similar results using a factorized recommender. They had to increase the factor number past where it would perform well. Along the way they saw increasing performance measuring precision offline. They eventually gave up on a factorized solution. This decision seems odd but anyway… In the case of Walmart and our data set they are quite diverse. The best idea is probably to create different recommenders for separate parts of the catalog but if you create one model on all items our intuition is that item-based works better than factorized. Again caveat--no A/B tests to support this yet. Doing an online item-based recommender would quickly run into scaling problems, no? We put together the simple Mahout in-memory version and it could not really handle more than a down-sampled few months of our data. Down-sampling lost us 20% of our precision scores so we moved to the hadoop version. Now we have use-cases for an online recommender that handles anonymous new users and that takes the story full circle. On Jul 17, 2013, at 1:28 PM, Sebastian Schelter s...@apache.org wrote: Hi Pat I think we should provide a simple support for recommending to anonymous users. We should have a method recommendToAnonymous() that takes a PreferenceArray as argument. For itembased recommenders, its straightforward to compute recommendations, for userbased you have to search through all users once, for latent factor models, you have to fold the user vector into the low dimensional space. I think Sean already added this method in myrrix and I have some code for my kornakapi project (a simple weblayer for mahout). Would such a method fit your needs? Best, Sebastian 2013/7/17 Pat Ferrel pat.fer...@gmail.com May I ask how you plan to support model updates and 'anonymous' users? I assume the latent factors model is calculated offline still in batch mode, then there are periodic updates? How are the updates handled? Do you plan to require batch model refactorization for any update? Or perform some partial update by maybe just transforming new data into the LF space already in place then doing full refactorization every so often in batch mode? By 'anonymous users' I mean users with some history that is not yet incorporated in the LF model. This could be history from a new user asked to pick a few items to start the rec process, or an old user with some new action history not yet in the model. Are you going to allow for passing the entire history vector or userID+incremental new history to the recommender? I hope so. For what it's worth we did a comparison of Mahout Item based CF to Mahout ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months of data. The data was purchase data from a diverse ecom source with a large variety of products from electronics to clothes. We found Item based CF did far better than ALS. As we increased the number of latent factors the results got better but were never within 10% of item based (we used MAP as the offline metric). Not sure why but maybe it has to do with the diversity of the item types. I understand that a full item based online recommender has very different tradeoffs and anyway others may not have seen this disparity of results. Furthermore we don't have A/B test results yet to validate the offline metric. On Jul 16, 2013, at 2:41 PM, Gokhan Capan gkhn...@gmail.com wrote: Peng, This is the reason I separated out the DataModel, and only put the learner stuff there. The learner I mentioned yesterday just stores the parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care where preferences are stored. I, kind of, agree with the multi-level DataModel approach: One for iterating over all preferences, one for if one wants to deploy a recommender and perform a lot of top-N recommendation tasks. (Or one DataModel with a strategy that might reduce existing memory consumption, while still providing fast access, I am not sure. Let me try a matrix-backed DataModel
Re: Regarding Online Recommenders
If I remember right, a highlight of 0.8 release is an online clustering algorithm. I'm not sure if it can be used in item-based recommender, but this is definitely I would like to pursue. It's probably the only advantage a non-hadoop implementation can offer in the future. Many non-hadoop recommenders are pretty fast. But existing in-memory GenericDataModel and FileDataModel are largely implemented for sandboxes, IMHO they are the culprit of scalability problem. May I ask about the scale of your dataset? how many rating does it have? Yours Peng On 13-07-18 12:14 PM, Sebastian Schelter wrote: Well, with itembased the only problem is new items. New users can immediately be served by the model (although this is not well supported by the API in Mahout). For the majority of usecases I saw, it is perfectly fine to have a short delay until new items enter the recommender, usually this happens after a retraining in batch. You have to care for cold-start and collect some interactions anyway. 2013/7/18 Pat Ferrel pat.fer...@gmail.com Yes, what Myrrix does is good. My last aside was a wish for an item-based online recommender not only factorized. Ted talks about using Solr for this, which we're experimenting with alongside Myrrix. I suspect Solr works but it does require a bit of tinkering and doesn't have quite the same set of options--no llr similarity for instance. On the same subject I recently attended a workshop in Seattle for UAI2013 where Walmart reported similar results using a factorized recommender. They had to increase the factor number past where it would perform well. Along the way they saw increasing performance measuring precision offline. They eventually gave up on a factorized solution. This decision seems odd but anyway… In the case of Walmart and our data set they are quite diverse. The best idea is probably to create different recommenders for separate parts of the catalog but if you create one model on all items our intuition is that item-based works better than factorized. Again caveat--no A/B tests to support this yet. Doing an online item-based recommender would quickly run into scaling problems, no? We put together the simple Mahout in-memory version and it could not really handle more than a down-sampled few months of our data. Down-sampling lost us 20% of our precision scores so we moved to the hadoop version. Now we have use-cases for an online recommender that handles anonymous new users and that takes the story full circle. On Jul 17, 2013, at 1:28 PM, Sebastian Schelter s...@apache.org wrote: Hi Pat I think we should provide a simple support for recommending to anonymous users. We should have a method recommendToAnonymous() that takes a PreferenceArray as argument. For itembased recommenders, its straightforward to compute recommendations, for userbased you have to search through all users once, for latent factor models, you have to fold the user vector into the low dimensional space. I think Sean already added this method in myrrix and I have some code for my kornakapi project (a simple weblayer for mahout). Would such a method fit your needs? Best, Sebastian 2013/7/17 Pat Ferrel pat.fer...@gmail.com May I ask how you plan to support model updates and 'anonymous' users? I assume the latent factors model is calculated offline still in batch mode, then there are periodic updates? How are the updates handled? Do you plan to require batch model refactorization for any update? Or perform some partial update by maybe just transforming new data into the LF space already in place then doing full refactorization every so often in batch mode? By 'anonymous users' I mean users with some history that is not yet incorporated in the LF model. This could be history from a new user asked to pick a few items to start the rec process, or an old user with some new action history not yet in the model. Are you going to allow for passing the entire history vector or userID+incremental new history to the recommender? I hope so. For what it's worth we did a comparison of Mahout Item based CF to Mahout ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months of data. The data was purchase data from a diverse ecom source with a large variety of products from electronics to clothes. We found Item based CF did far better than ALS. As we increased the number of latent factors the results got better but were never within 10% of item based (we used MAP as the offline metric). Not sure why but maybe it has to do with the diversity of the item types. I understand that a full item based online recommender has very different tradeoffs and anyway others may not have seen this disparity of results. Furthermore we don't have A/B test results yet to validate the offline metric. On Jul 16, 2013, at 2:41 PM, Gokhan Capan gkhn...@gmail.com wrote: Peng, This is the reason I separated out the DataModel, and only put the learner stuff there. The learner I
Re: Regarding Online Recommenders
It was about 2.5M users and 500K items with 25M actions over 6 months of data. On Jul 18, 2013, at 10:15 AM, Peng Cheng pc...@uowmail.edu.au wrote: If I remember right, a highlight of 0.8 release is an online clustering algorithm. I'm not sure if it can be used in item-based recommender, but this is definitely I would like to pursue. It's probably the only advantage a non-hadoop implementation can offer in the future. Many non-hadoop recommenders are pretty fast. But existing in-memory GenericDataModel and FileDataModel are largely implemented for sandboxes, IMHO they are the culprit of scalability problem. May I ask about the scale of your dataset? how many rating does it have? Yours Peng On 13-07-18 12:14 PM, Sebastian Schelter wrote: Well, with itembased the only problem is new items. New users can immediately be served by the model (although this is not well supported by the API in Mahout). For the majority of usecases I saw, it is perfectly fine to have a short delay until new items enter the recommender, usually this happens after a retraining in batch. You have to care for cold-start and collect some interactions anyway. 2013/7/18 Pat Ferrel pat.fer...@gmail.com Yes, what Myrrix does is good. My last aside was a wish for an item-based online recommender not only factorized. Ted talks about using Solr for this, which we're experimenting with alongside Myrrix. I suspect Solr works but it does require a bit of tinkering and doesn't have quite the same set of options--no llr similarity for instance. On the same subject I recently attended a workshop in Seattle for UAI2013 where Walmart reported similar results using a factorized recommender. They had to increase the factor number past where it would perform well. Along the way they saw increasing performance measuring precision offline. They eventually gave up on a factorized solution. This decision seems odd but anyway… In the case of Walmart and our data set they are quite diverse. The best idea is probably to create different recommenders for separate parts of the catalog but if you create one model on all items our intuition is that item-based works better than factorized. Again caveat--no A/B tests to support this yet. Doing an online item-based recommender would quickly run into scaling problems, no? We put together the simple Mahout in-memory version and it could not really handle more than a down-sampled few months of our data. Down-sampling lost us 20% of our precision scores so we moved to the hadoop version. Now we have use-cases for an online recommender that handles anonymous new users and that takes the story full circle. On Jul 17, 2013, at 1:28 PM, Sebastian Schelter s...@apache.org wrote: Hi Pat I think we should provide a simple support for recommending to anonymous users. We should have a method recommendToAnonymous() that takes a PreferenceArray as argument. For itembased recommenders, its straightforward to compute recommendations, for userbased you have to search through all users once, for latent factor models, you have to fold the user vector into the low dimensional space. I think Sean already added this method in myrrix and I have some code for my kornakapi project (a simple weblayer for mahout). Would such a method fit your needs? Best, Sebastian 2013/7/17 Pat Ferrel pat.fer...@gmail.com May I ask how you plan to support model updates and 'anonymous' users? I assume the latent factors model is calculated offline still in batch mode, then there are periodic updates? How are the updates handled? Do you plan to require batch model refactorization for any update? Or perform some partial update by maybe just transforming new data into the LF space already in place then doing full refactorization every so often in batch mode? By 'anonymous users' I mean users with some history that is not yet incorporated in the LF model. This could be history from a new user asked to pick a few items to start the rec process, or an old user with some new action history not yet in the model. Are you going to allow for passing the entire history vector or userID+incremental new history to the recommender? I hope so. For what it's worth we did a comparison of Mahout Item based CF to Mahout ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months of data. The data was purchase data from a diverse ecom source with a large variety of products from electronics to clothes. We found Item based CF did far better than ALS. As we increased the number of latent factors the results got better but were never within 10% of item based (we used MAP as the offline metric). Not sure why but maybe it has to do with the diversity of the item types. I understand that a full item based online recommender has very different tradeoffs and anyway others may not have seen this disparity of results. Furthermore we don't have A/B test
Re: Regarding Online Recommenders
Thanks for the encouragement guys and my apologies for deviating from the discussion again. As suggested by Sebastian, I looked into the Mahout's recommenders and implemented a User-based recommender as well. IMHO, I now have a fair understanding about the recommenders. In order to be a part of ASF-ICFOSS mentor-ship programmehttps://cwiki.apache.org/confluence/display/COMDEV/ASF-ICFOSS+Pilot+Mentoring+ProgrammeI am required to submit a proposal discussing about my plan in order to work on this idea. My understanding, based on the discussion on this thread, so far is that we are still in the process of finalizing an approach and yet to start working on the idea. I need your help on the following: 1) Could you please suggest the sub-features that I would be able to contribute to and thus can mention them under expected deliverables ? 2) Also, I would be really grateful if one of you could be my mentorhttp://community.apache.org/mentoringprogramme.htmlas a part of this programme (As a mentee I would need only some pointers/directions to proceed further). Please let me know. I am really excited and looking forward to a positive response to work on this feature. Thanks Regards, Abhishek Sharma http://www.linkedin.com/in/abhi21 https://github.com/abhi21 On Thu, Jul 18, 2013 at 2:39 AM, Peng Cheng pc...@uowmail.edu.au wrote: Awesome! your reinforcements are highly appreciated. On 13-07-17 01:29 AM, Abhishek Sharma wrote: Sorry to interrupt guys, but I just wanted to bring it to your notice that I am also interested in contributing to this idea. I am planning to participate in ASF-ICFOSS mentor-ship programmehttps://cwiki.**apache.org/confluence/display/** COMDEV/ASF-ICFOSS+Pilot+**Mentoring+Programmehttps://cwiki.apache.org/confluence/display/COMDEV/ASF-ICFOSS+Pilot+Mentoring+Programme . (this is very similar to GSOC) I do have strong concepts in machine learning (have done the ML course by Andrew NG on coursera) also, I am good in programming (have 2.5 yrs of work experience). I am not really sure of how can I approach this problem (but I do have a strong interest to work on this problem) hence would like to pair up on this. I am currently working as a research intern at Indian Institute of Science (IISc), Bangalore India and can put up 15-20 hrs per week. Please let me know your thoughts if I can be a part of this. Thanks Regards, Abhishek Sharma http://www.linkedin.com/in/**abhi21 http://www.linkedin.com/in/abhi21 https://github.com/abhi21 On Wed, Jul 17, 2013 at 3:11 AM, Gokhan Capan gkhn...@gmail.com wrote: Peng, This is the reason I separated out the DataModel, and only put the learner stuff there. The learner I mentioned yesterday just stores the parameters, (noOfUsers+noOfItems)***noOfLatentFactors, and does not care where preferences are stored. I, kind of, agree with the multi-level DataModel approach: One for iterating over all preferences, one for if one wants to deploy a recommender and perform a lot of top-N recommendation tasks. (Or one DataModel with a strategy that might reduce existing memory consumption, while still providing fast access, I am not sure. Let me try a matrix-backed DataModel approach) Gokhan On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter s...@apache.org wrote: I completely agree, Netflix is less than one gigabye in a smart representation, 12x more memory is a nogo. The techniques used in FactorizablePreferences allow a much more memory efficient representation, tested on KDD Music dataset which is approx 2.5 times Netflix and fits into 3GB with that approach. 2013/7/16 Ted Dunning ted.dunn...@gmail.com Netflix is a small dataset. 12G for that seems quite excessive. Note also that this is before you have done any work. Ideally, 100million observations should take 1GB. On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au wrote: The second idea is indeed splendid, we should separate time-complexity first and space-complexity first implementation. What I'm not quite sure, is that if we really need to create two interfaces instead of one. Personally, I think 12G heap space is not that high right? Most new laptop can already handle that (emphasis on laptop). And if we replace hash map (the culprit of high memory consumption) with list/linkedList, it would simply degrade time complexity for a linear search to O(n), not too bad either. The current DataModel is a result of careful thoughts and has underwent extensive test, it is easier to expand on top of it instead of subverting it. -- -- Abhishek Sharma ThoughtWorks
Re: Regarding Online Recommenders
Strange, its just a little bit larger than limibseti dataset (17m ratings), did you encountered an outOfMemory or GCTimeOut exception? Allocating more heap space usually help. Yours Peng On 13-07-18 02:27 PM, Pat Ferrel wrote: It was about 2.5M users and 500K items with 25M actions over 6 months of data. On Jul 18, 2013, at 10:15 AM, Peng Cheng pc...@uowmail.edu.au wrote: If I remember right, a highlight of 0.8 release is an online clustering algorithm. I'm not sure if it can be used in item-based recommender, but this is definitely I would like to pursue. It's probably the only advantage a non-hadoop implementation can offer in the future. Many non-hadoop recommenders are pretty fast. But existing in-memory GenericDataModel and FileDataModel are largely implemented for sandboxes, IMHO they are the culprit of scalability problem. May I ask about the scale of your dataset? how many rating does it have? Yours Peng On 13-07-18 12:14 PM, Sebastian Schelter wrote: Well, with itembased the only problem is new items. New users can immediately be served by the model (although this is not well supported by the API in Mahout). For the majority of usecases I saw, it is perfectly fine to have a short delay until new items enter the recommender, usually this happens after a retraining in batch. You have to care for cold-start and collect some interactions anyway. 2013/7/18 Pat Ferrel pat.fer...@gmail.com Yes, what Myrrix does is good. My last aside was a wish for an item-based online recommender not only factorized. Ted talks about using Solr for this, which we're experimenting with alongside Myrrix. I suspect Solr works but it does require a bit of tinkering and doesn't have quite the same set of options--no llr similarity for instance. On the same subject I recently attended a workshop in Seattle for UAI2013 where Walmart reported similar results using a factorized recommender. They had to increase the factor number past where it would perform well. Along the way they saw increasing performance measuring precision offline. They eventually gave up on a factorized solution. This decision seems odd but anyway… In the case of Walmart and our data set they are quite diverse. The best idea is probably to create different recommenders for separate parts of the catalog but if you create one model on all items our intuition is that item-based works better than factorized. Again caveat--no A/B tests to support this yet. Doing an online item-based recommender would quickly run into scaling problems, no? We put together the simple Mahout in-memory version and it could not really handle more than a down-sampled few months of our data. Down-sampling lost us 20% of our precision scores so we moved to the hadoop version. Now we have use-cases for an online recommender that handles anonymous new users and that takes the story full circle. On Jul 17, 2013, at 1:28 PM, Sebastian Schelter s...@apache.org wrote: Hi Pat I think we should provide a simple support for recommending to anonymous users. We should have a method recommendToAnonymous() that takes a PreferenceArray as argument. For itembased recommenders, its straightforward to compute recommendations, for userbased you have to search through all users once, for latent factor models, you have to fold the user vector into the low dimensional space. I think Sean already added this method in myrrix and I have some code for my kornakapi project (a simple weblayer for mahout). Would such a method fit your needs? Best, Sebastian 2013/7/17 Pat Ferrel pat.fer...@gmail.com May I ask how you plan to support model updates and 'anonymous' users? I assume the latent factors model is calculated offline still in batch mode, then there are periodic updates? How are the updates handled? Do you plan to require batch model refactorization for any update? Or perform some partial update by maybe just transforming new data into the LF space already in place then doing full refactorization every so often in batch mode? By 'anonymous users' I mean users with some history that is not yet incorporated in the LF model. This could be history from a new user asked to pick a few items to start the rec process, or an old user with some new action history not yet in the model. Are you going to allow for passing the entire history vector or userID+incremental new history to the recommender? I hope so. For what it's worth we did a comparison of Mahout Item based CF to Mahout ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months of data. The data was purchase data from a diverse ecom source with a large variety of products from electronics to clothes. We found Item based CF did far better than ALS. As we increased the number of latent factors the results got better but were never within 10% of item based (we used MAP as the offline metric). Not sure why but maybe it has to do with the diversity of the item types. I understand that a full
Re: Regarding Online Recommenders
No it was CPU bound not memory. I gave it something like 14G heap. It was running, just too slow to be of any real use. We switched to the hadoop version and stored precalculated recs in a db for every user. On Jul 18, 2013, at 12:06 PM, Peng Cheng pc...@uowmail.edu.au wrote: Strange, its just a little bit larger than limibseti dataset (17m ratings), did you encountered an outOfMemory or GCTimeOut exception? Allocating more heap space usually help. Yours Peng On 13-07-18 02:27 PM, Pat Ferrel wrote: It was about 2.5M users and 500K items with 25M actions over 6 months of data. On Jul 18, 2013, at 10:15 AM, Peng Cheng pc...@uowmail.edu.au wrote: If I remember right, a highlight of 0.8 release is an online clustering algorithm. I'm not sure if it can be used in item-based recommender, but this is definitely I would like to pursue. It's probably the only advantage a non-hadoop implementation can offer in the future. Many non-hadoop recommenders are pretty fast. But existing in-memory GenericDataModel and FileDataModel are largely implemented for sandboxes, IMHO they are the culprit of scalability problem. May I ask about the scale of your dataset? how many rating does it have? Yours Peng On 13-07-18 12:14 PM, Sebastian Schelter wrote: Well, with itembased the only problem is new items. New users can immediately be served by the model (although this is not well supported by the API in Mahout). For the majority of usecases I saw, it is perfectly fine to have a short delay until new items enter the recommender, usually this happens after a retraining in batch. You have to care for cold-start and collect some interactions anyway. 2013/7/18 Pat Ferrel pat.fer...@gmail.com Yes, what Myrrix does is good. My last aside was a wish for an item-based online recommender not only factorized. Ted talks about using Solr for this, which we're experimenting with alongside Myrrix. I suspect Solr works but it does require a bit of tinkering and doesn't have quite the same set of options--no llr similarity for instance. On the same subject I recently attended a workshop in Seattle for UAI2013 where Walmart reported similar results using a factorized recommender. They had to increase the factor number past where it would perform well. Along the way they saw increasing performance measuring precision offline. They eventually gave up on a factorized solution. This decision seems odd but anyway… In the case of Walmart and our data set they are quite diverse. The best idea is probably to create different recommenders for separate parts of the catalog but if you create one model on all items our intuition is that item-based works better than factorized. Again caveat--no A/B tests to support this yet. Doing an online item-based recommender would quickly run into scaling problems, no? We put together the simple Mahout in-memory version and it could not really handle more than a down-sampled few months of our data. Down-sampling lost us 20% of our precision scores so we moved to the hadoop version. Now we have use-cases for an online recommender that handles anonymous new users and that takes the story full circle. On Jul 17, 2013, at 1:28 PM, Sebastian Schelter s...@apache.org wrote: Hi Pat I think we should provide a simple support for recommending to anonymous users. We should have a method recommendToAnonymous() that takes a PreferenceArray as argument. For itembased recommenders, its straightforward to compute recommendations, for userbased you have to search through all users once, for latent factor models, you have to fold the user vector into the low dimensional space. I think Sean already added this method in myrrix and I have some code for my kornakapi project (a simple weblayer for mahout). Would such a method fit your needs? Best, Sebastian 2013/7/17 Pat Ferrel pat.fer...@gmail.com May I ask how you plan to support model updates and 'anonymous' users? I assume the latent factors model is calculated offline still in batch mode, then there are periodic updates? How are the updates handled? Do you plan to require batch model refactorization for any update? Or perform some partial update by maybe just transforming new data into the LF space already in place then doing full refactorization every so often in batch mode? By 'anonymous users' I mean users with some history that is not yet incorporated in the LF model. This could be history from a new user asked to pick a few items to start the rec process, or an old user with some new action history not yet in the model. Are you going to allow for passing the entire history vector or userID+incremental new history to the recommender? I hope so. For what it's worth we did a comparison of Mahout Item based CF to Mahout ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months of data. The data was purchase data from a diverse ecom
Re: Regarding Online Recommenders
I see, sorry I was too presumptuous. I only recently worked and tested SVDRecommender, never could have known its efficiency using an item-based recommender. Maybe there is space for algorithmic optimization. The online recommender Gokhan is working on is also an SVDRecommender. An online user-based or item-based recommender based on clustering technique would definitely be critical, but we need an expert to volunteer :) Perhaps Dr Dunning can have a few words? He announced the online clustering component. Yours Peng On 13-07-18 03:54 PM, Pat Ferrel wrote: No it was CPU bound not memory. I gave it something like 14G heap. It was running, just too slow to be of any real use. We switched to the hadoop version and stored precalculated recs in a db for every user. On Jul 18, 2013, at 12:06 PM, Peng Cheng pc...@uowmail.edu.au wrote: Strange, its just a little bit larger than limibseti dataset (17m ratings), did you encountered an outOfMemory or GCTimeOut exception? Allocating more heap space usually help. Yours Peng On 13-07-18 02:27 PM, Pat Ferrel wrote: It was about 2.5M users and 500K items with 25M actions over 6 months of data. On Jul 18, 2013, at 10:15 AM, Peng Cheng pc...@uowmail.edu.au wrote: If I remember right, a highlight of 0.8 release is an online clustering algorithm. I'm not sure if it can be used in item-based recommender, but this is definitely I would like to pursue. It's probably the only advantage a non-hadoop implementation can offer in the future. Many non-hadoop recommenders are pretty fast. But existing in-memory GenericDataModel and FileDataModel are largely implemented for sandboxes, IMHO they are the culprit of scalability problem. May I ask about the scale of your dataset? how many rating does it have? Yours Peng On 13-07-18 12:14 PM, Sebastian Schelter wrote: Well, with itembased the only problem is new items. New users can immediately be served by the model (although this is not well supported by the API in Mahout). For the majority of usecases I saw, it is perfectly fine to have a short delay until new items enter the recommender, usually this happens after a retraining in batch. You have to care for cold-start and collect some interactions anyway. 2013/7/18 Pat Ferrel pat.fer...@gmail.com Yes, what Myrrix does is good. My last aside was a wish for an item-based online recommender not only factorized. Ted talks about using Solr for this, which we're experimenting with alongside Myrrix. I suspect Solr works but it does require a bit of tinkering and doesn't have quite the same set of options--no llr similarity for instance. On the same subject I recently attended a workshop in Seattle for UAI2013 where Walmart reported similar results using a factorized recommender. They had to increase the factor number past where it would perform well. Along the way they saw increasing performance measuring precision offline. They eventually gave up on a factorized solution. This decision seems odd but anyway… In the case of Walmart and our data set they are quite diverse. The best idea is probably to create different recommenders for separate parts of the catalog but if you create one model on all items our intuition is that item-based works better than factorized. Again caveat--no A/B tests to support this yet. Doing an online item-based recommender would quickly run into scaling problems, no? We put together the simple Mahout in-memory version and it could not really handle more than a down-sampled few months of our data. Down-sampling lost us 20% of our precision scores so we moved to the hadoop version. Now we have use-cases for an online recommender that handles anonymous new users and that takes the story full circle. On Jul 17, 2013, at 1:28 PM, Sebastian Schelter s...@apache.org wrote: Hi Pat I think we should provide a simple support for recommending to anonymous users. We should have a method recommendToAnonymous() that takes a PreferenceArray as argument. For itembased recommenders, its straightforward to compute recommendations, for userbased you have to search through all users once, for latent factor models, you have to fold the user vector into the low dimensional space. I think Sean already added this method in myrrix and I have some code for my kornakapi project (a simple weblayer for mahout). Would such a method fit your needs? Best, Sebastian 2013/7/17 Pat Ferrel pat.fer...@gmail.com May I ask how you plan to support model updates and 'anonymous' users? I assume the latent factors model is calculated offline still in batch mode, then there are periodic updates? How are the updates handled? Do you plan to require batch model refactorization for any update? Or perform some partial update by maybe just transforming new data into the LF space already in place then doing full refactorization every so often in batch mode? By 'anonymous users' I mean users with some history that is not yet incorporated in the LF model. This
Re: Regarding Online Recommenders
I just started to implement a Matrix backed data model and pushed it, to check the performance and memory considerations. I believe I can try it on some data tomorrow. Best Gokhan On Thu, Jul 18, 2013 at 11:05 PM, Peng Cheng pc...@uowmail.edu.au wrote: I see, sorry I was too presumptuous. I only recently worked and tested SVDRecommender, never could have known its efficiency using an item-based recommender. Maybe there is space for algorithmic optimization. The online recommender Gokhan is working on is also an SVDRecommender. An online user-based or item-based recommender based on clustering technique would definitely be critical, but we need an expert to volunteer :) Perhaps Dr Dunning can have a few words? He announced the online clustering component. Yours Peng On 13-07-18 03:54 PM, Pat Ferrel wrote: No it was CPU bound not memory. I gave it something like 14G heap. It was running, just too slow to be of any real use. We switched to the hadoop version and stored precalculated recs in a db for every user. On Jul 18, 2013, at 12:06 PM, Peng Cheng pc...@uowmail.edu.au wrote: Strange, its just a little bit larger than limibseti dataset (17m ratings), did you encountered an outOfMemory or GCTimeOut exception? Allocating more heap space usually help. Yours Peng On 13-07-18 02:27 PM, Pat Ferrel wrote: It was about 2.5M users and 500K items with 25M actions over 6 months of data. On Jul 18, 2013, at 10:15 AM, Peng Cheng pc...@uowmail.edu.au wrote: If I remember right, a highlight of 0.8 release is an online clustering algorithm. I'm not sure if it can be used in item-based recommender, but this is definitely I would like to pursue. It's probably the only advantage a non-hadoop implementation can offer in the future. Many non-hadoop recommenders are pretty fast. But existing in-memory GenericDataModel and FileDataModel are largely implemented for sandboxes, IMHO they are the culprit of scalability problem. May I ask about the scale of your dataset? how many rating does it have? Yours Peng On 13-07-18 12:14 PM, Sebastian Schelter wrote: Well, with itembased the only problem is new items. New users can immediately be served by the model (although this is not well supported by the API in Mahout). For the majority of usecases I saw, it is perfectly fine to have a short delay until new items enter the recommender, usually this happens after a retraining in batch. You have to care for cold-start and collect some interactions anyway. 2013/7/18 Pat Ferrel pat.fer...@gmail.com Yes, what Myrrix does is good. My last aside was a wish for an item-based online recommender not only factorized. Ted talks about using Solr for this, which we're experimenting with alongside Myrrix. I suspect Solr works but it does require a bit of tinkering and doesn't have quite the same set of options--no llr similarity for instance. On the same subject I recently attended a workshop in Seattle for UAI2013 where Walmart reported similar results using a factorized recommender. They had to increase the factor number past where it would perform well. Along the way they saw increasing performance measuring precision offline. They eventually gave up on a factorized solution. This decision seems odd but anyway… In the case of Walmart and our data set they are quite diverse. The best idea is probably to create different recommenders for separate parts of the catalog but if you create one model on all items our intuition is that item-based works better than factorized. Again caveat--no A/B tests to support this yet. Doing an online item-based recommender would quickly run into scaling problems, no? We put together the simple Mahout in-memory version and it could not really handle more than a down-sampled few months of our data. Down-sampling lost us 20% of our precision scores so we moved to the hadoop version. Now we have use-cases for an online recommender that handles anonymous new users and that takes the story full circle. On Jul 17, 2013, at 1:28 PM, Sebastian Schelter s...@apache.org wrote: Hi Pat I think we should provide a simple support for recommending to anonymous users. We should have a method recommendToAnonymous() that takes a PreferenceArray as argument. For itembased recommenders, its straightforward to compute recommendations, for userbased you have to search through all users once, for latent factor models, you have to fold the user vector into the low dimensional space. I think Sean already added this method in myrrix and I have some code for my kornakapi project (a simple weblayer for mahout). Would such a method fit your needs? Best, Sebastian 2013/7/17 Pat Ferrel pat.fer...@gmail.com May I ask how you plan to support model updates and 'anonymous' users? I assume the latent factors model is calculated offline still in batch mode, then there are periodic updates? How are the updates
Re: Regarding Online Recommenders
Wow, that's lightning fast. Is it a SparseMatrix or DenseMatrix? On 13-07-18 07:23 PM, Gokhan Capan wrote: I just started to implement a Matrix backed data model and pushed it, to check the performance and memory considerations. I believe I can try it on some data tomorrow. Best Gokhan On Thu, Jul 18, 2013 at 11:05 PM, Peng Cheng pc...@uowmail.edu.au wrote: I see, sorry I was too presumptuous. I only recently worked and tested SVDRecommender, never could have known its efficiency using an item-based recommender. Maybe there is space for algorithmic optimization. The online recommender Gokhan is working on is also an SVDRecommender. An online user-based or item-based recommender based on clustering technique would definitely be critical, but we need an expert to volunteer :) Perhaps Dr Dunning can have a few words? He announced the online clustering component. Yours Peng On 13-07-18 03:54 PM, Pat Ferrel wrote: No it was CPU bound not memory. I gave it something like 14G heap. It was running, just too slow to be of any real use. We switched to the hadoop version and stored precalculated recs in a db for every user. On Jul 18, 2013, at 12:06 PM, Peng Cheng pc...@uowmail.edu.au wrote: Strange, its just a little bit larger than limibseti dataset (17m ratings), did you encountered an outOfMemory or GCTimeOut exception? Allocating more heap space usually help. Yours Peng On 13-07-18 02:27 PM, Pat Ferrel wrote: It was about 2.5M users and 500K items with 25M actions over 6 months of data. On Jul 18, 2013, at 10:15 AM, Peng Cheng pc...@uowmail.edu.au wrote: If I remember right, a highlight of 0.8 release is an online clustering algorithm. I'm not sure if it can be used in item-based recommender, but this is definitely I would like to pursue. It's probably the only advantage a non-hadoop implementation can offer in the future. Many non-hadoop recommenders are pretty fast. But existing in-memory GenericDataModel and FileDataModel are largely implemented for sandboxes, IMHO they are the culprit of scalability problem. May I ask about the scale of your dataset? how many rating does it have? Yours Peng On 13-07-18 12:14 PM, Sebastian Schelter wrote: Well, with itembased the only problem is new items. New users can immediately be served by the model (although this is not well supported by the API in Mahout). For the majority of usecases I saw, it is perfectly fine to have a short delay until new items enter the recommender, usually this happens after a retraining in batch. You have to care for cold-start and collect some interactions anyway. 2013/7/18 Pat Ferrel pat.fer...@gmail.com Yes, what Myrrix does is good. My last aside was a wish for an item-based online recommender not only factorized. Ted talks about using Solr for this, which we're experimenting with alongside Myrrix. I suspect Solr works but it does require a bit of tinkering and doesn't have quite the same set of options--no llr similarity for instance. On the same subject I recently attended a workshop in Seattle for UAI2013 where Walmart reported similar results using a factorized recommender. They had to increase the factor number past where it would perform well. Along the way they saw increasing performance measuring precision offline. They eventually gave up on a factorized solution. This decision seems odd but anyway… In the case of Walmart and our data set they are quite diverse. The best idea is probably to create different recommenders for separate parts of the catalog but if you create one model on all items our intuition is that item-based works better than factorized. Again caveat--no A/B tests to support this yet. Doing an online item-based recommender would quickly run into scaling problems, no? We put together the simple Mahout in-memory version and it could not really handle more than a down-sampled few months of our data. Down-sampling lost us 20% of our precision scores so we moved to the hadoop version. Now we have use-cases for an online recommender that handles anonymous new users and that takes the story full circle. On Jul 17, 2013, at 1:28 PM, Sebastian Schelter s...@apache.org wrote: Hi Pat I think we should provide a simple support for recommending to anonymous users. We should have a method recommendToAnonymous() that takes a PreferenceArray as argument. For itembased recommenders, its straightforward to compute recommendations, for userbased you have to search through all users once, for latent factor models, you have to fold the user vector into the low dimensional space. I think Sean already added this method in myrrix and I have some code for my kornakapi project (a simple weblayer for mahout). Would such a method fit your needs? Best, Sebastian 2013/7/17 Pat Ferrel pat.fer...@gmail.com May I ask how you plan to support model updates and 'anonymous' users? I assume the latent factors model is calculated offline still in batch mode, then there are periodic updates? How are
Re: Regarding Online Recommenders
It is 2 SparseRowMatrices, Peng. But I don't want to comment on it before actually trying it. This is essentially a first step for me to choose my side on the DataModel implementation discussion:) Gokhan On Fri, Jul 19, 2013 at 2:25 AM, Peng Cheng pc...@uowmail.edu.au wrote: Wow, that's lightning fast. Is it a SparseMatrix or DenseMatrix? On 13-07-18 07:23 PM, Gokhan Capan wrote: I just started to implement a Matrix backed data model and pushed it, to check the performance and memory considerations. I believe I can try it on some data tomorrow. Best Gokhan On Thu, Jul 18, 2013 at 11:05 PM, Peng Cheng pc...@uowmail.edu.au wrote: I see, sorry I was too presumptuous. I only recently worked and tested SVDRecommender, never could have known its efficiency using an item-based recommender. Maybe there is space for algorithmic optimization. The online recommender Gokhan is working on is also an SVDRecommender. An online user-based or item-based recommender based on clustering technique would definitely be critical, but we need an expert to volunteer :) Perhaps Dr Dunning can have a few words? He announced the online clustering component. Yours Peng On 13-07-18 03:54 PM, Pat Ferrel wrote: No it was CPU bound not memory. I gave it something like 14G heap. It was running, just too slow to be of any real use. We switched to the hadoop version and stored precalculated recs in a db for every user. On Jul 18, 2013, at 12:06 PM, Peng Cheng pc...@uowmail.edu.au wrote: Strange, its just a little bit larger than limibseti dataset (17m ratings), did you encountered an outOfMemory or GCTimeOut exception? Allocating more heap space usually help. Yours Peng On 13-07-18 02:27 PM, Pat Ferrel wrote: It was about 2.5M users and 500K items with 25M actions over 6 months of data. On Jul 18, 2013, at 10:15 AM, Peng Cheng pc...@uowmail.edu.au wrote: If I remember right, a highlight of 0.8 release is an online clustering algorithm. I'm not sure if it can be used in item-based recommender, but this is definitely I would like to pursue. It's probably the only advantage a non-hadoop implementation can offer in the future. Many non-hadoop recommenders are pretty fast. But existing in-memory GenericDataModel and FileDataModel are largely implemented for sandboxes, IMHO they are the culprit of scalability problem. May I ask about the scale of your dataset? how many rating does it have? Yours Peng On 13-07-18 12:14 PM, Sebastian Schelter wrote: Well, with itembased the only problem is new items. New users can immediately be served by the model (although this is not well supported by the API in Mahout). For the majority of usecases I saw, it is perfectly fine to have a short delay until new items enter the recommender, usually this happens after a retraining in batch. You have to care for cold-start and collect some interactions anyway. 2013/7/18 Pat Ferrel pat.fer...@gmail.com Yes, what Myrrix does is good. My last aside was a wish for an item-based online recommender not only factorized. Ted talks about using Solr for this, which we're experimenting with alongside Myrrix. I suspect Solr works but it does require a bit of tinkering and doesn't have quite the same set of options--no llr similarity for instance. On the same subject I recently attended a workshop in Seattle for UAI2013 where Walmart reported similar results using a factorized recommender. They had to increase the factor number past where it would perform well. Along the way they saw increasing performance measuring precision offline. They eventually gave up on a factorized solution. This decision seems odd but anyway… In the case of Walmart and our data set they are quite diverse. The best idea is probably to create different recommenders for separate parts of the catalog but if you create one model on all items our intuition is that item-based works better than factorized. Again caveat--no A/B tests to support this yet. Doing an online item-based recommender would quickly run into scaling problems, no? We put together the simple Mahout in-memory version and it could not really handle more than a down-sampled few months of our data. Down-sampling lost us 20% of our precision scores so we moved to the hadoop version. Now we have use-cases for an online recommender that handles anonymous new users and that takes the story full circle. On Jul 17, 2013, at 1:28 PM, Sebastian Schelter s...@apache.org wrote: Hi Pat I think we should provide a simple support for recommending to anonymous users. We should have a method recommendToAnonymous() that takes a PreferenceArray as argument. For itembased recommenders, its straightforward to compute recommendations, for userbased you have to search through all users once, for latent factor models, you have to fold the user vector into the low dimensional space. I think Sean already added
Re: Regarding Online Recommenders
May I ask how you plan to support model updates and 'anonymous' users? I assume the latent factors model is calculated offline still in batch mode, then there are periodic updates? How are the updates handled? Do you plan to require batch model refactorization for any update? Or perform some partial update by maybe just transforming new data into the LF space already in place then doing full refactorization every so often in batch mode? By 'anonymous users' I mean users with some history that is not yet incorporated in the LF model. This could be history from a new user asked to pick a few items to start the rec process, or an old user with some new action history not yet in the model. Are you going to allow for passing the entire history vector or userID+incremental new history to the recommender? I hope so. For what it's worth we did a comparison of Mahout Item based CF to Mahout ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months of data. The data was purchase data from a diverse ecom source with a large variety of products from electronics to clothes. We found Item based CF did far better than ALS. As we increased the number of latent factors the results got better but were never within 10% of item based (we used MAP as the offline metric). Not sure why but maybe it has to do with the diversity of the item types. I understand that a full item based online recommender has very different tradeoffs and anyway others may not have seen this disparity of results. Furthermore we don't have A/B test results yet to validate the offline metric. On Jul 16, 2013, at 2:41 PM, Gokhan Capan gkhn...@gmail.com wrote: Peng, This is the reason I separated out the DataModel, and only put the learner stuff there. The learner I mentioned yesterday just stores the parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care where preferences are stored. I, kind of, agree with the multi-level DataModel approach: One for iterating over all preferences, one for if one wants to deploy a recommender and perform a lot of top-N recommendation tasks. (Or one DataModel with a strategy that might reduce existing memory consumption, while still providing fast access, I am not sure. Let me try a matrix-backed DataModel approach) Gokhan On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter s...@apache.org wrote: I completely agree, Netflix is less than one gigabye in a smart representation, 12x more memory is a nogo. The techniques used in FactorizablePreferences allow a much more memory efficient representation, tested on KDD Music dataset which is approx 2.5 times Netflix and fits into 3GB with that approach. 2013/7/16 Ted Dunning ted.dunn...@gmail.com Netflix is a small dataset. 12G for that seems quite excessive. Note also that this is before you have done any work. Ideally, 100million observations should take 1GB. On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au wrote: The second idea is indeed splendid, we should separate time-complexity first and space-complexity first implementation. What I'm not quite sure, is that if we really need to create two interfaces instead of one. Personally, I think 12G heap space is not that high right? Most new laptop can already handle that (emphasis on laptop). And if we replace hash map (the culprit of high memory consumption) with list/linkedList, it would simply degrade time complexity for a linear search to O(n), not too bad either. The current DataModel is a result of careful thoughts and has underwent extensive test, it is easier to expand on top of it instead of subverting it.
Re: Regarding Online Recommenders
Hi Pat, please see my response inline. Best, Gokhan On Wed, Jul 17, 2013 at 8:23 PM, Pat Ferrel pat.fer...@gmail.com wrote: May I ask how you plan to support model updates and 'anonymous' users? I assume the latent factors model is calculated offline still in batch mode, then there are periodic updates? How are the updates handled? If you are referring to the recommender of discussion here, no, updating the model can be done with a single preference, using stochastic gradient descent, by updating the particular user and item factors simultaneously. Do you plan to require batch model refactorization for any update? Or perform some partial update by maybe just transforming new data into the LF space already in place then doing full refactorization every so often in batch mode? By 'anonymous users' I mean users with some history that is not yet incorporated in the LF model. This could be history from a new user asked to pick a few items to start the rec process, or an old user with some new action history not yet in the model. Are you going to allow for passing the entire history vector or userID+incremental new history to the recommender? I hope so. For what it's worth we did a comparison of Mahout Item based CF to Mahout ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months of data. The data was purchase data from a diverse ecom source with a large variety of products from electronics to clothes. We found Item based CF did far better than ALS. As we increased the number of latent factors the results got better but were never within 10% of item based (we used MAP as the offline metric). Not sure why but maybe it has to do with the diversity of the item types. My first question, are those actions are only positive, like purchase as you mentioned? I understand that a full item based online recommender has very different tradeoffs and anyway others may not have seen this disparity of results. Furthermore we don't have A/B test results yet to validate the offline metric. I personally think an A/B test is the best way to evaluate a recommender, and if you will be able to share it, I personally look forward to see the results. I believe that would be a great contribution for some future decisions. On Jul 16, 2013, at 2:41 PM, Gokhan Capan gkhn...@gmail.com wrote: Peng, This is the reason I separated out the DataModel, and only put the learner stuff there. The learner I mentioned yesterday just stores the parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care where preferences are stored. I, kind of, agree with the multi-level DataModel approach: One for iterating over all preferences, one for if one wants to deploy a recommender and perform a lot of top-N recommendation tasks. (Or one DataModel with a strategy that might reduce existing memory consumption, while still providing fast access, I am not sure. Let me try a matrix-backed DataModel approach) Gokhan On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter s...@apache.org wrote: I completely agree, Netflix is less than one gigabye in a smart representation, 12x more memory is a nogo. The techniques used in FactorizablePreferences allow a much more memory efficient representation, tested on KDD Music dataset which is approx 2.5 times Netflix and fits into 3GB with that approach. 2013/7/16 Ted Dunning ted.dunn...@gmail.com Netflix is a small dataset. 12G for that seems quite excessive. Note also that this is before you have done any work. Ideally, 100million observations should take 1GB. On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au wrote: The second idea is indeed splendid, we should separate time-complexity first and space-complexity first implementation. What I'm not quite sure, is that if we really need to create two interfaces instead of one. Personally, I think 12G heap space is not that high right? Most new laptop can already handle that (emphasis on laptop). And if we replace hash map (the culprit of high memory consumption) with list/linkedList, it would simply degrade time complexity for a linear search to O(n), not too bad either. The current DataModel is a result of careful thoughts and has underwent extensive test, it is easier to expand on top of it instead of subverting it.
Re: Regarding Online Recommenders
I see, OK so we shouldn't use the old implementation. But I mean, the old interface doesn't have to be discarded. The discrepancy between your FactorizablePreferences and DataModel is that, your model supports getPreferences(), which returns all preferences as an iterator, and DataModel supports a few old functions that returns preferences for an individual user or item. My point is that, it is not hard for each of them to implement what they lack of: old DataModel can implement getPreferences() just by a a loop in abstract class. Your new FactorizablePreferences can implement those old functions by a binary search that takes O(log n) time, or an interpolation search that takes O(log log n) time in average. So does the online update. It will just be a matter of different speed and space, but not different interface standard, we can use old unit tests, old examples, old everything. And we will be more flexible in writing ensemble recommender. Just a few thoughts, I'll have to validate the idea first before creating a new JIRA ticket. Yours Peng On 13-07-16 02:51 PM, Sebastian Schelter wrote: I completely agree, Netflix is less than one gigabye in a smart representation, 12x more memory is a nogo. The techniques used in FactorizablePreferences allow a much more memory efficient representation, tested on KDD Music dataset which is approx 2.5 times Netflix and fits into 3GB with that approach. 2013/7/16 Ted Dunning ted.dunn...@gmail.com Netflix is a small dataset. 12G for that seems quite excessive. Note also that this is before you have done any work. Ideally, 100million observations should take 1GB. On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au wrote: The second idea is indeed splendid, we should separate time-complexity first and space-complexity first implementation. What I'm not quite sure, is that if we really need to create two interfaces instead of one. Personally, I think 12G heap space is not that high right? Most new laptop can already handle that (emphasis on laptop). And if we replace hash map (the culprit of high memory consumption) with list/linkedList, it would simply degrade time complexity for a linear search to O(n), not too bad either. The current DataModel is a result of careful thoughts and has underwent extensive test, it is easier to expand on top of it instead of subverting it.
Re: Regarding Online Recommenders
Hi Peng, I never wanted to discard the old interface, I just wanted to split it up. I want to have a simple interface that only supports sequential access (and allows for very memory efficient implementions, e.g. by the use of primitive arrays). DataModel should *extend* this interface and provide sequential and random access (basically what is already does). Than a recommender such as SGD could state that it only needs sequential access to the preferences and you can either feed it a DataModel (so we dont break backwards compatibility) or a memory efficient sequential access thingy. Does that make sense for you? 2013/7/17 Peng Cheng pc...@uowmail.edu.au I see, OK so we shouldn't use the old implementation. But I mean, the old interface doesn't have to be discarded. The discrepancy between your FactorizablePreferences and DataModel is that, your model supports getPreferences(), which returns all preferences as an iterator, and DataModel supports a few old functions that returns preferences for an individual user or item. My point is that, it is not hard for each of them to implement what they lack of: old DataModel can implement getPreferences() just by a a loop in abstract class. Your new FactorizablePreferences can implement those old functions by a binary search that takes O(log n) time, or an interpolation search that takes O(log log n) time in average. So does the online update. It will just be a matter of different speed and space, but not different interface standard, we can use old unit tests, old examples, old everything. And we will be more flexible in writing ensemble recommender. Just a few thoughts, I'll have to validate the idea first before creating a new JIRA ticket. Yours Peng On 13-07-16 02:51 PM, Sebastian Schelter wrote: I completely agree, Netflix is less than one gigabye in a smart representation, 12x more memory is a nogo. The techniques used in FactorizablePreferences allow a much more memory efficient representation, tested on KDD Music dataset which is approx 2.5 times Netflix and fits into 3GB with that approach. 2013/7/16 Ted Dunning ted.dunn...@gmail.com Netflix is a small dataset. 12G for that seems quite excessive. Note also that this is before you have done any work. Ideally, 100million observations should take 1GB. On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au wrote: The second idea is indeed splendid, we should separate time-complexity first and space-complexity first implementation. What I'm not quite sure, is that if we really need to create two interfaces instead of one. Personally, I think 12G heap space is not that high right? Most new laptop can already handle that (emphasis on laptop). And if we replace hash map (the culprit of high memory consumption) with list/linkedList, it would simply degrade time complexity for a linear search to O(n), not too bad either. The current DataModel is a result of careful thoughts and has underwent extensive test, it is easier to expand on top of it instead of subverting it.
Re: Regarding Online Recommenders
Mmm, You are right, the most simple solution is usually the best, I'm creating new jira ticket. Yours Peng On 13-07-17 04:58 PM, Sebastian Schelter wrote: Hi Peng, I never wanted to discard the old interface, I just wanted to split it up. I want to have a simple interface that only supports sequential access (and allows for very memory efficient implementions, e.g. by the use of primitive arrays). DataModel should *extend* this interface and provide sequential and random access (basically what is already does). Than a recommender such as SGD could state that it only needs sequential access to the preferences and you can either feed it a DataModel (so we dont break backwards compatibility) or a memory efficient sequential access thingy. Does that make sense for you? 2013/7/17 Peng Cheng pc...@uowmail.edu.au I see, OK so we shouldn't use the old implementation. But I mean, the old interface doesn't have to be discarded. The discrepancy between your FactorizablePreferences and DataModel is that, your model supports getPreferences(), which returns all preferences as an iterator, and DataModel supports a few old functions that returns preferences for an individual user or item. My point is that, it is not hard for each of them to implement what they lack of: old DataModel can implement getPreferences() just by a a loop in abstract class. Your new FactorizablePreferences can implement those old functions by a binary search that takes O(log n) time, or an interpolation search that takes O(log log n) time in average. So does the online update. It will just be a matter of different speed and space, but not different interface standard, we can use old unit tests, old examples, old everything. And we will be more flexible in writing ensemble recommender. Just a few thoughts, I'll have to validate the idea first before creating a new JIRA ticket. Yours Peng On 13-07-16 02:51 PM, Sebastian Schelter wrote: I completely agree, Netflix is less than one gigabye in a smart representation, 12x more memory is a nogo. The techniques used in FactorizablePreferences allow a much more memory efficient representation, tested on KDD Music dataset which is approx 2.5 times Netflix and fits into 3GB with that approach. 2013/7/16 Ted Dunning ted.dunn...@gmail.com Netflix is a small dataset. 12G for that seems quite excessive. Note also that this is before you have done any work. Ideally, 100million observations should take 1GB. On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au wrote: The second idea is indeed splendid, we should separate time-complexity first and space-complexity first implementation. What I'm not quite sure, is that if we really need to create two interfaces instead of one. Personally, I think 12G heap space is not that high right? Most new laptop can already handle that (emphasis on laptop). And if we replace hash map (the culprit of high memory consumption) with list/linkedList, it would simply degrade time complexity for a linear search to O(n), not too bad either. The current DataModel is a result of careful thoughts and has underwent extensive test, it is easier to expand on top of it instead of subverting it.
Re: Regarding Online Recommenders
Awesome! your reinforcements are highly appreciated. On 13-07-17 01:29 AM, Abhishek Sharma wrote: Sorry to interrupt guys, but I just wanted to bring it to your notice that I am also interested in contributing to this idea. I am planning to participate in ASF-ICFOSS mentor-ship programmehttps://cwiki.apache.org/confluence/display/COMDEV/ASF-ICFOSS+Pilot+Mentoring+Programme. (this is very similar to GSOC) I do have strong concepts in machine learning (have done the ML course by Andrew NG on coursera) also, I am good in programming (have 2.5 yrs of work experience). I am not really sure of how can I approach this problem (but I do have a strong interest to work on this problem) hence would like to pair up on this. I am currently working as a research intern at Indian Institute of Science (IISc), Bangalore India and can put up 15-20 hrs per week. Please let me know your thoughts if I can be a part of this. Thanks Regards, Abhishek Sharma http://www.linkedin.com/in/abhi21 https://github.com/abhi21 On Wed, Jul 17, 2013 at 3:11 AM, Gokhan Capan gkhn...@gmail.com wrote: Peng, This is the reason I separated out the DataModel, and only put the learner stuff there. The learner I mentioned yesterday just stores the parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care where preferences are stored. I, kind of, agree with the multi-level DataModel approach: One for iterating over all preferences, one for if one wants to deploy a recommender and perform a lot of top-N recommendation tasks. (Or one DataModel with a strategy that might reduce existing memory consumption, while still providing fast access, I am not sure. Let me try a matrix-backed DataModel approach) Gokhan On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter s...@apache.org wrote: I completely agree, Netflix is less than one gigabye in a smart representation, 12x more memory is a nogo. The techniques used in FactorizablePreferences allow a much more memory efficient representation, tested on KDD Music dataset which is approx 2.5 times Netflix and fits into 3GB with that approach. 2013/7/16 Ted Dunning ted.dunn...@gmail.com Netflix is a small dataset. 12G for that seems quite excessive. Note also that this is before you have done any work. Ideally, 100million observations should take 1GB. On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au wrote: The second idea is indeed splendid, we should separate time-complexity first and space-complexity first implementation. What I'm not quite sure, is that if we really need to create two interfaces instead of one. Personally, I think 12G heap space is not that high right? Most new laptop can already handle that (emphasis on laptop). And if we replace hash map (the culprit of high memory consumption) with list/linkedList, it would simply degrade time complexity for a linear search to O(n), not too bad either. The current DataModel is a result of careful thoughts and has underwent extensive test, it is easier to expand on top of it instead of subverting it.
Re: Regarding Online Recommenders
One thing that suddenly comes to my mind is that, for a simple interface like FactorizablePreferences, maybe sequential READ in real time is possible, but sequential WRITE in O(1) time is Utopia. Because you need to flush out old preference with same user and item ID (in worst case it could be an interpolation search), otherwise you are permitting a user rating an item twice with different values. Considering how FileDataModel suppose to work (new files flush old files), maybe using the simple interface has less advantages than we used to believe. On 13-07-17 04:58 PM, Sebastian Schelter wrote: Hi Peng, I never wanted to discard the old interface, I just wanted to split it up. I want to have a simple interface that only supports sequential access (and allows for very memory efficient implementions, e.g. by the use of primitive arrays). DataModel should *extend* this interface and provide sequential and random access (basically what is already does). Than a recommender such as SGD could state that it only needs sequential access to the preferences and you can either feed it a DataModel (so we dont break backwards compatibility) or a memory efficient sequential access thingy. Does that make sense for you? 2013/7/17 Peng Cheng pc...@uowmail.edu.au I see, OK so we shouldn't use the old implementation. But I mean, the old interface doesn't have to be discarded. The discrepancy between your FactorizablePreferences and DataModel is that, your model supports getPreferences(), which returns all preferences as an iterator, and DataModel supports a few old functions that returns preferences for an individual user or item. My point is that, it is not hard for each of them to implement what they lack of: old DataModel can implement getPreferences() just by a a loop in abstract class. Your new FactorizablePreferences can implement those old functions by a binary search that takes O(log n) time, or an interpolation search that takes O(log log n) time in average. So does the online update. It will just be a matter of different speed and space, but not different interface standard, we can use old unit tests, old examples, old everything. And we will be more flexible in writing ensemble recommender. Just a few thoughts, I'll have to validate the idea first before creating a new JIRA ticket. Yours Peng On 13-07-16 02:51 PM, Sebastian Schelter wrote: I completely agree, Netflix is less than one gigabye in a smart representation, 12x more memory is a nogo. The techniques used in FactorizablePreferences allow a much more memory efficient representation, tested on KDD Music dataset which is approx 2.5 times Netflix and fits into 3GB with that approach. 2013/7/16 Ted Dunning ted.dunn...@gmail.com Netflix is a small dataset. 12G for that seems quite excessive. Note also that this is before you have done any work. Ideally, 100million observations should take 1GB. On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au wrote: The second idea is indeed splendid, we should separate time-complexity first and space-complexity first implementation. What I'm not quite sure, is that if we really need to create two interfaces instead of one. Personally, I think 12G heap space is not that high right? Most new laptop can already handle that (emphasis on laptop). And if we replace hash map (the culprit of high memory consumption) with list/linkedList, it would simply degrade time complexity for a linear search to O(n), not too bad either. The current DataModel is a result of careful thoughts and has underwent extensive test, it is easier to expand on top of it instead of subverting it.
Re: Regarding Online Recommenders
You are completely right, the simple interface would only be usable for readonly / batch-updatable recommenders. Online recommenders might need something different. I tried to widen the discussion here to discuss all kinds of API changes in the recommenders that would be necessary in the future. 2013/7/17 Peng Cheng pc...@uowmail.edu.au One thing that suddenly comes to my mind is that, for a simple interface like FactorizablePreferences, maybe sequential READ in real time is possible, but sequential WRITE in O(1) time is Utopia. Because you need to flush out old preference with same user and item ID (in worst case it could be an interpolation search), otherwise you are permitting a user rating an item twice with different values. Considering how FileDataModel suppose to work (new files flush old files), maybe using the simple interface has less advantages than we used to believe. On 13-07-17 04:58 PM, Sebastian Schelter wrote: Hi Peng, I never wanted to discard the old interface, I just wanted to split it up. I want to have a simple interface that only supports sequential access (and allows for very memory efficient implementions, e.g. by the use of primitive arrays). DataModel should *extend* this interface and provide sequential and random access (basically what is already does). Than a recommender such as SGD could state that it only needs sequential access to the preferences and you can either feed it a DataModel (so we dont break backwards compatibility) or a memory efficient sequential access thingy. Does that make sense for you? 2013/7/17 Peng Cheng pc...@uowmail.edu.au I see, OK so we shouldn't use the old implementation. But I mean, the old interface doesn't have to be discarded. The discrepancy between your FactorizablePreferences and DataModel is that, your model supports getPreferences(), which returns all preferences as an iterator, and DataModel supports a few old functions that returns preferences for an individual user or item. My point is that, it is not hard for each of them to implement what they lack of: old DataModel can implement getPreferences() just by a a loop in abstract class. Your new FactorizablePreferences can implement those old functions by a binary search that takes O(log n) time, or an interpolation search that takes O(log log n) time in average. So does the online update. It will just be a matter of different speed and space, but not different interface standard, we can use old unit tests, old examples, old everything. And we will be more flexible in writing ensemble recommender. Just a few thoughts, I'll have to validate the idea first before creating a new JIRA ticket. Yours Peng On 13-07-16 02:51 PM, Sebastian Schelter wrote: I completely agree, Netflix is less than one gigabye in a smart representation, 12x more memory is a nogo. The techniques used in FactorizablePreferences allow a much more memory efficient representation, tested on KDD Music dataset which is approx 2.5 times Netflix and fits into 3GB with that approach. 2013/7/16 Ted Dunning ted.dunn...@gmail.com Netflix is a small dataset. 12G for that seems quite excessive. Note also that this is before you have done any work. Ideally, 100million observations should take 1GB. On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au wrote: The second idea is indeed splendid, we should separate time-complexity first and space-complexity first implementation. What I'm not quite sure, is that if we really need to create two interfaces instead of one. Personally, I think 12G heap space is not that high right? Most new laptop can already handle that (emphasis on laptop). And if we replace hash map (the culprit of high memory consumption) with list/linkedList, it would simply degrade time complexity for a linear search to O(n), not too bad either. The current DataModel is a result of careful thoughts and has underwent extensive test, it is easier to expand on top of it instead of subverting it.
Re: Regarding Online Recommenders
Yeah, setPreference() and removePreference() shouldn't be there, but injecting Recommender back to DataModel is kind of a strong dependency, which may intermingle components for different concerns. Maybe we can do something to RefreshHelper class? e.g. push something into a swap field so the downstream of a refreshable chain can read it out. I have read Gokhan's UpdateAwareDataModel, and feel that it's probably too heavyweight for a model selector as every time he change the algorithm he has to re-register that. The second idea is indeed splendid, we should separate time-complexity first and space-complexity first implementation. What I'm not quite sure, is that if we really need to create two interfaces instead of one. Personally, I think 12G heap space is not that high right? Most new laptop can already handle that (emphasis on laptop). And if we replace hash map (the culprit of high memory consumption) with list/linkedList, it would simply degrade time complexity for a linear search to O(n), not too bad either. The current DataModel is a result of careful thoughts and has underwent extensive test, it is easier to expand on top of it instead of subverting it. All the best, Yours Peng On 13-07-16 01:05 AM, Sebastian Schelter wrote: Hi Gokhan, I like your proposals and I think this is an important discussion. Peng is also interested in working on online recommenders, so we should try to team up our efforts. I'd like to extend the discussion a little to related API changes, that I think are necessary. What do you think about completely removing the setPreference() and removePreference() methods from Recommender? I think they don't belong there for two reasons: First, they duplicate functionality from DataModel and second, a lot of recommenders are read-only/train-once and cannot handle single preference updates anyway. I think we should have a DataModel implementation that can be updated and an online learning recommender should be able to register to be notified with updates. We should further more split up the DataModel interface into a hierarchy of three parts: First, a simple readonly interface that allows sequential access to the data (similar to FactorizablePreferences). This allows us to create memory efficient implementations. E.g. Cheng reported in MAHOUT-1272 that the current DataModel needs 12GB heap for the Netflix dataset (100M ratings) which is unacceptable. I was able to fit the KDD Music dataset (250M ratings) into 3GB with FactorizablePreferences. The second interface would extend the readonly interface and should resemble what DataModel is today: An easy-to-use in-memory implementation that trades high memory consumption for convenient random access. And finally the third interface would extend the second and provide tooling for online updates of the data. What do you think of that? Does it sound reasonable? --sebastian The DataModel I imagine would follow the current API, where underlying preference storage is replaced with a matrix. A Recommender would then use the DataModel and the OnlineLearner, where Recommender#setPreference is delegated to DataModel#setPreference (like it does now), and DataModel#setPreference triggers OnlineLearner#train.
Re: Regarding Online Recommenders
Netflix is a small dataset. 12G for that seems quite excessive. Note also that this is before you have done any work. Ideally, 100million observations should take 1GB. On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au wrote: The second idea is indeed splendid, we should separate time-complexity first and space-complexity first implementation. What I'm not quite sure, is that if we really need to create two interfaces instead of one. Personally, I think 12G heap space is not that high right? Most new laptop can already handle that (emphasis on laptop). And if we replace hash map (the culprit of high memory consumption) with list/linkedList, it would simply degrade time complexity for a linear search to O(n), not too bad either. The current DataModel is a result of careful thoughts and has underwent extensive test, it is easier to expand on top of it instead of subverting it.
Re: Regarding Online Recommenders
I completely agree, Netflix is less than one gigabye in a smart representation, 12x more memory is a nogo. The techniques used in FactorizablePreferences allow a much more memory efficient representation, tested on KDD Music dataset which is approx 2.5 times Netflix and fits into 3GB with that approach. 2013/7/16 Ted Dunning ted.dunn...@gmail.com Netflix is a small dataset. 12G for that seems quite excessive. Note also that this is before you have done any work. Ideally, 100million observations should take 1GB. On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au wrote: The second idea is indeed splendid, we should separate time-complexity first and space-complexity first implementation. What I'm not quite sure, is that if we really need to create two interfaces instead of one. Personally, I think 12G heap space is not that high right? Most new laptop can already handle that (emphasis on laptop). And if we replace hash map (the culprit of high memory consumption) with list/linkedList, it would simply degrade time complexity for a linear search to O(n), not too bad either. The current DataModel is a result of careful thoughts and has underwent extensive test, it is easier to expand on top of it instead of subverting it.
Re: Regarding Online Recommenders
Peng, This is the reason I separated out the DataModel, and only put the learner stuff there. The learner I mentioned yesterday just stores the parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care where preferences are stored. I, kind of, agree with the multi-level DataModel approach: One for iterating over all preferences, one for if one wants to deploy a recommender and perform a lot of top-N recommendation tasks. (Or one DataModel with a strategy that might reduce existing memory consumption, while still providing fast access, I am not sure. Let me try a matrix-backed DataModel approach) Gokhan On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter s...@apache.org wrote: I completely agree, Netflix is less than one gigabye in a smart representation, 12x more memory is a nogo. The techniques used in FactorizablePreferences allow a much more memory efficient representation, tested on KDD Music dataset which is approx 2.5 times Netflix and fits into 3GB with that approach. 2013/7/16 Ted Dunning ted.dunn...@gmail.com Netflix is a small dataset. 12G for that seems quite excessive. Note also that this is before you have done any work. Ideally, 100million observations should take 1GB. On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au wrote: The second idea is indeed splendid, we should separate time-complexity first and space-complexity first implementation. What I'm not quite sure, is that if we really need to create two interfaces instead of one. Personally, I think 12G heap space is not that high right? Most new laptop can already handle that (emphasis on laptop). And if we replace hash map (the culprit of high memory consumption) with list/linkedList, it would simply degrade time complexity for a linear search to O(n), not too bad either. The current DataModel is a result of careful thoughts and has underwent extensive test, it is easier to expand on top of it instead of subverting it.
Re: Regarding Online Recommenders
Sorry to interrupt guys, but I just wanted to bring it to your notice that I am also interested in contributing to this idea. I am planning to participate in ASF-ICFOSS mentor-ship programmehttps://cwiki.apache.org/confluence/display/COMDEV/ASF-ICFOSS+Pilot+Mentoring+Programme. (this is very similar to GSOC) I do have strong concepts in machine learning (have done the ML course by Andrew NG on coursera) also, I am good in programming (have 2.5 yrs of work experience). I am not really sure of how can I approach this problem (but I do have a strong interest to work on this problem) hence would like to pair up on this. I am currently working as a research intern at Indian Institute of Science (IISc), Bangalore India and can put up 15-20 hrs per week. Please let me know your thoughts if I can be a part of this. Thanks Regards, Abhishek Sharma http://www.linkedin.com/in/abhi21 https://github.com/abhi21 On Wed, Jul 17, 2013 at 3:11 AM, Gokhan Capan gkhn...@gmail.com wrote: Peng, This is the reason I separated out the DataModel, and only put the learner stuff there. The learner I mentioned yesterday just stores the parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care where preferences are stored. I, kind of, agree with the multi-level DataModel approach: One for iterating over all preferences, one for if one wants to deploy a recommender and perform a lot of top-N recommendation tasks. (Or one DataModel with a strategy that might reduce existing memory consumption, while still providing fast access, I am not sure. Let me try a matrix-backed DataModel approach) Gokhan On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter s...@apache.org wrote: I completely agree, Netflix is less than one gigabye in a smart representation, 12x more memory is a nogo. The techniques used in FactorizablePreferences allow a much more memory efficient representation, tested on KDD Music dataset which is approx 2.5 times Netflix and fits into 3GB with that approach. 2013/7/16 Ted Dunning ted.dunn...@gmail.com Netflix is a small dataset. 12G for that seems quite excessive. Note also that this is before you have done any work. Ideally, 100million observations should take 1GB. On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au wrote: The second idea is indeed splendid, we should separate time-complexity first and space-complexity first implementation. What I'm not quite sure, is that if we really need to create two interfaces instead of one. Personally, I think 12G heap space is not that high right? Most new laptop can already handle that (emphasis on laptop). And if we replace hash map (the culprit of high memory consumption) with list/linkedList, it would simply degrade time complexity for a linear search to O(n), not too bad either. The current DataModel is a result of careful thoughts and has underwent extensive test, it is easier to expand on top of it instead of subverting it. -- -- Abhishek Sharma ThoughtWorks
Re: Regarding Online Recommenders
Hi Abhishek, Great to hear that you're willing to put some work into this! Have you ever worked with Mahout's recommenders before? If not, then a good first step would be to get familiar with them and code up a few examples. Best, Sebastian On 17.07.2013 07:29, Abhishek Sharma wrote: Sorry to interrupt guys, but I just wanted to bring it to your notice that I am also interested in contributing to this idea. I am planning to participate in ASF-ICFOSS mentor-ship programmehttps://cwiki.apache.org/confluence/display/COMDEV/ASF-ICFOSS+Pilot+Mentoring+Programme. (this is very similar to GSOC) I do have strong concepts in machine learning (have done the ML course by Andrew NG on coursera) also, I am good in programming (have 2.5 yrs of work experience). I am not really sure of how can I approach this problem (but I do have a strong interest to work on this problem) hence would like to pair up on this. I am currently working as a research intern at Indian Institute of Science (IISc), Bangalore India and can put up 15-20 hrs per week. Please let me know your thoughts if I can be a part of this. Thanks Regards, Abhishek Sharma http://www.linkedin.com/in/abhi21 https://github.com/abhi21 On Wed, Jul 17, 2013 at 3:11 AM, Gokhan Capan gkhn...@gmail.com wrote: Peng, This is the reason I separated out the DataModel, and only put the learner stuff there. The learner I mentioned yesterday just stores the parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care where preferences are stored. I, kind of, agree with the multi-level DataModel approach: One for iterating over all preferences, one for if one wants to deploy a recommender and perform a lot of top-N recommendation tasks. (Or one DataModel with a strategy that might reduce existing memory consumption, while still providing fast access, I am not sure. Let me try a matrix-backed DataModel approach) Gokhan On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter s...@apache.org wrote: I completely agree, Netflix is less than one gigabye in a smart representation, 12x more memory is a nogo. The techniques used in FactorizablePreferences allow a much more memory efficient representation, tested on KDD Music dataset which is approx 2.5 times Netflix and fits into 3GB with that approach. 2013/7/16 Ted Dunning ted.dunn...@gmail.com Netflix is a small dataset. 12G for that seems quite excessive. Note also that this is before you have done any work. Ideally, 100million observations should take 1GB. On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au wrote: The second idea is indeed splendid, we should separate time-complexity first and space-complexity first implementation. What I'm not quite sure, is that if we really need to create two interfaces instead of one. Personally, I think 12G heap space is not that high right? Most new laptop can already handle that (emphasis on laptop). And if we replace hash map (the culprit of high memory consumption) with list/linkedList, it would simply degrade time complexity for a linear search to O(n), not too bad either. The current DataModel is a result of careful thoughts and has underwent extensive test, it is easier to expand on top of it instead of subverting it.
Re: Regarding Online Recommenders
Hi Gokhan, I like your proposals and I think this is an important discussion. Peng is also interested in working on online recommenders, so we should try to team up our efforts. I'd like to extend the discussion a little to related API changes, that I think are necessary. What do you think about completely removing the setPreference() and removePreference() methods from Recommender? I think they don't belong there for two reasons: First, they duplicate functionality from DataModel and second, a lot of recommenders are read-only/train-once and cannot handle single preference updates anyway. I think we should have a DataModel implementation that can be updated and an online learning recommender should be able to register to be notified with updates. We should further more split up the DataModel interface into a hierarchy of three parts: First, a simple readonly interface that allows sequential access to the data (similar to FactorizablePreferences). This allows us to create memory efficient implementations. E.g. Cheng reported in MAHOUT-1272 that the current DataModel needs 12GB heap for the Netflix dataset (100M ratings) which is unacceptable. I was able to fit the KDD Music dataset (250M ratings) into 3GB with FactorizablePreferences. The second interface would extend the readonly interface and should resemble what DataModel is today: An easy-to-use in-memory implementation that trades high memory consumption for convenient random access. And finally the third interface would extend the second and provide tooling for online updates of the data. What do you think of that? Does it sound reasonable? --sebastian The DataModel I imagine would follow the current API, where underlying preference storage is replaced with a matrix. A Recommender would then use the DataModel and the OnlineLearner, where Recommender#setPreference is delegated to DataModel#setPreference (like it does now), and DataModel#setPreference triggers OnlineLearner#train.