Hi Abhishek, Great to hear that you're willing to put some work into this! Have you ever worked with Mahout's recommenders before? If not, then a good first step would be to get familiar with them and code up a few examples.
Best, Sebastian On 17.07.2013 07:29, Abhishek Sharma wrote: > Sorry to interrupt guys, but I just wanted to bring it to your notice that > I am also interested in contributing to this idea. I am planning to > participate in ASF-ICFOSS mentor-ship > programme<https://cwiki.apache.org/confluence/display/COMDEV/ASF-ICFOSS+Pilot+Mentoring+Programme>. > (this is very similar to GSOC) > > I do have strong concepts in machine learning (have done the ML course by > Andrew NG on coursera) also, I am good in programming (have 2.5 yrs of work > experience). I am not really sure of how can I approach this problem (but I > do have a strong interest to work on this problem) hence would like to pair > up on this. I am currently working as a research intern at Indian Institute > of Science (IISc), Bangalore India and can put up 15-20 hrs per week. > > Please let me know your thoughts if I can be a part of this. > > Thanks & Regards, > Abhishek Sharma > http://www.linkedin.com/in/abhi21 > https://github.com/abhi21 > > > On Wed, Jul 17, 2013 at 3:11 AM, Gokhan Capan <[email protected]> wrote: > >> Peng, >> >> This is the reason I separated out the DataModel, and only put the learner >> stuff there. The learner I mentioned yesterday just stores the >> parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care >> where preferences are stored. >> >> I, kind of, agree with the multi-level DataModel approach: >> One for iterating over "all" preferences, one for if one wants to deploy a >> recommender and perform a lot of top-N recommendation tasks. >> >> (Or one DataModel with a strategy that might reduce existing memory >> consumption, while still providing fast access, I am not sure. Let me try a >> matrix-backed DataModel approach) >> >> Gokhan >> >> >> On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter <[email protected]> >> wrote: >> >>> I completely agree, Netflix is less than one gigabye in a smart >>> representation, 12x more memory is a nogo. The techniques used in >>> FactorizablePreferences allow a much more memory efficient >> representation, >>> tested on KDD Music dataset which is approx 2.5 times Netflix and fits >> into >>> 3GB with that approach. >>> >>> >>> 2013/7/16 Ted Dunning <[email protected]> >>> >>>> Netflix is a small dataset. 12G for that seems quite excessive. >>>> >>>> Note also that this is before you have done any work. >>>> >>>> Ideally, 100million observations should take << 1GB. >>>> >>>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <[email protected]> >>> wrote: >>>> >>>>> The second idea is indeed splendid, we should separate >> time-complexity >>>>> first and space-complexity first implementation. What I'm not quite >>> sure, >>>>> is that if we really need to create two interfaces instead of one. >>>>> Personally, I think 12G heap space is not that high right? Most new >>>> laptop >>>>> can already handle that (emphasis on laptop). And if we replace hash >>> map >>>>> (the culprit of high memory consumption) with list/linkedList, it >> would >>>>> simply degrade time complexity for a linear search to O(n), not too >> bad >>>>> either. The current DataModel is a result of careful thoughts and has >>>>> underwent extensive test, it is easier to expand on top of it instead >>> of >>>>> subverting it. >>>> >>> >> > > >
