That might slow down the job enormously for certain nasty inputs. The more that I think about things, the more convinced I am that there should be a post-processing pass to enforce things like not recommending input items. The recommendation algorithm itself should not be distorted to do this if it is unnatural (and forcing a user to not use sampling is a great example ... there should be two controls here).
I think that the original point is also correct, however. The user should not be forced to implement this very common step. As such I think that the recommender code should still support doing this, but it really ought to be as an output filter. On Wed, Aug 7, 2013 at 9:19 AM, Sebastian Schelter <s...@apache.org> wrote: > if you also set --maxPrefsPerUserInItemSimilarity to a number higher than > the max preferences per user, no sampling should occur. This might slow > down the job however. > > 2013/8/7 Rafal Lukawiecki <ra...@projectbotticelli.com> > > > Is there a set of parameters which I could pass to RecommenderJob to > avoid > > that random sampling, in order to create a test case for the issue I have > > experienced? Would setting --maxSimilaritiesPerItem and/or > > --maxPrefsPerUserInItemSimilarity help? Many thanks. > > > > On 7 Aug 2013, at 16:12, Sebastian Schelter <ssc.o...@googlemail.com> > > wrote: > > > > It could affect the results even in this case, as we also sample the > > preferences when computing similar items. > > > > On 07.08.2013 17:07, Rafal Lukawiecki wrote: > > > Thank you, Sebastian. Would the random sampling affect the results of > > RecommenderJob, in any case? I am setting --maxPrefsPerUser to exceed the > > actual, maximum number of preferences expressed by every user. > > > > > > Rafal > > > > > > On 7 Aug 2013, at 15:48, Sebastian Schelter <ssc.o...@googlemail.com> > > > wrote: > > > > > > The code in trunk allows to you to specify a randomSeed, the older > > > versions don't unfortunately. > > > > > > On 07.08.2013 16:35, Rafal Lukawiecki wrote: > > >> Hi Sebastian, > > >> > > >> The quantity of returned "duplicates" is much too large to be caused > > just by sampling's randomness. I wonder if this could be related to > > something that is platform-specific, as in Windows vs. *nix > representation > > of input files, data types etc. > > >> > > >> For argument's sake, is it possible to fix the seed of the random > > aspect of the sampling so I could feed the same input through two > platforms > > and compare the results? > > >> > > >> Rafal > > >> > > >> On 7 Aug 2013, at 15:20, Sebastian Schelter <ssc.o...@googlemail.com> > > >> wrote: > > >> > > >> Hi Rafal, > > >> > > >> this sounds really strange, the bug should not have anything to do > with > > >> the version of Hadoop that you are running. You could sometimes not > see > > >> it due to the random sampling of the preferences. > > >> > > >> --sebastian > > >> > > >> On 07.08.2013 13:53, Rafal Lukawiecki wrote: > > >>> Sebastian, > > >>> > > >>> I've been doing a little more digging regarding the issue of > > preferences being calculated for already preferred items. I re-run the > jobs > > using the same data and the same parameters on a different installation > of > > Hadoop, and the problem seems to have gone away. For now it looks like > the > > issue arises when I run it under Mahout 0.7 and 0.8 using HDP > (Hortonworks > > Data Platform) for Windows 1.1.0, with Hadoop 1.1.0. This problem does > not > > show up, yet in my tests, under Hadoop 1.2.1 compiled for OS X. I will > work > > a little more to ensure my results, but if they stood up, should I still > > report it as a Mahout issue? > > >>> > > >>> Rafal > > >>> -- > > >>> Rafal Lukawiecki > > >>> Strategic Consultant and Director > > >>> Project Botticelli Ltd > > >>> > > >>> On 1 Aug 2013, at 17:31, Sebastian Schelter <s...@apache.org> wrote: > > >>> > > >>> Setting it to the maximum number should be enough. Would be great if > > you > > >>> can share your dataset and tests. > > >>> > > >>> 2013/8/1 Rafal Lukawiecki <ra...@projectbotticelli.com> > > >>> > > >>>> Should I have set that parameter to a value much much larger than > the > > >>>> maximum number of actually expressed preferences by a user? > > >>>> > > >>>> I'm working on an anonymised data set. If it works as an error test > > case, > > >>>> I'd be happy to share it for your re-test. I am still hoping it is > my > > >>>> error, not Mahout's. > > >>>> > > >>>> Rafal > > >>>> -- > > >>>> Rafal Lukawiecki > > >>>> Pardon brevity, mobile device. > > >>>> > > >>>> On 1 Aug 2013, at 17:19, "Sebastian Schelter" <s...@apache.org> > wrote: > > >>>> > > >>>>> Ok, please file a bug report detailing what you've tested and what > > >>>> results > > >>>>> you got. > > >>>>> > > >>>>> Just to clarify, setting maxPrefsPerUser to a high number still > does > > not > > >>>>> help? That surprises me. > > >>>>> > > >>>>> > > >>>>> 2013/8/1 Rafal Lukawiecki <ra...@projectbotticelli.com> > > >>>>> > > >>>>>> Hi Sebastian, > > >>>>>> > > >>>>>> I've rechecked the results, and, I'm afraid that the issue has not > > gone > > >>>>>> away, contrary to my yesterday's enthusiastic response. Using 0.8 > I > > have > > >>>>>> retested with and without --maxPrefsPerUser 9000 parameter (no > user > > has > > >>>>>> more than 5000 prefs). I have also supplied the prefs file, > without > > the > > >>>>>> preference value, that is as: user,item (one per line) as a > > >>>> --filterFile, > > >>>>>> with and without the -maxPrefsPerUser, and I am afraid we are also > > >>>> seeing > > >>>>>> recommendations for items the user has expressed a prior > preference > > for. > > >>>>>> > > >>>>>> I suppose I need to file a bug report. > > >>>>>> > > >>>>>> Rafal > > >>>>>> -- > > >>>>>> Rafal Lukawiecki > > >>>>>> Pardon my brevity, sent from a telephone. > > >>>>>> > > >>>>>> On 31 Jul 2013, at 22:35, "Rafal Lukawiecki" < > > >>>> ra...@projectbotticelli.com> > > >>>>>> wrote: > > >>>>>> > > >>>>>>> Dear Sebastian, > > >>>>>>> > > >>>>>>> It looks like setting --maxPrefsPerUser 10000 have resolved the > > issue > > >>>> in > > >>>>>> our case—it seems that the most preferences a user had was just > > about > > >>>> 5000, > > >>>>>> so I doubled it just-in-case, but when I operationalise this > model, > > I > > >>>> will > > >>>>>> make sure to calculate the actual max number of preferences and > set > > the > > >>>>>> parameter accordingly. I will double-check the resultset to make > > sure > > >>>> the > > >>>>>> issue is really gone, as I have only checked the few cases where > we > > have > > >>>>>> spotted a recommendation of a previously preferred item. > > >>>>>>> > > >>>>>>> Would you like me to file a bug, and would you like me to test it > > on > > >>>> 0.8 > > >>>>>> or another version? I am using 0.7. > > >>>>>>> > > >>>>>>> Thanks for your kind support. > > >>>>>>> Rafal > > >>>>>>> -- > > >>>>>>> Rafal Lukawiecki > > >>>>>>> Strategic Consultant and Director > > >>>>>>> Project Botticelli Ltd > > >>>>>>> > > >>>>>>> On 31 Jul 2013, at 06:22, Sebastian Schelter < > > ssc.o...@googlemail.com> > > >>>>>>> wrote: > > >>>>>>> > > >>>>>>> Hi Rafal, > > >>>>>>> > > >>>>>>> can you try to set the option --maxPrefsPerUser to the maximum > > number > > >>>> of > > >>>>>>> interactions per user and see if you still get the error? > > >>>>>>> > > >>>>>>> Best, > > >>>>>>> Sebastian > > >>>>>>> > > >>>>>>> On 30.07.2013 19:29, Rafal Lukawiecki wrote: > > >>>>>>>> Thank you Sebastian. The data set is not that large, as we are > > running > > >>>>>> tests on a subset. It is about 24k users, 40k items, the > preference > > file > > >>>>>> has 65k preferences as triples. This was using Similarity > > Cooccurrence. > > >>>>>>>> > > >>>>>>>> I can see if I could anonymise the data set to share if that > > would be > > >>>>>> helpful. > > >>>>>>>> > > >>>>>>>> Thanks for your kind help. > > >>>>>>>> > > >>>>>>>> Rafal > > >>>>>>>> -- > > >>>>>>>> Rafal Lukawiecki > > >>>>>>>> Pardon my brevity, sent from a telephone. > > >>>>>>>> > > >>>>>>>> On 30 Jul 2013, at 18:18, "Sebastian Schelter" <s...@apache.org> > > >>>> wrote: > > >>>>>>>> > > >>>>>>>>> Hi Rafal, > > >>>>>>>>> > > >>>>>>>>> can you issue a ticket for this problem at > > >>>>>>>>> https://issues.apache.org/jira/browse/MAHOUT ? We have > > unit-tests > > >>>> that > > >>>>>>>>> check whether this happens and currently they work fine. I can > > only > > >>>>>> imagine > > >>>>>>>>> that the problem occurs in larger datasets where we sample the > > data > > >>>> in > > >>>>>> some > > >>>>>>>>> places. Can you describe a scenario/dataset where this happens? > > >>>>>>>>> > > >>>>>>>>> Best, > > >>>>>>>>> Sebastian > > >>>>>>>>> > > >>>>>>>>> 2013/7/30 Rafal Lukawiecki <ra...@projectbotticelli.com> > > >>>>>>>>> > > >>>>>>>>>> I'm new here, just registered. Many thanks to everyone for > > working > > >>>> on > > >>>>>> an > > >>>>>>>>>> amazing piece of software, thank you for building Mahout and > for > > >>>> your > > >>>>>>>>>> support. My apologies if this is not the right place to ask > the > > >>>>>> question—I > > >>>>>>>>>> have searched for the issue, and I can see this problem has > been > > >>>>>> reported > > >>>>>>>>>> here: > > >>>>>> > > >>>> > > > http://stackoverflow.com/questions/13822455/apache-mahout-distributed-recommender-recommends-already-rated-items > > >>>>>>>>>> > > >>>>>>>>>> Unfortunately, the trail leads to the newsgroups, and I have > not > > >>>>>> found a > > >>>>>>>>>> way, yet, to get an answer from them, without asking you. > > >>>>>>>>>> > > >>>>>>>>>> Essentially, I am running a Hadoop RecommenderJob from Mahout > > 0.7, > > >>>>>> and I > > >>>>>>>>>> am finding that it is recommending items that the user has > > already > > >>>>>>>>>> expressed a preference for in their input file. I understand > > that > > >>>> this > > >>>>>>>>>> should not be happening, and I am not sure if there is a know > > fix or > > >>>>>> if I > > >>>>>>>>>> should be looking for a workaround (such as using the entire > > input > > >>>> as > > >>>>>> the > > >>>>>>>>>> filterFile). > > >>>>>>>>>> > > >>>>>>>>>> I will double-check that there is no error on my side, but so > > far it > > >>>>>> does > > >>>>>>>>>> not seem that way. > > >>>>>>>>>> > > >>>>>>>>>> Many thanks and my regards from Ireland, > > >>>>>>>>>> Rafal Lukawiecki > > >>>>>>>>>> > > >>>>>>>>>> -- > > >>>>>>>>>> > > >>>>>>>>>> Rafal Lukawiecki > > >>>>>>>>>> > > >>>>>>>>>> Strategic Consultant and Director > > >>>>>>>>>> > > >>>>>>>>>> Project Botticelli Ltd > > >>>>>> > > >>>> > > >>> > > >>> > > >> > > >> > > >> > > > > > > > > > > > > > > > > > >