That might slow down the job enormously for certain nasty inputs.

The more that I think about things, the more convinced I am that there
should be a post-processing pass to enforce things like not recommending
input items.  The recommendation algorithm itself should not be distorted
to do this if it is unnatural (and forcing a user to not use sampling is a
great example ... there should be two controls here).

I think that the original point is also correct, however.  The user should
not be forced to implement this very common step.  As such I think that the
recommender code should still support doing this, but it really ought to be
as an output filter.



On Wed, Aug 7, 2013 at 9:19 AM, Sebastian Schelter <s...@apache.org> wrote:

> if you also set --maxPrefsPerUserInItemSimilarity to a number higher than
> the max preferences per user, no sampling should occur. This might slow
> down the job however.
>
> 2013/8/7 Rafal Lukawiecki <ra...@projectbotticelli.com>
>
> > Is there a set of parameters which I could pass to RecommenderJob to
> avoid
> > that random sampling, in order to create a test case for the issue I have
> > experienced? Would setting --maxSimilaritiesPerItem and/or
> > --maxPrefsPerUserInItemSimilarity help? Many thanks.
> >
> > On 7 Aug 2013, at 16:12, Sebastian Schelter <ssc.o...@googlemail.com>
> >  wrote:
> >
> > It could affect the results even in this case, as we also sample the
> > preferences when computing similar items.
> >
> > On 07.08.2013 17:07, Rafal Lukawiecki wrote:
> > > Thank you, Sebastian. Would the random sampling affect the results of
> > RecommenderJob, in any case? I am setting --maxPrefsPerUser to exceed the
> > actual, maximum number of preferences expressed by every user.
> > >
> > > Rafal
> > >
> > > On 7 Aug 2013, at 15:48, Sebastian Schelter <ssc.o...@googlemail.com>
> > > wrote:
> > >
> > > The code in trunk allows to you to specify a randomSeed, the older
> > > versions don't unfortunately.
> > >
> > > On 07.08.2013 16:35, Rafal Lukawiecki wrote:
> > >> Hi Sebastian,
> > >>
> > >> The quantity of returned "duplicates" is much too large to be caused
> > just by sampling's randomness. I wonder if this could be related to
> > something that is platform-specific, as in Windows vs. *nix
> representation
> > of input files, data types etc.
> > >>
> > >> For argument's sake, is it possible to fix the seed of the random
> > aspect of the sampling so I could feed the same input through two
> platforms
> > and compare the results?
> > >>
> > >> Rafal
> > >>
> > >> On 7 Aug 2013, at 15:20, Sebastian Schelter <ssc.o...@googlemail.com>
> > >> wrote:
> > >>
> > >> Hi Rafal,
> > >>
> > >> this sounds really strange, the bug should not have anything to do
> with
> > >> the version of Hadoop that you are running. You could sometimes not
> see
> > >> it due to the random sampling of the preferences.
> > >>
> > >> --sebastian
> > >>
> > >> On 07.08.2013 13:53, Rafal Lukawiecki wrote:
> > >>> Sebastian,
> > >>>
> > >>> I've been doing a little more digging regarding the issue of
> > preferences being calculated for already preferred items. I re-run the
> jobs
> > using the same data and the same parameters on a different installation
> of
> > Hadoop, and the problem seems to have gone away. For now it looks like
> the
> > issue arises when I run it under Mahout 0.7 and 0.8 using HDP
> (Hortonworks
> > Data Platform) for Windows 1.1.0, with Hadoop 1.1.0. This problem does
> not
> > show up, yet in my tests, under Hadoop 1.2.1 compiled for OS X. I will
> work
> > a little more to ensure my results, but if they stood up, should I still
> > report it as a Mahout issue?
> > >>>
> > >>> Rafal
> > >>> --
> > >>> Rafal Lukawiecki
> > >>> Strategic Consultant and Director
> > >>> Project Botticelli Ltd
> > >>>
> > >>> On 1 Aug 2013, at 17:31, Sebastian Schelter <s...@apache.org> wrote:
> > >>>
> > >>> Setting it to the maximum number should be enough. Would be great if
> > you
> > >>> can share your dataset and tests.
> > >>>
> > >>> 2013/8/1 Rafal Lukawiecki <ra...@projectbotticelli.com>
> > >>>
> > >>>> Should I have set that parameter to a value much much larger than
> the
> > >>>> maximum number of actually expressed preferences by a user?
> > >>>>
> > >>>> I'm working on an anonymised data set. If it works as an error test
> > case,
> > >>>> I'd be happy to share it for your re-test. I am still hoping it is
> my
> > >>>> error, not Mahout's.
> > >>>>
> > >>>> Rafal
> > >>>> --
> > >>>> Rafal Lukawiecki
> > >>>> Pardon brevity, mobile device.
> > >>>>
> > >>>> On 1 Aug 2013, at 17:19, "Sebastian Schelter" <s...@apache.org>
> wrote:
> > >>>>
> > >>>>> Ok, please file a bug report detailing what you've tested and what
> > >>>> results
> > >>>>> you got.
> > >>>>>
> > >>>>> Just to clarify, setting maxPrefsPerUser to a high number still
> does
> > not
> > >>>>> help? That surprises me.
> > >>>>>
> > >>>>>
> > >>>>> 2013/8/1 Rafal Lukawiecki <ra...@projectbotticelli.com>
> > >>>>>
> > >>>>>> Hi Sebastian,
> > >>>>>>
> > >>>>>> I've rechecked the results, and, I'm afraid that the issue has not
> > gone
> > >>>>>> away, contrary to my yesterday's enthusiastic response. Using 0.8
> I
> > have
> > >>>>>> retested with and without --maxPrefsPerUser 9000 parameter (no
> user
> > has
> > >>>>>> more than 5000 prefs). I have also supplied the prefs file,
> without
> > the
> > >>>>>> preference value, that is as: user,item (one per line) as a
> > >>>> --filterFile,
> > >>>>>> with and without the -maxPrefsPerUser, and I am afraid we are also
> > >>>> seeing
> > >>>>>> recommendations for items the user has expressed a prior
> preference
> > for.
> > >>>>>>
> > >>>>>> I suppose I need to file a bug report.
> > >>>>>>
> > >>>>>> Rafal
> > >>>>>> --
> > >>>>>> Rafal Lukawiecki
> > >>>>>> Pardon my brevity, sent from a telephone.
> > >>>>>>
> > >>>>>> On 31 Jul 2013, at 22:35, "Rafal Lukawiecki" <
> > >>>> ra...@projectbotticelli.com>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Dear Sebastian,
> > >>>>>>>
> > >>>>>>> It looks like setting --maxPrefsPerUser 10000 have resolved the
> > issue
> > >>>> in
> > >>>>>> our case—it seems that the most preferences a user had was just
> > about
> > >>>> 5000,
> > >>>>>> so I doubled it just-in-case, but when I operationalise this
> model,
> > I
> > >>>> will
> > >>>>>> make sure to calculate the actual max number of preferences and
> set
> > the
> > >>>>>> parameter accordingly. I will double-check the resultset to make
> > sure
> > >>>> the
> > >>>>>> issue is really gone, as I have only checked the few cases where
> we
> > have
> > >>>>>> spotted a recommendation of a previously preferred item.
> > >>>>>>>
> > >>>>>>> Would you like me to file a bug, and would you like me to test it
> > on
> > >>>> 0.8
> > >>>>>> or another version? I am using 0.7.
> > >>>>>>>
> > >>>>>>> Thanks for your kind support.
> > >>>>>>> Rafal
> > >>>>>>> --
> > >>>>>>> Rafal Lukawiecki
> > >>>>>>> Strategic Consultant and Director
> > >>>>>>> Project Botticelli Ltd
> > >>>>>>>
> > >>>>>>> On 31 Jul 2013, at 06:22, Sebastian Schelter <
> > ssc.o...@googlemail.com>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>> Hi Rafal,
> > >>>>>>>
> > >>>>>>> can you try to set the option --maxPrefsPerUser to the maximum
> > number
> > >>>> of
> > >>>>>>> interactions per user and see if you still get the error?
> > >>>>>>>
> > >>>>>>> Best,
> > >>>>>>> Sebastian
> > >>>>>>>
> > >>>>>>> On 30.07.2013 19:29, Rafal Lukawiecki wrote:
> > >>>>>>>> Thank you Sebastian. The data set is not that large, as we are
> > running
> > >>>>>> tests on a subset. It is about 24k users, 40k items, the
> preference
> > file
> > >>>>>> has 65k preferences as triples. This was using Similarity
> > Cooccurrence.
> > >>>>>>>>
> > >>>>>>>> I can see if I could anonymise the data set to share if that
> > would be
> > >>>>>> helpful.
> > >>>>>>>>
> > >>>>>>>> Thanks for your kind help.
> > >>>>>>>>
> > >>>>>>>> Rafal
> > >>>>>>>> --
> > >>>>>>>> Rafal Lukawiecki
> > >>>>>>>> Pardon my brevity, sent from a telephone.
> > >>>>>>>>
> > >>>>>>>> On 30 Jul 2013, at 18:18, "Sebastian Schelter" <s...@apache.org>
> > >>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Hi Rafal,
> > >>>>>>>>>
> > >>>>>>>>> can you issue a ticket for this problem at
> > >>>>>>>>> https://issues.apache.org/jira/browse/MAHOUT ? We have
> > unit-tests
> > >>>> that
> > >>>>>>>>> check whether this happens and currently they work fine. I can
> > only
> > >>>>>> imagine
> > >>>>>>>>> that the problem occurs in larger datasets where we sample the
> > data
> > >>>> in
> > >>>>>> some
> > >>>>>>>>> places. Can you describe a scenario/dataset where this happens?
> > >>>>>>>>>
> > >>>>>>>>> Best,
> > >>>>>>>>> Sebastian
> > >>>>>>>>>
> > >>>>>>>>> 2013/7/30 Rafal Lukawiecki <ra...@projectbotticelli.com>
> > >>>>>>>>>
> > >>>>>>>>>> I'm new here, just registered. Many thanks to everyone for
> > working
> > >>>> on
> > >>>>>> an
> > >>>>>>>>>> amazing piece of software, thank you for building Mahout and
> for
> > >>>> your
> > >>>>>>>>>> support. My apologies if this is not the right place to ask
> the
> > >>>>>> question—I
> > >>>>>>>>>> have searched for the issue, and I can see this problem has
> been
> > >>>>>> reported
> > >>>>>>>>>> here:
> > >>>>>>
> > >>>>
> >
> http://stackoverflow.com/questions/13822455/apache-mahout-distributed-recommender-recommends-already-rated-items
> > >>>>>>>>>>
> > >>>>>>>>>> Unfortunately, the trail leads to the newsgroups, and I have
> not
> > >>>>>> found a
> > >>>>>>>>>> way, yet, to get an answer from them, without asking you.
> > >>>>>>>>>>
> > >>>>>>>>>> Essentially, I am running a Hadoop RecommenderJob from Mahout
> > 0.7,
> > >>>>>> and I
> > >>>>>>>>>> am finding that it is recommending items that the user has
> > already
> > >>>>>>>>>> expressed a preference for in their input file. I understand
> > that
> > >>>> this
> > >>>>>>>>>> should not be happening, and I am not sure if there is a know
> > fix or
> > >>>>>> if I
> > >>>>>>>>>> should be looking for a workaround (such as using the entire
> > input
> > >>>> as
> > >>>>>> the
> > >>>>>>>>>> filterFile).
> > >>>>>>>>>>
> > >>>>>>>>>> I will double-check that there is no error on my side, but so
> > far it
> > >>>>>> does
> > >>>>>>>>>> not seem that way.
> > >>>>>>>>>>
> > >>>>>>>>>> Many thanks and my regards from Ireland,
> > >>>>>>>>>> Rafal Lukawiecki
> > >>>>>>>>>>
> > >>>>>>>>>> --
> > >>>>>>>>>>
> > >>>>>>>>>> Rafal Lukawiecki
> > >>>>>>>>>>
> > >>>>>>>>>> Strategic Consultant and Director
> > >>>>>>>>>>
> > >>>>>>>>>> Project Botticelli Ltd
> > >>>>>>
> > >>>>
> > >>>
> > >>>
> > >>
> > >>
> > >>
> > >
> > >
> > >
> >
> >
> >
> >
>

Reply via email to