Re: Mahout performance issues

Daniel Zohar Sun, 04 Dec 2011 05:05:05 -0800

I assume the parameter does not affect the possibleItemIDs because of the
following line:


max = (int)
Math.max(defaultMaxPrefsPerItemConsidered, userItemCountMultiplier *
Math.log(Math.max(dataModel.getNumUsers(), dataModel.getNumItems())));

On Sun, Dec 4, 2011 at 2:59 PM, Daniel Zohar <disso...@gmail.com> wrote:

> Sean, your impl. is indeed better than mine but for some reason when I ran
> it with for a user with a lot of interactions, I got 2023 possibleItemIDs
> (although I used 10,2 in the constructor).
>
> Sebastian, I will try and expriment also with your patch. I would just
> like to add that in my opinion, as long as 'killing items' has to be done
> manually, it is not scalable by definition. I personally would always
> prefer to avoid these kind of solutions. Also, in my case, the most popular
> item has only 3% of the users interacted with, so I suppose that's not
> exactly the case as well..
>
>
> On Sun, Dec 4, 2011 at 2:30 PM, Sebastian Schelter <s...@apache.org> wrote:
>
>> Hi Daniel,
>>
>> My view is this: I think you can pretty safely down-sample power users
>> like it is done in https://issues.apache.org/jira/browse/MAHOUT-914
>> I did some experiments on the movielens1M dataset that showed that you
>> get a negligible error given you look at enough interactions per user:
>>
>> https://issues.apache.org/jira/secure/attachment/12506028/downsampling.png
>>
>> I could also verify this on the movielens10M dataset. I think this kind
>> of sampling works because the distribution of interactions with items in
>> the power-users and in the whole dataset is very similar. Therefore you
>> don't really learn anything new from the 'power-users'. The
>> 'power-users' might also be crawlers or people sharing accounts in
>> practice.
>>
>> However, I am not sure what happens when you also sample the number of
>> items you look at. If I had to decide, I'd rather follow Ted's advice
>> and kill super-popular items, as they are not helpful per-se.
>>
>> But if the additional item sampling helps in your usecase, I don't
>> oppose including it in Mahout. I think its good to have a variety of
>> candidate item strategies. You should however do some experimenting to
>> see how much the sampling affects quality. An A/B test in a real
>> application would be the best thing to do.
>>
>> --sebastian
>>
>>
>>
>> On 04.12.2011 13:12, Daniel Zohar wrote:
>> > Actually I was referring to Sebastian's. I haven't seen you committedI
>> can
>> > anything to SamplingCandidateItemsStrategy. Can you tell me in which
>> classI can
>> > the change appears?
>> >
>> > On Sun, Dec 4, 2011 at 2:06 PM, Sean Owen <sro...@gmail.com> wrote:
>> >
>> >> Are you referring to my patch, MAHOUT-910?
>> >>
>> >> It does let you specify a hard cap, really -- if you place a limit of
>> X,
>> >> then at most X^2 item-item associations come out. Before you could not
>> >> bound the result, really, since one user could rate a lot of items.
>> >>
>> >> I think it's slightly more efficient and unbiased as users with few
>> ratings
>> >> will not have their ratings sampled out, and all users are equally
>> likely
>> >> to be sampled out.
>> >>
>> >> What do you think?
>> >> Yes you could easily add a secondary cap though as a final filter.
>> >>
>> >> On Sun, Dec 4, 2011 at 11:43 AM, Daniel Zohar <disso...@gmail.com>
>> wrote:
>> >>
>> >>> Combining the latest commits with my
>> >>> optimized-SamplingCandidateItemsStrategy (
>> http://pastebin.com/6n9C8Pw1)
>> >>> I achieved satisfying results. All the queries were under one second.
>> >>>
>> >>> Sebastian, I took a look at your patch and I think it's more practical
>> >> than
>> >>> the current SamplingCandidateItemsStrategy, however it still doesn't
>> put
>> >> a
>> >>> strict cap on the number of possible item IDs like my implementation
>> >> does.
>> >>> Perhaps there is room for both implementations?
>> >>>
>> >>>
>> >>>
>> >>> On Sun, Dec 4, 2011 at 11:13 AM, Sebastian Schelter <s...@apache.org>
>> >>> wrote:
>> >>>
>> >>>> I created a jira to supply a non-distributed counterpart of the
>> >>>> sampling that is done in the distributed item similarity computation:
>> >>>>
>> >>>> https://issues.apache.org/jira/browse/MAHOUT-914
>> >>>>
>> >>>>
>> >>>> 2011/12/2 Sean Owen <sro...@gmail.com>:
>> >>>>> For your purposes, it's LogLikelihoodSimilarity. I made similar
>> >> changes
>> >>>> in
>> >>>>> other files. Ideally, just svn update to get all recent changes.
>> >>>>>
>> >>>>> On Fri, Dec 2, 2011 at 6:43 PM, Daniel Zohar <disso...@gmail.com>
>> >>> wrote:
>> >>>>>
>> >>>>>> Sean, can you tell me which files have you committed the changes
>> to?
>> >>>> Thanks
>> >>>>
>> >>>
>> >>
>> >
>>
>>
>

Re: Mahout performance issues

Reply via email to