Try LoglikelihoodSimilarity.

Where do you run into memory issues? Did you change worker heap
settings from the default?

On Sat, Jun 23, 2012 at 10:24 PM, Something Something
<mailinglist...@gmail.com> wrote:
> Thank you so much Sean.  It was great to get confirmation from you
> regarding the choice of algorithm.
>
> As suggested, I used the following params:
>
>            similarityJob.run(new String[]{"--tempDir",
> tmpDir.getAbsolutePath(), "--similarityClassname",
>
> CooccurrenceCountSimilarity.class.getName(),"--booleanData",
> String.valueOf(Boolean.TRUE)});
>
> and got output!!!!   Horray.
>
> Question:  Is CooccurenceCountSimilarity best in this case?
>
>
> Anyway, now I am going to try on our production cluster with Billions of
> lines.  Last time I tried, I ran into OutOfMemoryExceptions.  Any
> suggestions regarding memory settings?
>
> Thanks once again for your help.
>
>
> On Fri, Jun 22, 2012 at 11:08 PM, Sean Owen <sro...@gmail.com> wrote:
>
>> Using 1 is just fine for the reasons you give. You would be surprised how
>> OK it is to use this even for dislikes. In fact just omit the third field
>> in your CSV.
>>
>> However you need to set the boolean data flag and choose a similarity
>> metric that is defined over such data. Pearson / cosine is not for example
>> since every value is 1. This is why there is no output.
>> On Jun 23, 2012 1:33 AM, "Something Something" <mailinglist...@gmail.com>
>> wrote:
>>
>> > I tested my setup of ItemSimilarityJob using the MovieLens dataset & got
>> > the expected results.  It looks like my setup is good.
>> >
>> > Here's what I have:
>> >
>> > I have data coming in the following format: UserId, GroupId, Frequency
>> (how
>> > many times the user chose the group), Max timestamp (the last time the
>> user
>> > chose the group).
>> >
>> > Based on this dataset we need to figure out which groups look alike. I
>> > decided to use "item based collaborative filtering" but I have 3
>> concerns:
>> >
>> > 1)  We don't have any knowledge of "Dislikes"; we only know which groups
>> > users "Like".
>> > 2)  We don't really have ratings. In other words, users don't rate the
>> > group. Either they choose OR they don't.
>> > 3)  Frequency doesn't really imply interest level.
>> >
>> >
>> > I decided to try 'ItemSimilarityJob' by using a CSV file in the following
>> > format:
>> >
>> > UserId, GroupId, "1"
>> >
>> > In other words, the rating value is always 1.  There are NO rows with
>> value
>> > "0".  This is producing NO OUTPUT, but the job finishes successfully.
>> >
>> > Is this the right way to solve the problem?  Is there some other
>> Algorithm
>> > that I should be using?  Thanks for the help.
>> >
>>

Reply via email to