Try LoglikelihoodSimilarity. Where do you run into memory issues? Did you change worker heap settings from the default?
On Sat, Jun 23, 2012 at 10:24 PM, Something Something <mailinglist...@gmail.com> wrote: > Thank you so much Sean. It was great to get confirmation from you > regarding the choice of algorithm. > > As suggested, I used the following params: > > similarityJob.run(new String[]{"--tempDir", > tmpDir.getAbsolutePath(), "--similarityClassname", > > CooccurrenceCountSimilarity.class.getName(),"--booleanData", > String.valueOf(Boolean.TRUE)}); > > and got output!!!! Horray. > > Question: Is CooccurenceCountSimilarity best in this case? > > > Anyway, now I am going to try on our production cluster with Billions of > lines. Last time I tried, I ran into OutOfMemoryExceptions. Any > suggestions regarding memory settings? > > Thanks once again for your help. > > > On Fri, Jun 22, 2012 at 11:08 PM, Sean Owen <sro...@gmail.com> wrote: > >> Using 1 is just fine for the reasons you give. You would be surprised how >> OK it is to use this even for dislikes. In fact just omit the third field >> in your CSV. >> >> However you need to set the boolean data flag and choose a similarity >> metric that is defined over such data. Pearson / cosine is not for example >> since every value is 1. This is why there is no output. >> On Jun 23, 2012 1:33 AM, "Something Something" <mailinglist...@gmail.com> >> wrote: >> >> > I tested my setup of ItemSimilarityJob using the MovieLens dataset & got >> > the expected results. It looks like my setup is good. >> > >> > Here's what I have: >> > >> > I have data coming in the following format: UserId, GroupId, Frequency >> (how >> > many times the user chose the group), Max timestamp (the last time the >> user >> > chose the group). >> > >> > Based on this dataset we need to figure out which groups look alike. I >> > decided to use "item based collaborative filtering" but I have 3 >> concerns: >> > >> > 1) We don't have any knowledge of "Dislikes"; we only know which groups >> > users "Like". >> > 2) We don't really have ratings. In other words, users don't rate the >> > group. Either they choose OR they don't. >> > 3) Frequency doesn't really imply interest level. >> > >> > >> > I decided to try 'ItemSimilarityJob' by using a CSV file in the following >> > format: >> > >> > UserId, GroupId, "1" >> > >> > In other words, the rating value is always 1. There are NO rows with >> value >> > "0". This is producing NO OUTPUT, but the job finishes successfully. >> > >> > Is this the right way to solve the problem? Is there some other >> Algorithm >> > that I should be using? Thanks for the help. >> > >>