Re: Scaling up spark Iitem similarity on big data data sets

2017-05-01 Thread Pat Ferrel
I just ran into the opposite case Sebastian mentions, where a very large % of users have only one interaction. They come from Social media or Search and see only thing and leave. Processing this data turned into a huge job but led to virtually no change in the model since users with very few

Re: Scaling up spark Iitem similarity on big data data sets

2016-06-23 Thread Ted Dunning
This actually sounds like a very small problem. My guess is that there are bad settings for the interaction and frequency cuts. On Thu, Jun 23, 2016 at 11:07 AM, Pat Ferrel wrote: > In addition to increasing downsampling there are some other things to > note. The

Re: Scaling up spark Iitem similarity on big data data sets

2016-06-23 Thread Pat Ferrel
In addition to increasing downsampling there are some other things to note. The original OOM was caused by the use of BiMaps to store your row and column ids. These will increase with the size of the total storage needed for 2 hashmaps per id type. With only 16g you may have very little else

Re: Scaling up spark Iitem similarity on big data data sets

2016-06-23 Thread Sebastian
Hi, Pairwise similarity is a quadratic problem and its very easy to run into a problem size does not scale anymore, especially with so many items. Our code downsamples the input data to help with this. One thing you can do is decrease the argument maxNumInteractions to a lower number to

Scaling up spark Iitem similarity on big data data sets

2016-06-23 Thread jelmer
Hi, I am trying to build a simple recommendation engine using spark item similarity (eg with org.apache.mahout.math.cf.SimilarityAnalysis.cooccurrencesIDSs) Things work fine on comparatively small dataset but I am having difficulty scaling it up The input I am using is CSV data containing