I think you can get what you need through the --maxPrefsForUser flag. Any user with more than that many will only keep a random sample of that size.
On Jun 18, 2013, at 23:27, Ted Dunning <ted.dunn...@gmail.com> wrote: > I was reading the RowSimilarityJob and it doesn't appear that it does > down-sampling on the original data to minimize the performance impact of > perversely prolific users. > > The issue is that if a single user has 100,000 items in their history, we > learn nothing more than if we picked 300 of those while the former would > result in processing 10 billion cooccurrences and the latter would result > in 100,000. This factor of 10,000 is so large that it can make a big > difference in performance. > > I had thought that the code had this down-sampling in place. > > If not, I can add row based down-sampling quite easily.