I was reading the RowSimilarityJob and it doesn't appear that it does
down-sampling on the original data to minimize the performance impact of
perversely prolific users.

The issue is that if a single user has 100,000 items in their history, we
learn nothing more than if we picked 300 of those while the former would
result in processing 10 billion cooccurrences and the latter would result
in 100,000.  This factor of 10,000 is so large that it can make a big
difference in performance.

I had thought that the code had this down-sampling in place.

If not, I can add row based down-sampling quite easily.

Reply via email to