Github user srowen commented on the pull request: https://github.com/apache/spark/pull/1393#issuecomment-49149396 Let me close this PR for now. I will fork or wrap as necessary. Keep it in mind, and maybe in a 2.x release this can be revisited. (Matei I ran into more problems with the `Rating` class retrofit anyway.) Yes storage is the downside. Your comments on JIRA about effects of serialization in compressing away the difference are promising. I completely agree with using `Float` for ratings and even feature vectors. Yes I understand why random projections are helpful. It doesn't help accuracy, but may only trivially hurt it in return for some performance gain. If I have just 1 rating, it doesn't make my recs better to arbitrarily add your ratings to mine. Sure that's denser, and maybe you're getting less overfitting, but it's fitting the wrong input for both of us. A collision here and there is probably acceptable. One in a million customers? OK. 1%? maybe a problem. I agree, you'd have to quantify this to decide. If I'm an end user of MLlib bringing even millions of things to my model, I have to decide. And if it's a problem, have to maintain a lookup table to use it. It seemed simplest to moot the problem with a much bigger key space and engineer around the storage issue. A bit more memory is cheap; accuracy and engineer time are expensive.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---