Github user srowen commented on the pull request: https://github.com/apache/spark/pull/1393#issuecomment-49013972 Yes you could also tell callers to track their own user-ID mapping and maintain it consistently everywhere. Callers have to share that state then somehow. Hashing is easier, and 64 bits makes it work for practical purposes. A caller has to do something like these to deal with real-world identifiers because an `Int` ID API by itself doesn't quite work. This is an instance of a meta-concern I have, if an API which (from my perspective) is going to be problematic at scale is already unchangeable before battle-testing. (I actually thought all of MLlib was de facto `@Experimental`?) Yeah however you can layer on other APIs to fix it, or use `@deprecated` in cases like this to keep existing methods but add new signatures too. I think that would be the simplest solution to this particular concern. The question of serialized size is still out there. That is worth weighing in on.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---