[ 
https://issues.apache.org/jira/browse/SPARK-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16224124#comment-16224124
 ] 

Matteo Cossu commented on SPARK-2465:
-------------------------------------

For example, with this limitation it is not possible to use 
_monotonically_increasing_id_ to generate the ids, since they are longs. 
Therefore, one should go back to RDD to use ZipWithIndex.

> Use long as user / item ID for ALS
> ----------------------------------
>
>                 Key: SPARK-2465
>                 URL: https://issues.apache.org/jira/browse/SPARK-2465
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.0.1
>            Reporter: Sean Owen
>            Priority: Minor
>         Attachments: ALS using MEMORY_AND_DISK.png, ALS using 
> MEMORY_AND_DISK_SER.png, Screen Shot 2014-07-13 at 8.49.40 PM.png
>
>
> I'd like to float this for consideration: use longs instead of ints for user 
> and product IDs in the ALS implementation.
> The main reason for is that identifiers are not generally numeric at all, and 
> will be hashed to an integer. (This is a separate issue.) Hashing to 32 bits 
> means collisions are likely after hundreds of thousands of users and items, 
> which is not unrealistic. Hashing to 64 bits pushes this back to billions.
> It would also mean numeric IDs that happen to be larger than the largest int 
> can be used directly as identifiers.
> On the downside of course: 8 bytes instead of 4 bytes of memory used per 
> Rating.
> Thoughts? I will post a PR so as to show what the change would be.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to