[ https://issues.apache.org/jira/browse/SPARK-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14063159#comment-14063159 ]
Xiangrui Meng edited comment on SPARK-2465 at 7/16/14 5:17 AM: --------------------------------------------------------------- [~srowen] One interesting trade-off is to use MEMORY_AND_DISK_SER with "spark.rdd.compress=true" instead of MEMORY_AND_DISK for user/product in/out links (and maybe ratings as well). It slows down ALS a little bit but it saves a lot. See the screenshots attached. (Note: kryo was not used.) I'm thinking about whether we should switch to MEMROY_AND_DISK_SER by default or provide options to set storage level. This is a little out of the topic but I'm interested to see how much more memory we need for long ids if we use SER and rdd.compress. was (Author: mengxr): [~srowen] One interesting trade-off is to use MEMORY_AND_DISK_SER with "spark.rdd.compress=true" instead of MEMORY_AND_DISK for user/product in/out links (and maybe ratings as well). It slows down ALS a little bit and saves a lot. See the screenshots attached. (Note: kryo was not used.) > Use long as user / item ID for ALS > ---------------------------------- > > Key: SPARK-2465 > URL: https://issues.apache.org/jira/browse/SPARK-2465 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 1.0.1 > Reporter: Sean Owen > Priority: Minor > Attachments: ALS using MEMORY_AND_DISK.png, ALS using > MEMORY_AND_DISK_SER.png, Screen Shot 2014-07-13 at 8.49.40 PM.png > > > I'd like to float this for consideration: use longs instead of ints for user > and product IDs in the ALS implementation. > The main reason for is that identifiers are not generally numeric at all, and > will be hashed to an integer. (This is a separate issue.) Hashing to 32 bits > means collisions are likely after hundreds of thousands of users and items, > which is not unrealistic. Hashing to 64 bits pushes this back to billions. > It would also mean numeric IDs that happen to be larger than the largest int > can be used directly as identifiers. > On the downside of course: 8 bytes instead of 4 bytes of memory used per > Rating. > Thoughts? I will post a PR so as to show what the change would be. -- This message was sent by Atlassian JIRA (v6.2#6252)