[jira] [Commented] (SPARK-2465) Use long as user / item ID for ALS

2017-10-29 Thread Matteo Cossu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224124#comment-16224124
 ] 

Matteo Cossu commented on SPARK-2465:
-

For example, with this limitation it is not possible to use 
_monotonically_increasing_id_ to generate the ids, since they are longs. 
Therefore, one should go back to RDD to use ZipWithIndex.

> Use long as user / item ID for ALS
> --
>
> Key: SPARK-2465
> URL: https://issues.apache.org/jira/browse/SPARK-2465
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.1
>Reporter: Sean Owen
>Priority: Minor
> Attachments: ALS using MEMORY_AND_DISK.png, ALS using 
> MEMORY_AND_DISK_SER.png, Screen Shot 2014-07-13 at 8.49.40 PM.png
>
>
> I'd like to float this for consideration: use longs instead of ints for user 
> and product IDs in the ALS implementation.
> The main reason for is that identifiers are not generally numeric at all, and 
> will be hashed to an integer. (This is a separate issue.) Hashing to 32 bits 
> means collisions are likely after hundreds of thousands of users and items, 
> which is not unrealistic. Hashing to 64 bits pushes this back to billions.
> It would also mean numeric IDs that happen to be larger than the largest int 
> can be used directly as identifiers.
> On the downside of course: 8 bytes instead of 4 bytes of memory used per 
> Rating.
> Thoughts? I will post a PR so as to show what the change would be.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2465) Use long as user / item ID for ALS

2017-07-19 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16094106#comment-16094106
 ] 

Peng Meng commented on SPARK-2465:
--

I think it is time to revisit this now.  Some of our customers, such as JD.com, 
ask us to support Long ID for ALS. Actually, they have more than Int.MaxValue 
products.  Long ID of ALS is necessary for them. 
How to you think to reopen your PR? [~srowen]

> Use long as user / item ID for ALS
> --
>
> Key: SPARK-2465
> URL: https://issues.apache.org/jira/browse/SPARK-2465
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.1
>Reporter: Sean Owen
>Priority: Minor
> Attachments: ALS using MEMORY_AND_DISK.png, ALS using 
> MEMORY_AND_DISK_SER.png, Screen Shot 2014-07-13 at 8.49.40 PM.png
>
>
> I'd like to float this for consideration: use longs instead of ints for user 
> and product IDs in the ALS implementation.
> The main reason for is that identifiers are not generally numeric at all, and 
> will be hashed to an integer. (This is a separate issue.) Hashing to 32 bits 
> means collisions are likely after hundreds of thousands of users and items, 
> which is not unrealistic. Hashing to 64 bits pushes this back to billions.
> It would also mean numeric IDs that happen to be larger than the largest int 
> can be used directly as identifiers.
> On the downside of course: 8 bytes instead of 4 bytes of memory used per 
> Rating.
> Thoughts? I will post a PR so as to show what the change would be.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2465) Use long as user / item ID for ALS

2014-07-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063344#comment-14063344
 ] 

Sean Owen commented on SPARK-2465:
--

Yeah that's a good separate point. My hunch is that serialization would remove 
much of the difference.
I think this may need to be closed for now as there's not quite support; I'll 
comment separately on the PR.

 Use long as user / item ID for ALS
 --

 Key: SPARK-2465
 URL: https://issues.apache.org/jira/browse/SPARK-2465
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.1
Reporter: Sean Owen
Priority: Minor
 Attachments: ALS using MEMORY_AND_DISK.png, ALS using 
 MEMORY_AND_DISK_SER.png, Screen Shot 2014-07-13 at 8.49.40 PM.png


 I'd like to float this for consideration: use longs instead of ints for user 
 and product IDs in the ALS implementation.
 The main reason for is that identifiers are not generally numeric at all, and 
 will be hashed to an integer. (This is a separate issue.) Hashing to 32 bits 
 means collisions are likely after hundreds of thousands of users and items, 
 which is not unrealistic. Hashing to 64 bits pushes this back to billions.
 It would also mean numeric IDs that happen to be larger than the largest int 
 can be used directly as identifiers.
 On the downside of course: 8 bytes instead of 4 bytes of memory used per 
 Rating.
 Thoughts? I will post a PR so as to show what the change would be.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2465) Use long as user / item ID for ALS

2014-07-15 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063159#comment-14063159
 ] 

Xiangrui Meng commented on SPARK-2465:
--

[~srowen] One interesting trade-off is to use MEMORY_AND_DISK_SER with 
spark.rdd.compress=true instead of MEMORY_AND_DISK for user/product in/out 
links (and maybe ratings as well). It slows down ALS a little bit and saves a 
lot. See the screenshots attached. (Note: kryo was not used.)

 Use long as user / item ID for ALS
 --

 Key: SPARK-2465
 URL: https://issues.apache.org/jira/browse/SPARK-2465
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.1
Reporter: Sean Owen
Priority: Minor
 Attachments: ALS using MEMORY_AND_DISK.png, ALS using 
 MEMORY_AND_DISK_SER.png, Screen Shot 2014-07-13 at 8.49.40 PM.png


 I'd like to float this for consideration: use longs instead of ints for user 
 and product IDs in the ALS implementation.
 The main reason for is that identifiers are not generally numeric at all, and 
 will be hashed to an integer. (This is a separate issue.) Hashing to 32 bits 
 means collisions are likely after hundreds of thousands of users and items, 
 which is not unrealistic. Hashing to 64 bits pushes this back to billions.
 It would also mean numeric IDs that happen to be larger than the largest int 
 can be used directly as identifiers.
 On the downside of course: 8 bytes instead of 4 bytes of memory used per 
 Rating.
 Thoughts? I will post a PR so as to show what the change would be.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2465) Use long as user / item ID for ALS

2014-07-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14060474#comment-14060474
 ] 

Sean Owen commented on SPARK-2465:
--

Yeah, more data is a cost for sure. I had though it was mostly an in-memory 
cost since these data structures get cached in memory, and/or you are 
processing parsed blocks rather than individual Rating objects. It might be 
~25% more raw data?

This kind of densification is an error though, right? I treat two users as if 
they're the same when they are not. I think 0.1% collision is quite high -- 
every thousandth user or so is getting recommendations for some other person. 
You also end up reporting that certain users and items have no presence in the 
model.

If my estimate is right you end up with that kind of collision rate with a ~2 
million users or products and a 32-bit hash. It's 'safe' up to a hundred 
thousand or so (0 collisions is more probably than 0)

Did you mean you've benchmarked this to see the blow-up or can do so? I agree 
that it really depends on the cost. I had thought it was probably not bad 
compared to being able to hash more reliably.

 Use long as user / item ID for ALS
 --

 Key: SPARK-2465
 URL: https://issues.apache.org/jira/browse/SPARK-2465
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.1
Reporter: Sean Owen
Priority: Minor
 Attachments: Screen Shot 2014-07-13 at 8.49.40 PM.png


 I'd like to float this for consideration: use longs instead of ints for user 
 and product IDs in the ALS implementation.
 The main reason for is that identifiers are not generally numeric at all, and 
 will be hashed to an integer. (This is a separate issue.) Hashing to 32 bits 
 means collisions are likely after hundreds of thousands of users and items, 
 which is not unrealistic. Hashing to 64 bits pushes this back to billions.
 It would also mean numeric IDs that happen to be larger than the largest int 
 can be used directly as identifiers.
 On the downside of course: 8 bytes instead of 4 bytes of memory used per 
 Rating.
 Thoughts? I will post a PR so as to show what the change would be.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2465) Use long as user / item ID for ALS

2014-07-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14060127#comment-14060127
 ] 

Sean Owen commented on SPARK-2465:
--

https://github.com/apache/spark/pull/1393

 Use long as user / item ID for ALS
 --

 Key: SPARK-2465
 URL: https://issues.apache.org/jira/browse/SPARK-2465
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.1
Reporter: Sean Owen
Priority: Minor

 I'd like to float this for consideration: use longs instead of ints for user 
 and product IDs in the ALS implementation.
 The main reason for is that identifiers are not generally numeric at all, and 
 will be hashed to an integer. (This is a separate issue.) Hashing to 32 bits 
 means collisions are likely after hundreds of thousands of users and items, 
 which is not unrealistic. Hashing to 64 bits pushes this back to billions.
 It would also mean numeric IDs that happen to be larger than the largest int 
 can be used directly as identifiers.
 On the downside of course: 8 bytes instead of 4 bytes of memory used per 
 Rating.
 Thoughts? I will post a PR so as to show what the change would be.



--
This message was sent by Atlassian JIRA
(v6.2#6252)