Hi Xiangrui,

Thanks! for the reply, I will explore the suggested solutions.

-Nishanth
Hi Nishanth,

Just found out where you work:) We had some discussion in
https://issues.apache.org/jira/browse/SPARK-2465 . Having long IDs
will increase the communication cost, which may not worth the benefit.
Not many companies have more than 1 billion users. If they do, maybe
they can mirror the implementation for their use cases. I can suggest
several possible solutions:

1. Hash user IDs into integers before training. If the collision rate
is high and it is crucial for your business, you can recompute user
features from product features by solving least squares after
training. This works when the product IDs could be mapped to integers.
2. Make type aliases in ALS, so that you can easily mirror the
implementation to use long IDs and track future changes.
3. Make ALS implementation use generic ID types. This would be the
best solution, but it requires some refactoring of the code.

Best,
Xiangrui

On Wed, Jan 14, 2015 at 1:04 PM, Nishanth P S <nishant...@gmail.com> wrote:
> Yes, we are close to having more 2 billion users. In this case what is the
> best way to handle this.
>
> Thanks,
> Nishanth
>
> On Fri, Jan 9, 2015 at 9:50 PM, Xiangrui Meng <men...@gmail.com> wrote:
>>
>> Do you have more than 2 billion users/products? If not, you can pair
>> each user/product id with an integer (check RDD.zipWithUniqueId), use
>> them in ALS, and then join the original bigInt IDs back after
>> training. -Xiangrui
>>
>> On Fri, Jan 9, 2015 at 5:12 PM, nishanthps <nishant...@gmail.com> wrote:
>> > Hi,
>> >
>> > The userId's and productId's in my data are bigInts, what is the best
>> > way to
>> > run collaborative filtering on this data. Should I modify MLlib's
>> > implementation to support more types? or is there an easy way.
>> >
>> > Thanks!,
>> > Nishanth
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> >
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-BigInteger-for-userId-and-productId-in-collaborative-Filtering-tp21072.html
>> > Sent from the Apache Spark User List mailing list archive at
Nabble.com.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>
>

Reply via email to