Re: why use the job 'itemIDIndex' to convert the itemid to index?

2011-09-20 Thread Sean Owen
It is a problem -- but should be are. IDs are hashed to 31-bit
integers, so the probability of collision is small. However you don't
have to have too many items before it's probable that some two have
collided. (IIRC, that's about 2 ^ (31/2) ? )

In practice it doesn't hurt much. It just means that data from two
different items has been mixed up and treated as if it was all from
one item. That's not ideal, but has a tiny overall effect on
recommendations.

Another practical tip: if your item IDs all fit into an unsigned int
already, then the hash function won't mix them up at all as all of
them will hash to themselves.

2011/9/20 张玉东 :
> I am trouble with this problem, if two itemids are mapped to the same index, 
> then how to compute the similarity between them?
>
>
>


Re: why use the job 'itemIDIndex' to convert the itemid to index?

2011-09-20 Thread 张玉东
Yes, the probability of collision is quite small. But I mean it is not 
necessary to do this step, I can not find any help of it to the following 
computations.

-邮件原件-
发件人: Sean Owen [mailto:sro...@gmail.com] 
发送时间: 2011年9月20日 18:37
收件人: user@mahout.apache.org
主题: Re: why use the job 'itemIDIndex' to convert the itemid to index?

It is a problem -- but should be are. IDs are hashed to 31-bit
integers, so the probability of collision is small. However you don't
have to have too many items before it's probable that some two have
collided. (IIRC, that's about 2 ^ (31/2) ? )

In practice it doesn't hurt much. It just means that data from two
different items has been mixed up and treated as if it was all from
one item. That's not ideal, but has a tiny overall effect on
recommendations.

Another practical tip: if your item IDs all fit into an unsigned int
already, then the hash function won't mix them up at all as all of
them will hash to themselves.

2011/9/20 张玉东 :
> I am trouble with this problem, if two itemids are mapped to the same index, 
> then how to compute the similarity between them?
>
>
>


Re: why use the job 'itemIDIndex' to convert the itemid to index?

2011-09-20 Thread Sean Owen
It is necessary. We want to support input where IDs are possibly
64-bit longs, for consistency with the non-distributed code.
But, 64-bit values are too large to be used as indexes into a Vector.
So they are hashed and then un-hashed by a dictionary lookup.

On Tue, Sep 20, 2011 at 11:44 AM, 张玉东  wrote:
> Yes, the probability of collision is quite small. But I mean it is not 
> necessary to do this step, I can not find any help of it to the following 
> computations.
>


Re: why use the job 'itemIDIndex' to convert the itemid to index?

2011-09-20 Thread 张玉东
Thanks, I understand. I am not familiar with the algorithms of non-distributed 
method.

-邮件原件-
发件人: Sean Owen [mailto:sro...@gmail.com] 
发送时间: 2011年9月20日 18:46
收件人: user@mahout.apache.org
主题: Re: why use the job 'itemIDIndex' to convert the itemid to index?

It is necessary. We want to support input where IDs are possibly
64-bit longs, for consistency with the non-distributed code.
But, 64-bit values are too large to be used as indexes into a Vector.
So they are hashed and then un-hashed by a dictionary lookup.

On Tue, Sep 20, 2011 at 11:44 AM, 张玉东  wrote:
> Yes, the probability of collision is quite small. But I mean it is not 
> necessary to do this step, I can not find any help of it to the following 
> computations.
>