Ted,

Thanks again for your detailed response.

For the sake of this discussion let's assume that there is a finite set of
interests (1..n) and *interested_in*(i) is either 0 or 1
A user is either interested or not is a subject, this is why I though of
placing a one or zero in the dimensions representing the specific interest.

If I understand correctly, in normalization option #2 you mean that each
interest is encoded to value so that the sum of all interests is 1?
Also, What do you mean by "normalize the interests to have unit vector
magnitude"?




On Sat, Jan 14, 2012 at 7:30 AM, Raviv Pavel <ra...@gigya-inc.com> wrote

> Regarding 1 to n mapping are you referring to the idea of coding each
> interest in the super set as an
>
integer N and placing a 1 in the Nth dimension if a user is interested in
> it?
>

[Ted] Yes.  But, not necessarily a 1.


> So assuming 5 interests users will be vetorized  as follows:
>
> g    a         x          y          z       i1  i2  i3  i4  i5
> -----------------------------------------------------------
> 1   40   -2650  -4471  3678   0   1   0    0   0
> 0   25   -1641  -4055  4626   1   1   1    0   1
>
> Wouldn't the distance measure we skewed if users have different number of
> interests, as in the example above?
>

[Ted]
Yes.  It is common in these cases to pick one of the following strategies:

- use a 1 plus a bit of statistics to decide if the amount of cooccurrence
is anomalously interesting.  Where the interest marking truly is a count
rather than a question the user answered, this can be a good approach since
you have to deal with the issue that you have three states (similar,
different and don't know) that you have to analyze.

- normalize the interests to sum to 1.  This is nice when you want to
pretend that the interests are a probability.

- normalize the interests to have unit vector magnitude.  This makes
cosines interesting and useful.

The last two approaches have the problem that they don't distinguish
between the three cases.  These are bad when the interests really are
counts because they lose important information.

If the interests truly are 1 of N rather than k of N, then any
normalization gives about the same effect.

As a first attempt, I would vector normalize the interests (third approach)

Is the distance you outlined implemented by Mahout's
> WeightedEuclideanDistanceMeasure ?
>

[Ted]
I believe so, but don't have time to check for sure.

*
*
*--*Raviv



On Fri, Jan 13, 2012 at 10:49 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> I usually prefer to represent location as an xyz triple on a unit sphere.
>  That allows Euclidean distance to be useful.
>
> On the 1 of n encoded values. Euclidean works as well.  For gender, it also
> works fine.
>
> The only issue is how to combine these with reasonable weightings.  An easy
> way to do this is to have a weighted Euclidean distance which looks like
>
>    distance(A, B) = sqrt {\sum_i w_i (a_i - b_i)^2}
>
> Figuring out the weights is a bit tricky, but not horrendously hard.
>
> On Fri, Jan 13, 2012 at 12:02 PM, Sean Owen <sro...@gmail.com> wrote:
>
> > Certainly not the only solution. As I've been saying: what would it
> > mean to have n distance measures -- how would you combine them?
> >
> > If you can answer that, you can likely just as easily transform the
> > input so that the result is meaningful when all dimensions are
> > combined by one metric.
> >
> > This is the certainly the basic idea, and it isn't necessarily simple.
> > But it's straightforward and I think strictly less hard than what you
> > are contemplating.
> >
> >
> > On Fri, Jan 13, 2012 at 4:57 PM, Raviv Pavel <ra...@gigya-inc.com>
> wrote:
> > > I think the only solution would be do develop a custom distance measure
> > > that's aware of the "meaning" of each dimension(s) and return the
> > distance
> > > accordingly.
> > > Unless there is a way to vectorize user profiles in such a way that
> will
> > > allow me to use one of the built in distance measures.
> > >
> >
>

Reply via email to