Re: reproducibility

2013-03-18 Thread Sebastian Schelter
> KNN does not have a stochastic element. I think you would get the same
> results on one platform, unless I'm missing something.

These also have a stochastic element, as the Hadoop-based recommenders
randomly down-sample to the interaction histories of power-users.
However, this should only have a small impact on the result and can also
me made deterministic by fixing the seed of the RNG.

> On Sun, Mar 17, 2013 at 1:43 PM, Koobas  wrote:
> 
>> I am asking the basic reproducibility question.
>> If I run twice on the same dataset, with the same hardware setup, will I
>> always get the same resuts?
>> Or is there any chance that on two different runs, the same user will get
>> slightly different suggestions?
>> I am mostly revolving in the space of numerical libraries, where
>> reproducibility is, sort of, a big deal.
>> Maybe it's not much of a concern in machine learning.
>> I am just curious.
>>
>>
>> On Sun, Mar 17, 2013 at 8:46 AM, Sean Owen  wrote:
>>
>>> What's your question? ALS has a random starting point which changes the
>>> results a bit. Not sure about KNN though.
>>>
>>>
>>
>>> On Sun, Mar 17, 2013 at 3:03 AM, Koobas  wrote:
>>>
 Can anybody shed any light on the issue of reproducibility in Mahout,
 with and without Hadoop, specifically in the context of kNN and ALS
 recommenders?

>>>
>>
> 



Re: Boosting User-Based with the user's attributes

2013-03-18 Thread Agata Filiana
Hi,

Thank Sean for the response. I like the idea of multiplying the similarity
metric based on
user properties with the one based on CF data.
I understand that I have to create a seperate similarity metric - can I do
this with the help of Mahout or does this have to be done seperately, as in
I have to implement my own similarity measure? It would be great if there
is some clue on how I get this started.
Is this somehow similar to the subject of *Injecting domain-specific
information* in the book Mahout in Action (with the example of the
gender-based item similarity metric)?

And also how can I multiply the two results - will this affect the result
of the evaluation of the recommender system? Or it should be normalized in
a way?

Thank you and sorry for the basic questions.

Regards,

Agata Filiana


On 16 March 2013 13:41, Sean Owen  wrote:

> There are many ways to think about combining these two types of data.
>
> If you can make some similarity metric based on age, gender and interests,
> then you can use it as the similarity metric in
> GenericBooleanPrefUserBasedRecommender. You would be using both data sets
> in some way. Of course this means learning a whole different similarity
> metric somehow. A variant on this is to make a similarity metric based on
> user properties, and also use one based on CF data, and multiply them
> together to make a new combined similarity metric for this approach. This
> might work OK.
>
> It can also work to treat age and gender and other features as categorical
> features, and then model them as 'items' that the user interacts with. They
> would not have much of an effect here given how many items there are. In
> other models like ALS-WR you can weight these pseudo-items much more highly
> and get the desired effect to a degree.
>
>
>
> On Fri, Mar 15, 2013 at 4:37 PM, Agata Filiana  >wrote:
>
> > Hi,
> >
> > I'm fairly new to Mahout. Right now I am experimenting Mahout by trying
> to
> > build a simple recommendation system. What I have is just a boolean data
> > set, with only the userID and itemID. I understand that for this case I
> > have to use GenericBooleanPrefUserBasedRecommender - which I have and
> works
> > fine.
> >
> > Apart from the userID and itemID data, I also have the user's attributes
> > (their age, gender, list of interests). I would like to combine this into
> > the recommendation system to increase the performance of the recommender.
> > Is this possible to do or am I trying something that does not make sense?
> >
> > It would be great if you can give me any inputs or ideas for this. (Or
> any
> > good read based on this matter)
> >
> > Thank you!
> >
> > Regards,
> >
> > *Agata Filiana*
> > Erasmus Mundus Student
> >
>



-- 
*Agata Filiana
*


Re: Boosting User-Based with the user's attributes

2013-03-18 Thread Sean Owen
You would have to make up the similarity metric separately since it depends
entirely on how you want to define it.
The part of the book you are talking about concerns rescoring, which is not
the same thing.
Combine the similarity metrics, I mean, not make two recommenders. Make a
metric that is the product of two other metrics. Normalize both of those
metrics to the range [0,1].

Sean


On Mon, Mar 18, 2013 at 6:51 AM, Agata Filiana wrote:

> Hi,
>
> Thank Sean for the response. I like the idea of multiplying the similarity
> metric based on
> user properties with the one based on CF data.
> I understand that I have to create a seperate similarity metric - can I do
> this with the help of Mahout or does this have to be done seperately, as in
> I have to implement my own similarity measure? It would be great if there
> is some clue on how I get this started.
> Is this somehow similar to the subject of *Injecting domain-specific
> information* in the book Mahout in Action (with the example of the
> gender-based item similarity metric)?
>
> And also how can I multiply the two results - will this affect the result
> of the evaluation of the recommender system? Or it should be normalized in
> a way?
>
> Thank you and sorry for the basic questions.
>
> Regards,
>
> Agata Filiana
>
>
> On 16 March 2013 13:41, Sean Owen  wrote:
>
> > There are many ways to think about combining these two types of data.
> >
> > If you can make some similarity metric based on age, gender and
> interests,
> > then you can use it as the similarity metric in
> > GenericBooleanPrefUserBasedRecommender. You would be using both data sets
> > in some way. Of course this means learning a whole different similarity
> > metric somehow. A variant on this is to make a similarity metric based on
> > user properties, and also use one based on CF data, and multiply them
> > together to make a new combined similarity metric for this approach. This
> > might work OK.
> >
> > It can also work to treat age and gender and other features as
> categorical
> > features, and then model them as 'items' that the user interacts with.
> They
> > would not have much of an effect here given how many items there are. In
> > other models like ALS-WR you can weight these pseudo-items much more
> highly
> > and get the desired effect to a degree.
> >
> >
> >
> > On Fri, Mar 15, 2013 at 4:37 PM, Agata Filiana  > >wrote:
> >
> > > Hi,
> > >
> > > I'm fairly new to Mahout. Right now I am experimenting Mahout by trying
> > to
> > > build a simple recommendation system. What I have is just a boolean
> data
> > > set, with only the userID and itemID. I understand that for this case I
> > > have to use GenericBooleanPrefUserBasedRecommender - which I have and
> > works
> > > fine.
> > >
> > > Apart from the userID and itemID data, I also have the user's
> attributes
> > > (their age, gender, list of interests). I would like to combine this
> into
> > > the recommendation system to increase the performance of the
> recommender.
> > > Is this possible to do or am I trying something that does not make
> sense?
> > >
> > > It would be great if you can give me any inputs or ideas for this. (Or
> > any
> > > good read based on this matter)
> > >
> > > Thank you!
> > >
> > > Regards,
> > >
> > > *Agata Filiana*
> > > Erasmus Mundus Student
> > >
> >
>
>
>
> --
> *Agata Filiana
> *
>


Re: Boosting User-Based with the user's attributes

2013-03-18 Thread Agata Filiana
I understand how it works logically. However I am having problem
understanding about the implementation of it and how to get the final
outcome.
Say the user's attribute is Hobbies: hobby1,hobby2,hobby3
So I would make the similarity metric of the users and hobbies.

Then for the CF, using Mahout's GenericBooleanPrefUserBasedRecommender with
the boolean data set (userID and itemID).

Then somehow combine the two?

But at the end, my goal is to recommend the items in the second data set
(the itemID, not recommend the hobbies) - does this make sense? Or am I
confusing myself?

Agata


On 18 March 2013 14:23, Sean Owen  wrote:

> You would have to make up the similarity metric separately since it depends
> entirely on how you want to define it.
> The part of the book you are talking about concerns rescoring, which is not
> the same thing.
> Combine the similarity metrics, I mean, not make two recommenders. Make a
> metric that is the product of two other metrics. Normalize both of those
> metrics to the range [0,1].
>
> Sean
>
>
> On Mon, Mar 18, 2013 at 6:51 AM, Agata Filiana  >wrote:
>
> > Hi,
> >
> > Thank Sean for the response. I like the idea of multiplying the
> similarity
> > metric based on
> > user properties with the one based on CF data.
> > I understand that I have to create a seperate similarity metric - can I
> do
> > this with the help of Mahout or does this have to be done seperately, as
> in
> > I have to implement my own similarity measure? It would be great if there
> > is some clue on how I get this started.
> > Is this somehow similar to the subject of *Injecting domain-specific
> > information* in the book Mahout in Action (with the example of the
> > gender-based item similarity metric)?
> >
> > And also how can I multiply the two results - will this affect the result
> > of the evaluation of the recommender system? Or it should be normalized
> in
> > a way?
> >
> > Thank you and sorry for the basic questions.
> >
> > Regards,
> >
> > Agata Filiana
> >
> >
> > On 16 March 2013 13:41, Sean Owen  wrote:
> >
> > > There are many ways to think about combining these two types of data.
> > >
> > > If you can make some similarity metric based on age, gender and
> > interests,
> > > then you can use it as the similarity metric in
> > > GenericBooleanPrefUserBasedRecommender. You would be using both data
> sets
> > > in some way. Of course this means learning a whole different similarity
> > > metric somehow. A variant on this is to make a similarity metric based
> on
> > > user properties, and also use one based on CF data, and multiply them
> > > together to make a new combined similarity metric for this approach.
> This
> > > might work OK.
> > >
> > > It can also work to treat age and gender and other features as
> > categorical
> > > features, and then model them as 'items' that the user interacts with.
> > They
> > > would not have much of an effect here given how many items there are.
> In
> > > other models like ALS-WR you can weight these pseudo-items much more
> > highly
> > > and get the desired effect to a degree.
> > >
> > >
> > >
> > > On Fri, Mar 15, 2013 at 4:37 PM, Agata Filiana  > > >wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm fairly new to Mahout. Right now I am experimenting Mahout by
> trying
> > > to
> > > > build a simple recommendation system. What I have is just a boolean
> > data
> > > > set, with only the userID and itemID. I understand that for this
> case I
> > > > have to use GenericBooleanPrefUserBasedRecommender - which I have and
> > > works
> > > > fine.
> > > >
> > > > Apart from the userID and itemID data, I also have the user's
> > attributes
> > > > (their age, gender, list of interests). I would like to combine this
> > into
> > > > the recommendation system to increase the performance of the
> > recommender.
> > > > Is this possible to do or am I trying something that does not make
> > sense?
> > > >
> > > > It would be great if you can give me any inputs or ideas for this.
> (Or
> > > any
> > > > good read based on this matter)
> > > >
> > > > Thank you!
> > > >
> > > > Regards,
> > > >
> > > > *Agata Filiana*
> > > > Erasmus Mundus Student
> > > >
> > >
> >
> >
> >
> > --
> > *Agata Filiana
> > *
> >
>



-- 
*Agata Filiana
*


Re: Boosting User-Based with the user's attributes

2013-03-18 Thread Sean Owen
There is a difference between the recommender and the similarity metric it
uses. My suggestion was to either use your item data with the recommender
and hobby data with the similarity metric, or, use both in the similarity
metric by making a combined metric.


On Mon, Mar 18, 2013 at 9:44 AM, Agata Filiana wrote:

> I understand how it works logically. However I am having problem
> understanding about the implementation of it and how to get the final
> outcome.
> Say the user's attribute is Hobbies: hobby1,hobby2,hobby3
> So I would make the similarity metric of the users and hobbies.
>
> Then for the CF, using Mahout's GenericBooleanPrefUserBasedRecommender with
> the boolean data set (userID and itemID).
>
> Then somehow combine the two?
>
> But at the end, my goal is to recommend the items in the second data set
> (the itemID, not recommend the hobbies) - does this make sense? Or am I
> confusing myself?
>
> Agata
>
>
> On 18 March 2013 14:23, Sean Owen  wrote:
>
> > You would have to make up the similarity metric separately since it
> depends
> > entirely on how you want to define it.
> > The part of the book you are talking about concerns rescoring, which is
> not
> > the same thing.
> > Combine the similarity metrics, I mean, not make two recommenders. Make a
> > metric that is the product of two other metrics. Normalize both of those
> > metrics to the range [0,1].
> >
> > Sean
> >
> >
> > On Mon, Mar 18, 2013 at 6:51 AM, Agata Filiana  > >wrote:
> >
> > > Hi,
> > >
> > > Thank Sean for the response. I like the idea of multiplying the
> > similarity
> > > metric based on
> > > user properties with the one based on CF data.
> > > I understand that I have to create a seperate similarity metric - can I
> > do
> > > this with the help of Mahout or does this have to be done seperately,
> as
> > in
> > > I have to implement my own similarity measure? It would be great if
> there
> > > is some clue on how I get this started.
> > > Is this somehow similar to the subject of *Injecting domain-specific
> > > information* in the book Mahout in Action (with the example of the
> > > gender-based item similarity metric)?
> > >
> > > And also how can I multiply the two results - will this affect the
> result
> > > of the evaluation of the recommender system? Or it should be normalized
> > in
> > > a way?
> > >
> > > Thank you and sorry for the basic questions.
> > >
> > > Regards,
> > >
> > > Agata Filiana
> > >
> > >
> > > On 16 March 2013 13:41, Sean Owen  wrote:
> > >
> > > > There are many ways to think about combining these two types of data.
> > > >
> > > > If you can make some similarity metric based on age, gender and
> > > interests,
> > > > then you can use it as the similarity metric in
> > > > GenericBooleanPrefUserBasedRecommender. You would be using both data
> > sets
> > > > in some way. Of course this means learning a whole different
> similarity
> > > > metric somehow. A variant on this is to make a similarity metric
> based
> > on
> > > > user properties, and also use one based on CF data, and multiply them
> > > > together to make a new combined similarity metric for this approach.
> > This
> > > > might work OK.
> > > >
> > > > It can also work to treat age and gender and other features as
> > > categorical
> > > > features, and then model them as 'items' that the user interacts
> with.
> > > They
> > > > would not have much of an effect here given how many items there are.
> > In
> > > > other models like ALS-WR you can weight these pseudo-items much more
> > > highly
> > > > and get the desired effect to a degree.
> > > >
> > > >
> > > >
> > > > On Fri, Mar 15, 2013 at 4:37 PM, Agata Filiana <
> a.filian...@gmail.com
> > > > >wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I'm fairly new to Mahout. Right now I am experimenting Mahout by
> > trying
> > > > to
> > > > > build a simple recommendation system. What I have is just a boolean
> > > data
> > > > > set, with only the userID and itemID. I understand that for this
> > case I
> > > > > have to use GenericBooleanPrefUserBasedRecommender - which I have
> and
> > > > works
> > > > > fine.
> > > > >
> > > > > Apart from the userID and itemID data, I also have the user's
> > > attributes
> > > > > (their age, gender, list of interests). I would like to combine
> this
> > > into
> > > > > the recommendation system to increase the performance of the
> > > recommender.
> > > > > Is this possible to do or am I trying something that does not make
> > > sense?
> > > > >
> > > > > It would be great if you can give me any inputs or ideas for this.
> > (Or
> > > > any
> > > > > good read based on this matter)
> > > > >
> > > > > Thank you!
> > > > >
> > > > > Regards,
> > > > >
> > > > > *Agata Filiana*
> > > > > Erasmus Mundus Student
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > *Agata Filiana
> > > *
> > >
> >
>
>
>
> --
> *Agata Filiana
> *
>


Re: Reg Algorithms.

2013-03-18 Thread Danny Busch
Hi,
typically, multi category classification can also be achieved by training
multiple binary classificators. You loop through each of them and for each
hitting classificator, you add a category.

best
Danny

http://www.klickbrett.de


VIGNESH S  hat am 18. März 2013 um 14:43 geschrieben:
> Hi,
>
> What are the best algorithms suitable for multi category
> classification not binary classifcation.
>
> As Far as my understanding,Bayesian,SVD not suitable for multi
> category classification..
>
> Please kindly help..
>
> --
> Thanks and Regards
> Vignesh Srinivasan

Re: Reg Algorithms.

2013-03-18 Thread Kim Falk Jørgensen
Hi,

You could use the "one vs all" classification, where you basically
serialize the classifications asking for each category if its inside the
category or in some other. then you can use all the binary classifiers like
naive Bayes (http://en.wikipedia.org/wiki/Multiclass_classification). I am
not sure if that is what Danny Busch is also referring to.

Cheers
Kim



On Mon, Mar 18, 2013 at 2:56 PM, Danny Busch  wrote:

> Hi,
> typically, multi category classification can also be achieved by training
> multiple binary classificators. You loop through each of them and for each
> hitting classificator, you add a category.
>
> best
> Danny
>
> http://www.klickbrett.de
>
>
> VIGNESH S  hat am 18. März 2013 um 14:43
> geschrieben:
> > Hi,
> >
> > What are the best algorithms suitable for multi category
> > classification not binary classifcation.
> >
> > As Far as my understanding,Bayesian,SVD not suitable for multi
> > category classification..
> >
> > Please kindly help..
> >
> > --
> > Thanks and Regards
> > Vignesh Srinivasan
>



-- 
Best Regards

Kim Falk Jorgensen


Re: Reg Algorithms.

2013-03-18 Thread Danny Busch
Hi,

yes, this is what I was trying to point out. One vs all is how we usually call
that :-)

best,
  Danny

--
http://www.klickbrett.de


"Kim Falk Jørgensen"  hat am 18. März 2013 um
15:33 geschrieben:
> Hi,
>
> You could use the "one vs all" classification, where you basically
> serialize the classifications asking for each category if its inside the
> category or in some other. then you can use all the binary classifiers like
> naive Bayes (http://en.wikipedia.org/wiki/Multiclass_classification). I am
> not sure if that is what Danny Busch is also referring to.
>
> Cheers
> Kim
>
>
>
> On Mon, Mar 18, 2013 at 2:56 PM, Danny Busch  wrote:
>
> > Hi,
> > typically, multi category classification can also be achieved by training
> > multiple binary classificators. You loop through each of them and for each
> > hitting classificator, you add a category.
> >
> > best
> > Danny
> >
> > http://www.klickbrett.de
> >
> >
> > VIGNESH S  hat am 18. März 2013 um 14:43
> > geschrieben:
> > > Hi,
> > >
> > > What are the best algorithms suitable for multi category
> > > classification not binary classifcation.
> > >
> > > As Far as my understanding,Bayesian,SVD not suitable for multi
> > > category classification..
> > >
> > > Please kindly help..
> > >
> > > --
> > > Thanks and Regards
> > > Vignesh Srinivasan
> >
>
>
>
> --
> Best Regards
>
> Kim Falk Jorgensen

ALS-WR on Million Song dataset

2013-03-18 Thread Han JU
Hi,

I'm wondering has someone tried ParallelALS with implicite feedback job on
million song dataset? Some pointers on alpha and lambda?

In the paper alpha is 40 and lambda is 150, but I don't know what are their
r values in the matrix. They said is based on time units that users have
watched the show, so may be it's big.

Many thanks!
-- 
*JU Han*

UTC   -  Université de Technologie de Compiègne
* **GI06 - Fouille de Données et Décisionnel*

+33 061960


Re: ALS-WR on Million Song dataset

2013-03-18 Thread Sean Owen
One word of caution, is that there are at least two papers on ALS and they
define lambda differently. I think you are talking about "Collaborative
Filtering for Implicit Feedback Datasets".

I've been working with some folks who point out that alpha=40 seems to be
too high for most data sets. After running some tests on common data sets,
alpha=1 looks much better. YMMV.

In the end you have to evaluate these two parameters, and the # of
features, across a range to determine what's best.

Is this data set not a bunch of audio features? I am not sure it works for
ALS, not naturally at least.


On Mon, Mar 18, 2013 at 12:39 PM, Han JU  wrote:

> Hi,
>
> I'm wondering has someone tried ParallelALS with implicite feedback job on
> million song dataset? Some pointers on alpha and lambda?
>
> In the paper alpha is 40 and lambda is 150, but I don't know what are their
> r values in the matrix. They said is based on time units that users have
> watched the show, so may be it's big.
>
> Many thanks!
> --
> *JU Han*
>
> UTC   -  Université de Technologie de Compiègne
> * **GI06 - Fouille de Données et Décisionnel*
>
> +33 061960
>


Re: ALS-WR on Million Song dataset

2013-03-18 Thread Sebastian Schelter
JU,

are you refering to this dataset?

http://labrosa.ee.columbia.edu/millionsong/tasteprofile

On 18.03.2013 17:47, Sean Owen wrote:
> One word of caution, is that there are at least two papers on ALS and they
> define lambda differently. I think you are talking about "Collaborative
> Filtering for Implicit Feedback Datasets".
> 
> I've been working with some folks who point out that alpha=40 seems to be
> too high for most data sets. After running some tests on common data sets,
> alpha=1 looks much better. YMMV.
> 
> In the end you have to evaluate these two parameters, and the # of
> features, across a range to determine what's best.
> 
> Is this data set not a bunch of audio features? I am not sure it works for
> ALS, not naturally at least.
> 
> 
> On Mon, Mar 18, 2013 at 12:39 PM, Han JU  wrote:
> 
>> Hi,
>>
>> I'm wondering has someone tried ParallelALS with implicite feedback job on
>> million song dataset? Some pointers on alpha and lambda?
>>
>> In the paper alpha is 40 and lambda is 150, but I don't know what are their
>> r values in the matrix. They said is based on time units that users have
>> watched the show, so may be it's big.
>>
>> Many thanks!
>> --
>> *JU Han*
>>
>> UTC   -  Université de Technologie de Compiègne
>> * **GI06 - Fouille de Données et Décisionnel*
>>
>> +33 061960
>>
> 



Re: Boosting User-Based with the user's attributes

2013-03-18 Thread Agata Filiana
In this case, would be correct if I somehow "loop" through the item data
and the hobby data and then combine the score for a pair of users?

I am having trouble in how to combine both similarity into one metric,
could you possibly point me out a clue?

Thank you

On 18 March 2013 14:54, Sean Owen  wrote:

> There is a difference between the recommender and the similarity metric it
> uses. My suggestion was to either use your item data with the recommender
> and hobby data with the similarity metric, or, use both in the similarity
> metric by making a combined metric.
>
>
> On Mon, Mar 18, 2013 at 9:44 AM, Agata Filiana  >wrote:
>
> > I understand how it works logically. However I am having problem
> > understanding about the implementation of it and how to get the final
> > outcome.
> > Say the user's attribute is Hobbies: hobby1,hobby2,hobby3
> > So I would make the similarity metric of the users and hobbies.
> >
> > Then for the CF, using Mahout's GenericBooleanPrefUserBasedRecommender
> with
> > the boolean data set (userID and itemID).
> >
> > Then somehow combine the two?
> >
> > But at the end, my goal is to recommend the items in the second data set
> > (the itemID, not recommend the hobbies) - does this make sense? Or am I
> > confusing myself?
> >
> > Agata
> >
> >
> > On 18 March 2013 14:23, Sean Owen  wrote:
> >
> > > You would have to make up the similarity metric separately since it
> > depends
> > > entirely on how you want to define it.
> > > The part of the book you are talking about concerns rescoring, which is
> > not
> > > the same thing.
> > > Combine the similarity metrics, I mean, not make two recommenders.
> Make a
> > > metric that is the product of two other metrics. Normalize both of
> those
> > > metrics to the range [0,1].
> > >
> > > Sean
> > >
> > >
> > > On Mon, Mar 18, 2013 at 6:51 AM, Agata Filiana  > > >wrote:
> > >
> > > > Hi,
> > > >
> > > > Thank Sean for the response. I like the idea of multiplying the
> > > similarity
> > > > metric based on
> > > > user properties with the one based on CF data.
> > > > I understand that I have to create a seperate similarity metric -
> can I
> > > do
> > > > this with the help of Mahout or does this have to be done seperately,
> > as
> > > in
> > > > I have to implement my own similarity measure? It would be great if
> > there
> > > > is some clue on how I get this started.
> > > > Is this somehow similar to the subject of *Injecting domain-specific
> > > > information* in the book Mahout in Action (with the example of the
> > > > gender-based item similarity metric)?
> > > >
> > > > And also how can I multiply the two results - will this affect the
> > result
> > > > of the evaluation of the recommender system? Or it should be
> normalized
> > > in
> > > > a way?
> > > >
> > > > Thank you and sorry for the basic questions.
> > > >
> > > > Regards,
> > > >
> > > > Agata Filiana
> > > >
> > > >
> > > > On 16 March 2013 13:41, Sean Owen  wrote:
> > > >
> > > > > There are many ways to think about combining these two types of
> data.
> > > > >
> > > > > If you can make some similarity metric based on age, gender and
> > > > interests,
> > > > > then you can use it as the similarity metric in
> > > > > GenericBooleanPrefUserBasedRecommender. You would be using both
> data
> > > sets
> > > > > in some way. Of course this means learning a whole different
> > similarity
> > > > > metric somehow. A variant on this is to make a similarity metric
> > based
> > > on
> > > > > user properties, and also use one based on CF data, and multiply
> them
> > > > > together to make a new combined similarity metric for this
> approach.
> > > This
> > > > > might work OK.
> > > > >
> > > > > It can also work to treat age and gender and other features as
> > > > categorical
> > > > > features, and then model them as 'items' that the user interacts
> > with.
> > > > They
> > > > > would not have much of an effect here given how many items there
> are.
> > > In
> > > > > other models like ALS-WR you can weight these pseudo-items much
> more
> > > > highly
> > > > > and get the desired effect to a degree.
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Mar 15, 2013 at 4:37 PM, Agata Filiana <
> > a.filian...@gmail.com
> > > > > >wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I'm fairly new to Mahout. Right now I am experimenting Mahout by
> > > trying
> > > > > to
> > > > > > build a simple recommendation system. What I have is just a
> boolean
> > > > data
> > > > > > set, with only the userID and itemID. I understand that for this
> > > case I
> > > > > > have to use GenericBooleanPrefUserBasedRecommender - which I have
> > and
> > > > > works
> > > > > > fine.
> > > > > >
> > > > > > Apart from the userID and itemID data, I also have the user's
> > > > attributes
> > > > > > (their age, gender, list of interests). I would like to combine
> > this
> > > > into
> > > > > > the recommendation system to increase the performance of the
> > > > recomme

Re: ALS-WR on Million Song dataset

2013-03-18 Thread Han JU
Thanks for quick responses.

Yes it's that dataset. What I'm using is triplets of "user_id song_id
play_times", of ~ 1m users. No audio things, just plein text triples.

It seems to me that the paper about "implicit feedback" matchs well this
dataset: no explicit ratings, but times of listening to a song.

Thank you Sean for the alpha value, I think they use big numbers is because
their values in the R matrix is big.


2013/3/18 Sebastian Schelter 

> JU,
>
> are you refering to this dataset?
>
> http://labrosa.ee.columbia.edu/millionsong/tasteprofile
>
> On 18.03.2013 17:47, Sean Owen wrote:
> > One word of caution, is that there are at least two papers on ALS and
> they
> > define lambda differently. I think you are talking about "Collaborative
> > Filtering for Implicit Feedback Datasets".
> >
> > I've been working with some folks who point out that alpha=40 seems to be
> > too high for most data sets. After running some tests on common data
> sets,
> > alpha=1 looks much better. YMMV.
> >
> > In the end you have to evaluate these two parameters, and the # of
> > features, across a range to determine what's best.
> >
> > Is this data set not a bunch of audio features? I am not sure it works
> for
> > ALS, not naturally at least.
> >
> >
> > On Mon, Mar 18, 2013 at 12:39 PM, Han JU  wrote:
> >
> >> Hi,
> >>
> >> I'm wondering has someone tried ParallelALS with implicite feedback job
> on
> >> million song dataset? Some pointers on alpha and lambda?
> >>
> >> In the paper alpha is 40 and lambda is 150, but I don't know what are
> their
> >> r values in the matrix. They said is based on time units that users have
> >> watched the show, so may be it's big.
> >>
> >> Many thanks!
> >> --
> >> *JU Han*
> >>
> >> UTC   -  Université de Technologie de Compiègne
> >> * **GI06 - Fouille de Données et Décisionnel*
> >>
> >> +33 061960
> >>
> >
>
>


-- 
*JU Han*

Software Engineer Intern @ KXEN Inc.
UTC   -  Université de Technologie de Compiègne
* **GI06 - Fouille de Données et Décisionnel*

+33 061960


Re: ALS-WR on Million Song dataset

2013-03-18 Thread Sebastian Schelter
You should also be aware that the alpha parameter comes from a formula
the authors introduce to measure the "confidence" in the observed values:

confidence = 1 + alpha * observed_value

You can also change that formula in the code to something that you see
more fit, the paper even suggests alternative variants.

Best,
Sebastian


On 18.03.2013 18:06, Han JU wrote:
> Thanks for quick responses.
> 
> Yes it's that dataset. What I'm using is triplets of "user_id song_id
> play_times", of ~ 1m users. No audio things, just plein text triples.
> 
> It seems to me that the paper about "implicit feedback" matchs well this
> dataset: no explicit ratings, but times of listening to a song.
> 
> Thank you Sean for the alpha value, I think they use big numbers is because
> their values in the R matrix is big.
> 
> 
> 2013/3/18 Sebastian Schelter 
> 
>> JU,
>>
>> are you refering to this dataset?
>>
>> http://labrosa.ee.columbia.edu/millionsong/tasteprofile
>>
>> On 18.03.2013 17:47, Sean Owen wrote:
>>> One word of caution, is that there are at least two papers on ALS and
>> they
>>> define lambda differently. I think you are talking about "Collaborative
>>> Filtering for Implicit Feedback Datasets".
>>>
>>> I've been working with some folks who point out that alpha=40 seems to be
>>> too high for most data sets. After running some tests on common data
>> sets,
>>> alpha=1 looks much better. YMMV.
>>>
>>> In the end you have to evaluate these two parameters, and the # of
>>> features, across a range to determine what's best.
>>>
>>> Is this data set not a bunch of audio features? I am not sure it works
>> for
>>> ALS, not naturally at least.
>>>
>>>
>>> On Mon, Mar 18, 2013 at 12:39 PM, Han JU  wrote:
>>>
 Hi,

 I'm wondering has someone tried ParallelALS with implicite feedback job
>> on
 million song dataset? Some pointers on alpha and lambda?

 In the paper alpha is 40 and lambda is 150, but I don't know what are
>> their
 r values in the matrix. They said is based on time units that users have
 watched the show, so may be it's big.

 Many thanks!
 --
 *JU Han*

 UTC   -  Université de Technologie de Compiègne
 * **GI06 - Fouille de Données et Décisionnel*

 +33 061960

>>>
>>
>>
> 
> 



Re: ALS-WR on Million Song dataset

2013-03-18 Thread Sean Owen
Yes that's fine input then.

Large alpha should go with small R values, not large R. Really alpha
controls how much observed input (R != 0) is weighted towards 1 versus how
much unobserved input (R=0) is weighted to 0. I scale lambda by alpha to
complete this effect.


On Mon, Mar 18, 2013 at 1:06 PM, Han JU  wrote:

> Thanks for quick responses.
>
> Yes it's that dataset. What I'm using is triplets of "user_id song_id
> play_times", of ~ 1m users. No audio things, just plein text triples.
>
> It seems to me that the paper about "implicit feedback" matchs well this
> dataset: no explicit ratings, but times of listening to a song.
>
> Thank you Sean for the alpha value, I think they use big numbers is because
> their values in the R matrix is big.
>
>
> 2013/3/18 Sebastian Schelter 
>
> > JU,
> >
> > are you refering to this dataset?
> >
> > http://labrosa.ee.columbia.edu/millionsong/tasteprofile
> >
> > On 18.03.2013 17:47, Sean Owen wrote:
> > > One word of caution, is that there are at least two papers on ALS and
> > they
> > > define lambda differently. I think you are talking about "Collaborative
> > > Filtering for Implicit Feedback Datasets".
> > >
> > > I've been working with some folks who point out that alpha=40 seems to
> be
> > > too high for most data sets. After running some tests on common data
> > sets,
> > > alpha=1 looks much better. YMMV.
> > >
> > > In the end you have to evaluate these two parameters, and the # of
> > > features, across a range to determine what's best.
> > >
> > > Is this data set not a bunch of audio features? I am not sure it works
> > for
> > > ALS, not naturally at least.
> > >
> > >
> > > On Mon, Mar 18, 2013 at 12:39 PM, Han JU 
> wrote:
> > >
> > >> Hi,
> > >>
> > >> I'm wondering has someone tried ParallelALS with implicite feedback
> job
> > on
> > >> million song dataset? Some pointers on alpha and lambda?
> > >>
> > >> In the paper alpha is 40 and lambda is 150, but I don't know what are
> > their
> > >> r values in the matrix. They said is based on time units that users
> have
> > >> watched the show, so may be it's big.
> > >>
> > >> Many thanks!
> > >> --
> > >> *JU Han*
> > >>
> > >> UTC   -  Université de Technologie de Compiègne
> > >> * **GI06 - Fouille de Données et Décisionnel*
> > >>
> > >> +33 061960
> > >>
> > >
> >
> >
>
>
> --
> *JU Han*
>
> Software Engineer Intern @ KXEN Inc.
> UTC   -  Université de Technologie de Compiègne
> * **GI06 - Fouille de Données et Décisionnel*
>
> +33 061960
>


Re: Boosting User-Based with the user's attributes

2013-03-18 Thread Sean Owen
I'm not sure what you mean. The only thing I am suggesting to combine are
two similarity metrics, not data or recommendations.
You combine metrics by multiplying their values.


On Mon, Mar 18, 2013 at 12:54 PM, Agata Filiana wrote:

> In this case, would be correct if I somehow "loop" through the item data
> and the hobby data and then combine the score for a pair of users?
>
> I am having trouble in how to combine both similarity into one metric,
> could you possibly point me out a clue?
>
> Thank you
>
> On 18 March 2013 14:54, Sean Owen  wrote:
>
> > There is a difference between the recommender and the similarity metric
> it
> > uses. My suggestion was to either use your item data with the recommender
> > and hobby data with the similarity metric, or, use both in the similarity
> > metric by making a combined metric.
> >
> >
> > On Mon, Mar 18, 2013 at 9:44 AM, Agata Filiana  > >wrote:
> >
> > > I understand how it works logically. However I am having problem
> > > understanding about the implementation of it and how to get the final
> > > outcome.
> > > Say the user's attribute is Hobbies: hobby1,hobby2,hobby3
> > > So I would make the similarity metric of the users and hobbies.
> > >
> > > Then for the CF, using Mahout's GenericBooleanPrefUserBasedRecommender
> > with
> > > the boolean data set (userID and itemID).
> > >
> > > Then somehow combine the two?
> > >
> > > But at the end, my goal is to recommend the items in the second data
> set
> > > (the itemID, not recommend the hobbies) - does this make sense? Or am I
> > > confusing myself?
> > >
> > > Agata
> > >
> > >
> > > On 18 March 2013 14:23, Sean Owen  wrote:
> > >
> > > > You would have to make up the similarity metric separately since it
> > > depends
> > > > entirely on how you want to define it.
> > > > The part of the book you are talking about concerns rescoring, which
> is
> > > not
> > > > the same thing.
> > > > Combine the similarity metrics, I mean, not make two recommenders.
> > Make a
> > > > metric that is the product of two other metrics. Normalize both of
> > those
> > > > metrics to the range [0,1].
> > > >
> > > > Sean
> > > >
> > > >
> > > > On Mon, Mar 18, 2013 at 6:51 AM, Agata Filiana <
> a.filian...@gmail.com
> > > > >wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Thank Sean for the response. I like the idea of multiplying the
> > > > similarity
> > > > > metric based on
> > > > > user properties with the one based on CF data.
> > > > > I understand that I have to create a seperate similarity metric -
> > can I
> > > > do
> > > > > this with the help of Mahout or does this have to be done
> seperately,
> > > as
> > > > in
> > > > > I have to implement my own similarity measure? It would be great if
> > > there
> > > > > is some clue on how I get this started.
> > > > > Is this somehow similar to the subject of *Injecting
> domain-specific
> > > > > information* in the book Mahout in Action (with the example of the
> > > > > gender-based item similarity metric)?
> > > > >
> > > > > And also how can I multiply the two results - will this affect the
> > > result
> > > > > of the evaluation of the recommender system? Or it should be
> > normalized
> > > > in
> > > > > a way?
> > > > >
> > > > > Thank you and sorry for the basic questions.
> > > > >
> > > > > Regards,
> > > > >
> > > > > Agata Filiana
> > > > >
> > > > >
> > > > > On 16 March 2013 13:41, Sean Owen  wrote:
> > > > >
> > > > > > There are many ways to think about combining these two types of
> > data.
> > > > > >
> > > > > > If you can make some similarity metric based on age, gender and
> > > > > interests,
> > > > > > then you can use it as the similarity metric in
> > > > > > GenericBooleanPrefUserBasedRecommender. You would be using both
> > data
> > > > sets
> > > > > > in some way. Of course this means learning a whole different
> > > similarity
> > > > > > metric somehow. A variant on this is to make a similarity metric
> > > based
> > > > on
> > > > > > user properties, and also use one based on CF data, and multiply
> > them
> > > > > > together to make a new combined similarity metric for this
> > approach.
> > > > This
> > > > > > might work OK.
> > > > > >
> > > > > > It can also work to treat age and gender and other features as
> > > > > categorical
> > > > > > features, and then model them as 'items' that the user interacts
> > > with.
> > > > > They
> > > > > > would not have much of an effect here given how many items there
> > are.
> > > > In
> > > > > > other models like ALS-WR you can weight these pseudo-items much
> > more
> > > > > highly
> > > > > > and get the desired effect to a degree.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Mar 15, 2013 at 4:37 PM, Agata Filiana <
> > > a.filian...@gmail.com
> > > > > > >wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I'm fairly new to Mahout. Right now I am experimenting Mahout
> by
> > > > trying
> > > > > > to
> > > > > > > build a simple recommendation system.