Blog post about setting up a scalable recommender system with mahout

2011-04-21 Thread Sebastian Schelter
Hi, I'm hereby shamelessly advertising a blog post I've written up today about setting up a recommender system with mahout and hadoop :) http://ssc.io/deploying-a-massively-scalable-recommender-system-with-apache-mahout/ --sebastian

Re: Custom analyzers for seq2sparse

2011-04-21 Thread Camilo Lopez
OK that did work for mahout thanks!, but now hadoop cannot load the class, even when the jar containing it has been added to the hadoop classpath hadoop@ubuntu:/home/camilo/mahout-distribution-0.4$ echo $HADOOP_CLASSPATH /home/camilo/mahout-distribution-0.4/utils/target/dependency/lucene-core-3.

Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

2011-04-21 Thread Ted Dunning
It is definitely a reasonable idea to convert data to hashed feature vectors using map-reduce. And yes, you can pick a vector length that is long enough so that you don't have to worry about collisions. You need to examine your data to decide how large that needs to be, but it isn't hard to do.

Re: Anyway to speedup the category feature parsing and encoding in the SGD algorithm?

2011-04-21 Thread Ted Dunning
This code doesn't look right for category features. Those features are usually described either as strings or as integers. Either case can be handled as strings as long as you don't have any surprises like leading 0's. The best way to handle these features is to encode them using word encoders.

Re: How could I set a loss function in SGD?

2011-04-21 Thread Ted Dunning
On Tue, Apr 19, 2011 at 11:02 PM, Stanley Xu wrote: > What make me still a little confused is that, when training the model, I > probably knew the errors, could we thought that the penalty I wanted was > already counted in a loss function? > It could be, but usually isn't. > And for weight the

Re: Is any more detailed documentation aout the sgd logistic regression example.

2011-04-21 Thread Ted Dunning
The trainlogistic command is (as Stanley says) only a simple example. You will need to write a program something like TrainNewsGroups for your modelers to use. I agree that the API oriented code in Mahout is not what those users need. I was, however, what my users needed. It would be great if y

Re: Anyway to speedup the category feature parsing and encoding in the SGD algorithm?

2011-04-21 Thread Stanley Xu
Hi Ted, I knew I have to change the encoder and parse it as a String(or byte array). I am wondering even parse it as a byte array, it is still cost lot of time in feature hashing, and both as you said and per hour test, the time spent on feature hashing and parsing are normally dominate the SGD tr

Re: How could I set a loss function in SGD?

2011-04-21 Thread Stanley Xu
Hi Ted, I thought I got it but wanted to confirm once again for I am not a native English speaker. The add weight you mean here is re-define a train method, add a weight parameter and adjust the learning rate of currentLearningRate() with this param. Not the weight parameter already exist in the

Re: How could I set a loss function in SGD?

2011-04-21 Thread Ted Dunning
On Thu, Apr 21, 2011 at 7:09 PM, Stanley Xu wrote: > > The add weight you mean here is re-define a train method, add a weight > parameter and adjust the learning rate of currentLearningRate() with this > param. Not the weight parameter already exist in the features. Am I correct? > Yes. But you

Re: Anyway to speedup the category feature parsing and encoding in the SGD algorithm?

2011-04-21 Thread Ted Dunning
On Thu, Apr 21, 2011 at 7:05 PM, Stanley Xu wrote: > Hi Ted, > > I knew I have to change the encoder and parse it as a String(or byte > array). I am wondering even parse it as a byte array, it is still cost lot > of time in feature hashing, and both as you said and per hour test, the time > spent