Re: Click probability prediction using Mahout. From model output to probability

2012-12-27 Thread Johannes Schulte
Hi Pavel, first of all i would include an intercept term in the model. This learns the proportion of examples in the training set. Second, for getting calibrated probabilities out of the downsampled model, I can think of two ways: 1. Use another set of input data to measure the observed maximum

Re: Click probability prediction using Mahout. From model output to probability

2012-12-27 Thread Johannes Schulte
Oops, hit enter to early... Just wanted to say that those are the two ways I'm thinking of right now since i got a similar challenge. I'm thankful for any suggestions or comments. Cheers, Johannes On Thu, Dec 27, 2012 at 3:13 PM, Johannes Schulte johannes.schu...@gmail.com wrote: Hi Pavel,

Re: Vectorizing 20 newsgroups

2012-12-27 Thread Ted Dunning
Random low dimensional projections tend to look like normal distributions. This is the law of large numbers at work. I think it is hard to diagnose anything from this. On the other hand, projections against the principal components tend to show more structure. On Thu, Dec 27, 2012 at 11:53 AM,

Re: Click probability prediction using Mahout. From model output to probability

2012-12-27 Thread Ted Dunning
This paper is probably of interest for this problem: http://research.microsoft.com/apps/pubs/default.aspx?id=122779 On Thu, Dec 27, 2012 at 6:14 AM, Johannes Schulte johannes.schu...@gmail.com wrote: Oops, hit enter to early... Just wanted to say that those are the two ways I'm thinking

Re: time based price predictions

2012-12-27 Thread Ted Dunning
You have a sort of a regression problem here. Add features of each item if you can. Then add day-of-week, weekend or holiday features. Fit your regression. Can you say the size of your data? On Thu, Dec 27, 2012 at 7:26 AM, Matt Mitchell goodie...@gmail.com wrote: I'm looking for a way to

Re: Vectorizing 20 newsgroups

2012-12-27 Thread Ted Dunning
I have fixed the vectorizer in knn. Available from [0] as org.apache.mahout.knn.Vectorize20NewsGroups Typical invocation would have these command line options: lic subject false 1000 ../20news-bydate-train These are: - term weighting code. lic = log(tf) * IDF with cosine normalization -