So, the first thought that I have is that it sounds like you have dense variables rather than sparse. This may affect behavior of the Mahout system. If you have some text-like features of the ad, then you may get cleaner results.
Secondly, I don't see any interaction features. With as much training data as you have, interactions with user id are probably warranted. Regarding the predicted click-through, it is very hard to say if these results are implausible or not purely on the predicted scores. Logistic regression as used here may or may not provide calibrated scores even if working well. In your case, we clearly have calibration issues, but I think that a lift chart or Lorenz plot might be more useful for determining whether you are actually getting reasonable results. In general, to do off-line evaluation of an ad targeting system, you need to include a random component in your current ad-targeting system so that you don't have a grotesquely biased result. So the real question is not so much whether the score is accurately predicting click-through, but whether high scores correlate well with clicks (and vice versa). 2012/4/11 LiLeqiang <frelankie_...@hotmail.com> > > Hey guys, > I have a problem using mahout's Logisitc Regression classifiers. > I'm doing some study on ad click predictions. I've got a large collection > of ad impression and click data for my study. The size of my collection is > of hundreds of millions. My goal is to predict CTR(click through rate) of a > collection of ad creatives, given the id of the ad creative impressed, the > region, time, os and device type of the user, network type of the user, and > some other features. > To achieve this, I treat all above features as predictor variables. The id > of ad creative is treated as categorical, and region, time(in hours) are > also categorical. > My target variable is defined to be whether the user would click on the > impressed ad creative. So, the target variable should have 2 values: 0 > means would not click, and 1 means would click. > Here is the trick: suppose the model is already trained ok. When there > comes an ad request, I first extract the predictor variables from the > request. Then, for each of the advertisements available, I append the id of > current ad to the predictors and then call the model's classifyScalar > method to get the possibility of current ad to be clicked, which is my > goal: the CTRs. > > I first used the OnlineLogisticRegression class to do the job. Learning > rate is first choosen to be 0.2. I found that when the number of training > passes is large enough, the model is converged and CTRs predicted by the > model is unexpectedly high, over 20% for most ads in my test cases, and > this is unacceptable since the actual CTR of my data is just around 1%-2%. > Then, I tried different learning rates, and got the same phenomenon. I see > other parameters of OnlineLogisticRegression, but do not know how they > works exactly. > Then, I tried the AdaptiveLogisticRegression. I found that with larger > averaging window and interval, the algorithm would perform better. but the > final result I got is just like above, unexpectedly high. > I thought there must be something wrong. I tried some feature > interaction(just combine string values of feature together to get the new > feature, features are all categorical), and the result is even worse. > There must be something wrong, but I cannot figure it out.My choise of the > target variable may not be suitable, or feature selection is not properly > done, or maybe I should try other approches like linear regression. > Anybody has encountered situations like this? I will appreciate if there's > someone could give me some advice. > > Thanks a lot,Frelankie lee > >