Thanks for the quick response, I striped most of the features and only left features I have experimented on. I changed the category names to 1,2,3... or A,B.C and feature names to f1,f2... . It is still real world data with some missing features, if needed I can work on anonymizing other features also. I have played around online logistic regression for a bit changing alpha and lambda, but couldn't quite figure out how to control the learning rate. Learning rate seems to hit 0 after 70000th sample. I will spend some more time on it later today. Any suggestions are welcome :).
Thanks for your help, Seda http://dl.dropbox.com/u/24423903/mahout-anonymized.csv On Jul 15, 2012, at 1:40 PM, Ted Dunning wrote: > It is possibly sparseness, but more likely this is the known pathology of > the adaptive logistic regression in which it gets over-confident and locks > down training rate too early. > > I have a few suggestions: > > 1) try the OnlineLogisticRegression. I think that you can find decent > training parameters pretty easily and that would avoid the issue I > mentioned (at some human cost) > > 2) post some anonymized data and I will try a few different techniques and > post back comparisons. Notably, for small data like this, glmnet is > probably the gold standard to compare against. You shouldn't need to do > the step-wise stuff because you will get an entire plot of which variables > are significant with different amounts of regularization. > > If you can do (2), it would be fabulous if you could actually allow use of > the data as a test case. That would have the highest benefit to you since > it would mean that Mahout won't ever forget your needs. :-) > > On Sun, Jul 15, 2012 at 9:42 AM, Seda Sinangil <[email protected]>wrote: > >> I am running adaptive logistic regression on a data set consisting of >> 250k training examples for click through rate predictions (on this sample >> there are 350 clicks). For starting out I am trying each feature alone by >> itself to see how much it correlates with the data set. I have 2 problems; >> >> First my results are not consistent. I run my program with same input and >> configuration back to back, but the results it produces vary a lot. >> Sometimes my weights are around -3.3xxxx (which makes most sense), >> sometimes around -1.xxxx mark, but mostly around 0.000xx. >> >> Second when I use one of my simple feature with three categories and >> compare the regression results with the actual rates, sometimes the results >> do not correlate. Results usually give coefficients in favor of wrong >> features. And sometimes when the order is okay, the suggested results seem >> to be overestimated than the actual ones. >> >> I have tried >> 1)changing number of passes between 1 and 20 (as far as I learned so far, >> with my data set size for adaptive logistic regression, theoretically 1 >> pass should be enough) >> >> 2) played with windows size and interval (I'm not exactly sure how these >> are supposed to impact the results - larger window and interval size seemed >> to produce better results up to a certain point - window size:5000, >> interval:8000) >> >> 3)shuffling the data set before each pass which didn't really changed >> results >> >> 4) downsampling of non-click samples which made things even worse >> >> my questions are : >> >> Is it normal that I get inconsistent results even though I don't have any >> random part on my side of the code? >> Can this bee happening because my data is too sparse? >> What else can I try to tweaking? >> Can you think of anything I might be missing out? >> >> Thank you, >> Seda
