I am running adaptive logistic regression on a data set consisting of 250k training examples for click through rate predictions (on this sample there are 350 clicks). For starting out I am trying each feature alone by itself to see how much it correlates with the data set. I have 2 problems;
First my results are not consistent. I run my program with same input and configuration back to back, but the results it produces vary a lot. Sometimes my weights are around -3.3xxxx (which makes most sense), sometimes around -1.xxxx mark, but mostly around 0.000xx. Second when I use one of my simple feature with three categories and compare the regression results with the actual rates, sometimes the results do not correlate. Results usually give coefficients in favor of wrong features. And sometimes when the order is okay, the suggested results seem to be overestimated than the actual ones. I have tried 1)changing number of passes between 1 and 20 (as far as I learned so far, with my data set size for adaptive logistic regression, theoretically 1 pass should be enough) 2) played with windows size and interval (I'm not exactly sure how these are supposed to impact the results - larger window and interval size seemed to produce better results up to a certain point - window size:5000, interval:8000) 3)shuffling the data set before each pass which didn't really changed results 4) downsampling of non-click samples which made things even worse my questions are : Is it normal that I get inconsistent results even though I don't have any random part on my side of the code? Can this bee happening because my data is too sparse? What else can I try to tweaking? Can you think of anything I might be missing out? Thank you, Seda
