Thanks for the quick response,

I striped most of the features and only left features I have experimented on. I 
changed the category names to 1,2,3... or A,B.C and feature names to f1,f2... . 
It is still real world data with some missing features, if needed I can work on 
anonymizing other features also. I have played around online logistic 
regression for a bit changing alpha and lambda, but couldn't quite figure out 
how to control the learning rate. Learning rate seems to hit 0 after 70000th 
sample. I will spend some more time on it later today. Any suggestions are 
welcome :).  

Thanks for your help,
Seda
http://dl.dropbox.com/u/24423903/mahout-anonymized.csv


On Jul 15, 2012, at 1:40 PM, Ted Dunning wrote:

> It is possibly sparseness, but more likely this is the known pathology of
> the adaptive logistic regression in which it gets over-confident and locks
> down training rate too early.
> 
> I have a few suggestions:
> 
> 1) try the OnlineLogisticRegression.  I think that you can find decent
> training parameters pretty easily and that would avoid the issue I
> mentioned (at some human cost)
> 
> 2) post some anonymized data and I will try a few different techniques and
> post back comparisons.  Notably, for small data like this, glmnet is
> probably the gold standard to compare against.  You shouldn't need to do
> the step-wise stuff because you will get an entire plot of which variables
> are significant with different amounts of regularization.
> 
> If you can do (2), it would be fabulous if you could actually allow use of
> the data as a test case.  That would have the highest benefit to you since
> it would mean that Mahout won't ever forget your needs.  :-)
> 
> On Sun, Jul 15, 2012 at 9:42 AM, Seda Sinangil <[email protected]>wrote:
> 
>> I am  running adaptive logistic regression on a data set consisting of
>> 250k training examples for click through rate predictions (on this sample
>> there are 350 clicks). For starting out I am trying each feature alone by
>> itself to see how much it correlates with the data set. I have 2 problems;
>> 
>> First my results are not consistent. I run my program with same input and
>> configuration back to back, but the results it produces vary a lot.
>> Sometimes my weights are around -3.3xxxx (which makes most sense),
>> sometimes around -1.xxxx mark, but mostly around 0.000xx.
>> 
>> Second when I use one of my simple feature with three categories and
>> compare the regression results with the actual rates, sometimes the results
>> do not correlate. Results usually give coefficients in favor of wrong
>> features.  And sometimes when the order is okay, the suggested results seem
>> to be overestimated than the actual ones.
>> 
>> I have tried
>> 1)changing number of passes between 1 and 20 (as far as I learned so far,
>> with my data set size for adaptive logistic regression, theoretically 1
>> pass should be enough)
>> 
>> 2) played with windows size and interval (I'm not exactly sure how these
>> are supposed to impact the results - larger window and interval size seemed
>> to produce better results up to a certain point - window size:5000,
>> interval:8000)
>> 
>> 3)shuffling the data set before each pass which didn't really changed
>> results
>> 
>> 4) downsampling of non-click samples which made things even worse
>> 
>> my questions are :
>> 
>> Is it normal that I get inconsistent results even though I don't have any
>> random part on my side of the code?
>> Can this bee happening because my data is too sparse?
>> What else can I try to tweaking?
>> Can you think of anything I might be missing out?
>> 
>> Thank you,
>> Seda


Reply via email to