Re: OnlineLogisticRegression: Are my settings sensible

2013-11-08 Thread Andreas Bauer
Ok,  I'll have a look. Thanks! I know mahout is intended for large scale 
machine learning,  but I guess it shouldn't have problems with such small data 
either.



Ted Dunning  schrieb:
>On Thu, Nov 7, 2013 at 9:45 PM, Andreas Bauer  wrote:
>
>> Hi,
>>
>> Thanks for your comments.
>>
>> I modified the examples from the mahout in action book,  therefore I
>used
>> the hashed approach and that's why i used 100 features. I'll adjust
>the
>> number.
>>
>
>Makes sense.  But the book was doing sparse features.
>
>
>
>> You say that I'm using the same CVE for all features,  so you mean i
>> should create 12 separate CVE for adding features to the vector like
>this?
>>
>
>Yes.  Otherwise you don't get different hashes.  With a CVE, the
>hashing
>pattern is generated from the name of the variable.  For a work
>encoder,
>the hashing pattern is generated by the name of the variable (specified
>at
>construction of the encoder) and the word itself (specified at encode
>time).  Text is just repeated words except that the weights aren't
>necessarily linear in the number of times a word appears.
>
>In your case, you could have used a goofy trick with a word encoder
>where
>the "word" is the variable name and the value of the variable is passed
>as
>the weight of the word.
>
>But all of this hashing is really just extra work for you.  Easier to
>just
>pack your data into a dense vector.
>
>
>> Finally, I thought online logistic regression meant that it is an
>online
>> algorithm so it's fine to train only once. Does it mean, should i
>invoke
>> the train method over and over again with the same training sample
>until
>> the next one arrives or how should i make the model converge (or at
>least
>> try to with the few samples) ?
>>
>
>What online really implies is that training data is measured in terms
>of
>number of input records instead of in terms of passes through the data.
> To
>converge, you have to see enough data.  If that means you need to pass
>through the data several times to fool the learner ... well, it means
>you
>have to pass through the data several times.
>
>Some online learners are exact in that they always have the exact
>result at
>hand for all the data they have seen.  Welford's algorithm for
>computing
>sample mean and variance is like that. Others approximate an answer.
>Most
>systems which are estimating some property of a distribution are
>necessarily approximate.  In fact, even Welford's method for means is
>really only approximating the mean of the distribution based on what it
>has
>seen so far.  It happens that it gives you the best possible estimate
>so
>far, but that is just because computing a mean is simple enough.  With
>regularized logistic regression, the estimation is trickier and you can
>only say that the algorithm will converge to the correct result
>eventually
>rather than say that the answer is always as good as it can be.
>
>Another way to say it is that the key property of on-line learning is
>that
>the learning takes a fixed amount of time and no additional memory for
>each
>input example.
>
>
>> What would you suggest to use for incremental training instead of
>OLR?  Is
>> mahout perhaps the wrong library?
>>
>
>Well, for thousands of examples, anything at all will work quite well,
>even
>R.  Just keep all the data around and fit the data whenever requested.
>
>Take a look at glmnet for a very nicely done in-memory L1/L2
>regularized
>learner.  A quick experiment indicates that it will handle 200K samples
>of
>the sort you are looking in about a second with multiple levels of
>lambda
>thrown into the bargain.  Versions available in R, Matlab and Fortran
>(at
>least).
>
>http://www-stat.stanford.edu/~tibs/glmnet-matlab/
>
>This kind of in-memory, single machine problem is just not what Mahout
>is
>intended to solve.



Re: OnlineLogisticRegression: Are my settings sensible

2013-11-07 Thread Andreas Bauer
Hi,

Thanks for your comments.

I modified the examples from the mahout in action book,  therefore I used the 
hashed approach and that's why i used 100 features. I'll adjust the number.

You say that I'm using the same CVE for all features,  so you mean i should 
create 12 separate CVE for adding features to the vector like this?


BIAS.addToVector((byte[]) null,1, denseVector);

this. cve1.addToVector((byte[]) null,
sample.getFeatureValue1(), denseVector);
...
this. cve12.addToVector((byte[]) null,
sample.getFeatureValue12(), denseVector);

It's only a typo 12/15. Should be getFeatureValue12.

Finally, I thought online logistic regression meant that it is an online 
algorithm so it's fine to train only once. Does it mean, should i invoke the 
train method over and over again with the same training sample until the next 
one arrives or how should i make the model converge (or at least try to with 
the few samples) ?

What would you suggest to use for incremental training instead of OLR?  Is 
mahout perhaps the wrong library?

Many thanks,

Andreas



Ted Dunning  schrieb:
>Why is FEATURE_NUMBER != 13?
>
>With 12 features that are already lovely and continuous, just stick
>them in
>elements 1..12 of a 13 long vector and put a constant value at the
>beginning of it.  Hashed encoding is good for sparse stuff, but
>confusing
>for your case.
>
>Also, it looks like you only pass through the (very small) training set
>once.  The OnlineLogisticRegression is unlikely to converge very well
>with
>such a small number of examples.
>
>Finally, in the hashed representation that you are using, you use
>exactly
>the same CVE to put all 15 (12?) of the variables into the vector.
>Since
>you are using the same CVE, all of these values will be put into
>exactly
>the same location which is going to kill performance since you will get
>the
>effect of summing all your variables together.
>
>
>
>
>
>On Thu, Nov 7, 2013 at 1:48 PM, Andreas Bauer  wrote:
>
>> Hi,
>>
>> I’m trying to use OnlineLogisticRegression for a two-class
>classification
>> problem, but as my classification results are not very good, I wanted
>to
>> ask for support to find out if my settings are correct and if I’m
>using
>> Mahout correctly. Because if I’m doing it correctly then probably my
>> features are crap...
>>
>> In total I have 12 features. All are continuous values and all are
>> normalized/standardized (has not effect on the classification
>performance
>> at the moment).
>>
>> Training samples keep flowing in at constant rate (i.e. incremental
>> training), but in total it won’t be more than a few thousand (class
>split
>> pos/negative 30:70).
>>
>> My performance measure do not really get good, e.g. with approx. 3600
>> training samples I get
>>
>> f-measure(beta=0.5): 0.38
>> precision: 0.33
>> recall: 0.47
>>
>> The parameters I use are
>>
>> lambda=0.0001
>> offset=1000
>> alpha=1
>> decay_exponent=0.9
>> learning_rate=50
>>
>>
>> FEATURE_NUMBER = 100;
>> CATEGORIES_NUMBER = 2;
>>
>>
>>
>> Java code snip:
>>
>> private OnlineLogisticRegression olr;
>> private ContinuousValueEncoder continousValueEncoder;
>>
>> private static final FeatureVectorEncoder BIAS = new
>> ConstantValueEncoder("Intercept“);
>>
>> …
>> public Training() {
>>olr = new OnlineLogisticRegression(CATEGORIES_NUMBER,
>> FEATURE_NUMBER,new L1()); //L2 or ElasticBandPrior do not affect the
>> performance
>>
>>
>olr.lambda(lambda).learningRate(learning_rate).stepOffset(offset).decayExponent(decay_exponent);
>>this.continousValueEncoder = new
>> ContinuousValueEncoder("ContinuousValueEncoder");
>>this.continousValueEncoder.setProbes(20);
>>   ….
>> }
>>
>>
>> public void train(TrainingSample sample, int target){
>> DenseVector denseVector = new DenseVector(FEATURE_NUMBER);
>> //sample.getFeatureValue1-15() return a double value
>> this.continousValueEncoder.addToVector((byte[]) null,
>> sample.getFeatureValue1(), denseVector);
>> ….
>> this.continousValueEncoder.addToVector((byte[]) null,
>> sample.getFeatureValue15(), denseVector);
>> BIAS.addToVector((byte[]) null, 1, denseVector);
>> olr.train(target, denseVector);
>> }
>>
>> It is also interesting to notice, that when I use the model both test
>and
>> classification yield always probabilities of 1.0 or 0.99xxx for
>either
>> class.
>>
>> result = this.olr.classifyFull(input);
>> LOGGER.debug("TrainingSink test: classify real category:"
>> + realCategory + " olr classifier result: "
>> + result.maxValueIndex() + " prob: " + result.maxValue());
>>
>>
>>
>>
>> Would be great if you could give me some advise.
>>
>> Many thanks,
>>
>> Andreas
>>
>>
>>


OnlineLogisticRegression: Are my settings sensible

2013-11-07 Thread Andreas Bauer
Hi,  

I’m trying to use OnlineLogisticRegression for a two-class classification 
problem, but as my classification results are not very good, I wanted to ask 
for support to find out if my settings are correct and if I’m using Mahout 
correctly. Because if I’m doing it correctly then probably my features are 
crap...

In total I have 12 features. All are continuous values and all are 
normalized/standardized (has not effect on the classification performance at 
the moment).  

Training samples keep flowing in at constant rate (i.e. incremental training), 
but in total it won’t be more than a few thousand (class split pos/negative 
30:70).  

My performance measure do not really get good, e.g. with approx. 3600 training 
samples I get

f-measure(beta=0.5): 0.38
precision: 0.33
recall: 0.47

The parameters I use are

lambda=0.0001
offset=1000
alpha=1
decay_exponent=0.9
learning_rate=50


FEATURE_NUMBER = 100;
CATEGORIES_NUMBER = 2;



Java code snip:

private OnlineLogisticRegression olr;
private ContinuousValueEncoder continousValueEncoder;

private static final FeatureVectorEncoder BIAS = new 
ConstantValueEncoder("Intercept“);

…
public Training() {
   olr = new OnlineLogisticRegression(CATEGORIES_NUMBER, FEATURE_NUMBER,new 
L1()); //L2 or ElasticBandPrior do not affect the performance
   
olr.lambda(lambda).learningRate(learning_rate).stepOffset(offset).decayExponent(decay_exponent);
   this.continousValueEncoder = new 
ContinuousValueEncoder("ContinuousValueEncoder");
   this.continousValueEncoder.setProbes(20);
  ….
}


public void train(TrainingSample sample, int target){
DenseVector denseVector = new DenseVector(FEATURE_NUMBER);
//sample.getFeatureValue1-15() return a double value
this.continousValueEncoder.addToVector((byte[]) null, 
sample.getFeatureValue1(), denseVector);
….
this.continousValueEncoder.addToVector((byte[]) null, 
sample.getFeatureValue15(), denseVector);
BIAS.addToVector((byte[]) null, 1, denseVector);
olr.train(target, denseVector);
}

It is also interesting to notice, that when I use the model both test and 
classification yield always probabilities of 1.0 or 0.99xxx for either class.  

result = this.olr.classifyFull(input);
LOGGER.debug("TrainingSink test: classify real category:"
+ realCategory + " olr classifier result: "
+ result.maxValueIndex() + " prob: " + result.maxValue());




Would be great if you could give me some advise.  

Many thanks,

Andreas