Hi all,

I am in a long week end in the mountains, but the weather is crappy, and
I am in the valley rather than in a hut. So I am quickly sending this
email that has been sitting in my mailbox for a while. Food for thoughts,
I hope. I'll debrief this on Wenesday, when I land at work, and try to
catch up with everything.

Jacques, Alex and I spent a little while looking at the scale C issue a
bit more in detail. We are not finished with investigating it, but we are
starting to have ideas a bit more clear, and I appears that we have been
saying a lot of bullshit. Let me try to summarize the issues, where we
are at in terms of understanding, and what we think the next step could
be.

The setting
============

To fix the notation and the problem, what we are interested in is a
risk minimization that can be written at:

estimates = argmin sum_samples (loss(sample, model parameters))
                  + alpha penalization (model parameters)

Here, I have called 'loss' the individual error per sample. Thus, as you
can see, the data fit term, which is the sum of error for each sample,
increases as we add more samples. On the opposite,
the penalization term does not increase.

The different problems at stake
===============================

* The core problem that bother's Alex is that, as we do cross-validation,
  to set the amount of regularization 'alpha', the number of samples is
  different between the problems that we use for model selection, and the
  final problem on which we want to do trainning. A question is then:

    How do we optimaly adjust 'alpha' to acount for different training
    samples?

* A problem that bothers me, is that between the various folds, and when
  switching to the full train data, one might have varying properties of
  the data, such as conditioning of the design matrix. This happens to me
  when I deal with very crappy data

Possible solutions and some insights
=====================================

* As the number of samples varies, to keep the ratio of data fit to
 regularization in the optimization problem above, the penalty parameter
 should increase as n_samples.

 It turns out that, for l2 penalizations, theory tells us that for
 prediction consistency (i.e. that under given hypothesis, the estimator
 learned predicts as well as an model knowing the true distribution) the
 penalty parameter should be kept constant as the number of samples
 grow. In other words scale_C=False, the libSVM/liblinear choice. Alex
 and I were wrong, and I must apologies about it.

 For l1 penalizations, the theory says that prediction consistency is
 not possible, because of the bias of the l1. However, it does say that
 model consistency, in terms of finding the right set of non-zero
 parameters as well as their signs, is possible, and is achieved for
 alpha scaling as sqrt(n).

 Simulations, as done by Jaques following Andy's work
 (https://gist.github.com/2470665) show that for l2 models, the theory
 is well verified: the libSVM/liblinear choice is the right one. For l1
 models, it seems that scaling as 1/n (scale_C=True) works perfectly,
 which is still a bit mysterious for us... A bit more work remains to be done
 here to fully understand what is going on.

 Results for SVM can be seen here:
    http://www.flickr.com/photos/79425020@N04/7118421803/in/photostream/
    http://www.flickr.com/photos/79425020@N04/6972343692/in/photostream/


* In practice, if we are interested in solving problem 2, I have found
 empirically that what works really well is to scale the regularization
 parameter by a data-dependent factor. For l1 penalties, the alpha_max
 (aka C_min) above which the model has no non-zero coefficients, appears
 to work really well. For l2 penalties, this scaling seems less
 important, but when I have used it, I have used heuristics like the max
 singular value of np.dot(X.T, X), or the mean.

Plan of action
===============

It appears clearly to me that the current situation with scale_C in SVMs
is a very bad idea. I apologize for pushing it forward, I was motivated
by my experience with l1 models. I think that we should remove it for
right now, as having it in causes more confusion than anything, and we
need to think of better solutions to address the problems above. Removing
it would solve the problem of prediction consistency for l2-penalized
models (at least in theory, I want more empirical evidence to be
convinced in practice). We found out that there is no scaling
by n_samples in Ridge for example.

For l1 penalized models, we could scale alpha by 1/sqrt(n_samples),
althought the theory tells us that the choice to optimize recovery (model
consistency) and prediction (prediction consistency) are different.
Another possible problem is that it might surprise the user: if I am
using the lasso to do sparse coding, I want my sparsity to be independent
of the number of samples, and thus to scale by 1/n_samples.

Empirical scalings by data-dependent terms work very well in practice for
me, but they don't rely on theory. They seem like something that we
should render possible, but only optional. Also, they raise the question
of how to scale other terms.

Thus, to sum up the paragraphs above, they are several choices that make
sense. There is probably not one good answer for everyone. I am confused,
and I don't know at all what the right behavior or API is. We could add a
keyword argument, but right now I am not sure what its semantics should
be.

In the light of the nearing release, my gut feeling is to avoid doing
anything on the short term: remove the scale_C that is a fiasco as it is
applied to SVM with l2 penalties, confuses everybody, and doesn't improve
anything. It's a SERIOUS BUG :}. I'd like to remove it, and not leave a
non useful keyword argument for the release. It might be that we add it
back for liblinear, as it can be used with an l1 penalty, but I'd rather
not have a parameter that we might want to remove later and have to worry
about backward compatibility.

There still is an actual open problem that I am confronted to fairly
often, which is the l1-penalized logistic regression. In the current
situation, it does not work well at all. One option that Alex proposed
would be to add a 'SparseLogistic' estimator, that would be formulated in
terms of alpha = n_samples/C. This formulation would be closer to
that of other sparse linear models, and thus easier on the eye of people
who don't know the dual formulation. I wouldn't mind actually deprecating
'C' in logistic regression to replace it with alpha = 1/C, but that's a
separate issue.

What do people think?

Gael

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to