Re: [Scikit-learn-general] GraphLasso pull request and feature

Gael Varoquaux Wed, 09 Nov 2011 08:17:19 -0800

On Wed, Nov 09, 2011 at 10:05:53AM -0500, josef.p...@gmail.com wrote:
> graph_lasso(X,....) takes the data array as an argument, but except
> calculating the empirical_covariance at the beginning X is not used
> anymore, as far as I could see.


> The algorithm looks very interesting, but I would have cases where I
> need to calculate the empirical_covariance myself (e.g. long run
> covariance which is a weighted average of covariance and covariance
> with lags).

> Would it be possible to use an empirical covariance instead of X as
> the main argument, or would you get design inconsistencies?

That's a very good remark, and there are other situations in it arises.
Indeed, the empirical covariance matrix is a sufficient statistic for the
population covariance matrix in the case of Gaussian models, so there are
many models in which the situation arises, for instance the oracle
approximate shrinkage.

On the other hand, some models don't rely on the Gaussian assumption.
Therefore, they use the full X data, and not just the empirical
covariance. For instance the Ledoit-Wolf estimator.

My gut feeling is that the estimator object should really take X by
default, but I don't see why the function itself could not take a
covariance matrix as an input. Of course, people can misuse it, and put
in a shrunk covariance matrix (my guess it that they will), and we just
have to accept it.

Actually, I would almost favor an optional argument to the estimator so
that it can take a covariance matrix as an input. This would be similar
to the behavior of the kernel PCA with kernel='precomputed'. I used to
have a 'data_is_cov' boolean keyword argument in my codebase. I could
turn it into a 'X_is_cov' one.

There are situations in which I would be interested in using the estimator
object and, like you, I cannot afford carrying around the full time
series. This can be useful for instance to use the cross-validated
estimator, which carries a fair amount of logic to do the parameter
search, or to compare different estimators. This sort of breaks the
cross validation in the scikit, but not completely, as tricks can be used
passing in lists of empirical covariances.

What do people think? Should I:

 1. change graph_lasso to take the empirical covariance as an input

 2. add an 'X_is_cov' parameter to the estimators

Gael

PS: As noted by Joseph: cov_init doesn't answer this usecase.

------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] GraphLasso pull request and feature

Reply via email to