Re: [Scikit-learn-general] preprocessing.scaler uses population standard deviation

2012-11-09 Thread Olivier Grisel
> So, it seems like, at best, the docs could mention which kind of standard > deviation is in use and why it probably doesn't matter. Anyway, I learned a > lot about scaling! Pull Requests to improve the documentation are always very much appreciated :) -

Re: [Scikit-learn-general] preprocessing.scaler uses population standard deviation

2012-11-06 Thread Doug Coleman
Yes, I just realized that it doesn't work out unless you divide by the std. It seems like using the population or sample standard deviation is not important in this case since it's not easy to get the unbiased sample std. I came across some other techniques for scaling described in section "Class

Re: [Scikit-learn-general] preprocessing.scaler uses population standard deviation

2012-11-06 Thread Robert Kern
On Tue, Nov 6, 2012 at 4:17 PM, Doug Coleman wrote: > Actually, from the numpy docs, the ddof=1 for np.std doesn't make it > unbiased. There's a whole wikipedia article on calculating the unbiased > standard deviation, and it seems to be different for the normal distribution > than for others and

Re: [Scikit-learn-general] preprocessing.scaler uses population standard deviation

2012-11-06 Thread Doug Coleman
Actually, from the numpy docs, the ddof=1 for np.std doesn't make it unbiased. There's a whole wikipedia article on calculating the unbiased standard deviation, and it seems to be different for the normal distribution than for others and involves the gamma function--the advice from the wiki is not

Re: [Scikit-learn-general] preprocessing.scaler uses population standard deviation

2012-11-06 Thread Lars Buitinck
2012/11/6 Olivier Grisel : > None, False: no stdev > True, "pop": population stdev > "sample": sample stdev > > +1 but with "population" instead of "pop". Alright :) -- Lars Buitinck Scientific programmer, ILPS University of Amsterdam

Re: [Scikit-learn-general] preprocessing.scaler uses population standard deviation

2012-11-06 Thread Olivier Grisel
None, False: no stdev True, "pop": population stdev "sample": sample stdev +1 but with "population" instead of "pop". 2012/11/6 Lars Buitinck : > 2012/11/6 Gael Varoquaux : >> That said, I am OK adding an additional parameter, if people think that >> it is important. The one used in numpy, "ddof"

Re: [Scikit-learn-general] preprocessing.scaler uses population standard deviation

2012-11-06 Thread Lars Buitinck
2012/11/6 Gael Varoquaux : > That said, I am OK adding an additional parameter, if people think that > it is important. The one used in numpy, "ddof", is somewhat cryptic, > though. How about overloading with_std to take... None, False: no stdev True, "pop": population stdev "sample": sample stde

Re: [Scikit-learn-general] preprocessing.scaler uses population standard deviation

2012-11-06 Thread Robert Kern
On Tue, Nov 6, 2012 at 6:48 AM, Gael Varoquaux wrote: > I am actually -1 on this, because the consequence would be that np.std(X, > axis=-1) would no longer be one. I am afraid that it would confuse the > users. > > I believe that the n/(n - 1) difference is completely irrelevent for > machine lea

Re: [Scikit-learn-general] preprocessing.scaler uses population standard deviation

2012-11-05 Thread Gael Varoquaux
I am actually -1 on this, because the consequence would be that np.std(X, axis=-1) would no longer be one. I am afraid that it would confuse the users. I believe that the n/(n - 1) difference is completely irrelevent for machine learning purpose. If a quantity is relevant, it is the norm of the fe

Re: [Scikit-learn-general] preprocessing.scaler uses population standard deviation

2012-11-05 Thread Lars Buitinck
2012/11/5 Doug Coleman : > It seems this is rarely the case in machine learning, so perhaps it would be > better to scale using the sample standard deviation, which numpy already > supports, or to make it a flag. +1 Since we renamed Scaler since the last release (?), we can make population stdev

[Scikit-learn-general] preprocessing.scaler uses population standard deviation

2012-11-05 Thread Doug Coleman
preprocessor.scaler calls numpy's default standard deviation, which is the population standard deviation (delta-degrees-of-freedom is 0). This is usually reserved for when you have the entire set of data. It seems this is rarely the case in machine learning, so perhaps it would be better to scale