Re: [Scikit-learn-general] KFold cross validation strangely defaults to not shuffle

2014-04-25 Thread Cory Dolphin
I think that is a great idea. I made a quick change and submitted a pull request . Thanks, Cory On Fri, Apr 25, 2014 at 7:45 AM, Gael Varoquaux < gael.varoqu...@normalesup.org> wrote: > On Fri, Apr 25, 2014 at 01:40:32PM +0200, Arnaud Jol

Re: [Scikit-learn-general] KFold cross validation strangely defaults to not shuffle

2014-04-23 Thread Cory Dolphin
Good point. I still support shuffle=True being both the most common use case, and the most obvious default. Are there any arguments against this, and if so, how does the rest of the community feel? Regarding the text book definition, I don't mean to push the point, but I know I personally lost 5

Re: [Scikit-learn-general] KFold cross validation strangely defaults to not shuffle

2014-04-22 Thread Cory Dolphin
= np.arange(n) if shuffle: random_state.shuffle(self.idxs) Is that too complicated? On Tue, Apr 22, 2014 at 6:20 AM, Lars Buitinck wrote: > 2014-04-21 20:58 GMT+02:00 Cory Dolphin : > > I propose that: > > If a random_state is specified, I believe shuffling should

Re: [Scikit-learn-general] KFold cross validation strangely defaults to not shuffle

2014-04-21 Thread Cory Dolphin
Mathieu, in my case the order of the data is also an implementation detail. The default of not-shuffling produced an unfair split. Due to the fact that my data was sorted as a way or organizing the querying and packing of the data, models were trained and tested on data that were almost as disparat

[Scikit-learn-general] KFold cross validation strangely defaults to not shuffle

2014-04-16 Thread Cory Dolphin
Hello, Using cross_validation.KFold, I was surprised to see that the Shuffle parameter defaults to False. This default made it difficult to diagnose why my folds were performing so much worse than similar train_test_splits. I expect that shuffle=True would be the default, and that passing in a no

Re: [Scikit-learn-general] RidgeRegression sample_weights, sparse solvers

2014-04-03 Thread Cory Dolphin
to use SGDRegressor(loss="squared"), which > should readily support sample_weight. > > HTH, > Mathieu > > > On Wed, Apr 2, 2014 at 9:18 AM, Cory Dolphin wrote: > >> Hello, >> >> I am trying to perform ridge regression on a relatively large data s

[Scikit-learn-general] RidgeRegression sample_weights, sparse solvers

2014-04-01 Thread Cory Dolphin
Hello, I am trying to perform ridge regression on a relatively large data set 70 million examples 24 million very sparse features. E.G. I have created an X matrix with dimensions (73725855, 24652292), an associated y vector with dimensions (73725855,), and a sample_weights vector with identical d

Re: [Scikit-learn-general] Overflow error dumping large GaussianProcess estimator with joblib.

2014-03-05 Thread Cory Dolphin
I believe I experienced a similar issue, reported on Github #122. It seems to be a zlib issue in Python, and not something which can or will be fixed in joblib. Would love to know if a work-around was found, it is almost ironic that compression fails fo

Re: [Scikit-learn-general] Weighted logistic regression

2014-02-05 Thread Cory Dolphin
1. Use SGDClassifier(loss='log').fit(X, y, sample_weight). > 2. Use the branch of PR > 2784<https://github.com/scikit-learn/scikit-learn/pull/2784> which > ports sample_weight to scikit-learn's liblinear. > > - Joel > > > On 5 February 2014 11:34, Cory Dolph

[Scikit-learn-general] Weighted logistic regression

2014-02-04 Thread Cory Dolphin
Hello, First, thanks for this wonderful library, I am an undergraduate engineering student and this tool has opened my mind to ML! I have a problem with repeated samples, which I wish to perform logistic regression on. I expected to be able to pass a vector of weights to associate with the repeat