Re: [Scikit-learn-general] KFold cross validation strangely defaults to not shuffle

2014-04-25 Thread Cory Dolphin
I think that is a great idea. I made a quick change and submitted a pull request . Thanks, Cory On Fri, Apr 25, 2014 at 7:45 AM, Gael Varoquaux < gael.varoqu...@normalesup.org> wrote: > On Fri, Apr 25, 2014 at 01:40:32PM +0200, Arnaud Jol

Re: [Scikit-learn-general] KFold cross validation strangely defaults to not shuffle

2014-04-25 Thread Gael Varoquaux
On Fri, Apr 25, 2014 at 01:40:32PM +0200, Arnaud Joly wrote: > > One solution would be to deprecate the "shuffle" option from KFold and add > > a new class ShuffleKFold. > > The documentation should clarify the difference between ShuffleKFold and > > ShuffleSplit: in the latter you need to specif

Re: [Scikit-learn-general] KFold cross validation strangely defaults to not shuffle

2014-04-25 Thread Arnaud Joly
On 23 Apr 2014, at 08:17, Mathieu Blondel wrote: > One solution would be to deprecate the "shuffle" option from KFold and add a > new class ShuffleKFold. > The documentation should clarify the difference between ShuffleKFold and > ShuffleSplit: in the latter you need to specify the split size

Re: [Scikit-learn-general] KFold cross validation strangely defaults to not shuffle

2014-04-24 Thread Andy
On 04/23/2014 09:45 PM, Cory Dolphin wrote: > Good point. > > I still support shuffle=True being both the most common use case, and > the most obvious default. Are there any arguments against this, and if > so, how does the rest of the community feel? > Regarding the text book definition, > > > I

Re: [Scikit-learn-general] KFold cross validation strangely defaults to not shuffle

2014-04-23 Thread Cory Dolphin
Good point. I still support shuffle=True being both the most common use case, and the most obvious default. Are there any arguments against this, and if so, how does the rest of the community feel? Regarding the text book definition, I don't mean to push the point, but I know I personally lost 5

Re: [Scikit-learn-general] KFold cross validation strangely defaults to not shuffle

2014-04-23 Thread Gael Varoquaux
> I think it would be better not to add new ways of parsing that input that are > specific to a single class. Agreed. Explicit is better than implicit. Let's not change conventions for this class. -- Start Your Social Net

Re: [Scikit-learn-general] KFold cross validation strangely defaults to not shuffle

2014-04-22 Thread Mathieu Blondel
I agree with Lars that in scikit-learn we use the convention random_state=None to mean "use a random seed". One solution would be to deprecate the "shuffle" option from KFold and add a new class ShuffleKFold. The documentation should clarify the difference between ShuffleKFold and ShuffleSplit: in

Re: [Scikit-learn-general] KFold cross validation strangely defaults to not shuffle

2014-04-22 Thread Robert Layton
What is the behaviour scikit-learn wide? Usually you would just pass whatever random_state is set to to the sklearn.utils.validation.check_random_state function. It may be possible to extend *that* function to accept "False" as an input, in which case it would not shuffle. (For example, calling ran

Re: [Scikit-learn-general] KFold cross validation strangely defaults to not shuffle

2014-04-22 Thread Cory Dolphin
Lars, random_state=None would maintain current behavior (which is to not shuffle). As for shuffle=False and random_state=1, good point. I would expect not shuffling. I would handle this with a default None, and a check, such as: def __init__(self, n, n_folds=3, indices=None, shuffle=None,

Re: [Scikit-learn-general] KFold cross validation strangely defaults to not shuffle

2014-04-22 Thread Lars Buitinck
2014-04-21 20:58 GMT+02:00 Cory Dolphin : > I propose that: > If a random_state is specified, I believe shuffling should be enabled. (I > cannot imagine any other reason to specify a random state). > > This will: > 1. Keep backwards compatibility It does break with conventions. Usually, random_sta

Re: [Scikit-learn-general] KFold cross validation strangely defaults to not shuffle

2014-04-21 Thread Skipper Seabold
On Thu, Apr 17, 2014 at 9:56 AM, Mathieu Blondel wrote: > It seems to me that you assume the order in which a dataset is laid out is > meaningful. I think there are cases when this order might be completely > artificial and not reflect the true distribution of the data. For me, the > order is an i

Re: [Scikit-learn-general] KFold cross validation strangely defaults to not shuffle

2014-04-21 Thread Cory Dolphin
Mathieu, in my case the order of the data is also an implementation detail. The default of not-shuffling produced an unfair split. Due to the fact that my data was sorted as a way or organizing the querying and packing of the data, models were trained and tested on data that were almost as disparat

Re: [Scikit-learn-general] KFold cross validation strangely defaults to not shuffle

2014-04-17 Thread Mathieu Blondel
It seems to me that you assume the order in which a dataset is laid out is meaningful. I think there are cases when this order might be completely artificial and not reflect the true distribution of the data. For me, the order is an implementation detail: it depends on the way the samples were fetc

Re: [Scikit-learn-general] KFold cross validation strangely defaults to not shuffle

2014-04-17 Thread Olivier Grisel
If the mean validation score is significantly higher when you shuffle, it means that you have some dependency structure in your samples that got broken by the shuffling. Having a dependency structure in your samples means that your samples do not follow the i.i.d. assumption. Shuffling by default

Re: [Scikit-learn-general] KFold cross validation strangely defaults to not shuffle

2014-04-17 Thread Robert Layton
I think having it off by default is a good thing. Generally, you want as little to happen when you use the defaults. For example, if you "preshuffle" the data for some reason, you just want the KFold to split it up for you. On 17 April 2014 19:57, Mathieu Blondel wrote: > I think the main reaso

Re: [Scikit-learn-general] KFold cross validation strangely defaults to not shuffle

2014-04-17 Thread Mathieu Blondel
I think the main reason shuffle=False by default is historical: the option was added later on and so shuffle=False was chosen for backward compatibility. The time oriented data use case sounds pretty minor and the reason that the textbook definition doesn't shuffle is because data are usually assu

Re: [Scikit-learn-general] KFold cross validation strangely defaults to not shuffle

2014-04-17 Thread Gael Varoquaux
On Thu, Apr 17, 2014 at 01:08:10PM +1000, Joel Nothman wrote: > I think Olivier brought up before that one reason for keeping folds > contiguous is that in some real datasets the input ordering includes > some amount of correlation between near-adjacent samples, for example, > time oriented data su

Re: [Scikit-learn-general] KFold cross validation strangely defaults to not shuffle

2014-04-16 Thread Joel Nothman
Hi Cory, I think Olivier brought up before that one reason for keeping folds contiguous is that in some real datasets the input ordering includes some amount of correlation between near-adjacent samples, for example, time oriented data such as a news feed. In this case, shuffling makes your traini

[Scikit-learn-general] KFold cross validation strangely defaults to not shuffle

2014-04-16 Thread Cory Dolphin
Hello, Using cross_validation.KFold, I was surprised to see that the Shuffle parameter defaults to False. This default made it difficult to diagnose why my folds were performing so much worse than similar train_test_splits. I expect that shuffle=True would be the default, and that passing in a no