Re: [Scikit-learn-general] KFold cross validation strangely defaults to not shuffle

2014-04-17 Thread Mathieu Blondel
It seems to me that you assume the order in which a dataset is laid out is meaningful. I think there are cases when this order might be completely artificial and not reflect the true distribution of the data. For me, the order is an implementation detail: it depends on the way the samples were fetc

Re: [Scikit-learn-general] KFold cross validation strangely defaults to not shuffle

2014-04-17 Thread Olivier Grisel
If the mean validation score is significantly higher when you shuffle, it means that you have some dependency structure in your samples that got broken by the shuffling. Having a dependency structure in your samples means that your samples do not follow the i.i.d. assumption. Shuffling by default

Re: [Scikit-learn-general] KFold cross validation strangely defaults to not shuffle

2014-04-17 Thread Robert Layton
I think having it off by default is a good thing. Generally, you want as little to happen when you use the defaults. For example, if you "preshuffle" the data for some reason, you just want the KFold to split it up for you. On 17 April 2014 19:57, Mathieu Blondel wrote: > I think the main reaso

Re: [Scikit-learn-general] KFold cross validation strangely defaults to not shuffle

2014-04-17 Thread Mathieu Blondel
I think the main reason shuffle=False by default is historical: the option was added later on and so shuffle=False was chosen for backward compatibility. The time oriented data use case sounds pretty minor and the reason that the textbook definition doesn't shuffle is because data are usually assu

Re: [Scikit-learn-general] KFold cross validation strangely defaults to not shuffle

2014-04-17 Thread Gael Varoquaux
On Thu, Apr 17, 2014 at 01:08:10PM +1000, Joel Nothman wrote: > I think Olivier brought up before that one reason for keeping folds > contiguous is that in some real datasets the input ordering includes > some amount of correlation between near-adjacent samples, for example, > time oriented data su