It seems to me that you assume the order in which a dataset is laid out is
meaningful. I think there are cases when this order might be completely
artificial and not reflect the true distribution of the data. For me, the
order is an implementation detail: it depends on the way the samples were
fetc
If the mean validation score is significantly higher when you shuffle,
it means that you have some dependency structure in your samples that
got broken by the shuffling. Having a dependency structure in your
samples means that your samples do not follow the i.i.d. assumption.
Shuffling by default
I think having it off by default is a good thing. Generally, you want as
little to happen when you use the defaults.
For example, if you "preshuffle" the data for some reason, you just want
the KFold to split it up for you.
On 17 April 2014 19:57, Mathieu Blondel wrote:
> I think the main reaso
I think the main reason shuffle=False by default is historical: the option
was added later on and so shuffle=False was chosen for backward
compatibility.
The time oriented data use case sounds pretty minor and the reason that the
textbook definition doesn't shuffle is because data are usually assu
On Thu, Apr 17, 2014 at 01:08:10PM +1000, Joel Nothman wrote:
> I think Olivier brought up before that one reason for keeping folds
> contiguous is that in some real datasets the input ordering includes
> some amount of correlation between near-adjacent samples, for example,
> time oriented data su