Hi Arafin, You appear to be talking about a situation in which your dataset is divided into subsets in which the data are highly correlated (but perhaps conditionally independent given the subject / group identifier). In Scikit-learn 0.18 these might be called "grouped cross validation" strategies. See http://scikit-learn.org/dev/modules/cross_validation.html#cross-validation-iterators-for-grouped-data .
(In earlier versions of Scikit-learn, you will find the corresponding CV objects as LabelKFold, LeaveOneLabelOut, etc., but we decided to rename them for clarity when redesigning CV objects and moving them to the new sklearn.model_selection subpackage.) I hope that helps. Joel On 27 September 2016 at 07:06, Afarin Famili < afarin.fam...@utsouthwestern.edu> wrote: > Hi David, > > When applying Train_test_split to the sample space, we have a single row > per subject. I am looking for some other function like Train_test_split > that can deal with pairs of rows (for each subject), which does not lead to > a biased accuracy. We are studying memory and have a row of features for > successful memory encoding, and a second row for unsuccessful memory > encoding in each of the subjects. Our target space being 1 for successful > and 0 for unsuccessful encoding respectively. > How do you recommend me to split this set of data in order to get a > reasonable/unbiased accuracy? > > Thanks, > Afarin > > > > ________________________________________ > From: scikit-learn <scikit-learn-bounces+afarin.famili=utsouthwestern.edu@ > python.org> on behalf of scikit-learn-requ...@python.org < > scikit-learn-requ...@python.org> > Sent: Monday, September 26, 2016 2:43 PM > To: scikit-learn@python.org > Subject: scikit-learn Digest, Vol 6, Issue 40 > > Send scikit-learn mailing list submissions to > scikit-learn@python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/scikit-learn > or, via email, send a message with subject or body 'help' to > scikit-learn-requ...@python.org > > You can reach the person managing the list at > scikit-learn-ow...@python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > > > Today's Topics: > > 1. header intact (Afarin Famili) > 2. Is there a built-in function for pairs of data? (Afarin Famili) > 3. Re: Is there a built-in function for pairs of data? > (Pedro Pazzini) > 4. Re: Is there a built-in function for pairs of data? > (David Nicholson) > 5. Large computation time for homogeneous data with > agglomerative clustering (Md. Khairullah) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Mon, 26 Sep 2016 18:03:27 +0000 > From: Afarin Famili <afarin.fam...@utsouthwestern.edu> > To: "scikit-learn@python.org" <scikit-learn@python.org> > Subject: [scikit-learn] header intact > Message-ID: <1474913007611.80...@utsouthwestern.edu> > Content-Type: text/plain; charset="iso-8859-1" > > ? > > > > ________________________________ > > UT Southwestern > > > Medical Center > > > > The future of medicine, today. > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: <http://mail.python.org/pipermail/scikit-learn/ > attachments/20160926/92efd185/attachment-0001.html> > > ------------------------------ > > Message: 2 > Date: Mon, 26 Sep 2016 18:06:49 +0000 > From: Afarin Famili <afarin.fam...@utsouthwestern.edu> > To: "scikit-learn@python.org" <scikit-learn@python.org> > Subject: [scikit-learn] Is there a built-in function for pairs of > data? > Message-ID: <1474913209751.36...@utsouthwestern.edu> > Content-Type: text/plain; charset="iso-8859-1" > > > Dear Scikit-learn team, > > > We need to deal with pairs of data in our classification task. I was > wondering if there is already a built-in function in Scikit-learn that can > partition the pairs of data into train and test sets? > > > Regards, > > Afarin > > > > ________________________________ > > UT Southwestern > > > Medical Center > > > > The future of medicine, today. > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: <http://mail.python.org/pipermail/scikit-learn/ > attachments/20160926/983b9036/attachment-0001.html> > > ------------------------------ > > Message: 3 > Date: Mon, 26 Sep 2016 15:47:26 -0300 > From: Pedro Pazzini <pedropazz...@gmail.com> > To: Scikit-learn user and developer mailing list > <scikit-learn@python.org> > Subject: Re: [scikit-learn] Is there a built-in function for pairs of > data? > Message-ID: > <CAAY8FkB2LjnegwFbn=gSOawLBcBQ3dnYa6BxDxN6-cvLT1RsfA@mail. > gmail.com> > Content-Type: text/plain; charset="utf-8" > > Like this?: > http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation. > train_test_split.html > > 2016-09-26 15:06 GMT-03:00 Afarin Famili <afarin.fam...@utsouthwestern.edu > >: > > > > > Dear Scikit-learn team, > > > > > > We need to deal with pairs of data in our classification task. I was > > wondering if there is already a built-in function in Scikit-learn that > can > > partition the pairs of data into train and test sets? > > > > > > Regards, > > > > Afarin > > > > > > > > ------------------------------ > > > > UT Southwestern > > > > Medical Center > > > > The future of medicine, today. > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: <http://mail.python.org/pipermail/scikit-learn/ > attachments/20160926/2ba60e6a/attachment-0001.html> > > ------------------------------ > > Message: 4 > Date: Mon, 26 Sep 2016 14:53:05 -0400 > From: David Nicholson <nichol...@gmail.com> > To: Scikit-learn user and developer mailing list > <scikit-learn@python.org> > Subject: Re: [scikit-learn] Is there a built-in function for pairs of > data? > Message-ID: > <CAMabFbXamB5KzQY9_WU+8BFxpSECbs2fSiQqad18zi9zmOjvVQ > @mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > Do you mean like train_test_split? > http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation. > train_test_split.html > > On Sep 26, 2016 14:43, "Afarin Famili" <afarin.fam...@utsouthwestern.edu> > wrote: > > > > > Dear Scikit-learn team, > > > > > > We need to deal with pairs of data in our classification task. I was > > wondering if there is already a built-in function in Scikit-learn that > can > > partition the pairs of data into train and test sets? > > > > > > Regards, > > > > Afarin > > > > > > > > ------------------------------ > > > > UT Southwestern > > > > Medical Center > > > > The future of medicine, today. > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: <http://mail.python.org/pipermail/scikit-learn/ > attachments/20160926/191ef81d/attachment-0001.html> > > ------------------------------ > > Message: 5 > Date: Mon, 26 Sep 2016 21:43:05 +0200 > From: "Md. Khairullah" <md.khairul...@gmail.com> > To: scikit-learn@python.org > Subject: [scikit-learn] Large computation time for homogeneous data > with agglomerative clustering > Message-ID: > <CA+xrTcKMkwSN2Y7jFg12nEx-Ch_V5bw7eLhG5UO39wN+ebBozg@mail. > gmail.com> > Content-Type: text/plain; charset="utf-8" > > Dear Scikit-learners, > This is my first post here and I hope you experts can help me a lot. > > We are using the agglomerative clustering with ward's linkage and > connectivity constraint. The data size is around 205,000 (each is a single > scalar feature). The data set is dynamic (in time) and we need to apply > clustering at different time thorough the process. Initially all data is 0 > and they increase gradually. Alternatively, in the early stage the data is > more homogeneous and the heterogeneity among the data increases gradually. > If the clustering is applied at the final stage (most heterogeneous data, > but off course having patterns/clusters) requesting 20 clusters it takes > only 61s of CPU time. But, if clustering is run in an early stage (more > homogeneous data but all are not 0 and off course there are > patterns/clusters in the data) with the same settings the time rises up to > 1h 5m. The CPU time is in-between of these two if the data come from an > in-between time stamp. I also tried the the other linkage options too, but > the situation does not improve. My understanding is that the homogeneity is > playing the role. > > Have you experienced this too? What solution do you suggest? > > Thanks in advance for your attention and help. > > -- > Best regards > > Md. Khairullah > PhD Student, KU Leuven > Numerical Analysis and Applied Mathematics Section > Celestijnenlaan 200a - box 2402 > 3001 Leuven > room: 03.18 > tel. +32 16 37 39 66 > fax +32 16 3 27996 > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: <http://mail.python.org/pipermail/scikit-learn/ > attachments/20160926/da13ef50/attachment.html> > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ------------------------------ > > End of scikit-learn Digest, Vol 6, Issue 40 > ******************************************* > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn