Re: [scikit-learn] scikit-learn Digest, Vol 6, Issue 40

Joel Nothman Mon, 26 Sep 2016 15:08:06 -0700

Hi Arafin,

You appear to be talking about a situation in which your dataset is divided
into subsets in which the data are highly correlated (but perhaps
conditionally independent given the subject / group identifier). In
Scikit-learn 0.18 these might be called "grouped cross validation"
strategies. See
http://scikit-learn.org/dev/modules/cross_validation.html#cross-validation-iterators-for-grouped-data
.


(In earlier versions of Scikit-learn, you will find the corresponding CV
objects as LabelKFold, LeaveOneLabelOut, etc., but we decided to rename
them for clarity when redesigning CV objects and moving them to the new
sklearn.model_selection subpackage.)

I hope that helps.

Joel

On 27 September 2016 at 07:06, Afarin Famili <
[email protected]> wrote:

> Hi David,
>
> When applying Train_test_split to the sample space, we have a single row
> per subject. I am looking for some other function like Train_test_split
> that can deal with pairs of rows (for each subject), which does not lead to
> a biased accuracy. We are studying memory and have a row of features for
> successful memory encoding, and a second row for unsuccessful memory
> encoding in each of the subjects. Our target space being 1 for successful
> and 0 for unsuccessful encoding respectively.
> How do you recommend me to split this set of data in order to get a
> reasonable/unbiased accuracy?
>
> Thanks,
> Afarin
>
>
>
> ________________________________________
> From: scikit-learn <scikit-learn-bounces+afarin.famili=utsouthwestern.edu@
> python.org> on behalf of [email protected] <
> [email protected]>
> Sent: Monday, September 26, 2016 2:43 PM
> To: [email protected]
> Subject: scikit-learn Digest, Vol 6, Issue 40
>
> Send scikit-learn mailing list submissions to
>         [email protected]
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         https://mail.python.org/mailman/listinfo/scikit-learn
> or, via email, send a message with subject or body 'help' to
>         [email protected]
>
> You can reach the person managing the list at
>         [email protected]
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of scikit-learn digest..."
>
>
> Today's Topics:
>
>    1. header intact (Afarin Famili)
>    2. Is there a built-in function for pairs of data? (Afarin Famili)
>    3. Re: Is there a built-in function for pairs of data?
>       (Pedro Pazzini)
>    4. Re: Is there a built-in function for pairs of data?
>       (David Nicholson)
>    5. Large computation time for homogeneous data with
>       agglomerative clustering (Md. Khairullah)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 26 Sep 2016 18:03:27 +0000
> From: Afarin Famili <[email protected]>
> To: "[email protected]" <[email protected]>
> Subject: [scikit-learn] header intact
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset="iso-8859-1"
>
> ?
>
>
>
> ________________________________
>
> UT Southwestern
>
>
> Medical Center
>
>
>
> The future of medicine, today.
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://mail.python.org/pipermail/scikit-learn/
> attachments/20160926/92efd185/attachment-0001.html>
>
> ------------------------------
>
> Message: 2
> Date: Mon, 26 Sep 2016 18:06:49 +0000
> From: Afarin Famili <[email protected]>
> To: "[email protected]" <[email protected]>
> Subject: [scikit-learn] Is there a built-in function for pairs of
>         data?
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset="iso-8859-1"
>
>
> Dear Scikit-learn team,
>
>
> We need to deal with pairs of data in our classification task. I was
> wondering if there is already a built-in function in Scikit-learn that can
> partition the pairs of data into train and test sets?
>
>
> Regards,
>
> Afarin
>
>
>
> ________________________________
>
> UT Southwestern
>
>
> Medical Center
>
>
>
> The future of medicine, today.
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://mail.python.org/pipermail/scikit-learn/
> attachments/20160926/983b9036/attachment-0001.html>
>
> ------------------------------
>
> Message: 3
> Date: Mon, 26 Sep 2016 15:47:26 -0300
> From: Pedro Pazzini <[email protected]>
> To: Scikit-learn user and developer mailing list
>         <[email protected]>
> Subject: Re: [scikit-learn] Is there a built-in function for pairs of
>         data?
> Message-ID:
>         <CAAY8FkB2LjnegwFbn=gSOawLBcBQ3dnYa6BxDxN6-cvLT1RsfA@mail.
> gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Like this?:
> http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.
> train_test_split.html
>
> 2016-09-26 15:06 GMT-03:00 Afarin Famili <[email protected]
> >:
>
> >
> > Dear Scikit-learn team,
> >
> >
> > We need to deal with pairs of data in our classification task. I was
> > wondering if there is already a built-in function in Scikit-learn that
> can
> > partition the pairs of data into train and test sets?
> >
> >
> > Regards,
> >
> > Afarin
> >
> >
> >
> > ------------------------------
> >
> > UT Southwestern
> >
> > Medical Center
> >
> > The future of medicine, today.
> >
> > _______________________________________________
> > scikit-learn mailing list
> > [email protected]
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://mail.python.org/pipermail/scikit-learn/
> attachments/20160926/2ba60e6a/attachment-0001.html>
>
> ------------------------------
>
> Message: 4
> Date: Mon, 26 Sep 2016 14:53:05 -0400
> From: David Nicholson <[email protected]>
> To: Scikit-learn user and developer mailing list
>         <[email protected]>
> Subject: Re: [scikit-learn] Is there a built-in function for pairs of
>         data?
> Message-ID:
>         <CAMabFbXamB5KzQY9_WU+8BFxpSECbs2fSiQqad18zi9zmOjvVQ
> @mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Do you mean like train_test_split?
> http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.
> train_test_split.html
>
> On Sep 26, 2016 14:43, "Afarin Famili" <[email protected]>
> wrote:
>
> >
> > Dear Scikit-learn team,
> >
> >
> > We need to deal with pairs of data in our classification task. I was
> > wondering if there is already a built-in function in Scikit-learn that
> can
> > partition the pairs of data into train and test sets?
> >
> >
> > Regards,
> >
> > Afarin
> >
> >
> >
> > ------------------------------
> >
> > UT Southwestern
> >
> > Medical Center
> >
> > The future of medicine, today.
> >
> > _______________________________________________
> > scikit-learn mailing list
> > [email protected]
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://mail.python.org/pipermail/scikit-learn/
> attachments/20160926/191ef81d/attachment-0001.html>
>
> ------------------------------
>
> Message: 5
> Date: Mon, 26 Sep 2016 21:43:05 +0200
> From: "Md. Khairullah" <[email protected]>
> To: [email protected]
> Subject: [scikit-learn] Large computation time for homogeneous data
>         with agglomerative clustering
> Message-ID:
>         <CA+xrTcKMkwSN2Y7jFg12nEx-Ch_V5bw7eLhG5UO39wN+ebBozg@mail.
> gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Dear Scikit-learners,
> This is my first post here and I hope you experts can help me a lot.
>
> We are using the agglomerative clustering with ward's linkage and
> connectivity constraint. The data size is around 205,000 (each is a single
> scalar feature). The data set is dynamic (in time) and we need to apply
> clustering at different time thorough the process. Initially all data is 0
> and they increase gradually. Alternatively, in the early stage the data is
> more homogeneous and the heterogeneity among the data increases gradually.
> If the clustering is applied at the final stage (most heterogeneous data,
> but off course having patterns/clusters) requesting 20 clusters it takes
> only 61s of CPU time. But, if clustering is run in an early stage (more
> homogeneous data but all are not 0 and off course there are
> patterns/clusters in the data) with the same settings the time rises up to
> 1h 5m. The CPU time is in-between of these two if the data come from an
> in-between time stamp. I also tried the the other linkage options too, but
> the situation does not improve. My understanding is that the homogeneity is
> playing the role.
>
> Have you experienced this too? What solution do you suggest?
>
> Thanks in advance for your attention and help.
>
> --
> Best regards
>
> Md. Khairullah
> PhD Student, KU Leuven
> Numerical Analysis and Applied Mathematics Section
> Celestijnenlaan 200a - box 2402
> 3001 Leuven
> room: 03.18
> tel. +32 16 37 39 66
> fax +32 16 3 27996
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://mail.python.org/pipermail/scikit-learn/
> attachments/20160926/da13ef50/attachment.html>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ------------------------------
>
> End of scikit-learn Digest, Vol 6, Issue 40
> *******************************************
>
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn
>

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] scikit-learn Digest, Vol 6, Issue 40

Reply via email to