Are you sure it is train_test_split itself that is taking a long time?

What are the dimensions of your data? Are they stored in memory as a numpy
array when you call train_test_split?

On my MacBook with 16GB RAM I have no problem train_test_splitting
np.empty((1000000, 500),dtype=np.float64), which is 3.7GB, in 11 seconds.
An array of the same size, np.empty((5000000, 100),dtype=np.float64) takes
19s.

True, this can - and should - be sped up. We currently use X[indices] to
extract the selected samples, but the following benchmarks show this is not
ideal:

In [1]: a = np.empty((5000000, 100),dtype=np.float64)

In [2]: p = np.random.permutation(len(a))

In [3]: %timeit a[p]
1 loops, best of 3: 12.9 s per loop

In [4]: %timeit a.take(p, axis=0)
1 loops, best of 3: 3.7 s per loop

In [5]: x, y = a[p], a.take(p, axis=0)

In [6]: np.all(x == y)
True

In terms of memory consumption:

In [7]: %memit train_test_split(a)
peak memory: 11672.50 MiB, increment: 2890.67 MiB

In [8]: %memit a[p]
peak memory: 11636.03 MiB, increment: 3852.15 MiB

In [9]: %memit a.take(p, axis=0)
peak memory: 11628.34 MiB, increment: 3806.30 MiB

Note this is as much memory as we should expect for any approach to
train_test_split that requires copying.

- Joel

On 7 August 2014 18:44, ZORAIDA HIDALGO SANCHEZ <
[email protected]> wrote:

> Hi all,
>
> I have a dataset of 600M that I need to split into train and test. I am
> using cross_validation.train_test_split to achieve it but it keeps running
> for one hour more or less and it ends consuming all the memory of the
> system (and thus I need to kill the process). My laptop has 8G of
> memory(Mac OS), if I tried it with a smaller dataset(137M) then it works
> perfectly and only takes 40 sec!
>
> Any suggestion?
>
> Thanks.
>
>
> ________________________________
>
> Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario,
> puede contener información privilegiada o confidencial y es para uso
> exclusivo de la persona o entidad de destino. Si no es usted. el
> destinatario indicado, queda notificado de que la lectura, utilización,
> divulgación y/o copia sin autorización puede estar prohibida en virtud de
> la legislación vigente. Si ha recibido este mensaje por error, le rogamos
> que nos lo comunique inmediatamente por esta misma vía y proceda a su
> destrucción.
>
> The information contained in this transmission is privileged and
> confidential information intended only for the use of the individual or
> entity named above. If the reader of this message is not the intended
> recipient, you are hereby notified that any dissemination, distribution or
> copying of this communication is strictly prohibited. If you have received
> this transmission in error, do not read it. Please immediately reply to the
> sender that you have received this communication in error and then delete
> it.
>
> Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário,
> pode conter informação privilegiada ou confidencial e é para uso exclusivo
> da pessoa ou entidade de destino. Se não é vossa senhoria o destinatário
> indicado, fica notificado de que a leitura, utilização, divulgação e/ou
> cópia sem autorização pode estar proibida em virtude da legislação vigente.
> Se recebeu esta mensagem por erro, rogamos-lhe que nos o comunique
> imediatamente por esta mesma via e proceda a sua destruição
>
>
> ------------------------------------------------------------------------------
> Infragistics Professional
> Build stunning WinForms apps today!
> Reboot your WinForms applications with our WinForms controls.
> Build a bridge from your legacy apps to the future.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=153845071&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
------------------------------------------------------------------------------
Infragistics Professional
Build stunning WinForms apps today!
Reboot your WinForms applications with our WinForms controls. 
Build a bridge from your legacy apps to the future.
http://pubads.g.doubleclick.net/gampad/clk?id=153845071&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to