Re: [Scikit-learn-general] Problem in Reading Large CSV and Fitting to ML Algorithm

2012-07-30 Thread Gael Varoquaux
> Take the example of 1-NN it can be very well happen that for samples > close to the voronoi boundary, the closest neighbor is on the other > side of the boundary. Indeed, I was bullshiting. -- Live Security Virtual Conf

Re: [Scikit-learn-general] Problem in Reading Large CSV and Fitting to ML Algorithm

2012-07-30 Thread Mathieu Blondel
Solving this issue in a generic way would be nice: https://github.com/scikit-learn/scikit-learn/issues/325 On Mon, Jul 30, 2012 at 6:43 PM, Olivier Grisel wrote: > Actually I think the KNearestNeighborsClassifier implementation in > scikit-learn has a real memory occupation issue in "brute" mode

Re: [Scikit-learn-general] Problem in Reading Large CSV and Fitting to ML Algorithm

2012-07-30 Thread Olivier Grisel
2012/7/30 Gael Varoquaux : > On Mon, Jul 30, 2012 at 11:52:36AM +0200, Olivier Grisel wrote: >> > In addition, a voronoi tessalation computed with a KMeans during the >> > train could be used to avoid testing all the samples in the large n >> > situation. > >> Hum, that won't work for exact k-NN. >

Re: [Scikit-learn-general] Problem in Reading Large CSV and Fitting to ML Algorithm

2012-07-30 Thread Andreas Müller
well intuition on that works in high dimensions, though. - Ursprüngliche Mail - Von: "Gael Varoquaux" An: scikit-learn-general@lists.sourceforge.net Gesendet: Montag, 30. Juli 2012 10:54:48 Betreff: Re: [Scikit-learn-general] Problem in Reading Large CSV and Fitting to ML Alg

Re: [Scikit-learn-general] Problem in Reading Large CSV and Fitting to ML Algorithm

2012-07-30 Thread Gael Varoquaux
On Mon, Jul 30, 2012 at 11:52:36AM +0200, Olivier Grisel wrote: > > In addition, a voronoi tessalation computed with a KMeans during the > > train could be used to avoid testing all the samples in the large n > > situation. > Hum, that won't work for exact k-NN. I don't understand. Yes I do belie

Re: [Scikit-learn-general] Problem in Reading Large CSV and Fitting to ML Algorithm

2012-07-30 Thread Olivier Grisel
2012/7/30 Gael Varoquaux : > On Mon, Jul 30, 2012 at 11:43:01AM +0200, Olivier Grisel wrote: >> This could be worked around by chunking the data argument of the >> predict calls instead. > > Indeed. > > In addition, a voronoi tessalation computed with a KMeans during the > train could be used to av

Re: [Scikit-learn-general] Problem in Reading Large CSV and Fitting to ML Algorithm

2012-07-30 Thread Olivier Grisel
Ji, could you please create a new github issue to track this bug? https://github.com/scikit-learn/scikit-learn/issues Please include the python snippets of your notebook as verbatim markdown block in the issue: ```python from sklearn.neighbors import KNearestNeighborsClassifier ... ``` -- O

Re: [Scikit-learn-general] Problem in Reading Large CSV and Fitting to ML Algorithm

2012-07-30 Thread Gael Varoquaux
On Mon, Jul 30, 2012 at 11:43:01AM +0200, Olivier Grisel wrote: > This could be worked around by chunking the data argument of the > predict calls instead. Indeed. In addition, a voronoi tessalation computed with a KMeans during the train could be used to avoid testing all the samples in the larg

Re: [Scikit-learn-general] Problem in Reading Large CSV and Fitting to ML Algorithm

2012-07-30 Thread Olivier Grisel
Actually I think the KNearestNeighborsClassifier implementation in scikit-learn has a real memory occupation issue in "brute" mode (which is selected). I suspect it is materializing the whole (n_samples_train, n_samples_predict) distances array in memory before computing the (n_samples_predict * k

Re: [Scikit-learn-general] Problem in Reading Large CSV and Fitting to ML Algorithm

2012-07-30 Thread Gael Varoquaux
Hi Ji, What you are trying to do is called 'online fitting'. Only a small number of models can do online fitting. This is implemented in the scikit-learn with a 'partial_fit' method. As far as supervised learning goes, only SGD does online learning, I believe. http://scikit-learn.org/stable/module

[Scikit-learn-general] Problem in Reading Large CSV and Fitting to ML Algorithm

2012-07-28 Thread Ji H. Park
I'm using IPython notebook as the programming environment, and pandas and sklearn packages to analyze data from Digit Recognizer Tutorial . The data is available on the webpage (link above), and the attached is my ipython notebook. KNeighborsClassifi