[Scikit-learn-general] Joblib dump memory error

2012-11-16 Thread Ak
Hello, I am dumping the dataset vectorized with TfidfVectorizer, target array, and the classifier OneVsRestClassifierSGDClassifier(loss=log, n_iter=50, alpha=0.1)), since I want to add it to a package. I use joblib library from sklearn.externals to dump the vectors. The max memory used wh

[Scikit-learn-general] naive bayes question

2012-11-16 Thread Peter Maseter
Hi all, I'm trying to write my own code for NB classifier method, just so I could use prior distributions other than for example gaussian. To start with, I scripted something similar to GaussianNB function in Scikit Learn (see the code below), but the two approaches give me different result (means

Re: [Scikit-learn-general] Data Set on Tutorial: Machine Learning for Astronomy with Scikit-learn

2012-11-16 Thread Leon Palafox
Indeed, when I tried to re run it on my Windows PC at home it also found Nan. The problem appears to be when I download the data using the script, since I tried it with the data I downloaded from the Linux server and It ran fine. Best On Sat, Nov 17, 2012 at 11:44 AM, Leon Palafox wrote: > He

Re: [Scikit-learn-general] Data Set on Tutorial: Machine Learning for Astronomy with Scikit-learn

2012-11-16 Thread Leon Palafox
Hey guys, I think I figured out the problem, well, sorta, I downloaded everything on my ubuntu server, and everything worked fine and dandy, the problem seem to ve when I was running on my windows machine. That's odd Its on win32 and python 2.7 Greets On Sat, Nov 17, 2012 at 3:43 AM, Jake Van

Re: [Scikit-learn-general] Machine Learning on Large Data Sets

2012-11-16 Thread Ronnie Ghose
:/ darnit. I wanted to run CARTs and Neural Nets on it >_<. though it was a mystery to me how that would work. On 16 November 2012 19:06, Olivier Grisel wrote: > You can also have a look at this answer on stackoverflow for more details: > > > http://stackoverflow.com/questions/12460077/possibil

Re: [Scikit-learn-general] Machine Learning on Large Data Sets

2012-11-16 Thread Olivier Grisel
You can also have a look at this answer on stackoverflow for more details: http://stackoverflow.com/questions/12460077/possibility-to-apply-online-algorithms-on-big-data-files-with-sklearn -- Monitor your physical, virtua

Re: [Scikit-learn-general] Machine Learning on Large Data Sets

2012-11-16 Thread Olivier Grisel
Read your data from the hardrive or database by chunks of ~ 1000 samples for instance an the partial_fit method of the models that supports it, typically online linear models such as Perceptron, SGDClassifier (or PassiveAggressiveClassifier in master). -

Re: [Scikit-learn-general] RandomForest benchmark

2012-11-16 Thread Olivier Grisel
You can retry by replacing the sklearn/externals/joblib folder with the joblib folder of this branch: https://github.com/joblib/joblib/pull/44 -- Monitor your physical, virtual and cloud infrastructure from a single web c

Re: [Scikit-learn-general] RandomForest benchmark

2012-11-16 Thread Satrajit Ghosh
this would also be consistent with the evaluation done here: http://wise.io/wiserf.html cheers, satra On Fri, Nov 16, 2012 at 2:25 PM, Peter Prettenhofer < peter.prettenho...@gmail.com> wrote: > Hi, > > I did a quick benchmark to compare sklearn's RandomForestClassifier > against R's randomFo

Re: [Scikit-learn-general] GridSearch example

2012-11-16 Thread Fred Mailhot
On 15 November 2012 23:20, Andreas Mueller wrote: > [...] > You can give GridSearchCV not only a grid but also a list of grids. > I would go with that. > (is that sufficiently documented?) > This doesn't appear to be document (at least not at http://scikit-learn.org/dev/modules/generated/sklearn

[Scikit-learn-general] Machine Learning on Large Data Sets

2012-11-16 Thread Ronnie Ghose
So I have ~ 20gb and growing of data that I want to run some algorithms on... how should I do so as this is... a giant amount of data. Besides online techniques such as partials is there anyway to modify the train method so it works on all of the data but queries ... as a stream? or in chunks or t

Re: [Scikit-learn-general] GridSearch example

2012-11-16 Thread Ronnie Ghose
Ahh.. sorry >_<. I thought I made a new thread... sigh. On 16 November 2012 15:33, Fred Mailhot wrote: > Check out SGDClassifier and partial_fit()...I've used these to good effect. > > Also, PROTIP: if you want decent help, don't piggy-back on threads that > have nothing to do with your questio

Re: [Scikit-learn-general] GridSearch example

2012-11-16 Thread Fred Mailhot
Check out SGDClassifier and partial_fit()...I've used these to good effect. Also, PROTIP: if you want decent help, don't piggy-back on threads that have nothing to do with your question. Just sayin'. On 16 November 2012 12:23, Ronnie Ghose wrote: > Any ideas for online learning with Scikit? I

Re: [Scikit-learn-general] GridSearch example

2012-11-16 Thread Ronnie Ghose
Any ideas for online learning with Scikit? I have a data set that is > 20gb that I want to train on I don't think I can do that easily, so what should I do? Thanks, Shomiron Ghose On 15 November 2012 15:45, Fred Mailhot wrote: > Dear list, > > I'm using GridSearchCV to do some simple model

Re: [Scikit-learn-general] GridSearch example

2012-11-16 Thread Olivier Grisel
This is a really weird low level error. Maybe a python bug. I don't have time to investigate but I someone else can reproduce it would be interesting to try and make a minimalistic reproduction script that just uses the python multiprocessing API. --

Re: [Scikit-learn-general] Data Set on Tutorial: Machine Learning for Astronomy with Scikit-learn

2012-11-16 Thread Jake Vanderplas
If npy files do in fact work cross-platform, then I'm baffled. Any ideas about what could be causing these NaNs in Leon's script? The files on the website haven't been modified since they were put online. Here's a more compact version of the NaN checking: >>> import numpy as np >>> data = np

Re: [Scikit-learn-general] GridSearch example

2012-11-16 Thread Fred Mailhot
Thanks to all for the tips on GridSearch with FeatureUnion, I'll be trying those out today. And @amueller I've been following the development of your PR for the random sampling of param space with great interest. But back to the initial problem...it seems that an empty input is the cause. My raw d

Re: [Scikit-learn-general] Data Set on Tutorial: Machine Learning for Astronomy with Scikit-learn

2012-11-16 Thread Nelle Varoquaux
On 16 November 2012 17:14, Robert Kern wrote: > On Fri, Nov 16, 2012 at 4:03 PM, Nelle Varoquaux > wrote: > >> Hi Leon, > >> When I run your script, I get no instances of NaN in the data. > >> > >> I wonder if it's a problem with storing the data as a npy file. I asked > >> around last spring a

Re: [Scikit-learn-general] Data Set on Tutorial: Machine Learning for Astronomy with Scikit-learn

2012-11-16 Thread Robert Kern
On Fri, Nov 16, 2012 at 4:03 PM, Nelle Varoquaux wrote: >> Hi Leon, >> When I run your script, I get no instances of NaN in the data. >> >> I wonder if it's a problem with storing the data as a npy file. I asked >> around last spring and everybody seemed to think that the format is >> compatible

Re: [Scikit-learn-general] Data Set on Tutorial: Machine Learning for Astronomy with Scikit-learn

2012-11-16 Thread Gael Varoquaux
On Fri, Nov 16, 2012 at 05:03:14PM +0100, Nelle Varoquaux wrote: > I think numpy relies on pickle for those. If you store only one array per file it doesn't. It uses a stable cross-plateform format. > Saving as txt is more reliable. But completely inefficient. G ---

Re: [Scikit-learn-general] Data Set on Tutorial: Machine Learning for Astronomy with Scikit-learn

2012-11-16 Thread Nelle Varoquaux
> > Hi Leon, > When I run your script, I get no instances of NaN in the data. > > I wonder if it's a problem with storing the data as a npy file. I asked > around last spring and everybody seemed to think that the format is > compatible across platforms and numpy versions, but I may be wrong. Doe

Re: [Scikit-learn-general] Data Set on Tutorial: Machine Learning for Astronomy with Scikit-learn

2012-11-16 Thread Jake Vanderplas
Hi Leon, When I run your script, I get no instances of NaN in the data. I wonder if it's a problem with storing the data as a npy file. I asked around last spring and everybody seemed to think that the format is compatible across platforms and numpy versions, but I may be wrong. Does anybody